Technology
Massive volumes of information are available in textual form, for instance on the Internet or on Intranets. Such a large and rapid-shifting amount of knowledge cannot be efficiently exploited on human level.
In most of standard search technologies, text and meaning representations of both query and document collection are limited to an unordered bag of keywords.
For instance, a query such as When was Mercury discovered? is equivalent to mercury discovered. Standard Information Retrieval will return documents that are the most similar to that query. If you are searching the web, there is some good chance that the date will be found nearby those keywords, along with the names of the first observers.
While this approach can be sufficient for some search tasks, it is always not satisfactory:
- When a term of the query is ambiguous, eg Mercury, planet versus element;
- When one is limited by the medium of access (eg a mobile phone) and cannot afford to browse a list of documents.
- When the search corpus is small, specialized or has little redundancy, a match with the query may fail if the exact keyword is not found. For instance, the element mercury may only be referred to by the symbol Hg.
Either way, you will need to rephrase your query and check documents to ensure there is good evidence to support the answer you identified (not the search engine!). While this may be acceptable for personal searches, it is extremely time consuming in batch searches, for instance market watch or open source intelligence.
The issue applies to web search of course, but it is also a general search problem that applies to corporate intranets, log analysis, large technical manuals, data banks to name a few.
Linguit's approach
We have a triptych approach to search:
- Improve on text analysis and corpus pre-processing, beyond standard keyword indexing.
- Gather a deeper understanding of user queries, whether they search with keywords or questions.
- Retrieve answers rather than search results.
In our view, answering is a much more complex process than similarity matching. We see three essential steps to the answering process, articulated around intelligent information extraction, user modeling and rendering/visualization.
What we do
Our core technology is a platform dedicated to Information Retrieval and text mining. We do not offer a generic search engine for an undefined search task. Instead, we focus on task-oriented search problems. Meaning that we offer a generic architecture to perform a different kind of search, but each component is dedicated to a specific and well-defined task. These components can be used as standalone tools (for instance in the case of Named Entity Recognition), or as part of a higher-level task, such as query analysis or text indexing.
The generic aspect of the platform lies in its ability to handle different types of knowledge sources, for instance relational databases, text documents or semi-structured streams such as RSS feeds, and different types of front-ends, from graphical user interfaces to custom in-house applications or web services.
The input (query) and output (search results) process of the platform is designed as an XML transaction. Customization of search parameters in one hand, and search result rendering on the other hand, requires only minimal development using XSLT. This offers an extremely versatile approach to user interfacing since both the query interface and the rendering of the search results are completely separated from the actual search process. This includes the possibility of offering different views of the search results in terms of layout and structure for a minimal cost.
The platform is also designed to handle needs for meta-searches. It can be configured to access multiple sources of information from a single front-end, and it provides components to aggregate search results from such different sources.
For each new search task, we focus our attention to the customization of components performing natural language processing. In this area, we have the expertise to assess your needs in terms of language technology requirements. This assessment usually translates into a hybrid strategy consisting in plug-in in available linguistic resources, or acquiring new knowledge for your domain through machine learning techniques.
Linguit components in a nutshell
We provide the following components as part of our solutions, and offer corresponding expertise in consulting/training projects.
![]() | Language identifier |
![]() | Sentence boundary detection |
![]() | Text tokenization |
![]() | Morphological processing (stemming, lemmatization) |
![]() | Spell checking and phonetic suggestions |
![]() | Grammatical part-of-speech (POS) tagging |
![]() | Fast, robust syntactic phrase parser |
![]() | Syntactic parser |
![]() | Dependency parser |
![]() | Semantic analysis component |
![]() | Mining of patterns from textual data |
![]() | Named entity recognition and classification |
![]() | Co-reference finder |
![]() | Toponym resolution (geo-tagger) |
![]() | Machine learning for natural language processing |
![]() | Question classification |
![]() | Indexer: crawls data to construct a searchable index for fast access |
![]() | Text to SQL parser |
![]() | Retrieve documents from a collection with high relevance |
![]() | Clustering |
System Integration
Linguit platform is developed in Java (version 1.5 or higher), and will run on any operating system with a Java Runtime Environment (Windows, Linux/UNIX, Macintosh/BSD).
Application Integration
Linguit applications are accessible through:- HTTP transactions
- Via XML transactions (file system or web services, e.g. SOAP)
- Programmatically, through the Linguit public API
In an end to end application, we can provide a full integration based on a Tomcat container, including Apache integration on UNIX servers.
User Interface Integration
Linguit platform supports multiple user interfaces, from Java GUIs to web-based interfaces. Input and output of the platform are XML-enabled, and we offer built-in XSLT transformations to facilitate layout customizations.
System Requirements
Actual requirements may vary depending on the components used and the type of text processing needed. As a default, we would recommend
- 2 GHz 32-bit (x86) or 64-bit (x64) processor
- 512 MB of system memory (recommended: 800MB)
- 20 MB of available space on the machine hard-drive can support a Linguit search application.
Additional space requirements may be needed for :
- Corpus indexing for text search: the size of the index may vary with the search requirements, it is usually less than the size of the original corpus. However, the indexing process may require up to 2 times the size of the corpus before optimization.
- Knowledge resources and language models may also require additional space.












Contact phone: +44 7 780 850 038