Table of Contents
How ABBYY InfoExtractor works
Fact Extraction with InfoExtractor
ABBYY InfoExtractor is designed to process unstructured texts and written in natural language and to identify different facts, entities and the relationship between them.
“Unstructured” means that the information in the document/text does not have a certain order and, therefore, can't easily be located via extraction rules. Unstructured information can is almost everywhere:
- in office documents
- on the internet
- in PDFs and scanned documents or images.
InfoExtractor can natively process these documents and access the embedded textual information.
Natural language texts, of course, have a logical and linguistic structure. They consist of a sequence of characters that are somehow aggregated to words (separated by blanks or dashes). Sentences are lining up words - punctuation marks often provide a simple (visible) structure.
Proper built sentences follow (grammar) rules, but the entities and the relations between them are hidden (they are not obvious when you do not speak the language or only get a word-by-word translation).
Languages allow humans to package certain information in multiple ways by using different verbs, nouns and adjective. Everybody can build different sentences that “mean” the same thing, without using a predictable logic.
The Reality Problem
Computers are good at counting characters, finding similar combinations (words), sorting and calculating statistics.
But current computers/systems have no understanding of language, words or their meaning. To make things even more complicated, natural language is very ambiguous, and the same word can have very different meaning, for example: plant, bank, apple, oracle, etc.
In unstructured natural language texts, simple extraction rules can deliver information, but most of the times it will be only characters/words without a meaningful context. Regular expressions cannot solve the task efficiently, because to extract one fact dozens of different rules have to be developed to overcome the ambiguity of human language. One piece of information can be expressed:
- with different words
- in different word orders,
- and the same word combination may mean different things in different context.
ABBYY's new approach
The ABBYY InfoExtractor (based on Compreno technology) analyses the text with different linguistic and statistical approaches. This results in massive meta-data that is created out of simple text. These “raw” linguistic hypotheses are then weighted, cross-checked with the embedded language and grammar rules. The best hypotheses are then matched against ABBYY's Universal Semantic Hierarchy to get the real (semantic) meaning and the context how the word is used in this sentence. The same nouns and verbs can represent different contexts, for example:
- “apple” used as fruit or as an “IT company”
- “run” used in the meaning “move/ run around”
or “run” in a technical sense “execute an application”
or “run” in a business sense “run a business”
InfoExtractor tries to “understand” the meaning of natural sentences in text.
InfoExtractor as a product for natural language processing (NLP) can
- detect certain entities such as persons, institutions or companies, etc…
- find the meaningful relationship between them.
- identify structured data e.g. from tables and connect it with facts that were extracted from unstructured text section of the document.
For a reliable NLP system is not enough to look for something like “Invoice number: xxxx-xxxx-xxx” InfoExtractor is looking on a more abstract/conceptual level for information, for example if you would like to know: “Who bought what? For how much money?” InfoExtractor will try to identify what persons or companies can be detected in a text, and then it will look for words or phrases that reflect/belong to a semantic class “to purchase or sell something”. Numbers will be analyzed and checked if they are a proper amount with a currency.
The detected facts are not just key-value pairs, but so-called semantic triples. The default export format of InfoExtractor is RDF/XML. The RDF format is the fundament for the semantic web.
The generated InfoExtractor RDF/XML results could present the following information about a purchase and sales fact in a structured way:
- who is buyer
- who is seller
- what is an object of a purchase and sales
- where the purchase was made
- when and at what time the purchase was made
Because the extracted information is structured it can be parsed and reused for different purposes in different systems.
InfoExtractor was “taught” by ABBYY linguists and computer scientists how a certain language works and what the semantic meaning of words/phrases can be. This knowledge is used to analyze the text in documents on a very generic level. To be able to detect and to extract the wanted facts and information snippets within a certain project or use case it is also required to “tell” InfoExtractor what exactly to look for. To do this, so-called ontologies (= formal representation of concepts and the relationships between those concepts) are used. To create these ontologies, the cooperation between the customer/project team and ABBYY is needed. ABBYY ontology engineers will work out the project specific intelligence so that InfoExtractor can deliver the wanted facts. In the future, the creation of such semantically based extraction ontologies will be simplified so that they can be setup and adjusted in a more flexible way.