Preamble: terms, dictionaries and term-related maps

Preamble: terms, dictionaries and term-related maps
Prev	Chapter 2. Behind the scenes: The indexing process	Next

Before starting our description of the indexing process, a brief introduction is necessary to present some basic concepts. MG4J has to do with documents (e.g., HTML files, mail messages etc.), and every document is composed by fields (e.g., the fields of a mail message will probably be its subject, sender, recipient, body etc.). Although, as we shall see, MG4J will provides support for non-textual fields, its "bread and butter" is with textual fields, and for the time being we shall assume that we are dealing with documents composed of just one textual field.

A textual field (in our simplified view: a document) is a sequence of words: it is up to the factory producing the document to decide how to choose words (e.g., one may want to discard digits or not), and how to segment the document. For instance, the typical letter-to-nonletter transition used to split Western languages does not work very well with, say, Chinese. However, once segmentation has produced suitable words, they must be turned into indexable terms: for instance, you may want to downcase your words, but at the same time you may want to keep "ph" (as in "this soap's ph") separated from "Ph" (as in "Ph.D. degree"). You may also want to make more dramatic transformations, such as stemming, or avoid indexing a term altogether. All these operation are performed by a term processor, which can be specified on the command line. The option --downcase, for instance, selects for you the class it.unimi.dsi.mg4j.index.DowncaseTermProcessor. The chosen processor is recorded into the index structure: this is essential for interpreting queries correctly.

Note that in the design of other search engines segmentation and processing are somehow mixed into a generic tokenisation phase. We prefer to split clearly between linguistic and algorithmic term processing. Linguistic processing depends only on the writing customs of a language, whereas algorithmic processing might be language neutral (we do not exclude, however, that it might be language dependent, too).

If you scan the whole document collection, you can collect all terms that appear in it; the set of all such terms is called the term dictionary. Note that every term in the dictionary appears in some (at least one) document, and probably it will appear in many documents, possibly even many times in some documents. (By the way: terms that appear in just one document are called hapax legomena, and they are far more frequent than one might expect in many collections, especially due to typos).

MG4J, like any other indexing tool, does not treat internally terms as character sequences, but it uses numbers. This means that terms in the dictionary are assigned an index (a number between 0 and the dictionary size minus 1), and that this index is used whenever the application needs to refer to a term. Usually, indices are assigned in lexicographical order: this means that index 0 is assigned to the first term in lexicographic order, index 1 to the next one and so on). The assignment between terms and indices is stored in a suitable data structure, that compactly represents both the dictionary and the map.

There are many possible different representations of this assignment, each with certain memory requirements and each allowing different kind of access to the data.

The simplest kind of representation of a dictionary is the term list: a text file containing the whole dictionary, one term per line, in index order (the first line contains term with index 0, the second line contains term with index 1 etc.). This representation is not especially efficient, and access-time is prohibitive for most applications. Usually, a file containing the term list is stemmed with .terms; if the terms are not sorted lexicographically, the file is stemmed with .terms.unsorted.
A much more efficient representation is by means of unsigned MPH (minimal perfect hash): an unsigned minimal perfect hash is a very efficient data structure that is able to answer correctly to the question “What is the index of term XXX?”. You can build an unsigned MPH from a term list using the (main method of the) it.unimi.dsi.mg4j.util.MinimalPerfectHash class. Usually, a file containing the unsigned MPH is stemmed with .mph.
Unsigned MPH are very efficient and compact, but they have a serious limit. As we said before, they can answer correctly to the question “What is the index of term XXX?”, but only for terms XXX that appear in the dictionary. In other words, if the above question is posed for a term that does not appear anywhere, the answer you get from an unsigned MPH is completely useless. This is not going to cause any harm, if you are sure that you will never try to access the MPH with a term that does not belong to the dictionary, but it will become a nuisance in all other cases. To solve this problem, you can use signed MPH: a signed minimal perfect hash is a slightly less compact data structure that answer correctly to the question “What is the index of term XXX?” even when XXX does not belong to the dictionary - in the latter case, a signed MPH will answer with a special value (-1) that means “the word is not in the dictionary”. To build a signed MPH, you can once more use the it.unimi.dsi.mg4j.util.MinimalPerfectHash class, providing the special option -c with the classname of a signed MPH class (e.g., HashCodeSignedMinimalPerfectHash).
Signed and unsigned MPH's are perfectly ok, as long as you don't need to access the index with wildcards. Wildcard searches require the use of a term map. A term map is able to anwer correctly to questions like “What are the indices of terms starting with XXX”. This is meaningful only if the terms are lexicographically sorted: in this case, the indices of terms starting with a given prefix are consecutive, so the above question can be answered by giving just two integers (the first and the last index of terms satisfying the property). You can build a term map by using the main method of one of the implementation of the TermMap interface, e.g., ImmutableExternalPrefixDictionary.