Before starting our description of the indexing process, a brief introduction is necessary to present some basic concepts. MG4J has to do with documents (e.g., HTML files, mail messages etc.), and every document is composed by fields (e.g., the fields of a mail message will probably be its subject, sender, recipient, body etc.). Although, as we shall see, MG4J will provides support for non-textual fields, its "bread and butter" is with textual fields, and for the time being we shall assume that we are dealing with documents composed of just one textual field.
A textual field (in our simplified view: a document) is a sequence
of words: it is up to the factory producing the document to decide how to
choose words (e.g., one may want to discard digits or not), and how to
segment the document. For instance, the typical
letter-to-nonletter transition used to split Western languages does not
work very well with, say, Chinese. However, once segmentation has produced
suitable words, they must be turned into indexable
terms: for instance, you may want to downcase your
words, but at the same time you may want to keep "ph" (as in "this soap's
ph") separated from "Ph" (as in "Ph.D. degree"). You may also want to make
more dramatic transformations, such as stemming, or
avoid indexing a term altogether. All these operation are performed by a
term processor, which can be specified on the
command line. The option --downcase
, for instance,
selects for you the class
it.unimi.dsi.mg4j.index.DowncaseTermProcessor
. The
chosen processor is recorded into the index structure: this is essential
for interpreting queries correctly.
Note that in the design of other search engines segmentation and processing are somehow mixed into a generic tokenisation phase. We prefer to split clearly between linguistic and algorithmic term processing. Linguistic processing depends only on the writing customs of a language, whereas algorithmic processing might be language neutral (we do not exclude, however, that it might be language dependent, too).
If you scan the whole document collection, you can collect all terms that appear in it; the set of all such terms is called the term dictionary. Note that every term in the dictionary appears in some (at least one) document, and probably it will appear in many documents, possibly even many times in some documents. (By the way: terms that appear in just one document are called hapax legomena, and they are far more frequent than one might expect in many collections, especially due to typos).
MG4J, like any other indexing tool, does not treat internally terms as character sequences, but it uses numbers. This means that terms in the dictionary are assigned an index (a number between 0 and the dictionary size minus 1), and that this index is used whenever the application needs to refer to a term. Usually, indices are assigned in lexicographical order: this means that index 0 is assigned to the first term in lexicographic order, index 1 to the next one and so on). The assignment between terms and indices is stored in a suitable data structure, that compactly represents both the dictionary and the map.
There are many possible different representations of this assignment, each with certain memory requirements and each allowing different kind of access to the data.
The simplest kind of representation of a dictionary is the
term list: a text file containing the whole
dictionary, one term per line, in index order (the first line contains
term with index 0, the second line contains term with index 1 etc.).
This representation is not especially efficient, and access-time is
prohibitive for most applications. Usually, a file containing the term
list is stemmed with .terms
; if the terms are
not sorted lexicographically, the file is stemmed
with .terms.unsorted
.
A much more efficient representation is by means of
unsigned MPH (minimal perfect hash): an
unsigned minimal perfect hash is a very efficient data structure that
is able to answer correctly to the question “What is the index
of term XXX?”. You can build an unsigned MPH from a term list
using the (main method of the)
it.unimi.dsi.mg4j.util.MinimalPerfectHash
class. Usually, a file containing the unsigned MPH is stemmed with
.mph
.
Unsigned MPH are very efficient and compact, but they have a
serious limit. As we said before, they can answer correctly to the
question “What is the index of term XXX?”, but
only for terms XXX that appear in the dictionary.
In other words, if the above question is posed for a term that does
not appear anywhere, the answer you get from an unsigned MPH is
completely useless. This is not going to cause any harm, if you are
sure that you will never try to access the MPH with a term that does
not belong to the dictionary, but it will become a nuisance in all
other cases. To solve this problem, you can use signed
MPH: a signed minimal perfect hash is a slightly less
compact data structure that answer correctly to the question
“What is the index of term XXX?” even when XXX
does not belong to the dictionary - in the latter case, a
signed MPH will answer with a special value (-1) that means “the
word is not in the dictionary”. To build a signed MPH, you can
once more use the
it.unimi.dsi.mg4j.util.MinimalPerfectHash
class, providing the special option -c
with the
classname of a signed MPH class (e.g.,
HashCodeSignedMinimalPerfectHash
).
Signed and unsigned MPH's are perfectly ok, as long as you don't
need to access the index with wildcards. Wildcard searches require the
use of a term map. A term map is able to anwer
correctly to questions like “What are the indices of terms
starting with XXX”. This is meaningful only if the terms are
lexicographically sorted: in this case, the indices of terms starting
with a given prefix are consecutive, so the above question can be
answered by giving just two integers (the first and the last index of
terms satisfying the property). You can build a term map by using the
main method of one of the implementation of the
TermMap
interface, e.g.,
ImmutableExternalPrefixDictionary
.