Chapter 3. Performance

Table of Contents

Indexing Time
Setup Time
Query Time

Indexing Time

MG4J provides a great flexibility in index construction. For instance, you can choose several different codes for the components of the index, and moreover you can decide to drop parts you are not going to use (e.g., positions). All these choices have a significant impact on performance.

As a general rule, nonparametric codes are quicker than parametric codes. Thus, Golomb codes for document pointers have an excellent compression rate, but δ codes have very good compression, too, and can be decoded more quickly. The point here is that for unary, γ, shifted γ and δ codes MG4J uses precomputed decoding tables that speed up decompression by an order of magnitude. The default choice (γ for frequencies, δ for pointers, γ for counts, and δ for positions) is very reasonable. For maximum speed you could even try to use γ everywhere (as it is quicker to decode if the precomputed decoding tables fail).

Another important trick is that of discarding what you don't need. The default MG4J index type is called high-performance: it contains all information (pointers, counts, positions) but it is only partially interleaved―positions are kept in a separate file. This satisfies most needs, but If you are just using BM25 or TF/IDF scoring, there is no need to store positions in your index: you can force a standard, interleaved index.

Note that high-performance indices have always skips, but by default MG4J does not add skips if you require an interleaved index: if you are building a large index, however, skips are definitely a good idea. You should also try to raise the height of the skip towers to a large number (an index is able to skip in one shot as far as q2h, where q is the quantum and h is the height specified). However, as h grows the memory required to build the skip structures grow exponentially: the rule of thumb is setting h as large as possible without incurring in out-of-memory errors.