Scan: Building batches

In this step, MG4J scans the whole document collection producing the so-called batches. Batches are subindices limited to a subset of documents, and they are created each time the number of indexed documents reaches a user-provided threshold, or when the available memory is too little.

An occurrence is a group of three numbers, say (t,d,p), meaning that term with index t appears in document d at position p. Here, the document is represented by an integer, called the document pointer, which is in most cases the position of the document in the document collection (0 for the first document, 1 for the second document and so on). Position is also an integer that represents where the term occurs in the document.

To understand what the scanning phase really does, suppose you have three documents:

Document pointerDocument
0I love you
1God is love
2Love is blind
3Blind justice

Here is the dictionary produced initially by the scanning phase:

Term indexTerm
0blind
1god
2i
3is
4justice
5love
6you

Now, at least conceptually, this is the list of occurrences:

Occurrences (in the same order as they are found when scanning the documents)
(2,0,0) (5,0,1) (6,0,2) (1,1,0) (3,1,1) (5,1,2) (5,2,0) (3,2,1) (0,2,2) (0,3,0) (4,3,1)

This simply means that:

and so on. Inverted lists can now be obtained by re-sorting the occurrences in increasing term order, so that occurrences relative to the same term appear consecutively:

TermOccurrences
0 (blind)(0,2,2) (0,3,0)
1 (god)(1,1,0)
2 (i)(2,0,0)
3 (is)(3,1,1) (3,2,1)
4 (justice)(4,3,1)
5 (love)(5,0,1) (5,1,2) (5,2,0)
6 (you)(6,0,2)

Now, the indexer must:

The last point needs further explanation. Since occurrences are a lot it is not reasonable to think that they can be all kept in memory. What the indexing pass does is keeping an internal batch where occurrences are stored as they are found; when the batch is full, it is ordered by term, and flushed out on disk under the form of a subindex. Every batch will be in term order, but different batches may (and usually, will) contain occurrences of the same term.

You can run a scanning phase by giving the following command:

java -Xmx256M it.unimi.dsi.mg4j.tool.Scan -S javadoc.collection \
    --downcase -I text javadoc

This means that we are going to index the collection named javadoc.collection (-S javadoc.collection), and that we want to index only the field named text (-I text). The basename of the resulting index is going to be javadoc (as usual, completed with the field name).

After running this command, the directory contains the following files:

-rw-r--r--    1 vigna    vigna         453 Jun 11 21:38 javadoc-text.cluster.properties
-rw-r--r--    1 vigna    vigna         140 Jun 11 21:38 javadoc-text.cluster.strategy
-rw-r--r--    1 vigna    vigna        3.1k Jun 11 21:38 javadoc-text.frequencies
-rw-r--r--    1 vigna    vigna        4.0k Jun 11 21:38 javadoc-text.globcounts
-rw-r--r--    1 vigna    vigna         78k Jun 11 21:38 javadoc-text.index
-rw-r--r--    1 vigna    vigna        7.6k Jun 11 21:38 javadoc-text.offsets
-rw-r--r--    1 vigna    vigna        334k Jun 11 21:38 javadoc-text.positions
-rw-r--r--    1 vigna    vigna         452 Jun 11 21:38 javadoc-text.properties
-rw-r--r--    1 vigna    vigna         799 Jun 11 21:38 javadoc-text.sizes
-rw-r--r--    1 vigna    vigna        1.0k Jun 11 21:38 javadoc-text.stats
-rw-r--r--    1 vigna    vigna         23k Jun 11 21:38 javadoc-text.termmap
-rw-r--r--    1 vigna    vigna         48k Jun 11 21:38 javadoc-text.terms
-rw-r--r--    1 vigna    vigna        1.9k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna        2.4k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna        191k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna        5.1k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         405 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         444 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         29k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna        2.3k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna        3.0k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna        201k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna        6.6k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         406 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         355 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         39k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         450 Jun 11 21:38 javadoc-title.cluster.properties
-rw-r--r--    1 vigna    vigna         140 Jun 11 21:38 javadoc-title.cluster.strategy
-rw-r--r--    1 vigna    vigna         139 Jun 11 21:38 javadoc-title.frequencies
-rw-r--r--    1 vigna    vigna         139 Jun 11 21:38 javadoc-title.globcounts
-rw-r--r--    1 vigna    vigna        1.6k Jun 11 21:38 javadoc-title.index
-rw-r--r--    1 vigna    vigna         467 Jun 11 21:38 javadoc-title.offsets
-rw-r--r--    1 vigna    vigna         661 Jun 11 21:38 javadoc-title.positions
-rw-r--r--    1 vigna    vigna         441 Jun 11 21:38 javadoc-title.properties
-rw-r--r--    1 vigna    vigna         223 Jun 11 21:38 javadoc-title.sizes
-rw-r--r--    1 vigna    vigna         998 Jun 11 21:38 javadoc-title.stats
-rw-r--r--    1 vigna    vigna        6.6k Jun 11 21:38 javadoc-title.termmap
-rw-r--r--    1 vigna    vigna        6.2k Jun 11 21:38 javadoc-title.terms
-rw-r--r--    1 vigna    vigna          81 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna          81 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         800 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         203 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         392 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         125 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna        3.3k Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna          66 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna          66 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         605 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         158 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna         392 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna          98 Jun 11 21:38 [email protected]
-rw-r--r--    1 vigna    vigna        3.0k Jun 11 21:38 [email protected]

As you can see, there are several new files (they could be more or less, depending on the number of documents stored on your system): each file whose names starts with javadoc-text@ belongs to a certain subindex, that was generated using a batch of occurrences. The javadoc-text.properties file contains global information pertaining all subindices. Other files, such as the .sizes files, contain the list of the document sizes (the number of words contained in each document). The latter is useful for statistical purposes, but it might also be used by the indices, to establish better compression methods for the inverted lists. The .terms files, instead, contains the terms indexed in each batch.

Now, if you look into the javadoc-text.properties file, you will find some information:

# written by PropertiesConfiguration
# Mon Jun 11 21:38:38 CEST 2007

documents = 356
terms = 4908
postings = 62772
maxcount = 866
indexclass = it.unimi.dsi.mg4j.index.FileHPIndex
skipquantum = 64
skipheight = 8
coding = FREQUENCIES:GAMMA
coding = POINTERS:DELTA
coding = COUNTS:GAMMA
coding = POSITIONS:DELTA
termprocessor = it.unimi.dsi.mg4j.index.DowncaseTermProcessor
batches = 2
field = text
size = 3379672
maxdocsize = 13950
occurrences = 265665

You can see some the overall number of occurrences (265665), the number of batches (2) and the maximum size (number of words) of a document (13950). Similar information is available on a per-batch basis looking at the remaining .properties files.

The files starting with javadoc-text.cluster present a cluster view of the set of batches just built. Essentially, they provide dynamic access to the entire set of batches as a single index. More information can be found in the documentation of the package it.unimi.dsi.mg4j.index.cluster.

Time/space requirements

The scanning phase is, by far, the most time/space consuming. MG4J will work with little memory, but more memory will make it possible to build larger batches, which can then be merged more quickly and without opening too many files. You shoiuld set the JVM memory as high as you can go, and a number of documens per batch that does not cause too many compactions (or most of the time will be spent in the garbage collector), always keeping in mind that larger batches are better. If you experience out-of-memory errors (but it shouldn't happen!), just lower the number of documents per batch. Note that the memory compaction performed by MG4J seems to make the JVM erroneously think that there is too much garbage collection, sometimes resulting in an OutOfMemoryError due to excessive garbage-collector overhead. Please use the option -XX:-UseGCOverheadLimit to overcome the problem.

The kind of document sequence is going to influence heavily the indexing time. The best way of providing data to MG4J is to stream documents to the standard input, separating them with a character (usually, newline or NUL). This is the default choice if you do not specify explicitly a collection. Other kind of collections (e.g., database-based collections) might be reasonably efficient, but, for instance, do not expect great results from document sequences retreving documents directly from the file system.

Remember that the indexer will produce a number of subindices, and this number will depend on the overall number of occurrences (which is, essentially, proportional to the total document size). Combining these subindices (or accessing them using an on-the-fly index combiner) has a cost in time that increases logarithmically with the number of subindices. Moreover, for each subindex an on-the-fly combiner needs to allocate buffers, so the memory cost for batch or on-the-fly combination increases linearly with the number of subindices. The rule of thumb is that you should try to make batches as large as possible, but you should also check the logs because working with an almost full heap can slow down Java significantly.