In this step, MG4J scans the whole document collection producing the so-called batches. Batches are subindices limited to a subset of documents, and they are created each time the number of indexed documents reaches a user-provided threshold, or when the available memory is too little.
An occurrence is a group of three numbers, say
(t
,d
,p
), meaning
that term with index t
appears in document
d
at position p
. Here,
the document is represented by an integer, called the document
pointer, which is in most cases the position of the document
in the document collection (0 for the first document, 1 for the second
document and so on). Position is also an integer that represents where the
term occurs in the document.
To understand what the scanning phase really does, suppose you have three documents:
Document pointer | Document |
---|---|
0 | I love you |
1 | God is love |
2 | Love is blind |
3 | Blind justice |
Here is the dictionary produced initially by the scanning phase:
Term index | Term |
---|---|
0 | blind |
1 | god |
2 | i |
3 | is |
4 | justice |
5 | love |
6 | you |
Now, at least conceptually, this is the list of occurrences:
Occurrences (in the same order as they are found when scanning the documents) |
---|
(2,0,0) (5,0,1) (6,0,2) (1,1,0) (3,1,1) (5,1,2) (5,2,0) (3,2,1) (0,2,2) (0,3,0) (4,3,1) |
This simply means that:
term 2 (I
) appears in document 0 at position
0;
term 5 (love
) appears in document 0 at
position 1;
term 6 (you
) appears in document 0 at
position 2;
and so on. Inverted lists can now be obtained by re-sorting the occurrences in increasing term order, so that occurrences relative to the same term appear consecutively:
Term | Occurrences |
---|---|
0 (blind) | (0,2,2) (0,3,0) |
1 (god) | (1,1,0) |
2 (i) | (2,0,0) |
3 (is) | (3,1,1) (3,2,1) |
4 (justice) | (4,3,1) |
5 (love) | (5,0,1) (5,1,2) (5,2,0) |
6 (you) | (6,0,2) |
Now, the indexer must:
scan all documents and extract occurrences;
if the list of terms have not yet been obtained, gather new terms as they are found;
sort the terms in alphabetical order, renumbering all occurrences correspondingly;
(if required) renumber the documents and sort them in increasing order,
sort, at least partially, the occurrences found in increasing term order;
when the number of accumulated documents reaches a given threshold, create a subindex containing the current batch of occurrences.
The last point needs further explanation. Since occurrences are a lot it is not reasonable to think that they can be all kept in memory. What the indexing pass does is keeping an internal batch where occurrences are stored as they are found; when the batch is full, it is ordered by term, and flushed out on disk under the form of a subindex. Every batch will be in term order, but different batches may (and usually, will) contain occurrences of the same term.
You can run a scanning phase by giving the following command:
java -Xmx256M it.unimi.dsi.mg4j.tool.Scan -S javadoc.collection \ --downcase -I text javadoc
This means that we are going to index
the collection named javadoc.collection
(-S
javadoc.collection
), and that we want to index only the field
named text (
. The basename of
the resulting index is going to be -I text
)javadoc
(as usual,
completed with the field name).
After running this command, the directory contains the following files:
-rw-r--r-- 1 vigna vigna 453 Jun 11 21:38 javadoc-text.cluster.properties -rw-r--r-- 1 vigna vigna 140 Jun 11 21:38 javadoc-text.cluster.strategy -rw-r--r-- 1 vigna vigna 3.1k Jun 11 21:38 javadoc-text.frequencies -rw-r--r-- 1 vigna vigna 4.0k Jun 11 21:38 javadoc-text.globcounts -rw-r--r-- 1 vigna vigna 78k Jun 11 21:38 javadoc-text.index -rw-r--r-- 1 vigna vigna 7.6k Jun 11 21:38 javadoc-text.offsets -rw-r--r-- 1 vigna vigna 334k Jun 11 21:38 javadoc-text.positions -rw-r--r-- 1 vigna vigna 452 Jun 11 21:38 javadoc-text.properties -rw-r--r-- 1 vigna vigna 799 Jun 11 21:38 javadoc-text.sizes -rw-r--r-- 1 vigna vigna 1.0k Jun 11 21:38 javadoc-text.stats -rw-r--r-- 1 vigna vigna 23k Jun 11 21:38 javadoc-text.termmap -rw-r--r-- 1 vigna vigna 48k Jun 11 21:38 javadoc-text.terms -rw-r--r-- 1 vigna vigna 1.9k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 2.4k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 191k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 5.1k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 405 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 444 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 29k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 2.3k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 3.0k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 201k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 6.6k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 406 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 355 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 39k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 450 Jun 11 21:38 javadoc-title.cluster.properties -rw-r--r-- 1 vigna vigna 140 Jun 11 21:38 javadoc-title.cluster.strategy -rw-r--r-- 1 vigna vigna 139 Jun 11 21:38 javadoc-title.frequencies -rw-r--r-- 1 vigna vigna 139 Jun 11 21:38 javadoc-title.globcounts -rw-r--r-- 1 vigna vigna 1.6k Jun 11 21:38 javadoc-title.index -rw-r--r-- 1 vigna vigna 467 Jun 11 21:38 javadoc-title.offsets -rw-r--r-- 1 vigna vigna 661 Jun 11 21:38 javadoc-title.positions -rw-r--r-- 1 vigna vigna 441 Jun 11 21:38 javadoc-title.properties -rw-r--r-- 1 vigna vigna 223 Jun 11 21:38 javadoc-title.sizes -rw-r--r-- 1 vigna vigna 998 Jun 11 21:38 javadoc-title.stats -rw-r--r-- 1 vigna vigna 6.6k Jun 11 21:38 javadoc-title.termmap -rw-r--r-- 1 vigna vigna 6.2k Jun 11 21:38 javadoc-title.terms -rw-r--r-- 1 vigna vigna 81 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 81 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 800 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 203 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 392 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 125 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 3.3k Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 66 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 66 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 605 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 158 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 392 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 98 Jun 11 21:38 [email protected] -rw-r--r-- 1 vigna vigna 3.0k Jun 11 21:38 [email protected]
As
you can see, there are several new files (they could be more or less,
depending on the number of documents stored on your system): each file
whose names starts with javadoc-text@
belongs to a
certain subindex, that was generated using a batch of occurrences. The
javadoc-text.properties
file contains global
information pertaining all subindices. Other files, such as the
.sizes
files, contain the list of the document sizes
(the number of words contained in each document). The latter is useful for
statistical purposes, but it might also be used by the indices, to
establish better compression methods for the inverted lists. The
.terms
files, instead, contains the terms indexed in
each batch.
Now, if you look into the
javadoc-text.properties
file, you will find some
information:
# written by PropertiesConfiguration # Mon Jun 11 21:38:38 CEST 2007 documents = 356 terms = 4908 postings = 62772 maxcount = 866 indexclass = it.unimi.dsi.mg4j.index.FileHPIndex skipquantum = 64 skipheight = 8 coding = FREQUENCIES:GAMMA coding = POINTERS:DELTA coding = COUNTS:GAMMA coding = POSITIONS:DELTA termprocessor = it.unimi.dsi.mg4j.index.DowncaseTermProcessor batches = 2 field = text size = 3379672 maxdocsize = 13950 occurrences = 265665
You can see some the overall number of
occurrences (265665), the number of batches (2) and the maximum size
(number of words) of a document (13950). Similar information is available
on a per-batch basis looking at the remaining
.properties
files.
The files starting with javadoc-text.cluster
present a cluster view of the set of batches just
built. Essentially, they provide dynamic access to the entire set of
batches as a single index. More information can be found in the
documentation of the package
it.unimi.dsi.mg4j.index.cluster
.
The scanning phase is, by far, the most time/space consuming. MG4J
will work with little memory, but more memory will make it possible to
build larger batches, which can then be merged more quickly and without
opening too many files. You shoiuld set the JVM memory as high as you
can go, and a number of documens per batch that does not cause too many
compactions (or most of the time will be spent in the garbage
collector), always keeping in mind that larger batches are better. If
you experience out-of-memory errors (but it shouldn't happen!), just
lower the number of documents per batch. Note that the memory compaction
performed by MG4J seems to make the JVM erroneously think that there is
too much garbage collection, sometimes resulting in an
OutOfMemoryError
due to excessive
garbage-collector overhead. Please use the option
-XX:-UseGCOverheadLimit
to overcome the problem.
The kind of document sequence is going to influence heavily the indexing time. The best way of providing data to MG4J is to stream documents to the standard input, separating them with a character (usually, newline or NUL). This is the default choice if you do not specify explicitly a collection. Other kind of collections (e.g., database-based collections) might be reasonably efficient, but, for instance, do not expect great results from document sequences retreving documents directly from the file system.
Remember that the indexer will produce a number of subindices, and this number will depend on the overall number of occurrences (which is, essentially, proportional to the total document size). Combining these subindices (or accessing them using an on-the-fly index combiner) has a cost in time that increases logarithmically with the number of subindices. Moreover, for each subindex an on-the-fly combiner needs to allocate buffers, so the memory cost for batch or on-the-fly combination increases linearly with the number of subindices. The rule of thumb is that you should try to make batches as large as possible, but you should also check the logs because working with an almost full heap can slow down Java significantly.