Indexing in MG4J is centered around documents, either exposed by means of sequences or of collections. For the time being, let us concentrate on collections, which are randomly addressable lists of documents.
Each document in a collection is associated with a
title and a URI. Typical
titles are filenames, or titles from HTML documents. URIs can be the
actual URL of a page. To build our first document collection, we use the
main method of the class FileSetDocumentCollection
,
which allows to build and serialize a set of documents specified by their
filenames. As a typical case, we will build a collection out of your
Javadoc documentation directory. Supposing your Javadocs are located in
/usr/share/javadoc
, you may try the
following:
find /usr/share/javadoc/ -iname \*.html -type f | \ egrep -v "(package-|-tree|class-use|index-.*.html|allclasses)" | \ java it.unimi.dsi.mg4j.document.FileSetDocumentCollection \ -f HtmlDocumentFactory -p encoding=UTF-8 javadoc.collection
Let
us try to understand what's happening. We are providing as input to the
main method of the class a list of files, one per line. Moreover, we are
specifying (using the -f
option) a
factory, that is, something that will turn a pure
stream of bytes (provided, in this case, by a file) into a document made
by several fields (for instance, title and main
text). The factory needs to know the encoding of the files, and we are
specifying UTF-8 as a property. All this
information is serlialised and stored in a file named
javadoc.collection
. Note that since we are using a
standard MG4J factory, we can avoid to write the full factory class name
(it.unimi.dsi.mg4j.document.HtmlDocumentFactory
).
If you try and look into the file
javadoc.collection
, you will discover that this is
indeed a typical, serialized version of a Java object; note that the file
is not going to contain the files that are part of
the collection, but only their name. This means, in particular, that the
very existence of the collection will depend on the existence of the files
spanned by the collection; in other words, deleting or modifying any of
the indexed file may cause inconsistence in the collection (and, more
importantly, in the index produced in the following steps). This is true
of almost every collection: document collections may base their existence
on some external data (files, web pages, mailbox files etc.), and they
usually become inconsistent as soon as such data are modified, changed or
deleted.
It is now time to index our collection. To do so, we simply pass the
collection to the main method of the class Index
,
which scans all documents in the collection and produces a number of
indices, one for each field of the collection. The number of fields
depends on the factory used to produce documents: in our case, we will get
indices for the title (the content of the HTML title
element, if present; the filename is used, instead, if the title element
is absent) and the body (the textual content of the entire HTML page).
Additionally, FileSetDocumentCollection
sets the
URI of each document to a URI pointing to the absolute location of the
file in the file system; the document title is, once more, going to be the
title appearing in the HTML content.
java -Xmx256M it.unimi.dsi.mg4j.tool.IndexBuilder \ --keep-batches --downcase -s 10000 -S javadoc.collection javadoc
The class Index
has a large number of
options, as it runs in sequence the two phases of the indexing process.
These phases are also available separately, mainly in the case of very
large collection (hundreds of millions of documents) for which the memory
limits are rather tight.
In this example, we have used the --downcase
option
that forces all the terms to be downcased: this means that the index will
collapse words that differ only for the presence of upper/lowercase
letters. For example, terms String
and
string
will not be distinguished. The
-S
option specifies that we are producing an index for
the specified collection (javadoc.collection
): if the
option was omitted, Index
would expect to index a
document sequence read from standard input (more about this below). The
-s
options specifies the batch size (see below). The
--keep-batches
option is not used normally, but we
specify it here so to have a look at the temporary files generated during
the indexing process. The last, unflagged option,
javadoc
, is the only mandatory option for
Index
, and it is the index
basename, the basename after which all index files are
stemmed.
Since our collection has documents containing two fields, named
title
and text
, there will be two
sets of index files: each will be named, by convention, with the index
basename followed by the field name (separated with a dash). Hence, there
will be index files named javadoc-title.something
and
files named javadoc-text.something
.
We have now built indices, and we are ready to query them using a
web server. This is very easy in MG4J: we just
run the main method of the Query
class specifying
the -h optional and
passing as argument the indices and
(for showing snippets) the collection:
java it.unimi.dsi.mg4j.query.Query -h -i FileSystemItem \ -c javadoc.collection javadoc-text javadoc-title
We can now either use the command line, or open the search page by
pointing our browser to http://localhost:4242/Query
and start
querying the collection. Note that -i
option, which
specifies what to link to result items: the specified class links a file
in the file system using a local HTTP server (the observation about class
names made for factories applies here, too).
Note that the names we specified for the indices (e.g.,
javadoc-text
) are actually URIs, so you can add options
much like in a web query. For instance,
javadoc-text?inMemory=1
would load the index into main
memory, whereas javadoc-text?mapped=1
would try to use
low-level memory-mapping features of the operating system to cache the
most frequently used part of the index in main memory.