Querying MG4J

Querying MG4J
Prev	Chapter 1. A Quick Tour of MG4J	Next

Querying MG4J is easy if you already used a text-indexing system. The simplest possible query is a single term, e.g., class: the answer that you will obtain by such a query is the set of all documents (in our case: all files among those that have been indexed) that contain the word class (or any other uppercase/lowercase variant thereof).

There are several additional operators you might want to try:

AND: writing more than one term (separated by whitespace) means that you want to look for documents that contain all the specified words (not necessarily in the same order or consecutively); for example, the query InputStream Reader encoding means that you want to look for documents that contain all the given words; you can convey the same meaning by using the operator & (a.k.a. AND), thus writing InputStream & Reader & encoding instead;
OR: if you want to write a disjunctive query you can use the operator | (a.k.a. OR); thus, for example, the query InputStream | Reader | encoding means that you are looking for documents that contain any of the given words;
NOT: you can use the operator ! (a.k.a. NOT) to mean negation; thus, for example, the query InputStream & !Reader means that you are looking for documents that contain the first term but not the second;
phrase: you can force consecutivity by using quotation marks; thus "InputStream Reader" means that you want to look for documents that contain these two words consecutively;
proximity restriction: you can limit your search to documents where the words you are searching appear within a limited portion of the document; this is done with the tilda operator; for example, (InputStream Reader)~5 means that you are looking for documents where the two given words appear (in any order) within 5 words from each other;
ordered AND: writing more than one term separated by < will find documents containing the given terms in the specified order.
wildcard search: you can perform wildcard searches by appending * at the end of a term; for example, term* will look for documents containing "term", "terms", "termed" and so on.
parentheses: you can use parentheses to enforce priority when building complex queries; parentheses are not needed in many cases, but they are necessary, for example, when a boolean query is written within a phrase; for example, if you want to look for the word InputStream followed by Reader or Writer, you will enter the query "InputStream (Reader | Writer)".
index specifiers: prefixing a query with the name of an index followed by a colon you can restrict the search to that index. The name of an index is by default the name of the field that it has indexed, so title:Reader will search for Reader just in titles.
range queries: if you created an index containing payloads (dates, integers, etc.) you can perform range queries using square brackets and two dots: for instance, assuming the existence of a field date the query [ 20/2/2007 .. 23/2/2007 ] will search for documents whose date is between 20 February and 23 February 2007, inclusive.

MG4J will emphasise intervals satisfying the query. By clicking on the link of a document, the document will be opened in the browser.

The description we have just given just scratches the surfaces of the queries you can write with MG4J: all the operators can be freely combined, obtaining very sophisticated constraints on the documents returned. More information on this topic can be found in the documentation of the package it.unimi.dsi.mg4j.search.

More sophisticated queries

MG4J actually provide very sophisticated query tuning. In particular, it provides scorers, which let you reorder the document satisying a query depending on some criterion. To use this features, you must use the command line interface, albeit all settings will be used for the subsequent web queries.

Type $ to get some help on the available options. A basic command is $mode, which lets you choose the kind of result: just the document number and title, the intervals, snippets and so on. Some options require a full index and a collection (for instance, snippets). The most interesting command, however, is $scorer, that lets you choose a scorer for your documents. For instance,

$score BM25Scorer VignaScorer

reproduces the standard settings, using a BM25 scorer and a scorer that shows firsts documents satisfying your queries more frequently and in smaller intervals, linearly combined with equal weight. Scorers are described in the documentation of the package it.unimi.dsi.mg4j.search.score.

When you use a scorer, it is a good idea to use multiplexing: when multiplexing is on, each query is multiplexed to all indices (by default, a query is directed to the first index specified on the command line). Just type

$mplex on

Of course, you can always choose a specific index with the colon notation. You can also change the weight of your indices (which is particularly useful when multiplexing):

$weight text:1 title:3

In this way, weight-based scorers will usually consider the title field three times more important than the text field.

You can also change the way snippets (or intervals) on display are chosen: MG4J provides an interval selector, a class that will try to choose the best intervals to be shown. You can set the maximum length of an interval, and the maximum number of intervals:

$selector 3 40

will show at most three intervals, and intervals longer than 40 characters will be broken. All these changes are reflected in the web interface.

If you want to learn more about query resolution, you should have a look at the documentation of the class it.unimi.dsi.mg4j.query.QueryEngine, which embodies all the logic used to answer queries in MG4J.