Querying MG4J is easy if you already used
a text-indexing system. The simplest possible query is a single term,
e.g., class
: the answer that you will obtain by such a
query is the set of all documents (in our case: all files among those that
have been indexed) that contain the word class
(or any
other uppercase/lowercase variant thereof).
There are several additional operators you might want to try:
AND:
writing more than one term (separated by
whitespace) means that you want to look for documents that contain
all the specified words (not necessarily in the
same order or consecutively); for example, the query
InputStream Reader encoding
means that you want to
look for documents that contain all the given
words; you can convey the same meaning by using the
operator &
(a.k.a. AND
),
thus writing InputStream & Reader &
encoding
instead;
OR:
if you want to write a disjunctive query
you can use the operator |
(a.k.a. OR); thus, for
example, the query InputStream | Reader | encoding
means that you are looking for documents that contain any of
the given words;
NOT:
you can use the operator ! (a.k.a. NOT)
to mean negation; thus, for example, the query InputStream
& !Reader
means that you are looking for documents that
contain the first term but not the second;
phrase: you can force consecutivity by using quotation marks;
thus "InputStream Reader"
means that you want to
look for documents that contain these two words
consecutively;
proximity restriction: you can limit your search to documents
where the words you are searching appear within a limited portion of
the document; this is done with the tilda operator; for example,
(InputStream Reader)~5
means that you are looking
for documents where the two given words appear (in any order) within 5
words from each other;
ordered AND:
writing more than one term
separated by <
will find documents containing
the given terms in the specified order.
wildcard search: you can perform wildcard searches by appending
* at the end of a term; for example, term*
will
look for documents containing "term", "terms", "termed" and so
on.
parentheses: you can use parentheses to enforce priority when
building complex queries; parentheses are not needed in many cases,
but they are necessary, for example, when a boolean query is written
within a phrase; for example, if you want to look for the word
InputStream
followed by Reader
or Writer
, you will enter the query
"InputStream (Reader | Writer)"
.
index specifiers: prefixing a query with the name of an index
followed by a colon you can restrict the search to that index. The
name of an index is by default the name of the field that it has
indexed, so title:Reader
will search for
Reader
just in titles.
range queries: if you created an index containing
payloads (dates, integers, etc.) you can perform
range queries using square brackets and two dots: for instance,
assuming the existence of a field date
the query
[ 20/2/2007 .. 23/2/2007 ]
will search for
documents whose date is between 20 February and 23 February 2007,
inclusive.
MG4J will emphasise intervals satisfying the query. By clicking on the link of a document, the document will be opened in the browser.
The description we have just given just scratches the surfaces of
the queries you can write with MG4J: all the
operators can be freely combined, obtaining very sophisticated constraints
on the documents returned. More information on this topic can be found in
the documentation of the package
it.unimi.dsi.mg4j.search
.
MG4J actually provide very sophisticated query tuning. In particular, it provides scorers, which let you reorder the document satisying a query depending on some criterion. To use this features, you must use the command line interface, albeit all settings will be used for the subsequent web queries.
Type $
to get some help on the available
options. A basic command is $mode
, which lets you
choose the kind of result: just the document number and title, the
intervals, snippets and so on. Some options require a full index and a
collection (for instance, snippets). The most interesting command,
however, is $scorer
, that lets you choose a scorer
for your documents. For instance,
$score BM25Scorer VignaScorer
reproduces
the standard settings, using a BM25 scorer and a scorer that shows
firsts documents satisfying your queries more frequently and in smaller
intervals, linearly combined with equal weight. Scorers are described in
the documentation of the package
it.unimi.dsi.mg4j.search.score
.
When you use a scorer, it is a good idea to use multiplexing: when multiplexing is on, each query is multiplexed to all indices (by default, a query is directed to the first index specified on the command line). Just type
$mplex on
Of course, you can always choose a specific index with the colon notation. You can also change the weight of your indices (which is particularly useful when multiplexing):
$weight text:1 title:3
In this
way, weight-based scorers will usually consider the
title
field three times more important than the
text
field.
You can also change the way snippets (or intervals) on display are chosen: MG4J provides an interval selector, a class that will try to choose the best intervals to be shown. You can set the maximum length of an interval, and the maximum number of intervals:
$selector 3 40
will show at most three intervals, and intervals longer than 40 characters will be broken. All these changes are reflected in the web interface.
If you want to learn more about query resolution, you should have
a look at the documentation of the class
it.unimi.dsi.mg4j.query.QueryEngine
, which
embodies all the logic used to answer queries in MG4J.