Matcher Design Notes

This document is incomplete at present. It lacks explanation of the min-heap used to keep the best N M-set items (Managing Gigabytes describes this technique well).

The PostList Tree

The matcher builds a tree structure from the query. This tree is a binary tree, and each node is a PostList sub-class object. This is more efficient than a n-ary tree in terms of the number of comparisons which need to be performed: <insert proof> (but this proof may only be valid for equal sized posting lists without optimisations, in which case there may be a more efficient way to do this - investigate!)

To built the tree, a PostList object is created for each term, and pairs of PostLists are combined using 2-way branching tree elements for AND, OR, etc - these are virtual PostLists whose class names reflect the operation (AndPostList, OrPostList, etc). See below for a full list.

The tree is deliberately built in an uneven way, such that we minimise the likely number of times a posting has to be passed up a level. For a group of OR operations, the PostLists with fewest entries are furthest down the group's subtree, minimising the amount of information needing to be passed up the tree. For a group of AND operations the PostLists with most entries are furthest down, so we look at the least frequent terms first, and skip the posting lists of the others. This will generally minimise the number of posting list entries we read and maximises the size of each skip_to. The OR tree is built up in a similar way to how an optimal huffman code is constructed. This is provably optimal, but with the assumption that the tree structure is immutable once created. If term distribution is uneven, rebalancing this tree during the match might be more efficient.

running the match

Once the tree is built, the matcher repeatedly asks the root of the tree for the next matching document and compares it to those in the proto-mset it maintains. If the next matching document scores more highly (either by weight, or in sort order if sorting is used) then it adds it and discards the lowest scoring document.

When one of a sub-tree of AND operations runs out, it signals "end of list", and each AND signals this too.

When an OR gets end of list, it autoprunes, replacing itself with the branch that still has postings - see below for full details. If the matcher itself gets "end of list", the match is complete.

The other operations also handle end of list in one of these two ways (for asymmetric operations, which happens may depend which branch has run out).

The matcher also passes the lowest weight currently needed make the proto-mset into the tree, and each node may adjust this weight and pass it on to its subtrees. Each PostList can report a minimum weight it could contribute - so if the left branch of an AND will always return a weight of 2 or more, then if the whole AND needs to return at least 6, the right branch is told it need to return at least 4.

For example, an OR knows that if its left branch can contribute at most a weight of 4 and its right branch at most 7, then if the minimum weight is 8, only documents matching both branches are now of interest so it mutates into an AND. If the minimum weight is 6 it changes into an AND_MAYBE (A AND_MAYBE B matches documents which which match A, but B contributes to the weight - in most search engines query syntax, that's expressed as `+A B'). See the "Operator Decay" section below for full details of these mutations. If the minimum weight needed is 12, no document is good enough, and the OR returns "end of list".

Phrase and near matching

The way phrase and near matching works is to perform an AND query for all the terms, with a filter node in front which only returns documents whose positional information fulfils the phrase requirements.

Unfortunately this creates a bad case is where a lot of documents have the words of the phrase in but few match the actual phrase - this filter does a lot of work, but the matcher can't stop early. We should look at hoisting the filtering part higher up the tree, but note that this may not always be a win. Some heuristics are probably required.

virtual postlist types

There are several types of virtual PostList. Each type can be treated as boolean or probabilistic - the only difference is whether the weights are ignored or not. The types are:

OrPostList: returns documents which match either branch
AndPostList: returns documents which match both branches
XorPostList: returns documents which match one branch or the other but not both
AndNotPostList: returns documents which match the left branch, but not the right (the weights of documents from the right branch are ignored).
AndMaybePostList: returns documents which match the left branch - weights from documents also in the right branch are added in for the probabilistic case ("X ANDMAYBE Y" can be expressed as "+X Y" in Altavista).
FilterPostList: applies the right branch as a boolean filter to the left branch (which is typically a probabilistic query. Note: same as AndPostList with the right branch weights ignored.

[Note: You can use AndNotPostList to apply an inverted boolean filter to a probabilistic query]

All the symmetric operators (i.e. OR, AND, XOR) are coding for maximum efficiency when the right branch has fewer postings in than the left branch.

There are 2 main optimisations which the best match performs: autoprune and operator decay.

autoprune

For example, if a branch in the match tree is "A OR B", when A runs out then "A OR B" is replaced by "B". Similar reductions occur for XOR, ANDNOT, and ANDMAYBE (if the right branch runs out). Other operators (AND, FILTER, and ANDMAYBE (when the left branch runs out) simply return "at_end" and this is dealt with somewhere further up the tree as appropriate.

An autoprune is indicated by the next or skip_to method returning a pointer to the PostList object to replace the postlist being read with.

operator decay

The matcher tracks the minimum weight needed for a document to make it into the m-set (this decreases monotonically as the m-set forms). This can be used to replace on boolean operator with a stricter one. E.g. consider A OR B - when maxweight(A) < minweight and maxweight(B) < minweight then only documents matching both A and B can make it into the m-set so we can replace the OR with an AND. Operator decay is flagged using the same mechanism as autoprune, by returning the replacement operator from next or skip_to.

Possible decays:

OR -> AND
OR -> ANDMAYBE
ANDMAYBE -> AND
XOR -> ANDNOT

A related optimisation is that the Match object may terminate early if maxweight for the whole tree is less than the smallest weight in the mset.