We'd like to add stemmers for more languages - see the Snowball site for information on how to contribute.
connection connections connective ---> connect connected connectingIt is important to appreciate that we use stemming with the intention of improving the performance of IR systems. It is not an exercise in etymology or grammar. In fact from an etymological or grammatical viewpoint, a stemming algorithm is liable to make many mistakes. In addition, stemming algorithms - at least the ones presented here - are applicable to the written, not the spoken, form of the language.
For some of the world's languages, Chinese for example, the concept of stemming is not applicable, but it is certainly meaningful for the many languages of the Indo-European group. In these languages words tend to be constant at the front, and to vary at the end:
-ion -ions connect-ive -ed -ingThe variable part is the `ending', or `suffix'. Taking these endings off is called `suffix stripping' or `stemming', and the residual part is called the stem.
confirmativescan be thought of as `confirm' with a chain of endings,
-atif (adjectival ending - morphological) plus -e (feminine ending - grammatical) plus -s (plural ending - grammatical)-atif can also be thought of as -ate plus -if. Note that the addition of endings can cause respellings, so -e changes preceding `f' to `v'.
Endings fall into two classes, grammatical and morphological. The addition of -s in English to make a plural is an example of a grammatical ending. The word remains of the same type. There is usually only one dictionary entry for a word with all its various grammatical endings. Morphological endings create new types of word. In English -ise or -ize makes verbs from nouns (`demon', `demonise'), -ly makes adverbs from adjectives (`foolish', `foolishly'), and so on. Usually there are separate dictionary endings for these creations.
From the source text derive a control vocabulary of words in sorted order. Sample vocabularies in this style are part of our Open Source release. If you make a small change to the stemming algorithm you should have a procedure that presents the change as a three column table: column one is the control vocabulary, column 2 the stemmed equivalent, and column 3 the stemmed equivalent after the change has been made to the algorithm. The effects of the change can be evaluated by looking at the differences between columns two and three.
The first job is to come up with a list of endings. This can be done by referring to the grammar, the dictionary, and also by browsing through the control vocabulary.
-IC (adj) -+-> -ATION (noun) +-> -ITY (noun) +-> -MENT (adv) \-> -AT (verb) -+-> -IV (adj) -+-> -ITY (noun) | \-> -MENT (adv) \-> -OR (noun) -ABLE (adj) -+-> -ITY (noun) \-> -MENT (adv) -OUS (adj) ---> -MENT (adv)The ending forms take different values in different languages. In French, -OR becomes `-eur' (m.) or `-rice' (f.), -AT disappears into the infinitive form of a verb. In English, -MENT becomes `-ly', and then one can recognise,
-IC-ATION fortification -IC-ITY electricity -IC-MENT fantastically -AT-IV contemplative -AT-OR conspirator -IV-ITY relativity -IV-MENT instinctively -ABLE-ITY incapability -ABLE-MENT charitably -OUS-MENT famouslyTrios, -IC-AT-IV etc., also occur, but sequences of length four, -IC-AT-IV-ITY and -IC-AT-IV-MENT, are absent (or occur very rarely).
Sometimes the validity of an ending depends on the immediately preceding group of letters. In Italian, for example, certain pronouns and pronoun groups attach to present participle and infinitive forms of verbs, for example,
scrivendole = scrivendo (writing) + le (to her)If E is the ending, the possible forms are -andoE, -endoE, -arE, -erE, -irE, so E is removed in the context -Xndo or Yr, where X is a or e, and Y is a or e or i. See the
mandarglielo = mandare (to give) + glielo (it to him)
attached_pronoun
procedure in the Italian
stemmer.The most useful criterion for removing an ending, however, is to base the decision on the syllable length of the stem that will remain. This idea was first used in the English stemming algorithm, and has been found to be applicable in the other stemming algorithms too. If C stands for a sequence of consonants, and V for a sequence of vowels, any word can be analysed as,
[C] V C ... V C [V]where [..] indicates arbitrary presence, and V C ... V C means V C repeated zero or more times. We can find successive positions 0, 1, 2 ... in a word corresponding to each vowel-consonant stretch V C,
t h u n d e r s t r i c k e n 0 1 2 3 4The closer E is to the beginning of the word, the more unwilling we should be remove it. So we might have a rule to remove E if at is after position 2, and so on.
This sounds like common sense, but it is actually very easy to fall into the trap of endlessly elaborating the rules without looking at their true effect. What you find eventually is that you can be improving performance in one area of the vocabulary, while causing a similar degradation of performance in another area. When this happens consistently it is time to call a halt to development and to regard the stemming algorithm as finished.
It is important to realise that the stemming process cannot be made perfect. For example, in French, the simple verb endings -ons and -ent of the present tense occur repeatedly in other contexts. -ons is the plural form of all nouns ending -on, and -ent is also a common noun ending. On balance it is best not to remove these endings. In practice this affects -ent verb endings more than -ons verb endings, since -ent endings are commoner. The result is that verbs stem not to a single form, but to a much smaller number of forms (three), among which the form given by the true stem of the verb is by far the commonest.
If we define errors A and B by,
error A: removing an ending when it is not an endingThen removing -ent leads to error A; not removing -ent leads to error B. We must adopt the rule that minimises the number of errors - not the rule that appears to be the most elegant.
error B: not removing an ending when it is an ending
An irregularity of English which does cause a problem is the word `news'. It
is now a singular noun, and is never regarded as the plural of `new'. This,
and a few more howlers, are placed in a table, irregular_forms
, in the
English stemming algorithm. Similar tables are provided in the other stemming
algorithms, with some provisional entries. The non-English stemming
algorithms have not been used enough for a significant list of irregular
forms to emerge, but as they do, they can be placed in the irregular_forms
table.
A superior approach is to keep each word, W, and its stemmed form, s(W), as a two-way relation in the IR system. W is held in the index with its own posting list. s(W) could have its separate posting list, but this would be derivable from the class of words that stem to s(W). The important thing is to have the W <-> s(W) relation. From W we can derive s(W), the stemmed form. From a stemmed form s(W) we can derive W plus the other words in the IR system which stem to s(W). Any word can then be searched on either stemmed or unstemmed. If the stemmed form of a word needs to be shown to the user, it can be represented by the commonest among the words which stem to that form.
Getting a list of stopwords can be done by sorting a vocabulary of a text corpus for a language by frequency, and going down the list picking off words to be discarded.
The stopword list connects in various ways with the stemming algorithm:
The stemming algorithm can itself be used to detect and remove stopwords. One
would add into the irregular_forms
table something like this,
"", /* null string */ "am/is/are/be/being/been/" /* BE */ "have/has/having/had/" /* HAD */ "do/does/doing/did/" /* DID */ ... /* multi-line string */so that the words `am', `is' etc. map to the null string (or some other easily recognised value).
Alternatively, stopwords could be removed before the stemming algorithm is applied, or after the stemming algorithm is applied. In this latter case, the words to be removed must themselves have gone through the stemmer, and the number of distinct forms will be greatly reduced as a result. In Italian for example, the four forms
questa queste questi questo(meaning `that') all stem to
questIn the xapian-data directory in the SVN repository, each language represented in the stemming section has, in addition to a large test vocabulary, a useful stopword list in both source and stemmed form. The source form, in the file
stopsource
, is carefully annotated, and the derived
file, stopwords
, contains an equivalent list of sorted, stemmed,
stopwords.