Package nltk :: Package stem :: Module porter :: Class PorterStemmer

Class PorterStemmer

  object --+    
           |    
api.StemmerI --+
               |
              PorterStemmer

Known Subclasses:

Porter


A word stemmer based on the Porter stemming algorithm.

    Porter, M. "An algorithm for suffix stripping."
    Program 14.3 (1980): 130-137.

A few minor modifications have been made to Porter's basic
algorithm.  See the source code of this module for more
information.

The Porter Stemmer requires that all tokens have string types.

Instance Methods

[hide private]

__init__(self)
The main part of the stemming algorithm starts here.

source code

cons(self, i)
cons(i) is TRUE <=> b[i] is a consonant.

source code

m(self)
m() measures the number of consonant sequences between k0 and j.

source code

vowelinstem(self)
vowelinstem() is TRUE <=> k0,...j contains a vowel

source code

doublec(self, j)
doublec(j) is TRUE <=> j,(j-1) contain a double consonant.

source code

cvc(self, i)
cvc(i) is TRUE <=>

source code

ends(self, s)
ends(s) is TRUE <=> k0,...k ends with the string s.

source code

setto(self, s)
setto(s) sets (j+1),...k to the characters in the string s, readjusting k.

source code

r(self, s)
r(s) is used further down.

source code

step1ab(self)
step1ab() gets rid of plurals and -ed or -ing.

source code

step1c(self)
step1c() turns terminal y to i when there is another vowel in the stem.

source code

step2(self)
step2() maps double suffices to single ones.

source code

step3(self)
step3() dels with -ic-, -full, -ness etc.

source code

step4(self)
step4() takes off -ant, -ence etc., in context <c>vcvc<v>.

source code

step5(self)
step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1.

source code

stem_word(self, p, i=0, j=None)
In stem(p,i,j), p is a char pointer, and the string to be stemmed is from p[i] to p[j] inclusive.

source code

adjust_case(self, word, stem)

source code

stem(self, word)
Strip affixes from the token and return the stem.

source code

__repr__(self)
repr(x)

source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self)
(Constructor)

source code

The main part of the stemming algorithm starts here.
b is a buffer holding a word to be stemmed. The letters are in b[k0],
b[k0+1] ... ending at b[k]. In fact k0 = 0 in this demo program. k is
readjusted downwards as the stemming progresses. Zero termination is
not in fact used in the algorithm.

Note that only lower case sequences are stemmed. Forcing to lower case
should be done before stem(...) is called.

Overrides: object.__init__

m(self)

source code

m() measures the number of consonant sequences between k0 and j.
if c is a consonant sequence and v a vowel sequence, and <..>
indicates arbitrary presence,

   <c><v>       gives 0
   <c>vc<v>     gives 1
   <c>vcvc<v>   gives 2
   <c>vcvcvc<v> gives 3
   ....

cvc(self, i)

source code

cvc(i) is TRUE <=>

a) ( --NEW--) i == 1, and p[0] p[1] is vowel consonant, or

b) p[i - 2], p[i - 1], p[i] has the form consonant -
   vowel - consonant and also if the second c is not w, x or y. this
   is used when trying to restore an e at the end of a short word.
   e.g.

       cav(e), lov(e), hop(e), crim(e), but
       snow, box, tray.

step1ab(self)

source code

step1ab() gets rid of plurals and -ed or -ing. e.g.

caresses  ->  caress
ponies    ->  poni
sties     ->  sti
tie       ->  tie        (--NEW--: see below)
caress    ->  caress
cats      ->  cat

feed      ->  feed
agreed    ->  agree
disabled  ->  disable

matting   ->  mat
mating    ->  mate
meeting   ->  meet
milling   ->  mill
messing   ->  mess

meetings  ->  meet

step1c(self)

source code

step1c() turns terminal y to i when there is another vowel in the stem.
--NEW--: This has been modified from the original Porter algorithm so that y->i
is only done when y is preceded by a consonant, but not if the stem
is only a single consonant, i.e.

   (*c and not c) Y -> I

So 'happy' -> 'happi', but
  'enjoy' -> 'enjoy'  etc

This is a much better rule. Formerly 'enjoy'->'enjoi' and 'enjoyment'->
'enjoy'. Step 1c is perhaps done too soon; but with this modification that
no longer really matters.

Also, the removal of the vowelinstem(z) condition means that 'spy', 'fly',
'try' ... stem to 'spi', 'fli', 'tri' and conflate with 'spied', 'tried',
'flies' ...

step2(self)

source code

step2() maps double suffices to single ones.
so -ization ( = -ize plus -ation) maps to -ize etc. note that the
string before the suffix must give m() > 0.

step3(self)

source code

step3() dels with -ic-, -full, -ness etc. similar strategy to step2.

stem_word(self, p, i=0, j=None)

source code

In stem(p,i,j), p is a char pointer, and the string to be stemmed
is from p[i] to p[j] inclusive. Typically i is zero and j is the
offset to the last character of a string, (p[j+1] == ''). The
stemmer adjusts the characters p[i] ... p[j] and returns the new
end-point of the string, k. Stemming never increases word length, so
i <= k <= j. To turn the stemmer into a module, declare 'stem' as
extern, and delete the remainder of this file.

stem(self, word)

source code

Strip affixes from the token and return the stem.

Parameters:

token - The token that should be stemmed.

Overrides: api.StemmerI.stem

(inherited documentation)

repr(self)
(Representation operator)

source code

repr(x)

Overrides: object.__repr__: (inherited documentation)

Class PorterStemmer

__init__(self) (Constructor)

m(self)

cvc(self, i)

step1ab(self)

step1c(self)

step2(self)

step3(self)

stem_word(self, p, i=0, j=None)

stem(self, word)

__repr__(self) (Representation operator)

init(self)
(Constructor)

repr(self)
(Representation operator)