Package nltk :: Package stem :: Module porter :: Class PorterStemmer
[hide private]
[frames] | no frames]

Class PorterStemmer

source code

  object --+    
           |    
api.StemmerI --+
               |
              PorterStemmer
Known Subclasses:


A word stemmer based on the Porter stemming algorithm.

    Porter, M. "An algorithm for suffix stripping."
    Program 14.3 (1980): 130-137.

A few minor modifications have been made to Porter's basic
algorithm.  See the source code of this module for more
information.

The Porter Stemmer requires that all tokens have string types.

Instance Methods [hide private]
 
__init__(self)
The main part of the stemming algorithm starts here.
source code
 
cons(self, i)
cons(i) is TRUE <=> b[i] is a consonant.
source code
 
m(self)
m() measures the number of consonant sequences between k0 and j.
source code
 
vowelinstem(self)
vowelinstem() is TRUE <=> k0,...j contains a vowel
source code
 
doublec(self, j)
doublec(j) is TRUE <=> j,(j-1) contain a double consonant.
source code
 
cvc(self, i)
cvc(i) is TRUE <=>
source code
 
ends(self, s)
ends(s) is TRUE <=> k0,...k ends with the string s.
source code
 
setto(self, s)
setto(s) sets (j+1),...k to the characters in the string s, readjusting k.
source code
 
r(self, s)
r(s) is used further down.
source code
 
step1ab(self)
step1ab() gets rid of plurals and -ed or -ing.
source code
 
step1c(self)
step1c() turns terminal y to i when there is another vowel in the stem.
source code
 
step2(self)
step2() maps double suffices to single ones.
source code
 
step3(self)
step3() dels with -ic-, -full, -ness etc.
source code
 
step4(self)
step4() takes off -ant, -ence etc., in context <c>vcvc<v>.
source code
 
step5(self)
step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
source code
 
stem_word(self, p, i=0, j=None)
In stem(p,i,j), p is a char pointer, and the string to be stemmed is from p[i] to p[j] inclusive.
source code
 
adjust_case(self, word, stem) source code
 
stem(self, word)
Strip affixes from the token and return the stem.
source code
 
__repr__(self)
repr(x)
source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self)
(Constructor)

source code 
The main part of the stemming algorithm starts here.
b is a buffer holding a word to be stemmed. The letters are in b[k0],
b[k0+1] ... ending at b[k]. In fact k0 = 0 in this demo program. k is
readjusted downwards as the stemming progresses. Zero termination is
not in fact used in the algorithm.

Note that only lower case sequences are stemmed. Forcing to lower case
should be done before stem(...) is called.

Overrides: object.__init__

m(self)

source code 
m() measures the number of consonant sequences between k0 and j.
if c is a consonant sequence and v a vowel sequence, and <..>
indicates arbitrary presence,

   <c><v>       gives 0
   <c>vc<v>     gives 1
   <c>vcvc<v>   gives 2
   <c>vcvcvc<v> gives 3
   ....

cvc(self, i)

source code 
cvc(i) is TRUE <=>

a) ( --NEW--) i == 1, and p[0] p[1] is vowel consonant, or

b) p[i - 2], p[i - 1], p[i] has the form consonant -
   vowel - consonant and also if the second c is not w, x or y. this
   is used when trying to restore an e at the end of a short word.
   e.g.

       cav(e), lov(e), hop(e), crim(e), but
       snow, box, tray.        

step1ab(self)

source code 
step1ab() gets rid of plurals and -ed or -ing. e.g.

caresses  ->  caress
ponies    ->  poni
sties     ->  sti
tie       ->  tie        (--NEW--: see below)
caress    ->  caress
cats      ->  cat

feed      ->  feed
agreed    ->  agree
disabled  ->  disable

matting   ->  mat
mating    ->  mate
meeting   ->  meet
milling   ->  mill
messing   ->  mess

meetings  ->  meet

step1c(self)

source code 
step1c() turns terminal y to i when there is another vowel in the stem.
--NEW--: This has been modified from the original Porter algorithm so that y->i
is only done when y is preceded by a consonant, but not if the stem
is only a single consonant, i.e.

   (*c and not c) Y -> I

So 'happy' -> 'happi', but
  'enjoy' -> 'enjoy'  etc

This is a much better rule. Formerly 'enjoy'->'enjoi' and 'enjoyment'->
'enjoy'. Step 1c is perhaps done too soon; but with this modification that
no longer really matters.

Also, the removal of the vowelinstem(z) condition means that 'spy', 'fly',
'try' ... stem to 'spi', 'fli', 'tri' and conflate with 'spied', 'tried',
'flies' ...

step2(self)

source code 
step2() maps double suffices to single ones.
so -ization ( = -ize plus -ation) maps to -ize etc. note that the
string before the suffix must give m() > 0.

step3(self)

source code 
step3() dels with -ic-, -full, -ness etc. similar strategy to step2.

stem_word(self, p, i=0, j=None)

source code 
In stem(p,i,j), p is a char pointer, and the string to be stemmed
is from p[i] to p[j] inclusive. Typically i is zero and j is the
offset to the last character of a string, (p[j+1] == ''). The
stemmer adjusts the characters p[i] ... p[j] and returns the new
end-point of the string, k. Stemming never increases word length, so
i <= k <= j. To turn the stemmer into a module, declare 'stem' as
extern, and delete the remainder of this file.

stem(self, word)

source code 

Strip affixes from the token and return the stem.

Parameters:
  • token - The token that should be stemmed.
Overrides: api.StemmerI.stem
(inherited documentation)

__repr__(self)
(Representation operator)

source code 

repr(x)

Overrides: object.__repr__
(inherited documentation)