Package nltk :: Package stem :: Module porter
[hide private]
[frames] | no frames]

Module porter

source code

Porter Stemming Algorithm

This is the Porter stemming algorithm, ported to Python from the
version coded up in ANSI C by the author. It follows the algorithm
presented in

Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137.

only differing from it at the points maked --DEPARTURE-- and --NEW--
below.

For a more faithful version of the Porter algorithm, see

    http://www.tartarus.org/~martin/PorterStemmer/

Later additions:

   June 2000
   
   The 'l' of the 'logi' -> 'log' rule is put with the stem, so that
   short stems like 'geo' 'theo' etc work like 'archaeo' 'philo' etc.

   This follows a suggestion of Barry Wilkins, reasearch student at
   Birmingham.


   February 2000

   the cvc test for not dropping final -e now looks after vc at the
   beginning of a word, so are, eve, ice, ore, use keep final -e. In this
   test c is any consonant, including w, x and y. This extension was
   suggested by Chris Emerson.

   -fully    -> -ful   treated like  -fulness -> -ful, and
   -tionally -> -tion  treated like  -tional  -> -tion

   both in Step 2. These were suggested by Hiranmay Ghosh, of New Delhi.

   Invariants proceed, succeed, exceed. Also suggested by Hiranmay Ghosh.

Additional modifications were made to incorperate this module into
nltk.  All such modifications are marked with "--NLTK--".  The nltk
version of this module is maintained by the NLTK developers, and is
available from <http://nltk.sourceforge.net>

Classes [hide private]
  PorterStemmer
A word stemmer based on the Porter stemming algorithm.
Functions [hide private]
 
demo()
A demonstration of the porter stemmer on a sample from the Penn Treebank corpus.
source code