Module porter
source code
Porter Stemming Algorithm
This is the Porter stemming algorithm, ported to Python from the
version coded up in ANSI C by the author. It follows the algorithm
presented in
Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137.
only differing from it at the points maked --DEPARTURE-- and --NEW--
below.
For a more faithful version of the Porter algorithm, see
http://www.tartarus.org/~martin/PorterStemmer/
Later additions:
June 2000
The 'l' of the 'logi' -> 'log' rule is put with the stem, so that
short stems like 'geo' 'theo' etc work like 'archaeo' 'philo' etc.
This follows a suggestion of Barry Wilkins, reasearch student at
Birmingham.
February 2000
the cvc test for not dropping final -e now looks after vc at the
beginning of a word, so are, eve, ice, ore, use keep final -e. In this
test c is any consonant, including w, x and y. This extension was
suggested by Chris Emerson.
-fully -> -ful treated like -fulness -> -ful, and
-tionally -> -tion treated like -tional -> -tion
both in Step 2. These were suggested by Hiranmay Ghosh, of New Delhi.
Invariants proceed, succeed, exceed. Also suggested by Hiranmay Ghosh.
Additional modifications were made to incorperate this module into
nltk. All such modifications are marked with "--NLTK--". The nltk
version of this module is maintained by the NLTK developers, and is
available from <http://nltk.sourceforge.net>
|
PorterStemmer
A word stemmer based on the Porter stemming algorithm.
|
|
demo()
A demonstration of the porter stemmer on a sample from
the Penn Treebank corpus. |
source code
|
|