1
2
3
4
5
6
7
8
9
10 """
11 Classes and interfaces for identifying non-overlapping linguistic
12 groups (such as base noun phrases) in unrestricted text. This task is
13 called X{chunk parsing} or X{chunking}, and the identified groups are
14 called X{chunks}. The chunked text is represented using a shallow
15 tree called a "chunk structure." A X{chunk structure} is a tree
16 containing tokens and chunks, where each chunk is a subtree containing
17 only tokens. For example, the chunk structure for base noun phrase
18 chunks in the sentence "I saw the big dog on the hill" is::
19
20 (SENTENCE:
21 (NP: <I>)
22 <saw>
23 (NP: <the> <big> <dog>)
24 <on>
25 (NP: <the> <hill>))
26
27 To convert a chunk structure back to a list of tokens, simply use the
28 chunk structure's L{leaves<Tree.leaves>} method.
29
30 The C{parser.chunk} module defines L{ChunkParserI}, a standard interface for
31 chunking texts; and L{RegexpChunkParser}, a regular-expression based
32 implementation of that interface. It also defines L{ChunkScore}, a
33 utility class for scoring chunk parsers.
34
35 RegexpChunkParser
36 =================
37
38 C{parse.RegexpChunkParser} is an implementation of the chunk parser interface
39 that uses regular-expressions over tags to chunk a text. Its
40 C{parse} method first constructs a C{ChunkString}, which encodes a
41 particular chunking of the input text. Initially, nothing is
42 chunked. C{parse.RegexpChunkParser} then applies a sequence of
43 C{RegexpChunkRule}s to the C{ChunkString}, each of which modifies
44 the chunking that it encodes. Finally, the C{ChunkString} is
45 transformed back into a chunk structure, which is returned.
46
47 C{RegexpChunkParser} can only be used to chunk a single kind of phrase.
48 For example, you can use an C{RegexpChunkParser} to chunk the noun
49 phrases in a text, or the verb phrases in a text; but you can not
50 use it to simultaneously chunk both noun phrases and verb phrases in
51 the same text. (This is a limitation of C{RegexpChunkParser}, not of
52 chunk parsers in general.)
53
54 RegexpChunkRules
55 ----------------
56
57 C{RegexpChunkRule}s are transformational rules that update the
58 chunking of a text by modifying its C{ChunkString}. Each
59 C{RegexpChunkRule} defines the C{apply} method, which modifies
60 the chunking encoded by a C{ChunkString}. The
61 L{RegexpChunkRule} class itself can be used to implement any
62 transformational rule based on regular expressions. There are
63 also a number of subclasses, which can be used to implement
64 simpler types of rules:
65
66 - L{ChunkRule} chunks anything that matches a given regular
67 expression.
68 - L{ChinkRule} chinks anything that matches a given regular
69 expression.
70 - L{UnChunkRule} will un-chunk any chunk that matches a given
71 regular expression.
72 - L{MergeRule} can be used to merge two contiguous chunks.
73 - L{SplitRule} can be used to split a single chunk into two
74 smaller chunks.
75 - L{ExpandLeftRule} will expand a chunk to incorporate new
76 unchunked material on the left.
77 - L{ExpandRightRule} will expand a chunk to incorporate new
78 unchunked material on the right.
79
80 Tag Patterns
81 ~~~~~~~~~~~~
82
83 C{RegexpChunkRule}s use a modified version of regular
84 expression patterns, called X{tag patterns}. Tag patterns are
85 used to match sequences of tags. Examples of tag patterns are::
86
87 r'(<DT>|<JJ>|<NN>)+'
88 r'<NN>+'
89 r'<NN.*>'
90
91 The differences between regular expression patterns and tag
92 patterns are:
93
94 - In tag patterns, C{'<'} and C{'>'} act as parentheses; so
95 C{'<NN>+'} matches one or more repetitions of C{'<NN>'}, not
96 C{'<NN'} followed by one or more repetitions of C{'>'}.
97 - Whitespace in tag patterns is ignored. So
98 C{'<DT> | <NN>'} is equivalant to C{'<DT>|<NN>'}
99 - In tag patterns, C{'.'} is equivalant to C{'[^{}<>]'}; so
100 C{'<NN.*>'} matches any single tag starting with C{'NN'}.
101
102 The function L{tag_pattern2re_pattern} can be used to transform
103 a tag pattern to an equivalent regular expression pattern.
104
105 Efficiency
106 ----------
107
108 Preliminary tests indicate that C{RegexpChunkParser} can chunk at a
109 rate of about 300 tokens/second, with a moderately complex rule set.
110
111 There may be problems if C{RegexpChunkParser} is used with more than
112 5,000 tokens at a time. In particular, evaluation of some regular
113 expressions may cause the Python regular expression engine to
114 exceed its maximum recursion depth. We have attempted to minimize
115 these problems, but it is impossible to avoid them completely. We
116 therefore recommend that you apply the chunk parser to a single
117 sentence at a time.
118
119 Emacs Tip
120 ---------
121
122 If you evaluate the following elisp expression in emacs, it will
123 colorize C{ChunkString}s when you use an interactive python shell
124 with emacs or xemacs ("C-c !")::
125
126 (let ()
127 (defconst comint-mode-font-lock-keywords
128 '(("<[^>]+>" 0 'font-lock-reference-face)
129 ("[{}]" 0 'font-lock-function-name-face)))
130 (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))
131
132 You can evaluate this code by copying it to a temporary buffer,
133 placing the cursor after the last close parenthesis, and typing
134 "C{C-x C-e}". You should evaluate it before running the interactive
135 session. The change will last until you close emacs.
136
137 Unresolved Issues
138 -----------------
139
140 If we use the C{re} module for regular expressions, Python's
141 regular expression engine generates "maximum recursion depth
142 exceeded" errors when processing very large texts, even for
143 regular expressions that should not require any recursion. We
144 therefore use the C{pre} module instead. But note that C{pre}
145 does not include Unicode support, so this module will not work
146 with unicode strings. Note also that C{pre} regular expressions
147 are not quite as advanced as C{re} ones (e.g., no leftward
148 zero-length assertions).
149
150 @type CHUNK_TAG_PATTERN: C{regexp}
151 @var CHUNK_TAG_PATTERN: A regular expression to test whether a tag
152 pattern is valid.
153 """
154
155 from api import *
156 from util import *
157 from regexp import *
158
159 __all__ = [
160
161 'ChunkParserI',
162
163
164 'RegexpChunkParser', 'RegexpParser',
165 ]
166
167
168
169
170 from nltk.internals import Deprecated
172 """Use nltk.ChunkParserI instead."""
174 """Use nltk.RegexpChunkParser instead."""
175 -class Regexp(RegexpParser, Deprecated):
176 """Use nltk.RegexpParser instead."""
177