Package nltk :: Package chunk
Source Code for Package nltk.chunk

  1  # Natural Language Toolkit: Chunkers 
  2  # 
  3  # Copyright (C) 2001-2008 NLTK Project 
  4  # Author: Steven Bird <[email protected]> 
  5  #         Edward Loper <[email protected]> 
  6  # URL: <http://nltk.org> 
  7  # For license information, see LICENSE.TXT 
  8  # 
  9   
 10  """ 
 11  Classes and interfaces for identifying non-overlapping linguistic 
 12  groups (such as base noun phrases) in unrestricted text.  This task is 
 13  called X{chunk parsing} or X{chunking}, and the identified groups are 
 14  called X{chunks}.  The chunked text is represented using a shallow 
 15  tree called a "chunk structure."  A X{chunk structure} is a tree 
 16  containing tokens and chunks, where each chunk is a subtree containing 
 17  only tokens.  For example, the chunk structure for base noun phrase 
 18  chunks in the sentence "I saw the big dog on the hill" is:: 
 19   
 20    (SENTENCE: 
 21      (NP: <I>) 
 22      <saw> 
 23      (NP: <the> <big> <dog>) 
 24      <on> 
 25      (NP: <the> <hill>)) 
 26   
 27  To convert a chunk structure back to a list of tokens, simply use the 
 28  chunk structure's L{leaves<Tree.leaves>} method. 
 29   
 30  The C{parser.chunk} module defines L{ChunkParserI}, a standard interface for 
 31  chunking texts; and L{RegexpChunkParser}, a regular-expression based 
 32  implementation of that interface. It also defines L{ChunkScore}, a 
 33  utility class for scoring chunk parsers. 
 34   
 35  RegexpChunkParser 
 36  ================= 
 37   
 38  C{parse.RegexpChunkParser} is an implementation of the chunk parser interface 
 39  that uses regular-expressions over tags to chunk a text.  Its 
 40  C{parse} method first constructs a C{ChunkString}, which encodes a 
 41  particular chunking of the input text.  Initially, nothing is 
 42  chunked.  C{parse.RegexpChunkParser} then applies a sequence of 
 43  C{RegexpChunkRule}s to the C{ChunkString}, each of which modifies 
 44  the chunking that it encodes.  Finally, the C{ChunkString} is 
 45  transformed back into a chunk structure, which is returned. 
 46   
 47  C{RegexpChunkParser} can only be used to chunk a single kind of phrase. 
 48  For example, you can use an C{RegexpChunkParser} to chunk the noun 
 49  phrases in a text, or the verb phrases in a text; but you can not 
 50  use it to simultaneously chunk both noun phrases and verb phrases in 
 51  the same text.  (This is a limitation of C{RegexpChunkParser}, not of 
 52  chunk parsers in general.) 
 53   
 54  RegexpChunkRules 
 55  ---------------- 
 56   
 57  C{RegexpChunkRule}s are transformational rules that update the 
 58  chunking of a text by modifying its C{ChunkString}.  Each 
 59  C{RegexpChunkRule} defines the C{apply} method, which modifies 
 60  the chunking encoded by a C{ChunkString}.  The 
 61  L{RegexpChunkRule} class itself can be used to implement any 
 62  transformational rule based on regular expressions.  There are 
 63  also a number of subclasses, which can be used to implement 
 64  simpler types of rules: 
 65   
 66      - L{ChunkRule} chunks anything that matches a given regular 
 67        expression. 
 68      - L{ChinkRule} chinks anything that matches a given regular 
 69        expression. 
 70      - L{UnChunkRule} will un-chunk any chunk that matches a given 
 71        regular expression. 
 72      - L{MergeRule} can be used to merge two contiguous chunks. 
 73      - L{SplitRule} can be used to split a single chunk into two 
 74        smaller chunks. 
 75      - L{ExpandLeftRule} will expand a chunk to incorporate new 
 76        unchunked material on the left. 
 77      - L{ExpandRightRule} will expand a chunk to incorporate new 
 78        unchunked material on the right. 
 79   
 80  Tag Patterns 
 81  ~~~~~~~~~~~~ 
 82   
 83  C{RegexpChunkRule}s use a modified version of regular 
 84  expression patterns, called X{tag patterns}.  Tag patterns are 
 85  used to match sequences of tags.  Examples of tag patterns are:: 
 86   
 87       r'(<DT>|<JJ>|<NN>)+' 
 88       r'<NN>+' 
 89       r'<NN.*>' 
 90   
 91  The differences between regular expression patterns and tag 
 92  patterns are: 
 93   
 94      - In tag patterns, C{'<'} and C{'>'} act as parentheses; so 
 95        C{'<NN>+'} matches one or more repetitions of C{'<NN>'}, not 
 96        C{'<NN'} followed by one or more repetitions of C{'>'}. 
 97      - Whitespace in tag patterns is ignored.  So 
 98        C{'<DT> | <NN>'} is equivalant to C{'<DT>|<NN>'} 
 99      - In tag patterns, C{'.'} is equivalant to C{'[^{}<>]'}; so 
100        C{'<NN.*>'} matches any single tag starting with C{'NN'}. 
101   
102  The function L{tag_pattern2re_pattern} can be used to transform 
103  a tag pattern to an equivalent regular expression pattern. 
104   
105  Efficiency 
106  ---------- 
107   
108  Preliminary tests indicate that C{RegexpChunkParser} can chunk at a 
109  rate of about 300 tokens/second, with a moderately complex rule set. 
110   
111  There may be problems if C{RegexpChunkParser} is used with more than 
112  5,000 tokens at a time.  In particular, evaluation of some regular 
113  expressions may cause the Python regular expression engine to 
114  exceed its maximum recursion depth.  We have attempted to minimize 
115  these problems, but it is impossible to avoid them completely.  We 
116  therefore recommend that you apply the chunk parser to a single 
117  sentence at a time. 
118   
119  Emacs Tip 
120  --------- 
121   
122  If you evaluate the following elisp expression in emacs, it will 
123  colorize C{ChunkString}s when you use an interactive python shell 
124  with emacs or xemacs ("C-c !"):: 
125   
126      (let () 
127        (defconst comint-mode-font-lock-keywords  
128          '(("<[^>]+>" 0 'font-lock-reference-face) 
129            ("[{}]" 0 'font-lock-function-name-face))) 
130        (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock)))) 
131   
132  You can evaluate this code by copying it to a temporary buffer, 
133  placing the cursor after the last close parenthesis, and typing 
134  "C{C-x C-e}".  You should evaluate it before running the interactive 
135  session.  The change will last until you close emacs. 
136   
137  Unresolved Issues 
138  ----------------- 
139   
140  If we use the C{re} module for regular expressions, Python's 
141  regular expression engine generates "maximum recursion depth 
142  exceeded" errors when processing very large texts, even for 
143  regular expressions that should not require any recursion.  We 
144  therefore use the C{pre} module instead.  But note that C{pre} 
145  does not include Unicode support, so this module will not work 
146  with unicode strings.  Note also that C{pre} regular expressions 
147  are not quite as advanced as C{re} ones (e.g., no leftward 
148  zero-length assertions). 
149   
150  @type CHUNK_TAG_PATTERN: C{regexp} 
151  @var CHUNK_TAG_PATTERN: A regular expression to test whether a tag 
152       pattern is valid. 
153  """ 
154   
155  from api import * 
156  from util import * 
157  from regexp import * 
158   
159  __all__ = [ 
160      # ChunkParser interface 
161      'ChunkParserI', 
162   
163      # Parsers 
164      'RegexpChunkParser', 'RegexpParser', 
165      ] 
166   
167  ###################################################################### 
168  #{ Deprecated 
169  ###################################################################### 
170  from nltk.internals import Deprecated 
171 -class ChunkParseI(ChunkParserI, Deprecated): 
172      """Use nltk.ChunkParserI instead.""" 
173 -class RegexpChunk(RegexpChunkParser, Deprecated): 
174      """Use nltk.RegexpChunkParser instead.""" 
175 -class Regexp(RegexpParser, Deprecated): 
176      """Use nltk.RegexpParser instead.""" 
177