nltk.chunk package¶
Submodules¶
nltk.chunk.api module¶
-
class
nltk.chunk.api.
ChunkParserI
[source]¶ Bases:
nltk.parse.api.ParserI
A processing interface for identifying non-overlapping groups in unrestricted text. Typically, chunk parsers are used to find base syntactic constituents, such as base noun phrases. Unlike
ParserI
,ChunkParserI
guarantees that theparse()
method will always generate a parse.-
evaluate
(gold)[source]¶ Score the accuracy of the chunker against the gold standard. Remove the chunking the gold standard text, rechunk it using the chunker, and return a
ChunkScore
object reflecting the performance of this chunk peraser.Parameters: gold (list(Tree)) – The list of chunked sentences to score the chunker on. Return type: ChunkScore
-
nltk.chunk.named_entity module¶
Named entity chunker
-
class
nltk.chunk.named_entity.
NEChunkParser
(train)[source]¶ Bases:
nltk.chunk.api.ChunkParserI
Expected input: list of pos-tagged words
-
class
nltk.chunk.named_entity.
NEChunkParserTagger
(train)[source]¶ Bases:
nltk.tag.sequential.ClassifierBasedTagger
The IOB tagger used by the chunk parser.
nltk.chunk.regexp module¶
-
class
nltk.chunk.regexp.
ChinkRule
(tag_pattern, descr)[source]¶ Bases:
nltk.chunk.regexp.RegexpChunkRule
A rule specifying how to remove chinks to a
ChunkString
, using a matching tag pattern. When applied to aChunkString
, it will find any substring that matches this tag pattern and that is contained in a chunk, and remove it from that chunk, thus creating two new chunks.
-
class
nltk.chunk.regexp.
ChunkRule
(tag_pattern, descr)[source]¶ Bases:
nltk.chunk.regexp.RegexpChunkRule
A rule specifying how to add chunks to a
ChunkString
, using a matching tag pattern. When applied to aChunkString
, it will find any substring that matches this tag pattern and that is not already part of a chunk, and create a new chunk containing that substring.
-
class
nltk.chunk.regexp.
ChunkRuleWithContext
(left_context_tag_pattern, chunk_tag_pattern, right_context_tag_pattern, descr)[source]¶ Bases:
nltk.chunk.regexp.RegexpChunkRule
A rule specifying how to add chunks to a
ChunkString
, using three matching tag patterns: one for the left context, one for the chunk, and one for the right context. When applied to aChunkString
, it will find any substring that matches the chunk tag pattern, is surrounded by substrings that match the two context patterns, and is not already part of a chunk; and create a new chunk containing the substring that matched the chunk tag pattern.Caveat: Both the left and right context are consumed when this rule matches; therefore, if you need to find overlapping matches, you will need to apply your rule more than once.
-
class
nltk.chunk.regexp.
ChunkString
(chunk_struct, debug_level=1)[source]¶ Bases:
object
A string-based encoding of a particular chunking of a text. Internally, the
ChunkString
class uses a single string to encode the chunking of the input text. This string contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>
ChunkString
are created from tagged texts (i.e., lists oftokens
whose type isTaggedType
). Initially, nothing is chunked.The chunking of a
ChunkString
can be modified with thexform()
method, which uses a regular expression to transform the string representation. These transformations should only add and remove braces; they should not modify the sequence of angle-bracket delimited tags.Variables: - _str –
The internal string representation of the text’s encoding. This string representation contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:
{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>
- _pieces – The tagged tokens and chunks encoded by this
ChunkString
. - _debug – The debug level. See the constructor docs.
- IN_CHUNK_PATTERN – A zero-width regexp pattern string that will only match positions that are in chunks.
- IN_CHINK_PATTERN – A zero-width regexp pattern string that will only match positions that are in chinks.
-
CHUNK_TAG
= '(<[^\\{\\}<>]+?>)'¶
-
CHUNK_TAG_CHAR
= '[^\\{\\}<>]'¶
-
IN_CHINK_PATTERN
= '(?=[^\\}]*(\\{|$))'¶
-
IN_CHUNK_PATTERN
= '(?=[^\\{]*\\})'¶
-
to_chunkstruct
(chunk_label='CHUNK')[source]¶ Return the chunk structure encoded by this
ChunkString
.Return type: Tree Raises: ValueError – If a transformation has generated an invalid chunkstring.
-
unicode_repr
()¶ Return a string representation of this
ChunkString
. It has the form:<ChunkString: '{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}'>
Return type: str
-
xform
(regexp, repl)[source]¶ Apply the given transformation to the string encoding of this
ChunkString
. In particular, find all occurrences that matchregexp
, and replace them usingrepl
(as done byre.sub
).This transformation should only add and remove braces; it should not modify the sequence of angle-bracket delimited tags. Furthermore, this transformation may not result in improper bracketing. Note, in particular, that bracketing may not be nested.
Parameters: - regexp (str or regexp) – A regular expression matching the substring
that should be replaced. This will typically include a
named group, which can be used by
repl
. - repl (str) – An expression specifying what should replace the
matched substring. Typically, this will include a named
replacement group, specified by
regexp
.
Return type: None
Raises: ValueError – If this transformation generated an invalid chunkstring.
- regexp (str or regexp) – A regular expression matching the substring
that should be replaced. This will typically include a
named group, which can be used by
- _str –
-
class
nltk.chunk.regexp.
ExpandLeftRule
(left_tag_pattern, right_tag_pattern, descr)[source]¶ Bases:
nltk.chunk.regexp.RegexpChunkRule
A rule specifying how to expand chunks in a
ChunkString
to the left, using two matching tag patterns: a left pattern, and a right pattern. When applied to aChunkString
, it will find any chunk whose beginning matches right pattern, and immediately preceded by a chink whose end matches left pattern. It will then expand the chunk to incorporate the new material on the left.
-
class
nltk.chunk.regexp.
ExpandRightRule
(left_tag_pattern, right_tag_pattern, descr)[source]¶ Bases:
nltk.chunk.regexp.RegexpChunkRule
A rule specifying how to expand chunks in a
ChunkString
to the right, using two matching tag patterns: a left pattern, and a right pattern. When applied to aChunkString
, it will find any chunk whose end matches left pattern, and immediately followed by a chink whose beginning matches right pattern. It will then expand the chunk to incorporate the new material on the right.
-
class
nltk.chunk.regexp.
MergeRule
(left_tag_pattern, right_tag_pattern, descr)[source]¶ Bases:
nltk.chunk.regexp.RegexpChunkRule
A rule specifying how to merge chunks in a
ChunkString
, using two matching tag patterns: a left pattern, and a right pattern. When applied to aChunkString
, it will find any chunk whose end matches left pattern, and immediately followed by a chunk whose beginning matches right pattern. It will then merge those two chunks into a single chunk.
-
class
nltk.chunk.regexp.
RegexpChunkParser
(rules, chunk_label='NP', root_label='S', trace=0)[source]¶ Bases:
nltk.chunk.api.ChunkParserI
A regular expression based chunk parser.
RegexpChunkParser
uses a sequence of “rules” to find chunks of a single type within a text. The chunking of the text is encoded using aChunkString
, and each rule acts by modifying the chunking in theChunkString
. The rules are all implemented using regular expression matching and substitution.The
RegexpChunkRule
class and its subclasses (ChunkRule
,ChinkRule
,UnChunkRule
,MergeRule
, andSplitRule
) define the rules that are used byRegexpChunkParser
. Each rule defines anapply()
method, which modifies the chunking encoded by a givenChunkString
.Variables: - _rules – The list of rules that should be applied to a text.
- _trace – The default level of tracing.
-
parse
(chunk_struct, trace=None)[source]¶ Parameters: - chunk_struct (Tree) – the chunk structure to be (further) chunked
- trace (int) – The level of tracing that should be used when
parsing a text.
0
will generate no tracing output;1
will generate normal tracing output; and2
or highter will generate verbose tracing output. This value overrides the trace level value that was given to the constructor.
Return type: Returns: a chunk structure that encodes the chunks in a given tagged sentence. A chunk is a non-overlapping linguistic group, such as a noun phrase. The set of chunks identified in the chunk structure depends on the rules used to define this
RegexpChunkParser
.
-
rules
()[source]¶ Returns: the sequence of rules used by RegexpChunkParser
.Return type: list(RegexpChunkRule)
-
class
nltk.chunk.regexp.
RegexpChunkRule
(regexp, repl, descr)[source]¶ Bases:
object
A rule specifying how to modify the chunking in a
ChunkString
, using a transformational regular expression. TheRegexpChunkRule
class itself can be used to implement any transformational rule based on regular expressions. There are also a number of subclasses, which can be used to implement simpler types of rules, based on matching regular expressions.Each
RegexpChunkRule
has a regular expression and a replacement expression. When aRegexpChunkRule
is “applied” to aChunkString
, it searches theChunkString
for any substring that matches the regular expression, and replaces it using the replacement expression. This search/replace operation has the same semantics asre.sub
.Each
RegexpChunkRule
also has a description string, which gives a short (typically less than 75 characters) description of the purpose of the rule.This transformation defined by this
RegexpChunkRule
should only add and remove braces; it should not modify the sequence of angle-bracket delimited tags. Furthermore, this transformation may not result in nested or mismatched bracketing.-
apply
(chunkstr)[source]¶ Apply this rule to the given
ChunkString
. See the class reference documentation for a description of what it means to apply a rule.Parameters: chunkstr (ChunkString) – The chunkstring to which this rule is applied. Return type: None Raises: ValueError – If this transformation generated an invalid chunkstring.
-
descr
()[source]¶ Return a short description of the purpose and/or effect of this rule.
Return type: str
-
static
fromstring
(s)[source]¶ Create a RegexpChunkRule from a string description. Currently, the following formats are supported:
{regexp} # chunk rule }regexp{ # chink rule regexp}{regexp # split rule regexp{}regexp # merge rule
Where
regexp
is a regular expression for the rule. Any text following the comment marker (#
) will be used as the rule’s description:>>> from nltk.chunk.regexp import RegexpChunkRule >>> RegexpChunkRule.fromstring('{<DT>?<NN.*>+}') <ChunkRule: '<DT>?<NN.*>+'>
-
-
class
nltk.chunk.regexp.
RegexpParser
(grammar, root_label='S', loop=1, trace=0)[source]¶ Bases:
nltk.chunk.api.ChunkParserI
A grammar based chunk parser.
chunk.RegexpParser
uses a set of regular expression patterns to specify the behavior of the parser. The chunking of the text is encoded using aChunkString
, and each rule acts by modifying the chunking in theChunkString
. The rules are all implemented using regular expression matching and substitution.A grammar contains one or more clauses in the following form:
NP: {<DT|JJ>} # chunk determiners and adjectives }<[\.VI].*>+{ # chink any tag beginning with V, I, or . <.*>}{<DT> # split a chunk at a determiner <DT|JJ>{}<NN.*> # merge chunk ending with det/adj # with one starting with a noun
The patterns of a clause are executed in order. An earlier pattern may introduce a chunk boundary that prevents a later pattern from executing. Sometimes an individual pattern will match on multiple, overlapping extents of the input. As with regular expression substitution more generally, the chunker will identify the first match possible, then continue looking for matches after this one has ended.
The clauses of a grammar are also executed in order. A cascaded chunk parser is one having more than one clause. The maximum depth of a parse tree created by this chunk parser is the same as the number of clauses in the grammar.
When tracing is turned on, the comment portion of a line is displayed each time the corresponding pattern is applied.
Variables: - _start – The start symbol of the grammar (the root node of resulting trees)
- _stages – The list of parsing stages corresponding to the grammar
-
parse
(chunk_struct, trace=None)[source]¶ Apply the chunk parser to this input.
Parameters: - chunk_struct (Tree) – the chunk structure to be (further) chunked (this tree is modified, and is also returned)
- trace (int) – The level of tracing that should be used when
parsing a text.
0
will generate no tracing output;1
will generate normal tracing output; and2
or highter will generate verbose tracing output. This value overrides the trace level value that was given to the constructor.
Returns: the chunked output.
Return type:
-
class
nltk.chunk.regexp.
SplitRule
(left_tag_pattern, right_tag_pattern, descr)[source]¶ Bases:
nltk.chunk.regexp.RegexpChunkRule
A rule specifying how to split chunks in a
ChunkString
, using two matching tag patterns: a left pattern, and a right pattern. When applied to aChunkString
, it will find any chunk that matches the left pattern followed by the right pattern. It will then split the chunk into two new chunks, at the point between the two pattern matches.
-
class
nltk.chunk.regexp.
UnChunkRule
(tag_pattern, descr)[source]¶ Bases:
nltk.chunk.regexp.RegexpChunkRule
A rule specifying how to remove chunks to a
ChunkString
, using a matching tag pattern. When applied to aChunkString
, it will find any complete chunk that matches this tag pattern, and un-chunk it.
-
nltk.chunk.regexp.
demo
()[source]¶ A demonstration for the
RegexpChunkParser
class. A single text is parsed with four different chunk parsers, using a variety of rules and strategies.
-
nltk.chunk.regexp.
demo_eval
(chunkparser, text)[source]¶ Demonstration code for evaluating a chunk parser, using a
ChunkScore
. This function assumes thattext
contains one sentence per line, and that each sentence has the form expected bytree.chunk
. It runs the given chunk parser on each sentence in the text, and scores the result. It prints the final score (precision, recall, and f-measure); and reports the set of chunks that were missed and the set of chunks that were incorrect. (At most 10 missing chunks and 10 incorrect chunks are reported).Parameters: - chunkparser (ChunkParserI) – The chunkparser to be tested
- text (str) – The chunked tagged text that should be used for evaluation.
-
nltk.chunk.regexp.
tag_pattern2re_pattern
(tag_pattern)[source]¶ Convert a tag pattern to a regular expression pattern. A “tag pattern” is a modified version of a regular expression, designed for matching sequences of tags. The differences between regular expression patterns and tag patterns are:
- In tag patterns,
'<'
and'>'
act as parentheses; so'<NN>+'
matches one or more repetitions of'<NN>'
, not'<NN'
followed by one or more repetitions of'>'
. - Whitespace in tag patterns is ignored. So
'<DT> | <NN>'
is equivalant to'<DT>|<NN>'
- In tag patterns,
'.'
is equivalant to'[^{}<>]'
; so'<NN.*>'
matches any single tag starting with'NN'
.
In particular,
tag_pattern2re_pattern
performs the following transformations on the given pattern:- Replace ‘.’ with ‘[^<>{}]’
- Remove any whitespace
- Add extra parens around ‘<’ and ‘>’, to make ‘<’ and ‘>’ act like parentheses. E.g., so that in ‘<NN>+’, the ‘+’ has scope over the entire ‘<NN>’; and so that in ‘<NN|IN>’, the ‘|’ has scope over ‘NN’ and ‘IN’, but not ‘<’ or ‘>’.
- Check to make sure the resulting pattern is valid.
Parameters: tag_pattern (str) – The tag pattern to convert to a regular expression pattern. Raises: ValueError – If tag_pattern
is not a valid tag pattern. In particular,tag_pattern
should not include braces; and it should not contain nested or mismatched angle-brackets.Return type: str Returns: A regular expression pattern corresponding to tag_pattern
.- In tag patterns,
nltk.chunk.util module¶
-
class
nltk.chunk.util.
ChunkScore
(**kwargs)[source]¶ Bases:
object
A utility class for scoring chunk parsers.
ChunkScore
can evaluate a chunk parser’s output, based on a number of statistics (precision, recall, f-measure, misssed chunks, incorrect chunks). It can also combine the scores from the parsing of multiple texts; this makes it significantly easier to evaluate a chunk parser that operates one sentence at a time.Texts are evaluated with the
score
method. The results of evaluation can be accessed via a number of accessor methods, such asprecision
andf_measure
. A typical use of theChunkScore
class is:>>> chunkscore = ChunkScore() >>> for correct in correct_sentences: ... guess = chunkparser.parse(correct.leaves()) ... chunkscore.score(correct, guess) >>> print('F Measure:', chunkscore.f_measure()) F Measure: 0.823
Variables: - kwargs –
Keyword arguments:
- max_tp_examples: The maximum number actual examples of true
positives to record. This affects the
correct
member function:correct
will not return more than this number of true positive examples. This does not affect any of the numerical metrics (precision, recall, or f-measure) - max_fp_examples: The maximum number actual examples of false
positives to record. This affects the
incorrect
member function and theguessed
member function:incorrect
will not return more than this number of examples, andguessed
will not return more than this number of true positive examples. This does not affect any of the numerical metrics (precision, recall, or f-measure) - max_fn_examples: The maximum number actual examples of false
negatives to record. This affects the
missed
member function and thecorrect
member function:missed
will not return more than this number of examples, andcorrect
will not return more than this number of true negative examples. This does not affect any of the numerical metrics (precision, recall, or f-measure) - chunk_label: A regular expression indicating which chunks
should be compared. Defaults to
'.*'
(i.e., all chunks).
- max_tp_examples: The maximum number actual examples of true
positives to record. This affects the
- _tp – List of true positives
- _fp – List of false positives
- _fn – List of false negatives
- _tp_num – Number of true positives
- _fp_num – Number of false positives
- _fn_num – Number of false negatives.
-
accuracy
()[source]¶ Return the overall tag-based accuracy for all text that have been scored by this
ChunkScore
, using the IOB (conll2000) tag encoding.Return type: float
-
correct
()[source]¶ Return the chunks which were included in the correct chunk structures, listed in input order.
Return type: list of chunks
-
f_measure
(alpha=0.5)[source]¶ Return the overall F measure for all texts that have been scored by this
ChunkScore
.Parameters: alpha (float) – the relative weighting of precision and recall. Larger alpha biases the score towards the precision value, while smaller alpha biases the score towards the recall value. alpha
should have a value in the range [0,1].Return type: float
-
guessed
()[source]¶ Return the chunks which were included in the guessed chunk structures, listed in input order.
Return type: list of chunks
-
incorrect
()[source]¶ Return the chunks which were included in the guessed chunk structures, but not in the correct chunk structures, listed in input order.
Return type: list of chunks
-
missed
()[source]¶ Return the chunks which were included in the correct chunk structures, but not in the guessed chunk structures, listed in input order.
Return type: list of chunks
-
precision
()[source]¶ Return the overall precision for all texts that have been scored by this
ChunkScore
.Return type: float
- kwargs –
-
nltk.chunk.util.
accuracy
(chunker, gold)[source]¶ Score the accuracy of the chunker against the gold standard. Strip the chunk information from the gold standard and rechunk it using the chunker, then compute the accuracy score.
Parameters: - chunker (ChunkParserI) – The chunker being evaluated.
- gold (tree) – The chunk structures to score the chunker on.
Return type: float
-
nltk.chunk.util.
conllstr2tree
(s, chunk_types=('NP', 'PP', 'VP'), root_label='S')[source]¶ Return a chunk structure for a single sentence encoded in the given CONLL 2000 style string. This function converts a CoNLL IOB string into a tree. It uses the specified chunk types (defaults to NP, PP and VP), and creates a tree rooted at a node labeled S (by default).
Parameters: Return type:
Convert the CoNLL IOB format to a tree.
-
nltk.chunk.util.
ieerstr2tree
(s, chunk_types=['LOCATION', 'ORGANIZATION', 'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE'], root_label='S')[source]¶ Return a chunk structure containing the chunked tagged text that is encoded in the given IEER style string. Convert a string of chunked tagged text in the IEER named entity format into a chunk structure. Chunks are of several types, LOCATION, ORGANIZATION, PERSON, DURATION, DATE, CARDINAL, PERCENT, MONEY, and MEASURE.
Return type: Tree
Divide a string of bracketted tagged text into chunks and unchunked tokens, and produce a Tree. Chunks are marked by square brackets (
[...]
). Words are delimited by whitespace, and each word should have the formtext/tag
. Words that do not contain a slash are assigned atag
of None.Parameters: Return type:
Module contents¶
Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This task is called “chunk parsing” or “chunking”, and the identified groups are called “chunks”. The chunked text is represented using a shallow tree called a “chunk structure.” A chunk structure is a tree containing tokens and chunks, where each chunk is a subtree containing only tokens. For example, the chunk structure for base noun phrase chunks in the sentence “I saw the big dog on the hill” is:
(SENTENCE:
(NP: <I>)
<saw>
(NP: <the> <big> <dog>)
<on>
(NP: <the> <hill>))
To convert a chunk structure back to a list of tokens, simply use the
chunk structure’s leaves()
method.
This module defines ChunkParserI
, a standard interface for
chunking texts; and RegexpChunkParser
, a regular-expression based
implementation of that interface. It also defines ChunkScore
, a
utility class for scoring chunk parsers.
RegexpChunkParser¶
RegexpChunkParser
is an implementation of the chunk parser interface
that uses regular-expressions over tags to chunk a text. Its
parse()
method first constructs a ChunkString
, which encodes a
particular chunking of the input text. Initially, nothing is
chunked. parse.RegexpChunkParser
then applies a sequence of
RegexpChunkRule
rules to the ChunkString
, each of which modifies
the chunking that it encodes. Finally, the ChunkString
is
transformed back into a chunk structure, which is returned.
RegexpChunkParser
can only be used to chunk a single kind of phrase.
For example, you can use an RegexpChunkParser
to chunk the noun
phrases in a text, or the verb phrases in a text; but you can not
use it to simultaneously chunk both noun phrases and verb phrases in
the same text. (This is a limitation of RegexpChunkParser
, not of
chunk parsers in general.)
RegexpChunkRules¶
A RegexpChunkRule
is a transformational rule that updates the
chunking of a text by modifying its ChunkString
. Each
RegexpChunkRule
defines the apply()
method, which modifies
the chunking encoded by a ChunkString
. The
RegexpChunkRule
class itself can be used to implement any
transformational rule based on regular expressions. There are
also a number of subclasses, which can be used to implement
simpler types of rules:
ChunkRule
chunks anything that matches a given regular expression.ChinkRule
chinks anything that matches a given regular expression.UnChunkRule
will un-chunk any chunk that matches a given regular expression.MergeRule
can be used to merge two contiguous chunks.SplitRule
can be used to split a single chunk into two smaller chunks.ExpandLeftRule
will expand a chunk to incorporate new unchunked material on the left.ExpandRightRule
will expand a chunk to incorporate new unchunked material on the right.
Tag Patterns¶
A RegexpChunkRule
uses a modified version of regular
expression patterns, called “tag patterns”. Tag patterns are
used to match sequences of tags. Examples of tag patterns are:
r'(<DT>|<JJ>|<NN>)+'
r'<NN>+'
r'<NN.*>'
The differences between regular expression patterns and tag patterns are:
- In tag patterns,
'<'
and'>'
act as parentheses; so'<NN>+'
matches one or more repetitions of'<NN>'
, not'<NN'
followed by one or more repetitions of'>'
.- Whitespace in tag patterns is ignored. So
'<DT> | <NN>'
is equivalant to'<DT>|<NN>'
- In tag patterns,
'.'
is equivalant to'[^{}<>]'
; so'<NN.*>'
matches any single tag starting with'NN'
.
The function tag_pattern2re_pattern
can be used to transform
a tag pattern to an equivalent regular expression pattern.
Efficiency¶
Preliminary tests indicate that RegexpChunkParser
can chunk at a
rate of about 300 tokens/second, with a moderately complex rule set.
There may be problems if RegexpChunkParser
is used with more than
5,000 tokens at a time. In particular, evaluation of some regular
expressions may cause the Python regular expression engine to
exceed its maximum recursion depth. We have attempted to minimize
these problems, but it is impossible to avoid them completely. We
therefore recommend that you apply the chunk parser to a single
sentence at a time.
Emacs Tip¶
If you evaluate the following elisp expression in emacs, it will
colorize a ChunkString
when you use an interactive python shell
with emacs or xemacs (“C-c !”):
(let ()
(defconst comint-mode-font-lock-keywords
'(("<[^>]+>" 0 'font-lock-reference-face)
("[{}]" 0 'font-lock-function-name-face)))
(add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))
You can evaluate this code by copying it to a temporary buffer,
placing the cursor after the last close parenthesis, and typing
“C-x C-e
”. You should evaluate it before running the interactive
session. The change will last until you close emacs.
Unresolved Issues¶
If we use the re
module for regular expressions, Python’s
regular expression engine generates “maximum recursion depth
exceeded” errors when processing very large texts, even for
regular expressions that should not require any recursion. We
therefore use the pre
module instead. But note that pre
does not include Unicode support, so this module will not work
with unicode strings. Note also that pre
regular expressions
are not quite as advanced as re
ones (e.g., no leftward
zero-length assertions).
type CHUNK_TAG_PATTERN: | |
---|---|
regexp | |
var CHUNK_TAG_PATTERN: | |
A regular expression to test whether a tag pattern is valid. |