Package nltk :: Package chunk :: Module regexp :: Class ChunkString
[hide private]
[frames] | no frames]

Class ChunkString

source code

object --+
         |
        ChunkString

A string-based encoding of a particular chunking of a text. Internally, the ChunkString class uses a single string to encode the chunking of the input text. This string contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:

   {<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>

ChunkString are created from tagged texts (i.e., lists of tokens whose type is TaggedType). Initially, nothing is chunked.

The chunking of a ChunkString can be modified with the xform method, which uses a regular expression to transform the string representation. These transformations should only add and remove braces; they should not modify the sequence of angle-bracket delimited tags.

Instance Methods [hide private]
 
__init__(self, chunk_struct, debug_level=1)
Construct a new ChunkString that encodes the chunking of the text tagged_tokens.
source code
 
_tag(self, tok) source code
 
_verify(self, s, verify_tags)
Check to make sure that s still corresponds to some chunked version of _pieces.
source code
Tree
to_chunkstruct(self, chunk_node='CHUNK')
Returns: the chunk structure encoded by this ChunkString.
source code
None
xform(self, regexp, repl)
Apply the given transformation to this ChunkString's string encoding.
source code
string
__repr__(self)
Returns: A string representation of this ChunkString.
source code
string
__str__(self)
Returns: A formatted representation of this ChunkString's string encoding.
source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__

Class Variables [hide private]
  CHUNK_TAG_CHAR = '[^\\{\\}<>]'
  CHUNK_TAG = '(<[^\\{\\}<>]+?>)'
  IN_CHUNK_PATTERN = '(?=[^\\{]*\\})'
A zero-width regexp pattern string that will only match positions that are in chunks.
  IN_CHINK_PATTERN = '(?=[^\\}]*(\\{|$))'
A zero-width regexp pattern string that will only match positions that are in chinks.
  _CHUNK = '(\\{(<[^\\{\\}<>]+?>)+?\\})+?'
  _CHINK = '((<[^\\{\\}<>]+?>)+?)+?'
  _VALID = re.compile(r'^(\{?(<[^\{\}<>]+?>)\}?)*?$')
  _BRACKETS = re.compile(r'[^\{\}]+')
  _BALANCED_BRACKETS = re.compile(r'(\{\})*$')
Instance Variables [hide private]
  _debug
The debug level.
list of pieces (tagged tokens and chunks) _pieces
The tagged tokens and chunks encoded by this ChunkString.
string _str
The internal string representation of the text's encoding.
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, chunk_struct, debug_level=1)
(Constructor)

source code 

Construct a new ChunkString that encodes the chunking of the text tagged_tokens.

Parameters:
  • chunk_struct (Tree) - The chunk structure to be further chunked.
  • debug_level (int) - The level of debugging which should be applied to transformations on the ChunkString. The valid levels are:
    • 0: no checks
    • 1: full check on to_chunkstruct
    • 2: full check on to_chunkstruct and cursory check after each transformation.
    • 3: full check on to_chunkstruct and full check after each transformation.

    We recommend you use at least level 1. You should probably use level 3 if you use any non-standard subclasses of RegexpChunkRule.

Overrides: object.__init__

_verify(self, s, verify_tags)

source code 

Check to make sure that s still corresponds to some chunked version of _pieces.

Parameters:
  • verify_tags (boolean) - Whether the individual tags should be checked. If this is false, _verify will check to make sure that _str encodes a chunked version of some list of tokens. If this is true, then _verify will check to make sure that the tags in _str match those in _pieces.
Raises:
  • ValueError - if this ChunkString's internal string representation is invalid or not consistent with _pieces.

to_chunkstruct(self, chunk_node='CHUNK')

source code 
Returns: Tree
the chunk structure encoded by this ChunkString.
Raises:
  • ValueError - If a transformation has generated an invalid chunkstring.

xform(self, regexp, repl)

source code 

Apply the given transformation to this ChunkString's string encoding. In particular, find all occurrences that match regexp, and replace them using repl (as done by re.sub).

This transformation should only add and remove braces; it should not modify the sequence of angle-bracket delimited tags. Furthermore, this transformation may not result in improper bracketing. Note, in particular, that bracketing may not be nested.

Parameters:
  • regexp (string or regexp) - A regular expression matching the substring that should be replaced. This will typically include a named group, which can be used by repl.
  • repl (string) - An expression specifying what should replace the matched substring. Typically, this will include a named replacement group, specified by regexp.
Returns: None
Raises:
  • ValueError - If this transformation generated an invalid chunkstring.

__repr__(self)
(Representation operator)

source code 

repr(x)

Returns: string
A string representation of this ChunkString. This string representation has the form:
   <ChunkString: '{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}'>
Overrides: object.__repr__

__str__(self)
(Informal representation operator)

source code 

str(x)

Returns: string
A formatted representation of this ChunkString's string encoding. This representation will include extra spaces to ensure that tags will line up with the representation of other ChunkStrings for the same text, regardless of the chunking.
Overrides: object.__str__

Instance Variable Details [hide private]

_debug

The debug level. See the constructor docs.

_str

The internal string representation of the text's encoding. This string representation contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:
   {<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>
Type:
string