Package nltk :: Package chunk :: Module regexp :: Class ChunkString

Class ChunkString

object --+
         |
        ChunkString

A string-based encoding of a particular chunking of a text. Internally, the ChunkString class uses a single string to encode the chunking of the input text. This string contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:

   {<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>

ChunkString are created from tagged texts (i.e., lists of tokens whose type is TaggedType). Initially, nothing is chunked.

The chunking of a ChunkString can be modified with the xform method, which uses a regular expression to transform the string representation. These transformations should only add and remove braces; they should not modify the sequence of angle-bracket delimited tags.

Instance Methods

[hide private]

__init__(self, chunk_struct, debug_level=1)
Construct a new ChunkString that encodes the chunking of the text tagged_tokens. source code

_tag(self, tok)

source code

_verify(self, s, verify_tags)
Check to make sure that s still corresponds to some chunked version of _pieces. source code

Tree

to_chunkstruct(self, chunk_node='CHUNK')
Returns: the chunk structure encoded by this ChunkString. source code

None

xform(self, regexp, repl)
Apply the given transformation to this ChunkString's string encoding. source code

string

__repr__(self)
Returns: A string representation of this ChunkString. source code

string

__str__(self)
Returns: A formatted representation of this ChunkString's string encoding. source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__

Class Variables

[hide private]

CHUNK_TAG_CHAR = '[^\\{\\}<>]'

CHUNK_TAG = '(<[^\\{\\}<>]+?>)'

IN_CHUNK_PATTERN = '(?=[^\\{]*\\})'
A zero-width regexp pattern string that will only match positions that are in chunks.

IN_CHINK_PATTERN = '(?=[^\\}]*(\\{|$))'
A zero-width regexp pattern string that will only match positions that are in chinks.

_CHUNK = '(\\{(<[^\\{\\}<>]+?>)+?\\})+?'

_CHINK = '((<[^\\{\\}<>]+?>)+?)+?'

_VALID = re.compile(r'^(\{?(<[^\{\}<>]+?>)\}?)*?$')

_BRACKETS = re.compile(r'[^\{\}]+')

_BALANCED_BRACKETS = re.compile(r'(\{\})*$')

Instance Variables

[hide private]

_debug
The debug level.

list of pieces (tagged tokens and chunks) _pieces
The tagged tokens and chunks encoded by this ChunkString.

string _str
The internal string representation of the text's encoding.

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, chunk_struct, debug_level=1)
(Constructor)

source code

Construct a new ChunkString that encodes the chunking of the text tagged_tokens.

Parameters:

chunk_struct (Tree) - The chunk structure to be further chunked.
debug_level (int) - The level of debugging which should be applied to transformations on the ChunkString. The valid levels are:
- 0: no checks
- 1: full check on to_chunkstruct
- 2: full check on to_chunkstruct and cursory check after each transformation.
- 3: full check on to_chunkstruct and full check after each transformation.
We recommend you use at least level 1. You should probably use level 3 if you use any non-standard subclasses of RegexpChunkRule.

Overrides: object.__init__

_verify(self, s, verify_tags)

source code

Check to make sure that s still corresponds to some chunked version of _pieces.

Parameters:

verify_tags (boolean) - Whether the individual tags should be checked. If this is false, _verify will check to make sure that _str encodes a chunked version of some list of tokens. If this is true, then _verify will check to make sure that the tags in _str match those in _pieces.

Raises:

ValueError - if this ChunkString's internal string representation is invalid or not consistent with _pieces.

to_chunkstruct(self, chunk_node=`'CHUNK'`)

source code

Returns: Tree

the chunk structure encoded by this ChunkString.

Raises:

ValueError - If a transformation has generated an invalid chunkstring.

xform(self, regexp, repl)

source code

Apply the given transformation to this ChunkString's string encoding. In particular, find all occurrences that match regexp, and replace them using repl (as done by re.sub).

This transformation should only add and remove braces; it should not modify the sequence of angle-bracket delimited tags. Furthermore, this transformation may not result in improper bracketing. Note, in particular, that bracketing may not be nested.

Parameters:

regexp (string or regexp) - A regular expression matching the substring that should be replaced. This will typically include a named group, which can be used by repl.
repl (string) - An expression specifying what should replace the matched substring. Typically, this will include a named replacement group, specified by regexp.

Returns: None

Raises:

ValueError - If this transformation generated an invalid chunkstring.

repr(self)
(Representation operator)

source code

repr(x)

Returns: string

A string representation of this ChunkString. This string representation has the form:

   <ChunkString: '{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}'>

Overrides: object.__repr__

str(self)
(Informal representation operator)

source code

str(x)

Returns: string: A formatted representation of this ChunkString's string encoding. This representation will include extra spaces to ensure that tags will line up with the representation of other ChunkStrings for the same text, regardless of the chunking.
Overrides: object.__str__

Instance Variable Details

[hide private]

_debug

The debug level. See the constructor docs.

_str

The internal string representation of the text's encoding. This string representation contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:

   {<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>

Type:: string

Class ChunkString

__init__(self, chunk_struct, debug_level=1) (Constructor)

_verify(self, s, verify_tags)

to_chunkstruct(self, chunk_node='CHUNK')

xform(self, regexp, repl)

__repr__(self) (Representation operator)

__str__(self) (Informal representation operator)

_debug

_str

init(self, chunk_struct, debug_level=1)
(Constructor)

to_chunkstruct(self, chunk_node=`'CHUNK'`)

repr(self)
(Representation operator)

str(self)
(Informal representation operator)