Package nltk :: Package parse :: Module viterbi :: Class ViterbiParser

Class ViterbiParser

 object --+    
          |    
api.ParserI --+
              |
             ViterbiParser

Known Subclasses:

ViterbiParse

A bottom-up PCFG parser that uses dynamic programming to find the single most likely parse for a text. The ViterbiParser parser parses texts by filling in a most likely constituent table. This table records the most probable tree representation for any given span and node value. In particular, it has an entry for every start index, end index, and node value, recording the most likely subtree that spans from the start index to the end index, and has the given node value.

The ViterbiParser parser fills in this table incrementally. It starts by filling in all entries for constituents that span one element of text (i.e., entries where the end index is one greater than the start index). After it has filled in all table entries for constituents that span one element of text, it fills in the entries for constitutants that span two elements of text. It continues filling in the entries for constituents spanning larger and larger portions of the text, until the entire table has been filled. Finally, it returns the table entry for a constituent spanning the entire text, whose node value is the grammar's start symbol.

In order to find the most likely constituent with a given span and node value, the ViterbiParser parser considers all productions that could produce that node value. For each production, it finds all children that collectively cover the span and have the node values specified by the production's right hand side. If the probability of the tree formed by applying the production to the children is greater than the probability of the current entry in the table, then the table is updated with this new tree.

A pseudo-code description of the algorithm used by ViterbiParser is:

Create an empty most likely constituent table, MLC.
For width in 1...len(text):
- For start in 1...len(text)-width:
  - For prod in grammar.productions:
    - For each sequence of subtrees [t[1], t[2], ..., t[n]] in MLC, where t[i].node==prod.rhs[i], and the sequence covers [start:start+width]:
      - old_p = MLC[start, start+width, prod.lhs]
      - new_p = P(t[1])*P(t[1])*...*P(t[n])*P(prod)
      - if new_p > old_p:
        
        new_tree = Tree(prod.lhs, t[1], t[2], ..., t[n])
        
        MLC[start, start+width, prod.lhs] = new_tree
Return MLC[0, len(text), start_symbol]

Instance Methods

[hide private]

__init__(self, grammar, trace=0)
Create a new ViterbiParser parser, that uses {grammar} to parse texts. source code

grammar(self)
Returns: The grammar used by this parser.

source code

None

trace(self, trace=2)
Set the level of tracing output that should be generated when parsing a text.

source code

list of Tree

nbest_parse(self, tokens, n=None)
Returns: A list of parse trees that represent possible structures for the given sentence.

source code

None

_add_constituents_spanning(self, span, constituents, tokens)
Find any constituents that might cover span, and add them to the most likely constituents table. source code

_find_instantiations(self, span, constituents)
Returns: a list of the production instantiations that cover a given span of the text.

source code

list of (list of ProbabilisticTree or Token)

_match_rhs(self, rhs, span, constituents)
Returns: a set of all the lists of children that cover span and that match rhs. source code

None

_trace_production(self, production, p, span, width)
Print trace output indicating that a given production has been applied at a given location.

source code

_trace_lexical_insertion(self, token, index, width)

source code

__repr__(self)
repr(x)

source code

Inherited from api.ParserI: batch_iter_parse, batch_nbest_parse, batch_parse, batch_prob_parse, iter_parse, parse, prob_parse

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Deprecated

Inherited from api.ParserI: batch_test, get_parse, get_parse_dict, get_parse_list, get_parse_prob

Instance Variables

[hide private]

pcfg.Grammar _grammar
The grammar used to parse sentences.

int _trace
The level of tracing output that should be generated when parsing a text.

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, grammar, trace=0)
(Constructor)

source code

Create a new ViterbiParser parser, that uses {grammar} to parse texts.

Parameters:

grammar (pcfg.Grammar) - The grammar used to parse texts.
trace (int) - The level of tracing that should be used when parsing a text. 0 will generate no tracing output; and higher numbers will produce more verbose tracing output.

Overrides: object.__init__

grammar(self)

source code

Returns:: The grammar used by this parser.
Overrides: api.ParserI.grammar: (inherited documentation)

trace(self, trace=2)

source code

Set the level of tracing output that should be generated when parsing a text.

Parameters:

trace (int) - The trace level. A trace level of 0 will generate no tracing output; and higher trace levels will produce more verbose tracing output.

Returns: None

nbest_parse(self, tokens, n=None)

source code

Parameters:

sent - The sentence to be parsed
n - The maximum number of trees to return.

Returns: list of Tree

A list of parse trees that represent possible structures for the given sentence. When possible, this list is sorted from most likely to least likely. If n is specified, then the returned list will contain at most n parse trees.

Overrides: api.ParserI.nbest_parse

(inherited documentation)

_add_constituents_spanning(self, span, constituents, tokens)

source code

Find any constituents that might cover span, and add them to the most likely constituents table.

Parameters:

span ((int, int)) - The section of the text for which we are trying to find possible constituents. The span is specified as a pair of integers, where the first integer is the index of the first token that should be included in the constituent; and the second integer is the index of the first token that should not be included in the constituent. I.e., the constituent should cover text[span[0]:span[1]], where text is the text that we are parsing.
constituents (dictionary from (int,int,Nonterminal) to (ProbabilisticToken or ProbabilisticTree).) - The most likely constituents table. This table records the most probable tree representation for any given span and node value. In particular, constituents(s,e,nv) is the most likely ProbabilisticTree that covers text[s:e] and has a node value nv.symbol(), where text is the text that we are parsing. When _add_constituents_spanning is called, constituents should contain all possible constituents that are shorter than span.
tokens (list of tokens) - The text we are parsing. This is only used for trace output.

Returns: None

_find_instantiations(self, span, constituents)

source code

Parameters:

span ((int, int)) - The section of the text for which we are trying to find production instantiations. The span is specified as a pair of integers, where the first integer is the index of the first token that should be covered by the production instantiation; and the second integer is the index of the first token that should not be covered by the production instantiation.
constituents (dictionary from (int,int,Nonterminal) to (ProbabilisticToken or ProbabilisticTree).) - The most likely constituents table. This table records the most probable tree representation for any given span and node value. See the module documentation for more information.

Returns:

a list of the production instantiations that cover a given span of the text. A production instantiation is a tuple containing a production and a list of children, where the production's right hand side matches the list of children; and the children cover span. @rtype: list of pair of Production, (list of (ProbabilisticTree or token.

_match_rhs(self, rhs, span, constituents)

source code

Parameters:

rhs (list of Nonterminal or (any)) - The list specifying what kinds of children need to cover span. Each nonterminal in rhs specifies that the corresponding child should be a tree whose node value is that nonterminal's symbol. Each terminal in rhs specifies that the corresponding child should be a token whose type is that terminal.
span ((int, int)) - The section of the text for which we are trying to find child lists. The span is specified as a pair of integers, where the first integer is the index of the first token that should be covered by the child list; and the second integer is the index of the first token that should not be covered by the child list.
constituents (dictionary from (int,int,Nonterminal) to (ProbabilisticToken or ProbabilisticTree).) - The most likely constituents table. This table records the most probable tree representation for any given span and node value. See the module documentation for more information.

Returns: list of (list of ProbabilisticTree or Token)

a set of all the lists of children that cover span and that match rhs.

_trace_production(self, production, p, span, width)

source code

Print trace output indicating that a given production has been applied at a given location.

Parameters:

production (Production) - The production that has been applied
p (float) - The probability of the tree produced by the production.
span (tuple) - The span of the production

Returns: None

repr(self)
(Representation operator)

source code

repr(x)

Overrides: object.__repr__: (inherited documentation)

Class ViterbiParser

__init__(self, grammar, trace=0) (Constructor)

grammar(self)

trace(self, trace=2)

nbest_parse(self, tokens, n=None)

_add_constituents_spanning(self, span, constituents, tokens)

_find_instantiations(self, span, constituents)

_match_rhs(self, rhs, span, constituents)

_trace_production(self, production, p, span, width)

__repr__(self) (Representation operator)

init(self, grammar, trace=0)
(Constructor)

repr(self)
(Representation operator)