nltk.parse package¶
Submodules¶
nltk.parse.api module¶
-
class
nltk.parse.api.
ParserI
[source]¶ Bases:
object
A processing class for deriving trees that represent possible structures for a sequence of tokens. These tree structures are known as “parses”. Typically, parsers are used to derive syntax trees for sentences. But parsers can also be used to derive other kinds of tree structure, such as morphological trees and discourse structures.
- Subclasses must define:
- at least one of:
parse()
,parse_sents()
.
- at least one of:
- Subclasses may define:
grammar()
nltk.parse.bllip module¶
-
class
nltk.parse.bllip.
BllipParser
(parser_model=None, reranker_features=None, reranker_weights=None, parser_options=None, reranker_options=None)[source]¶ Bases:
nltk.parse.api.ParserI
Interface for parsing with BLLIP Parser. BllipParser objects can be constructed with the
BllipParser.from_unified_model_dir
class method or manually using theBllipParser
constructor.-
classmethod
from_unified_model_dir
(model_dir, parser_options=None, reranker_options=None)[source]¶ Create a
BllipParser
object from a unified parsing model directory. Unified parsing model directories are a standardized way of storing BLLIP parser and reranker models together on disk. Seebllipparser.RerankingParser.get_unified_model_parameters()
for more information about unified model directories.Returns: A BllipParser
object using the parser and rerankermodels in the model directory.
Parameters: - model_dir (str) – Path to the unified model directory.
- parser_options – optional dictionary of parser options, see
bllipparser.RerankingParser.RerankingParser.load_parser_options()
for more information. :type parser_options: dict(str) :param reranker_options: optional dictionary of reranker options, seebllipparser.RerankingParser.RerankingParser.load_reranker_model()
for more information. :type reranker_options: dict(str) :rtype: BllipParser
-
parse
(sentence)[source]¶ Use BLLIP Parser to parse a sentence. Takes a sentence as a list of words; it will be automatically tagged with this BLLIP Parser instance’s tagger.
Returns: An iterator that generates parse trees for the sentence from most likely to least likely.
Parameters: sentence (list(str)) – The sentence to be parsed Return type: iter(Tree)
-
tagged_parse
(word_and_tag_pairs)[source]¶ Use BLLIP to parse a sentence. Takes a sentence as a list of (word, tag) tuples; the sentence must have already been tokenized and tagged. BLLIP will attempt to use the tags provided but may use others if it can’t come up with a complete parse subject to those constraints. You may also specify a tag as
None
to leave a token’s tag unconstrained.Returns: An iterator that generates parse trees for the sentence from most likely to least likely.
Parameters: sentence (list(tuple(str, str))) – Input sentence to parse as (word, tag) pairs Return type: iter(Tree)
-
classmethod
nltk.parse.chart module¶
Data classes and parser implementations for “chart parsers”, which use dynamic programming to efficiently parse a text. A chart parser derives parse trees for a text by iteratively adding “edges” to a “chart.” Each edge represents a hypothesis about the tree structure for a subsequence of the text. The chart is a “blackboard” for composing and combining these hypotheses.
When a chart parser begins parsing a text, it creates a new (empty) chart, spanning the text. It then incrementally adds new edges to the chart. A set of “chart rules” specifies the conditions under which new edges should be added to the chart. Once the chart reaches a stage where none of the chart rules adds any new edges, parsing is complete.
Charts are encoded with the Chart
class, and edges are encoded with
the TreeEdge
and LeafEdge
classes. The chart parser module
defines three chart parsers:
ChartParser
is a simple and flexible chart parser. Given a set of chart rules, it will apply those rules to the chart until no more edges are added.SteppingChartParser
is a subclass ofChartParser
that can be used to step through the parsing process.
-
class
nltk.parse.chart.
AbstractChartRule
[source]¶ Bases:
nltk.parse.chart.ChartRuleI
An abstract base class for chart rules.
AbstractChartRule
provides:- A default implementation for
apply
. - A default implementation for
apply_everywhere
, (Currently, this implementation assumes that ``NUM_EDGES``<=3.) - A default implementation for
__str__
, which returns a name based on the rule’s class name.
-
apply
(chart, grammar, *edges)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
apply_everywhere
(chart, grammar)[source]¶ Return a generator that will add all edges licensed by this rule, given the edges that are currently in the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Return type: iter(EdgeI)
-
unicode_repr
¶ Return repr(self).
- A default implementation for
-
class
nltk.parse.chart.
BottomUpChartParser
(grammar, **parser_args)[source]¶ Bases:
nltk.parse.chart.ChartParser
A
ChartParser
using a bottom-up parsing strategy. SeeChartParser
for more information.
-
class
nltk.parse.chart.
BottomUpLeftCornerChartParser
(grammar, **parser_args)[source]¶ Bases:
nltk.parse.chart.ChartParser
A
ChartParser
using a bottom-up left-corner parsing strategy. This strategy is often more efficient than standard bottom-up. SeeChartParser
for more information.
-
class
nltk.parse.chart.
BottomUpPredictCombineRule
[source]¶ Bases:
nltk.parse.chart.BottomUpPredictRule
A rule licensing any edge corresponding to a production whose right-hand side begins with a complete edge’s left-hand side. In particular, this rule specifies that
[A -> alpha \*]
licenses the edge[B -> A \* beta]
for each grammar productionB -> A beta
.Note: This is like BottomUpPredictRule
, but it also applies theFundamentalRule
to the resulting edge.-
NUM_EDGES
= 1¶
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.chart.
BottomUpPredictRule
[source]¶ Bases:
nltk.parse.chart.AbstractChartRule
A rule licensing any edge corresponding to a production whose right-hand side begins with a complete edge’s left-hand side. In particular, this rule specifies that
[A -> alpha \*]
licenses the edge[B -> \* A beta]
for each grammar productionB -> A beta
.-
NUM_EDGES
= 1¶
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.chart.
CachedTopDownPredictRule
[source]¶ Bases:
nltk.parse.chart.TopDownPredictRule
A cached version of
TopDownPredictRule
. After the first time this rule is applied to an edge with a givenend
andnext
, it will not generate any more edges for edges with thatend
andnext
.If
chart
orgrammar
are changed, then the cache is flushed.-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.chart.
Chart
(tokens)[source]¶ Bases:
object
A blackboard for hypotheses about the syntactic constituents of a sentence. A chart contains a set of edges, and each edge encodes a single hypothesis about the structure of some portion of the sentence.
The
select
method can be used to select a specific collection of edges. For examplechart.select(is_complete=True, start=0)
yields all complete edges whose start indices are 0. To ensure the efficiency of these selection operations,Chart
dynamically creates and maintains an index for each set of attributes that have been selected on.In order to reconstruct the trees that are represented by an edge, the chart associates each edge with a set of child pointer lists. A child pointer list is a list of the edges that license an edge’s right-hand side.
Variables: - _tokens – The sentence that the chart covers.
- _num_leaves – The number of tokens.
- _edges – A list of the edges in the chart
- _edge_to_cpls – A dictionary mapping each edge to a set of child pointer lists that are associated with that edge.
- _indexes – A dictionary mapping tuples of edge attributes to indices, where each index maps the corresponding edge attribute values to lists of edges.
-
child_pointer_lists
(edge)[source]¶ Return the set of child pointer lists for the given edge. Each child pointer list is a list of edges that have been used to form this edge.
Return type: list(list(EdgeI))
-
edges
()[source]¶ Return a list of all edges in this chart. New edges that are added to the chart after the call to edges() will not be contained in this list.
Return type: list(EdgeI) See: iteredges
,select
-
insert
(edge, *child_pointer_lists)[source]¶ Add a new edge to the chart, and return True if this operation modified the chart. In particular, return true iff the chart did not already contain
edge
, or if it did not already associatechild_pointer_lists
withedge
.Parameters: Return type: bool
-
insert_with_backpointer
(new_edge, previous_edge, child_edge)[source]¶ Add a new edge to the chart, using a pointer to the previous edge.
-
iteredges
()[source]¶ Return an iterator over the edges in this chart. It is not guaranteed that new edges which are added to the chart before the iterator is exhausted will also be generated.
Return type: iter(EdgeI) See: edges
,select
-
leaves
()[source]¶ Return a list of the leaf values of each word in the chart’s sentence.
Return type: list(str)
-
parses
(root, tree_class=<class 'nltk.tree.Tree'>)[source]¶ Return an iterator of the complete tree structures that span the entire chart, and whose root node is
root
.
-
pretty_format
(width=None)[source]¶ Return a pretty-printed string representation of this chart.
Parameters: width – The number of characters allotted to each index in the sentence. Return type: str
-
pretty_format_edge
(edge, width=None)[source]¶ Return a pretty-printed string representation of a given edge in this chart.
Return type: str Parameters: width – The number of characters allotted to each index in the sentence.
-
pretty_format_leaves
(width=None)[source]¶ Return a pretty-printed string representation of this chart’s leaves. This string can be used as a header for calls to
pretty_format_edge
.
-
select
(**restrictions)[source]¶ Return an iterator over the edges in this chart. Any new edges that are added to the chart before the iterator is exahusted will also be generated.
restrictions
can be used to restrict the set of edges that will be generated.Parameters: - span – Only generate edges
e
wheree.span()==span
- start – Only generate edges
e
wheree.start()==start
- end – Only generate edges
e
wheree.end()==end
- length – Only generate edges
e
wheree.length()==length
- lhs – Only generate edges
e
wheree.lhs()==lhs
- rhs – Only generate edges
e
wheree.rhs()==rhs
- nextsym – Only generate edges
e
wheree.nextsym()==nextsym
- dot – Only generate edges
e
wheree.dot()==dot
- is_complete – Only generate edges
e
wheree.is_complete()==is_complete
- is_incomplete – Only generate edges
e
wheree.is_incomplete()==is_incomplete
Return type: iter(EdgeI)
- span – Only generate edges
-
trees
(edge, tree_class=<class 'nltk.tree.Tree'>, complete=False)[source]¶ Return an iterator of the tree structures that are associated with
edge
.If
edge
is incomplete, then the unexpanded children will be encoded as childless subtrees, whose node value is the corresponding terminal or nonterminal.Return type: list(Tree) Note: If two trees share a common subtree, then the same Tree may be used to encode that subtree in both trees. If you need to eliminate this subtree sharing, then create a deep copy of each tree.
-
class
nltk.parse.chart.
ChartParser
(grammar, strategy=[<nltk.parse.chart.LeafInitRule object>, <nltk.parse.chart.EmptyPredictRule object>, <nltk.parse.chart.BottomUpPredictCombineRule object>, <nltk.parse.chart.SingleEdgeFundamentalRule object>], trace=0, trace_chart_width=50, use_agenda=True, chart_class=<class 'nltk.parse.chart.Chart'>)[source]¶ Bases:
nltk.parse.api.ParserI
A generic chart parser. A “strategy”, or list of
ChartRuleI
instances, is used to decide what edges to add to the chart. In particular,ChartParser
uses the following algorithm to parse texts:Until no new edges are added:For each rule in strategy:Apply rule to any applicable edges in the chart.Return any complete parses in the chart
-
class
nltk.parse.chart.
ChartRuleI
[source]¶ Bases:
object
A rule that specifies what new edges are licensed by any given set of existing edges. Each chart rule expects a fixed number of edges, as indicated by the class variable
NUM_EDGES
. In particular:- A chart rule with
NUM_EDGES=0
specifies what new edges are licensed, regardless of existing edges. - A chart rule with
NUM_EDGES=1
specifies what new edges are licensed by a single existing edge. - A chart rule with
NUM_EDGES=2
specifies what new edges are licensed by a pair of existing edges.
Variables: NUM_EDGES – The number of existing edges that this rule uses to license new edges. Typically, this number ranges from zero to two. -
apply
(chart, grammar, *edges)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
- A chart rule with
-
class
nltk.parse.chart.
EdgeI
[source]¶ Bases:
object
A hypothesis about the structure of part of a sentence. Each edge records the fact that a structure is (partially) consistent with the sentence. An edge contains:
- A span, indicating what part of the sentence is consistent with the hypothesized structure.
- A left-hand side, specifying what kind of structure is hypothesized.
- A right-hand side, specifying the contents of the hypothesized structure.
- A dot position, indicating how much of the hypothesized structure is consistent with the sentence.
Every edge is either complete or incomplete:
- An edge is complete if its structure is fully consistent with the sentence.
- An edge is incomplete if its structure is partially consistent with the sentence. For every incomplete edge, the span specifies a possible prefix for the edge’s structure.
There are two kinds of edge:
- A
TreeEdge
records which trees have been found to be (partially) consistent with the text. - A
LeafEdge
records the tokens occurring in the text.
The
EdgeI
interface provides a common interface to both types of edge, allowing chart parsers to treat them in a uniform manner.-
dot
()[source]¶ Return this edge’s dot position, which indicates how much of the hypothesized structure is consistent with the sentence. In particular,
self.rhs[:dot]
is consistent withtokens[self.start():self.end()]
.Return type: int
-
is_complete
()[source]¶ Return True if this edge’s structure is fully consistent with the text.
Return type: bool
-
is_incomplete
()[source]¶ Return True if this edge’s structure is partially consistent with the text.
Return type: bool
-
lhs
()[source]¶ Return this edge’s left-hand side, which specifies what kind of structure is hypothesized by this edge.
See: TreeEdge
andLeafEdge
for a description of the left-hand side values for each edge type.
-
nextsym
()[source]¶ Return the element of this edge’s right-hand side that immediately follows its dot.
Return type: Nonterminal or terminal or None
-
rhs
()[source]¶ Return this edge’s right-hand side, which specifies the content of the structure hypothesized by this edge.
See: TreeEdge
andLeafEdge
for a description of the right-hand side values for each edge type.
-
class
nltk.parse.chart.
EmptyPredictRule
[source]¶ Bases:
nltk.parse.chart.AbstractChartRule
A rule that inserts all empty productions as passive edges, in every position in the chart.
-
NUM_EDGES
= 0¶
-
apply
(chart, grammar)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.chart.
FilteredBottomUpPredictCombineRule
[source]¶ Bases:
nltk.parse.chart.BottomUpPredictCombineRule
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.chart.
FundamentalRule
[source]¶ Bases:
nltk.parse.chart.AbstractChartRule
A rule that joins two adjacent edges to form a single combined edge. In particular, this rule specifies that any pair of edges
[A -> alpha \* B beta][i:j]
[B -> gamma \*][j:k]
licenses the edge:
[A -> alpha B * beta][i:j]
-
NUM_EDGES
= 2¶
-
apply
(chart, grammar, left_edge, right_edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
class
nltk.parse.chart.
LeafEdge
(leaf, index)[source]¶ Bases:
nltk.parse.chart.EdgeI
An edge that records the fact that a leaf value is consistent with a word in the sentence. A leaf edge consists of:
- An index, indicating the position of the word.
- A leaf, specifying the word’s content.
A leaf edge’s left-hand side is its leaf value, and its right hand side is
()
. Its span is[index, index+1]
, and its dot position is0
.-
dot
()[source]¶ Return this edge’s dot position, which indicates how much of the hypothesized structure is consistent with the sentence. In particular,
self.rhs[:dot]
is consistent withtokens[self.start():self.end()]
.Return type: int
-
is_complete
()[source]¶ Return True if this edge’s structure is fully consistent with the text.
Return type: bool
-
is_incomplete
()[source]¶ Return True if this edge’s structure is partially consistent with the text.
Return type: bool
-
lhs
()[source]¶ Return this edge’s left-hand side, which specifies what kind of structure is hypothesized by this edge.
See: TreeEdge
andLeafEdge
for a description of the left-hand side values for each edge type.
-
nextsym
()[source]¶ Return the element of this edge’s right-hand side that immediately follows its dot.
Return type: Nonterminal or terminal or None
-
rhs
()[source]¶ Return this edge’s right-hand side, which specifies the content of the structure hypothesized by this edge.
See: TreeEdge
andLeafEdge
for a description of the right-hand side values for each edge type.
-
span
()[source]¶ Return a tuple
(s, e)
, wheretokens[s:e]
is the portion of the sentence that is consistent with this edge’s structure.Return type: tuple(int, int)
-
unicode_repr
()¶ Return repr(self).
-
class
nltk.parse.chart.
LeafInitRule
[source]¶ Bases:
nltk.parse.chart.AbstractChartRule
-
NUM_EDGES
= 0¶
-
apply
(chart, grammar)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.chart.
LeftCornerChartParser
(grammar, **parser_args)[source]¶ Bases:
nltk.parse.chart.ChartParser
-
class
nltk.parse.chart.
SingleEdgeFundamentalRule
[source]¶ Bases:
nltk.parse.chart.FundamentalRule
A rule that joins a given edge with adjacent edges in the chart, to form combined edges. In particular, this rule specifies that either of the edges:
[A -> alpha \* B beta][i:j]
[B -> gamma \*][j:k]
licenses the edge:
[A -> alpha B * beta][i:j]
if the other edge is already in the chart.
Note: This is basically FundamentalRule
, with one edge left unspecified.-
NUM_EDGES
= 1¶
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
class
nltk.parse.chart.
SteppingChartParser
(grammar, strategy=[], trace=0)[source]¶ Bases:
nltk.parse.chart.ChartParser
A
ChartParser
that allows you to step through the parsing process, adding a single edge at a time. It also allows you to change the parser’s strategy or grammar midway through parsing a text.The
initialize
method is used to start parsing a text.step
adds a single edge to the chart.set_strategy
changes the strategy used by the chart parser.parses
returns the set of parses that has been found by the chart parser.Variables: _restart – Records whether the parser’s strategy, grammar, or chart has been changed. If so, then step
must restart the parsing algorithm.-
parse
(tokens, tree_class=<class 'nltk.tree.Tree'>)[source]¶ Returns: An iterator that generates parse trees for the sentence. When possible this list is sorted from most likely to least likely.
Parameters: sent (list(str)) – The sentence to be parsed Return type: iter(Tree)
-
parses
(tree_class=<class 'nltk.tree.Tree'>)[source]¶ Return the parse trees currently contained in the chart.
-
set_strategy
(strategy)[source]¶ Change the strategy that the parser uses to decide which edges to add to the chart.
Parameters: strategy (list(ChartRuleI)) – A list of rules that should be used to decide what edges to add to the chart.
-
step
()[source]¶ Return a generator that adds edges to the chart, one at a time. Each time the generator is resumed, it adds a single edge and yields that edge. If no more edges can be added, then it yields None.
If the parser’s strategy, grammar, or chart is changed, then the generator will continue adding edges using the new strategy, grammar, or chart.
Note that this generator never terminates, since the grammar or strategy might be changed to values that would add new edges. Instead, it yields None when no more edges can be added with the current strategy and grammar.
-
-
class
nltk.parse.chart.
TopDownChartParser
(grammar, **parser_args)[source]¶ Bases:
nltk.parse.chart.ChartParser
A
ChartParser
using a top-down parsing strategy. SeeChartParser
for more information.
-
class
nltk.parse.chart.
TopDownInitRule
[source]¶ Bases:
nltk.parse.chart.AbstractChartRule
A rule licensing edges corresponding to the grammar productions for the grammar’s start symbol. In particular, this rule specifies that
[S -> \* alpha][0:i]
is licensed for each grammar productionS -> alpha
, whereS
is the grammar’s start symbol.-
NUM_EDGES
= 0¶
-
apply
(chart, grammar)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.chart.
TopDownPredictRule
[source]¶ Bases:
nltk.parse.chart.AbstractChartRule
A rule licensing edges corresponding to the grammar productions for the nonterminal following an incomplete edge’s dot. In particular, this rule specifies that
[A -> alpha \* B beta][i:j]
licenses the edge[B -> \* gamma][j:j]
for each grammar productionB -> gamma
.Note: This rule corresponds to the Predictor Rule in Earley parsing. -
NUM_EDGES
= 1¶
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.chart.
TreeEdge
(span, lhs, rhs, dot=0)[source]¶ Bases:
nltk.parse.chart.EdgeI
An edge that records the fact that a tree is (partially) consistent with the sentence. A tree edge consists of:
- A span, indicating what part of the sentence is consistent with the hypothesized tree.
- A left-hand side, specifying the hypothesized tree’s node value.
- A right-hand side, specifying the hypothesized tree’s children. Each element of the right-hand side is either a terminal, specifying a token with that terminal as its leaf value; or a nonterminal, specifying a subtree with that nonterminal’s symbol as its node value.
- A dot position, indicating which children are consistent
with part of the sentence. In particular, if
dot
is the dot position,rhs
is the right-hand size,(start,end)
is the span, andsentence
is the list of tokens in the sentence, thentokens[start:end]
can be spanned by the children specified byrhs[:dot]
.
For more information about edges, see the
EdgeI
interface.-
dot
()[source]¶ Return this edge’s dot position, which indicates how much of the hypothesized structure is consistent with the sentence. In particular,
self.rhs[:dot]
is consistent withtokens[self.start():self.end()]
.Return type: int
-
static
from_production
(production, index)[source]¶ Return a new
TreeEdge
formed from the given production. The new edge’s left-hand side and right-hand side will be taken fromproduction
; its span will be(index,index)
; and its dot position will be0
.Return type: TreeEdge
-
is_complete
()[source]¶ Return True if this edge’s structure is fully consistent with the text.
Return type: bool
-
is_incomplete
()[source]¶ Return True if this edge’s structure is partially consistent with the text.
Return type: bool
-
lhs
()[source]¶ Return this edge’s left-hand side, which specifies what kind of structure is hypothesized by this edge.
See: TreeEdge
andLeafEdge
for a description of the left-hand side values for each edge type.
-
move_dot_forward
(new_end)[source]¶ Return a new
TreeEdge
formed from this edge. The new edge’s dot position is increased by1
, and its end index will be replaced bynew_end
.Parameters: new_end (int) – The new end index. Return type: TreeEdge
-
nextsym
()[source]¶ Return the element of this edge’s right-hand side that immediately follows its dot.
Return type: Nonterminal or terminal or None
-
rhs
()[source]¶ Return this edge’s right-hand side, which specifies the content of the structure hypothesized by this edge.
See: TreeEdge
andLeafEdge
for a description of the right-hand side values for each edge type.
-
span
()[source]¶ Return a tuple
(s, e)
, wheretokens[s:e]
is the portion of the sentence that is consistent with this edge’s structure.Return type: tuple(int, int)
-
unicode_repr
()¶ Return repr(self).
nltk.parse.corenlp module¶
-
class
nltk.parse.corenlp.
CoreNLPDependencyParser
(url='http://localhost:9000', encoding='utf8', tagtype=None)[source]¶ Bases:
nltk.parse.corenlp.GenericCoreNLPParser
Dependency parser.
>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> parse, = dep_parser.raw_parse( ... 'The quick brown fox jumps over the lazy dog.' ... ) >>> print(parse.to_conll(4)) The DT 4 det quick JJ 4 amod brown JJ 4 amod fox NN 5 nsubj jumps VBZ 0 ROOT over IN 9 case the DT 9 det lazy JJ 9 amod dog NN 5 nmod . . 5 punct
>>> print(parse.tree()) (jumps (fox The quick brown) (dog over the lazy) .)
>>> for governor, dep, dependent in parse.triples(): ... print(governor, dep, dependent) ('jumps', 'VBZ') nsubj ('fox', 'NN') ('fox', 'NN') det ('The', 'DT') ('fox', 'NN') amod ('quick', 'JJ') ('fox', 'NN') amod ('brown', 'JJ') ('jumps', 'VBZ') nmod ('dog', 'NN') ('dog', 'NN') case ('over', 'IN') ('dog', 'NN') det ('the', 'DT') ('dog', 'NN') amod ('lazy', 'JJ') ('jumps', 'VBZ') punct ('.', '.')
>>> (parse_fox, ), (parse_dog, ) = dep_parser.raw_parse_sents( ... [ ... 'The quick brown fox jumps over the lazy dog.', ... 'The quick grey wolf jumps over the lazy fox.', ... ] ... ) >>> print(parse_fox.to_conll(4)) The DT 4 det quick JJ 4 amod brown JJ 4 amod fox NN 5 nsubj jumps VBZ 0 ROOT over IN 9 case the DT 9 det lazy JJ 9 amod dog NN 5 nmod . . 5 punct
>>> print(parse_dog.to_conll(4)) The DT 4 det quick JJ 4 amod grey JJ 4 amod wolf NN 5 nsubj jumps VBZ 0 ROOT over IN 9 case the DT 9 det lazy JJ 9 amod fox NN 5 nmod . . 5 punct
>>> (parse_dog, ), (parse_friends, ) = dep_parser.parse_sents( ... [ ... "I 'm a dog".split(), ... "This is my friends ' cat ( the tabby )".split(), ... ] ... ) >>> print(parse_dog.to_conll(4)) I PRP 4 nsubj 'm VBP 4 cop a DT 4 det dog NN 0 ROOT
>>> print(parse_friends.to_conll(4)) This DT 6 nsubj is VBZ 6 cop my PRP$ 4 nmod:poss friends NNS 6 nmod:poss ' POS 4 case cat NN 0 ROOT -LRB- -LRB- 9 punct the DT 9 det tabby NN 6 appos -RRB- -RRB- 9 punct
>>> parse_john, parse_mary, = dep_parser.parse_text( ... 'John loves Mary. Mary walks.' ... )
>>> print(parse_john.to_conll(4)) John NNP 2 nsubj loves VBZ 0 ROOT Mary NNP 2 dobj . . 2 punct
>>> print(parse_mary.to_conll(4)) Mary NNP 2 nsubj walks VBZ 0 ROOT . . 2 punct
Non-breaking space inside of a token.
>>> len( ... next( ... dep_parser.raw_parse( ... 'Anhalt said children typically treat a 20-ounce soda bottle as one ' ... 'serving, while it actually contains 2 1/2 servings.' ... ) ... ).nodes ... ) 21
Phone numbers.
>>> len( ... next( ... dep_parser.raw_parse('This is not going to crash: 01 111 555.') ... ).nodes ... ) 10
>>> print( ... next( ... dep_parser.raw_parse('The underscore _ should not simply disappear.') ... ).to_conll(4) ... ) The DT 3 det underscore VBP 3 amod _ NN 7 nsubj should MD 7 aux not RB 7 neg simply RB 7 advmod disappear VB 0 ROOT . . 7 punct
>>> print( ... '\n'.join( ... next( ... dep_parser.raw_parse( ... 'for all of its insights into the dream world of teen life , and its electronic expression through ' ... 'cyber culture , the film gives no quarter to anyone seeking to pull a cohesive story out of its 2 ' ... '1/2-hour running time .' ... ) ... ).to_conll(4).split('\n')[-8:] ... ) ... ) its PRP$ 40 nmod:poss 2 1/2 CD 40 nummod - : 40 punct hour NN 31 nmod running VBG 42 amod time NN 40 dep . . 24 punct
-
parser_annotator
= 'depparse'¶
-
-
class
nltk.parse.corenlp.
CoreNLPParser
(url='http://localhost:9000', encoding='utf8', tagtype=None)[source]¶ Bases:
nltk.parse.corenlp.GenericCoreNLPParser
>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> next( ... parser.raw_parse('The quick brown fox jumps over the lazy dog.') ... ).pretty_print() ROOT | S _______________|__________________________ | VP | | _________|___ | | | PP | | | ________|___ | NP | | NP | ____|__________ | | _______|____ | DT JJ JJ NN VBZ IN DT JJ NN . | | | | | | | | | | The quick brown fox jumps over the lazy dog .
>>> (parse_fox, ), (parse_wolf, ) = parser.raw_parse_sents( ... [ ... 'The quick brown fox jumps over the lazy dog.', ... 'The quick grey wolf jumps over the lazy fox.', ... ] ... )
>>> parse_fox.pretty_print() ROOT | S _______________|__________________________ | VP | | _________|___ | | | PP | | | ________|___ | NP | | NP | ____|__________ | | _______|____ | DT JJ JJ NN VBZ IN DT JJ NN . | | | | | | | | | | The quick brown fox jumps over the lazy dog .
>>> parse_wolf.pretty_print() ROOT | S _______________|__________________________ | VP | | _________|___ | | | PP | | | ________|___ | NP | | NP | ____|_________ | | _______|____ | DT JJ JJ NN VBZ IN DT JJ NN . | | | | | | | | | | The quick grey wolf jumps over the lazy fox .
>>> (parse_dog, ), (parse_friends, ) = parser.parse_sents( ... [ ... "I 'm a dog".split(), ... "This is my friends ' cat ( the tabby )".split(), ... ] ... )
>>> parse_dog.pretty_print() ROOT | S _______|____ | VP | ________|___ NP | NP | | ___|___ PRP VBP DT NN | | | | I 'm a dog
>>> parse_friends.pretty_print() ROOT | S ____|___________ | VP | ___________|_____________ | | NP | | _______|_________ | | NP PRN | | _____|_______ ____|______________ NP | NP | | NP | | | ______|_________ | | ___|____ | DT VBZ PRP$ NNS POS NN -LRB- DT NN -RRB- | | | | | | | | | | This is my friends ' cat -LRB- the tabby -RRB-
>>> parse_john, parse_mary, = parser.parse_text( ... 'John loves Mary. Mary walks.' ... )
>>> parse_john.pretty_print() ROOT | S _____|_____________ | VP | | ____|___ | NP | NP | | | | | NNP VBZ NNP . | | | | John loves Mary .
>>> parse_mary.pretty_print() ROOT | S _____|____ NP VP | | | | NNP VBZ . | | | Mary walks .
>>> next( ... parser.raw_parse( ... 'NASIRIYA, Iraq—Iraqi doctors who treated former prisoner of war ' ... 'Jessica Lynch have angrily dismissed claims made in her biography ' ... 'that she was raped by her Iraqi captors.' ... ) ... ).height() 20
>>> next( ... parser.raw_parse( ... "The broader Standard & Poor's 500 Index <.SPX> was 0.46 points lower, or " ... '0.05 percent, at 997.02.' ... ) ... ).height() 9
-
parser_annotator
= 'parse'¶
-
-
class
nltk.parse.corenlp.
CoreNLPServer
(path_to_jar=None, path_to_models_jar=None, verbose=False, java_options=None, corenlp_options=None, port=None)[source]¶ Bases:
object
-
exception
nltk.parse.corenlp.
CoreNLPServerError
[source]¶ Bases:
OSError
Exceptions associated with the Core NLP server.
-
class
nltk.parse.corenlp.
GenericCoreNLPParser
(url='http://localhost:9000', encoding='utf8', tagtype=None)[source]¶ Bases:
nltk.parse.api.ParserI
,nltk.tokenize.api.TokenizerI
,nltk.tag.api.TaggerI
Interface to the CoreNLP Parser.
-
parse_sents
(sentences, *args, **kwargs)[source]¶ Parse multiple sentences.
Takes multiple sentences as a list where each sentence is a list of words. Each sentence will be automatically tagged with this CoreNLPParser instance’s tagger.
If a whitespace exists inside a token, then the token will be treated as several tokens.
Parameters: sentences (list(list(str))) – Input sentences to parse Return type: iter(iter(Tree))
-
parse_text
(text, *args, **kwargs)[source]¶ Parse a piece of text.
The text might contain several sentences which will be split by CoreNLP.
Parameters: text (str) – text to be split. Returns: an iterable of syntactic structures. # TODO: should it be an iterable of iterables?
-
raw_parse
(sentence, properties=None, *args, **kwargs)[source]¶ Parse a sentence.
Takes a sentence as a string; before parsing, it will be automatically tokenized and tagged by the CoreNLP Parser.
Parameters: sentence (str) – Input sentence to parse Return type: iter(Tree)
-
raw_parse_sents
(sentences, verbose=False, properties=None, *args, **kwargs)[source]¶ Parse multiple sentences.
Takes multiple sentences as a list of strings. Each sentence will be automatically tokenized and tagged.
Parameters: sentences (list(str)) – Input sentences to parse. Return type: iter(iter(Tree))
-
raw_tag_sents
(sentences)[source]¶ Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a string.
Parameters: sentences (list(str)) – Input sentences to tag Return type: list(list(list(tuple(str, str)))
-
tag
(sentence)[source]¶ Tag a list of tokens.
Return type: list(tuple(str, str)) >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner') >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split() >>> parser.tag(tokens) [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos') >>> tokens = "What is the airspeed of an unladen swallow ?".split() >>> parser.tag(tokens) [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
-
tag_sents
(sentences)[source]¶ Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a list of tokens.
Parameters: sentences (list(list(str))) – Input sentences to tag Return type: list(list(tuple(str, str))
-
tokenize
(text, properties=None)[source]¶ Tokenize a string of text.
>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> text = 'Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.' >>> list(parser.tokenize(text)) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> s = "The colour of the wall is blue." >>> list( ... parser.tokenize( ... 'The colour of the wall is blue.', ... properties={'tokenize.options': 'americanize=true'}, ... ) ... ) ['The', 'color', 'of', 'the', 'wall', 'is', 'blue', '.']
-
nltk.parse.dependencygraph module¶
Tools for reading and writing dependency trees. The input is assumed to be in Malt-TAB format (http://stp.lingfil.uu.se/~nivre/research/MaltXML.html).
-
class
nltk.parse.dependencygraph.
DependencyGraph
(tree_str=None, cell_extractor=None, zero_based=False, cell_separator=None, top_relation_label='ROOT')[source]¶ Bases:
object
A container for the nodes and labelled edges of a dependency structure.
-
add_arc
(head_address, mod_address)[source]¶ Adds an arc from the node specified by head_address to the node specified by the mod address.
-
connect_graph
()[source]¶ Fully connects all non-root nodes. All nodes are set to be dependents of the root node.
-
contains_address
(node_address)[source]¶ Returns true if the graph contains a node with the given node address, false otherwise.
-
contains_cycle
()[source]¶ Check whether there are cycles.
>>> dg = DependencyGraph(treebank_data) >>> dg.contains_cycle() False
>>> cyclic_dg = DependencyGraph() >>> top = {'word': None, 'deps': [1], 'rel': 'TOP', 'address': 0} >>> child1 = {'word': None, 'deps': [2], 'rel': 'NTOP', 'address': 1} >>> child2 = {'word': None, 'deps': [4], 'rel': 'NTOP', 'address': 2} >>> child3 = {'word': None, 'deps': [1], 'rel': 'NTOP', 'address': 3} >>> child4 = {'word': None, 'deps': [3], 'rel': 'NTOP', 'address': 4} >>> cyclic_dg.nodes = { ... 0: top, ... 1: child1, ... 2: child2, ... 3: child3, ... 4: child4, ... } >>> cyclic_dg.root = top
>>> cyclic_dg.contains_cycle() [3, 1, 2, 4]
-
left_children
(node_index)[source]¶ Returns the number of left children under the node specified by the given address.
-
static
load
(filename, zero_based=False, cell_separator=None, top_relation_label='ROOT')[source]¶ Parameters: - filename – a name of a file in Malt-TAB format
- zero_based – nodes in the input file are numbered starting from 0
rather than 1 (as produced by, e.g., zpar) :param str cell_separator: the cell separator. If not provided, cells are split by whitespace. :param str top_relation_label: the label by which the top relation is identified, for examlple, ROOT, null or TOP.
Returns: a list of DependencyGraphs
-
redirect_arcs
(originals, redirect)[source]¶ Redirects arcs to any of the nodes in the originals list to the redirect node address.
-
remove_by_address
(address)[source]¶ Removes the node with the given address. References to this node in others will still exist.
-
right_children
(node_index)[source]¶ Returns the number of right children under the node specified by the given address.
-
to_conll
(style)[source]¶ The dependency graph in CoNLL format.
Parameters: style (int) – the style to use for the format (3, 4, 10 columns) Return type: str
-
to_dot
()[source]¶ Return a dot representation suitable for using with Graphviz.
>>> dg = DependencyGraph( ... 'John N 2\n' ... 'loves V 0\n' ... 'Mary N 2' ... ) >>> print(dg.to_dot()) digraph G{ edge [dir=forward] node [shape=plaintext] 0 [label="0 (None)"] 0 -> 2 [label="ROOT"] 1 [label="1 (John)"] 2 [label="2 (loves)"] 2 -> 1 [label=""] 2 -> 3 [label=""] 3 [label="3 (Mary)"] }
-
tree
()[source]¶ Starting with the
root
node, build a dependency tree using the NLTKTree
constructor. Dependency labels are omitted.
-
triples
(node=None)[source]¶ Extract dependency triples of the form: ((head word, head tag), rel, (dep word, dep tag))
-
unicode_repr
()¶ Return repr(self).
-
-
exception
nltk.parse.dependencygraph.
DependencyGraphError
[source]¶ Bases:
Exception
Dependency graph exception.
nltk.parse.earleychart module¶
Data classes and parser implementations for incremental chart parsers, which use dynamic programming to efficiently parse a text. A “chart parser” derives parse trees for a text by iteratively adding “edges” to a “chart”. Each “edge” represents a hypothesis about the tree structure for a subsequence of the text. The “chart” is a “blackboard” for composing and combining these hypotheses.
A parser is “incremental”, if it guarantees that for all i, j where i < j, all edges ending at i are built before any edges ending at j. This is appealing for, say, speech recognizer hypothesis filtering.
The main parser class is EarleyChartParser
, which is a top-down
algorithm, originally formulated by Jay Earley (1970).
-
class
nltk.parse.earleychart.
CompleterRule
[source]¶ Bases:
nltk.parse.earleychart.CompleteFundamentalRule
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.earleychart.
FeatureCompleteFundamentalRule
[source]¶ Bases:
nltk.parse.featurechart.FeatureSingleEdgeFundamentalRule
-
class
nltk.parse.earleychart.
FeatureIncrementalBottomUpLeftCornerChartParser
(grammar, **parser_args)[source]¶
-
class
nltk.parse.earleychart.
FeatureIncrementalChart
(tokens)[source]¶ Bases:
nltk.parse.earleychart.IncrementalChart
,nltk.parse.featurechart.FeatureChart
-
class
nltk.parse.earleychart.
FeatureIncrementalChartParser
(grammar, strategy=[<nltk.parse.chart.LeafInitRule object>, <nltk.parse.featurechart.FeatureEmptyPredictRule object>, <nltk.parse.featurechart.FeatureBottomUpPredictCombineRule object>, <nltk.parse.earleychart.FeatureCompleteFundamentalRule object>], trace_chart_width=20, chart_class=<class 'nltk.parse.earleychart.FeatureIncrementalChart'>, **parser_args)[source]¶ Bases:
nltk.parse.earleychart.IncrementalChartParser
,nltk.parse.featurechart.FeatureChartParser
-
class
nltk.parse.earleychart.
FilteredCompleteFundamentalRule
[source]¶ Bases:
nltk.parse.chart.FilteredSingleEdgeFundamentalRule
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.earleychart.
IncrementalBottomUpLeftCornerChartParser
(grammar, **parser_args)[source]¶
-
class
nltk.parse.earleychart.
IncrementalChart
(tokens)[source]¶ Bases:
nltk.parse.chart.Chart
-
edges
()[source]¶ Return a list of all edges in this chart. New edges that are added to the chart after the call to edges() will not be contained in this list.
Return type: list(EdgeI) See: iteredges
,select
-
iteredges
()[source]¶ Return an iterator over the edges in this chart. It is not guaranteed that new edges which are added to the chart before the iterator is exhausted will also be generated.
Return type: iter(EdgeI) See: edges
,select
-
select
(end, **restrictions)[source]¶ Return an iterator over the edges in this chart. Any new edges that are added to the chart before the iterator is exahusted will also be generated.
restrictions
can be used to restrict the set of edges that will be generated.Parameters: - span – Only generate edges
e
wheree.span()==span
- start – Only generate edges
e
wheree.start()==start
- end – Only generate edges
e
wheree.end()==end
- length – Only generate edges
e
wheree.length()==length
- lhs – Only generate edges
e
wheree.lhs()==lhs
- rhs – Only generate edges
e
wheree.rhs()==rhs
- nextsym – Only generate edges
e
wheree.nextsym()==nextsym
- dot – Only generate edges
e
wheree.dot()==dot
- is_complete – Only generate edges
e
wheree.is_complete()==is_complete
- is_incomplete – Only generate edges
e
wheree.is_incomplete()==is_incomplete
Return type: iter(EdgeI)
- span – Only generate edges
-
-
class
nltk.parse.earleychart.
IncrementalChartParser
(grammar, strategy=[<nltk.parse.chart.LeafInitRule object>, <nltk.parse.chart.EmptyPredictRule object>, <nltk.parse.chart.BottomUpPredictCombineRule object>, <nltk.parse.earleychart.CompleteFundamentalRule object>], trace=0, trace_chart_width=50, chart_class=<class 'nltk.parse.earleychart.IncrementalChart'>)[source]¶ Bases:
nltk.parse.chart.ChartParser
An incremental chart parser implementing Jay Earley’s parsing algorithm:
For each index end in [0, 1, …, N]:For each edge such that edge.end = end:If edge is incomplete and edge.next is not a part of speech:Apply PredictorRule to edgeIf edge is incomplete and edge.next is a part of speech:Apply ScannerRule to edgeIf edge is complete:Apply CompleterRule to edgeReturn any complete parses in the chart
-
class
nltk.parse.earleychart.
ScannerRule
[source]¶ Bases:
nltk.parse.earleychart.CompleteFundamentalRule
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
nltk.parse.evaluate module¶
-
class
nltk.parse.evaluate.
DependencyEvaluator
(parsed_sents, gold_sents)[source]¶ Bases:
object
Class for measuring labelled and unlabelled attachment score for dependency parsing. Note that the evaluation ignores punctuation.
>>> from nltk.parse import DependencyGraph, DependencyEvaluator
>>> gold_sent = DependencyGraph(""" ... Pierre NNP 2 NMOD ... Vinken NNP 8 SUB ... , , 2 P ... 61 CD 5 NMOD ... years NNS 6 AMOD ... old JJ 2 NMOD ... , , 2 P ... will MD 0 ROOT ... join VB 8 VC ... the DT 11 NMOD ... board NN 9 OBJ ... as IN 9 VMOD ... a DT 15 NMOD ... nonexecutive JJ 15 NMOD ... director NN 12 PMOD ... Nov. NNP 9 VMOD ... 29 CD 16 NMOD ... . . 9 VMOD ... """)
>>> parsed_sent = DependencyGraph(""" ... Pierre NNP 8 NMOD ... Vinken NNP 1 SUB ... , , 3 P ... 61 CD 6 NMOD ... years NNS 6 AMOD ... old JJ 2 NMOD ... , , 3 AMOD ... will MD 0 ROOT ... join VB 8 VC ... the DT 11 AMOD ... board NN 9 OBJECT ... as IN 9 NMOD ... a DT 15 NMOD ... nonexecutive JJ 15 NMOD ... director NN 12 PMOD ... Nov. NNP 9 VMOD ... 29 CD 16 NMOD ... . . 9 VMOD ... """)
>>> de = DependencyEvaluator([parsed_sent],[gold_sent]) >>> las, uas = de.eval() >>> las 0.6... >>> uas 0.8... >>> abs(uas - 0.8) < 0.00001 True
nltk.parse.featurechart module¶
Extension of chart parsing implementation to handle grammars with feature structures as nodes.
-
class
nltk.parse.featurechart.
FeatureBottomUpPredictCombineRule
[source]¶ Bases:
nltk.parse.chart.BottomUpPredictCombineRule
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.featurechart.
FeatureBottomUpPredictRule
[source]¶ Bases:
nltk.parse.chart.BottomUpPredictRule
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.featurechart.
FeatureChart
(tokens)[source]¶ Bases:
nltk.parse.chart.Chart
A Chart for feature grammars. :see:
Chart
for more information.
-
class
nltk.parse.featurechart.
FeatureChartParser
(grammar, strategy=[<nltk.parse.chart.LeafInitRule object>, <nltk.parse.featurechart.FeatureEmptyPredictRule object>, <nltk.parse.featurechart.FeatureBottomUpPredictCombineRule object>, <nltk.parse.featurechart.FeatureSingleEdgeFundamentalRule object>], trace_chart_width=20, chart_class=<class 'nltk.parse.featurechart.FeatureChart'>, **parser_args)[source]¶ Bases:
nltk.parse.chart.ChartParser
-
class
nltk.parse.featurechart.
FeatureEmptyPredictRule
[source]¶ Bases:
nltk.parse.chart.EmptyPredictRule
-
apply
(chart, grammar)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.featurechart.
FeatureFundamentalRule
[source]¶ Bases:
nltk.parse.chart.FundamentalRule
A specialized version of the fundamental rule that operates on nonterminals whose symbols are
FeatStructNonterminal``s. Rather tha simply comparing the nonterminals for equality, they are unified. Variable bindings from these unifications are collected and stored in the chart using a ``FeatureTreeEdge
. When a complete edge is generated, these bindings are applied to all nonterminals in the edge.The fundamental rule states that:
[A -> alpha \* B1 beta][i:j]
[B2 -> gamma \*][j:k]
licenses the edge:
[A -> alpha B3 \* beta][i:j]
assuming that B1 and B2 can be unified to generate B3.
-
apply
(chart, grammar, left_edge, right_edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
class
nltk.parse.featurechart.
FeatureSingleEdgeFundamentalRule
[source]¶ Bases:
nltk.parse.chart.SingleEdgeFundamentalRule
A specialized version of the completer / single edge fundamental rule that operates on nonterminals whose symbols are ``FeatStructNonterminal``s. Rather than simply comparing the nonterminals for equality, they are unified.
-
class
nltk.parse.featurechart.
FeatureTopDownInitRule
[source]¶ Bases:
nltk.parse.chart.TopDownInitRule
-
apply
(chart, grammar)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.featurechart.
FeatureTopDownPredictRule
[source]¶ Bases:
nltk.parse.chart.CachedTopDownPredictRule
A specialized version of the (cached) top down predict rule that operates on nonterminals whose symbols are ``FeatStructNonterminal``s. Rather than simply comparing the nonterminals for equality, they are unified.
The top down expand rule states that:
[A -> alpha \* B1 beta][i:j]
licenses the edge:
[B2 -> \* gamma][j:j]
for each grammar production
B2 -> gamma
, assuming that B1 and B2 can be unified.-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
class
nltk.parse.featurechart.
FeatureTreeEdge
(span, lhs, rhs, dot=0, bindings=None)[source]¶ Bases:
nltk.parse.chart.TreeEdge
A specialized tree edge that allows shared variable bindings between nonterminals on the left-hand side and right-hand side.
Each
FeatureTreeEdge
contains a set ofbindings
, i.e., a dictionary mapping from variables to values. If the edge is not complete, then these bindings are simply stored. However, if the edge is complete, then the constructor applies these bindings to every nonterminal in the edge whose symbol implements the interfaceSubstituteBindingsI
.-
static
from_production
(production, index)[source]¶ Returns: A new TreeEdge
formed from the given production. The new edge’s left-hand side and right-hand side will be taken fromproduction
; its span will be(index,index)
; and its dot position will be0
.Return type: TreeEdge
-
move_dot_forward
(new_end, bindings=None)[source]¶ Returns: A new
FeatureTreeEdge
formed from this edge. The new edge’s dot position is increased by1
, and its end index will be replaced bynew_end
.Return type: Parameters: - new_end (int) – The new end index.
- bindings (dict) – Bindings for the new edge.
-
unicode_repr
()¶ Return repr(self).
-
static
-
class
nltk.parse.featurechart.
InstantiateVarsChart
(tokens)[source]¶ Bases:
nltk.parse.featurechart.FeatureChart
A specialized chart that ‘instantiates’ variables whose names start with ‘@’, by replacing them with unique new variables. In particular, whenever a complete edge is added to the chart, any variables in the edge’s
lhs
whose names start with ‘@’ will be replaced by unique new ``Variable``s.-
insert
(edge, child_pointer_list)[source]¶ Add a new edge to the chart, and return True if this operation modified the chart. In particular, return true iff the chart did not already contain
edge
, or if it did not already associatechild_pointer_lists
withedge
.Parameters: Return type: bool
-
instantiate_edge
(edge)[source]¶ If the edge is a
FeatureTreeEdge
, and it is complete, then instantiate all variables whose names start with ‘@’, by replacing them with unique new variables.Note that instantiation is done in-place, since the parsing algorithms might already hold a reference to the edge for future use.
-
nltk.parse.generate module¶
-
nltk.parse.generate.
generate
(grammar, start=None, depth=None, n=None)[source]¶ Generates an iterator of all sentences from a CFG.
Parameters: - grammar – The Grammar used to generate sentences.
- start – The Nonterminal from which to start generate sentences.
- depth – The maximal depth of the generated tree.
- n – The maximum number of sentences to return.
Returns: An iterator of lists of terminal tokens.
nltk.parse.malt module¶
-
class
nltk.parse.malt.
MaltParser
(parser_dirname, model_filename=None, tagger=None, additional_java_args=None)[source]¶ Bases:
nltk.parse.api.ParserI
A class for dependency parsing with MaltParser. The input is the paths to: - a maltparser directory - (optionally) the path to a pre-trained MaltParser .mco model file - (optionally) the tagger to use for POS tagging before parsing - (optionally) additional Java arguments
- Example:
>>> from nltk.parse import malt >>> # With MALT_PARSER and MALT_MODEL environment set. >>> mp = malt.MaltParser('maltparser-1.7.2', 'engmalt.linear-1.7.mco') >>> mp.parse_one('I shot an elephant in my pajamas .'.split()).tree() (shot I (elephant an) (in (pajamas my)) .) >>> # Without MALT_PARSER and MALT_MODEL environment. >>> mp = malt.MaltParser('/home/user/maltparser-1.7.2/', '/home/user/engmalt.linear-1.7.mco') >>> mp.parse_one('I shot an elephant in my pajamas .'.split()).tree() (shot I (elephant an) (in (pajamas my)) .)
-
generate_malt_command
(inputfilename, outputfilename=None, mode=None)[source]¶ This function generates the maltparser command use at the terminal.
Parameters: - inputfilename (str) – path to the input file
- outputfilename (str) – path to the output file
-
parse_sents
(sentences, verbose=False, top_relation_label='null')[source]¶ Use MaltParser to parse multiple sentences. Takes a list of sentences, where each sentence is a list of words. Each sentence will be automatically tagged with this MaltParser instance’s tagger.
Parameters: sentences – Input sentences to parse Returns: iter(DependencyGraph)
-
parse_tagged_sents
(sentences, verbose=False, top_relation_label='null')[source]¶ Use MaltParser to parse multiple POS tagged sentences. Takes multiple sentences where each sentence is a list of (word, tag) tuples. The sentences must have already been tokenized and tagged.
Parameters: sentences – Input sentences to parse Returns: iter(iter( DependencyGraph
)) the dependency graphrepresentation of each sentence
-
train
(depgraphs, verbose=False)[source]¶ Train MaltParser from a list of
DependencyGraph
objectsParameters: depgraphs (DependencyGraph) – list of DependencyGraph
objects for training input data
-
nltk.parse.malt.
find_malt_model
(model_filename)[source]¶ A module to find pre-trained MaltParser model.
nltk.parse.nonprojectivedependencyparser module¶
-
class
nltk.parse.nonprojectivedependencyparser.
DemoScorer
[source]¶ Bases:
nltk.parse.nonprojectivedependencyparser.DependencyScorerI
-
score
(graph)[source]¶ Parameters: graph (DependencyGraph) – A dependency graph whose set of edges need to be scored. :rtype: A three-dimensional list of numbers. :return: The score is returned in a multidimensional(3) list, such that the outer-dimension refers to the head, and the inner-dimension refers to the dependencies. For instance, scores[0][1] would reference the list of scores corresponding to arcs from node 0 to node 1. The node’s ‘address’ field can be used to determine its number identification.
For further illustration, a score list corresponding to Fig.2 of Keith Hall’s ‘K-best Spanning Tree Parsing’ paper:
- scores = [[[], [5], [1], [1]],
- [[], [], [11], [4]], [[], [10], [], [5]], [[], [8], [8], []]]
When used in conjunction with a MaxEntClassifier, each score would correspond to the confidence of a particular edge being classified with the positive training examples.
-
train
(graphs)[source]¶ Parameters: graphs (list(DependencyGraph)) – A list of dependency graphs to train the scorer. Typically the edges present in the graphs can be used as positive training examples, and the edges not present as negative examples.
-
-
class
nltk.parse.nonprojectivedependencyparser.
DependencyScorerI
[source]¶ Bases:
object
A scorer for calculated the weights on the edges of a weighted dependency graph. This is used by a
ProbabilisticNonprojectiveParser
to initialize the edge weights of aDependencyGraph
. While typically this would be done by training a binary classifier, any class that can return a multidimensional list representation of the edge weights can implement this interface. As such, it has no necessary fields.-
score
(graph)[source]¶ Parameters: graph (DependencyGraph) – A dependency graph whose set of edges need to be scored. :rtype: A three-dimensional list of numbers. :return: The score is returned in a multidimensional(3) list, such that the outer-dimension refers to the head, and the inner-dimension refers to the dependencies. For instance, scores[0][1] would reference the list of scores corresponding to arcs from node 0 to node 1. The node’s ‘address’ field can be used to determine its number identification.
For further illustration, a score list corresponding to Fig.2 of Keith Hall’s ‘K-best Spanning Tree Parsing’ paper:
- scores = [[[], [5], [1], [1]],
- [[], [], [11], [4]], [[], [10], [], [5]], [[], [8], [8], []]]
When used in conjunction with a MaxEntClassifier, each score would correspond to the confidence of a particular edge being classified with the positive training examples.
-
train
(graphs)[source]¶ Parameters: graphs (list(DependencyGraph)) – A list of dependency graphs to train the scorer. Typically the edges present in the graphs can be used as positive training examples, and the edges not present as negative examples.
-
-
class
nltk.parse.nonprojectivedependencyparser.
NaiveBayesDependencyScorer
[source]¶ Bases:
nltk.parse.nonprojectivedependencyparser.DependencyScorerI
A dependency scorer built around a MaxEnt classifier. In this particular class that classifier is a
NaiveBayesClassifier
. It uses head-word, head-tag, child-word, and child-tag features for classification.>>> from nltk.parse.dependencygraph import DependencyGraph, conll_data2
>>> graphs = [DependencyGraph(entry) for entry in conll_data2.split('\n\n') if entry] >>> npp = ProbabilisticNonprojectiveParser() >>> npp.train(graphs, NaiveBayesDependencyScorer()) >>> parses = npp.parse(['Cathy', 'zag', 'hen', 'zwaaien', '.'], ['N', 'V', 'Pron', 'Adj', 'N', 'Punc']) >>> len(list(parses)) 1
-
score
(graph)[source]¶ Converts the graph into a feature-based representation of each edge, and then assigns a score to each based on the confidence of the classifier in assigning it to the positive label. Scores are returned in a multidimensional list.
Parameters: graph (DependencyGraph) – A dependency graph to score. Return type: 3 dimensional list Returns: Edge scores for the graph parameter.
-
train
(graphs)[source]¶ Trains a
NaiveBayesClassifier
using the edges present in graphs list as positive examples, the edges not present as negative examples. Uses a feature vector of head-word, head-tag, child-word, and child-tag.Parameters: graphs (list(DependencyGraph)) – A list of dependency graphs to train the scorer.
-
-
class
nltk.parse.nonprojectivedependencyparser.
NonprojectiveDependencyParser
(dependency_grammar)[source]¶ Bases:
object
A non-projective, rule-based, dependency parser. This parser will return the set of all possible non-projective parses based on the word-to-word relations defined in the parser’s dependency grammar, and will allow the branches of the parse tree to cross in order to capture a variety of linguistic phenomena that a projective parser will not.
-
parse
(tokens)[source]¶ Parses the input tokens with respect to the parser’s grammar. Parsing is accomplished by representing the search-space of possible parses as a fully-connected directed graph. Arcs that would lead to ungrammatical parses are removed and a lattice is constructed of length n, where n is the number of input tokens, to represent all possible grammatical traversals. All possible paths through the lattice are then enumerated to produce the set of non-projective parses.
param tokens: A list of tokens to parse. type tokens: list(str) return: An iterator of non-projective parses. rtype: iter(DependencyGraph)
-
-
class
nltk.parse.nonprojectivedependencyparser.
ProbabilisticNonprojectiveParser
[source]¶ Bases:
object
A probabilistic non-projective dependency parser.
Nonprojective dependencies allows for “crossing branches” in the parse tree which is necessary for representing particular linguistic phenomena, or even typical parses in some languages. This parser follows the MST parsing algorithm, outlined in McDonald(2005), which likens the search for the best non-projective parse to finding the maximum spanning tree in a weighted directed graph.
>>> class Scorer(DependencyScorerI): ... def train(self, graphs): ... pass ... ... def score(self, graph): ... return [ ... [[], [5], [1], [1]], ... [[], [], [11], [4]], ... [[], [10], [], [5]], ... [[], [8], [8], []], ... ]
>>> npp = ProbabilisticNonprojectiveParser() >>> npp.train([], Scorer())
>>> parses = npp.parse(['v1', 'v2', 'v3'], [None, None, None]) >>> len(list(parses)) 1
>>> from nltk.grammar import DependencyGrammar
>>> grammar = DependencyGrammar.fromstring(''' ... 'taught' -> 'play' | 'man' ... 'man' -> 'the' | 'in' ... 'in' -> 'corner' ... 'corner' -> 'the' ... 'play' -> 'golf' | 'dachshund' | 'to' ... 'dachshund' -> 'his' ... ''')
>>> ndp = NonprojectiveDependencyParser(grammar) >>> parses = ndp.parse(['the', 'man', 'in', 'the', 'corner', 'taught', 'his', 'dachshund', 'to', 'play', 'golf']) >>> len(list(parses)) 4
-
best_incoming_arc
(node_index)[source]¶ Returns the source of the best incoming arc to the node with address: node_index
Parameters: node_index (integer.) – The address of the ‘destination’ node, the node that is arced to.
-
collapse_nodes
(new_node, cycle_path, g_graph, b_graph, c_graph)[source]¶ Takes a list of nodes that have been identified to belong to a cycle, and collapses them into on larger node. The arcs of all nodes in the graph must be updated to account for this.
Parameters: - new_node (Node.) – A Node (Dictionary) to collapse the cycle nodes into.
- cycle_path (A list of integers.) – A list of node addresses, each of which is in the cycle.
- b_graph, c_graph (g_graph,) – Graphs which need to be updated.
-
compute_max_subtract_score
(column_index, cycle_indexes)[source]¶ When updating scores the score of the highest-weighted incoming arc is subtracted upon collapse. This returns the correct amount to subtract from that edge.
Parameters: column_index (integer.) – A index representing the column of incoming arcs to a particular node being updated :type cycle_indexes: A list of integers. :param cycle_indexes: Only arcs from cycle nodes are considered. This is a list of such nodes addresses.
-
compute_original_indexes
(new_indexes)[source]¶ As nodes are collapsed into others, they are replaced by the new node in the graph, but it’s still necessary to keep track of what these original nodes were. This takes a list of node addresses and replaces any collapsed node addresses with their original addresses.
Parameters: new_indexes (A list of integers.) – A list of node addresses to check for subsumed nodes.
-
initialize_edge_scores
(graph)[source]¶ Assigns a score to every edge in the
DependencyGraph
graph. These scores are generated via the parser’s scorer which was assigned during the training process.Parameters: graph (DependencyGraph) – A dependency graph to assign scores to.
-
parse
(tokens, tags)[source]¶ Parses a list of tokens in accordance to the MST parsing algorithm for non-projective dependency parses. Assumes that the tokens to be parsed have already been tagged and those tags are provided. Various scoring methods can be used by implementing the
DependencyScorerI
interface and passing it to the training algorithm.Parameters: - tokens (list(str)) – A list of words or punctuation to be parsed.
- tags (list(str)) – A list of tags corresponding by index to the words in the tokens list.
Returns: An iterator of non-projective parses.
Return type: iter(DependencyGraph)
-
train
(graphs, dependency_scorer)[source]¶ Trains a
DependencyScorerI
from a set ofDependencyGraph
objects, and establishes this as the parser’s scorer. This is used to initialize the scores on aDependencyGraph
during the parsing procedure.Parameters: - graphs (list(DependencyGraph)) – A list of dependency graphs to train the scorer.
- dependency_scorer (DependencyScorerI) – A scorer which implements the
DependencyScorerI
interface.
-
nltk.parse.pchart module¶
Classes and interfaces for associating probabilities with tree
structures that represent the internal organization of a text. The
probabilistic parser module defines BottomUpProbabilisticChartParser
.
BottomUpProbabilisticChartParser
is an abstract class that implements
a bottom-up chart parser for PCFG
grammars. It maintains a queue of edges,
and adds them to the chart one at a time. The ordering of this queue
is based on the probabilities associated with the edges, allowing the
parser to expand more likely edges before less likely ones. Each
subclass implements a different queue ordering, producing different
search strategies. Currently the following subclasses are defined:
InsideChartParser
searches edges in decreasing order of their trees’ inside probabilities.RandomChartParser
searches edges in random order.LongestChartParser
searches edges in decreasing order of their location’s length.
The BottomUpProbabilisticChartParser
constructor has an optional
argument beam_size. If non-zero, this controls the size of the beam
(aka the edge queue). This option is most useful with InsideChartParser.
-
class
nltk.parse.pchart.
BottomUpProbabilisticChartParser
(grammar, beam_size=0, trace=0)[source]¶ Bases:
nltk.parse.api.ParserI
An abstract bottom-up parser for
PCFG
grammars that uses aChart
to record partial results.BottomUpProbabilisticChartParser
maintains a queue of edges that can be added to the chart. This queue is initialized with edges for each token in the text that is being parsed.BottomUpProbabilisticChartParser
inserts these edges into the chart one at a time, starting with the most likely edges, and proceeding to less likely edges. For each edge that is added to the chart, it may become possible to insert additional edges into the chart; these are added to the queue. This process continues until enough complete parses have been generated, or until the queue is empty.The sorting order for the queue is not specified by
BottomUpProbabilisticChartParser
. Different sorting orders will result in different search strategies. The sorting order for the queue is defined by the methodsort_queue
; subclasses are required to provide a definition for this method.Variables: - _grammar – The grammar used to parse sentences.
- _trace – The level of tracing output that should be generated when parsing a text.
-
parse
(tokens)[source]¶ Returns: An iterator that generates parse trees for the sentence. When possible this list is sorted from most likely to least likely.
Parameters: sent (list(str)) – The sentence to be parsed Return type: iter(Tree)
-
sort_queue
(queue, chart)[source]¶ Sort the given queue of
Edge
objects, placing the edge that should be tried first at the beginning of the queue. This method will be called after eachEdge
is added to the queue.Parameters: - queue (list(Edge)) – The queue of
Edge
objects to sort. Each edge in this queue is an edge that could be added to the chart by the fundamental rule; but that has not yet been added. - chart (Chart) – The chart being used to parse the text. This chart can be used to provide extra information for sorting the queue.
Return type: None
- queue (list(Edge)) – The queue of
-
class
nltk.parse.pchart.
InsideChartParser
(grammar, beam_size=0, trace=0)[source]¶ Bases:
nltk.parse.pchart.BottomUpProbabilisticChartParser
A bottom-up parser for
PCFG
grammars that tries edges in descending order of the inside probabilities of their trees. The “inside probability” of a tree is simply the probability of the entire tree, ignoring its context. In particular, the inside probability of a tree generated by production p with children c[1], c[2], …, c[n] is P(p)P(c[1])P(c[2])…P(c[n]); and the inside probability of a token is 1 if it is present in the text, and 0 if it is absent.This sorting order results in a type of lowest-cost-first search strategy.
-
sort_queue
(queue, chart)[source]¶ Sort the given queue of edges, in descending order of the inside probabilities of the edges’ trees.
Parameters: - queue (list(Edge)) – The queue of
Edge
objects to sort. Each edge in this queue is an edge that could be added to the chart by the fundamental rule; but that has not yet been added. - chart (Chart) – The chart being used to parse the text. This chart can be used to provide extra information for sorting the queue.
Return type: None
- queue (list(Edge)) – The queue of
-
-
class
nltk.parse.pchart.
LongestChartParser
(grammar, beam_size=0, trace=0)[source]¶ Bases:
nltk.parse.pchart.BottomUpProbabilisticChartParser
A bottom-up parser for
PCFG
grammars that tries longer edges before shorter ones. This sorting order results in a type of best-first search strategy.-
sort_queue
(queue, chart)[source]¶ Sort the given queue of
Edge
objects, placing the edge that should be tried first at the beginning of the queue. This method will be called after eachEdge
is added to the queue.Parameters: - queue (list(Edge)) – The queue of
Edge
objects to sort. Each edge in this queue is an edge that could be added to the chart by the fundamental rule; but that has not yet been added. - chart (Chart) – The chart being used to parse the text. This chart can be used to provide extra information for sorting the queue.
Return type: None
- queue (list(Edge)) – The queue of
-
-
class
nltk.parse.pchart.
ProbabilisticBottomUpInitRule
[source]¶ Bases:
nltk.parse.chart.AbstractChartRule
-
NUM_EDGES
= 0¶
-
apply
(chart, grammar)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.pchart.
ProbabilisticBottomUpPredictRule
[source]¶ Bases:
nltk.parse.chart.AbstractChartRule
-
NUM_EDGES
= 1¶
-
apply
(chart, grammar, edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.pchart.
ProbabilisticFundamentalRule
[source]¶ Bases:
nltk.parse.chart.AbstractChartRule
-
NUM_EDGES
= 2¶
-
apply
(chart, grammar, left_edge, right_edge)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
-
class
nltk.parse.pchart.
ProbabilisticLeafEdge
(leaf, index)[source]¶ Bases:
nltk.parse.chart.LeafEdge
-
class
nltk.parse.pchart.
ProbabilisticTreeEdge
(prob, *args, **kwargs)[source]¶ Bases:
nltk.parse.chart.TreeEdge
-
class
nltk.parse.pchart.
RandomChartParser
(grammar, beam_size=0, trace=0)[source]¶ Bases:
nltk.parse.pchart.BottomUpProbabilisticChartParser
A bottom-up parser for
PCFG
grammars that tries edges in random order. This sorting order results in a random search strategy.-
sort_queue
(queue, chart)[source]¶ Sort the given queue of
Edge
objects, placing the edge that should be tried first at the beginning of the queue. This method will be called after eachEdge
is added to the queue.Parameters: - queue (list(Edge)) – The queue of
Edge
objects to sort. Each edge in this queue is an edge that could be added to the chart by the fundamental rule; but that has not yet been added. - chart (Chart) – The chart being used to parse the text. This chart can be used to provide extra information for sorting the queue.
Return type: None
- queue (list(Edge)) – The queue of
-
-
class
nltk.parse.pchart.
SingleEdgeProbabilisticFundamentalRule
[source]¶ Bases:
nltk.parse.chart.AbstractChartRule
-
NUM_EDGES
= 1¶
-
apply
(chart, grammar, edge1)[source]¶ Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.
Parameters: edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply()
is specified by theNUM_EDGES
class variable.Return type: iter(EdgeI)
-
unicode_repr
¶ Return repr(self).
-
-
class
nltk.parse.pchart.
UnsortedChartParser
(grammar, beam_size=0, trace=0)[source]¶ Bases:
nltk.parse.pchart.BottomUpProbabilisticChartParser
A bottom-up parser for
PCFG
grammars that tries edges in whatever order.-
sort_queue
(queue, chart)[source]¶ Sort the given queue of
Edge
objects, placing the edge that should be tried first at the beginning of the queue. This method will be called after eachEdge
is added to the queue.Parameters: - queue (list(Edge)) – The queue of
Edge
objects to sort. Each edge in this queue is an edge that could be added to the chart by the fundamental rule; but that has not yet been added. - chart (Chart) – The chart being used to parse the text. This chart can be used to provide extra information for sorting the queue.
Return type: None
- queue (list(Edge)) – The queue of
-
-
nltk.parse.pchart.
demo
(choice=None, draw_parses=None, print_parses=None)[source]¶ A demonstration of the probabilistic parsers. The user is prompted to select which demo to run, and how many parses should be found; and then each parser is run on the same demo, and a summary of the results are displayed.
nltk.parse.projectivedependencyparser module¶
-
class
nltk.parse.projectivedependencyparser.
ChartCell
(x, y)[source]¶ Bases:
object
A cell from the parse chart formed when performing the CYK algorithm. Each cell keeps track of its x and y coordinates (though this will probably be discarded), and a list of spans serving as the cell’s entries.
-
add
(span)[source]¶ Appends the given span to the list of spans representing the chart cell’s entries.
Parameters: span (DependencySpan) – The span to add.
-
unicode_repr
()¶ Returns: A concise string representation of this ChartCell
.Return type: str.
-
-
class
nltk.parse.projectivedependencyparser.
DependencySpan
(start_index, end_index, head_index, arcs, tags)[source]¶ Bases:
object
A contiguous span over some part of the input string representing dependency (head -> modifier) relationships amongst words. An atomic span corresponds to only one word so it isn’t a ‘span’ in the conventional sense, as its _start_index = _end_index = _head_index for concatenation purposes. All other spans are assumed to have arcs between all nodes within the start and end indexes of the span, and one head index corresponding to the head word for the entire span. This is the same as the root node if the dependency structure were depicted as a graph.
-
head_index
()[source]¶ Returns: An value indexing the head of the entire DependencySpan
.Return type: int
-
unicode_repr
()¶ Returns: A concise string representatino of the DependencySpan
.Return type: str.
-
-
class
nltk.parse.projectivedependencyparser.
ProbabilisticProjectiveDependencyParser
[source]¶ Bases:
object
A probabilistic, projective dependency parser.
This parser returns the most probable projective parse derived from the probabilistic dependency grammar derived from the train() method. The probabilistic model is an implementation of Eisner’s (1996) Model C, which conditions on head-word, head-tag, child-word, and child-tag. The decoding uses a bottom-up chart-based span concatenation algorithm that’s identical to the one utilized by the rule-based projective parser.
>>> from nltk.parse.dependencygraph import conll_data2
>>> graphs = [ ... DependencyGraph(entry) for entry in conll_data2.split('\n\n') if entry ... ]
>>> ppdp = ProbabilisticProjectiveDependencyParser() >>> ppdp.train(graphs)
>>> sent = ['Cathy', 'zag', 'hen', 'wild', 'zwaaien', '.'] >>> list(ppdp.parse(sent)) [Tree('zag', ['Cathy', 'hen', Tree('zwaaien', ['wild', '.'])])]
-
compute_prob
(dg)[source]¶ Computes the probability of a dependency graph based on the parser’s probability model (defined by the parser’s statistical dependency grammar).
Parameters: dg (DependencyGraph) – A dependency graph to score. Returns: The probability of the dependency graph. Return type: int
-
concatenate
(span1, span2)[source]¶ Concatenates the two spans in whichever way possible. This includes rightward concatenation (from the leftmost word of the leftmost span to the rightmost word of the rightmost span) and leftward concatenation (vice-versa) between adjacent spans. Unlike Eisner’s presentation of span concatenation, these spans do not share or pivot on a particular word/word-index.
Returns: A list of new spans formed through concatenation. Return type: list(DependencySpan)
-
parse
(tokens)[source]¶ Parses the list of tokens subject to the projectivity constraint and the productions in the parser’s grammar. This uses a method similar to the span-concatenation algorithm defined in Eisner (1996). It returns the most probable parse derived from the parser’s probabilistic dependency grammar.
-
train
(graphs)[source]¶ Trains a ProbabilisticDependencyGrammar based on the list of input DependencyGraphs. This model is an implementation of Eisner’s (1996) Model C, which derives its statistics from head-word, head-tag, child-word, and child-tag relationships.
Parameters: graphs – A list of dependency graphs to train from. Type: list(DependencyGraph)
-
-
class
nltk.parse.projectivedependencyparser.
ProjectiveDependencyParser
(dependency_grammar)[source]¶ Bases:
object
A projective, rule-based, dependency parser. A ProjectiveDependencyParser is created with a DependencyGrammar, a set of productions specifying word-to-word dependency relations. The parse() method will then return the set of all parses, in tree representation, for a given input sequence of tokens. Each parse must meet the requirements of the both the grammar and the projectivity constraint which specifies that the branches of the dependency tree are not allowed to cross. Alternatively, this can be understood as stating that each parent node and its children in the parse tree form a continuous substring of the input sequence.
-
concatenate
(span1, span2)[source]¶ Concatenates the two spans in whichever way possible. This includes rightward concatenation (from the leftmost word of the leftmost span to the rightmost word of the rightmost span) and leftward concatenation (vice-versa) between adjacent spans. Unlike Eisner’s presentation of span concatenation, these spans do not share or pivot on a particular word/word-index.
Returns: A list of new spans formed through concatenation. Return type: list(DependencySpan)
-
-
nltk.parse.projectivedependencyparser.
arity_parse_demo
()[source]¶ A demonstration showing the creation of a
DependencyGrammar
in which a specific number of modifiers is listed for a given head. This can further constrain the number of possible parses created by aProjectiveDependencyParser
.
nltk.parse.recursivedescent module¶
-
class
nltk.parse.recursivedescent.
RecursiveDescentParser
(grammar, trace=0)[source]¶ Bases:
nltk.parse.api.ParserI
A simple top-down CFG parser that parses texts by recursively expanding the fringe of a Tree, and matching it against a text.
RecursiveDescentParser
uses a list of tree locations called a “frontier” to remember which subtrees have not yet been expanded and which leaves have not yet been matched against the text. Each tree location consists of a list of child indices specifying the path from the root of the tree to a subtree or a leaf; see the reference documentation for Tree for more information about tree locations.When the parser begins parsing a text, it constructs a tree containing only the start symbol, and a frontier containing the location of the tree’s root node. It then extends the tree to cover the text, using the following recursive procedure:
- If the frontier is empty, and the text is covered by the tree, then return the tree as a possible parse.
- If the frontier is empty, and the text is not covered by the tree, then return no parses.
- If the first element of the frontier is a subtree, then use CFG productions to “expand” it. For each applicable production, add the expanded subtree’s children to the frontier, and recursively find all parses that can be generated by the new tree and frontier.
- If the first element of the frontier is a token, then “match” it against the next token from the text. Remove the token from the frontier, and recursively find all parses that can be generated by the new tree and frontier.
See: nltk.grammar
-
class
nltk.parse.recursivedescent.
SteppingRecursiveDescentParser
(grammar, trace=0)[source]¶ Bases:
nltk.parse.recursivedescent.RecursiveDescentParser
A
RecursiveDescentParser
that allows you to step through the parsing process, performing a single operation at a time.The
initialize
method is used to start parsing a text.expand
expands the first element on the frontier using a single CFG production, andmatch
matches the first element on the frontier against the next text token.backtrack
undoes the most recent expand or match operation.step
performs a single expand, match, or backtrack operation.parses
returns the set of parses that have been found by the parser.Variables: - _history – A list of
(rtext, tree, frontier)
tripples, containing the previous states of the parser. This history is used to implement thebacktrack
operation. - _tried_e – A record of all productions that have been tried
for a given tree. This record is used by
expand
to perform the next untried production. - _tried_m – A record of what tokens have been matched for a
given tree. This record is used by
step
to decide whether or not to match a token.
See: nltk.grammar
-
backtrack
()[source]¶ Return the parser to its state before the most recent match or expand operation. Calling
undo
repeatedly return the parser to successively earlier states. If no match or expand operations have been performed,undo
will make no changes.Returns: true if an operation was successfully undone. Return type: bool
-
currently_complete
()[source]¶ Returns: Whether the parser’s current state represents a complete parse. Return type: bool
-
expand
(production=None)[source]¶ Expand the first element of the frontier. In particular, if the first element of the frontier is a subtree whose node type is equal to
production
’s left hand side, then add a child to that subtree for each element ofproduction
’s right hand side. Ifproduction
is not specified, then use the first untried expandable production. If all expandable productions have been tried, do nothing.Returns: The production used to expand the frontier, if an expansion was performed. If no expansion was performed, return None. Return type: Production or None
-
expandable_productions
()[source]¶ Returns: A list of all the productions for which expansions are available for the current parser state. Return type: list(Production)
-
frontier
()[source]¶ Returns: A list of the tree locations of all subtrees that have not yet been expanded, and all leaves that have not yet been matched. Return type: list(tuple(int))
-
initialize
(tokens)[source]¶ Start parsing a given text. This sets the parser’s tree to the start symbol, its frontier to the root node, and its remaining text to
token['SUBTOKENS']
.
-
match
()[source]¶ Match the first element of the frontier. In particular, if the first element of the frontier has the same type as the next text token, then substitute the text token into the tree.
Returns: The token matched, if a match operation was performed. If no match was performed, return None Return type: str or None
-
parse
(tokens)[source]¶ Returns: An iterator that generates parse trees for the sentence. When possible this list is sorted from most likely to least likely.
Parameters: sent (list(str)) – The sentence to be parsed Return type: iter(Tree)
-
parses
()[source]¶ Returns: An iterator of the parses that have been found by this parser so far. Return type: list of Tree
-
remaining_text
()[source]¶ Returns: The portion of the text that is not yet covered by the tree. Return type: list(str)
-
set_grammar
(grammar)[source]¶ Change the grammar used to parse texts.
Parameters: grammar (CFG) – The new grammar.
-
step
()[source]¶ Perform a single parsing operation. If an untried match is possible, then perform the match, and return the matched token. If an untried expansion is possible, then perform the expansion, and return the production that it is based on. If backtracking is possible, then backtrack, and return True. Otherwise, return None.
Returns: None if no operation was performed; a token if a match was performed; a production if an expansion was performed; and True if a backtrack operation was performed. Return type: Production or String or bool
-
tree
()[source]¶ Returns: A partial structure for the text that is currently being parsed. The elements specified by the frontier have not yet been expanded or matched. Return type: Tree
-
untried_expandable_productions
()[source]¶ Returns: A list of all the untried productions for which expansions are available for the current parser state. Return type: list(Production)
- _history – A list of
nltk.parse.shiftreduce module¶
-
class
nltk.parse.shiftreduce.
ShiftReduceParser
(grammar, trace=0)[source]¶ Bases:
nltk.parse.api.ParserI
A simple bottom-up CFG parser that uses two operations, “shift” and “reduce”, to find a single parse for a text.
ShiftReduceParser
maintains a stack, which records the structure of a portion of the text. This stack is a list of strings and Trees that collectively cover a portion of the text. For example, while parsing the sentence “the dog saw the man” with a typical grammar,ShiftReduceParser
will produce the following stack, which covers “the dog saw”:[(NP: (Det: 'the') (N: 'dog')), (V: 'saw')]
ShiftReduceParser
attempts to extend the stack to cover the entire text, and to combine the stack elements into a single tree, producing a complete parse for the sentence.Initially, the stack is empty. It is extended to cover the text, from left to right, by repeatedly applying two operations:
- “shift” moves a token from the beginning of the text to the end of the stack.
- “reduce” uses a CFG production to combine the rightmost stack elements into a single Tree.
Often, more than one operation can be performed on a given stack. In this case,
ShiftReduceParser
uses the following heuristics to decide which operation to perform:- Only shift if no reductions are available.
- If multiple reductions are available, then apply the reduction whose CFG production is listed earliest in the grammar.
Note that these heuristics are not guaranteed to choose an operation that leads to a parse of the text. Also, if multiple parses exists,
ShiftReduceParser
will return at most one of them.See: nltk.grammar
-
class
nltk.parse.shiftreduce.
SteppingShiftReduceParser
(grammar, trace=0)[source]¶ Bases:
nltk.parse.shiftreduce.ShiftReduceParser
A
ShiftReduceParser
that allows you to setp through the parsing process, performing a single operation at a time. It also allows you to change the parser’s grammar midway through parsing a text.The
initialize
method is used to start parsing a text.shift
performs a single shift operation, andreduce
performs a single reduce operation.step
will perform a single reduce operation if possible; otherwise, it will perform a single shift operation.parses
returns the set of parses that have been found by the parser.Variables: _history – A list of (stack, remaining_text)
pairs, containing all of the previous states of the parser. This history is used to implement theundo
operation.See: nltk.grammar
-
initialize
(tokens)[source]¶ Start parsing a given text. This sets the parser’s stack to
[]
and sets its remaining text totokens
.
-
parse
(tokens)[source]¶ Returns: An iterator that generates parse trees for the sentence. When possible this list is sorted from most likely to least likely.
Parameters: sent (list(str)) – The sentence to be parsed Return type: iter(Tree)
-
parses
()[source]¶ Returns: An iterator of the parses that have been found by this parser so far. Return type: iter(Tree)
-
reduce
(production=None)[source]¶ Use
production
to combine the rightmost stack elements into a single Tree. Ifproduction
does not match the rightmost stack elements, then do nothing.Returns: The production used to reduce the stack, if a reduction was performed. If no reduction was performed, return None. Return type: Production or None
-
reducible_productions
()[source]¶ Returns: A list of the productions for which reductions are available for the current parser state. Return type: list(Production)
-
remaining_text
()[source]¶ Returns: The portion of the text that is not yet covered by the stack. Return type: list(str)
-
set_grammar
(grammar)[source]¶ Change the grammar used to parse texts.
Parameters: grammar (CFG) – The new grammar.
-
shift
()[source]¶ Move a token from the beginning of the remaining text to the end of the stack. If there are no more tokens in the remaining text, then do nothing.
Returns: True if the shift operation was successful. Return type: bool
-
step
()[source]¶ Perform a single parsing operation. If a reduction is possible, then perform that reduction, and return the production that it is based on. Otherwise, if a shift is possible, then perform it, and return True. Otherwise, return False.
Returns: False if no operation was performed; True if a shift was performed; and the CFG production used to reduce if a reduction was performed. Return type: Production or bool
-
undo
()[source]¶ Return the parser to its state before the most recent shift or reduce operation. Calling
undo
repeatedly return the parser to successively earlier states. If no shift or reduce operations have been performed,undo
will make no changes.Returns: true if an operation was successfully undone. Return type: bool
-
nltk.parse.stanford module¶
-
class
nltk.parse.stanford.
GenericStanfordParser
(path_to_jar=None, path_to_models_jar=None, model_path='edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz', encoding='utf8', verbose=False, java_options='-mx4g', corenlp_options='')[source]¶ Bases:
nltk.parse.api.ParserI
Interface to the Stanford Parser
-
parse_sents
(sentences, verbose=False)[source]¶ Use StanfordParser to parse multiple sentences. Takes multiple sentences as a list where each sentence is a list of words. Each sentence will be automatically tagged with this StanfordParser instance’s tagger. If whitespaces exists inside a token, then the token will be treated as separate tokens.
Parameters: sentences (list(list(str))) – Input sentences to parse Return type: iter(iter(Tree))
-
raw_parse
(sentence, verbose=False)[source]¶ Use StanfordParser to parse a sentence. Takes a sentence as a string; before parsing, it will be automatically tokenized and tagged by the Stanford Parser.
Parameters: sentence (str) – Input sentence to parse Return type: iter(Tree)
-
raw_parse_sents
(sentences, verbose=False)[source]¶ Use StanfordParser to parse multiple sentences. Takes multiple sentences as a list of strings. Each sentence will be automatically tokenized and tagged by the Stanford Parser.
Parameters: sentences (list(str)) – Input sentences to parse Return type: iter(iter(Tree))
-
tagged_parse
(sentence, verbose=False)[source]¶ Use StanfordParser to parse a sentence. Takes a sentence as a list of (word, tag) tuples; the sentence must have already been tokenized and tagged.
Parameters: sentence (list(tuple(str, str))) – Input sentence to parse Return type: iter(Tree)
-
tagged_parse_sents
(sentences, verbose=False)[source]¶ Use StanfordParser to parse multiple sentences. Takes multiple sentences where each sentence is a list of (word, tag) tuples. The sentences must have already been tokenized and tagged.
Parameters: sentences (list(list(tuple(str, str)))) – Input sentences to parse Return type: iter(iter(Tree))
-
-
class
nltk.parse.stanford.
StanfordDependencyParser
(*args, **kwargs)[source]¶ Bases:
nltk.parse.stanford.GenericStanfordParser
>>> dep_parser=StanfordDependencyParser( ... model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz" ... )
>>> [parse.tree() for parse in dep_parser.raw_parse("The quick brown fox jumps over the lazy dog.")] [Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over', 'the', 'lazy'])])]
>>> [list(parse.triples()) for parse in dep_parser.raw_parse("The quick brown fox jumps over the lazy dog.")] [[((u'jumps', u'VBZ'), u'nsubj', (u'fox', u'NN')), ((u'fox', u'NN'), u'det', (u'The', u'DT')), ((u'fox', u'NN'), u'amod', (u'quick', u'JJ')), ((u'fox', u'NN'), u'amod', (u'brown', u'JJ')), ((u'jumps', u'VBZ'), u'nmod', (u'dog', u'NN')), ((u'dog', u'NN'), u'case', (u'over', u'IN')), ((u'dog', u'NN'), u'det', (u'the', u'DT')), ((u'dog', u'NN'), u'amod', (u'lazy', u'JJ'))]]
>>> sum([[parse.tree() for parse in dep_graphs] for dep_graphs in dep_parser.raw_parse_sents(( ... "The quick brown fox jumps over the lazy dog.", ... "The quick grey wolf jumps over the lazy fox." ... ))], []) [Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over', 'the', 'lazy'])]), Tree('jumps', [Tree('wolf', ['The', 'quick', 'grey']), Tree('fox', ['over', 'the', 'lazy'])])]
>>> sum([[parse.tree() for parse in dep_graphs] for dep_graphs in dep_parser.parse_sents(( ... "I 'm a dog".split(), ... "This is my friends ' cat ( the tabby )".split(), ... ))], []) [Tree('dog', ['I', "'m", 'a']), Tree('cat', ['This', 'is', Tree('friends', ['my', "'"]), Tree('tabby', ['the'])])]
>>> sum([[list(parse.triples()) for parse in dep_graphs] for dep_graphs in dep_parser.tagged_parse_sents(( ... ( ... ("The", "DT"), ... ("quick", "JJ"), ... ("brown", "JJ"), ... ("fox", "NN"), ... ("jumped", "VBD"), ... ("over", "IN"), ... ("the", "DT"), ... ("lazy", "JJ"), ... ("dog", "NN"), ... (".", "."), ... ), ... ))],[]) [[((u'jumped', u'VBD'), u'nsubj', (u'fox', u'NN')), ((u'fox', u'NN'), u'det', (u'The', u'DT')), ((u'fox', u'NN'), u'amod', (u'quick', u'JJ')), ((u'fox', u'NN'), u'amod', (u'brown', u'JJ')), ((u'jumped', u'VBD'), u'nmod', (u'dog', u'NN')), ((u'dog', u'NN'), u'case', (u'over', u'IN')), ((u'dog', u'NN'), u'det', (u'the', u'DT')), ((u'dog', u'NN'), u'amod', (u'lazy', u'JJ'))]]
-
class
nltk.parse.stanford.
StanfordNeuralDependencyParser
(*args, **kwargs)[source]¶ Bases:
nltk.parse.stanford.GenericStanfordParser
>>> from nltk.parse.stanford import StanfordNeuralDependencyParser >>> dep_parser=StanfordNeuralDependencyParser(java_options='-mx4g')
>>> [parse.tree() for parse in dep_parser.raw_parse("The quick brown fox jumps over the lazy dog.")] [Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over', 'the', 'lazy']), '.'])]
>>> [list(parse.triples()) for parse in dep_parser.raw_parse("The quick brown fox jumps over the lazy dog.")] [[((u'jumps', u'VBZ'), u'nsubj', (u'fox', u'NN')), ((u'fox', u'NN'), u'det', (u'The', u'DT')), ((u'fox', u'NN'), u'amod', (u'quick', u'JJ')), ((u'fox', u'NN'), u'amod', (u'brown', u'JJ')), ((u'jumps', u'VBZ'), u'nmod', (u'dog', u'NN')), ((u'dog', u'NN'), u'case', (u'over', u'IN')), ((u'dog', u'NN'), u'det', (u'the', u'DT')), ((u'dog', u'NN'), u'amod', (u'lazy', u'JJ')), ((u'jumps', u'VBZ'), u'punct', (u'.', u'.'))]]
>>> sum([[parse.tree() for parse in dep_graphs] for dep_graphs in dep_parser.raw_parse_sents(( ... "The quick brown fox jumps over the lazy dog.", ... "The quick grey wolf jumps over the lazy fox." ... ))], []) [Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over', 'the', 'lazy']), '.']), Tree('jumps', [Tree('wolf', ['The', 'quick', 'grey']), Tree('fox', ['over', 'the', 'lazy']), '.'])]
>>> sum([[parse.tree() for parse in dep_graphs] for dep_graphs in dep_parser.parse_sents(( ... "I 'm a dog".split(), ... "This is my friends ' cat ( the tabby )".split(), ... ))], []) [Tree('dog', ['I', "'m", 'a']), Tree('cat', ['This', 'is', Tree('friends', ['my', "'"]), Tree('tabby', ['-LRB-', 'the', '-RRB-'])])]
-
class
nltk.parse.stanford.
StanfordParser
(*args, **kwargs)[source]¶ Bases:
nltk.parse.stanford.GenericStanfordParser
>>> parser=StanfordParser( ... model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz" ... )
>>> list(parser.raw_parse("the quick brown fox jumps over the lazy dog")) [Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])])]
>>> sum([list(dep_graphs) for dep_graphs in parser.raw_parse_sents(( ... "the quick brown fox jumps over the lazy dog", ... "the quick grey wolf jumps over the lazy fox" ... ))], []) [Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])]), Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['grey']), Tree('NN', ['wolf'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['fox'])])])])])])]
>>> sum([list(dep_graphs) for dep_graphs in parser.parse_sents(( ... "I 'm a dog".split(), ... "This is my friends ' cat ( the tabby )".split(), ... ))], []) [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['I'])]), Tree('VP', [Tree('VBP', ["'m"]), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['dog'])])])])]), Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('PRP$', ['my']), Tree('NNS', ['friends']), Tree('POS', ["'"])]), Tree('NN', ['cat'])]), Tree('PRN', [Tree('-LRB-', [Tree('', []), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['tabby'])]), Tree('-RRB-', [])])])])])])])]
>>> sum([list(dep_graphs) for dep_graphs in parser.tagged_parse_sents(( ... ( ... ("The", "DT"), ... ("quick", "JJ"), ... ("brown", "JJ"), ... ("fox", "NN"), ... ("jumped", "VBD"), ... ("over", "IN"), ... ("the", "DT"), ... ("lazy", "JJ"), ... ("dog", "NN"), ... (".", "."), ... ), ... ))],[]) [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['The']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('VP', [Tree('VBD', ['jumped']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])]), Tree('.', ['.'])])])]
nltk.parse.transitionparser module¶
-
class
nltk.parse.transitionparser.
Configuration
(dep_graph)[source]¶ Bases:
object
Class for holding configuration which is the partial analysis of the input sentence. The transition based parser aims at finding set of operators that transfer the initial configuration to the terminal configuration.
- The configuration includes:
- Stack: for storing partially proceeded words
- Buffer: for storing remaining input words
- Set of arcs: for storing partially built dependency tree
This class also provides a method to represent a configuration as list of features.
-
class
nltk.parse.transitionparser.
Transition
(alg_option)[source]¶ Bases:
object
This class defines a set of transition which is applied to a configuration to get another configuration Note that for different parsing algorithm, the transition is different.
-
LEFT_ARC
= 'LEFTARC'¶
-
REDUCE
= 'REDUCE'¶
-
RIGHT_ARC
= 'RIGHTARC'¶
-
SHIFT
= 'SHIFT'¶
-
left_arc
(conf, relation)[source]¶ - Note that the algorithm for left-arc is quite similar except for precondition for both arc-standard and arc-eager
param configuration: is the current configuration :return : A new configuration or -1 if the pre-condition is not satisfied
-
reduce
(conf)[source]¶ - Note that the algorithm for reduce is only available for arc-eager
param configuration: is the current configuration :return : A new configuration or -1 if the pre-condition is not satisfied
-
-
class
nltk.parse.transitionparser.
TransitionParser
(algorithm)[source]¶ Bases:
nltk.parse.api.ParserI
Class for transition based parser. Implement 2 algorithms which are “arc-standard” and “arc-eager”
-
ARC_EAGER
= 'arc-eager'¶
-
ARC_STANDARD
= 'arc-standard'¶
-
parse
(depgraphs, modelFile)[source]¶ Parameters: - depgraphs (list(DependencyGraph)) – the list of test sentence, each sentence is represented as a dependency graph where the ‘head’ information is dummy
- modelfile (str) – the model file
Returns: list (DependencyGraph) with the ‘head’ and ‘rel’ information
-
-
nltk.parse.transitionparser.
demo
()[source]¶ >>> from nltk.parse import DependencyGraph, DependencyEvaluator >>> from nltk.parse.transitionparser import TransitionParser, Configuration, Transition >>> gold_sent = DependencyGraph(""" ... Economic JJ 2 ATT ... news NN 3 SBJ ... has VBD 0 ROOT ... little JJ 5 ATT ... effect NN 3 OBJ ... on IN 5 ATT ... financial JJ 8 ATT ... markets NNS 6 PC ... . . 3 PU ... """)
>>> conf = Configuration(gold_sent)
###################### Check the Initial Feature ########################
>>> print(', '.join(conf.extract_features())) STK_0_POS_TOP, BUF_0_FORM_Economic, BUF_0_LEMMA_Economic, BUF_0_POS_JJ, BUF_1_FORM_news, BUF_1_POS_NN, BUF_2_POS_VBD, BUF_3_POS_JJ
###################### Check The Transition ####################### Check the Initialized Configuration >>> print(conf) Stack : [0] Buffer : [1, 2, 3, 4, 5, 6, 7, 8, 9] Arcs : []
- Do some transition checks for ARC-STANDARD
>>> operation = Transition('arc-standard') >>> operation.shift(conf) >>> operation.left_arc(conf, "ATT") >>> operation.shift(conf) >>> operation.left_arc(conf,"SBJ") >>> operation.shift(conf) >>> operation.shift(conf) >>> operation.left_arc(conf, "ATT") >>> operation.shift(conf) >>> operation.shift(conf) >>> operation.shift(conf) >>> operation.left_arc(conf, "ATT")
Middle Configuration and Features Check >>> print(conf) Stack : [0, 3, 5, 6] Buffer : [8, 9] Arcs : [(2, ‘ATT’, 1), (3, ‘SBJ’, 2), (5, ‘ATT’, 4), (8, ‘ATT’, 7)]
>>> print(', '.join(conf.extract_features())) STK_0_FORM_on, STK_0_LEMMA_on, STK_0_POS_IN, STK_1_POS_NN, BUF_0_FORM_markets, BUF_0_LEMMA_markets, BUF_0_POS_NNS, BUF_1_FORM_., BUF_1_POS_., BUF_0_LDEP_ATT
>>> operation.right_arc(conf, "PC") >>> operation.right_arc(conf, "ATT") >>> operation.right_arc(conf, "OBJ") >>> operation.shift(conf) >>> operation.right_arc(conf, "PU") >>> operation.right_arc(conf, "ROOT") >>> operation.shift(conf)
Terminated Configuration Check >>> print(conf) Stack : [0] Buffer : [] Arcs : [(2, ‘ATT’, 1), (3, ‘SBJ’, 2), (5, ‘ATT’, 4), (8, ‘ATT’, 7), (6, ‘PC’, 8), (5, ‘ATT’, 6), (3, ‘OBJ’, 5), (3, ‘PU’, 9), (0, ‘ROOT’, 3)]
- Do some transition checks for ARC-EAGER
>>> conf = Configuration(gold_sent) >>> operation = Transition('arc-eager') >>> operation.shift(conf) >>> operation.left_arc(conf,'ATT') >>> operation.shift(conf) >>> operation.left_arc(conf,'SBJ') >>> operation.right_arc(conf,'ROOT') >>> operation.shift(conf) >>> operation.left_arc(conf,'ATT') >>> operation.right_arc(conf,'OBJ') >>> operation.right_arc(conf,'ATT') >>> operation.shift(conf) >>> operation.left_arc(conf,'ATT') >>> operation.right_arc(conf,'PC') >>> operation.reduce(conf) >>> operation.reduce(conf) >>> operation.reduce(conf) >>> operation.right_arc(conf,'PU') >>> print(conf) Stack : [0, 3, 9] Buffer : [] Arcs : [(2, 'ATT', 1), (3, 'SBJ', 2), (0, 'ROOT', 3), (5, 'ATT', 4), (3, 'OBJ', 5), (5, 'ATT', 6), (8, 'ATT', 7), (6, 'PC', 8), (3, 'PU', 9)]
###################### Check The Training Function #######################
A. Check the ARC-STANDARD training >>> import tempfile >>> import os >>> input_file = tempfile.NamedTemporaryFile(prefix=’transition_parse.train’, dir=tempfile.gettempdir(), delete=False)
>>> parser_std = TransitionParser('arc-standard') >>> print(', '.join(parser_std._create_training_examples_arc_std([gold_sent], input_file))) Number of training examples : 1 Number of valid (projective) examples : 1 SHIFT, LEFTARC:ATT, SHIFT, LEFTARC:SBJ, SHIFT, SHIFT, LEFTARC:ATT, SHIFT, SHIFT, SHIFT, LEFTARC:ATT, RIGHTARC:PC, RIGHTARC:ATT, RIGHTARC:OBJ, SHIFT, RIGHTARC:PU, RIGHTARC:ROOT, SHIFT
>>> parser_std.train([gold_sent],'temp.arcstd.model', verbose=False) Number of training examples : 1 Number of valid (projective) examples : 1 >>> remove(input_file.name)
- Check the ARC-EAGER training
>>> input_file = tempfile.NamedTemporaryFile(prefix='transition_parse.train', dir=tempfile.gettempdir(),delete=False) >>> parser_eager = TransitionParser('arc-eager') >>> print(', '.join(parser_eager._create_training_examples_arc_eager([gold_sent], input_file))) Number of training examples : 1 Number of valid (projective) examples : 1 SHIFT, LEFTARC:ATT, SHIFT, LEFTARC:SBJ, RIGHTARC:ROOT, SHIFT, LEFTARC:ATT, RIGHTARC:OBJ, RIGHTARC:ATT, SHIFT, LEFTARC:ATT, RIGHTARC:PC, REDUCE, REDUCE, REDUCE, RIGHTARC:PU
>>> parser_eager.train([gold_sent],'temp.arceager.model', verbose=False) Number of training examples : 1 Number of valid (projective) examples : 1
>>> remove(input_file.name)
###################### Check The Parsing Function ########################
- Check the ARC-STANDARD parser
>>> result = parser_std.parse([gold_sent], 'temp.arcstd.model') >>> de = DependencyEvaluator(result, [gold_sent]) >>> de.eval() >= (0, 0) True
B. Check the ARC-EAGER parser >>> result = parser_eager.parse([gold_sent], ‘temp.arceager.model’) >>> de = DependencyEvaluator(result, [gold_sent]) >>> de.eval() >= (0, 0) True
Remove test temporary files >>> remove(‘temp.arceager.model’) >>> remove(‘temp.arcstd.model’)
Note that result is very poor because of only one training example.
nltk.parse.util module¶
Utility functions for parsers.
-
class
nltk.parse.util.
TestGrammar
(grammar, suite, accept=None, reject=None)[source]¶ Bases:
object
Unit tests for CFG.
-
run
(show_trees=False)[source]¶ - Sentences in the test suite are divided into two classes:
- grammatical (
accept
) and - ungrammatical (
reject
).
- grammatical (
If a sentence should parse accordng to the grammar, the value of
trees
will be a non-empty list. If a sentence should be rejected according to the grammar, then the value oftrees
will be None.
-
-
nltk.parse.util.
extract_test_sentences
(string, comment_chars='#%;', encoding=None)[source]¶ Parses a string with one test sentence per line. Lines can optionally begin with:
- a bool, saying if the sentence is grammatical or not, or
- an int, giving the number of parse trees is should have,
The result information is followed by a colon, and then the sentence. Empty lines and lines beginning with a comment char are ignored.
Returns: a list of tuple of sentences and expected results, where a sentence is a list of str, and a result is None, or bool, or int
Parameters: - comment_chars –
str
of possible comment characters. - encoding – the encoding of the string, if it is binary
-
nltk.parse.util.
load_parser
(grammar_url, trace=0, parser=None, chart_class=None, beam_size=0, **load_args)[source]¶ Load a grammar from a file, and build a parser based on that grammar. The parser depends on the grammar format, and might also depend on properties of the grammar itself.
- The following grammar formats are currently supported:
'cfg'
(CFGs:CFG
)'pcfg'
(probabilistic CFGs:PCFG
)'fcfg'
(feature-based CFGs:FeatureGrammar
)
Parameters: - grammar_url (str) – A URL specifying where the grammar is located.
The default protocol is
"nltk:"
, which searches for the file in the the NLTK data package. - trace (int) – The level of tracing that should be used when
parsing a text.
0
will generate no tracing output; and higher numbers will produce more verbose tracing output. - parser – The class used for parsing; should be
ChartParser
or a subclass. If None, the class depends on the grammar format. - chart_class – The class used for storing the chart;
should be
Chart
or a subclass. Only used for CFGs and feature CFGs. If None, the chart class depends on the grammar format. - beam_size (int) – The maximum length for the parser’s edge queue. Only used for probabilistic CFGs.
- load_args – Keyword parameters used when loading the grammar.
See
data.load
for more information.
-
nltk.parse.util.
taggedsent_to_conll
(sentence)[source]¶ A module to convert a single POS tagged sentence into CONLL format.
>>> from nltk import word_tokenize, pos_tag >>> text = "This is a foobar sentence." >>> for line in taggedsent_to_conll(pos_tag(word_tokenize(text))): ... print(line, end="") 1 This _ DT DT _ 0 a _ _ 2 is _ VBZ VBZ _ 0 a _ _ 3 a _ DT DT _ 0 a _ _ 4 foobar _ JJ JJ _ 0 a _ _ 5 sentence _ NN NN _ 0 a _ _ 6 . _ . . _ 0 a _ _
Parameters: sentence (list(tuple(str, str))) – A single input sentence to parse Return type: iter(str) Returns: a generator yielding a single sentence in CONLL format.
-
nltk.parse.util.
taggedsents_to_conll
(sentences)[source]¶ A module to convert the a POS tagged document stream (i.e. list of list of tuples, a list of sentences) and yield lines in CONLL format. This module yields one line per word and two newlines for end of sentence.
>>> from nltk import word_tokenize, sent_tokenize, pos_tag >>> text = "This is a foobar sentence. Is that right?" >>> sentences = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)] >>> for line in taggedsents_to_conll(sentences): ... if line: ... print(line, end="") 1 This _ DT DT _ 0 a _ _ 2 is _ VBZ VBZ _ 0 a _ _ 3 a _ DT DT _ 0 a _ _ 4 foobar _ JJ JJ _ 0 a _ _ 5 sentence _ NN NN _ 0 a _ _ 6 . _ . . _ 0 a _ _ 1 Is _ VBZ VBZ _ 0 a _ _ 2 that _ IN IN _ 0 a _ _ 3 right _ NN NN _ 0 a _ _ 4 ? _ . . _ 0 a _ _
Parameters: sentences – Input sentences to parse Return type: iter(str) Returns: a generator yielding sentences in CONLL format.
nltk.parse.viterbi module¶
-
class
nltk.parse.viterbi.
ViterbiParser
(grammar, trace=0)[source]¶ Bases:
nltk.parse.api.ParserI
A bottom-up
PCFG
parser that uses dynamic programming to find the single most likely parse for a text. TheViterbiParser
parser parses texts by filling in a “most likely constituent table”. This table records the most probable tree representation for any given span and node value. In particular, it has an entry for every start index, end index, and node value, recording the most likely subtree that spans from the start index to the end index, and has the given node value.The
ViterbiParser
parser fills in this table incrementally. It starts by filling in all entries for constituents that span one element of text (i.e., entries where the end index is one greater than the start index). After it has filled in all table entries for constituents that span one element of text, it fills in the entries for constitutants that span two elements of text. It continues filling in the entries for constituents spanning larger and larger portions of the text, until the entire table has been filled. Finally, it returns the table entry for a constituent spanning the entire text, whose node value is the grammar’s start symbol.In order to find the most likely constituent with a given span and node value, the
ViterbiParser
parser considers all productions that could produce that node value. For each production, it finds all children that collectively cover the span and have the node values specified by the production’s right hand side. If the probability of the tree formed by applying the production to the children is greater than the probability of the current entry in the table, then the table is updated with this new tree.A pseudo-code description of the algorithm used by
ViterbiParser
is:Create an empty most likely constituent table, MLC.For width in 1…len(text):For start in 1…len(text)-width:For prod in grammar.productions:For each sequence of subtrees [t[1], t[2], …, t[n]] in MLC,where t[i].label()==prod.rhs[i],and the sequence covers [start:start+width]:old_p = MLC[start, start+width, prod.lhs]new_p = P(t[1])P(t[1])…P(t[n])P(prod)if new_p > old_p:new_tree = Tree(prod.lhs, t[1], t[2], …, t[n])MLC[start, start+width, prod.lhs] = new_treeReturn MLC[0, len(text), start_symbol]Variables: - _grammar – The grammar used to parse sentences.
- _trace – The level of tracing output that should be generated when parsing a text.
-
parse
(tokens)[source]¶ Returns: An iterator that generates parse trees for the sentence. When possible this list is sorted from most likely to least likely.
Parameters: sent (list(str)) – The sentence to be parsed Return type: iter(Tree)
-
trace
(trace=2)[source]¶ Set the level of tracing output that should be generated when parsing a text.
Parameters: trace (int) – The trace level. A trace level of 0
will generate no tracing output; and higher trace levels will produce more verbose tracing output.Return type: None
-
unicode_repr
()¶ Return repr(self).
Module contents¶
NLTK Parsers
Classes and interfaces for producing tree structures that represent the internal organization of a text. This task is known as “parsing” the text, and the resulting tree structures are called the text’s “parses”. Typically, the text is a single sentence, and the tree structure represents the syntactic structure of the sentence. However, parsers can also be used in other domains. For example, parsers can be used to derive the morphological structure of the morphemes that make up a word, or to derive the discourse structure for a set of utterances.
Sometimes, a single piece of text can be represented by more than one tree structure. Texts represented by more than one tree structure are called “ambiguous” texts. Note that there are actually two ways in which a text can be ambiguous:
- The text has multiple correct parses.
- There is not enough information to decide which of several candidate parses is correct.
However, the parser module does not distinguish these two types of ambiguity.
The parser module defines ParserI
, a standard interface for parsing
texts; and two simple implementations of that interface,
ShiftReduceParser
and RecursiveDescentParser
. It also contains
three sub-modules for specialized kinds of parsing:
nltk.parser.chart
defines chart parsing, which uses dynamic programming to efficiently parse texts.nltk.parser.probabilistic
defines probabilistic parsing, which associates a probability with each parse.