Module cfg
source code
Basic data classes for representing context free grammars. A grammar specifies which
trees can represent the structure of a given text. Each of these trees
is called a parse
tree for the text (or simply a parse). In a context free grammar, the set of parse trees for
any piece of a text can depend only on that piece, and not on the rest of
the text (i.e., the piece's context). Context free grammars are often
used to find possible syntactic structures for sentences. In this
context, the leaves of a parse tree are word tokens; and the node values
are phrasal categories, such as NP
and VP
.
The Grammar
class is used to encode context free grammars. Each Grammar
consists of a start symbol and a set of productions. The start symbol
specifies the root node value for parse trees. For example, the start
symbol for syntactic parsing is usually S
. Start symbols
are encoded using the Nonterminal
class, which is discussed
below.
A Grammar's productions specify what parent-child relationships
a parse tree can contain. Each production specifies that a particular
node can be the parent of a particular set of children. For example, the
production <S> -> <NP> <VP>
specifies
that an S
node can be the parent of an NP
node
and a VP
node.
Grammar productions are implemented by the Production
class. Each Production
consists of a left hand side and a
right hand side. The left hand side is a Nonterminal
that
specifies the node type for a potential parent; and the right hand side
is a list that specifies allowable children for that parent. This lists
consists of Nonterminals
and text types: each
Nonterminal
indicates that the corresponding child may be a
TreeToken
with the specified node type; and each text type
indicates that the corresponding child may be a Token
with
the with that type.
The Nonterminal
class is used to distinguish node values
from leaf values. This prevents the grammar from accidentally using a
leaf value (such as the English word "A") as the node of a
subtree. Within a Grammar
, all node values are wrapped in
the Nonterminal
class. Note, however, that the trees that
are specified by the grammar do not include these
Nonterminal
wrappers.
Grammars can also be given a more procedural interpretation.
According to this interpretation, a Grammar specifies any tree structure
tree that can be produced by the following
procedure:
-
Set tree to the start symbol
-
Repeat until tree contains no more nonterminal
leaves:
-
Choose a production prod with whose left hand
side lhs is a nonterminal leaf of tree.
-
Replace the nonterminal leaf with a subtree, whose node value is
the value wrapped by the nonterminal lhs, and
whose children are the right hand side of prod.
The operation of replacing the left hand side (lhs) of a production with the right hand side (rhs) in a tree (tree) is known as expanding lhs to rhs in tree.
list of Nonterminal
|
nonterminals(symbols)
Given a string containing a list of symbol names, return a list of
Nonterminals constructed from those symbols. |
source code
|
|
|
parse_cfg_production(s)
Returns a list of productions |
source code
|
|
|
|
|
|
|
parse_pcfg_production(s)
Returns a list of PCFG productions |
source code
|
|
|
|
|
earley_lexicon(productions)
Convert CFG lexical productions into a dictionary indexed by the
lexical string. |
source code
|
|
|
parse_fcfg_production(line,
fstruct_parser) |
source code
|
|
|
parse_fcfg(input,
features=None)
Return a tuple (list of grammatical productions, lexicon dict). |
source code
|
|
|
cfg_demo()
A demonstration showing how Grammar s can be created and
used. |
source code
|
|
|
pcfg_demo()
A demonstration showing how PCFG Grammar s can be created
and used. |
source code
|
|
|
|
|
|
|
_PARSE_CFG_RE = re.compile(r'(?x) ^\s* ( \w+ (?: /\w+ ) ? ) \s* (?: [ -=] + ...
|
|
_SPLIT_CFG_RE = re.compile(r'( \w+ (?: /\w+ ) ? | [ -=] + >| "[^ "] + "| \'[^ ...
|
|
_PARSE_PCFG_RE = re.compile(r'(?x) ^\s* ( \w+ (?: /\w+ ) ? ) \s* (?: [ -=] ...
|
|
_SPLIT_PCFG_RE = re.compile(r'( \w+ (?: /\w+ ) ? | \[[ 01] ? \.\d+ \]| [ -=...
|
|
toy_pcfg1 = <Grammar with 17 productions>
|
|
toy_pcfg2 = <Grammar with 23 productions>
|
Given a string containing a list of symbol names, return a list of
Nonterminals constructed from those symbols.
- Parameters:
symbols (string ) - The symbol name string. This string can be delimited by either
spaces or commas.
- Returns:
list of Nonterminal
- A list of
Nonterminals constructed from the symbol
names given in symbols . The
Nonterminals are sorted in the same order as the
symbols names.
|
Induce a PCFG grammar from a list of productions.
The probability of a production A -> B C in a PCFG is:
| count(A -> B C) | P(B, C | A) = ---------------
where * is any right hand side | count(A -> *)
- Parameters:
start (Nonterminal) - The start symbol
productions (list of Production) - The list of productions that defines the grammar
|
Return a tuple (list of grammatical productions, lexicon dict).
- Parameters:
input - a grammar, either in the form of a string or else as a list of
strings.
|
_PARSE_CFG_RE
- Value:
re.compile(r'(?x) ^\s* ( \w+ (?: /\w+ ) ? ) \s* (?: [ -=] + >) \s* (?: ( "[^ "] + "| \'[^ \']
+ \'| \w+ (?: /\w+ ) ? | \|) \s* ) * $')
|
|
_SPLIT_CFG_RE
- Value:
re.compile(r'( \w+ (?: /\w+ ) ? | [ -=] + >| "[^ "] + "| \'[^ \'] + \'| \|) ')
|
|
_PARSE_PCFG_RE
- Value:
re.compile(r'(?x) ^\s* ( \w+ (?: /\w+ ) ? ) \s* (?: [ -=] + >) \s* (?: ( "[^ "] + "| \'[^ \']
+ \'| \w+ (?: /\w+ ) ? | \[[ 01] ? \.\d+ \]| \|) \s* ) * $')
|
|
_SPLIT_PCFG_RE
- Value:
re.compile(r'( \w+ (?: /\w+ ) ? | \[[ 01] ? \.\d+ \]| [ -=] + >| "[^ "] + "| \'[^ \'] + \'| \|
) ')
|
|