A collection of methods for tree (grammar) transformations used in
parsing natural language.
Although many of these methods are technically grammar transformations
(ie. Chomsky Norm Form), when working with treebanks it is much more
natural to visualize these modifications in a tree structure. Hence, we
will do all transformation directly to the tree itself. Transforming the
tree directly also allows us to do parent annotation. A grammar can then
be simply induced from the modified tree.
The following is a short tutorial on the available
transformations.
-
Chomsky Normal Form (binarization)
It is well known that any grammar has a Chomsky Normal Form (CNF)
equivalent grammar where CNF is defined by every production having
either two non-terminals or one terminal on its right hand side. When
we have hierarchically structured data (ie. a treebank), it is
natural to view this in terms of productions where the root of every
subtree is the head (left hand side) of the production and all of its
children are the right hand side constituents. In order to convert a
tree into CNF, we simply need to ensure that every subtree has either
two subtrees as children (binarization), or one leaf node
(non-terminal). In order to binarize a subtree with more than two
children, we must introduce artificial nodes.
There are two popular methods to convert a tree into CNF: left
factoring and right factoring. The following example demonstrates
the difference between them. Example:
Original Right-Factored Left-Factored
A A A
/ | \ / \ / \
B C D ==> B A|<C-D> OR A|<B-C> D
/ \ / \
C D B C
-
Parent Annotation
In addition to binarizing the tree, there are two standard
modifications to node labels we can do in the same traversal: parent
annotation and Markov order-N smoothing (or sibling smoothing).
The purpose of parent annotation is to refine the probabilities of
productions by adding a small amount of context. With this simple
addition, a CYK (inside-outside, dynamic programming chart parse) can
improve from 74% to 79% accuracy. A natural generalization from
parent annotation is to grandparent annotation and beyond. The
tradeoff becomes accuracy gain vs. computational complexity. We must
also keep in mind data sparcity issues. Example:
Original Parent Annotation
A A^<?>
/ | \ / \
B C D ==> B^<A> A|<C-D>^<?> where ? is the
/ \ parent of A
C^<A> D^<A>
-
Markov order-N smoothing
Markov smoothing combats data sparcity issues as well as
decreasing computational requirements by limiting the number of
children included in artificial nodes. In practice, most people use
an order 2 grammar. Example:
Original No Smoothing Markov order 1 Markov order 2 etc.
__A__ A A A
/ /|\ \ / \ / \ / \
B C D E F ==> B A|<C-D-E-F> ==> B A|<C> ==> B A|<C-D>
/ \ / \ / \
C ... C ... C ...
Annotation decisions can be thought about in the vertical
direction (parent, grandparent, etc) and the horizontal direction
(number of siblings to keep). Parameters to the following functions
specify these values. For more information see:
Dan Klein and Chris Manning (2003) "Accurate Unlexicalized
Parsing", ACL-03. http://www.aclweb.org/anthology/P03-1054
-
Unary Collapsing
Collapse unary productions (ie. subtrees with a single child) into
a new non-terminal (Tree node). This is useful when working with
algorithms that do not allow unary productions, yet you do not wish
to lose the parent information. Example:
A
|
B ==> A+B
/ \ / \
C D C D
|
chomsky_normal_form(tree,
factor=' right ' ,
horzMarkov=None,
vertMarkov=0,
childChar=' | ' ,
parentChar=' ^ ' ) |
source code
|
|
|
un_chomsky_normal_form(tree,
expandUnary=True,
childChar=' | ' ,
parentChar=' ^ ' ,
unaryChar=' + ' ) |
source code
|
|
|
collapse_unary(tree,
collapsePOS=False,
collapseRoot=False,
joinChar=' + ' )
Collapse subtrees with a single child (ie. |
source code
|
|
|
demo()
A demonstration showing how each tree transform can be used. |
source code
|
|