1
2
3
4
5
6
7
8 """
9 A collection of methods for tree (grammar) transformations used
10 in parsing natural language.
11
12 Although many of these methods are technically grammar transformations
13 (ie. Chomsky Norm Form), when working with treebanks it is much more
14 natural to visualize these modifications in a tree structure. Hence,
15 we will do all transformation directly to the tree itself.
16 Transforming the tree directly also allows us to do parent annotation.
17 A grammar can then be simply induced from the modified tree.
18
19 The following is a short tutorial on the available transformations.
20
21 1. Chomsky Normal Form (binarization)
22
23 It is well known that any grammar has a Chomsky Normal Form (CNF)
24 equivalent grammar where CNF is defined by every production having
25 either two non-terminals or one terminal on its right hand side.
26 When we have hierarchically structured data (ie. a treebank), it is
27 natural to view this in terms of productions where the root of every
28 subtree is the head (left hand side) of the production and all of
29 its children are the right hand side constituents. In order to
30 convert a tree into CNF, we simply need to ensure that every subtree
31 has either two subtrees as children (binarization), or one leaf node
32 (non-terminal). In order to binarize a subtree with more than two
33 children, we must introduce artificial nodes.
34
35 There are two popular methods to convert a tree into CNF: left
36 factoring and right factoring. The following example demonstrates
37 the difference between them. Example::
38
39 Original Right-Factored Left-Factored
40
41 A A A
42 / | \ / \ / \
43 B C D ==> B A|<C-D> OR A|<B-C> D
44 / \ / \
45 C D B C
46
47 2. Parent Annotation
48
49 In addition to binarizing the tree, there are two standard
50 modifications to node labels we can do in the same traversal: parent
51 annotation and Markov order-N smoothing (or sibling smoothing).
52
53 The purpose of parent annotation is to refine the probabilities of
54 productions by adding a small amount of context. With this simple
55 addition, a CYK (inside-outside, dynamic programming chart parse)
56 can improve from 74% to 79% accuracy. A natural generalization from
57 parent annotation is to grandparent annotation and beyond. The
58 tradeoff becomes accuracy gain vs. computational complexity. We
59 must also keep in mind data sparcity issues. Example::
60
61 Original Parent Annotation
62
63 A A^<?>
64 / | \ / \
65 B C D ==> B^<A> A|<C-D>^<?> where ? is the
66 / \ parent of A
67 C^<A> D^<A>
68
69
70 3. Markov order-N smoothing
71
72 Markov smoothing combats data sparcity issues as well as decreasing
73 computational requirements by limiting the number of children
74 included in artificial nodes. In practice, most people use an order
75 2 grammar. Example::
76
77 Original No Smoothing Markov order 1 Markov order 2 etc.
78
79 __A__ A A A
80 / /|\ \ / \ / \ / \
81 B C D E F ==> B A|<C-D-E-F> ==> B A|<C> ==> B A|<C-D>
82 / \ / \ / \
83 C ... C ... C ...
84
85
86
87 Annotation decisions can be thought about in the vertical direction
88 (parent, grandparent, etc) and the horizontal direction (number of
89 siblings to keep). Parameters to the following functions specify
90 these values. For more information see:
91
92 Dan Klein and Chris Manning (2003) "Accurate Unlexicalized
93 Parsing", ACL-03. http://www.aclweb.org/anthology/P03-1054
94
95 4. Unary Collapsing
96
97 Collapse unary productions (ie. subtrees with a single child) into a
98 new non-terminal (Tree node). This is useful when working with
99 algorithms that do not allow unary productions, yet you do not wish
100 to lose the parent information. Example::
101
102 A
103 |
104 B ==> A+B
105 / \ / \
106 C D C D
107
108 """
109
110 from nltk import Tree
111
164
165
206
207
208 -def collapse_unary(tree, collapsePOS = False, collapseRoot = False, joinChar = "+"):
209 """
210 Collapse subtrees with a single child (ie. unary productions)
211 into a new non-terminal (Tree node) joined by 'joinChar'.
212 This is useful when working with algorithms that do not allow
213 unary productions, and completely removing the unary productions
214 would require loss of useful information. The Tree is modified
215 directly (since it is passed by reference) and no value is returned.
216
217 @param tree: The Tree to be collapsed
218 @type tree: C{Tree}
219 @param collapsePOS: 'False' (default) will not collapse the parent of leaf nodes (ie.
220 Part-of-Speech tags) since they are always unary productions
221 @type collapsePOS: C{boolean}
222 @param collapseRoot: 'False' (default) will not modify the root production
223 if it is unary. For the Penn WSJ treebank corpus, this corresponds
224 to the TOP -> productions.
225 @type collapseRoot: C{boolean}
226 @param joinChar: A string used to connect collapsed node values (default = "+")
227 @type joinChar: C{string}
228 """
229
230 if collapseRoot == False and isinstance(tree, Tree) and len(tree) == 1:
231 nodeList = [tree[0]]
232 else:
233 nodeList = [tree]
234
235
236 while nodeList != []:
237 node = nodeList.pop()
238 if isinstance(node,Tree):
239 if len(node) == 1 and isinstance(node[0], Tree) and (collapsePOS == True or isinstance(node[0,0], Tree)):
240 node.node += joinChar + node[0].node
241 node[0:] = [child for child in node[0]]
242
243
244 nodeList.append(node)
245 else:
246 for child in node:
247 nodeList.append(child)
248
249
250
251
252
254 """
255 A demonstration showing how each tree transform can be used.
256 """
257
258 from nltk.draw.tree import draw_trees
259 from nltk import treetransforms, bracket_parse
260 from copy import deepcopy
261
262
263 sentence = """(TOP
264 (S
265 (S
266 (VP
267 (VBN Turned)
268 (ADVP (RB loose))
269 (PP
270 (IN in)
271 (NP
272 (NP (NNP Shane) (NNP Longman) (POS 's))
273 (NN trading)
274 (NN room)))))
275 (, ,)
276 (NP (DT the) (NN yuppie) (NNS dealers))
277 (VP (AUX do) (NP (NP (RB little)) (ADJP (RB right))))
278 (. .)))"""
279 tree = bracket_parse(sentence)
280
281
282 collapsedTree = deepcopy(tree)
283 treetransforms.collapse_unary(collapsedTree)
284
285
286 cnfTree = deepcopy(collapsedTree)
287 treetransforms.chomsky_normal_form(cnfTree)
288
289
290 parentTree = deepcopy(collapsedTree)
291 treetransforms.chomsky_normal_form(parentTree, horzMarkov=2, vertMarkov=1)
292
293
294 original = deepcopy(parentTree)
295 treetransforms.un_chomsky_normal_form(original)
296
297
298 sentence2 = original.pprint()
299 print sentence
300 print sentence2
301 print "Sentences the same? ", sentence == sentence2
302
303 draw_trees(tree, collapsedTree, cnfTree, parentTree, original)
304
305 if __name__ == '__main__':
306 demo()
307