nltk.treetransforms

1 # Natural Language Toolkit: Tree Transformations 2 # 3 # Copyright (C) 2005-2007 Oregon Graduate Institute 4 # Author: Nathan Bodenstab <[email protected]> 5 # URL: <http://nltk.org> 6 # For license information, see LICENSE.TXT 7 8 """ 9 A collection of methods for tree (grammar) transformations used 10 in parsing natural language. 11 12 Although many of these methods are technically grammar transformations 13 (ie. Chomsky Norm Form), when working with treebanks it is much more 14 natural to visualize these modifications in a tree structure. Hence, 15 we will do all transformation directly to the tree itself. 16 Transforming the tree directly also allows us to do parent annotation. 17 A grammar can then be simply induced from the modified tree. 18 19 The following is a short tutorial on the available transformations. 20 21 1. Chomsky Normal Form (binarization) 22 23 It is well known that any grammar has a Chomsky Normal Form (CNF) 24 equivalent grammar where CNF is defined by every production having 25 either two non-terminals or one terminal on its right hand side. 26 When we have hierarchically structured data (ie. a treebank), it is 27 natural to view this in terms of productions where the root of every 28 subtree is the head (left hand side) of the production and all of 29 its children are the right hand side constituents. In order to 30 convert a tree into CNF, we simply need to ensure that every subtree 31 has either two subtrees as children (binarization), or one leaf node 32 (non-terminal). In order to binarize a subtree with more than two 33 children, we must introduce artificial nodes. 34 35 There are two popular methods to convert a tree into CNF: left 36 factoring and right factoring. The following example demonstrates 37 the difference between them. Example:: 38 39 Original Right-Factored Left-Factored 40 41 A A A 42 / | \ / \ / \ 43 B C D ==> B A|<C-D> OR A|<B-C> D 44 / \ / \ 45 C D B C 46 47 2. Parent Annotation 48 49 In addition to binarizing the tree, there are two standard 50 modifications to node labels we can do in the same traversal: parent 51 annotation and Markov order-N smoothing (or sibling smoothing). 52 53 The purpose of parent annotation is to refine the probabilities of 54 productions by adding a small amount of context. With this simple 55 addition, a CYK (inside-outside, dynamic programming chart parse) 56 can improve from 74% to 79% accuracy. A natural generalization from 57 parent annotation is to grandparent annotation and beyond. The 58 tradeoff becomes accuracy gain vs. computational complexity. We 59 must also keep in mind data sparcity issues. Example:: 60 61 Original Parent Annotation 62 63 A A^<?> 64 / | \ / \ 65 B C D ==> B^<A> A|<C-D>^<?> where ? is the 66 / \ parent of A 67 C^<A> D^<A> 68 69 70 3. Markov order-N smoothing 71 72 Markov smoothing combats data sparcity issues as well as decreasing 73 computational requirements by limiting the number of children 74 included in artificial nodes. In practice, most people use an order 75 2 grammar. Example:: 76 77 Original No Smoothing Markov order 1 Markov order 2 etc. 78 79 __A__ A A A 80 / /|\ \ / \ / \ / \ 81 B C D E F ==> B A|<C-D-E-F> ==> B A|<C> ==> B A|<C-D> 82 / \ / \ / \ 83 C ... C ... C ... 84 85 86 87 Annotation decisions can be thought about in the vertical direction 88 (parent, grandparent, etc) and the horizontal direction (number of 89 siblings to keep). Parameters to the following functions specify 90 these values. For more information see: 91 92 Dan Klein and Chris Manning (2003) "Accurate Unlexicalized 93 Parsing", ACL-03. http://www.aclweb.org/anthology/P03-1054 94 95 4. Unary Collapsing 96 97 Collapse unary productions (ie. subtrees with a single child) into a 98 new non-terminal (Tree node). This is useful when working with 99 algorithms that do not allow unary productions, yet you do not wish 100 to lose the parent information. Example:: 101 102 A 103 | 104 B ==> A+B 105 / \ / \ 106 C D C D 107 108 """ 109 110 from nltk import Tree 111

112 -def chomsky_normal_form(tree, factor = "right", horzMarkov = None, vertMarkov = 0, childChar = "|", parentChar = "^"):

113 # assume all subtrees have homogeneous children 114 # assume all terminals have no siblings 115 116 # A semi-hack to have elegant looking code below. As a result, 117 # any subtree with a branching factor greater than 999 will be incorrectly truncated. 118 if horzMarkov == None: horzMarkov = 999 119 120 # Traverse the tree depth-first keeping a list of ancestor nodes to the root. 121 # I chose not to use the tree.treepositions() method since it requires 122 # two traversals of the tree (one to get the positions, one to iterate 123 # over them) and node access time is proportional to the height of the node. 124 # This method is 7x faster which helps when parsing 40,000 sentences. 125 126 nodeList = [(tree, [tree.node])] 127 while nodeList != []: 128 node, parent = nodeList.pop() 129 if isinstance(node,Tree): 130 131 # parent annotation 132 parentString = "" 133 originalNode = node.node 134 if vertMarkov != 0 and node != tree and isinstance(node[0],Tree): 135 parentString = "%s<%s>" % (parentChar, "-".join(parent)) 136 node.node += parentString 137 parent = [originalNode] + parent[:vertMarkov - 1] 138 139 # add children to the agenda before we mess with them 140 for child in node: 141 nodeList.append((child, parent)) 142 143 # chomsky normal form factorization 144 if len(node) > 2: 145 childNodes = [child.node for child in node] 146 nodeCopy = node.copy() 147 node[0:] = [] # delete the children 148 149 curNode = node 150 numChildren = len(nodeCopy) 151 for i in range(1,numChildren - 1): 152 if factor == "right": 153 newHead = "%s%s<%s>%s" % (originalNode, childChar, "-".join(childNodes[i:min([i+horzMarkov,numChildren])]),parentString) # create new head 154 newNode = Tree(newHead, []) 155 curNode[0:] = [nodeCopy.pop(0), newNode] 156 else: 157 newHead = "%s%s<%s>%s" % (originalNode, childChar, "-".join(childNodes[max([numChildren-i-horzMarkov,0]):-i]),parentString) 158 newNode = Tree(newHead, []) 159 curNode[0:] = [newNode, nodeCopy.pop()] 160 161 curNode = newNode 162 163 curNode[0:] = [child for child in nodeCopy]

164 165

166 -def un_chomsky_normal_form(tree, expandUnary = True, childChar = "|", parentChar = "^", unaryChar = "+"):

167 # Traverse the tree-depth first keeping a pointer to the parent for modification purposes. 168 nodeList = [(tree,[])] 169 while nodeList != []: 170 node,parent = nodeList.pop() 171 if isinstance(node,Tree): 172 # if the node contains the 'childChar' character it means that 173 # it is an artificial node and can be removed, although we still need 174 # to move its children to its parent 175 childIndex = node.node.find(childChar) 176 if childIndex != -1: 177 nodeIndex = parent.index(node) 178 parent.remove(parent[nodeIndex]) 179 # Generated node was on the left if the nodeIndex is 0 which 180 # means the grammar was left factored. We must insert the children 181 # at the beginning of the parent's children 182 if nodeIndex == 0: 183 parent.insert(0,node[0]) 184 parent.insert(1,node[1]) 185 else: 186 parent.extend([node[0],node[1]]) 187 188 # parent is now the current node so the children of parent will be added to the agenda 189 node = parent 190 else: 191 parentIndex = node.node.find(parentChar) 192 if parentIndex != -1: 193 # strip the node name of the parent annotation 194 node.node = node.node[:parentIndex] 195 196 # expand collapsed unary productions 197 if expandUnary == True: 198 unaryIndex = node.node.find(unaryChar) 199 if unaryIndex != -1: 200 newNode = Tree(node.node[unaryIndex + 1:], [i for i in node]) 201 node.node = node.node[:unaryIndex] 202 node[0:] = [newNode] 203 204 for child in node: 205 nodeList.append((child,node))

206 207

208 -def collapse_unary(tree, collapsePOS = False, collapseRoot = False, joinChar = "+"):

209 """ 210 Collapse subtrees with a single child (ie. unary productions) 211 into a new non-terminal (Tree node) joined by 'joinChar'. 212 This is useful when working with algorithms that do not allow 213 unary productions, and completely removing the unary productions 214 would require loss of useful information. The Tree is modified 215 directly (since it is passed by reference) and no value is returned. 216 217 @param tree: The Tree to be collapsed 218 @type tree: C{Tree} 219 @param collapsePOS: 'False' (default) will not collapse the parent of leaf nodes (ie. 220 Part-of-Speech tags) since they are always unary productions 221 @type collapsePOS: C{boolean} 222 @param collapseRoot: 'False' (default) will not modify the root production 223 if it is unary. For the Penn WSJ treebank corpus, this corresponds 224 to the TOP -> productions. 225 @type collapseRoot: C{boolean} 226 @param joinChar: A string used to connect collapsed node values (default = "+") 227 @type joinChar: C{string} 228 """ 229 230 if collapseRoot == False and isinstance(tree, Tree) and len(tree) == 1: 231 nodeList = [tree[0]] 232 else: 233 nodeList = [tree] 234 235 # depth-first traversal of tree 236 while nodeList != []: 237 node = nodeList.pop() 238 if isinstance(node,Tree): 239 if len(node) == 1 and isinstance(node[0], Tree) and (collapsePOS == True or isinstance(node[0,0], Tree)): 240 node.node += joinChar + node[0].node 241 node[0:] = [child for child in node[0]] 242 # since we assigned the child's children to the current node, 243 # evaluate the current node again 244 nodeList.append(node) 245 else: 246 for child in node: 247 nodeList.append(child)

248 249 ################################################################# 250 # Demonstration 251 ################################################################# 252

253 -def demo():

254 """ 255 A demonstration showing how each tree transform can be used. 256 """ 257 258 from nltk.draw.tree import draw_trees 259 from nltk import treetransforms, bracket_parse 260 from copy import deepcopy 261 262 # original tree from WSJ bracketed text 263 sentence = """(TOP 264 (S 265 (S 266 (VP 267 (VBN Turned) 268 (ADVP (RB loose)) 269 (PP 270 (IN in) 271 (NP 272 (NP (NNP Shane) (NNP Longman) (POS 's)) 273 (NN trading) 274 (NN room))))) 275 (, ,) 276 (NP (DT the) (NN yuppie) (NNS dealers)) 277 (VP (AUX do) (NP (NP (RB little)) (ADJP (RB right)))) 278 (. .)))""" 279 tree = bracket_parse(sentence) 280 281 # collapse subtrees with only one child 282 collapsedTree = deepcopy(tree) 283 treetransforms.collapse_unary(collapsedTree) 284 285 # convert the tree to CNF 286 cnfTree = deepcopy(collapsedTree) 287 treetransforms.chomsky_normal_form(cnfTree) 288 289 # convert the tree to CNF with parent annotation (one level) and horizontal smoothing of order two 290 parentTree = deepcopy(collapsedTree) 291 treetransforms.chomsky_normal_form(parentTree, horzMarkov=2, vertMarkov=1) 292 293 # convert the tree back to its original form (used to make CYK results comparable) 294 original = deepcopy(parentTree) 295 treetransforms.un_chomsky_normal_form(original) 296 297 # convert tree back to bracketed text 298 sentence2 = original.pprint() 299 print sentence 300 print sentence2 301 print "Sentences the same? ", sentence == sentence2 302 303 draw_trees(tree, collapsedTree, cnfTree, parentTree, original)

304 305 if __name__ == '__main__': 306 demo() 307

Source Code for Module nltk.treetransforms