nltk.corpus.chat80

1 # Natural Language Toolkit: Chat-80 KB Reader 2 # See http://www.w3.org/TR/swbp-skos-core-guide/ 3 # 4 # Author: Ewan Klein <[email protected]>, 5 # URL: <http://nltk.sourceforge.net> 6 # For license information, see LICENSE.TXT 7 8 """ 9 Overview 10 ======== 11 12 Chat-80 was a natural language system which allowed the user to 13 interrogate a Prolog knowledge base in the domain of world 14 geography. It was developed in the early '80s by Warren and Pereira; see 15 U{http://acl.ldc.upenn.edu/J/J82/J82-3002.pdf} for a description and 16 U{http://www.cis.upenn.edu/~pereira/oldies.html} for the source 17 files. 18 19 This module contains functions to extract data from the Chat-80 20 relation files ('the world database'), and convert then into a format 21 that can be incorporated in the FOL models of 22 L{nltk.sem.evaluate}. The code assumes that the Prolog 23 input files are available in the NLTK corpora directory. 24 25 The Chat-80 World Database consists of the following files:: 26 27 world0.pl 28 rivers.pl 29 cities.pl 30 countries.pl 31 contain.pl 32 borders.pl 33 34 This module uses a slightly modified version of C{world0.pl}, in which 35 a set of Prolog rules have been omitted. The modified file is named 36 C{world1.pl}. Currently, the file C{rivers.pl} is not read in, since 37 it uses a list rather than a string in the second field. 38 39 Reading Chat-80 Files 40 ===================== 41 42 Chat-80 relations are like tables in a relational database. The 43 relation acts as the name of the table; the first argument acts as the 44 'primary key'; and subsequent arguments are further fields in the 45 table. In general, the name of the table provides a label for a unary 46 predicate whose extension is all the primary keys. For example, 47 relations in C{cities.pl} are of the following form:: 48 49 'city(athens,greece,1368).' 50 51 Here, C{'athens'} is the key, and will be mapped to a member of the 52 unary predicate M{city}. 53 54 The fields in the table are mapped to binary predicates. The first 55 argument of the predicate is the primary key, while the second 56 argument is the data in the relevant field. Thus, in the above 57 example, the third field is mapped to the binary predicate 58 M{population_of}, whose extension is a set of pairs such as C{'(athens, 59 1368)'}. 60 61 An exception to this general framework is required by the relations in 62 the files C{borders.pl} and C{contains.pl}. These contain facts of the 63 following form:: 64 65 'borders(albania,greece).' 66 67 'contains0(africa,central_africa).' 68 69 We do not want to form a unary concept out the element in 70 the first field of these records, and we want the label of the binary 71 relation just to be C{'border'}/C{'contain'} respectively. 72 73 In order to drive the extraction process, we use 'relation metadata bundles' 74 which are Python dictionaries such as the following:: 75 76 city = {'label': 'city', 77 'closures': [], 78 'schema': ['city', 'country', 'population'], 79 'filename': 'cities.pl'} 80 81 According to this, the file C{city['filename']} contains a list of 82 relational tuples (or more accurately, the corresponding strings in 83 Prolog form) whose predicate symbol is C{city['label']} and whose 84 relational schema is C{city['schema']}. The notion of a C{closure} is 85 discussed in the next section. 86 87 Concepts 88 ======== 89 In order to encapsulate the results of the extraction, a class of 90 L{Concept}s is introduced. A L{Concept} object has a number of 91 attributes, in particular a C{prefLabel} and C{extension}, which make 92 it easier to inspect the output of the extraction. In addition, the 93 C{extension} can be further processed: in the case of the C{'border'} 94 relation, we check that the relation is B{symmetric}, and in the case 95 of the C{'contain'} relation, we carry out the B{transitive 96 closure}. The closure properties associated with a concept is 97 indicated in the relation metadata, as indicated earlier. 98 99 The C{extension} of a L{Concept} object is then incorporated into a 100 L{Valuation} object. 101 102 Persistence 103 =========== 104 The functions L{val_dump} and L{val_load} are provided to allow a 105 valuation to be stored in a persistent database and re-loaded, rather 106 than having to be re-computed each time. 107 108 Individuals and Lexical Items 109 ============================= 110 As well as deriving relations from the Chat-80 data, we also create a 111 set of individual constants, one for each entity in the domain. The 112 individual constants are string-identical to the entities. For 113 example, given a data item such as C{'zloty'}, we add to the valuation 114 a pair C{('zloty', 'zloty')}. In order to parse English sentences that 115 refer to these entities, we also create a lexical item such as the 116 following for each individual constant:: 117 118 PropN[num=sg, sem=<\P.(P zloty)>] -> 'Zloty' 119 120 The set of rules is written to the file C{chat_pnames.cfg} in the 121 current directory. 122 123 """ 124 125 import re 126 import shelve 127 import os 128 import sys 129 130 import nltk.data 131 132 from util import * 133 134 ########################################################################### 135 # Chat-80 relation metadata bundles needed to build the valuation 136 ########################################################################### 137 138 borders = {'rel_name': 'borders', 139 'closures': ['symmetric'], 140 'schema': ['region', 'border'], 141 'filename': 'borders.pl'} 142 143 contains = {'rel_name': 'contains0', 144 'closures': ['transitive'], 145 'schema': ['region', 'contain'], 146 'filename': 'contain.pl'} 147 148 city = {'rel_name': 'city', 149 'closures': [], 150 'schema': ['city', 'country', 'population'], 151 'filename': 'cities.pl'} 152 153 country = {'rel_name': 'country', 154 'closures': [], 155 'schema': ['country', 'region', 'latitude', 'longitude', 156 'area', 'population', 'capital', 'currency'], 157 'filename': 'countries.pl'} 158 159 circle_of_lat = {'rel_name': 'circle_of_latitude', 160 'closures': [], 161 'schema': ['circle_of_latitude', 'degrees'], 162 'filename': 'world1.pl'} 163 164 circle_of_long = {'rel_name': 'circle_of_longitude', 165 'closures': [], 166 'schema': ['circle_of_longitude', 'degrees'], 167 'filename': 'world1.pl'} 168 169 continent = {'rel_name': 'continent', 170 'closures': [], 171 'schema': ['continent'], 172 'filename': 'world1.pl'} 173 174 region = {'rel_name': 'in_continent', 175 'closures': [], 176 'schema': ['region', 'continent'], 177 'filename': 'world1.pl'} 178 179 ocean = {'rel_name': 'ocean', 180 'closures': [], 181 'schema': ['ocean'], 182 'filename': 'world1.pl'} 183 184 sea = {'rel_name': 'sea', 185 'closures': [], 186 'schema': ['sea'], 187 'filename': 'world1.pl'} 188 189 190 191 items = ['borders', 'contains', 'city', 'country', 'circle_of_lat', 192 'circle_of_long', 'continent', 'region', 'ocean', 'sea'] 193 items = tuple(sorted(items)) 194 195 item_metadata = { 196 'borders': borders, 197 'contains': contains, 198 'city': city, 199 'country': country, 200 'circle_of_lat': circle_of_lat, 201 'circle_of_long': circle_of_long, 202 'continent': continent, 203 'region': region, 204 'ocean': ocean, 205 'sea': sea 206 } 207 208 rels = item_metadata.values() 209 210 not_unary = ['borders.pl', 'contain.pl'] 211 212 ########################################################################### 213

214 -class Concept(object):

215 """ 216 A Concept class, loosely 217 based on SKOS (U{http://www.w3.org/TR/swbp-skos-core-guide/}). 218 """

219 - def __init__(self, prefLabel, arity, altLabels=[], closures=[], extension=set()):

220 """ 221 @param prefLabel: the preferred label for the concept 222 @type prefLabel: str 223 @param arity: the arity of the concept 224 @type arity: int 225 @keyword altLabels: other (related) labels 226 @type altLabels: list 227 @keyword closures: closure properties of the extension \ 228 (list items can be C{symmetric}, C{reflexive}, C{transitive}) 229 @type closures: list 230 @keyword extension: the extensional value of the concept 231 @type extension: set 232 """ 233 self.prefLabel = prefLabel 234 self.arity = arity 235 self.altLabels = altLabels 236 self.closures = closures 237 self.extension = extension

238

239 - def __str__(self):

240 _extension = '' 241 for element in sorted(self.extension): 242 _extension += element + ',' 243 _extension = _extension[:-1] 244 return "Label = '%s'\nArity = %s\nExtension = {%s}" % \ 245 (self.prefLabel, self.arity, _extension)

246

247 - def __repr__(self):

248 return "Concept('%s')" % self.prefLabel

249

250 - def augment(self, data):

251 """ 252 Add more data to the C{Concept}'s extension set. 253 254 @param data: a new semantic value 255 @type data: string or pair of strings 256 @rtype: set 257 258 """ 259 self.extension.add(data) 260 return self.extension

261 262

263 - def _make_graph(self, s):

264 """ 265 Convert a set of pairs into an adjacency linked list encoding of a graph. 266 """ 267 g = {} 268 for (x, y) in s: 269 if x in g: 270 g[x].append(y) 271 else: 272 g[x] = [y] 273 return g

274

275 - def _transclose(self, g):

276 """ 277 Compute the transitive closure of a graph represented as a linked list. 278 """ 279 for x in g: 280 for adjacent in g[x]: 281 # check that adjacent is a key 282 if adjacent in g: 283 for y in g[adjacent]: 284 if y not in g[x]: 285 g[x].append(y) 286 return g

287

288 - def _make_pairs(self, g):

289 """ 290 Convert an adjacency linked list back into a set of pairs. 291 """ 292 pairs = [] 293 for node in g: 294 for adjacent in g[node]: 295 pairs.append((node, adjacent)) 296 return set(pairs)

297 298

299 - def close(self):

300 """ 301 Close a binary relation in the C{Concept}'s extension set. 302 303 @return: a new extension for the C{Concept} in which the 304 relation is closed under a given property 305 306 307 """ 308 from nltk.sem import is_rel 309 assert is_rel(self.extension) 310 if 'symmetric' in self.closures: 311 pairs = [] 312 for (x, y) in self.extension: 313 pairs.append((y, x)) 314 sym = set(pairs) 315 self.extension = self.extension.union(sym) 316 if 'transitive' in self.closures: 317 all = self._make_graph(self.extension) 318 closed = self._transclose(all) 319 trans = self._make_pairs(closed) 320 #print sorted(trans) 321 self.extension = self.extension.union(trans)

322 323 324

325 -def clause2concepts(filename, rel_name, closures, schema):

326 """ 327 Convert a file of Prolog clauses into a list of L{Concept} objects. 328 329 @param filename: filename containing the relations 330 @type filename: string 331 @param rel_name: name of the relation 332 @type rel_name: string 333 @param schema: the schema used in a set of relational tuples 334 @type schema: list 335 @return: a list of L{Concept}s 336 @rtype: list 337 """ 338 concepts = [] 339 # position of the subject of a binary relation 340 subj = 0 341 # label of the 'primary key' 342 pkey = schema[0] 343 # fields other than the primary key 344 fields = schema[1:] 345 346 # convert a file into a list of lists 347 records = _str2records(filename, rel_name) 348 349 # add a unary concept corresponding to the set of entities 350 # in the primary key position 351 # relations in 'not_unary' are more like ordinary binary relations 352 if not filename in not_unary: 353 concepts.append(unary_concept(pkey, subj, records)) 354 355 # add a binary concept for each non-key field 356 for field in fields: 357 obj = schema.index(field) 358 concepts.append(binary_concept(field, closures, subj, obj, records)) 359 360 return concepts

361

362 -def _str2records(filename, rel):

363 """ 364 Read a file into memory and convert each relation clause into a list. 365 """ 366 recs = [] 367 path = nltk.data.find("corpora/chat80/%s" % filename) 368 for line in open(path): 369 if line.startswith(rel): 370 line = re.sub(rel+r'$', '', line) 371 line = re.sub(r'$\.$', '', line) 372 line = line[:-1] 373 record = line.split(',') 374 recs.append(record) 375 return recs

376

377 -def unary_concept(label, subj, records):

378 """ 379 Make a unary concept out of the primary key in a record. 380 381 A record is a list of entities in some relation, such as 382 C{['france', 'paris']}, where C{'france'} is acting as the primary 383 key. 384 385 @param label: the preferred label for the concept 386 @type label: string 387 @param subj: position in the record of the subject of the predicate 388 @type subj: int 389 @param records: a list of records 390 @type records: list of lists 391 @return: L{Concept} of arity 1 392 @rtype: L{Concept} 393 """ 394 c = Concept(label, arity=1, extension=set()) 395 for record in records: 396 c.augment(record[subj]) 397 return c

398

399 -def binary_concept(label, closures, subj, obj, records):

400 """ 401 Make a binary concept out of the primary key and another field in a record. 402 403 A record is a list of entities in some relation, such as 404 C{['france', 'paris']}, where C{'france'} is acting as the primary 405 key, and C{'paris'} stands in the C{'capital_of'} relation to 406 C{'france'}. 407 408 More generally, given a record such as C{['a', 'b', 'c']}, where 409 label is bound to C{'B'}, and C{obj} bound to 1, the derived 410 binary concept will have label C{'B_of'}, and its extension will 411 be a set of pairs such as C{('a', 'b')}. 412 413 414 @param label: the base part of the preferred label for the concept 415 @type label: string 416 @param closures: closure properties for the extension of the concept 417 @type closures: list 418 @param subj: position in the record of the subject of the predicate 419 @type subj: int 420 @param obj: position in the record of the object of the predicate 421 @type obj: int 422 @param records: a list of records 423 @type records: list of lists 424 @return: L{Concept} of arity 2 425 @rtype: L{Concept} 426 """ 427 if not label == 'border' and not label == 'contain': 428 label = label + '_of' 429 c = Concept(label, arity=2, closures=closures, extension=set()) 430 for record in records: 431 c.augment((record[subj], record[obj])) 432 # close the concept's extension according to the properties in closures 433 c.close() 434 return c

435 436

437 -def process_bundle(rels):

438 """ 439 Given a list of relation metadata bundles, make a corresponding 440 dictionary of concepts, indexed by the relation name. 441 442 @param rels: bundle of metadata needed for constructing a concept 443 @type rels: list of dictionaries 444 @return: a dictionary of concepts, indexed by the relation name. 445 @rtype: dict 446 """ 447 concepts = {} 448 for rel in rels: 449 rel_name = rel['rel_name'] 450 closures = rel['closures'] 451 schema = rel['schema'] 452 filename = rel['filename'] 453 454 concept_list = clause2concepts(filename, rel_name, closures, schema) 455 for c in concept_list: 456 label = c.prefLabel 457 if(label in concepts.keys()): 458 for data in c.extension: 459 concepts[label].augment(data) 460 concepts[label].close() 461 else: 462 concepts[label] = c 463 return concepts

464 465

466 -def make_valuation(concepts, read=False, lexicon=False):

467 """ 468 Convert a list of C{Concept}s into a list of (label, extension) pairs; 469 optionally create a C{Valuation} object. 470 471 @param concepts: concepts 472 @type concepts: list of L{Concept}s 473 @param read: if C{True}, C{(symbol, set)} pairs are read into a C{Valuation} 474 @type read: bool 475 @rtype: list or a L{Valuation} 476 """ 477 vals = [] 478 479 for c in concepts: 480 vals.append((c.prefLabel, c.extension)) 481 if lexicon: read = True 482 if read: 483 from nltk.sem import Valuation 484 val = Valuation(vals) 485 # val.read(vals) 486 # add labels for individuals 487 val = label_indivs(val, lexicon=lexicon) 488 return val 489 else: return vals

490 491

492 -def val_dump(rels, db):

493 """ 494 Make a L{Valuation} from a list of relation metadata bundles and dump to 495 persistent database. 496 497 @param rels: bundle of metadata needed for constructing a concept 498 @type rels: list of dictionaries 499 @param db: name of file to which data is written. 500 The suffix '.db' will be automatically appended. 501 @type db: string 502 """ 503 concepts = process_bundle(rels).values() 504 valuation = make_valuation(concepts, read=True) 505 db_out = shelve.open(db, 'n') 506 507 db_out.update(valuation) 508 509 db_out.close()

510 511

512 -def val_load(db):

513 """ 514 Load a L{Valuation} from a persistent database. 515 516 @param db: name of file from which data is read. 517 The suffix '.db' should be omitted from the name. 518 @type db: string 519 """ 520 dbname = db+".db" 521 522 if not os.access(dbname, os.R_OK): 523 sys.exit("Cannot read file: %s" % dbname) 524 else: 525 db_in = shelve.open(db) 526 from nltk.sem import Valuation 527 val = Valuation(db_in) 528 # val.read(db_in.items()) 529 return val

530 531

532 -def alpha(str):

533 """ 534 Utility to filter out non-alphabetic constants. 535 536 @param str: candidate constant 537 @type str: string 538 @rtype: bool 539 """ 540 try: 541 int(str) 542 return False 543 except ValueError: 544 # some unknown values in records are labeled '?' 545 if not str == '?': 546 return True

547 548

549 -def label_indivs(valuation, lexicon=False):

550 """ 551 Assign individual constants to the individuals in the domain of a C{Valuation}. 552 553 Given a valuation with an entry of the form {'rel': {'a': True}}, 554 add a new entry {'a': 'a'}. 555 556 @type valuation: L{Valuation} 557 @rtype: L{Valuation} 558 """ 559 # collect all the individuals into a domain 560 domain = valuation.domain 561 # convert the domain into a sorted list of alphabetic terms 562 entities = sorted(e for e in domain if alpha(e)) 563 # use the same string as a label 564 pairs = [(e, e) for e in entities] 565 if lexicon: 566 lex = make_lex(entities) 567 open("chat_pnames.cfg", mode='w').writelines(lex) 568 # read the pairs into the valuation 569 valuation.read(pairs) 570 return valuation

571

572 -def make_lex(symbols):

573 """ 574 Create lexical CFG rules for each individual symbol. 575 576 Given a valuation with an entry of the form {'zloty': 'zloty'}, 577 create a lexical rule for the proper name 'Zloty'. 578 579 @param symbols: a list of individual constants in the semantic representation 580 @type symbols: sequence 581 @rtype: list 582 """ 583 lex = [] 584 header = """ 585 ################################################################## 586 # Lexical rules automatically generated by running 'chat80.py -x'. 587 ################################################################## 588 589 """ 590 lex.append(header) 591 template = "PropN[num=sg, sem=<\P.(P %s)>] -> '%s'\n" 592 593 for s in symbols: 594 parts = s.split('_') 595 caps = [p.capitalize() for p in parts] 596 pname = ('_').join(caps) 597 rule = template % (s, pname) 598 lex.append(rule) 599 return lex

600 601 602 ########################################################################### 603 # Interface function to emulate other corpus readers 604 ########################################################################### 605

606 -def concepts(items = items):

607 """ 608 Build a list of concepts corresponding to the relation names in C{items}. 609 610 @param items: names of the Chat-80 relations to extract 611 @type items: list of strings 612 @return: the L{Concept}s which are extracted from the relations 613 @rtype: list 614 """ 615 if type(items) is str: items = (items,) 616 617 rels = [item_metadata[r] for r in items] 618 619 concept_map = process_bundle(rels) 620 return concept_map.values()

621 622 623 624 625 ########################################################################### 626 627

628 -def main():

629 import sys 630 from optparse import OptionParser 631 description = \ 632 """ 633 Extract data from the Chat-80 Prolog files and convert them into a 634 Valuation object for use in the NLTK semantics package. 635 """ 636 637 opts = OptionParser(description=description) 638 opts.set_defaults(verbose=True, lex=False, vocab=False) 639 opts.add_option("-s", "--store", dest="outdb", 640 help="store a valuation in DB", metavar="DB") 641 opts.add_option("-l", "--load", dest="indb", 642 help="load a stored valuation from DB", metavar="DB") 643 opts.add_option("-c", "--concepts", action="store_true", 644 help="print concepts instead of a valuation") 645 opts.add_option("-r", "--relation", dest="label", 646 help="print concept with label REL (check possible labels with '-v' option)", metavar="REL") 647 opts.add_option("-q", "--quiet", action="store_false", dest="verbose", 648 help="don't print out progress info") 649 opts.add_option("-x", "--lex", action="store_true", dest="lex", 650 help="write a file of lexical entries for country names, then exit") 651 opts.add_option("-v", "--vocab", action="store_true", dest="vocab", 652 help="print out the vocabulary of concept labels and their arity, then exit") 653 654 (options, args) = opts.parse_args() 655 if options.outdb and options.indb: 656 opts.error("Options --store and --load are mutually exclusive") 657 658 659 if options.outdb: 660 # write the valuation to a persistent database 661 if options.verbose: 662 outdb = options.outdb+".db" 663 print "Dumping a valuation to %s" % outdb 664 val_dump(rels, options.outdb) 665 sys.exit(0) 666 else: 667 # try to read in a valuation from a database 668 if options.indb is not None: 669 dbname = options.indb+".db" 670 if not os.access(dbname, os.R_OK): 671 sys.exit("Cannot read file: %s" % dbname) 672 else: 673 valuation = val_load(options.indb) 674 # we need to create the valuation from scratch 675 else: 676 # build some concepts 677 concept_map = process_bundle(rels) 678 concepts = concept_map.values() 679 # just print out the vocabulary 680 if options.vocab: 681 items = [(c.arity, c.prefLabel) for c in concepts] 682 items.sort() 683 for (arity, label) in items: 684 print label, arity 685 sys.exit(0) 686 # show all the concepts 687 if options.concepts: 688 for c in concepts: 689 print c 690 print 691 if options.label: 692 print concept_map[options.label] 693 sys.exit(0) 694 else: 695 # turn the concepts into a Valuation 696 if options.lex: 697 if options.verbose: 698 print "Writing out lexical rules" 699 make_valuation(concepts, lexicon=True) 700 else: 701 valuation = make_valuation(concepts, read=True) 702 print valuation

703 704 705 706 if __name__ == '__main__': 707 main() 708

Source Code for Module nltk.corpus.chat80