Package nltk :: Package corpus :: Module chat80
[hide private]
[frames] | no frames]

Source Code for Module nltk.corpus.chat80

  1  # Natural Language Toolkit: Chat-80 KB Reader 
  2  # See http://www.w3.org/TR/swbp-skos-core-guide/ 
  3  # 
  4  # Author: Ewan Klein <[email protected]>, 
  5  # URL: <http://nltk.sourceforge.net> 
  6  # For license information, see LICENSE.TXT 
  7   
  8  """ 
  9  Overview 
 10  ======== 
 11   
 12  Chat-80 was a natural language system which allowed the user to 
 13  interrogate a Prolog knowledge base in the domain of world 
 14  geography. It was developed in the early '80s by Warren and Pereira; see 
 15  U{http://acl.ldc.upenn.edu/J/J82/J82-3002.pdf} for a description and 
 16  U{http://www.cis.upenn.edu/~pereira/oldies.html} for the source 
 17  files. 
 18   
 19  This module contains functions to extract data from the Chat-80 
 20  relation files ('the world database'), and convert then into a format 
 21  that can be incorporated in the FOL models of 
 22  L{nltk.sem.evaluate}. The code assumes that the Prolog 
 23  input files are available in the NLTK corpora directory. 
 24   
 25  The Chat-80 World Database consists of the following files:: 
 26   
 27      world0.pl 
 28      rivers.pl 
 29      cities.pl 
 30      countries.pl 
 31      contain.pl 
 32      borders.pl 
 33   
 34  This module uses a slightly modified version of C{world0.pl}, in which 
 35  a set of Prolog rules have been omitted. The modified file is named 
 36  C{world1.pl}. Currently, the file C{rivers.pl} is not read in, since 
 37  it uses a list rather than a string in the second field. 
 38   
 39  Reading Chat-80 Files 
 40  ===================== 
 41   
 42  Chat-80 relations are like tables in a relational database. The 
 43  relation acts as the name of the table; the first argument acts as the 
 44  'primary key'; and subsequent arguments are further fields in the 
 45  table. In general, the name of the table provides a label for a unary 
 46  predicate whose extension is all the primary keys. For example, 
 47  relations in C{cities.pl} are of the following form:: 
 48   
 49     'city(athens,greece,1368).' 
 50   
 51  Here, C{'athens'} is the key, and will be mapped to a member of the 
 52  unary predicate M{city}. 
 53   
 54  The fields in the table are mapped to binary predicates. The first 
 55  argument of the predicate is the primary key, while the second 
 56  argument is the data in the relevant field. Thus, in the above 
 57  example, the third field is mapped to the binary predicate 
 58  M{population_of}, whose extension is a set of pairs such as C{'(athens, 
 59  1368)'}. 
 60   
 61  An exception to this general framework is required by the relations in 
 62  the files C{borders.pl} and C{contains.pl}. These contain facts of the 
 63  following form:: 
 64   
 65      'borders(albania,greece).' 
 66       
 67      'contains0(africa,central_africa).' 
 68   
 69  We do not want to form a unary concept out the element in 
 70  the first field of these records, and we want the label of the binary 
 71  relation just to be C{'border'}/C{'contain'} respectively. 
 72   
 73  In order to drive the extraction process, we use 'relation metadata bundles' 
 74  which are Python dictionaries such as the following:: 
 75   
 76    city = {'label': 'city', 
 77            'closures': [], 
 78            'schema': ['city', 'country', 'population'], 
 79            'filename': 'cities.pl'} 
 80   
 81  According to this, the file C{city['filename']} contains a list of 
 82  relational tuples (or more accurately, the corresponding strings in 
 83  Prolog form) whose predicate symbol is C{city['label']} and whose 
 84  relational schema is C{city['schema']}. The notion of a C{closure} is 
 85  discussed in the next section. 
 86   
 87  Concepts 
 88  ======== 
 89  In order to encapsulate the results of the extraction, a class of 
 90  L{Concept}s is introduced.  A L{Concept} object has a number of 
 91  attributes, in particular a C{prefLabel} and C{extension}, which make 
 92  it easier to inspect the output of the extraction. In addition, the 
 93  C{extension} can be further processed: in the case of the C{'border'} 
 94  relation, we check that the relation is B{symmetric}, and in the case 
 95  of the C{'contain'} relation, we carry out the B{transitive 
 96  closure}. The closure properties associated with a concept is 
 97  indicated in the relation metadata, as indicated earlier. 
 98   
 99  The C{extension} of a L{Concept} object is then incorporated into a 
100  L{Valuation} object. 
101   
102  Persistence 
103  =========== 
104  The functions L{val_dump} and L{val_load} are provided to allow a 
105  valuation to be stored in a persistent database and re-loaded, rather 
106  than having to be re-computed each time. 
107   
108  Individuals and Lexical Items  
109  ============================= 
110  As well as deriving relations from the Chat-80 data, we also create a 
111  set of individual constants, one for each entity in the domain. The 
112  individual constants are string-identical to the entities. For 
113  example, given a data item such as C{'zloty'}, we add to the valuation 
114  a pair C{('zloty', 'zloty')}. In order to parse English sentences that 
115  refer to these entities, we also create a lexical item such as the 
116  following for each individual constant:: 
117   
118     PropN[num=sg, sem=<\P.(P zloty)>] -> 'Zloty' 
119   
120  The set of rules is written to the file C{chat_pnames.cfg} in the 
121  current directory. 
122   
123  """ 
124   
125  import re 
126  import shelve 
127  import os 
128  import sys 
129   
130  import nltk.data 
131   
132  from util import * 
133   
134  ########################################################################### 
135  # Chat-80 relation metadata bundles needed to build the valuation 
136  ########################################################################### 
137   
138  borders = {'rel_name': 'borders', 
139             'closures': ['symmetric'], 
140             'schema': ['region', 'border'], 
141             'filename': 'borders.pl'} 
142   
143  contains = {'rel_name': 'contains0', 
144              'closures': ['transitive'], 
145              'schema': ['region', 'contain'], 
146              'filename': 'contain.pl'} 
147   
148  city = {'rel_name': 'city', 
149          'closures': [], 
150          'schema': ['city', 'country', 'population'], 
151          'filename': 'cities.pl'} 
152   
153  country = {'rel_name': 'country', 
154             'closures': [], 
155             'schema': ['country', 'region', 'latitude', 'longitude', 
156                        'area', 'population', 'capital', 'currency'], 
157             'filename': 'countries.pl'} 
158   
159  circle_of_lat = {'rel_name': 'circle_of_latitude', 
160                   'closures': [], 
161                   'schema': ['circle_of_latitude', 'degrees'], 
162                   'filename': 'world1.pl'} 
163   
164  circle_of_long = {'rel_name': 'circle_of_longitude', 
165                   'closures': [], 
166                   'schema': ['circle_of_longitude', 'degrees'], 
167                   'filename': 'world1.pl'} 
168   
169  continent = {'rel_name': 'continent', 
170               'closures': [], 
171               'schema': ['continent'], 
172               'filename': 'world1.pl'} 
173   
174  region = {'rel_name': 'in_continent', 
175            'closures': [], 
176            'schema': ['region', 'continent'], 
177            'filename': 'world1.pl'} 
178   
179  ocean = {'rel_name': 'ocean', 
180           'closures': [], 
181           'schema': ['ocean'], 
182           'filename': 'world1.pl'} 
183   
184  sea = {'rel_name': 'sea', 
185         'closures': [], 
186         'schema': ['sea'], 
187         'filename': 'world1.pl'} 
188   
189   
190   
191  items = ['borders', 'contains', 'city', 'country', 'circle_of_lat', 
192           'circle_of_long', 'continent', 'region', 'ocean', 'sea'] 
193  items = tuple(sorted(items)) 
194   
195  item_metadata = { 
196      'borders': borders, 
197      'contains': contains, 
198      'city': city, 
199      'country': country, 
200      'circle_of_lat': circle_of_lat, 
201      'circle_of_long': circle_of_long, 
202      'continent': continent, 
203      'region': region, 
204      'ocean': ocean, 
205      'sea': sea 
206      } 
207   
208  rels = item_metadata.values() 
209   
210  not_unary = ['borders.pl', 'contain.pl']  
211   
212  ########################################################################### 
213   
214 -class Concept(object):
215 """ 216 A Concept class, loosely 217 based on SKOS (U{http://www.w3.org/TR/swbp-skos-core-guide/}). 218 """
219 - def __init__(self, prefLabel, arity, altLabels=[], closures=[], extension=set()):
220 """ 221 @param prefLabel: the preferred label for the concept 222 @type prefLabel: str 223 @param arity: the arity of the concept 224 @type arity: int 225 @keyword altLabels: other (related) labels 226 @type altLabels: list 227 @keyword closures: closure properties of the extension \ 228 (list items can be C{symmetric}, C{reflexive}, C{transitive}) 229 @type closures: list 230 @keyword extension: the extensional value of the concept 231 @type extension: set 232 """ 233 self.prefLabel = prefLabel 234 self.arity = arity 235 self.altLabels = altLabels 236 self.closures = closures 237 self.extension = extension
238
239 - def __str__(self):
240 _extension = '' 241 for element in sorted(self.extension): 242 _extension += element + ',' 243 _extension = _extension[:-1] 244 return "Label = '%s'\nArity = %s\nExtension = {%s}" % \ 245 (self.prefLabel, self.arity, _extension)
246
247 - def __repr__(self):
248 return "Concept('%s')" % self.prefLabel
249
250 - def augment(self, data):
251 """ 252 Add more data to the C{Concept}'s extension set. 253 254 @param data: a new semantic value 255 @type data: string or pair of strings 256 @rtype: set 257 258 """ 259 self.extension.add(data) 260 return self.extension
261 262
263 - def _make_graph(self, s):
264 """ 265 Convert a set of pairs into an adjacency linked list encoding of a graph. 266 """ 267 g = {} 268 for (x, y) in s: 269 if x in g: 270 g[x].append(y) 271 else: 272 g[x] = [y] 273 return g
274
275 - def _transclose(self, g):
276 """ 277 Compute the transitive closure of a graph represented as a linked list. 278 """ 279 for x in g: 280 for adjacent in g[x]: 281 # check that adjacent is a key 282 if adjacent in g: 283 for y in g[adjacent]: 284 if y not in g[x]: 285 g[x].append(y) 286 return g
287
288 - def _make_pairs(self, g):
289 """ 290 Convert an adjacency linked list back into a set of pairs. 291 """ 292 pairs = [] 293 for node in g: 294 for adjacent in g[node]: 295 pairs.append((node, adjacent)) 296 return set(pairs)
297 298
299 - def close(self):
300 """ 301 Close a binary relation in the C{Concept}'s extension set. 302 303 @return: a new extension for the C{Concept} in which the 304 relation is closed under a given property 305 306 307 """ 308 from nltk.sem import is_rel 309 assert is_rel(self.extension) 310 if 'symmetric' in self.closures: 311 pairs = [] 312 for (x, y) in self.extension: 313 pairs.append((y, x)) 314 sym = set(pairs) 315 self.extension = self.extension.union(sym) 316 if 'transitive' in self.closures: 317 all = self._make_graph(self.extension) 318 closed = self._transclose(all) 319 trans = self._make_pairs(closed) 320 #print sorted(trans) 321 self.extension = self.extension.union(trans)
322 323 324
325 -def clause2concepts(filename, rel_name, closures, schema):
326 """ 327 Convert a file of Prolog clauses into a list of L{Concept} objects. 328 329 @param filename: filename containing the relations 330 @type filename: string 331 @param rel_name: name of the relation 332 @type rel_name: string 333 @param schema: the schema used in a set of relational tuples 334 @type schema: list 335 @return: a list of L{Concept}s 336 @rtype: list 337 """ 338 concepts = [] 339 # position of the subject of a binary relation 340 subj = 0 341 # label of the 'primary key' 342 pkey = schema[0] 343 # fields other than the primary key 344 fields = schema[1:] 345 346 # convert a file into a list of lists 347 records = _str2records(filename, rel_name) 348 349 # add a unary concept corresponding to the set of entities 350 # in the primary key position 351 # relations in 'not_unary' are more like ordinary binary relations 352 if not filename in not_unary: 353 concepts.append(unary_concept(pkey, subj, records)) 354 355 # add a binary concept for each non-key field 356 for field in fields: 357 obj = schema.index(field) 358 concepts.append(binary_concept(field, closures, subj, obj, records)) 359 360 return concepts
361
362 -def _str2records(filename, rel):
363 """ 364 Read a file into memory and convert each relation clause into a list. 365 """ 366 recs = [] 367 path = nltk.data.find("corpora/chat80/%s" % filename) 368 for line in open(path): 369 if line.startswith(rel): 370 line = re.sub(rel+r'\(', '', line) 371 line = re.sub(r'\)\.$', '', line) 372 line = line[:-1] 373 record = line.split(',') 374 recs.append(record) 375 return recs
376
377 -def unary_concept(label, subj, records):
378 """ 379 Make a unary concept out of the primary key in a record. 380 381 A record is a list of entities in some relation, such as 382 C{['france', 'paris']}, where C{'france'} is acting as the primary 383 key. 384 385 @param label: the preferred label for the concept 386 @type label: string 387 @param subj: position in the record of the subject of the predicate 388 @type subj: int 389 @param records: a list of records 390 @type records: list of lists 391 @return: L{Concept} of arity 1 392 @rtype: L{Concept} 393 """ 394 c = Concept(label, arity=1, extension=set()) 395 for record in records: 396 c.augment(record[subj]) 397 return c
398
399 -def binary_concept(label, closures, subj, obj, records):
400 """ 401 Make a binary concept out of the primary key and another field in a record. 402 403 A record is a list of entities in some relation, such as 404 C{['france', 'paris']}, where C{'france'} is acting as the primary 405 key, and C{'paris'} stands in the C{'capital_of'} relation to 406 C{'france'}. 407 408 More generally, given a record such as C{['a', 'b', 'c']}, where 409 label is bound to C{'B'}, and C{obj} bound to 1, the derived 410 binary concept will have label C{'B_of'}, and its extension will 411 be a set of pairs such as C{('a', 'b')}. 412 413 414 @param label: the base part of the preferred label for the concept 415 @type label: string 416 @param closures: closure properties for the extension of the concept 417 @type closures: list 418 @param subj: position in the record of the subject of the predicate 419 @type subj: int 420 @param obj: position in the record of the object of the predicate 421 @type obj: int 422 @param records: a list of records 423 @type records: list of lists 424 @return: L{Concept} of arity 2 425 @rtype: L{Concept} 426 """ 427 if not label == 'border' and not label == 'contain': 428 label = label + '_of' 429 c = Concept(label, arity=2, closures=closures, extension=set()) 430 for record in records: 431 c.augment((record[subj], record[obj])) 432 # close the concept's extension according to the properties in closures 433 c.close() 434 return c
435 436
437 -def process_bundle(rels):
438 """ 439 Given a list of relation metadata bundles, make a corresponding 440 dictionary of concepts, indexed by the relation name. 441 442 @param rels: bundle of metadata needed for constructing a concept 443 @type rels: list of dictionaries 444 @return: a dictionary of concepts, indexed by the relation name. 445 @rtype: dict 446 """ 447 concepts = {} 448 for rel in rels: 449 rel_name = rel['rel_name'] 450 closures = rel['closures'] 451 schema = rel['schema'] 452 filename = rel['filename'] 453 454 concept_list = clause2concepts(filename, rel_name, closures, schema) 455 for c in concept_list: 456 label = c.prefLabel 457 if(label in concepts.keys()): 458 for data in c.extension: 459 concepts[label].augment(data) 460 concepts[label].close() 461 else: 462 concepts[label] = c 463 return concepts
464 465
466 -def make_valuation(concepts, read=False, lexicon=False):
467 """ 468 Convert a list of C{Concept}s into a list of (label, extension) pairs; 469 optionally create a C{Valuation} object. 470 471 @param concepts: concepts 472 @type concepts: list of L{Concept}s 473 @param read: if C{True}, C{(symbol, set)} pairs are read into a C{Valuation} 474 @type read: bool 475 @rtype: list or a L{Valuation} 476 """ 477 vals = [] 478 479 for c in concepts: 480 vals.append((c.prefLabel, c.extension)) 481 if lexicon: read = True 482 if read: 483 from nltk.sem import Valuation 484 val = Valuation(vals) 485 # val.read(vals) 486 # add labels for individuals 487 val = label_indivs(val, lexicon=lexicon) 488 return val 489 else: return vals
490 491
492 -def val_dump(rels, db):
493 """ 494 Make a L{Valuation} from a list of relation metadata bundles and dump to 495 persistent database. 496 497 @param rels: bundle of metadata needed for constructing a concept 498 @type rels: list of dictionaries 499 @param db: name of file to which data is written. 500 The suffix '.db' will be automatically appended. 501 @type db: string 502 """ 503 concepts = process_bundle(rels).values() 504 valuation = make_valuation(concepts, read=True) 505 db_out = shelve.open(db, 'n') 506 507 db_out.update(valuation) 508 509 db_out.close()
510 511
512 -def val_load(db):
513 """ 514 Load a L{Valuation} from a persistent database. 515 516 @param db: name of file from which data is read. 517 The suffix '.db' should be omitted from the name. 518 @type db: string 519 """ 520 dbname = db+".db" 521 522 if not os.access(dbname, os.R_OK): 523 sys.exit("Cannot read file: %s" % dbname) 524 else: 525 db_in = shelve.open(db) 526 from nltk.sem import Valuation 527 val = Valuation(db_in) 528 # val.read(db_in.items()) 529 return val
530 531
532 -def alpha(str):
533 """ 534 Utility to filter out non-alphabetic constants. 535 536 @param str: candidate constant 537 @type str: string 538 @rtype: bool 539 """ 540 try: 541 int(str) 542 return False 543 except ValueError: 544 # some unknown values in records are labeled '?' 545 if not str == '?': 546 return True
547 548
549 -def label_indivs(valuation, lexicon=False):
550 """ 551 Assign individual constants to the individuals in the domain of a C{Valuation}. 552 553 Given a valuation with an entry of the form {'rel': {'a': True}}, 554 add a new entry {'a': 'a'}. 555 556 @type valuation: L{Valuation} 557 @rtype: L{Valuation} 558 """ 559 # collect all the individuals into a domain 560 domain = valuation.domain 561 # convert the domain into a sorted list of alphabetic terms 562 entities = sorted(e for e in domain if alpha(e)) 563 # use the same string as a label 564 pairs = [(e, e) for e in entities] 565 if lexicon: 566 lex = make_lex(entities) 567 open("chat_pnames.cfg", mode='w').writelines(lex) 568 # read the pairs into the valuation 569 valuation.read(pairs) 570 return valuation
571
572 -def make_lex(symbols):
573 """ 574 Create lexical CFG rules for each individual symbol. 575 576 Given a valuation with an entry of the form {'zloty': 'zloty'}, 577 create a lexical rule for the proper name 'Zloty'. 578 579 @param symbols: a list of individual constants in the semantic representation 580 @type symbols: sequence 581 @rtype: list 582 """ 583 lex = [] 584 header = """ 585 ################################################################## 586 # Lexical rules automatically generated by running 'chat80.py -x'. 587 ################################################################## 588 589 """ 590 lex.append(header) 591 template = "PropN[num=sg, sem=<\P.(P %s)>] -> '%s'\n" 592 593 for s in symbols: 594 parts = s.split('_') 595 caps = [p.capitalize() for p in parts] 596 pname = ('_').join(caps) 597 rule = template % (s, pname) 598 lex.append(rule) 599 return lex
600 601 602 ########################################################################### 603 # Interface function to emulate other corpus readers 604 ########################################################################### 605
606 -def concepts(items = items):
607 """ 608 Build a list of concepts corresponding to the relation names in C{items}. 609 610 @param items: names of the Chat-80 relations to extract 611 @type items: list of strings 612 @return: the L{Concept}s which are extracted from the relations 613 @rtype: list 614 """ 615 if type(items) is str: items = (items,) 616 617 rels = [item_metadata[r] for r in items] 618 619 concept_map = process_bundle(rels) 620 return concept_map.values()
621 622 623 624 625 ########################################################################### 626 627
628 -def main():
629 import sys 630 from optparse import OptionParser 631 description = \ 632 """ 633 Extract data from the Chat-80 Prolog files and convert them into a 634 Valuation object for use in the NLTK semantics package. 635 """ 636 637 opts = OptionParser(description=description) 638 opts.set_defaults(verbose=True, lex=False, vocab=False) 639 opts.add_option("-s", "--store", dest="outdb", 640 help="store a valuation in DB", metavar="DB") 641 opts.add_option("-l", "--load", dest="indb", 642 help="load a stored valuation from DB", metavar="DB") 643 opts.add_option("-c", "--concepts", action="store_true", 644 help="print concepts instead of a valuation") 645 opts.add_option("-r", "--relation", dest="label", 646 help="print concept with label REL (check possible labels with '-v' option)", metavar="REL") 647 opts.add_option("-q", "--quiet", action="store_false", dest="verbose", 648 help="don't print out progress info") 649 opts.add_option("-x", "--lex", action="store_true", dest="lex", 650 help="write a file of lexical entries for country names, then exit") 651 opts.add_option("-v", "--vocab", action="store_true", dest="vocab", 652 help="print out the vocabulary of concept labels and their arity, then exit") 653 654 (options, args) = opts.parse_args() 655 if options.outdb and options.indb: 656 opts.error("Options --store and --load are mutually exclusive") 657 658 659 if options.outdb: 660 # write the valuation to a persistent database 661 if options.verbose: 662 outdb = options.outdb+".db" 663 print "Dumping a valuation to %s" % outdb 664 val_dump(rels, options.outdb) 665 sys.exit(0) 666 else: 667 # try to read in a valuation from a database 668 if options.indb is not None: 669 dbname = options.indb+".db" 670 if not os.access(dbname, os.R_OK): 671 sys.exit("Cannot read file: %s" % dbname) 672 else: 673 valuation = val_load(options.indb) 674 # we need to create the valuation from scratch 675 else: 676 # build some concepts 677 concept_map = process_bundle(rels) 678 concepts = concept_map.values() 679 # just print out the vocabulary 680 if options.vocab: 681 items = [(c.arity, c.prefLabel) for c in concepts] 682 items.sort() 683 for (arity, label) in items: 684 print label, arity 685 sys.exit(0) 686 # show all the concepts 687 if options.concepts: 688 for c in concepts: 689 print c 690 print 691 if options.label: 692 print concept_map[options.label] 693 sys.exit(0) 694 else: 695 # turn the concepts into a Valuation 696 if options.lex: 697 if options.verbose: 698 print "Writing out lexical rules" 699 make_valuation(concepts, lexicon=True) 700 else: 701 valuation = make_valuation(concepts, read=True) 702 print valuation
703 704 705 706 if __name__ == '__main__': 707 main() 708