1
2
3
4
5
6
7
8 """
9 Overview
10 ========
11
12 Chat-80 was a natural language system which allowed the user to
13 interrogate a Prolog knowledge base in the domain of world
14 geography. It was developed in the early '80s by Warren and Pereira; see
15 U{http://acl.ldc.upenn.edu/J/J82/J82-3002.pdf} for a description and
16 U{http://www.cis.upenn.edu/~pereira/oldies.html} for the source
17 files.
18
19 This module contains functions to extract data from the Chat-80
20 relation files ('the world database'), and convert then into a format
21 that can be incorporated in the FOL models of
22 L{nltk.sem.evaluate}. The code assumes that the Prolog
23 input files are available in the NLTK corpora directory.
24
25 The Chat-80 World Database consists of the following files::
26
27 world0.pl
28 rivers.pl
29 cities.pl
30 countries.pl
31 contain.pl
32 borders.pl
33
34 This module uses a slightly modified version of C{world0.pl}, in which
35 a set of Prolog rules have been omitted. The modified file is named
36 C{world1.pl}. Currently, the file C{rivers.pl} is not read in, since
37 it uses a list rather than a string in the second field.
38
39 Reading Chat-80 Files
40 =====================
41
42 Chat-80 relations are like tables in a relational database. The
43 relation acts as the name of the table; the first argument acts as the
44 'primary key'; and subsequent arguments are further fields in the
45 table. In general, the name of the table provides a label for a unary
46 predicate whose extension is all the primary keys. For example,
47 relations in C{cities.pl} are of the following form::
48
49 'city(athens,greece,1368).'
50
51 Here, C{'athens'} is the key, and will be mapped to a member of the
52 unary predicate M{city}.
53
54 The fields in the table are mapped to binary predicates. The first
55 argument of the predicate is the primary key, while the second
56 argument is the data in the relevant field. Thus, in the above
57 example, the third field is mapped to the binary predicate
58 M{population_of}, whose extension is a set of pairs such as C{'(athens,
59 1368)'}.
60
61 An exception to this general framework is required by the relations in
62 the files C{borders.pl} and C{contains.pl}. These contain facts of the
63 following form::
64
65 'borders(albania,greece).'
66
67 'contains0(africa,central_africa).'
68
69 We do not want to form a unary concept out the element in
70 the first field of these records, and we want the label of the binary
71 relation just to be C{'border'}/C{'contain'} respectively.
72
73 In order to drive the extraction process, we use 'relation metadata bundles'
74 which are Python dictionaries such as the following::
75
76 city = {'label': 'city',
77 'closures': [],
78 'schema': ['city', 'country', 'population'],
79 'filename': 'cities.pl'}
80
81 According to this, the file C{city['filename']} contains a list of
82 relational tuples (or more accurately, the corresponding strings in
83 Prolog form) whose predicate symbol is C{city['label']} and whose
84 relational schema is C{city['schema']}. The notion of a C{closure} is
85 discussed in the next section.
86
87 Concepts
88 ========
89 In order to encapsulate the results of the extraction, a class of
90 L{Concept}s is introduced. A L{Concept} object has a number of
91 attributes, in particular a C{prefLabel} and C{extension}, which make
92 it easier to inspect the output of the extraction. In addition, the
93 C{extension} can be further processed: in the case of the C{'border'}
94 relation, we check that the relation is B{symmetric}, and in the case
95 of the C{'contain'} relation, we carry out the B{transitive
96 closure}. The closure properties associated with a concept is
97 indicated in the relation metadata, as indicated earlier.
98
99 The C{extension} of a L{Concept} object is then incorporated into a
100 L{Valuation} object.
101
102 Persistence
103 ===========
104 The functions L{val_dump} and L{val_load} are provided to allow a
105 valuation to be stored in a persistent database and re-loaded, rather
106 than having to be re-computed each time.
107
108 Individuals and Lexical Items
109 =============================
110 As well as deriving relations from the Chat-80 data, we also create a
111 set of individual constants, one for each entity in the domain. The
112 individual constants are string-identical to the entities. For
113 example, given a data item such as C{'zloty'}, we add to the valuation
114 a pair C{('zloty', 'zloty')}. In order to parse English sentences that
115 refer to these entities, we also create a lexical item such as the
116 following for each individual constant::
117
118 PropN[num=sg, sem=<\P.(P zloty)>] -> 'Zloty'
119
120 The set of rules is written to the file C{chat_pnames.cfg} in the
121 current directory.
122
123 """
124
125 import re
126 import shelve
127 import os
128 import sys
129
130 import nltk.data
131
132 from util import *
133
134
135
136
137
138 borders = {'rel_name': 'borders',
139 'closures': ['symmetric'],
140 'schema': ['region', 'border'],
141 'filename': 'borders.pl'}
142
143 contains = {'rel_name': 'contains0',
144 'closures': ['transitive'],
145 'schema': ['region', 'contain'],
146 'filename': 'contain.pl'}
147
148 city = {'rel_name': 'city',
149 'closures': [],
150 'schema': ['city', 'country', 'population'],
151 'filename': 'cities.pl'}
152
153 country = {'rel_name': 'country',
154 'closures': [],
155 'schema': ['country', 'region', 'latitude', 'longitude',
156 'area', 'population', 'capital', 'currency'],
157 'filename': 'countries.pl'}
158
159 circle_of_lat = {'rel_name': 'circle_of_latitude',
160 'closures': [],
161 'schema': ['circle_of_latitude', 'degrees'],
162 'filename': 'world1.pl'}
163
164 circle_of_long = {'rel_name': 'circle_of_longitude',
165 'closures': [],
166 'schema': ['circle_of_longitude', 'degrees'],
167 'filename': 'world1.pl'}
168
169 continent = {'rel_name': 'continent',
170 'closures': [],
171 'schema': ['continent'],
172 'filename': 'world1.pl'}
173
174 region = {'rel_name': 'in_continent',
175 'closures': [],
176 'schema': ['region', 'continent'],
177 'filename': 'world1.pl'}
178
179 ocean = {'rel_name': 'ocean',
180 'closures': [],
181 'schema': ['ocean'],
182 'filename': 'world1.pl'}
183
184 sea = {'rel_name': 'sea',
185 'closures': [],
186 'schema': ['sea'],
187 'filename': 'world1.pl'}
188
189
190
191 items = ['borders', 'contains', 'city', 'country', 'circle_of_lat',
192 'circle_of_long', 'continent', 'region', 'ocean', 'sea']
193 items = tuple(sorted(items))
194
195 item_metadata = {
196 'borders': borders,
197 'contains': contains,
198 'city': city,
199 'country': country,
200 'circle_of_lat': circle_of_lat,
201 'circle_of_long': circle_of_long,
202 'continent': continent,
203 'region': region,
204 'ocean': ocean,
205 'sea': sea
206 }
207
208 rels = item_metadata.values()
209
210 not_unary = ['borders.pl', 'contain.pl']
211
212
213
215 """
216 A Concept class, loosely
217 based on SKOS (U{http://www.w3.org/TR/swbp-skos-core-guide/}).
218 """
219 - def __init__(self, prefLabel, arity, altLabels=[], closures=[], extension=set()):
220 """
221 @param prefLabel: the preferred label for the concept
222 @type prefLabel: str
223 @param arity: the arity of the concept
224 @type arity: int
225 @keyword altLabels: other (related) labels
226 @type altLabels: list
227 @keyword closures: closure properties of the extension \
228 (list items can be C{symmetric}, C{reflexive}, C{transitive})
229 @type closures: list
230 @keyword extension: the extensional value of the concept
231 @type extension: set
232 """
233 self.prefLabel = prefLabel
234 self.arity = arity
235 self.altLabels = altLabels
236 self.closures = closures
237 self.extension = extension
238
240 _extension = ''
241 for element in sorted(self.extension):
242 _extension += element + ','
243 _extension = _extension[:-1]
244 return "Label = '%s'\nArity = %s\nExtension = {%s}" % \
245 (self.prefLabel, self.arity, _extension)
246
248 return "Concept('%s')" % self.prefLabel
249
251 """
252 Add more data to the C{Concept}'s extension set.
253
254 @param data: a new semantic value
255 @type data: string or pair of strings
256 @rtype: set
257
258 """
259 self.extension.add(data)
260 return self.extension
261
262
264 """
265 Convert a set of pairs into an adjacency linked list encoding of a graph.
266 """
267 g = {}
268 for (x, y) in s:
269 if x in g:
270 g[x].append(y)
271 else:
272 g[x] = [y]
273 return g
274
276 """
277 Compute the transitive closure of a graph represented as a linked list.
278 """
279 for x in g:
280 for adjacent in g[x]:
281
282 if adjacent in g:
283 for y in g[adjacent]:
284 if y not in g[x]:
285 g[x].append(y)
286 return g
287
289 """
290 Convert an adjacency linked list back into a set of pairs.
291 """
292 pairs = []
293 for node in g:
294 for adjacent in g[node]:
295 pairs.append((node, adjacent))
296 return set(pairs)
297
298
300 """
301 Close a binary relation in the C{Concept}'s extension set.
302
303 @return: a new extension for the C{Concept} in which the
304 relation is closed under a given property
305
306
307 """
308 from nltk.sem import is_rel
309 assert is_rel(self.extension)
310 if 'symmetric' in self.closures:
311 pairs = []
312 for (x, y) in self.extension:
313 pairs.append((y, x))
314 sym = set(pairs)
315 self.extension = self.extension.union(sym)
316 if 'transitive' in self.closures:
317 all = self._make_graph(self.extension)
318 closed = self._transclose(all)
319 trans = self._make_pairs(closed)
320
321 self.extension = self.extension.union(trans)
322
323
324
326 """
327 Convert a file of Prolog clauses into a list of L{Concept} objects.
328
329 @param filename: filename containing the relations
330 @type filename: string
331 @param rel_name: name of the relation
332 @type rel_name: string
333 @param schema: the schema used in a set of relational tuples
334 @type schema: list
335 @return: a list of L{Concept}s
336 @rtype: list
337 """
338 concepts = []
339
340 subj = 0
341
342 pkey = schema[0]
343
344 fields = schema[1:]
345
346
347 records = _str2records(filename, rel_name)
348
349
350
351
352 if not filename in not_unary:
353 concepts.append(unary_concept(pkey, subj, records))
354
355
356 for field in fields:
357 obj = schema.index(field)
358 concepts.append(binary_concept(field, closures, subj, obj, records))
359
360 return concepts
361
363 """
364 Read a file into memory and convert each relation clause into a list.
365 """
366 recs = []
367 path = nltk.data.find("corpora/chat80/%s" % filename)
368 for line in open(path):
369 if line.startswith(rel):
370 line = re.sub(rel+r'\(', '', line)
371 line = re.sub(r'\)\.$', '', line)
372 line = line[:-1]
373 record = line.split(',')
374 recs.append(record)
375 return recs
376
378 """
379 Make a unary concept out of the primary key in a record.
380
381 A record is a list of entities in some relation, such as
382 C{['france', 'paris']}, where C{'france'} is acting as the primary
383 key.
384
385 @param label: the preferred label for the concept
386 @type label: string
387 @param subj: position in the record of the subject of the predicate
388 @type subj: int
389 @param records: a list of records
390 @type records: list of lists
391 @return: L{Concept} of arity 1
392 @rtype: L{Concept}
393 """
394 c = Concept(label, arity=1, extension=set())
395 for record in records:
396 c.augment(record[subj])
397 return c
398
400 """
401 Make a binary concept out of the primary key and another field in a record.
402
403 A record is a list of entities in some relation, such as
404 C{['france', 'paris']}, where C{'france'} is acting as the primary
405 key, and C{'paris'} stands in the C{'capital_of'} relation to
406 C{'france'}.
407
408 More generally, given a record such as C{['a', 'b', 'c']}, where
409 label is bound to C{'B'}, and C{obj} bound to 1, the derived
410 binary concept will have label C{'B_of'}, and its extension will
411 be a set of pairs such as C{('a', 'b')}.
412
413
414 @param label: the base part of the preferred label for the concept
415 @type label: string
416 @param closures: closure properties for the extension of the concept
417 @type closures: list
418 @param subj: position in the record of the subject of the predicate
419 @type subj: int
420 @param obj: position in the record of the object of the predicate
421 @type obj: int
422 @param records: a list of records
423 @type records: list of lists
424 @return: L{Concept} of arity 2
425 @rtype: L{Concept}
426 """
427 if not label == 'border' and not label == 'contain':
428 label = label + '_of'
429 c = Concept(label, arity=2, closures=closures, extension=set())
430 for record in records:
431 c.augment((record[subj], record[obj]))
432
433 c.close()
434 return c
435
436
438 """
439 Given a list of relation metadata bundles, make a corresponding
440 dictionary of concepts, indexed by the relation name.
441
442 @param rels: bundle of metadata needed for constructing a concept
443 @type rels: list of dictionaries
444 @return: a dictionary of concepts, indexed by the relation name.
445 @rtype: dict
446 """
447 concepts = {}
448 for rel in rels:
449 rel_name = rel['rel_name']
450 closures = rel['closures']
451 schema = rel['schema']
452 filename = rel['filename']
453
454 concept_list = clause2concepts(filename, rel_name, closures, schema)
455 for c in concept_list:
456 label = c.prefLabel
457 if(label in concepts.keys()):
458 for data in c.extension:
459 concepts[label].augment(data)
460 concepts[label].close()
461 else:
462 concepts[label] = c
463 return concepts
464
465
467 """
468 Convert a list of C{Concept}s into a list of (label, extension) pairs;
469 optionally create a C{Valuation} object.
470
471 @param concepts: concepts
472 @type concepts: list of L{Concept}s
473 @param read: if C{True}, C{(symbol, set)} pairs are read into a C{Valuation}
474 @type read: bool
475 @rtype: list or a L{Valuation}
476 """
477 vals = []
478
479 for c in concepts:
480 vals.append((c.prefLabel, c.extension))
481 if lexicon: read = True
482 if read:
483 from nltk.sem import Valuation
484 val = Valuation(vals)
485
486
487 val = label_indivs(val, lexicon=lexicon)
488 return val
489 else: return vals
490
491
493 """
494 Make a L{Valuation} from a list of relation metadata bundles and dump to
495 persistent database.
496
497 @param rels: bundle of metadata needed for constructing a concept
498 @type rels: list of dictionaries
499 @param db: name of file to which data is written.
500 The suffix '.db' will be automatically appended.
501 @type db: string
502 """
503 concepts = process_bundle(rels).values()
504 valuation = make_valuation(concepts, read=True)
505 db_out = shelve.open(db, 'n')
506
507 db_out.update(valuation)
508
509 db_out.close()
510
511
513 """
514 Load a L{Valuation} from a persistent database.
515
516 @param db: name of file from which data is read.
517 The suffix '.db' should be omitted from the name.
518 @type db: string
519 """
520 dbname = db+".db"
521
522 if not os.access(dbname, os.R_OK):
523 sys.exit("Cannot read file: %s" % dbname)
524 else:
525 db_in = shelve.open(db)
526 from nltk.sem import Valuation
527 val = Valuation(db_in)
528
529 return val
530
531
533 """
534 Utility to filter out non-alphabetic constants.
535
536 @param str: candidate constant
537 @type str: string
538 @rtype: bool
539 """
540 try:
541 int(str)
542 return False
543 except ValueError:
544
545 if not str == '?':
546 return True
547
548
550 """
551 Assign individual constants to the individuals in the domain of a C{Valuation}.
552
553 Given a valuation with an entry of the form {'rel': {'a': True}},
554 add a new entry {'a': 'a'}.
555
556 @type valuation: L{Valuation}
557 @rtype: L{Valuation}
558 """
559
560 domain = valuation.domain
561
562 entities = sorted(e for e in domain if alpha(e))
563
564 pairs = [(e, e) for e in entities]
565 if lexicon:
566 lex = make_lex(entities)
567 open("chat_pnames.cfg", mode='w').writelines(lex)
568
569 valuation.read(pairs)
570 return valuation
571
573 """
574 Create lexical CFG rules for each individual symbol.
575
576 Given a valuation with an entry of the form {'zloty': 'zloty'},
577 create a lexical rule for the proper name 'Zloty'.
578
579 @param symbols: a list of individual constants in the semantic representation
580 @type symbols: sequence
581 @rtype: list
582 """
583 lex = []
584 header = """
585 ##################################################################
586 # Lexical rules automatically generated by running 'chat80.py -x'.
587 ##################################################################
588
589 """
590 lex.append(header)
591 template = "PropN[num=sg, sem=<\P.(P %s)>] -> '%s'\n"
592
593 for s in symbols:
594 parts = s.split('_')
595 caps = [p.capitalize() for p in parts]
596 pname = ('_').join(caps)
597 rule = template % (s, pname)
598 lex.append(rule)
599 return lex
600
601
602
603
604
605
607 """
608 Build a list of concepts corresponding to the relation names in C{items}.
609
610 @param items: names of the Chat-80 relations to extract
611 @type items: list of strings
612 @return: the L{Concept}s which are extracted from the relations
613 @rtype: list
614 """
615 if type(items) is str: items = (items,)
616
617 rels = [item_metadata[r] for r in items]
618
619 concept_map = process_bundle(rels)
620 return concept_map.values()
621
622
623
624
625
626
627
629 import sys
630 from optparse import OptionParser
631 description = \
632 """
633 Extract data from the Chat-80 Prolog files and convert them into a
634 Valuation object for use in the NLTK semantics package.
635 """
636
637 opts = OptionParser(description=description)
638 opts.set_defaults(verbose=True, lex=False, vocab=False)
639 opts.add_option("-s", "--store", dest="outdb",
640 help="store a valuation in DB", metavar="DB")
641 opts.add_option("-l", "--load", dest="indb",
642 help="load a stored valuation from DB", metavar="DB")
643 opts.add_option("-c", "--concepts", action="store_true",
644 help="print concepts instead of a valuation")
645 opts.add_option("-r", "--relation", dest="label",
646 help="print concept with label REL (check possible labels with '-v' option)", metavar="REL")
647 opts.add_option("-q", "--quiet", action="store_false", dest="verbose",
648 help="don't print out progress info")
649 opts.add_option("-x", "--lex", action="store_true", dest="lex",
650 help="write a file of lexical entries for country names, then exit")
651 opts.add_option("-v", "--vocab", action="store_true", dest="vocab",
652 help="print out the vocabulary of concept labels and their arity, then exit")
653
654 (options, args) = opts.parse_args()
655 if options.outdb and options.indb:
656 opts.error("Options --store and --load are mutually exclusive")
657
658
659 if options.outdb:
660
661 if options.verbose:
662 outdb = options.outdb+".db"
663 print "Dumping a valuation to %s" % outdb
664 val_dump(rels, options.outdb)
665 sys.exit(0)
666 else:
667
668 if options.indb is not None:
669 dbname = options.indb+".db"
670 if not os.access(dbname, os.R_OK):
671 sys.exit("Cannot read file: %s" % dbname)
672 else:
673 valuation = val_load(options.indb)
674
675 else:
676
677 concept_map = process_bundle(rels)
678 concepts = concept_map.values()
679
680 if options.vocab:
681 items = [(c.arity, c.prefLabel) for c in concepts]
682 items.sort()
683 for (arity, label) in items:
684 print label, arity
685 sys.exit(0)
686
687 if options.concepts:
688 for c in concepts:
689 print c
690 print
691 if options.label:
692 print concept_map[options.label]
693 sys.exit(0)
694 else:
695
696 if options.lex:
697 if options.verbose:
698 print "Writing out lexical rules"
699 make_valuation(concepts, lexicon=True)
700 else:
701 valuation = make_valuation(concepts, read=True)
702 print valuation
703
704
705
706 if __name__ == '__main__':
707 main()
708