Information Extraction

Information Extraction standardly consists of three subtasks:

  1. Named Entity Recognition
  2. Relation Extraction
  3. Template Filling

Named Entities

The IEER corpus is marked up for a variety of Named Entities. A `Named Entity`:dt: (more strictly, a Named Entity mention) is a name of an entity belonging to a specified class. For example, the Named Entity classes in IEER include PERSON, LOCATION, ORGANIZATION, DATE and so on. Within NLTK, Named Entities are represented as subtrees within a chunk structure: the class name is treated as node label, while the entity mention itself appears as the leaves of the subtree. This is illustrated below, where we have show an extract of the chunk representation of document NYT_19980315.064:

System Message: ERROR/3 (relextract.doctest, line 19); backlink

Unknown interpreted text role "dt".
>>> from nltk.corpus import ieer
>>> docs = ieer.parsed_docs('NYT_19980315')
>>> tree = docs[1].text
>>> print(tree) # doctest: +ELLIPSIS
(DOCUMENT
...
  ``It's
  a
  chance
  to
  think
  about
  first-level
  questions,''
  said
  Ms.
  (PERSON Cohn)
  ,
  a
  partner
  in
  the
  (ORGANIZATION McGlashan & Sarrail)
  firm
  in
  (LOCATION San Mateo)
  ,
  (LOCATION Calif.)
  ...)

Thus, the Named Entity mentions in this example are Cohn, McGlashan & Sarrail, San Mateo and Calif..

The CoNLL2002 Dutch and Spanish data is treated similarly, although in this case, the strings are also POS tagged.

>>> from nltk.corpus import conll2002
>>> for doc in conll2002.chunked_sents('ned.train')[27]:
...     print(doc)
(u'Het', u'Art')
(ORG Hof/N van/Prep Cassatie/N)
(u'verbrak', u'V')
(u'het', u'Art')
(u'arrest', u'N')
(u'zodat', u'Conj')
(u'het', u'Pron')
(u'moest', u'V')
(u'worden', u'V')
(u'overgedaan', u'V')
(u'door', u'Prep')
(u'het', u'Art')
(u'hof', u'N')
(u'van', u'Prep')
(u'beroep', u'N')
(u'van', u'Prep')
(LOC Antwerpen/N)
(u'.', u'Punc')

Relation Extraction

Relation Extraction standardly consists of identifying specified relations between Named Entities. For example, assuming that we can recognize ORGANIZATIONs and LOCATIONs in text, we might want to also recognize pairs (o, l) of these kinds of entities such that o is located in l.

The sem.relextract module provides some tools to help carry out a simple version of this task. The tree2semi_rel() function splits a chunk document into a list of two-member lists, each of which consists of a (possibly empty) string followed by a Tree (i.e., a Named Entity):

>>> from nltk.sem import relextract
>>> pairs = relextract.tree2semi_rel(tree)
>>> for s, tree in pairs[18:22]:
...     print('("...%s", %s)' % (" ".join(s[-5:]),tree))
("...about first-level questions,'' said Ms.", (PERSON Cohn))
("..., a partner in the", (ORGANIZATION McGlashan & Sarrail))
("...firm in", (LOCATION San Mateo))
("...,", (LOCATION Calif.))

The function semi_rel2reldict() processes triples of these pairs, i.e., pairs of the form ((string1, Tree1), (string2, Tree2), (string3, Tree3)) and outputs a dictionary (a reldict) in which Tree1 is the subject of the relation, string2 is the filler and Tree3 is the object of the relation. string1 and string3 are stored as left and right context respectively.

>>> reldicts = relextract.semi_rel2reldict(pairs)
>>> for k, v in sorted(reldicts[0].items()):
...     print(k, '=>', v) # doctest: +ELLIPSIS
filler => of messages to their own ``Cyberia'' ...
lcon => transactions.'' Each week, they post
objclass => ORGANIZATION
objsym => white_house
objtext => White House
rcon => for access to its planned
subjclass => CARDINAL
subjsym => hundreds
subjtext => hundreds
untagged_filler => of messages to their own ``Cyberia'' ...

The next example shows some of the values for two reldicts corresponding to the 'NYT_19980315' text extract shown earlier.

>>> for r in reldicts[18:20]:
...     print('=' * 20)
...     print(r['subjtext'])
...     print(r['filler'])
...     print(r['objtext'])
====================
Cohn
, a partner in the
McGlashan & Sarrail
====================
McGlashan & Sarrail
firm in
San Mateo

The function relextract() allows us to filter the reldicts according to the classes of the subject and object named entities. In addition, we can specify that the filler text has to match a given regular expression, as illustrated in the next example. Here, we are looking for pairs of entities in the IN relation, where IN has signature <ORG, LOC>.

>>> import re
>>> IN = re.compile(r'.*\bin\b(?!\b.+ing\b)')
>>> for fileid in ieer.fileids():
...     for doc in ieer.parsed_docs(fileid):
...         for rel in relextract.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
...             print(relextract.rtuple(rel))  # doctest: +ELLIPSIS
[ORG: 'Christian Democrats'] ', the leading political forces in' [LOC: 'Italy']
[ORG: 'AP'] ') _ Lebanese guerrillas attacked Israeli forces in southern' [LOC: 'Lebanon']
[ORG: 'Security Council'] 'adopted Resolution 425. Huge yellow banners hung across intersections in' [LOC: 'Beirut']
[ORG: 'U.N.'] 'failures in' [LOC: 'Africa']
[ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
[ORG: 'U.N.'] 'partners on a more effective role in' [LOC: 'Africa']
[ORG: 'AP'] ') _ A bomb exploded in a mosque in central' [LOC: 'San`a']
[ORG: 'Krasnoye Sormovo'] 'shipyard in the Soviet city of' [LOC: 'Gorky']
[ORG: 'Kelab Golf Darul Ridzuan'] 'in' [LOC: 'Perak']
[ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
...

The next example illustrates a case where the patter is a disjunction of roles that a PERSON can occupy in an ORGANIZATION.

>>> roles = """
... (.*(
... analyst|
... chair(wo)?man|
... commissioner|
... counsel|
... director|
... economist|
... editor|
... executive|
... foreman|
... governor|
... head|
... lawyer|
... leader|
... librarian).*)|
... manager|
... partner|
... president|
... producer|
... professor|
... researcher|
... spokes(wo)?man|
... writer|
... ,\sof\sthe?\s*  # "X, of (the) Y"
... """
>>> ROLES = re.compile(roles, re.VERBOSE)
>>> for fileid in ieer.fileids():
...     for doc in ieer.parsed_docs(fileid):
...         for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=ROLES):
...             print(relextract.rtuple(rel)) # doctest: +ELLIPSIS
[PER: 'Kivutha Kibwana'] ', of the' [ORG: 'National Convention Assembly']
[PER: 'Boban Boskovic'] ', chief executive of the' [ORG: 'Plastika']
[PER: 'Annan'] ', the first sub-Saharan African to head the' [ORG: 'United Nations']
[PER: 'Kiriyenko'] 'became a foreman at the' [ORG: 'Krasnoye Sormovo']
[PER: 'Annan'] ', the first sub-Saharan African to head the' [ORG: 'United Nations']
[PER: 'Mike Godwin'] ', chief counsel for the' [ORG: 'Electronic Frontier Foundation']
...

In the case of the CoNLL2002 data, we can include POS tags in the query pattern. This example also illustrates how the output can be presented as something that looks more like a clause in a logical language.

>>> de = """
... .*
... (
... de/SP|
... del/SP
... )
... """
>>> DE = re.compile(de, re.VERBOSE)
>>> rels = [rel for doc in conll2002.chunked_sents('esp.train')
...         for rel in relextract.extract_rels('ORG', 'LOC', doc, corpus='conll2002', pattern = DE)]
>>> for r in rels[:10]:
...     print(relextract.clause(r, relsym='DE'))    # doctest: +NORMALIZE_WHITESPACE
DE(u'tribunal_supremo', u'victoria')
DE(u'museo_de_arte', u'alcorc\xf3n')
DE(u'museo_de_bellas_artes', u'a_coru\xf1a')
DE(u'siria', u'l\xedbano')
DE(u'uni\xf3n_europea', u'pek\xedn')
DE(u'ej\xe9rcito', u'rogberi')
DE(u'juzgado_de_instrucci\xf3n_n\xfamero_1', u'san_sebasti\xe1n')
DE(u'psoe', u'villanueva_de_la_serena')
DE(u'ej\xe9rcito', u'l\xedbano')
DE(u'juzgado_de_lo_penal_n\xfamero_2', u'ceuta')
>>> vnv = """
... (
... is/V|
... was/V|
... werd/V|
... wordt/V
... )
... .*
... van/Prep
... """
>>> VAN = re.compile(vnv, re.VERBOSE)
>>> for doc in conll2002.chunked_sents('ned.train'):
...     for r in relextract.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):
...         print(relextract.clause(r, relsym="VAN"))
VAN(u"cornet_d'elzius", u'buitenlandse_handel')
VAN(u'johan_rottiers', u'kardinaal_van_roey_instituut')
VAN(u'annie_lennox', u'eurythmics')