Information Extraction standardly consists of three subtasks:
The IEER corpus is marked up for a variety of Named Entities. A Named Entity (more strictly, a Named Entity mention) is a name of an entity belonging to a specified class. For example, the Named Entity classes in IEER include PERSON, LOCATION, ORGANIZATION, DATE and so on. Within NLTK, Named Entities are represented as subtrees within a chunk structure: the class name is treated as node label, while the entity mention itself appears as the leaves of the subtree. This is illustrated below, where we have show an extract of the chunk representation of document NYT_19980315.064:
|
Thus, the Named Entity mentions in this example are Cohn, McGlashan & Sarrail, San Mateo and Calif..
The CoNLL2002 Dutch and Spanish data is treated similarly, although in this case, the strings are also POS tagged.
|
Relation Extraction standardly consists of identifying specified relations between Named Entities. For example, assuming that we can recognize ORGANIZATIONs and LOCATIONs in text, we might want to also recognize pairs (o, l) of these kinds of entities such that o is located in l.
The sem.relextract module provides some tools to help carry out a simple version of this task. The tree2semi_rel() function splits a chunk document into a list of two-member lists, each of which consists of a (possibly empty) string followed by a Tree (i.e., a Named Entity):
|
The function semi_rel2reldict() processes triples of these pairs, i.e., pairs of the form ((string1, Tree1), (string2, Tree2), (string3, Tree3)) and outputs a dictionary (a reldict) in which Tree1 is the subject of the relation, string2 is the filler and Tree3 is the object of the relation. string1 and string3 are stored as left and right context respectively.
|
The next example shows some of the values for two reldicts corresponding to the 'NYT_19980315' text extract shown earlier.
|
The function relextract() allows us to filter the reldicts according to the classes of the subject and object named entities. In addition, we can specify that the filler text has to match a given regular expression, as illustrated in the next example. Here, we are looking for pairs of entities in the IN relation, where IN has signature <ORG, LOC>.
|
The next example illustrates a case where the patter is a disjunction of roles that a PERSON can occupy in an ORGANIZATION.
|
In the case of the CoNLL2002 data, we can include POS tags in the query pattern. This example also illustrates how the output can be presented as something that looks more like a clause in a logical language.
|