.. -*- mode: rst -*- .. include:: ../definitions.rst ======================================================== ?. Natural Language Processing and Information Retrieval ======================================================== ------------------------- ?. Information Extraction ------------------------- For many NLP applications, the priority is to extract meaning from written text. This must be done *robustly*; processing should not fail when unexpected or ill-formed input is received. Today, robust semantic interpretation is easiest when shallow processing methods are used. As we saw in the previous section, these methods severely limit (or even omit) syntactic analysis and restrict the complexity of the target semantic representations. One well-known instance of shallow semantics is Information Extraction (IE). In contrast to full text understanding, IE systems only try to recognize a limited number of pre-specified semantic topics and use a constrained semantic representation. The initial phase of information extraction involves the detection of named entities, expressions that denote locations, people, companies, times, and monetary amounts. (In fact, these named entities are nothing other than the low-level categories of the semantic grammars we saw in the previous section, but under a new guise.) To illustrate, the following text contains several expressions which have been tagged as entities of types ``time``, ``company`` and ``location``: | The incident occurred when a man walked into | William Hill bookkeepers on Ware Road, | and threatened a member of staff with a small black handgun. The final output of an information extraction system, usually referred to as a template, consists of fixed number of slots that must be filled. More semantically, we can think of the template as denoting an event in which the participating entities play specific roles. Here is an example corresponding to the above text: | Crime event | Time: around 5.30pm | Location: William Hill bookkeepers on Ware Road | Suspect: man | Weapon: small black handgun Early work in IE demonstrated that hand-built rules could achieve good accuracy in the tasks defined by the DARPA Message Understanding Conferences (MUC). In general, highest accuracy has been achieved in identifying so-called named entities, such as ``William Hill bookkeepers``. However, it is harder to identify which slots the entities should fill; and harder still to accurately identify which event the entities are participating in. In recent years attention has shifted to making IE systems more portable to new domains by using some form of machine learning. These systems work moderately well on marked-up text (e.g. HTML-formatted text), but their performance on free text still needs improvement before they can be adopted more widely. IE systems commonly use external knowledge sources in addition to the input text. For example, named entity recognizers often use name or location gazetteers. In the bioinformatics domain it is common to draw on external knowledge bases for standard names of genes and proteins. A general purpose lexical ontology called WordNet is often used in IE systems. Sometimes formal domain ontologies are used as a knowledge source for IE, permitting systems to perform simple taxonomic inferences and resolve ambiguities in the classification of slot-fillers. For example, an IE system for extracting basketball statistics from news reports on the web might have to interpret the sentence *Quincy played three games last weekend*, where the ambiguous subject *Quincy* might be either a player or a team. An ontology in which *Quincy* is identified as the name of a college basketball team would permit the ambiguity to be resolved. In other cases the template disambiguates the slot filler. Thus, in the basketball domain, *Washington* might refer to one of many possible players, teams or locations. If we know that the verb *won* requires an entity whose role in the taxonomy is a team, then we can resolve the ambiguity in the sentence *Washington won*. Message Understanding --------------------- A message understanding system will extract salient chunks of text from a news story and populate a database. .. figure:: ../images/chunk-muc.png :scale: 60 The Message Understanding Process (from Abney 1996) Consider the units that have been selected in this process: a name (``Garcia Alvarado``), a verb cluster (``was killed``), a locative prepositional phrase (``on his vehicle``). These are examples of ``NP``, ``VP`` and ``PP`` chunks. MUC-7 Data? Named Entity Recognition ------------------------ Toponym Resolution ------------------ Recognizing Temporal Expressions -------------------------------- Semantic Role Labeling ---------------------- --------------------------------- Biomedical Information Extraction --------------------------------- ---------------------- Document Summarization ---------------------- ------------ Other Topics ------------ text retrieval text classification text segmentation topic detection web crawling saving extracted material into a database .. include:: footer.txt