.. -*- mode: rst -*- .. include:: ../definitions.rst .. TODO: reconsider location of section on generative grammar .. _chap-introduction: ============================================== 1. Introduction to Natural Language Processing ============================================== ------------------------ Language and Computation ------------------------ Language and Symbol Processing ------------------------------ The very notion that natural language could be treated in a computational manner grew out of a research program, dating back to the early 1900s, to reconstruct mathematical reasoning using logic, most clearly manifested in work by Frege, Russell, Wittgenstein, Tarski, Lambek and Carnap. This work led to the notion of language as a formal system amenable to automatic processing. Three later developments laid the foundation for natural language processing. The first was `formal language theory`:dt:. This defined a language as a set of strings accepted by a class of automata, such as context-free languages and pushdown automata, and provided the underpinnings for computational syntax. The second development was `symbolic logic`:dt:. This provided a formal method for capturing selected aspects of natural language that are relevant for expressing logical proofs. A formal calculus in symbolic logic provides the syntax of a language, together with rules of inference and, possibly, rules of interpretation in a set-theoretic model; examples are propositional logic and First Order Logic. Given such a calculus, with a well-defined syntax and semantics, it becomes possible to associate meanings with expressions of natural language by translating them into expressions of the formal calculus. For example, if we translate `John saw Mary`:lx: into a formula ``saw(j,m)``, we (implicitly or explicitly) intepret the English verb `saw`:lx: as a binary relation, and `John`:lx: and `Mary`:lx: as denoting individuals. More general statements like `All birds fly`:lx: require quantifiers, in this case |forall|, meaning `for all`:lx:: |forall|\ `x (bird(x)`:math: |rarr| `fly(x))`:math:. This use of logic provided the technical machinery to perform inferences that are an important part of language understanding. A closely related development was the `principle of compositionality`:dt:, namely that the meaning of a complex expression is composed from the meaning of its parts and their mode of combination. This principle provided a useful correspondence between syntax and semantics, namely that the meaning of a complex expression could be computed recursively. Consider the sentence `It is not true that`:lx: `p`:math:, where `p`:math: is a proposition. We can represent the meaning of this sentence as `not(p)`:math:. Similarly, we can represent the meaning of `John saw Mary`:lx: as `saw(j, m)`:math:. Now we can compute the interpretation of `It is not true that John saw Mary`:lx: recursively, using the above information, to get `not(saw(j,m))`:math:. The approaches just outlined share the premise that computing with natural language crucially relies on rules for manipulating symbolic representations. For a certain period in the development of |NLP|, particularly during the 1980s, this premise provided a common starting point for both linguists and practioners of |NLP|, leading to a family of grammar formalisms known as unification-based (or feature-based) grammar, and to |NLP| applications implemented in the Prolog programming language. Although grammar-based |NLP| is still a significant area of research, it has become somewhat eclipsed in the last 15\ |ndash|\ 20 years due to a variety of factors. One significant influence came from automatic speech recognition. Although early work in speech processing adopted a model that emulated the kind of rule-based phonological `phonology`:topic: processing typified by the *Sound Pattern of English* [ChomskyHalle68]_, this turned out to be hopelessly inadequate in dealing with the hard problem of recognizing actual speech in anything like real time. By contrast, systems which involved learning patterns from large bodies of speech data were significantly more accurate, efficient and robust. In addition, the speech community found that progress in building better systems was hugely assisted by the construction of shared resources for quantitatively measuring performance against common test data. Eventually, much of the |NLP| community embraced a `data intensive`:dt: orientation to language processing, coupled with a growing use of machine-learning techniques and evaluation-led methodology. Philosophical Divides --------------------- The contrasting approaches to |NLP| described in the preceding section relate back to early metaphysical debates about `rationalism`:dt: versus `empiricism`:dt: and `realism`:dt: versus `idealism`:dt: that occurred in the Enlightenment period of Western philosophy. These debates took place against a backdrop of orthodox thinking in which the source of all knowledge was believed to be divine revelation. During this period of the seventeenth and eighteenth centuries, philosophers argued that human reason or sensory experience has priority over revelation. Descartes and Leibniz, amongst others, took the rationalist position, asserting that all truth has its origins in human thought, and in the existence of "innate ideas" implanted in our minds from birth. For example, they argued that the principles of Euclidean geometry were developed using human reason, and were not the result of supernatural revelation or sensory experience. In contrast, Locke and others took the empiricist view, that our primary source of knowledge is the experience of our faculties, and that human reason plays a secondary role in reflecting on that experience. Prototypical evidence for this position was Galileo's discovery |mdash| based on careful observation of the motion of the planets |mdash| that the solar system is heliocentric and not geocentric. In the context of linguistics, this debate leads to the following question: to what extent does human linguistic experience, versus our innate "language faculty", provide the basis for our knowledge of language? In |NLP| this matter surfaces as differences in the priority of corpus data versus linguistic introspection in the construction of computational models. We will return to this issue later in the book. A further concern, enshrined in the debate between realism and idealism, was the metaphysical status of the constructs of a theory. Kant argued for a distinction between phenomena, the manifestations we can experience, and "things in themselves" which can never been known directly. A linguistic realist would take a theoretical construct like `noun phrase`:dt: to be real world entity that exists independently of human perception and reason, and which actually *causes* the observed linguistic phenomena. A linguistic idealist, on the other hand, would argue that noun phrases, along with more abstract constructs like semantic representations, are intrinsically unobservable, and simply play the role of useful fictions. The way linguists write about theories often betrays a realist position, while |NLP| practitioners occupy neutral territory or else lean towards the idealist position. Thus, in |NLP|, it is often enough if a theoretical abstraction leads to a useful result; it does not matter whether this result sheds any light on human linguistic processing. These issues are still alive today, and show up in the distinctions between symbolic vs statistical methods, deep vs shallow processing, binary vs gradient classifications, and scientific vs engineering goals. However, such contrasts are now highly nuanced, and the debate is no longer as polarized as it once was. In fact, most of the discussions |mdash| and most of the advances even |mdash| involve a "balancing act". For example, one intermediate position is to assume that humans are innately endowed with analogical and memory-based learning methods (weak rationalism), and to use these methods to identify meaningful patterns in their sensory language experience (empiricism). For a more concrete illustration, consider the way in which statistics from large corpora may serve as evidence for binary choices in a symbolic grammar. For instance, dictionaries describe the words `absolutely`:lx: and `definitely`:lx: as nearly synonymous, yet their patterns of usage are quite distinct when combined with a following verb, as shown in Table absolutely_. .. table:: absolutely +-----------------+------------+-------------+-------------+--------------+ | Google hits |`adore`:lx: | `love`:lx: |`like`:lx: |`prefer`:lx: | +-----------------+------------+-------------+-------------+--------------+ | `absolutely`:lx:| 289,000| 905,000| 16,200| 644| +-----------------+------------+-------------+-------------+--------------+ | `definitely`:lx:| 1,460| 51,000| 158,000| 62,600| +-----------------+------------+-------------+-------------+--------------+ | ratio | 198:1| 18:1| 1:10| 1:97| +-----------------+------------+-------------+-------------+--------------+ `Absolutely`:lx: vs `Definitely`:lx: (Liberman 2005, LanguageLog.org) .. http://itre.cis.upenn.edu/~myl/languagelog/archives/002022.html As you will see, `absolutely adore`:lx: is about 200 times as popular as `definitely adore`:lx:, while `absolutely prefer`:lx: is about 100 times rarer then `definitely prefer`:lx:. This information is used by statistical language models, but it also counts as evidence for a symbolic account of word combination in which `absolutely`:lx: can only modify extreme actions or attributes, a property that could be represented as a binary-valued feature of certain lexical items. Thus, we see statistical data informing symbolic models. Once this information has been codified symbolically, it is available to be exploited as a contextual feature for statistical language modeling, alongside many other rich sources of symbolic information, like hand-constructed parse trees and semantic representations. Now the circle is closed, and we see symbolic information informing statistical models. This new rapprochement is giving rise to many exciting new developments. We will touch on some of these in the ensuing pages. We too will perform this balancing act, employing approaches to |NLP| that integrate these historically-opposed philosophies and methodologies. ------------------------------------------------ The Architecture of Linguistic and |NLP| Systems ------------------------------------------------ .. add this to a chapter on grammar or semantics? Generative Grammar and Modularity --------------------------------- One of the intellectual descendants of formal language theory was the linguistic framework known as `generative grammar`:dt:. Such a grammar contains a set of rules that recursively specify (or `generate`) the set of well-formed strings in a language. While there is a wide spectrum of models that owe some allegiance to this core, Chomsky's transformational grammar, in its various incarnations, is probably the best known. In the Chomskyan tradition, it is claimed that humans have distinct kinds of linguistic knowledge, organized into different modules: for example, knowledge of a language's sound structure (`phonology`:dt:), knowledge of word structure (`morphology`:dt:), knowledge of phrase structure (`syntax`:dt:), and knowledge of meaning (`semantics`:dt:). In a formal linguistic theory, each kind of linguistic knowledge is made explicit as different `module`:dt: of the theory, consisting of a collection of basic elements together with a way of combining them into complex structures. For example, a phonological module might provide a set of phonemes together with an operation for concatenating phonemes into phonological strings. Similarly, a syntactic module might provide labeled nodes as primitives together with a mechanism for assembling them into trees. A set of linguistic primitives, together with some operators for defining complex elements, is often called a `level of representation`:dt:. As well as defining modules, a generative grammar will prescribe how the modules interact. For example, well-formed phonological strings will provide the phonological content of words, and words will provide the terminal elements of syntax trees. Well-formed syntactic trees will be mapped to semantic representations, and contextual or pragmatic information will ground these semantic representations in some real-world situation. As we indicated above, an important aspect of theories of generative grammar is that they are intended to model the linguistic knowledge of speakers and hearers; they are not intended to explain how humans actually process linguistic information. This is, in part, reflected in the claim that a generative grammar encodes the `competence`:dt: of an idealized native speaker, rather than the speaker's `performance`:dt:. A closely related distinction is to say that a generative grammar encodes `declarative`:dt: rather than `procedural`:dt: knowledge. Declarative knowledge can be glossed as "knowing what", whereas procedural knowledge is "knowing how". As you might expect, computational linguistics has the crucial role of proposing procedural models of language. A central example is parsing, where we have to develop computational mechanisms that convert strings of words into structural representations such as syntax trees. Nevertheless, it is widely accepted that well-engineered computational models of language contain both declarative and procedural aspects. Thus, a full account of parsing will say how declarative knowledge in the form of a grammar and lexicon combines with procedural knowledge that determines how a syntactic analysis should be assigned to a given string of words. This procedural knowledge will be expressed as an `algorithm`:dt:\ : that is, an explicit recipe for mapping some input into an appropriate output in a finite number of steps. A simple parsing algorithm for context-free grammars, for instance, looks first for a rule of the form ``S`` |rarr| ``X``:subscript:`1` ... ``X``:subscript:`n`, and builds a partial tree structure. It then steps through the grammar rules one-by-one, looking for a rule of the form ``X``:subscript:`1` |rarr| ``Y``:subscript:`1` ... ``Y``:subscript:`j` that will expand the leftmost daughter introduced by the ``S`` rule, and further extends the partial tree. This process continues, for example by looking for a rule of the form ``Y``:subscript:`1` |rarr| ``Z``:subscript:`1` ... ``Z``:subscript:`k` and expanding the partial tree appropriately, until the leftmost node label in the partial tree is a lexical category; the parser then checks to see if the first word of the input can belong to the category. To illustrate, let's suppose that the first grammar rule chosen by the parser is ``S`` |rarr| ``NP VP`` and the second rule chosen is ``NP`` |rarr| ``Det N``; then the partial tree will be as follows: .. ex:: .. tree:: (S (NP (Det \ ) (N)) (VP)) If we assume that the input string we are trying to parse is `the cat slept`:lx:, we will succeed in identifying `the`:lx: as a word that can belong to the category `Det`:gc:. In this case, the parser goes on to the next node of the tree, `N`:gc:, and next input word, `cat`:lx:. However, if we had built the same partial tree with an input string `did the cat sleep`:lx:, the parse would fail at this point, since `did`:lx: is not of category `Det`:gc:. The parser would throw away the structure built so far and look for an alternative way of going from the `S`:gc: node down to a leftmost lexical category (e.g., using a rule `S`:gc: |rarr| `V NP VP`:gc:). The important point for now is not the details of this or other parsing algorithms; we discuss this topic much more fully in the chapter on parsing. Rather, we just want to illustrate the idea that an algorithm can be broken down into a fixed number of steps that produce a definite result at the end. In Figure sds_ we further illustrate some of these points in the context of a spoken dialogue system, such as our earlier example of an application that offers the user information about movies currently on show. .. _sds: .. figure:: ../images/dialogue.png :scale: 32 Simple Pipeline Architecture for a Spoken Dialogue System Along the top of the diagram, moving from left to right, is a "pipeline" of some representative speech understanding `components`:dt:. These map from speech input *via* syntactic parsing to some kind of meaning representation. Along the middle, moving from right to left, is an inverse pipeline of components for concept-to-speech generation. These components constitute the dynamic or procedural aspect of the system's natural language processing. At the bottom of the diagram are some representative bodies of static information: the repositories of language-related data that are called upon by the processing components. The diagram illustrates that linguistically-motivated ways of modularizing linguistic knowledge are often reflected in computational systems. That is, the various components are organized so that the data which they exchange corresponds roughly to different levels of representation. For example, the output of the speech analysis component will contain sequences of phonological representations of words, and the output of the parser will be a semantic representation. Of course the parallel is not precise, in part because it is often a matter of practical expedience where to place the boundaries between different processing components. For example, we can assume that within the parsing component there is a level of syntactic representation, although we have chosen not to expose this at the level of the system diagram. Despite such idiosyncrasies, most |NLP| systems break down their work into a series of discrete steps. In the process of natural language understanding, these steps go from more concrete levels to more abstract ones, while in natural language production, the direction is reversed. ---------------------------- Before Proceeding Further... ---------------------------- An important aspect of learning |NLP| using these materials is to experience both the challenge and |mdash| we hope |mdash| the satisfaction of creating software to process natural language. The accompanying software, NLTK, is available for free and runs on most operating systems including Linux/Unix, Mac OSX and Microsoft Windows. You can download NLTK from |NLTK-URL|, along with extensive documentation. We encourage you to install Python and NLTK on your machine before reading beyond the end of this chapter. --------------- Further Reading --------------- Several websites have useful information about |NLP|, including conferences, resources, and special-interest groups, e.g. ``www.lt-world.org``, ``www.aclweb.org``, ``www.elsnet.org``. The website of the *Association for Computational Linguistics*, at ``www.aclweb.org``, contains an overview of computational linguistics, including copies of introductory chapters from recent textbooks. Wikipedia has entries for |NLP| and its subfields (but don't confuse natural language processing with the other |NLP|\ : neuro-linguistic programming.) The new, second edition of *Speech and Language Processing*, is a more advanced textbook that builds on the material presented here. Three books provide comprehensive surveys of the field: [Cole97]_, [Dale00handbook]_, [Mitkov02handbook]_. Several |NLP| systems have online interfaces that you might like to experiment with, e.g.: * WordNet: ``http://wordnet.princeton.edu/`` * Translation: ``http://world.altavista.com/`` * ChatterBots: ``http://www.loebner.net/Prizef/loebner-prize.html`` * Question Answering: ``http://www.answerbus.com/`` * Summarization: ``http://newsblaster.cs.columbia.edu/`` The example dialogue was taken from Carpenter and Chu-Carroll's ACL-99 Tutorial on Spoken Dialogue Systems. .. include:: footer.txt