.. -*- mode: rst -*-
.. include:: ../definitions.rst

.. TODO: reconsider location of section on generative grammar

.. _chap-introduction:

==============================================
1. Introduction to Natural Language Processing
==============================================


------------------------
Language and Computation
------------------------


Language and Symbol Processing
------------------------------

The very notion that natural language could be treated in a
computational manner grew out of a research program, dating back to
the early 1900s, to reconstruct mathematical reasoning using logic,
most clearly manifested in work by Frege, Russell, Wittgenstein,
Tarski, Lambek and Carnap.  This work led to the notion of language as
a formal system amenable to automatic processing.  Three later
developments laid the foundation for natural language processing.  The
first was `formal language theory`:dt:.  This defined a language as a set
of strings accepted by a class of automata, such as context-free
languages and pushdown automata, and provided the underpinnings for
computational syntax.

The second development was `symbolic logic`:dt:. This provided a
formal method for capturing selected aspects of natural language that
are relevant for expressing logical proofs. A formal calculus in
symbolic logic provides the syntax of a language, together with rules
of inference and, possibly, rules of interpretation in a set-theoretic
model; examples are propositional logic and First Order Logic.  Given
such a calculus, with a well-defined syntax and semantics, it becomes
possible to associate meanings with expressions of natural language by
translating them into expressions of the formal calculus. For example,
if we translate `John saw Mary`:lx: into a formula ``saw(j,m)``, we
(implicitly or explicitly) intepret the English verb `saw`:lx: as a
binary relation, and `John`:lx: and `Mary`:lx: as denoting
individuals.  More general statements like `All birds fly`:lx: require
quantifiers, in this case |forall|, meaning `for all`:lx:: |forall|\
`x (bird(x)`:math: |rarr| `fly(x))`:math:.  This use of logic provided
the technical machinery to perform inferences that are an important
part of language understanding.


A closely related development was the `principle of
compositionality`:dt:, namely that the meaning of a complex expression
is composed from the meaning of its parts and their mode of
combination. This principle provided a useful correspondence between
syntax and semantics, namely that the meaning of a complex expression
could be computed recursively.  Consider the sentence `It is not true
that`:lx: `p`:math:, where `p`:math: is a proposition.  We can
represent the meaning of this sentence as `not(p)`:math:.  Similarly, we
can represent the meaning of `John saw Mary`:lx: as `saw(j, m)`:math:.  Now we
can compute the interpretation of `It is not true that John saw Mary`:lx:
recursively, using the above information, to get
`not(saw(j,m))`:math:. 


The approaches just outlined share the premise that computing with
natural language crucially relies on rules for manipulating symbolic
representations. For a certain period in the development of |NLP|,
particularly during the 1980s, this premise provided a common starting
point for both linguists and practioners of |NLP|, leading to a family
of grammar formalisms known as unification-based (or feature-based)
grammar, and to |NLP| applications implemented in the Prolog
programming language. Although grammar-based |NLP| is still a
significant area of research, it has become somewhat eclipsed in the
last 15\ |ndash|\ 20 years due to a variety of factors. One
significant influence came from automatic speech recognition. Although
early work in speech processing adopted a model that emulated the
kind of rule-based phonological `phonology`:topic: processing typified
by the *Sound Pattern of English* [ChomskyHalle68]_,
this turned out to be hopelessly inadequate in dealing
with the hard problem of recognizing actual speech in anything like
real time. By contrast, systems which involved learning patterns from
large bodies of speech data were significantly more accurate,
efficient and robust. In addition, the speech community found that
progress in building better systems was hugely assisted by the
construction of shared resources for quantitatively measuring
performance against common test data. Eventually, much of the |NLP|
community embraced a `data intensive`:dt: orientation to language
processing, coupled with a growing use of machine-learning techniques
and evaluation-led methodology.

Philosophical Divides
---------------------

The contrasting approaches to |NLP| described in the preceding section
relate back to early metaphysical debates about `rationalism`:dt:
versus `empiricism`:dt: and `realism`:dt: versus `idealism`:dt: that
occurred in the Enlightenment period of Western philosophy.  These
debates took place against a backdrop of orthodox thinking in which
the source of all knowledge was believed to be divine revelation.
During this period of the seventeenth and eighteenth centuries,
philosophers argued that human reason or sensory experience has
priority over revelation.  Descartes and Leibniz, amongst others, took
the rationalist position, asserting that all truth has its origins in
human thought, and in the existence of "innate ideas" implanted in our
minds from birth.  For example, they argued that the principles of
Euclidean geometry were developed using human reason, and were not the
result of supernatural revelation or sensory experience.  In contrast,
Locke and others took the empiricist view, that our primary source of
knowledge is the experience of our faculties, and that human reason
plays a secondary role in reflecting on that experience.  Prototypical
evidence for this position was Galileo's discovery |mdash| based on
careful observation of the motion of the planets |mdash| that the
solar system is heliocentric and not geocentric.  In the context of
linguistics, this debate leads to the following question: to what
extent does human linguistic experience, versus our innate "language
faculty", provide the basis for our knowledge of language?  In |NLP|
this matter surfaces as differences in the priority of corpus data
versus linguistic introspection in the construction of computational
models.  We will return to this issue later in the book.

A further concern, enshrined in the debate between realism and
idealism, was the metaphysical status of the constructs of a theory.
Kant argued for a distinction between phenomena, the manifestations we
can experience, and "things in themselves" which can never been
known directly.  A linguistic realist would take a theoretical
construct like `noun phrase`:dt: to be real world entity that exists
independently of human perception and reason, and which actually
*causes* the observed linguistic phenomena.  A linguistic idealist, on
the other hand, would argue that noun phrases, along with more
abstract constructs like semantic representations, are intrinsically
unobservable, and simply play the role of useful fictions.  The way
linguists write about theories often betrays a realist position, while
|NLP| practitioners occupy neutral territory or else lean towards the
idealist position.  Thus, in |NLP|, it is often enough if a theoretical
abstraction leads to a useful result; it does not matter whether this
result sheds any light on human linguistic processing.

These issues are still alive today, and show up in the distinctions
between symbolic vs statistical methods, deep vs shallow processing,
binary vs gradient classifications, and scientific vs engineering
goals.  However, such contrasts are now highly nuanced, and the debate
is no longer as polarized as it once was.  In fact, most of the
discussions |mdash| and most of the advances even |mdash| involve a
"balancing act".  For example, one intermediate position is to assume
that humans are innately endowed with analogical and memory-based
learning methods (weak rationalism), and to use these methods to identify
meaningful patterns in their sensory language experience (empiricism).
For a more concrete illustration, consider the way in which statistics
from large corpora may serve as evidence for binary choices in a
symbolic grammar.  For instance, dictionaries describe the words
`absolutely`:lx: and `definitely`:lx: as nearly synonymous, yet their patterns
of usage are quite distinct when combined with a following verb, as
shown in Table absolutely_.

.. table:: absolutely

   +-----------------+------------+-------------+-------------+--------------+
   | Google hits     |`adore`:lx: | `love`:lx:  |`like`:lx:   |`prefer`:lx:  |
   +-----------------+------------+-------------+-------------+--------------+
   | `absolutely`:lx:|     289,000|      905,000|       16,200|           644|
   +-----------------+------------+-------------+-------------+--------------+
   | `definitely`:lx:|       1,460|       51,000|      158,000|        62,600|
   +-----------------+------------+-------------+-------------+--------------+
   | ratio           |       198:1|         18:1|         1:10|          1:97|
   +-----------------+------------+-------------+-------------+--------------+

   `Absolutely`:lx: vs `Definitely`:lx: (Liberman 2005, LanguageLog.org)

.. http://itre.cis.upenn.edu/~myl/languagelog/archives/002022.html

As you will see, `absolutely adore`:lx: is about 200 times as popular as
`definitely adore`:lx:, while `absolutely prefer`:lx: is about 100 times rarer
then `definitely prefer`:lx:.  This information is used by statistical
language models, but it also counts as evidence for a symbolic account
of word combination in which `absolutely`:lx: can only modify extreme
actions or attributes, a property that could be represented as a
binary-valued feature of certain lexical items.  Thus, we see
statistical data informing symbolic models.  Once this information has
been codified symbolically, it is available to be exploited as a
contextual feature for statistical language modeling, alongside many
other rich sources of symbolic information, like hand-constructed
parse trees and semantic representations.  Now the circle is closed,
and we see symbolic information informing statistical models.

This new rapprochement is giving rise to many exciting new
developments.  We will touch on some of these in the ensuing pages.
We too will perform this balancing act, employing approaches to |NLP|
that integrate these historically-opposed philosophies and
methodologies.

------------------------------------------------
The Architecture of Linguistic and |NLP| Systems
------------------------------------------------

.. add this to a chapter on grammar or semantics?

Generative Grammar and Modularity
---------------------------------

One of the intellectual descendants of formal language theory was the
linguistic framework known as `generative grammar`:dt:. Such a grammar
contains a set of rules that recursively specify (or `generate`) the
set of well-formed strings in a language. While there is a wide
spectrum of models that owe some allegiance to this core, Chomsky's
transformational grammar, in its various incarnations, is probably the
best known. In the Chomskyan tradition, it is claimed that humans have
distinct kinds of linguistic knowledge, organized into different
modules: for example, knowledge of a language's sound structure
(`phonology`:dt:), knowledge of word structure (`morphology`:dt:),
knowledge of phrase structure (`syntax`:dt:), and knowledge of meaning
(`semantics`:dt:). In a formal linguistic theory, each kind of
linguistic knowledge is made explicit as different `module`:dt: of the
theory, consisting of a collection of basic elements together with a
way of combining them into complex structures. For example, a
phonological module might provide a set of phonemes together with an
operation for concatenating phonemes into phonological
strings. Similarly, a syntactic module might provide labeled nodes as
primitives together with a mechanism for assembling them into trees. A
set of linguistic primitives, together with some operators for
defining complex elements, is often called a `level of
representation`:dt:.

As well as defining modules, a generative grammar will prescribe how
the modules interact. For example, well-formed phonological strings
will provide the phonological content of words, and words will provide
the terminal elements of syntax trees. Well-formed syntactic trees
will be mapped to semantic representations, and contextual or
pragmatic information will ground these semantic representations in
some real-world situation.

As we indicated above, an important aspect of theories of generative
grammar is that they are intended to model the linguistic knowledge of
speakers and hearers; they are not intended to explain how humans
actually process linguistic information. This is, in part, reflected
in the claim that a generative grammar encodes the `competence`:dt: of an
idealized native speaker, rather than the speaker's `performance`:dt:. A
closely related distinction is to say that a generative grammar
encodes `declarative`:dt: rather than `procedural`:dt:
knowledge. Declarative knowledge can be glossed as "knowing what",
whereas procedural knowledge is "knowing how".  As you
might expect, computational linguistics has the crucial role of
proposing procedural models of language. A central example is parsing,
where we have to develop computational mechanisms that convert
strings of words into structural representations such as syntax
trees. Nevertheless, it is widely accepted that well-engineered
computational models of language contain both declarative and
procedural aspects. Thus, a full account of parsing will say how
declarative knowledge in the form of a grammar and lexicon combines
with procedural knowledge that determines how a syntactic analysis
should be assigned to a given string of words. This procedural
knowledge will be expressed as an `algorithm`:dt:\ : that is, an explicit
recipe for mapping some input into an appropriate output in a finite
number of steps.

A simple parsing algorithm for context-free grammars, for instance,
looks first for a rule of the form ``S`` |rarr| ``X``:subscript:`1` ...
``X``:subscript:`n`, and builds a partial tree structure. It then steps
through the grammar rules one-by-one, looking for a rule of the form
``X``:subscript:`1` |rarr| ``Y``:subscript:`1` ...  ``Y``:subscript:`j` that
will expand the leftmost daughter introduced by the ``S`` rule, and
further extends the partial tree. This process continues, for example
by looking for a rule of the form ``Y``:subscript:`1` |rarr|
``Z``:subscript:`1` ...  ``Z``:subscript:`k` and expanding the partial tree
appropriately, until the leftmost node label in the partial tree is a
lexical category; the parser then checks to see if the first word of
the input can belong to the category. To illustrate, let's suppose
that the first grammar rule chosen by the parser is ``S`` |rarr| ``NP VP``
and the second rule chosen is ``NP`` |rarr| ``Det N``; then the partial
tree will be as follows:

.. ex::

   .. tree:: (S (NP (Det \ ) (N)) (VP))

If we assume that the input string we are trying to parse is `the cat
slept`:lx:, we will succeed in identifying `the`:lx: as a word that can
belong to the category `Det`:gc:. In this case, the parser goes on to
the next node of the tree, `N`:gc:, and next input word, `cat`:lx:. However,
if we had built the same partial tree with an input string `did the
cat sleep`:lx:, the parse would fail at this point, since `did`:lx: is not of
category `Det`:gc:.  The parser would throw away the structure built so
far and look for an alternative way of going from the `S`:gc: node down
to a leftmost lexical category (e.g., using a rule `S`:gc: |rarr| `V NP
VP`:gc:).
The important point for now is not the details of this or other
parsing algorithms; we discuss this topic much more fully in the
chapter on parsing. Rather, we just want to illustrate the idea that
an algorithm can be broken down into a fixed number of steps that
produce a definite result at the end.

In Figure sds_ we further illustrate some of these points in
the context of a spoken dialogue system, such as our earlier example
of an application that offers the user information about movies
currently on show.

.. _sds:
.. figure:: ../images/dialogue.png
   :scale: 32

   Simple Pipeline Architecture for a Spoken Dialogue System

Along the top of the diagram, moving from left to right, is a
"pipeline" of some representative speech understanding
`components`:dt:.  These map from speech input *via* syntactic parsing
to some kind of meaning representation.  Along the middle, moving from
right to left, is an inverse pipeline of components for
concept-to-speech generation. These components constitute the dynamic
or procedural aspect of the system's natural language processing.  
At the bottom of the diagram are some representative bodies of
static information: the repositories of language-related data that
are called upon by the processing components.

The diagram illustrates that linguistically-motivated ways of
modularizing linguistic knowledge are often reflected in computational
systems. That is, the various components are organized so that the
data which they exchange corresponds roughly to different levels of
representation. For example, the output of the speech analysis
component will contain sequences of phonological representations of
words, and the output of the parser will be a semantic representation.
Of course the parallel is not precise, in part because it is often a
matter of practical expedience where to place the boundaries between
different processing components. For example, we can assume that
within the parsing component there is a level of syntactic
representation, although we have chosen not to expose this at the
level of the system diagram.  Despite such idiosyncrasies, most |NLP|
systems break down their work into a series of discrete steps.  In the
process of natural language understanding, these steps go from more
concrete levels to more abstract ones, while in natural language
production, the direction is reversed.

----------------------------
Before Proceeding Further...
----------------------------

An important aspect of learning |NLP| using these materials is to
experience both the challenge and |mdash| we hope |mdash| the
satisfaction of creating software to process natural language.  The
accompanying software, NLTK, is available for free and runs on most
operating systems including Linux/Unix, Mac OSX and Microsoft Windows.
You can download NLTK from |NLTK-URL|, along
with extensive documentation.  We encourage you to install Python and
NLTK on your machine before reading beyond the end of this chapter.

---------------
Further Reading
---------------

Several websites have useful information about |NLP|, including
conferences, resources, and special-interest groups, e.g.
``www.lt-world.org``, ``www.aclweb.org``, ``www.elsnet.org``.
The website of the *Association for Computational Linguistics*,
at ``www.aclweb.org``, contains an overview of computational linguistics,
including copies of introductory chapters from recent textbooks.
Wikipedia has entries for |NLP| and its subfields
(but don't confuse natural language processing with
the other |NLP|\ : neuro-linguistic programming.)
The new, second edition of *Speech and Language Processing*, is
a more advanced textbook that builds on the material presented here.
Three books provide comprehensive surveys of the field:
[Cole97]_, [Dale00handbook]_, [Mitkov02handbook]_.
Several |NLP| systems have online interfaces that you might like to
experiment with, e.g.:

* WordNet: ``http://wordnet.princeton.edu/``
* Translation: ``http://world.altavista.com/``
* ChatterBots: ``http://www.loebner.net/Prizef/loebner-prize.html``
* Question Answering: ``http://www.answerbus.com/``
* Summarization: ``http://newsblaster.cs.columbia.edu/``

The example dialogue was taken from
Carpenter and Chu-Carroll's ACL-99 Tutorial on Spoken Dialogue Systems.

.. include:: footer.txt