Deleted stuff with no home at present
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Simple approach: "delete one". First define a function that returns
a list of strings, each one having a different character deleted from
the supplied form:
>>> def delete_one(word):
... for i in range(len(word)):
... yield word[:i]+word[i+1:]
Next construct an index over all these forms:
>>> idx = {}
>>> for lex in lexemes:
... for s in delete_one(lex):
... if s not in idx:
... idx[s] = set()
... idx[s].add(lex)
Now we can define a lookup function:
>>> def lookup(word):
... candidates = set()
... for s in delete_one(word):
... if s in idx:
... candidates.update(idx[s])
... return candidates
Now we can test it out:
>>> lookup('kokopouto')
set(['kokopeoto', 'kokopuoto'])
>>> lookup('kokou')
set(['kokoa', 'kokeu', 'kokio', 'kooru', 'kokoi', 'kooku', 'kokoo'])
Note that this simple method only returns forms of the same length.
#. Write a spelling correction function which, given a word of length
``i``, can return candidate corrections of length ``i-1``, ``i``,
or ``i+1``.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
---------------
NLTK Interfaces
---------------
An *interface* gives a partial specification of the behavior of a
class, including specifications for methods that the class should
implement. For example, a "comparable" interface might specify that a
class must implement a comparison method. Interfaces do not give a
complete specification of a class; they only specify a minimum set of
methods and behaviors which should be implemented by the class. For
example, the ``TaggerI`` interface specifies that a tagger class must
implement a ``tag`` method, which takes a ``string``, and returns a
tuple, consisting of that string and its part-of-speech tag; but it
does not specify what other methods the class should implement (if
any).
.. note:: The notion of "interfaces" can be very useful in ensuring that
different classes work together correctly. Although the concept of
"interfaces" is supported in many languages, such as Java, there is no
native support for interfaces in Python.
NLTK therefore implements interfaces using classes, all of whose
methods raise the ``NotImplementedError`` exception. To distinguish
interfaces from other classes, they are always named with a trailing
``I``. If a class implements an interface, then it should be a
subclass of the interface. For example, the ``Ngram`` tagger class
implements the ``TaggerI`` interface, and so it is a subclass of
``TaggerI``.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
..
A piece to be relocated
-----------------------
Many natural language expressions are ambiguous, and we need to draw
on other sources of information to aid interpretation. For instance,
our preferred interpretation of `fruit flies like a
banana`:lx: depends on the presence of contextual cues that cause
us to expect `flies`:lx: to be a noun or a verb. Before
we can even address such issues, we need to be able to represent the
required linguistic information. Here is a possible representation:
========= ========= ======== ===== ==========
``Fruit`` `flies`` `like`` `a`` `banana``
noun verb prep det noun
========= ========= ======== ===== ==========
========= ========= ======== ===== ==========
``Fruit`` `flies`` `like`` `a`` `banana``
noun noun verb det noun
========= ========= ======== ===== ==========
Most language processing systems must recognize and interpret the
linguistic structures that exist in a sequence of words. This task is
virtually impossible if all we know about each word is its text
representation. To determine whether a given string of words has the
structure of, say, a noun phrase, it is infeasible to check through a
(possibly infinite) list of all strings which can be classed as noun
phrases. Instead we want to be able to generalize over *classes`:dt: of
words. These word classes are commonly given labels such as
'determiner', 'adjective' and 'noun'. Conversely, to interpret words
we need to be able to discriminate between different usages, such as
``deal`` as a noun or a verb.
We earlier presented two interpretations of `Fruit flies like a
banana`:lx: as examples of how a string of word tokens can be augmented
with information about the word classes that the words belong to. In
effect, we carried out tagging for the string `fruit flies like a
banana`:lx:. However, tags are more usually attached inline to the text
they are associated with.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
.. ===================== UNUSED =====================
Some of the following material might belong in other chapters
Programming?
------------
The ``nltk_lite.corpora`` package provides ready access to several
corpora included with NLTK, along with built-in tokenizers. For
example, ``brown.raw()`` is an iterator over sentences from
the Brown Corpus. We use ``extract()`` to extract a sentence of
interest:
>>> from nltk_lite.corpora import brown, extract
>>> print extract(0, brown.raw('a'))
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
Old intro material
------------------
How do we know that piece of text is a *word*, and how do we represent
words and associated information in a machine? It might seem
needlessly picky to ask what a word is. Can't we just say that a word
is a string of characters which has white space before and after it?
However, it turns out that things are quite a bit more complex. To get
a flavour of the problems, consider the following text from the Wall
Street Journal::
Let's start with the string `aren't`:lx:. According to our naive
definition, it counts as only one word. But consider a situation where
we wanted to check whether all the words in our text occurred in a
dictionary, and our dictionary had entries for `are`:lx: and `not`:lx:,
but not for `aren't`:lx:. In this case, we would probably be happy to
say that `aren't`:lx: is a contraction of two distinct words.
.. We can make a similar point about `1992's`:lx:. We might want to run
a small program over our text to extract all words which express
dates. In this case, we would get achieve more generality by first
stripping except in this case, we would not expect to find
`1992`:lx: in a dictionary.
If we take our naive definition of word literally (as we should, if we
are thinking of implementing it in code), then there are some other
minor but real problems. For example, assuming our file consists of a
number of separate lines, as in the WSJ text, then all the
words which come at the beginning of a line will fail to be preceded
by whitespace (unless we treat the newline character as a
whitespace). Second, according to our criterion, punctuation symbols
will form part of words; that is, a string like `investors,`:lx: will
also count as a word, since there is no whitespace between
`investors`:lx: and the following comma. Consequently, we run the risk
of failing to recognise that `investors,`:lx: (with appended comma) is a
token of the same type as `investors`:lx: (without appended comma). More
importantly, we would like punctuation to be a "first-class citizen"
for tokenization and subsequent processing. For example, we might want
to implement a rule which says that a word followed by a period is
likely to be an abbreviation if the immediately following word has a
lowercase initial. However, to formulate such a rule, we must be able
to identify a period as a token in its own right.
A slightly different challenge is raised by examples such as the
following (drawn from the MedLine corpus):
#. This is a alpha-galactosyl-1,4-beta-galactosyl-specific adhesin.
#. The corresponding free cortisol fractions in these sera were 4.53
+/- 0.15% and 8.16 +/- 0.23%, respectively.
In these cases, we encounter terms which are unlikely to be found in
any general purpose English lexicon. Moreover, we will have no success
in trying to syntactically analyse these strings using a standard
grammar of English. Now for some applications, we would like to
"bundle up" expressions such as
`alpha-galactosyl-1,4-beta-galactosyl-specific adhesin`:lx: and `4.53
+/- 0.15%`:lx: so that they are presented as unanalysable atoms to the
parser. That is, we want to treat them as single "words" for the
purposes of subsequent processing. The upshot is that, even if we
confine our attention to English text, the question of what we treat
as word may depend a great deal on what our purposes are.
Representing tokens
-------------------
When written language is stored in a computer file it is normally
represented as a sequence or *string* of characters. That is, in a
standard text file, individual words are strings, sentences are
strings, and indeed the whole text is one long string. The characters
in a string don't have to be just the ordinary alphanumerics; strings
can also include special characters which represent space, tab and
newline.
Most computational processing is performed above the level of
characters. In compiling a programming language, for example, the
compiler expects its input to be a sequence of tokens that it knows
how to deal with; for example, the classes of identifiers, string
constants and numerals. Analogously, a parser will expect its input
to be a sequence of word tokens rather than a sequence of individual
characters. At its simplest, then, tokenization of a text involves
searching for locations in the string of characters containing
whitespace (space, tab, or newline) or certain punctuation symbols,
and breaking the string into word tokens at these points. For
example, suppose we have a file containing the following two lines::
The cat climbed
the tree.
From the parser's point of view, this file is just a string of
characters:
'The_cat_climbed\\n_the_tree.'
Note that we use single quotes to delimit strings, "_" to represent
space and "\n" to represent newline.
As we just pointed out, to tokenize this text for consumption by the
parser, we need to explicitly indicate which substrings are words. One
convenient way to do this in Python is to split the string into a
*list* of words, where each word is a string, such as
`'dog'`:lx:. [#]_
In Python, lists are printed as a series of objects
(in this case, strings), surrounded by square brackets and separated
by commas:
>>> words = ['the', 'cat', 'climbed', 'the', 'tree']
>>> words
['the', 'cat', 'climbed', 'the', 'tree']
.. [#] We say "convenient" because Python makes it easy to iterate
through a list, processing the items one by one.
Notice that we have introduced a new variable `words`:lx: which is bound
to the list, and that we entered the variable on a new line to check
its value.
To summarize, we have just illustrated how, at its simplest,
tokenization of a text can be carried out by converting the single
string representing the text into a list of strings, each of which
corresponds to a word.
Some of this could maybe be discussed in the programming chapter?
----------------------------------------------------------------
Many natural language processing tasks involve analyzing texts of
varying sizes, ranging from single sentences to very large corpora.
There are a number of ways to represent texts using NLTK. The
simplest is as a single string. These strings can be loaded directly
from files:
>>> text_str = open('corpus.txt').read()
>>> text_str
'Hello world. This is a test file.\n'
However, as noted above, it is usually preferable to represent a text
as a list of tokens. These lists are typically created using a
*tokenizer*, such as `tokenize.whitespace`:lx: which splits strings into
words at whitespaces:
>>> from nltk_lite import tokenize
>>> text = 'Hello world. This is a test string.'
>>> list(tokenize.whitespace(text))
['Hello', 'world.', 'This', 'is', 'a', 'test', 'string.']
.. Note:: By "whitespace", we mean not only interword space, but
also tab and line-end.
Note that tokenization may normalize the text, mapping all words to lowercase,
expanding contractions, and possibly even stemming the words. An
example for stemming is shown below:
>>> text = 'stemming can be fun and exciting'
>>> tokens = tokenize.whitespace(text)
>>> porter = tokenize.PorterStemmer()
>>> for token in tokens:
... print porter.stem(token),
stem can be fun and excit
Tokenization based on whitespace is too simplistic for most
applications; for instance, it fails to separate the last word of a
phrase or sentence from punctuation characters, such as comma, period,
exclamation mark and question mark. As its name suggests,
`tokenize.regexp`:lx: employs a regular expression to determine how text
should be split up. This regular expression specifies the characters
that can be included in a valid word. To define a tokenizer that
includes punctuation as separate tokens, we could use:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
------------
More Grammar
------------
In this final section, we return to the grammar of English. We
consider more syntactic phenomena that will require us to refine
the productions of our phrase structure grammar.
Lexical heads other than `V`:gc: can be subcategorized for particular
complements:
**Nouns**
#. The rumour *that Kim was bald* circulated widely.
#. \*The picture *that Kim was bald* circulated widely.
**Adjectives**
#. Lee was afraid *to leave*.
#. \*Lee was tall *to leave*.
It has also been suggested that 'ordinary' prepositions are
transitive, and that many so-called adverb are in fact intransitive
prepositions. For example, `towards`:lx: requires an `NP`:gc: complement,
while `home`:lx: and `forwards`:lx: forbid them.
.. example:: Lee ran towards the house.
.. example:: \*Lee ran towards.
.. example:: Sammy walked home.
.. example:: \*Sammy walked home the house.
.. example:: Brent stepped one pace forwards.
.. example:: \*Brent stepped one pace forwards the house.
Adopting this approach, we can also analyse certain prepositions as
allowing `PP`:gc: complements:
.. example:: Kim ran away *from the house*.
.. example:: Lee jumped down *into the boat*.
In general, the lexical categories `V`:gc:, `N`:gc:, `A`:gc: and `P`:gc: are
taken to be the heads of the respective phrases `VP`, `NP`,
`AP`:gc: and `PP`:gc:. Abstracting over the identity of these phrases, we
can say that a lexical category `X`:gc: is the head of its immediate
`XP`:gc: phrase, and moreover that the complements `C`:subscript:`1`
... `C`:subscript:`n` of
`X`:gc: will occur as sisters of `X`:gc: within that `XP`:gc:. This is
illustrated in the following schema:
.. ex::
.. tree:: (XP (X) (*C_1*) ... (*C_n*))
We have argued that lexical categories need to be subdivided into
subcategories to account for the fact that different lexical items
select different sequences of following complements. That is, it is a
distinguishing property of complements that they co-occur with some
lexical items but not others. By contrast, :dt:`modifiers` can
occur with pretty much any instance of the relevant lexical class. For
example, consider the temporal adverbial *last Thursday*:
.. example:: The woman gave the telescope to the dog last Thursday.
.. example:: The woman saw a man last Thursday.
.. example:: The dog barked last Thursday.
Moreover, modifiers are always optional, whereas complements are at
least sometimes obligatory. We can use the phrase structure
geometry to draw a structural distinction between complements, which
occur as sisters of the lexical head, versus modifiers, which occur as
sisters of the phrase which encloses the head:
.. ex::
.. tree:: (XP (XP (X) (*C_1*) ... (*C_n*)) (*Mod*))
Exercises
---------
#. Pick some of the syntactic constructions described in any
introductory syntax text (e.g. Jurafsky and Martin, Chapter 9) and
create a set of 15 sentences. Five sentences should be unambiguous
(have a unique parse), five should be ambiguous, and a further five
should be ungrammatical.
a) Develop a small grammar, consisting of about ten syntactic
productions, to account for this data. Refine your set of sentences
as needed to test and demonstrate the grammar. Write a function
to demonstrate your grammar on three sentences: (i) a
sentence having exactly one parse; (ii) a sentence having more than
one parse; (iii) a sentence having no parses. Discuss your
observations using inline comments.
b) Create a list ``words`` of all the words in your lexicon, and use
``random.choice(words)`` to generate sequences of 5-10 randomly
selected words. Does this generate any grammatical sentences which
your grammar rejects, or any ungrammatical sentences which your
grammar accepts? Now use this information to help you improve your
grammar. Discuss your findings.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
NLTK consists of a set of Python *modules*, each of which defines
classes and functions related to a single data structure or task.
Before you can use a module, you must ``import`` its contents. The
simplest way to import the contents of a module is to use the ``from
module import *`` command. For example, to import the contents of the
``nltk_lite.util`` module, which is discussed in this chapter, type:
>>> from nltk_lite.utilities import *
>>>
A disadvantage of this style of import statement is that it does not
specify what objects are imported; and it is possible that some of the
import objects will unintentionally cause conflicts. To avoid this
disadvantage, you can explicitly list the objects you wish to import.
For example, as we saw earlier, we can import the ``re_show`` function
from the ``nltk_lite.util`` module as follows:
>>> from nltk_lite.utilities import re_show
>>>
Another option is to import the module itself, rather than
its contents. Now its contents can then be accessed
using *fully qualified* dotted names:
>>> from nltk_lite import utilities
>>> utilities.re_show('green', sent)
colorless {green} ideas sleep furiously
>>>
For more information about importing, see any Python textbook.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We can also access the tagged text using the ``brown.tagged()`` method:
>>> print extract(0, brown.tagged())
[('The', 'at'), ('Fulton', 'np-tl'), ('County', 'nn-tl'), ('Grand', 'jj-tl'),
('Jury', 'nn-tl'), ('said', 'vbd'), ('Friday', 'nr'), ('an', 'at'),
('investigation', 'nn'), ('of', 'in'), ("Atlanta's", 'np$'), ('recent', 'jj'),
('primary', 'nn'), ('election', 'nn'), ('produced', 'vbd'), ('``', '``'),
('no', 'at'), ('evidence', 'nn'), ("''", "''"), ('that', 'cs'),
('any', 'dti'), ('irregularities', 'nns'), ('took', 'vbd'), ('place', 'nn'),
('.', '.')]
>>>
NLTK includes a 10% fragment of the Wall Street Journal section
of the Penn Treebank. This can be accessed using ``treebank.raw()``
for the raw text, and ``treebank.tagged()`` for the tagged text.
>>> from nltk_lite.corpora import treebank
>>> print extract(0, treebank.parsed())
(S:
(NP-SBJ:
(NP: (NNP: 'Pierre') (NNP: 'Vinken'))
(,: ',')
(ADJP: (NP: (CD: '61') (NNS: 'years')) (JJ: 'old'))
(,: ','))
(VP:
(MD: 'will')
(VP:
(VB: 'join')
(NP: (DT: 'the') (NN: 'board'))
(PP-CLR:
(IN: 'as')
(NP: (DT: 'a') (JJ: 'nonexecutive') (NN: 'director')))
(NP-TMP: (NNP: 'Nov.') (CD: '29'))))
(.: '.'))
>>>
NLTK contains some simple chatbots, which will try to talk
intelligently with you. You can access the famous Eliza
chatbot using ``from nltk_lite.chat import eliza``, then
run ``eliza.demo()``. The other chatbots are called
``iesha`` (teen anime talk),
``rude`` (insulting talk), and
``zen`` (gems of Zen wisdom),
and were contributed by other students who have used NLTK.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Predicting the Next Word (Revisited)
------------------------------------
>>> from nltk_lite.corpora import genesis
>>> from nltk_lite.probability import ConditionalFreqDist
>>> cfdist = ConditionalFreqDist()
We then examine each token in the corpus, and increment the
appropriate sample's count. We use the variable ``prev`` to record
the previous word.
>>> prev = None
>>> for word in genesis.raw():
... cfdist[prev].inc(word)
... prev = word
.. Note:: Sometimes the context for an experiment is unavailable, or
does not exist. For example, the first token in a text does not
follow any word. In these cases, we must decide what context to
use. For this example, we use ``None`` as the context for the
first token. Another option would be to discard the first token.
Once we have constructed a conditional frequency distribution for the
training corpus, we can use it to find the most likely word for any
given context. For example, taking the word `living`:lx: as our context,
we can inspect all the words that occurred in that context.
>>> word = 'living'
>>> cfdist[word].samples()
['creature,', 'substance', 'soul.', 'thing', 'thing,', 'creature']
We can set up a simple loop to generate text: we set an initial
context, picking the most likely token in that context as our next
word, and then using that word as our new context:
>>> word = 'living'
>>> for i in range(20):
... print word,
... word = cfdist[word].max()
living creature that he said, I will not be a wife of the land
of the land of the land
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Pronunciation Dictionary
------------------------
Here we access the pronunciation of words...
.. doctest-ignore::
>>> from nltk_lite.corpora import cmudict
>>> from string import join
>>> for word, num, pron in cmudict.raw():
... if pron[-4:] == ('N', 'IH0', 'K', 'S'):
... print word.lower(),
atlantic's audiotronics avionics beatniks calisthenics centronics
chetniks clinic's clinics conics cynics diasonics dominic's
ebonics electronics electronics' endotronics endotronics' enix
environics ethnics eugenics fibronics flextronics harmonics
hispanics histrionics identics ionics kibbutzniks lasersonics
lumonics mannix mechanics mechanics' microelectronics minix minnix
mnemonics mnemonics molonicks mullenix mullenix mullinix mulnix
munich's nucleonics onyx panic's panics penix pennix personics
phenix philharmonic's phoenix phonics photronics pinnix
plantronics pyrotechnics refuseniks resnick's respironics sconnix
siliconix skolniks sonics sputniks technics tectonics tektronix
telectronics telephonics tonics unix vinick's vinnick's vitronics
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Looking at Timestamps
~~~~~~~~~~~~~~~~~~~~~
>>> fd = nltk.FreqDist()
>>> for entry in lexicon:
... date = entry.findtext('dt')
... if date:
... (day, month, year) = string.split(date, '/')
... fd.inc((month, year))
>>> for time in fd.sorted():
... print time[0], time[1], ':', fd[time]
Oct 2005 : 220
Nov 2005 : 133
Feb 2005 : 107
Jun 2005 : 97
Dec 2004 : 74
Aug 2005 : 33
Jan 2005 : 32
Sep 2005 : 27
Dec 2005 : 21
Feb 2007 : 16
Nov 2006 : 15
Sep 2006 : 13
Feb 2004 : 12
Apr 2006 : 11
Jul 2005 : 11
May 2005 : 10
Sep 2004 : 9
Mar 2005 : 8
Jan 2007 : 7
Dec 2006 : 6
Jul 2004 : 6
Jan 2006 : 4
Mar 2006 : 4
Oct 2006 : 3
Jul 2006 : 3
Apr 2007 : 2
Feb 2006 : 2
May 2004 : 1
Oct 2004 : 1
Nov 2004 : 1
To put these in time order, we need to set up a special comparison function.
Otherwise, if we just sort the months, we'll get them in alphabetical order.
>>> month_index = {
... "Jan" : 1, "Feb" : 2, "Mar" : 3, "Apr" : 4,
... "May" : 5, "Jun" : 6, "Jul" : 7, "Aug" : 8,
... "Sep" : 9, "Oct" : 10, "Nov" : 11, "Dec" : 12
... }
>>> def time_cmp(a, b):
... a2 = a[1], month_index[a[0]]
... b2 = b[1], month_index[b[0]]
... return cmp(a2, b2)
The comparison function says that we compare two times of the
form ``('Mar', '2004')`` by reversing the order of the month and
year, and converting the month into a number to get ``('2004', '3')``,
then using Python's built-in ``cmp`` function to compare them.
Now we can get the times found in the Toolbox entries, sort them
according to our ``time_cmp`` comparison function, and then print them
in order. This time we print bars to indicate frequency:
>>> times = fd.samples()
>>> times.sort(cmp=time_cmp)
>>> for time in times:
... print time[0], time[1], ':', '#' * (1 + fd[time]/10)
Feb 2004 : ##
May 2004 : #
Jul 2004 : #
Sep 2004 : #
Oct 2004 : #
Nov 2004 : #
Dec 2004 : ########
Jan 2005 : ####
Feb 2005 : ###########
Mar 2005 : #
May 2005 : ##
Jun 2005 : ##########
Jul 2005 : ##
Aug 2005 : ####
Sep 2005 : ###
Oct 2005 : #######################
Nov 2005 : ##############
Dec 2005 : ###
Jan 2006 : #
Feb 2006 : #
Mar 2006 : #
Apr 2006 : ##
Jul 2006 : #
Sep 2006 : ##
Oct 2006 : #
Nov 2006 : ##
Dec 2006 : #
Jan 2007 : #
Feb 2007 : ##
Apr 2007 : #
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Next, let's consider some methods for discovering patterns in the lexical
entries. Which fields are the most frequent?
>>> fd = nltk.FreqDist(field.tag for entry in lexicon for field in entry)
>>> fd.sorted()[:10]
['xp', 'ex', 'xe', 'ge', 'tkp', 'lx', 'dt', 'ps', 'pt', 'rt']
Which sequences of fields are the most frequent?
>>> fd = nltk.FreqDist(':'.join(field.tag for field in entry) for entry in lexicon)
>>> top_ten = fd.sorted()[:10]
>>> print '\n'.join(top_ten)
lx:ps:pt:ge:tkp:dt:ex:xp:xe
lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe
lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe
lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe
lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe
lx:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe
lx:rt:ps:pt:ge:ge:tkp:dt:ex:xp:xe:ex:xp:xe
lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe
lx:ps:pt:ge:tkp:nt:sf:dt:ex:xp:xe
lx:ps:pt:ge:ge:tkp:dt:ex:xp:xe
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-------------------------------
NLTK in the Academic Literature
-------------------------------
|NLTK| has been presented at several international
conferences with published proceedings, from 2002 to the
present, as listed below:
Edward Loper and Steven Bird (2002).
NLTK: The Natural Language Toolkit,
*Proceedings of the ACL Workshop on Effective Tools and
Methodologies for Teaching Natural Language Processing and Computational
Linguistics*,
Somerset, NJ: Association for Computational Linguistics,
pp. 62-69, http://arXiv.org/abs/cs/0205028
Steven Bird and Edward Loper (2004).
NLTK: The Natural Language Toolkit,
*Proceedings of the ACL demonstration session*, pp 214-217.
http://eprints.unimelb.edu.au/archive/00001448/
Edward Loper (2004).
NLTK: Building a Pedagogical Toolkit in Python,
*PyCon DC 2004*
Python Software Foundation,
http://www.python.org/pycon/dc2004/papers/
Steven Bird (2005).
NLTK-Lite: Efficient Scripting for Natural Language Processing,
*4th International Conference on Natural Language Processing*, pp 1-8.
http://eprints.unimelb.edu.au/archive/00001453/
Steven Bird (2006).
NLTK: The Natural Language Toolkit,
*Proceedings of the ACL demonstration session*
http://www.ldc.upenn.edu/sb/home/papers/nltk-demo-06.pdf
Ewan Klein (2006).
Computational Semantics in the Natural Language Toolkit,
*Australian Language Technology Workshop*.
http://www.alta.asn.au/events/altw2006/proceedings/Klein.pdf
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|nopar|
**Further Reading:**
The Association for Computational Linguistics (ACL)
The ACL is the foremost professional body in |NLP|.
Its journal and conference proceedings,
approximately 10,000 articles, are available online with a full-text
search interface, via ``_.
Linguistic Terminology
A comprehensive glossary of linguistic terminology is available at:
``_.
*Language Files*
*Materials for an Introduction to Language and
Linguistics (Ninth Edition)*, The Ohio State University Department of
Linguistics. For more information, see
``_.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%