Deleted stuff with no home at present %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Simple approach: "delete one". First define a function that returns a list of strings, each one having a different character deleted from the supplied form: >>> def delete_one(word): ... for i in range(len(word)): ... yield word[:i]+word[i+1:] Next construct an index over all these forms: >>> idx = {} >>> for lex in lexemes: ... for s in delete_one(lex): ... if s not in idx: ... idx[s] = set() ... idx[s].add(lex) Now we can define a lookup function: >>> def lookup(word): ... candidates = set() ... for s in delete_one(word): ... if s in idx: ... candidates.update(idx[s]) ... return candidates Now we can test it out: >>> lookup('kokopouto') set(['kokopeoto', 'kokopuoto']) >>> lookup('kokou') set(['kokoa', 'kokeu', 'kokio', 'kooru', 'kokoi', 'kooku', 'kokoo']) Note that this simple method only returns forms of the same length. #. Write a spelling correction function which, given a word of length ``i``, can return candidate corrections of length ``i-1``, ``i``, or ``i+1``. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% --------------- NLTK Interfaces --------------- An *interface* gives a partial specification of the behavior of a class, including specifications for methods that the class should implement. For example, a "comparable" interface might specify that a class must implement a comparison method. Interfaces do not give a complete specification of a class; they only specify a minimum set of methods and behaviors which should be implemented by the class. For example, the ``TaggerI`` interface specifies that a tagger class must implement a ``tag`` method, which takes a ``string``, and returns a tuple, consisting of that string and its part-of-speech tag; but it does not specify what other methods the class should implement (if any). .. note:: The notion of "interfaces" can be very useful in ensuring that different classes work together correctly. Although the concept of "interfaces" is supported in many languages, such as Java, there is no native support for interfaces in Python. NLTK therefore implements interfaces using classes, all of whose methods raise the ``NotImplementedError`` exception. To distinguish interfaces from other classes, they are always named with a trailing ``I``. If a class implements an interface, then it should be a subclass of the interface. For example, the ``Ngram`` tagger class implements the ``TaggerI`` interface, and so it is a subclass of ``TaggerI``. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% .. A piece to be relocated ----------------------- Many natural language expressions are ambiguous, and we need to draw on other sources of information to aid interpretation. For instance, our preferred interpretation of `fruit flies like a banana`:lx: depends on the presence of contextual cues that cause us to expect `flies`:lx: to be a noun or a verb. Before we can even address such issues, we need to be able to represent the required linguistic information. Here is a possible representation: ========= ========= ======== ===== ========== ``Fruit`` `flies`` `like`` `a`` `banana`` noun verb prep det noun ========= ========= ======== ===== ========== ========= ========= ======== ===== ========== ``Fruit`` `flies`` `like`` `a`` `banana`` noun noun verb det noun ========= ========= ======== ===== ========== Most language processing systems must recognize and interpret the linguistic structures that exist in a sequence of words. This task is virtually impossible if all we know about each word is its text representation. To determine whether a given string of words has the structure of, say, a noun phrase, it is infeasible to check through a (possibly infinite) list of all strings which can be classed as noun phrases. Instead we want to be able to generalize over *classes`:dt: of words. These word classes are commonly given labels such as 'determiner', 'adjective' and 'noun'. Conversely, to interpret words we need to be able to discriminate between different usages, such as ``deal`` as a noun or a verb. We earlier presented two interpretations of `Fruit flies like a banana`:lx: as examples of how a string of word tokens can be augmented with information about the word classes that the words belong to. In effect, we carried out tagging for the string `fruit flies like a banana`:lx:. However, tags are more usually attached inline to the text they are associated with. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% .. ===================== UNUSED ===================== Some of the following material might belong in other chapters Programming? ------------ The ``nltk_lite.corpora`` package provides ready access to several corpora included with NLTK, along with built-in tokenizers. For example, ``brown.raw()`` is an iterator over sentences from the Brown Corpus. We use ``extract()`` to extract a sentence of interest: >>> from nltk_lite.corpora import brown, extract >>> print extract(0, brown.raw('a')) ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'] Old intro material ------------------ How do we know that piece of text is a *word*, and how do we represent words and associated information in a machine? It might seem needlessly picky to ask what a word is. Can't we just say that a word is a string of characters which has white space before and after it? However, it turns out that things are quite a bit more complex. To get a flavour of the problems, consider the following text from the Wall Street Journal:: Let's start with the string `aren't`:lx:. According to our naive definition, it counts as only one word. But consider a situation where we wanted to check whether all the words in our text occurred in a dictionary, and our dictionary had entries for `are`:lx: and `not`:lx:, but not for `aren't`:lx:. In this case, we would probably be happy to say that `aren't`:lx: is a contraction of two distinct words. .. We can make a similar point about `1992's`:lx:. We might want to run a small program over our text to extract all words which express dates. In this case, we would get achieve more generality by first stripping except in this case, we would not expect to find `1992`:lx: in a dictionary. If we take our naive definition of word literally (as we should, if we are thinking of implementing it in code), then there are some other minor but real problems. For example, assuming our file consists of a number of separate lines, as in the WSJ text, then all the words which come at the beginning of a line will fail to be preceded by whitespace (unless we treat the newline character as a whitespace). Second, according to our criterion, punctuation symbols will form part of words; that is, a string like `investors,`:lx: will also count as a word, since there is no whitespace between `investors`:lx: and the following comma. Consequently, we run the risk of failing to recognise that `investors,`:lx: (with appended comma) is a token of the same type as `investors`:lx: (without appended comma). More importantly, we would like punctuation to be a "first-class citizen" for tokenization and subsequent processing. For example, we might want to implement a rule which says that a word followed by a period is likely to be an abbreviation if the immediately following word has a lowercase initial. However, to formulate such a rule, we must be able to identify a period as a token in its own right. A slightly different challenge is raised by examples such as the following (drawn from the MedLine corpus): #. This is a alpha-galactosyl-1,4-beta-galactosyl-specific adhesin. #. The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively. In these cases, we encounter terms which are unlikely to be found in any general purpose English lexicon. Moreover, we will have no success in trying to syntactically analyse these strings using a standard grammar of English. Now for some applications, we would like to "bundle up" expressions such as `alpha-galactosyl-1,4-beta-galactosyl-specific adhesin`:lx: and `4.53 +/- 0.15%`:lx: so that they are presented as unanalysable atoms to the parser. That is, we want to treat them as single "words" for the purposes of subsequent processing. The upshot is that, even if we confine our attention to English text, the question of what we treat as word may depend a great deal on what our purposes are. Representing tokens ------------------- When written language is stored in a computer file it is normally represented as a sequence or *string* of characters. That is, in a standard text file, individual words are strings, sentences are strings, and indeed the whole text is one long string. The characters in a string don't have to be just the ordinary alphanumerics; strings can also include special characters which represent space, tab and newline. Most computational processing is performed above the level of characters. In compiling a programming language, for example, the compiler expects its input to be a sequence of tokens that it knows how to deal with; for example, the classes of identifiers, string constants and numerals. Analogously, a parser will expect its input to be a sequence of word tokens rather than a sequence of individual characters. At its simplest, then, tokenization of a text involves searching for locations in the string of characters containing whitespace (space, tab, or newline) or certain punctuation symbols, and breaking the string into word tokens at these points. For example, suppose we have a file containing the following two lines:: The cat climbed the tree. From the parser's point of view, this file is just a string of characters: 'The_cat_climbed\\n_the_tree.' Note that we use single quotes to delimit strings, "_" to represent space and "\n" to represent newline. As we just pointed out, to tokenize this text for consumption by the parser, we need to explicitly indicate which substrings are words. One convenient way to do this in Python is to split the string into a *list* of words, where each word is a string, such as `'dog'`:lx:. [#]_ In Python, lists are printed as a series of objects (in this case, strings), surrounded by square brackets and separated by commas: >>> words = ['the', 'cat', 'climbed', 'the', 'tree'] >>> words ['the', 'cat', 'climbed', 'the', 'tree'] .. [#] We say "convenient" because Python makes it easy to iterate through a list, processing the items one by one. Notice that we have introduced a new variable `words`:lx: which is bound to the list, and that we entered the variable on a new line to check its value. To summarize, we have just illustrated how, at its simplest, tokenization of a text can be carried out by converting the single string representing the text into a list of strings, each of which corresponds to a word. Some of this could maybe be discussed in the programming chapter? ---------------------------------------------------------------- Many natural language processing tasks involve analyzing texts of varying sizes, ranging from single sentences to very large corpora. There are a number of ways to represent texts using NLTK. The simplest is as a single string. These strings can be loaded directly from files: >>> text_str = open('corpus.txt').read() >>> text_str 'Hello world. This is a test file.\n' However, as noted above, it is usually preferable to represent a text as a list of tokens. These lists are typically created using a *tokenizer*, such as `tokenize.whitespace`:lx: which splits strings into words at whitespaces: >>> from nltk_lite import tokenize >>> text = 'Hello world. This is a test string.' >>> list(tokenize.whitespace(text)) ['Hello', 'world.', 'This', 'is', 'a', 'test', 'string.'] .. Note:: By "whitespace", we mean not only interword space, but also tab and line-end. Note that tokenization may normalize the text, mapping all words to lowercase, expanding contractions, and possibly even stemming the words. An example for stemming is shown below: >>> text = 'stemming can be fun and exciting' >>> tokens = tokenize.whitespace(text) >>> porter = tokenize.PorterStemmer() >>> for token in tokens: ... print porter.stem(token), stem can be fun and excit Tokenization based on whitespace is too simplistic for most applications; for instance, it fails to separate the last word of a phrase or sentence from punctuation characters, such as comma, period, exclamation mark and question mark. As its name suggests, `tokenize.regexp`:lx: employs a regular expression to determine how text should be split up. This regular expression specifies the characters that can be included in a valid word. To define a tokenizer that includes punctuation as separate tokens, we could use: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ------------ More Grammar ------------ In this final section, we return to the grammar of English. We consider more syntactic phenomena that will require us to refine the productions of our phrase structure grammar. Lexical heads other than `V`:gc: can be subcategorized for particular complements: **Nouns** #. The rumour *that Kim was bald* circulated widely. #. \*The picture *that Kim was bald* circulated widely. **Adjectives** #. Lee was afraid *to leave*. #. \*Lee was tall *to leave*. It has also been suggested that 'ordinary' prepositions are transitive, and that many so-called adverb are in fact intransitive prepositions. For example, `towards`:lx: requires an `NP`:gc: complement, while `home`:lx: and `forwards`:lx: forbid them. .. example:: Lee ran towards the house. .. example:: \*Lee ran towards. .. example:: Sammy walked home. .. example:: \*Sammy walked home the house. .. example:: Brent stepped one pace forwards. .. example:: \*Brent stepped one pace forwards the house. Adopting this approach, we can also analyse certain prepositions as allowing `PP`:gc: complements: .. example:: Kim ran away *from the house*. .. example:: Lee jumped down *into the boat*. In general, the lexical categories `V`:gc:, `N`:gc:, `A`:gc: and `P`:gc: are taken to be the heads of the respective phrases `VP`, `NP`, `AP`:gc: and `PP`:gc:. Abstracting over the identity of these phrases, we can say that a lexical category `X`:gc: is the head of its immediate `XP`:gc: phrase, and moreover that the complements `C`:subscript:`1` ... `C`:subscript:`n` of `X`:gc: will occur as sisters of `X`:gc: within that `XP`:gc:. This is illustrated in the following schema: .. ex:: .. tree:: (XP (X) (*C_1*) ... (*C_n*)) We have argued that lexical categories need to be subdivided into subcategories to account for the fact that different lexical items select different sequences of following complements. That is, it is a distinguishing property of complements that they co-occur with some lexical items but not others. By contrast, :dt:`modifiers` can occur with pretty much any instance of the relevant lexical class. For example, consider the temporal adverbial *last Thursday*: .. example:: The woman gave the telescope to the dog last Thursday. .. example:: The woman saw a man last Thursday. .. example:: The dog barked last Thursday. Moreover, modifiers are always optional, whereas complements are at least sometimes obligatory. We can use the phrase structure geometry to draw a structural distinction between complements, which occur as sisters of the lexical head, versus modifiers, which occur as sisters of the phrase which encloses the head: .. ex:: .. tree:: (XP (XP (X) (*C_1*) ... (*C_n*)) (*Mod*)) Exercises --------- #. Pick some of the syntactic constructions described in any introductory syntax text (e.g. Jurafsky and Martin, Chapter 9) and create a set of 15 sentences. Five sentences should be unambiguous (have a unique parse), five should be ambiguous, and a further five should be ungrammatical. a) Develop a small grammar, consisting of about ten syntactic productions, to account for this data. Refine your set of sentences as needed to test and demonstrate the grammar. Write a function to demonstrate your grammar on three sentences: (i) a sentence having exactly one parse; (ii) a sentence having more than one parse; (iii) a sentence having no parses. Discuss your observations using inline comments. b) Create a list ``words`` of all the words in your lexicon, and use ``random.choice(words)`` to generate sequences of 5-10 randomly selected words. Does this generate any grammatical sentences which your grammar rejects, or any ungrammatical sentences which your grammar accepts? Now use this information to help you improve your grammar. Discuss your findings. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% NLTK consists of a set of Python *modules*, each of which defines classes and functions related to a single data structure or task. Before you can use a module, you must ``import`` its contents. The simplest way to import the contents of a module is to use the ``from module import *`` command. For example, to import the contents of the ``nltk_lite.util`` module, which is discussed in this chapter, type: >>> from nltk_lite.utilities import * >>> A disadvantage of this style of import statement is that it does not specify what objects are imported; and it is possible that some of the import objects will unintentionally cause conflicts. To avoid this disadvantage, you can explicitly list the objects you wish to import. For example, as we saw earlier, we can import the ``re_show`` function from the ``nltk_lite.util`` module as follows: >>> from nltk_lite.utilities import re_show >>> Another option is to import the module itself, rather than its contents. Now its contents can then be accessed using *fully qualified* dotted names: >>> from nltk_lite import utilities >>> utilities.re_show('green', sent) colorless {green} ideas sleep furiously >>> For more information about importing, see any Python textbook. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% We can also access the tagged text using the ``brown.tagged()`` method: >>> print extract(0, brown.tagged()) [('The', 'at'), ('Fulton', 'np-tl'), ('County', 'nn-tl'), ('Grand', 'jj-tl'), ('Jury', 'nn-tl'), ('said', 'vbd'), ('Friday', 'nr'), ('an', 'at'), ('investigation', 'nn'), ('of', 'in'), ("Atlanta's", 'np$'), ('recent', 'jj'), ('primary', 'nn'), ('election', 'nn'), ('produced', 'vbd'), ('``', '``'), ('no', 'at'), ('evidence', 'nn'), ("''", "''"), ('that', 'cs'), ('any', 'dti'), ('irregularities', 'nns'), ('took', 'vbd'), ('place', 'nn'), ('.', '.')] >>> NLTK includes a 10% fragment of the Wall Street Journal section of the Penn Treebank. This can be accessed using ``treebank.raw()`` for the raw text, and ``treebank.tagged()`` for the tagged text. >>> from nltk_lite.corpora import treebank >>> print extract(0, treebank.parsed()) (S: (NP-SBJ: (NP: (NNP: 'Pierre') (NNP: 'Vinken')) (,: ',') (ADJP: (NP: (CD: '61') (NNS: 'years')) (JJ: 'old')) (,: ',')) (VP: (MD: 'will') (VP: (VB: 'join') (NP: (DT: 'the') (NN: 'board')) (PP-CLR: (IN: 'as') (NP: (DT: 'a') (JJ: 'nonexecutive') (NN: 'director'))) (NP-TMP: (NNP: 'Nov.') (CD: '29')))) (.: '.')) >>> NLTK contains some simple chatbots, which will try to talk intelligently with you. You can access the famous Eliza chatbot using ``from nltk_lite.chat import eliza``, then run ``eliza.demo()``. The other chatbots are called ``iesha`` (teen anime talk), ``rude`` (insulting talk), and ``zen`` (gems of Zen wisdom), and were contributed by other students who have used NLTK. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Predicting the Next Word (Revisited) ------------------------------------ >>> from nltk_lite.corpora import genesis >>> from nltk_lite.probability import ConditionalFreqDist >>> cfdist = ConditionalFreqDist() We then examine each token in the corpus, and increment the appropriate sample's count. We use the variable ``prev`` to record the previous word. >>> prev = None >>> for word in genesis.raw(): ... cfdist[prev].inc(word) ... prev = word .. Note:: Sometimes the context for an experiment is unavailable, or does not exist. For example, the first token in a text does not follow any word. In these cases, we must decide what context to use. For this example, we use ``None`` as the context for the first token. Another option would be to discard the first token. Once we have constructed a conditional frequency distribution for the training corpus, we can use it to find the most likely word for any given context. For example, taking the word `living`:lx: as our context, we can inspect all the words that occurred in that context. >>> word = 'living' >>> cfdist[word].samples() ['creature,', 'substance', 'soul.', 'thing', 'thing,', 'creature'] We can set up a simple loop to generate text: we set an initial context, picking the most likely token in that context as our next word, and then using that word as our new context: >>> word = 'living' >>> for i in range(20): ... print word, ... word = cfdist[word].max() living creature that he said, I will not be a wife of the land of the land of the land %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Pronunciation Dictionary ------------------------ Here we access the pronunciation of words... .. doctest-ignore:: >>> from nltk_lite.corpora import cmudict >>> from string import join >>> for word, num, pron in cmudict.raw(): ... if pron[-4:] == ('N', 'IH0', 'K', 'S'): ... print word.lower(), atlantic's audiotronics avionics beatniks calisthenics centronics chetniks clinic's clinics conics cynics diasonics dominic's ebonics electronics electronics' endotronics endotronics' enix environics ethnics eugenics fibronics flextronics harmonics hispanics histrionics identics ionics kibbutzniks lasersonics lumonics mannix mechanics mechanics' microelectronics minix minnix mnemonics mnemonics molonicks mullenix mullenix mullinix mulnix munich's nucleonics onyx panic's panics penix pennix personics phenix philharmonic's phoenix phonics photronics pinnix plantronics pyrotechnics refuseniks resnick's respironics sconnix siliconix skolniks sonics sputniks technics tectonics tektronix telectronics telephonics tonics unix vinick's vinnick's vitronics %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Looking at Timestamps ~~~~~~~~~~~~~~~~~~~~~ >>> fd = nltk.FreqDist() >>> for entry in lexicon: ... date = entry.findtext('dt') ... if date: ... (day, month, year) = string.split(date, '/') ... fd.inc((month, year)) >>> for time in fd.sorted(): ... print time[0], time[1], ':', fd[time] Oct 2005 : 220 Nov 2005 : 133 Feb 2005 : 107 Jun 2005 : 97 Dec 2004 : 74 Aug 2005 : 33 Jan 2005 : 32 Sep 2005 : 27 Dec 2005 : 21 Feb 2007 : 16 Nov 2006 : 15 Sep 2006 : 13 Feb 2004 : 12 Apr 2006 : 11 Jul 2005 : 11 May 2005 : 10 Sep 2004 : 9 Mar 2005 : 8 Jan 2007 : 7 Dec 2006 : 6 Jul 2004 : 6 Jan 2006 : 4 Mar 2006 : 4 Oct 2006 : 3 Jul 2006 : 3 Apr 2007 : 2 Feb 2006 : 2 May 2004 : 1 Oct 2004 : 1 Nov 2004 : 1 To put these in time order, we need to set up a special comparison function. Otherwise, if we just sort the months, we'll get them in alphabetical order. >>> month_index = { ... "Jan" : 1, "Feb" : 2, "Mar" : 3, "Apr" : 4, ... "May" : 5, "Jun" : 6, "Jul" : 7, "Aug" : 8, ... "Sep" : 9, "Oct" : 10, "Nov" : 11, "Dec" : 12 ... } >>> def time_cmp(a, b): ... a2 = a[1], month_index[a[0]] ... b2 = b[1], month_index[b[0]] ... return cmp(a2, b2) The comparison function says that we compare two times of the form ``('Mar', '2004')`` by reversing the order of the month and year, and converting the month into a number to get ``('2004', '3')``, then using Python's built-in ``cmp`` function to compare them. Now we can get the times found in the Toolbox entries, sort them according to our ``time_cmp`` comparison function, and then print them in order. This time we print bars to indicate frequency: >>> times = fd.samples() >>> times.sort(cmp=time_cmp) >>> for time in times: ... print time[0], time[1], ':', '#' * (1 + fd[time]/10) Feb 2004 : ## May 2004 : # Jul 2004 : # Sep 2004 : # Oct 2004 : # Nov 2004 : # Dec 2004 : ######## Jan 2005 : #### Feb 2005 : ########### Mar 2005 : # May 2005 : ## Jun 2005 : ########## Jul 2005 : ## Aug 2005 : #### Sep 2005 : ### Oct 2005 : ####################### Nov 2005 : ############## Dec 2005 : ### Jan 2006 : # Feb 2006 : # Mar 2006 : # Apr 2006 : ## Jul 2006 : # Sep 2006 : ## Oct 2006 : # Nov 2006 : ## Dec 2006 : # Jan 2007 : # Feb 2007 : ## Apr 2007 : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Next, let's consider some methods for discovering patterns in the lexical entries. Which fields are the most frequent? >>> fd = nltk.FreqDist(field.tag for entry in lexicon for field in entry) >>> fd.sorted()[:10] ['xp', 'ex', 'xe', 'ge', 'tkp', 'lx', 'dt', 'ps', 'pt', 'rt'] Which sequences of fields are the most frequent? >>> fd = nltk.FreqDist(':'.join(field.tag for field in entry) for entry in lexicon) >>> top_ten = fd.sorted()[:10] >>> print '\n'.join(top_ten) lx:ps:pt:ge:tkp:dt:ex:xp:xe lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe lx:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe lx:rt:ps:pt:ge:ge:tkp:dt:ex:xp:xe:ex:xp:xe lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe lx:ps:pt:ge:tkp:nt:sf:dt:ex:xp:xe lx:ps:pt:ge:ge:tkp:dt:ex:xp:xe %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ------------------------------- NLTK in the Academic Literature ------------------------------- |NLTK| has been presented at several international conferences with published proceedings, from 2002 to the present, as listed below: Edward Loper and Steven Bird (2002). NLTK: The Natural Language Toolkit, *Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics*, Somerset, NJ: Association for Computational Linguistics, pp. 62-69, http://arXiv.org/abs/cs/0205028 Steven Bird and Edward Loper (2004). NLTK: The Natural Language Toolkit, *Proceedings of the ACL demonstration session*, pp 214-217. http://eprints.unimelb.edu.au/archive/00001448/ Edward Loper (2004). NLTK: Building a Pedagogical Toolkit in Python, *PyCon DC 2004* Python Software Foundation, http://www.python.org/pycon/dc2004/papers/ Steven Bird (2005). NLTK-Lite: Efficient Scripting for Natural Language Processing, *4th International Conference on Natural Language Processing*, pp 1-8. http://eprints.unimelb.edu.au/archive/00001453/ Steven Bird (2006). NLTK: The Natural Language Toolkit, *Proceedings of the ACL demonstration session* http://www.ldc.upenn.edu/sb/home/papers/nltk-demo-06.pdf Ewan Klein (2006). Computational Semantics in the Natural Language Toolkit, *Australian Language Technology Workshop*. http://www.alta.asn.au/events/altw2006/proceedings/Klein.pdf %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |nopar| **Further Reading:** The Association for Computational Linguistics (ACL) The ACL is the foremost professional body in |NLP|. Its journal and conference proceedings, approximately 10,000 articles, are available online with a full-text search interface, via ``_. Linguistic Terminology A comprehensive glossary of linguistic terminology is available at: ``_. *Language Files* *Materials for an Introduction to Language and Linguistics (Ninth Edition)*, The Ohio State University Department of Linguistics. For more information, see ``_. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%