nltk.lm package

Submodules

nltk.lm.api module

Language Model Interface.

class nltk.lm.api.LanguageModel(order, vocabulary=None, counter=None)[source]

Bases: object

ABC for Language Models.

Cannot be directly instantiated itself.

context_counts(context)[source]

Helper method for retrieving counts for a given context.

Assumes context has been checked and oov words in it masked. :type context: tuple(str) or None

entropy(text_ngrams)[source]

Calculate cross-entropy of model for given evaluation text.

Parameters:text_ngrams (Iterable(tuple(str))) – A sequence of ngram tuples.
Return type:float
fit(text, vocabulary_text=None)[source]

Trains the model on a text.

Parameters:text – Training text as a sequence of sentences.
generate(num_words=1, text_seed=None, random_seed=None)[source]

Generate words from the model.

Parameters:
  • num_words (int) – How many words to generate. By default 1.
  • text_seed – Generation can be conditioned on preceding context.
  • random_seed – If provided, makes the random sampling part of

generation reproducible. :return: One (str) word or a list of words generated from model.

Examples:

>>> from nltk.lm import MLE
>>> lm = MLE(2)
>>> lm.fit([[("a", "b"), ("b", "c")]], vocabulary_text=['a', 'b', 'c'])
>>> lm.fit([[("a",), ("b",), ("c",)]])
>>> lm.generate(random_seed=3)
'a'
>>> lm.generate(text_seed=['a'])
'b'
logscore(word, context=None)[source]

Evaluate the log score of this word in this context.

The arguments are the same as for score and unmasked_score.

perplexity(text_ngrams)[source]

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy for the text, so the arguments are the same.

score(word, context=None)[source]

Masks out of vocab (OOV) words and computes their model score.

For model-specific logic of calculating scores, see the unmasked_score method.

unmasked_score(word, context=None)[source]

Score a word given some optional context.

Concrete models are expected to provide an implementation. Note that this method does not mask its arguments with the OOV label. Use the score method for that.

Parameters:
  • word (str) – Word for which we want the score
  • context (tuple(str)) – Context the word is in.

If None, compute unigram score. :param context: tuple(str) or None :rtype: float

class nltk.lm.api.Smoothing(vocabulary, counter)[source]

Bases: object

Ngram Smoothing Interface

Implements Chen & Goodman 1995’s idea that all smoothing algorithms have certain features in common. This should ideally allow smoothing algoritms to work both with Backoff and Interpolation.

counter represents the number of counts for ngrams

alpha_gamma(word, context)[source]
unigram_score(word)[source]

nltk.lm.counter module

Language Model Counter

class nltk.lm.counter.NgramCounter(ngram_text=None)[source]

Bases: object

Class for counting ngrams.

Will count any ngram sequence you give it ;)

First we need to make sure we are feeding the counter sentences of ngrams.

>>> text = [["a", "b", "c", "d"], ["a", "c", "d", "c"]]
>>> from nltk.util import ngrams
>>> text_bigrams = [ngrams(sent, 2) for sent in text]
>>> text_unigrams = [ngrams(sent, 1) for sent in text]

The counting itself is very simple.

>>> from nltk.lm import NgramCounter
>>> ngram_counts = NgramCounter(text_bigrams + text_unigrams)

You can conveniently access ngram counts using standard python dictionary notation. String keys will give you unigram counts.

>>> ngram_counts['a']
2
>>> ngram_counts['aliens']
0

If you want to access counts for higher order ngrams, use a list or a tuple. These are treated as “context” keys, so what you get is a frequency distribution over all continuations after the given context.

>>> sorted(ngram_counts[['a']].items())
[('b', 1), ('c', 1)]
>>> sorted(ngram_counts[('a',)].items())
[('b', 1), ('c', 1)]

This is equivalent to specifying explicitly the order of the ngram (in this case 2 for bigram) and indexing on the context. >>> ngram_counts[2][(‘a’,)] is ngram_counts[[‘a’]] True

Note that the keys in ConditionalFreqDist cannot be lists, only tuples! It is generally advisable to use the less verbose and more flexible square bracket notation.

To get the count of the full ngram “a b”, do this:

>>> ngram_counts[['a']]['b']
1

Specifying the ngram order as a number can be useful for accessing all ngrams in that order.

>>> ngram_counts[2]
<ConditionalFreqDist with 4 conditions>

The keys of this ConditionalFreqDist are the contexts we discussed earlier. Unigrams can also be accessed with a human-friendly alias.

>>> ngram_counts.unigrams is ngram_counts[1]
True

Similarly to collections.Counter, you can update counts after initialization.

>>> ngram_counts['e']
0
>>> ngram_counts.update([ngrams(["d", "e", "f"], 1)])
>>> ngram_counts['e']
1
N()[source]

Returns grand total number of ngrams stored.

This includes ngrams from all orders, so some duplication is expected. :rtype: int

>>> from nltk.lm import NgramCounter
>>> counts = NgramCounter([[("a", "b"), ("c",), ("d", "e")]])
>>> counts.N()
3
unicode_repr

Return repr(self).

update(ngram_text)[source]

Updates ngram counts from ngram_text.

Expects ngram_text to be a sequence of sentences (sequences). Each sentence consists of ngrams as tuples of strings.

Parameters:ngram_text (Iterable(Iterable(tuple(str)))) – Text containing senteces of ngrams.
Raises:TypeError – if the ngrams are not tuples.

nltk.lm.models module

Language Models

class nltk.lm.models.InterpolatedLanguageModel(smoothing_cls, order, **kwargs)[source]

Bases: nltk.lm.api.LanguageModel

Logic common to all interpolated language models.

The idea to abstract this comes from Chen & Goodman 1995.

unmasked_score(word, context=None)[source]

Score a word given some optional context.

Concrete models are expected to provide an implementation. Note that this method does not mask its arguments with the OOV label. Use the score method for that.

Parameters:
  • word (str) – Word for which we want the score
  • context (tuple(str)) – Context the word is in.

If None, compute unigram score. :param context: tuple(str) or None :rtype: float

class nltk.lm.models.KneserNeyInterpolated(order, discount=0.1, **kwargs)[source]

Bases: nltk.lm.models.InterpolatedLanguageModel

Interpolated version of Kneser-Ney smoothing.

class nltk.lm.models.Laplace(*args, **kwargs)[source]

Bases: nltk.lm.models.Lidstone

Implements Laplace (add one) smoothing.

Initialization identical to BaseNgramModel because gamma is always 1.

unicode_repr

Return repr(self).

class nltk.lm.models.Lidstone(gamma, *args, **kwargs)[source]

Bases: nltk.lm.api.LanguageModel

Provides Lidstone-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.

unicode_repr

Return repr(self).

unmasked_score(word, context=None)[source]

Add-one smoothing: Lidstone or Laplace.

To see what kind, look at gamma attribute on the class.

class nltk.lm.models.MLE(order, vocabulary=None, counter=None)[source]

Bases: nltk.lm.api.LanguageModel

Class for providing MLE ngram model scores.

Inherits initialization from BaseNgramModel.

unicode_repr

Return repr(self).

unmasked_score(word, context=None)[source]

Returns the MLE score for a word given a context.

Args: - word is expcected to be a string - context is expected to be something reasonably convertible to a tuple

class nltk.lm.models.WittenBellInterpolated(order, **kwargs)[source]

Bases: nltk.lm.models.InterpolatedLanguageModel

Interpolated version of Witten-Bell smoothing.

nltk.lm.preprocessing module

nltk.lm.preprocessing.flatten()

chain.from_iterable(iterable) –> chain object

Alternate chain() constructor taking a single iterable argument that evaluates lazily.

nltk.lm.preprocessing.pad_both_ends(sequence, n, *, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')

Pads both ends of a sentence to length specified by ngram order.

Following convention <s> pads the start of sentence </s> pads its end.

nltk.lm.preprocessing.padded_everygram_pipeline(order, text)[source]

Default preprocessing for a sequence of sentences.

Creates two iterators: - sentences padded and turned into sequences of nltk.util.everygrams - sentences padded as above and chained together for a flat stream of words

Parameters:
  • order – Largest ngram length produced by everygrams.
  • text – Text to iterate over. Expected to be an iterable of sentences:

Iterable[Iterable[str]] :return: iterator over text as ngrams, iterator over text as vocabulary data

nltk.lm.preprocessing.padded_everygrams(order, sentence)[source]

Helper with some useful defaults.

Applies pad_both_ends to sentence and follows it up with everygrams.

nltk.lm.smoothing module

Smoothing algorithms for language modeling.

According to Chen & Goodman 1995 these should work with both Backoff and Interpolation.

class nltk.lm.smoothing.GoodTuring(vocabulary, counter, **kwargs)[source]

Bases: nltk.lm.api.Smoothing

Good-Turing Smoothing

unigram_score(word)[source]
class nltk.lm.smoothing.KneserNey(vocabulary, counter, discount=0.1, **kwargs)[source]

Bases: nltk.lm.api.Smoothing

Kneser-Ney Smoothing.

alpha(word, prefix_counts)[source]
alpha_gamma(word, context)[source]
gamma(prefix_counts)[source]
unigram_score(word)[source]
class nltk.lm.smoothing.WittenBell(vocabulary, counter, discount=0.1, **kwargs)[source]

Bases: nltk.lm.api.Smoothing

Witten-Bell smoothing.

alpha(word, context)[source]
alpha_gamma(word, context)[source]
gamma(context)[source]
unigram_score(word)[source]

nltk.lm.util module

Language Model Utilities

nltk.lm.util.log_base2(score)[source]

Convenience function for computing logarithms with base 2.

nltk.lm.vocabulary module

Language Model Vocabulary

class nltk.lm.vocabulary.Vocabulary(counts=None, unk_cutoff=1, unk_label='<UNK>')[source]

Bases: object

Stores language model vocabulary.

Satisfies two common language modeling requirements for a vocabulary: - When checking membership and calculating its size, filters items

by comparing their counts to a cutoff value.
  • Adds a special “unknown” token which unseen words are mapped to.
>>> words = ['a', 'c', '-', 'd', 'c', 'a', 'b', 'r', 'a', 'c', 'd']
>>> from nltk.lm import Vocabulary
>>> vocab = Vocabulary(words, unk_cutoff=2)

Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.

>>> vocab['c']
3
>>> 'c' in vocab
True
>>> vocab['d']
2
>>> 'd' in vocab
True

Tokens with frequency counts less than the cutoff value will be considered not part of the vocabulary even though their entries in the count dictionary are preserved.

>>> vocab['b']
1
>>> 'b' in vocab
False
>>> vocab['aliens']
0
>>> 'aliens' in vocab
False

Keeping the count entries for seen words allows us to change the cutoff value without having to recalculate the counts.

>>> vocab2 = Vocabulary(vocab.counts, unk_cutoff=1)
>>> "b" in vocab2
True

The cutoff value influences not only membership checking but also the result of getting the size of the vocabulary using the built-in len. Note that while the number of keys in the vocabulary’s counter stays the same, the items in the vocabulary differ depending on the cutoff. We use sorted to demonstrate because it keeps the order consistent.

>>> sorted(vocab2.counts)
['-', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab2)
['-', '<UNK>', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab.counts)
['-', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab)
['<UNK>', 'a', 'c', 'd']

In addition to items it gets populated with, the vocabulary stores a special token that stands in for so-called “unknown” items. By default it’s “<UNK>”.

>>> "<UNK>" in vocab
True

We can look up words in a vocabulary using its lookup method. “Unseen” words (with counts less than cutoff) are looked up as the unknown label. If given one word (a string) as an input, this method will return a string.

>>> vocab.lookup("a")
'a'
>>> vocab.lookup("aliens")
'<UNK>'

If given a sequence, it will return an tuple of the looked up words.

>>> vocab.lookup(["p", 'a', 'r', 'd', 'b', 'c'])
('<UNK>', 'a', '<UNK>', 'd', '<UNK>', 'c')

It’s possible to update the counts after the vocabulary has been created. The interface follows that of collections.Counter.

>>> vocab['b']
1
>>> vocab.update(["b", "b", "c"])
>>> vocab['b']
3
cutoff

Cutoff value.

Items with count below this value are not considered part of vocabulary.

lookup(words)[source]

Look up one or more words in the vocabulary.

If passed one word as a string will return that word or self.unk_label. Otherwise will assume it was passed a sequence of words, will try to look each of them up and return an iterator over the looked up words.

Parameters:words (Iterable(str) or str) – Word(s) to look up.
Return type:generator(str) or str
Raises:TypeError for types other than strings or iterables
>>> from nltk.lm import Vocabulary
>>> vocab = Vocabulary(["a", "b", "c", "a", "b"], unk_cutoff=2)
>>> vocab.lookup("a")
'a'
>>> vocab.lookup("aliens")
'<UNK>'
>>> vocab.lookup(["a", "b", "c", ["x", "b"]])
('a', 'b', '<UNK>', ('<UNK>', 'b'))
unicode_repr

Return repr(self).

update(*counter_args, **counter_kwargs)[source]

Update vocabulary counts.

Wraps collections.Counter.update method.

Module contents

NLTK Language Modeling Module.

Currently this module covers only ngram language models, but it should be easy to extend to neural models.

Preparing Data

Before we train our ngram models it is necessary to make sure the data we put in them is in the right format. Let’s say we have a text that is a list of sentences, where each sentence is a list of strings. For simplicity we just consider a text consisting of characters instead of words.

>>> text = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]

If we want to train a bigram model, we need to turn this text into bigrams. Here’s what the first sentence of our text would look like if we use a function from NLTK for this.

>>> from nltk.util import bigrams
>>> list(bigrams(text[0]))
[('a', 'b'), ('b', 'c')]

Notice how “b” occurs both as the first and second member of different bigrams but “a” and “c” don’t? Wouldn’t it be nice to somehow indicate how often sentences start with “a” and end with “c”? A standard way to deal with this is to add special “padding” symbols to the sentence before splitting it into ngrams. Fortunately, NLTK also has a function for that, let’s see what it does to the first sentence.

>>> from nltk.util import pad_sequence
>>> list(pad_sequence(text[0],
... pad_left=True,
... left_pad_symbol="<s>",
... pad_right=True,
... right_pad_symbol="</s>",
... n=2))
['<s>', 'a', 'b', 'c', '</s>']

Note the n argument, that tells the function we need padding for bigrams. Now, passing all these parameters every time is tedious and in most cases they can be safely assumed as defaults anyway. Thus our module provides a convenience function that has all these arguments already set while the other arguments remain the same as for pad_sequence.

>>> from nltk.lm.preprocessing import pad_both_ends
>>> list(pad_both_ends(text[0], n=2))
['<s>', 'a', 'b', 'c', '</s>']

Combining the two parts discussed so far we get the following preparation steps for one sentence.

>>> list(bigrams(pad_both_ends(text[0], n=2)))
[('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>')]

To make our model more robust we could also train it on unigrams (single words) as well as bigrams, its main source of information. NLTK once again helpfully provides a function called everygrams. While not the most efficient, it is conceptually simple.

>>> from nltk.util import everygrams
>>> padded_bigrams = list(pad_both_ends(text[0], n=2))
>>> list(everygrams(padded_bigrams, max_len=2))
[('<s>',),
 ('a',),
 ('b',),
 ('c',),
 ('</s>',),
 ('<s>', 'a'),
 ('a', 'b'),
 ('b', 'c'),
 ('c', '</s>')]

We are almost ready to start counting ngrams, just one more step left. During training and evaluation our model will rely on a vocabulary that defines which words are “known” to the model. To create this vocabulary we need to pad our sentences (just like for counting ngrams) and then combine the sentences into one flat stream of words.

>>> from nltk.lm.preprocessing import flatten
>>> list(flatten(pad_both_ends(sent, n=2) for sent in text))
['<s>', 'a', 'b', 'c', '</s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>']

In most cases we want to use the same text as the source for both vocabulary and ngram counts. Now that we understand what this means for our preprocessing, we can simply import a function that does everything for us.

>>> from nltk.lm.preprocessing import padded_everygram_pipeline
>>> train, vocab = padded_everygram_pipeline(2, text)

So as to avoid re-creating the text in memory, both train and vocab are lazy iterators. They are evaluated on demand at training time.

Training

Having prepared our data we are ready to start training a model. As a simple example, let us train a Maximum Likelihood Estimator (MLE). We only need to specify the highest ngram order to instantiate it.

>>> from nltk.lm import MLE
>>> lm = MLE(2)

This automatically creates an empty vocabulary…

>>> len(lm.vocab)
0

… which gets filled as we fit the model.

>>> lm.fit(train, vocab)
>>> print(lm.vocab)
<Vocabulary with cutoff=1 unk_label='<UNK>' and 9 items>
>>> len(lm.vocab)
9

The vocabulary helps us handle words that have not occurred during training.

>>> lm.vocab.lookup(text[0])
('a', 'b', 'c')
>>> lm.vocab.lookup(["aliens", "from", "Mars"])
('<UNK>', '<UNK>', '<UNK>')

Moreover, in some cases we want to ignore words that we did see during training but that didn’t occur frequently enough, to provide us useful information. You can tell the vocabulary to ignore such words. To find out how that works, check out the docs for the Vocabulary class.

Using a Trained Model

When it comes to ngram models the training boils down to counting up the ngrams from the training corpus.

>>> print(lm.counts)
<NgramCounter with 2 ngram orders and 24 ngrams>

This provides a convenient interface to access counts for unigrams…

>>> lm.counts['a']
2

…and bigrams (in this case “a b”)

>>> lm.counts[['a']]['b']
1

And so on. However, the real purpose of training a language model is to have it score how probable words are in certain contexts. This being MLE, the model returns the item’s relative frequency as its score.

>>> lm.score("a")
0.15384615384615385

Items that are not seen during training are mapped to the vocabulary’s “unknown label” token. This is “<UNK>” by default.

>>> lm.score("<UNK>") == lm.score("aliens")
True

Here’s how you get the score for a word given some preceding context. For example we want to know what is the chance that “b” is preceded by “a”.

>>> lm.score("b", ["a"])
0.5

To avoid underflow when working with many small score values it makes sense to take their logarithm. For convenience this can be done with the logscore method.

>>> lm.logscore("a")
-2.700439718141092

Building on this method, we can also evaluate our model’s cross-entropy and perplexity with respect to sequences of ngrams.

>>> test = [('a', 'b'), ('c', 'd')]
>>> lm.entropy(test)
1.292481250360578
>>> lm.perplexity(test)
2.449489742783178

It is advisable to preprocess your test text exactly the same way as you did the training text.

One cool feature of ngram models is that they can be used to generate text.

>>> lm.generate(1, random_seed=3)
'<s>'
>>> lm.generate(5, random_seed=3)
['<s>', 'a', 'b', 'c', '</s>']

Provide random_seed if you want to consistently reproduce the same text all other things being equal. Here we are using it to test the examples.

You can also condition your generation on some preceding text with the context argument.

>>> lm.generate(5, text_seed=['c'], random_seed=3)
['</s>', '<s>', 'a', 'b', 'c']

Note that an ngram model is restricted in how much preceding context it can take into account. For example, a trigram model can only condition its output on 2 preceding words. If you pass in a 4-word context, the first two words will be ignored.

class nltk.lm.Vocabulary(counts=None, unk_cutoff=1, unk_label='<UNK>')[source]

Bases: object

Stores language model vocabulary.

Satisfies two common language modeling requirements for a vocabulary: - When checking membership and calculating its size, filters items

by comparing their counts to a cutoff value.
  • Adds a special “unknown” token which unseen words are mapped to.
>>> words = ['a', 'c', '-', 'd', 'c', 'a', 'b', 'r', 'a', 'c', 'd']
>>> from nltk.lm import Vocabulary
>>> vocab = Vocabulary(words, unk_cutoff=2)

Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.

>>> vocab['c']
3
>>> 'c' in vocab
True
>>> vocab['d']
2
>>> 'd' in vocab
True

Tokens with frequency counts less than the cutoff value will be considered not part of the vocabulary even though their entries in the count dictionary are preserved.

>>> vocab['b']
1
>>> 'b' in vocab
False
>>> vocab['aliens']
0
>>> 'aliens' in vocab
False

Keeping the count entries for seen words allows us to change the cutoff value without having to recalculate the counts.

>>> vocab2 = Vocabulary(vocab.counts, unk_cutoff=1)
>>> "b" in vocab2
True

The cutoff value influences not only membership checking but also the result of getting the size of the vocabulary using the built-in len. Note that while the number of keys in the vocabulary’s counter stays the same, the items in the vocabulary differ depending on the cutoff. We use sorted to demonstrate because it keeps the order consistent.

>>> sorted(vocab2.counts)
['-', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab2)
['-', '<UNK>', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab.counts)
['-', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab)
['<UNK>', 'a', 'c', 'd']

In addition to items it gets populated with, the vocabulary stores a special token that stands in for so-called “unknown” items. By default it’s “<UNK>”.

>>> "<UNK>" in vocab
True

We can look up words in a vocabulary using its lookup method. “Unseen” words (with counts less than cutoff) are looked up as the unknown label. If given one word (a string) as an input, this method will return a string.

>>> vocab.lookup("a")
'a'
>>> vocab.lookup("aliens")
'<UNK>'

If given a sequence, it will return an tuple of the looked up words.

>>> vocab.lookup(["p", 'a', 'r', 'd', 'b', 'c'])
('<UNK>', 'a', '<UNK>', 'd', '<UNK>', 'c')

It’s possible to update the counts after the vocabulary has been created. The interface follows that of collections.Counter.

>>> vocab['b']
1
>>> vocab.update(["b", "b", "c"])
>>> vocab['b']
3
cutoff

Cutoff value.

Items with count below this value are not considered part of vocabulary.

lookup(words)[source]

Look up one or more words in the vocabulary.

If passed one word as a string will return that word or self.unk_label. Otherwise will assume it was passed a sequence of words, will try to look each of them up and return an iterator over the looked up words.

Parameters:words (Iterable(str) or str) – Word(s) to look up.
Return type:generator(str) or str
Raises:TypeError for types other than strings or iterables
>>> from nltk.lm import Vocabulary
>>> vocab = Vocabulary(["a", "b", "c", "a", "b"], unk_cutoff=2)
>>> vocab.lookup("a")
'a'
>>> vocab.lookup("aliens")
'<UNK>'
>>> vocab.lookup(["a", "b", "c", ["x", "b"]])
('a', 'b', '<UNK>', ('<UNK>', 'b'))
unicode_repr

Return repr(self).

update(*counter_args, **counter_kwargs)[source]

Update vocabulary counts.

Wraps collections.Counter.update method.

class nltk.lm.NgramCounter(ngram_text=None)[source]

Bases: object

Class for counting ngrams.

Will count any ngram sequence you give it ;)

First we need to make sure we are feeding the counter sentences of ngrams.

>>> text = [["a", "b", "c", "d"], ["a", "c", "d", "c"]]
>>> from nltk.util import ngrams
>>> text_bigrams = [ngrams(sent, 2) for sent in text]
>>> text_unigrams = [ngrams(sent, 1) for sent in text]

The counting itself is very simple.

>>> from nltk.lm import NgramCounter
>>> ngram_counts = NgramCounter(text_bigrams + text_unigrams)

You can conveniently access ngram counts using standard python dictionary notation. String keys will give you unigram counts.

>>> ngram_counts['a']
2
>>> ngram_counts['aliens']
0

If you want to access counts for higher order ngrams, use a list or a tuple. These are treated as “context” keys, so what you get is a frequency distribution over all continuations after the given context.

>>> sorted(ngram_counts[['a']].items())
[('b', 1), ('c', 1)]
>>> sorted(ngram_counts[('a',)].items())
[('b', 1), ('c', 1)]

This is equivalent to specifying explicitly the order of the ngram (in this case 2 for bigram) and indexing on the context. >>> ngram_counts[2][(‘a’,)] is ngram_counts[[‘a’]] True

Note that the keys in ConditionalFreqDist cannot be lists, only tuples! It is generally advisable to use the less verbose and more flexible square bracket notation.

To get the count of the full ngram “a b”, do this:

>>> ngram_counts[['a']]['b']
1

Specifying the ngram order as a number can be useful for accessing all ngrams in that order.

>>> ngram_counts[2]
<ConditionalFreqDist with 4 conditions>

The keys of this ConditionalFreqDist are the contexts we discussed earlier. Unigrams can also be accessed with a human-friendly alias.

>>> ngram_counts.unigrams is ngram_counts[1]
True

Similarly to collections.Counter, you can update counts after initialization.

>>> ngram_counts['e']
0
>>> ngram_counts.update([ngrams(["d", "e", "f"], 1)])
>>> ngram_counts['e']
1
N()[source]

Returns grand total number of ngrams stored.

This includes ngrams from all orders, so some duplication is expected. :rtype: int

>>> from nltk.lm import NgramCounter
>>> counts = NgramCounter([[("a", "b"), ("c",), ("d", "e")]])
>>> counts.N()
3
unicode_repr

Return repr(self).

update(ngram_text)[source]

Updates ngram counts from ngram_text.

Expects ngram_text to be a sequence of sentences (sequences). Each sentence consists of ngrams as tuples of strings.

Parameters:ngram_text (Iterable(Iterable(tuple(str)))) – Text containing senteces of ngrams.
Raises:TypeError – if the ngrams are not tuples.
class nltk.lm.MLE(order, vocabulary=None, counter=None)[source]

Bases: nltk.lm.api.LanguageModel

Class for providing MLE ngram model scores.

Inherits initialization from BaseNgramModel.

unicode_repr

Return repr(self).

unmasked_score(word, context=None)[source]

Returns the MLE score for a word given a context.

Args: - word is expcected to be a string - context is expected to be something reasonably convertible to a tuple

class nltk.lm.Lidstone(gamma, *args, **kwargs)[source]

Bases: nltk.lm.api.LanguageModel

Provides Lidstone-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.

unicode_repr

Return repr(self).

unmasked_score(word, context=None)[source]

Add-one smoothing: Lidstone or Laplace.

To see what kind, look at gamma attribute on the class.

class nltk.lm.Laplace(*args, **kwargs)[source]

Bases: nltk.lm.models.Lidstone

Implements Laplace (add one) smoothing.

Initialization identical to BaseNgramModel because gamma is always 1.

unicode_repr

Return repr(self).

class nltk.lm.WittenBellInterpolated(order, **kwargs)[source]

Bases: nltk.lm.models.InterpolatedLanguageModel

Interpolated version of Witten-Bell smoothing.

class nltk.lm.KneserNeyInterpolated(order, discount=0.1, **kwargs)[source]

Bases: nltk.lm.models.InterpolatedLanguageModel

Interpolated version of Kneser-Ney smoothing.