|
|
|
Note that items are sorted in order of decreasing frequency; two items of the same frequency appear in indeterminate order.
|
|
|
|
|
nltk.FreqDist can be pickled:
|
We extract a small part (500 sentences) of the Brown corpus
|
We create a HMM trainer - note that we need the tags and symbols from the whole corpus, not just the training corpus
|
We divide the corpus into 90% training and 10% testing
|
And now we can test the estimators
|
this resulted in an initialization error before r7209
|
Laplace (= Lidstone with gamma==1)
|
Expected Likelihood Estimation (= Lidstone with gamma==0.5)
|
Lidstone Estimation, for gamma==0.1, 0.5 and 1 (the later two should be exactly equal to MLE and ELE above)
|
This resulted in ZeroDivisionError before r7209
|
Good Turing Estimation
|
Since the Kneser-Ney distribution is best suited for trigrams, we must adjust our testing accordingly.
|
|
|
|
|
|
|
Remains to be added: - Tests for HeldoutProbDist, CrossValidationProbDist and MutableProbDist
Issue 511: override pop and popitem to invalidate the cache
|
Issue 533: access cumulative frequencies with no arguments
|
Issue 579: override clear to reset some variables
|
Issue 351: fix fileids method of CategorizedCorpusReader to inadvertently add errant categories
|
Issue 175: add the unseen bin to SimpleGoodTuringProbDist by default otherwise any unseen events get a probability of zero, i.e., they don't get smoothed
|
MLEProbDist, ConditionalProbDist'', ``DictionaryConditionalProbDist and ConditionalFreqDist can be pickled:
|