Package nltk :: Module probability :: Class HeldoutProbDist

Class HeldoutProbDist

object --+    
         |    
 ProbDistI --+
             |
            HeldoutProbDist

The heldout estimate for the probability distribution of the experiment used to generate two frequency distributions. These two frequency distributions are called the "heldout frequency distribution" and the "base frequency distribution." The heldout estimate uses uses the heldout frequency distribution to predict the probability of each sample, given its frequency in the base frequency distribution.

In particular, the heldout estimate approximates the probability for a sample that occurs r times in the base distribution as the average frequency in the heldout distribution of all samples that occur r times in the base distribution.

This average frequency is Tr[r]/(Nr[r]*N), where:

Tr[r] is the total count in the heldout distribution for all samples that occur r times in the base distribution.
Nr[r] is the number of samples that occur r times in the base distribution.
N is the number of outcomes recorded by the heldout frequency distribution.

In order to increase the efficiency of the prob member function, Tr[r]/(Nr[r]*N) is precomputed for each value of r when the HeldoutProbDist is created.

Instance Methods

[hide private]

__init__(self, base_fdist, heldout_fdist, bins=None)
Use the heldout estimate to create a probability distribution for the experiment used to generate base_fdist and heldout_fdist. source code

list of float

_calculate_Tr(self)
Returns: the list Tr, where Tr[r] is the total count in heldout_fdist for all samples that occur r times in base_fdist. source code

list of float

_calculate_estimate(self, Tr, Nr, N)
Returns: the list estimate, where estimate[r] is the probability estimate for any sample that occurs r times in the base frequency distribution.

source code

FreqDist

base_fdist(self)
Returns: The base frequency distribution that this probability distribution is based on.

source code

FreqDist

heldout_fdist(self)
Returns: The heldout frequency distribution that this probability distribution is based on.

source code

list

samples(self)
Returns: A list of all samples that have nonzero probabilities.

source code

float

prob(self, sample)
Returns: the probability for a given sample.

source code

any

max(self)
Returns: the sample with the greatest probability.

source code

float

discount(self)
Returns: The ratio by which counts are discounted on average: c*/c

source code

string

__repr__(self)
Returns: A string representation of this ProbDist. source code

Inherited from ProbDistI: generate, logprob

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Class Variables

[hide private]

SUM_TO_ONE = False
True if the probabilities of the samples in this probability distribution will always sum to one.

Instance Variables

[hide private]

list of float _estimate
A list mapping from r, the number of times that a sample occurs in the base distribution, to the probability estimate for that sample.

int _max_r
The maximum number of times that any sample occurs in the base distribution.

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, base_fdist, heldout_fdist, bins=None)
(Constructor)

source code

Use the heldout estimate to create a probability distribution for the experiment used to generate base_fdist and heldout_fdist.

Parameters:

base_fdist (FreqDist) - The base frequency distribution.
heldout_fdist (FreqDist) - The heldout frequency distribution.
bins (int) - The number of sample values that can be generated by the experiment that is described by the probability distribution. This value must be correctly set for the probabilities of the sample values to sum to one. If bins is not specified, it defaults to freqdist.B().

Overrides: ProbDistI.__init__

_calculate_Tr(self)

source code

Returns: list of float: the list Tr, where Tr[r] is the total count in heldout_fdist for all samples that occur r times in base_fdist.

_calculate_estimate(self, Tr, Nr, N)

source code

Parameters:

Tr (list of float) - the list Tr, where Tr[r] is the total count in the heldout distribution for all samples that occur r times in base distribution.
Nr (list of float) - The list Nr, where Nr[r] is the number of samples that occur r times in the base distribution.
N (int) - The total number of outcomes recorded by the heldout frequency distribution.

Returns: list of float

the list estimate, where estimate[r] is the probability estimate for any sample that occurs r times in the base frequency distribution. In particular, estimate[r] is Tr[r]/(N[r]*N). In the special case that N[r]=0, estimate[r] will never be used; so we define estimate[r]=None for those cases.

base_fdist(self)

source code

Returns: FreqDist: The base frequency distribution that this probability distribution is based on.

heldout_fdist(self)

source code

Returns: FreqDist: The heldout frequency distribution that this probability distribution is based on.

samples(self)

source code

Returns: list: A list of all samples that have nonzero probabilities. Use prob to find the probability of each sample.
Overrides: ProbDistI.samples: (inherited documentation)

prob(self, sample)

source code

Parameters:

sample - The sample whose probability should be returned.

Returns: float

the probability for a given sample. Probabilities are always real numbers in the range [0, 1].

Overrides: ProbDistI.prob

(inherited documentation)

max(self)

source code

Returns: any: the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
Overrides: ProbDistI.max: (inherited documentation)

discount(self)

source code

Returns: float: The ratio by which counts are discounted on average: c*/c
Overrides: ProbDistI.discount: (inherited documentation)

repr(self)
(Representation operator)

source code

repr(x)

Returns: string: A string representation of this ProbDist.
Overrides: object.__repr__

Instance Variable Details

[hide private]

_estimate

A list mapping from r, the number of times that a sample occurs in the base distribution, to the probability estimate for that sample. _estimate[r] is calculated by finding the average frequency in the heldout distribution of all samples that occur r times in the base distribution. In particular, _estimate[r] = Tr[r]/(Nr[r]*N).

Type:: list of float

_max_r

The maximum number of times that any sample occurs in the base distribution. _max_r is used to decide how large _estimate must be.

Type:: int

Class HeldoutProbDist

__init__(self, base_fdist, heldout_fdist, bins=None) (Constructor)

_calculate_Tr(self)

_calculate_estimate(self, Tr, Nr, N)

base_fdist(self)

heldout_fdist(self)

samples(self)

prob(self, sample)

max(self)

discount(self)

__repr__(self) (Representation operator)

_estimate

_max_r

init(self, base_fdist, heldout_fdist, bins=None)
(Constructor)

repr(self)
(Representation operator)