nltk.cluster package

Submodules

nltk.cluster.api module

class nltk.cluster.api.ClusterI[source]

Bases: object

Interface covering basic clustering functionality.

classification_probdist(vector)[source]

Classifies the token into a cluster, returning a probability distribution over the cluster identifiers.

classify(token)[source]

Classifies the token into a cluster, setting the token’s CLUSTER parameter to that cluster identifier.

cluster(vectors, assign_clusters=False)[source]

Assigns the vectors to clusters, learning the clustering parameters from the data. Returns a cluster identifier for each vector.

cluster_name(index)[source]

Returns the names of the cluster at index.

cluster_names()[source]

Returns the names of the clusters. :rtype: list

likelihood(vector, label)[source]

Returns the likelihood (a float) of the token having the corresponding cluster.

num_clusters()[source]

Returns the number of clusters.

nltk.cluster.em module

class nltk.cluster.em.EMClusterer(initial_means, priors=None, covariance_matrices=None, conv_threshold=1e-06, bias=0.1, normalise=False, svd_dimensions=None)[source]

Bases: nltk.cluster.util.VectorSpaceClusterer

The Gaussian EM clusterer models the vectors as being produced by a mixture of k Gaussian sources. The parameters of these sources (prior probability, mean and covariance matrix) are then found to maximise the likelihood of the given data. This is done with the expectation maximisation algorithm. It starts with k arbitrarily chosen means, priors and covariance matrices. It then calculates the membership probabilities for each vector in each of the clusters; this is the ‘E’ step. The cluster parameters are then updated in the ‘M’ step using the maximum likelihood estimate from the cluster membership probabilities. This process continues until the likelihood of the data does not significantly increase.

classify_vectorspace(vector)[source]

Returns the index of the appropriate cluster for the vector.

cluster_vectorspace(vectors, trace=False)[source]

Finds the clusters using the given set of vectors.

likelihood_vectorspace(vector, cluster)[source]

Returns the likelihood of the vector belonging to the cluster.

num_clusters()[source]

Returns the number of clusters.

unicode_repr()

Return repr(self).

nltk.cluster.em.demo()[source]

Non-interactive demonstration of the clusterers with simple 2-D data.

nltk.cluster.gaac module

class nltk.cluster.gaac.GAAClusterer(num_clusters=1, normalise=True, svd_dimensions=None)[source]

Bases: nltk.cluster.util.VectorSpaceClusterer

The Group Average Agglomerative starts with each of the N vectors as singleton clusters. It then iteratively merges pairs of clusters which have the closest centroids. This continues until there is only one cluster. The order of merges gives rise to a dendrogram: a tree with the earlier merges lower than later merges. The membership of a given number of clusters c, 1 <= c <= N, can be found by cutting the dendrogram at depth c.

This clusterer uses the cosine similarity metric only, which allows for efficient speed-up in the clustering process.

classify_vectorspace(vector)[source]

Returns the index of the appropriate cluster for the vector.

cluster(vectors, assign_clusters=False, trace=False)[source]

Assigns the vectors to clusters, learning the clustering parameters from the data. Returns a cluster identifier for each vector.

cluster_vectorspace(vectors, trace=False)[source]

Finds the clusters using the given set of vectors.

dendrogram()[source]
Returns:The dendrogram representing the current clustering
Return type:Dendrogram
num_clusters()[source]

Returns the number of clusters.

unicode_repr()

Return repr(self).

update_clusters(num_clusters)[source]
nltk.cluster.gaac.demo()[source]

Non-interactive demonstration of the clusterers with simple 2-D data.

nltk.cluster.kmeans module

class nltk.cluster.kmeans.KMeansClusterer(num_means, distance, repeats=1, conv_test=1e-06, initial_means=None, normalise=False, svd_dimensions=None, rng=None, avoid_empty_clusters=False)[source]

Bases: nltk.cluster.util.VectorSpaceClusterer

The K-means clusterer starts with k arbitrary chosen means then allocates each vector to the cluster with the closest mean. It then recalculates the means of each cluster as the centroid of the vectors in the cluster. This process repeats until the cluster memberships stabilise. This is a hill-climbing algorithm which may converge to a local maximum. Hence the clustering is often repeated with random initial means and the most commonly occurring output means are chosen.

classify_vectorspace(vector)[source]

Returns the index of the appropriate cluster for the vector.

cluster_vectorspace(vectors, trace=False)[source]

Finds the clusters using the given set of vectors.

means()[source]

The means used for clustering.

num_clusters()[source]

Returns the number of clusters.

unicode_repr()

Return repr(self).

nltk.cluster.kmeans.demo()[source]

nltk.cluster.util module

class nltk.cluster.util.Dendrogram(items=[])[source]

Bases: object

Represents a dendrogram, a tree with a specified branching order. This must be initialised with the leaf items, then iteratively call merge for each branch. This class constructs a tree representing the order of calls to the merge function.

groups(n)[source]

Finds the n-groups of items (leaves) reachable from a cut at depth n. :param n: number of groups :type n: int

merge(*indices)[source]

Merges nodes at given indices in the dendrogram. The nodes will be combined which then replaces the first node specified. All other nodes involved in the merge will be removed.

Parameters:indices (seq of int) – indices of the items to merge (at least two)
show(leaf_labels=[])[source]

Print the dendrogram in ASCII art to standard out. :param leaf_labels: an optional list of strings to use for labeling the

leaves
unicode_repr()

Return repr(self).

class nltk.cluster.util.VectorSpaceClusterer(normalise=False, svd_dimensions=None)[source]

Bases: nltk.cluster.api.ClusterI

Abstract clusterer which takes tokens and maps them into a vector space. Optionally performs singular value decomposition to reduce the dimensionality.

classify(vector)[source]

Classifies the token into a cluster, setting the token’s CLUSTER parameter to that cluster identifier.

classify_vectorspace(vector)[source]

Returns the index of the appropriate cluster for the vector.

cluster(vectors, assign_clusters=False, trace=False)[source]

Assigns the vectors to clusters, learning the clustering parameters from the data. Returns a cluster identifier for each vector.

cluster_vectorspace(vectors, trace)[source]

Finds the clusters using the given set of vectors.

likelihood(vector, label)[source]

Returns the likelihood (a float) of the token having the corresponding cluster.

likelihood_vectorspace(vector, cluster)[source]

Returns the likelihood of the vector belonging to the cluster.

vector(vector)[source]

Returns the vector after normalisation and dimensionality reduction

nltk.cluster.util.cosine_distance(u, v)[source]

Returns 1 minus the cosine of the angle between vectors v and u. This is equal to 1 - (u.v / |u||v|).

nltk.cluster.util.euclidean_distance(u, v)[source]

Returns the euclidean distance between vectors u and v. This is equivalent to the length of the vector (u - v).

Module contents

This module contains a number of basic clustering algorithms. Clustering describes the task of discovering groups of similar items with a large collection. It is also describe as unsupervised machine learning, as the data from which it learns is unannotated with class information, as is the case for supervised learning. Annotated data is difficult and expensive to obtain in the quantities required for the majority of supervised learning algorithms. This problem, the knowledge acquisition bottleneck, is common to most natural language processing tasks, thus fueling the need for quality unsupervised approaches.

This module contains a k-means clusterer, E-M clusterer and a group average agglomerative clusterer (GAAC). All these clusterers involve finding good cluster groupings for a set of vectors in multi-dimensional space.

The K-means clusterer starts with k arbitrary chosen means then allocates each vector to the cluster with the closest mean. It then recalculates the means of each cluster as the centroid of the vectors in the cluster. This process repeats until the cluster memberships stabilise. This is a hill-climbing algorithm which may converge to a local maximum. Hence the clustering is often repeated with random initial means and the most commonly occurring output means are chosen.

The GAAC clusterer starts with each of the N vectors as singleton clusters. It then iteratively merges pairs of clusters which have the closest centroids. This continues until there is only one cluster. The order of merges gives rise to a dendrogram - a tree with the earlier merges lower than later merges. The membership of a given number of clusters c, 1 <= c <= N, can be found by cutting the dendrogram at depth c.

The Gaussian EM clusterer models the vectors as being produced by a mixture of k Gaussian sources. The parameters of these sources (prior probability, mean and covariance matrix) are then found to maximise the likelihood of the given data. This is done with the expectation maximisation algorithm. It starts with k arbitrarily chosen means, priors and covariance matrices. It then calculates the membership probabilities for each vector in each of the clusters - this is the ‘E’ step. The cluster parameters are then updated in the ‘M’ step using the maximum likelihood estimate from the cluster membership probabilities. This process continues until the likelihood of the data does not significantly increase.

They all extend the ClusterI interface which defines common operations available with each clusterer. These operations include.

  • cluster: clusters a sequence of vectors
  • classify: assign a vector to a cluster
  • classification_probdist: give the probability distribution over cluster memberships

The current existing classifiers also extend cluster.VectorSpace, an abstract class which allows for singular value decomposition (SVD) and vector normalisation. SVD is used to reduce the dimensionality of the vector space in such a manner as to preserve as much of the variation as possible, by reparameterising the axes in order of variability and discarding all bar the first d dimensions. Normalisation ensures that vectors fall in the unit hypersphere.

Usage example (see also demo())::

from nltk import cluster from nltk.cluster import euclidean_distance from numpy import array

vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]

# initialise the clusterer (will also assign the vectors to clusters) clusterer = cluster.KMeansClusterer(2, euclidean_distance) clusterer.cluster(vectors, True)

# classify a new vector print(clusterer.classify(array([3, 3])))

Note that the vectors must use numpy array-like objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for efficiency when required.