Package nltk :: Package cluster :: Module kmeans :: Class KMeans
[hide private]
[frames] | no frames]

Class KMeans

source code

api.ClusterI --+    
               |    
util.VectorSpace --+
                   |
                  KMeans

The K-means clusterer starts with k arbitrary chosen means then allocates each vector to the cluster with the closest mean. It then recalculates the means of each cluster as the centroid of the vectors in the cluster. This process repeats until the cluster memberships stabilise. This is a hill-climbing algorithm which may converge to a local maximum. Hence the clustering is often repeated with random initial means and the most commonly occuring output means are chosen.

Instance Methods [hide private]
 
__init__(self, num_means, distance, repeats=1, conv_test=1e-06, initial_means=None, normalise=False, svd_dimensions=None, rng=None) source code
 
cluster_vectorspace(self, vectors, trace=False)
Finds the clusters using the given set of vectors.
source code
 
_cluster_vectorspace(self, vectors, trace=False) source code
 
classify_vectorspace(self, vector)
Returns the index of the appropriate cluster for the vector.
source code
 
num_clusters(self)
Returns the number of clusters.
source code
 
means(self)
The means used for clustering.
source code
 
_sum_distances(self, vectors1, vectors2) source code
 
_centroid(self, cluster) source code
 
__repr__(self) source code

Inherited from util.VectorSpace: classify, cluster, likelihood, likelihood_vectorspace, vector

Inherited from util.VectorSpace (private): _normalise

Inherited from api.ClusterI: classification_probdist, cluster_name, cluster_names

Method Details [hide private]

__init__(self, num_means, distance, repeats=1, conv_test=1e-06, initial_means=None, normalise=False, svd_dimensions=None, rng=None)
(Constructor)

source code 
Parameters:
  • num_means (int) - the number of means to use (may use fewer)
  • distance (function taking two vectors and returing a float) - measure of distance between two vectors
  • repeats (int) - number of randomised clustering trials to use
  • conv_test (number) - maximum variation in mean differences before deemed convergent
  • initial_means (sequence of vectors) - set of k initial means
  • normalise (boolean) - should vectors be normalised to length 1
  • svd_dimensions (int) - number of dimensions to use in reducing vector dimensionsionality with SVD
  • rng (Random) - random number generator (or None)
Overrides: util.VectorSpace.__init__

cluster_vectorspace(self, vectors, trace=False)

source code 

Finds the clusters using the given set of vectors.

Overrides: util.VectorSpace.cluster_vectorspace
(inherited documentation)

classify_vectorspace(self, vector)

source code 

Returns the index of the appropriate cluster for the vector.

Overrides: util.VectorSpace.classify_vectorspace
(inherited documentation)

num_clusters(self)

source code 

Returns the number of clusters.

Overrides: api.ClusterI.num_clusters
(inherited documentation)