Package nltk :: Package cluster
[hide private]
[frames] | no frames]

Source Code for Package nltk.cluster

 1  # Natural Language Toolkit: Clusterers 
 2  # 
 3  # Copyright (C) 2001-2008 NLTK Project 
 4  # Author: Trevor Cohn <[email protected]> 
 5  # Porting: Steven Bird <[email protected]> 
 6  # URL: <http://nltk.org> 
 7  # For license information, see LICENSE.TXT 
 8   
 9  """ 
10  This module contains a number of basic clustering algorithms. Clustering 
11  describes the task of discovering groups of similar items with a large 
12  collection. It is also describe as unsupervised machine learning, as the data 
13  from which it learns is unannotated with class information, as is the case for 
14  supervised learning.  Annotated data is difficult and expensive to obtain in 
15  the quantities required for the majority of supervised learning algorithms. 
16  This problem, the knowledge acquisition bottleneck, is common to most natural 
17  language processing tasks, thus fueling the need for quality unsupervised 
18  approaches. 
19   
20  This module contains a k-means clusterer, E-M clusterer and a group average 
21  agglomerative clusterer (GAAC). All these clusterers involve finding good 
22  cluster groupings for a set of vectors in multi-dimensional space. 
23   
24  The K-means clusterer starts with k arbitrary chosen means then allocates each 
25  vector to the cluster with the closest mean. It then recalculates the means of 
26  each cluster as the centroid of the vectors in the cluster. This process 
27  repeats until the cluster memberships stabilise. This is a hill-climbing 
28  algorithm which may converge to a local maximum. Hence the clustering is 
29  often repeated with random initial means and the most commonly occurring 
30  output means are chosen. 
31   
32  The GAAC clusterer starts with each of the M{N} vectors as singleton clusters. 
33  It then iteratively merges pairs of clusters which have the closest centroids. 
34  This continues until there is only one cluster. The order of merges gives rise 
35  to a dendogram - a tree with the earlier merges lower than later merges. The 
36  membership of a given number of clusters M{c}, M{1 <= c <= N}, can be found by 
37  cutting the dendogram at depth M{c}. 
38   
39  The Gaussian EM clusterer models the vectors as being produced by a mixture 
40  of k Gaussian sources. The parameters of these sources (prior probability, 
41  mean and covariance matrix) are then found to maximise the likelihood of the 
42  given data. This is done with the expectation maximisation algorithm. It 
43  starts with k arbitrarily chosen means, priors and covariance matrices. It 
44  then calculates the membership probabilities for each vector in each of the 
45  clusters - this is the 'E' step. The cluster parameters are then updated in 
46  the 'M' step using the maximum likelihood estimate from the cluster membership 
47  probabilities. This process continues until the likelihood of the data does 
48  not significantly increase. 
49   
50  They all extend the ClusterI interface which defines common operations 
51  available with each clusterer. These operations include. 
52     - cluster: clusters a sequence of vectors 
53     - classify: assign a vector to a cluster 
54     - classification_probdist: give the probability distribution over cluster memberships 
55   
56  The current existing classifiers also extend cluster.VectorSpace, an 
57  abstract class which allows for singular value decomposition (SVD) and vector 
58  normalisation. SVD is used to reduce the dimensionality of the vector space in 
59  such a manner as to preserve as much of the variation as possible, by 
60  reparameterising the axes in order of variability and discarding all bar the 
61  first d dimensions. Normalisation ensures that vectors fall in the unit 
62  hypersphere. 
63   
64  Usage example (see also demo()):: 
65      vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]] 
66       
67      # initialise the clusterer (will also assign the vectors to clusters) 
68      clusterer = cluster.KMeans(2, euclidean_distance) 
69      clusterer.cluster(vectors, True) 
70   
71      # classify a new vector 
72      print clusterer.classify(array([3, 3])) 
73   
74  Note that the vectors must use numpy array-like 
75  objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for 
76  efficiency when required. 
77  """ 
78   
79  from api import * 
80  from util import * 
81  from kmeans import * 
82  from gaac import * 
83  from em import * 
84