This toolkit will contain implementations of clustering algorithm.

Currently the algorithms implemented are

KMeans++

The kmeans program implements the KMeans++ algorithm described by

Arthur, D. and Vassilvitskii, S. (2007). "k-means++: the advantages of careful seeding". Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. pp. 1027–1035.

It takes as input a collection of files where each line in each file represents a data point. Each line must contains a list of numbers, white-space or comma separated. Each line must be the same length.

For instance in this example input file, there are 6 datapoints, one per line, and each datapoint is a point in a 5-dimensional space.

-10.7551  6.82178 5.33455 -2.08247  2.86694 
-1.36687  10.8464 -5.28851  -4.26768  -5.50659  
-8.79834  8.01002 5.33418 0.102824  3.23318 
-8.64345  6.81946 1.2309  -4.46784  2.26341 
-8.29782  7.1154  3.32559 -2.59422  2.33936 
-8.12504  8.98924 4.15027 0.253153  1.75911

Synthetic Data

Example synthetic data can be generated by running

> ./generate_synthetic [Number of Clusters] [Number of Dimensions] [Number of datapoints]

This will generate a file called synthetic.txt , and will also output to screen the cluster centers.

For instance:

> ./generate_synthetic 2 3 10
Usage: generate_synthetic [NClusters] [Dimensions] [Ndata]
Center 0 at: -6.69675 0.355189  -4.88601  
Center 1 at: 5.85919  0.0388327 5.50007  

> cat synthetic.txt
-4.31568  -0.396959 -6.29507  
-4.56112  -1.74917  -4.57874  
4.54508 0.102845  6.35385 
4.87746 -0.832591 7.06942 
-5.91254  -0.278006 -4.25934  
6.95139 0.120139  4.89531 
-6.28538  -0.88527  -4.74988  
-6.84791  0.887664  -4.91919  
7.47117 1.67911 6.02221 
-4.78011  1.2099  -4.55519

Note:: Mathematically, the synthetic generator draws centers from the [-10,10]^D uniform hypercube and draws data points by sampling uniformly from the centers, and around a unit gaussian around each center.

Running KMeans

To run Kmeans clustering:

> ./kmeans --data=[data prefix] --cluster=[N cluster] --output-clusters=[cluster output file]

All files beginning with the data prefix will be loaded. The data may be on HDFS. The –cluster option is the number of clusters to solve for.

The cluster output file must be a target accessible on the local file system. (In the distributed case, it must be accessible from the 0th machine). This file will contain a list of all the solved cluster centers.

For instance, running on the synthetic data above:

>./kmeans --data=synthetic.txt --clusters=2 --output-clusters=cluster.txt
Number of datapoints: 10
Validating data...Initializing using Kmeans++
Running Kmeans...
Kmeans iteration 1: # points with changed assignments = 0
Writing Cluster Centers...

>cat cluster.txt
-5.45046  -0.201973 -4.8929 
5.96127 0.267376  6.0852

To also output the cluster center assignments for each datapoint, add the option:

--output-data=[output prefix]

Tne output prefix is where the output data will be written. This may be located on HDFS. For instance, if the output_prefix is "v_out", the output files will be written to:

v_out_1_of_16
v_out_2_of_16
...
v_out_16_of_16

Each line in the output files contain firstly, the original data point, followed $by the cluster number it was assigned to. These need not be in the same order as the original input.

For instance, running kmeans on the example synthetic data above may produce the following output:

>./kmeans --data=synthetic.txt --clusters=2 --output-clusters=cluster.txt --output-data=data.txt
Number of datapoints: 10
Validating data...Initializing using Kmeans++
Running Kmeans...
Kmeans iteration 1: # points with changed assignments = 0
Writing Cluster Centers...
Writing Data with cluster assignments...

>cat data.txt
-4.78011  1.2099  -4.55519  0
7.47117 1.67911 6.02221 1
-6.84791  0.887664  -4.91919  0
-6.28538  -0.88527  -4.74988  0
6.95139 0.120139  4.89531 1
-5.91254  -0.278006 -4.25934  0
4.87746 -0.832591 7.06942 1
4.54508 0.102845  6.35385 1
-4.56112  -1.74917  -4.57874  0
-4.31568  -0.396959 -6.29507  0

This program can also run distributed by using

> mpiexec -n [N machines] --hostfile [host file] ./kmeans ....

See your MPI documentation for details on how to launch this job. All machines must have access to the input graph location and the output graph location. Graphs may be on HDFS. If you have problems loading HDFS files, see the FAQ.

--sparse Option

If –sparse=1 is set (default is 0), the program will use a sparse vector representation. The file format is [feature id]:[value] [feature id]:[value] ... Each line corresponds to a datapoint. [feature id] must be positive integer or zero.

For instance:

> cat synthetic_sparse.txt
0:-4.31568  3:-0.396959 5:-6.29507  
5:-4.56112  9:-1.74917  
4:4.54508 5:0.102845  10:6.35385 
0:4.87746 7:-0.832591

--id Option

If –id=1 is set (default is 0), the program will use id for each data point. The id of a data point must be written at the head of each line of the input data.

For instance:

> cat synthetic_with_id.txt
1 -4.31568  -0.396959 -6.29507  
2 -4.56112  -1.74917  -4.57874  
3 4.54508 0.102845  6.35385 
4 4.87746 -0.832591 7.06942

The output data will consist of two columns: one for the ids and the other for the assigned clusters.

For instance:

> cat data_with_id.txt
1 0
2 0  
3 1
4 1

–id can be used with –sparse/tt>.

Adding Pairwise Reward

This program can consider pairwise rewards (and penalty) by using –pairwise-reward=[file prefix]. All files beginning with [file prefix] will be loaded for pairwise rewards. Each line of the pairwise reward file holds [id1] [id2] [weight]. This option must be used with –id=1.

For instance:


> cat pairwise_data.txt
1 2 10
2 4 -10

In this example the evaluation function will gain 10 when data 1 and data 2 are in the same cluster; it will gain nothing otherwise.

Options



–data (Required). The prefix from which to load the input data 
–clusters (Required). The number of cluster centers 
–ncpus (Optional. Default 2) The number of processors that will be used for computation. Due to some implementation limitations within GraphLab, this parameter is not respected. It will use all processors on your machine if ran in Linux, and will use only 1 processor if ran on Mac 
–output-data (Optional) A target prefix to write the output data with cluster assignments. May be on HDFS. 
–output-clusters (Optional) A target location to write the cluster centers. Must be on the local file system. 
–sparse (Optional. Default 0) If set at 1, will use sparse vector representation 
–id (Optional. Default 0) If set at 1, will use ids for data points 
–pairwise-reward (Optional) If set, will consider pairwise rewards written in the files beginning with the given argument 
–max-iteration (Optional) The max number of iterations

Spectral Clustering

This program takes as input a collection of files where each line in each file represents a data point. Each line must contains an id and a list of numbers, white-space or comma separated. Each line must be the same length. For instance in this example input file, there are 6 datapoints, one per line, and each datapoint is a point in a 5-dimensional space.


1 -10.7551  6.82178 5.33455 -2.08247  2.86694
2 -1.36687  10.8464 -5.28851  -4.26768  -5.50659
3 -8.79834  8.01002 5.33418 0.102824  3.23318 
4 -8.64345  6.81946 1.2309  -4.46784  2.26341 
5 -8.29782  7.1154  3.32559 -2.59422  2.33936
6 -8.12504  8.98924 4.15027 0.253153  1.75911

The ids of data points must start from 1 and must not skip any numbers.

To run spectral clustering, the minimal set of options required are:


> ./spectral_clustering --data=[data prefix] --clusters=[N cluster]

This program uses svd in Graphlab Collaborative Filtering Toolkit, kmeans in Graphlab Clustering Toolkit and eigen_vector_normalization in Graphlab Graph Analytics Toolkit. The paths to the directories are specified by –svd-dir, –kmeans-dir and –graph-analytics-dir, respectively.

The program will create some intermediate files. The final clustering result is written in files named [data prefix].result with a suffix, for example [data prefix].result_1_of_4. The clustering result data will consist of two columns: one for the ids and the other for the assigned clusters. For instance:

Options

Relevant options are:



–data (Required). The prefix from which to load the input data 
–clusters (Required). The number of clusters 
–sigma (Optional. Default 0.1). Scale parameter for Gaussian kernel. This value is often critical to the clustering result. 
–t-nearest (Optional. Default 20). Number of nearest neighbors (=t). Will use only the t-nearest similarities for each datapoint. If set at 0, will use all similarities. 
–similarity-thres (Optional). Threshold to discard small similarities. If a value is set, similarities less than this value will be discarded. 
–svd-dir (Optional Default ../collaborative_filtering/). Path to the directory where Graphlab svd is located 
–kmeans-dir (Optional. Default ../clustering/). Path to the directory where Graphlab kmeans is located 
–graph-analytics-dir (Optional. Default ../graph_analytics/). Path to the directory where Graphlab eigen_vector_normalization is located 
–ncpus (Optional. Default 2). The number of processors that will be used for computation. Due to some implementation limitations within GraphLab, this parameter is not respected. It will use all processors on your machine if ran in Linux, and will use only 1 processor if ran on Mac 
–graph_opts (Optional, Default empty). Any additional graph options. See graphlab::distributed_graph a list of options.