GraphLab: Distributed Graph-Parallel API
2.1
|
This toolkit will contain implementations of clustering algorithm.
Currently the algorithms implemented are
The kmeans
program implements the KMeans++ algorithm described by
Arthur, D. and Vassilvitskii, S. (2007). "k-means++: the advantages of careful seeding". Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. pp. 1027–1035.
It takes as input a collection of files where each line in each file represents a data point. Each line must contains a list of numbers, white-space or comma separated. Each line must be the same length.
For instance in this example input file, there are 6 datapoints, one per line, and each datapoint is a point in a 5-dimensional space.
-10.7551 6.82178 5.33455 -2.08247 2.86694 -1.36687 10.8464 -5.28851 -4.26768 -5.50659 -8.79834 8.01002 5.33418 0.102824 3.23318 -8.64345 6.81946 1.2309 -4.46784 2.26341 -8.29782 7.1154 3.32559 -2.59422 2.33936 -8.12504 8.98924 4.15027 0.253153 1.75911
Example synthetic data can be generated by running
> ./generate_synthetic [Number of Clusters] [Number of Dimensions] [Number of datapoints]
This will generate a file called synthetic.txt
, and will also output to screen the cluster centers.
For instance:
> ./generate_synthetic 2 3 10 Usage: generate_synthetic [NClusters] [Dimensions] [Ndata] Center 0 at: -6.69675 0.355189 -4.88601 Center 1 at: 5.85919 0.0388327 5.50007 > cat synthetic.txt -4.31568 -0.396959 -6.29507 -4.56112 -1.74917 -4.57874 4.54508 0.102845 6.35385 4.87746 -0.832591 7.06942 -5.91254 -0.278006 -4.25934 6.95139 0.120139 4.89531 -6.28538 -0.88527 -4.74988 -6.84791 0.887664 -4.91919 7.47117 1.67911 6.02221 -4.78011 1.2099 -4.55519
To run Kmeans clustering:
> ./kmeans --data=[data prefix] --cluster=[N cluster] --output-clusters=[cluster output file]
All files beginning with the data prefix will be loaded. The data may be on HDFS. The –cluster
option is the number of clusters to solve for.
The cluster output file must be a target accessible on the local file system. (In the distributed case, it must be accessible from the 0th machine). This file will contain a list of all the solved cluster centers.
For instance, running on the synthetic data above:
>./kmeans --data=synthetic.txt --clusters=2 --output-clusters=cluster.txt Number of datapoints: 10 Validating data...Initializing using Kmeans++ Running Kmeans... Kmeans iteration 1: # points with changed assignments = 0 Writing Cluster Centers... >cat cluster.txt -5.45046 -0.201973 -4.8929 5.96127 0.267376 6.0852
To also output the cluster center assignments for each datapoint, add the option:
--output-data=[output prefix]
Tne output prefix
is where the output data will be written. This may be located on HDFS. For instance, if the output_prefix
is "v_out"
, the output files will be written to:
v_out_1_of_16 v_out_2_of_16 ... v_out_16_of_16
Each line in the output files contain firstly, the original data point, followed $by the cluster number it was assigned to. These need not be in the same order as the original input.
For instance, running kmeans on the example synthetic data above may produce the following output:
>./kmeans --data=synthetic.txt --clusters=2 --output-clusters=cluster.txt --output-data=data.txt Number of datapoints: 10 Validating data...Initializing using Kmeans++ Running Kmeans... Kmeans iteration 1: # points with changed assignments = 0 Writing Cluster Centers... Writing Data with cluster assignments... >cat data.txt -4.78011 1.2099 -4.55519 0 7.47117 1.67911 6.02221 1 -6.84791 0.887664 -4.91919 0 -6.28538 -0.88527 -4.74988 0 6.95139 0.120139 4.89531 1 -5.91254 -0.278006 -4.25934 0 4.87746 -0.832591 7.06942 1 4.54508 0.102845 6.35385 1 -4.56112 -1.74917 -4.57874 0 -4.31568 -0.396959 -6.29507 0
This program can also run distributed by using
> mpiexec -n [N machines] --hostfile [host file] ./kmeans ....
See your MPI documentation for details on how to launch this job. All machines must have access to the input graph location and the output graph location. Graphs may be on HDFS. If you have problems loading HDFS files, see the FAQ.
If –sparse=1
is set (default is 0), the program will use a sparse vector representation. The file format is [feature id]:[value] [feature id]:[value] ... Each line corresponds to a datapoint. [feature id] must be positive integer or zero.
For instance:
> cat synthetic_sparse.txt 0:-4.31568 3:-0.396959 5:-6.29507 5:-4.56112 9:-1.74917 4:4.54508 5:0.102845 10:6.35385 0:4.87746 7:-0.832591
If –id=1
is set (default is 0), the program will use id for each data point. The id of a data point must be written at the head of each line of the input data.
For instance:
> cat synthetic_with_id.txt 1 -4.31568 -0.396959 -6.29507 2 -4.56112 -1.74917 -4.57874 3 4.54508 0.102845 6.35385 4 4.87746 -0.832591 7.06942
The output data will consist of two columns: one for the ids and the other for the assigned clusters.
For instance:
> cat data_with_id.txt 1 0 2 0 3 1 4 1
–id
can be used with –sparse/tt>.
This program can consider pairwise rewards (and penalty) by using
–pairwise-reward=[file prefix]
. All files beginning with [file prefix]
will be loaded for pairwise rewards. Each line of the pairwise reward file holds [id1] [id2] [weight]. This option must be used with –id=1
.
For instance:
> cat pairwise_data.txt 1 2 10 2 4 -10
In this example the evaluation function will gain 10 when data
1
and data 2
are in the same cluster; it will gain nothing otherwise.
This program takes as input a collection of files where each line in each file represents a data point. Each line must contains an id and a list of numbers, white-space or comma separated. Each line must be the same length. For instance in this example input file, there are 6 datapoints, one per line, and each datapoint is a point in a 5-dimensional space.
1 -10.7551 6.82178 5.33455 -2.08247 2.86694 2 -1.36687 10.8464 -5.28851 -4.26768 -5.50659 3 -8.79834 8.01002 5.33418 0.102824 3.23318 4 -8.64345 6.81946 1.2309 -4.46784 2.26341 5 -8.29782 7.1154 3.32559 -2.59422 2.33936 6 -8.12504 8.98924 4.15027 0.253153 1.75911
The ids of data points must start from 1 and must not skip any numbers.
To run spectral clustering, the minimal set of options required are:
> ./spectral_clustering --data=[data prefix] --clusters=[N cluster]
This program uses svd in Graphlab Collaborative Filtering Toolkit, kmeans in Graphlab Clustering Toolkit and eigen_vector_normalization in Graphlab Graph Analytics Toolkit. The paths to the directories are specified by
–svd-dir
, –kmeans-dir
and –graph-analytics-dir
, respectively.
The program will create some intermediate files. The final clustering result is written in files named
[data prefix].result
with a suffix, for example [data prefix].result_1_of_4
. The clustering result data will consist of two columns: one for the ids and the other for the assigned clusters. For instance:
1 1 2 0 3 1 4 1 5 1
Relevant options are: