GraphLab: Distributed Graph-Parallel API
2.1
|
The topic modeling toolkit contains a collection of applications targeted at clustering documents and extracting topical representations. The resulting topical representation can be used as a feature space in information retrieval tasks and to group topically related words and documents.
Currently the text modeling toolkit implements a fast asynchronous collapsed Gibbs sampler for the widely used Latent Dirichlet Allocation (LDA) model. In the near future we plan to add a Collapsed Variational Bayesian inference algorithm for the LDA model as well as some more general topic models.
The LDA model associates a topic id with each token (word) in each document in the input corpus. Conceptually, topic ids correspond to semantic groups like "foods", "colors", and "politics" however the association between the id 1, 2, ..., N and the particular topic meaning "foods", "colors", ... is not know in advance and can be resolved by running the approximate inference algorithm. In addition the LDA model assigns a distribution over topics to each document and a distribution over term to each topic. The the topic id for each token is drawn from the topic distribution for each document. The actual word is then drawn from the term distribution for that topic. At a high-level the LDA model encodes the following intuitive assumptions:
Solving for the latent topic assignments of each token as well as the topic distribution for each document and the term distribution for each topic is a challenging (NP-Hard) task. Fortunately there are several approximate inference algorithms that typically can resolve coherent posterior estimates for the LDA model.
The topic modeling toolkit currently implements an asynchronous variant of the Collapsed Gibbs Sampler described by Griffiths and Steyvers in their landmark paper Finding Scientific Topics. The collapsed Gibbs sampler is a Markov Chain Monte Carlo (MCMC) algorithm which generates a sequence of topic assignments for each token that in the limit converge to a sequence of samples drawn from the posterior distribution. In practice the algorithm is run for a sufficient long time to allow the topics to "converge" (sometimes referred to as burn-in) and then the last few samples are used to estimate the posterior distribution over topics for each document and the posterior distribution over words for each topic.
The parallelization of the Collapsed Gibbs Sampler is achieved by drawing new assignments for multiple tokens simultaneously using a method that is similar to that described by Ahmed et al. (Paper). Unfortunately, the collapsed LDA model used to accelerate mixing of the Gibbs sampler also eliminates any conditional independence structure needed to obtain a parallel ergodic sampler as described by Gonzalez et al. (Paper)
However, by mapping the collapsed Gibbs sampler into the GraphLab abstraction we obtain a statistically more efficient algorithm. To implement the collapsed Gibbs sampler in GraphLab we construct a bipartite graph connecting each document with terms that occur in that document. Each edge contains the token count and latent topic assignments for that token. The GraphLab update function maintains the term and document counts during the gather and apply phases and then samples new values for the tokens on the scatter phase. We exploit local atomic integer operations and the GraphLab caching model to immediately propagate changes. The asynchronous consistency model ensures that only one token per document term pair is sampled at a time improving upon the original formulation of the asynchronous Gibbs sampler described by Ahmed et al. (Paper) or the sampler described by Asuncion et al. (Paper).
The collapsed Gibbs sampler application (cgs_lda.cpp) takes as an input a text corpus represented as one or more token files. Each token consists of lines in the form:
<docid> <wordid> <count>
for example a file containing:
0 0 2 0 4 1 1 2 3
implies that the word with id 0 occurs twice in document 0, the word with id 4 occurs once in document 0, and the word with id 2 occurs three times in document 1.
On termination the system outputs for each term the number of occurrences of that term that have been assigned to each topic and for each document the number of tokens assigned to each document.
To demonstrate how the CGS LDA application works we have obtained a copy of the Daily Kos bag-of-words data from the UCI Repository and reformatted it for the cgs_lda application. You can download the reformatted data from here. Once extracted the folder contains:
> ls -lR daily_kos total 120 -rw-r--r-- 1 jegonzal staff 904 Jul 1 22:37 README -rw-r--r--@ 1 jegonzal staff 55467 Jul 1 22:21 dictionary.txt drwxr-xr-x 3 jegonzal staff 102 Jul 1 22:21 tokens ./tokens: total 7960 -rw-r--r-- 1 jegonzal staff 4074516 Jul 1 22:21 doc_word_count.tsv
To run the CGS_LDA GraphLab application on a single machine we simply run:
> ./cgs_lda --corpus ./daily_kos/tokens --dictionary ./daily_kos/dictionary.txt
This will run indefinitley display the top words in each topic every 10 seconds. To help visualize the output if you open the webpage
graphlabapi/toolkits/topic_modeling/http/index.html
We render a word cloud webpage that connects directly to the cgs_lda appliction's internal web-server running on localhost port 8090.
In most cases we will also be interested in collecting the final assignments. This can be done by running:
> ./cgs_lda --corpus ./daily_kos/tokens --dictionary ./daily_kos/dictionary.txt \ --word_dir word_counts --doc_dir doc_counts --burnin=60
This will run the cgs_lda sampler for roughly 60 seconds and the save the counts of tokens in each topic for the words and documents in the files word_counts_x_of_x
and doc_counts_x_of_x
. If instead you would like to save the counts to seperate folders you can prepend the folder path.
By default GraphLab runs with two threads. However we can increase the parallelism on a single machine by increasing the number of threads:
> ./cgs_lda --corpus ./daily_kos/tokens --dictionary ./daily_kos/dictionary.txt \ --ncpus=8
The cgs_lda application can run in the distributed setting as well simply by using MPI to launch it:
> mpiexec --hostfile machine_list.txt -n 16 ./cgs_lda \ --corpus ./daily_kos/tokens --dictionary ./daily_kos/dictionary.txt \ --ncpus=8
This will run 16 instances each consuming 8 on the machines in the machine_list.txt
file. Each of these instances will automatically communicate splitting the work as well as the memory requirements. It is really that easy!
Since we are running in the distributed setting it is convenient to be able to read and write to a distributed filesystem. We have built HDFS support into the GraphLab abstraction therefore we can simply change the arguments to be:
> mpiexec --hostfile machine_list.txt -n 16 ./cgs_lda \ --corpus hdfs://bros.ml.cmu.edu/daily_kos/tokens \ --dictionary hdfs://bros.ml.cmu.edu/daily_kos/dictionary.txt \ --ncpus=8
There are a wide range of options available when calling the cgs_lda program:
max_count
then it is reported as occurring max_count
times. This ensures that overly frequent words do not dominate documents.–corpus
argument must point to the JSON files.