GraphLab: Distributed Graph-Parallel API
2.1
|
Finally, we get to writing the PageRank vertex program itself. A detailed description the pagerank update can be found on wikipedia here.
In pseudo-code:
To compute PageRank of page P: acc = 0; For Each In-page Q: acc += Q.pagerank / Q.num_out_links End P.pagerank = 0.85 * acc + 0.15
We need to map this to the GraphLab vertex program.
A "vertex program" can be thought of as a little program which is executed on a vertex in the graph. The vertex program is short lived: it performs the following 3 phases of execution: each phase providing it access to a different section of the neighborhood of the vertex, then is destroyed.
See graphlab::ivertex_program for detailed documentation on the behavior of the vertex program.
It is simplest to just demonstrate the PageRank vertex program in code:
The pagerank_program
inherits from graphlab::ivertex_program which is itself templatized over the type of the graph (graph_type
) and the type returned by the gather()
obersion.
The pagerank_program must also be Serializable. Since this program does not contain any data elements, it is a POD type and is sufficient to inherit from graphlab::IS_POD_TYPE.
We will now explain each function.
The gather_edges() function returns the collection of edges to gather. It may return graphlab::IN_EDGES, graphlab::OUT_EDGES, graphlab::NO_EDGES, or graphlab::ALL_EDGES. The PageRank update uses only in pages, thus we return graphlab::IN_EDGES.
The gather_edges() function is also passed a context
object which provides additional access to the execution environment: such as obtaining the number of edges in the graph (num_edges()), the ability to immediately stop execution (stop()) among others.
It is also passed a reference to the vertex the current vertex_program is executing on through the vertex
argument. Through vertex
, the function can read the data on the vertex as well as obtain other meta-data such as the number of in-edges of the current vertex (see graphlab::distributed_graph::vertex_type).
According to the PageRank equation, we must compute a weighted sum of the in-pages. The gather() function is thus executed on each in-edge in of the current vertex, returning the edge's contribution to the "acc" value.
To compute the current edge's contribution to the weight, we use the edge
argument which provides direct access to the data on the edge, the source vertex of the edge, and the destination vertex. (see graphlab::distributed_graph::edge_type)
gather()
technically has a non-const reference to the source and target vertex data through edge.source()
and edge.target()
, it should not modify them. The data on the edge (accessible through edge.data()
) is modifiable however.Once the contribution is computed, we simply return it since the result from all gather()
calls are summed up by the execution engine (using only the += operator). The data type of the accumulation is a double, and this must be provided in the second template argument of the ivertex_program the pagerank_program inheritted from.
The returned values from each gather are implicitly summed up (in parallel) behind the scenes as passed to the apply() function in the total
argument. Observe that now the vertex
parameter is modifiable, and we use this to write the new pagerank to the current vertex.
Scatter is similarly gather: it is executed on each edge of the vertex, but does not accumulate any values. Since the pagerank computation does not require a scatter operation we simply have scatter_edges return graphlab::NO_EDGES.
To run the above vertex program on all vertices in the graph once, we simply construct an engine in main() (after finalizing the graph)
The omni_engine is a engine wrapper that allows you to easily select between a synchronous engine or an asynchronous engine. In this case, we select a synchronous engine. Passing "async" will select an asynchronous engine.
The signal_all() function, as the name suggests, signals all the vertices in the graph to run. start() will begin execution of all signaled vertices. Since the synchronous engine is selected all vertices will perform the gather/apply/scatter operations in lock-step. If the asynchronous engine is selected, the vertices will run asynchronously, but a distributed locking procedure is used internally to ensure data consistency.
Each vertex in the graph will run exactly once. We could of course embed the signal+start operations in a loop to run multiple rounds, but that would be inefficient. In the next section we will learn how to signal vertices inside a vertex program.