GraphLab: Distributed Graph-Parallel API  2.1
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
5: Writing the PageRank Vertex Program

Finally, we get to writing the PageRank vertex program itself. A detailed description the pagerank update can be found on wikipedia here.

In pseudo-code:

  To compute PageRank of page P:
    acc = 0;
    For Each In-page Q:
      acc += Q.pagerank / Q.num_out_links
    End
    P.pagerank = 0.85 * acc + 0.15

We need to map this to the GraphLab vertex program.

PageRank Vertex Program

A "vertex program" can be thought of as a little program which is executed on a vertex in the graph. The vertex program is short lived: it performs the following 3 phases of execution: each phase providing it access to a different section of the neighborhood of the vertex, then is destroyed.

  • A gather phase where gather() function in the vertex program is called on each edge on the vertex's adjacent edges. returning a value with each gather.
  • An apply phase where the values returned by the gather's are summed together and given to the apply() function in the vertex program.
  • A scatter phase where scatter() function in the vertex program is once again called on each edge on the vertex's adjacent edges.

See graphlab::ivertex_program for detailed documentation on the behavior of the vertex program.

It is simplest to just demonstrate the PageRank vertex program in code:

class pagerank_program :
public graphlab::ivertex_program<graph_type, double>,
public:
// we are going to gather on all the in-edges
edge_dir_type gather_edges(icontext_type& context,
const vertex_type& vertex) const {
}
// for each in-edge gather the weighted sum of the edge.
double gather(icontext_type& context, const vertex_type& vertex,
edge_type& edge) const {
return edge.source().data().pagerank / edge.source().num_out_edges();
}
// Use the total rank of adjacent pages to update this page
void apply(icontext_type& context, vertex_type& vertex,
const gather_type& total) {
double newval = total * 0.85 + 0.15;
vertex.data().pagerank = newval;
}
// No scatter needed. Return NO_EDGES
edge_dir_type scatter_edges(icontext_type& context,
const vertex_type& vertex) const {
}
};

The pagerank_program inherits from graphlab::ivertex_program which is itself templatized over the type of the graph (graph_type) and the type returned by the gather() obersion.

The pagerank_program must also be Serializable. Since this program does not contain any data elements, it is a POD type and is sufficient to inherit from graphlab::IS_POD_TYPE.

We will now explain each function.

gather_edges(icontext_type& context, const vertex_type& vertex)

The gather_edges() function returns the collection of edges to gather. It may return graphlab::IN_EDGES, graphlab::OUT_EDGES, graphlab::NO_EDGES, or graphlab::ALL_EDGES. The PageRank update uses only in pages, thus we return graphlab::IN_EDGES.

The gather_edges() function is also passed a context object which provides additional access to the execution environment: such as obtaining the number of edges in the graph (num_edges()), the ability to immediately stop execution (stop()) among others.

It is also passed a reference to the vertex the current vertex_program is executing on through the vertex argument. Through vertex, the function can read the data on the vertex as well as obtain other meta-data such as the number of in-edges of the current vertex (see graphlab::distributed_graph::vertex_type).

gather(icontext_type& context, const vertex_type& vertex, edge_type& edge)

According to the PageRank equation, we must compute a weighted sum of the in-pages. The gather() function is thus executed on each in-edge in of the current vertex, returning the edge's contribution to the "acc" value.

To compute the current edge's contribution to the weight, we use the edge argument which provides direct access to the data on the edge, the source vertex of the edge, and the destination vertex. (see graphlab::distributed_graph::edge_type)

Note:
While gather() technically has a non-const reference to the source and target vertex data through edge.source() and edge.target(), it should not modify them. The data on the edge (accessible through edge.data()) is modifiable however.

Once the contribution is computed, we simply return it since the result from all gather() calls are summed up by the execution engine (using only the += operator). The data type of the accumulation is a double, and this must be provided in the second template argument of the ivertex_program the pagerank_program inheritted from.

apply(icontext_type& context, vertex_type& vertex, const gather_type& total)

The returned values from each gather are implicitly summed up (in parallel) behind the scenes as passed to the apply() function in the total argument. Observe that now the vertex parameter is modifiable, and we use this to write the new pagerank to the current vertex.

scatter_edges(icontext_type& context, const vertex_type& vertex)

Scatter is similarly gather: it is executed on each edge of the vertex, but does not accumulate any values. Since the pagerank computation does not require a scatter operation we simply have scatter_edges return graphlab::NO_EDGES.

Running the Vertex Program

To run the above vertex program on all vertices in the graph once, we simply construct an engine in main() (after finalizing the graph)

graphlab::omni_engine<pagerank_program> engine(dc, graph, "sync");
engine.signal_all();
engine.start();

The omni_engine is a engine wrapper that allows you to easily select between a synchronous engine or an asynchronous engine. In this case, we select a synchronous engine. Passing "async" will select an asynchronous engine.

Note:
If graphlab::command_line_options are used, it can be passed as an additional 4th argument to the constructor. This will allow the engine type to be modified at runtime.

The signal_all() function, as the name suggests, signals all the vertices in the graph to run. start() will begin execution of all signaled vertices. Since the synchronous engine is selected all vertices will perform the gather/apply/scatter operations in lock-step. If the asynchronous engine is selected, the vertices will run asynchronously, but a distributed locking procedure is used internally to ensure data consistency.

Each vertex in the graph will run exactly once. We could of course embed the signal+start operations in a loop to run multiple rounds, but that would be inefficient. In the next section we will learn how to signal vertices inside a vertex program.