GraphLab: Distributed Graph-Parallel API
2.1
|
The datastructure which surrounds much of GraphLab's computation capabilities is the distributed_graph. The Distributed Graph is a directed graph datastructure comprising of vertices and directed edges, but with no duplicated edges allowed. i.e. there can be only one edge from vertex A to vertex B, and one edge from vertex B to vertex A. An arbitrary user data type can be associated with each vertex and each edge as long as the data type is Serializable.
Since we are writing PageRank, we will first we define a struct describing a web page. This will be the contents of the vertex. This struct here holds a name of the webpage, as well as the resultant PageRank. A constructor which assigns a name is provided for later convenience. Observe that we also defined a default constructor as this is required for it to be used in the graph.
To make this Serializable, we need to define a save
and load
member function. The save
function simply writes the pagename
and pagerank
fields into the output archive object. The load
function performs the reverse. Care should be made to ensure that the save
and load
functions are symmetric.
Since we do not need any information to be stored on the edges of the graph, we will just use the graphlab::empty data type which will ensure that the edge data does not take up any memory.
The graphlab::distributed_graph data type takes two template arguments:
VertexData
The type of data to be stored on each vertex EdgeData
The type of data to be stored on each edgeFor convenience, we define the type of the graph using a typedef:
At this point, our code looks like this:
We have constructed the datatypes required for the graph to operate. In the next section, we will fill out the graph using some synthetic data.