GraphLab: Distributed Graph-Parallel API  2.1
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
Serialization

We have a custom serialization scheme which is designed for performance rather than compatibility. It does not perform type checking, It does not perform pointer tracking, and has only limited support across platforms. It has been tested, and should be compatible across x86 platforms.

For a summary of all serialization functionality see Serialization

There are two serialization classes graphlab::oarchive and graphlab::iarchive. The former does output, while the latter does input. To include all serialization headers, #include <graphlab/serialization/serialization_includes.hpp>.

Basic serialize/deserialize

To serialize data to disk, you just create an output archive, and associate it wiith an output stream.

For instance, to serialize to a file called "file.bin":

std::ofstream fout("file.bin", std::fstream::binary);
graphlab::oarchive oarc(fout);

The << stream operators are then used to write data into the archive.

int i = 10;
double j = 20;
std::vector<float> v(10,1.0); // create a vector of 10 "1.0" values
oarc << i << j << v;

To read back, you use the iarchive with an input stream, and read back the variables in the same order:

std::ifstream fin("file.bin", std::fstream::binary);
graphlab::iarchive iarc(fout);
int i;
double j;
std::vector<float> v;
iarc >> i >> j >> v;

Serializable

So what type of data is serializable?

Integer Types

All integer datatypes are serializable.

  • bool
  • char and unsigned char
  • short and unsigned short
  • int and unsigned int
  • long and unsigned long
  • long long and unsigned long long

Since all fixed width integer types from stdint (int16_t, int32_t, etc) are derived from these basic types, all fixed width integer types are also serializable.

  • int16_t and uint16_t
  • int32_t and uint32_t
  • int64_t and uint64_t

All integer types are saved in their raw binary form without any additional re-encoding. It is therefore important to deserialize with the same integer width as what was serialized.

The following code will fail in dramatic ways:

int i;
oarc << i; // write some integer to a file
...
// some time later we need to read back the integer.
long j;
iarc >> j; // this will fail

Floating Point Types

All floating point data types are serializable.

  • float
  • double
  • long double if your compiler supports quad precision.

Similar to integer types, all floating types are saved in raw binary form without re-encoding. You must deserialize with the same floating point width as what was serialized. (i.e. if you serialize a double, you must deserialize a double.

Containers

The following template containers are serializable as long as the contained types are all serializable. This can be recursively applied.

  • std::vector
  • std::list
  • std::set
  • std::map
  • boost::unordered_set
  • boost::unordered_map

For instance, a std::vector<int> is serializable. A std::list<std::vector<int> > is therefore also serializable.

There is special handling for the std::vector<T> for performance in the event that T is a simple POD (Plain Old Data) data type. POD types are data types which occupy a contiguous region in memory. For instance, basic types (double, int, etc), or structs which contains only basic types. Such types can be copied or replicated using a simple mem-copy operation and can be greatly acceleration during serialization / deserialization. All basic data types are automatically POD types. We will discuss structs and other user types in the next section.

User Structs and Classes

To serialize a struct/class, all you need to do is to define a public load/save function. For instance:

class TestClass{
public:
int i, j;
std::vector<int> k;
void save(graphlab::oarchive& oarc) const {
oarc << i << j << k;
}
void load(graphlab::iarchive& iarc) {
iarc >> i >> j >> k;
}
};

The save() and load() function prototypes must match exactly. Other conditions are that the class must be Default Constructible:

// it must be possible to create a variable of TestClass type like this
TestClass a;

And that the class must be Assignable:

TestClass a, b;
// it must be possible to assign one variable of TestClass to another
b = a;

After which, TestClass becomes serializable, and can be stored and read from an archive:

TestClass t;
// set values to t
oarc << t; // write it to a file
... some time afterwords...
TestClass t2;
iarc >> t2; // read it to a file

Since TestClass is now serializable, containers of TestClass listed in Containers are also serializable.

POD Serialization

As mentioned in Containers, POD data types occupy a contiguous region in memory and hence can be serialized and deserialized very quickly. Ideally, determination of whether a data type is POD or not should be handled by the compiler. However, this capability is only available in C++11 and not all compilers support it yet. We therefore implemented a simple workaround which will allow you to identify to the serializer that a class is POD, and avoid writing a save/load function.

We consider the following Coordinate struct.

struct Coordinate{
int x, y, z;
};

This struct can be defined to be a POD type using an accelerated serializer by simply inheriting from graphlab::IS_POD_TYPE

struct Coordinate: public graphlab::IS_POD_TYPE{
int x, y, z;
};

Now, Coordinate variables, or even vector<Coordinate> variables will serialize/deserialize faster. Also, you avoid writing a save() and load() function.

Note:
Currently POD detection is performed through the boost type traits library. When compilers implement std::is_pod (in C++11), POD detection will improve, increasing the scope of types which can be serialized quickly and automatically. A minor concern is that the scope of POD types is still slightly too large, since technically pointer types are POD, and those cannot not be serialized automatically.

Out of Place Serialization

In some situations, you may find that you need to make a data type serializable, but the data type is implemented by someone else, in a different library, making it impossible to extend and write a member save() and load() function as described in User Structs and Classes.

In this situation, it is necessary to implement an "Out of place" serializer. This is unfortunately somewhat more complicated.

For instance, if there is an external type implemented by some other library called Matrix which I would like to make serializable. The following code will have to be written in the global namespace

BEGIN_OUT_OF_PLACE_SAVE(oarc, Matrix, mat)
// write the "mat" variable which is of the type Matrix
// into the output archive oarc
END_OUT_OF_PLACE_SAVE()
BEGIN_OUT_OF_PLACE_LOAD(iarc, Matrix, mat)
// read the "mat" variable which is of the type Matrix
// from the input archive iarc
END_OUT_OF_PLACE_LOAD()

To facilitate reading and writing of data from the archives, the output oarchive object provides an graphlab::oarchive::write() oarchive::write() function which directly writes a sequence of bytes to the stream. Similarly, the input iarchive object provides a graphlab::iarchive::read() iarchive::read() function which directly reads a sequence of bytes from the stream.

For instance, if the Matrix type example above is defined in the following way:

struct Matrix {
int width; // width of the matrix
int height; // height of the matrix
double* data; // an array containing all the values in the matrix
int datalen; // the number of elements in the "data" array.
}

An "out of place" serializer could be implemented the following way:

BEGIN_OUT_OF_PLACE_SAVE(oarc, Matrix, mat)
// store the dimensions of the matrix
oarc << mat.width << mat.height;
// store the length of the data array
oarc << mat.datalen;
// write the double array
oarc.write((char*)(mat.data), sizeof(double) * mat.datalen);
END_OUT_OF_PLACE_SAVE()
BEGIN_OUT_OF_PLACE_LOAD(iarc, Matrix, mat)
// clear the matrix data if there is any
if (mat.data != NULL) delete [] mat.data;
// read the dimensions of the matrix
iarc >> mat.width >> mat.height;
// read the length of the data array
iarc >> mat.datalen;
// allocate sufficient storage for the array
mat.data = new double[mat.datalen];
// read the double array
iarc.read((char*)(mat.data), sizeof(double) * mat.datalen);
END_OUT_OF_PLACE_LOAD()