GraphLab: Distributed Graph-Parallel API
2.1
|
We have a custom serialization scheme which is designed for performance rather than compatibility. It does not perform type checking, It does not perform pointer tracking, and has only limited support across platforms. It has been tested, and should be compatible across x86 platforms.
For a summary of all serialization functionality see Serialization
There are two serialization classes graphlab::oarchive and graphlab::iarchive. The former does output, while the latter does input. To include all serialization headers, #include <graphlab/serialization/serialization_includes.hpp>.
To serialize data to disk, you just create an output archive, and associate it wiith an output stream.
For instance, to serialize to a file called "file.bin":
The << stream operators are then used to write data into the archive.
To read back, you use the iarchive with an input stream, and read back the variables in the same order:
So what type of data is serializable?
All integer datatypes are serializable.
bool
char
and unsigned char
short
and unsigned short
int
and unsigned int
long
and unsigned long
long long
and unsigned long long
Since all fixed width integer types from stdint (int16_t, int32_t, etc) are derived from these basic types, all fixed width integer types are also serializable.
int16_t
and uint16_t
int32_t
and uint32_t
int64_t
and uint64_t
All integer types are saved in their raw binary form without any additional re-encoding. It is therefore important to deserialize with the same integer width as what was serialized.
The following code will fail in dramatic ways:
All floating point data types are serializable.
float
double
long double
if your compiler supports quad precision.Similar to integer types, all floating types are saved in raw binary form without re-encoding. You must deserialize with the same floating point width as what was serialized. (i.e. if you serialize a double, you must deserialize a double.
The following template containers are serializable as long as the contained types are all serializable. This can be recursively applied.
std::vector
std::list
std::set
std::map
boost::unordered_set
boost::unordered_map
For instance, a std::vector<int>
is serializable. A std::list<std::vector<int> >
is therefore also serializable.
There is special handling for the std::vector<T> for performance in the event that T is a simple POD (Plain Old Data) data type. POD types are data types which occupy a contiguous region in memory. For instance, basic types (double, int, etc), or structs which contains only basic types. Such types can be copied or replicated using a simple mem-copy operation and can be greatly acceleration during serialization / deserialization. All basic data types are automatically POD types. We will discuss structs and other user types in the next section.
To serialize a struct/class, all you need to do is to define a public load/save function. For instance:
The save() and load() function prototypes must match exactly. Other conditions are that the class must be Default Constructible:
And that the class must be Assignable:
After which, TestClass
becomes serializable, and can be stored and read from an archive:
Since TestClass
is now serializable, containers of TestClass listed in Containers are also serializable.
As mentioned in Containers, POD data types occupy a contiguous region in memory and hence can be serialized and deserialized very quickly. Ideally, determination of whether a data type is POD or not should be handled by the compiler. However, this capability is only available in C++11 and not all compilers support it yet. We therefore implemented a simple workaround which will allow you to identify to the serializer that a class is POD, and avoid writing a save/load function.
We consider the following Coordinate struct.
This struct can be defined to be a POD type using an accelerated serializer by simply inheriting from graphlab::IS_POD_TYPE
Now, Coordinate variables, or even vector<Coordinate> variables will serialize/deserialize faster. Also, you avoid writing a save() and load() function.
In some situations, you may find that you need to make a data type serializable, but the data type is implemented by someone else, in a different library, making it impossible to extend and write a member save() and load() function as described in User Structs and Classes.
In this situation, it is necessary to implement an "Out of place" serializer. This is unfortunately somewhat more complicated.
For instance, if there is an external type implemented by some other library called Matrix which I would like to make serializable. The following code will have to be written in the global namespace
To facilitate reading and writing of data from the archives, the output oarchive object provides an graphlab::oarchive::write() oarchive::write() function which directly writes a sequence of bytes to the stream. Similarly, the input iarchive object provides a graphlab::iarchive::read() iarchive::read() function which directly reads a sequence of bytes from the stream.
For instance, if the Matrix type example above is defined in the following way:
An "out of place" serializer could be implemented the following way: