Data Serialization and Evolution

When sending data over the network or storing it in a file, we need a way to encode the data into bytes. The area of data serialization has a long history, but has evolved quite a bit over the last few years. People started with programming language specific serialization such as Java serialization, which makes consuming the data in other languages inconvenient. People then moved to language agnostic formats such as JSON. However, formats like JSON lack a strictly defined format, which has two drawbacks. First, they are verbose because field names and type information have to be explicitly represented in the serialized format. Second, the lack of structure makes consuming data in these formats more challenging because fields can be arbitrarily added or removed.

More recently, a few cross-language serialization libraries have emerged that require the data structure to be formally defined by some sort of schemas. These libraries include Thrift , Protocol Buffer and Avro . The advantage of having a schema is that it clearly specifies the structure, the type and the meaning (through documentation) of the data. With a schema, data can also be encoded more efficiently. Among those schema-aware serialization libraries, we particularly recommend Avro because the authors have put in more thoughts on how to evolve the schemas. Next, we briefly introduce Avro and how it can be used to support typical schema evolution.

Avro

An Avro schema defines the data structure in a JSON format. The following is an example Avro schema that specifies a user record with two fields: name and favorite_number of type string and int, respectively.

{"namespace": "example.avro",
 "type": "record",
 "name": "user",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": "int"}
 ]
}

One of the interesting things about Avro is that it not only requires a schema during data serialization, but also during data deserialization. Because the schema is provided at decoding time, metadata such as the field names don’t have to be explicitly encoded in the data. This makes the binary encoding of Avro data very compact.

An important aspect of data management is schema evolution. After the initial schema is defined, applications may need to evolve it over time. When this happens, it’s critical for the downstream consumers to be able to handle data encoded with both the old and the new schema seamlessly. This is an area that tends to be overlooked. However, without thinking through this carefully, people often pay the cost later on.

There are three common patterns of schema evolution: backward compatibility, forward compatibility, and full compatibility.

Backward Compatibility

First, consider the case where all the data is loaded into HDFS and we want to run SQL queries (e.g., using Apache Hive) over all the data. To support this kind of use case, we can evolve the schemas in a backward compatible way: data encoded with the old schema can be read with the newer schema. Avro has a set of rules on what changes are allowed in the new schema for it to be backward compatible. If all schemas are evolved in a backward compatible way, we can always use the latest schema to query all the data uniformly. For example, an application can evolve the previous user schema to the following by adding a new field favorite_color.

{"namespace": "example.avro",
 "type": "record",
 "name": "user",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": "int"},
     {"name": "favorite_color", "type": "string", "default": "green"}
 ]
}

Note that the new field has a default value “green”. This allows data encoded with the old schema to be read with the new one. The default value specified in the new schema will be used for the missing field when deserializing the data encoded with the old schema. Had the default value been ommitted in the new field, the new schema would not be backward compatible with the old one since it’s not clear what value should be assigned to the new field, which is missing in the old data.

Forward Compatibility

Second, consider another use case where a consumer has application logic tied to a particular version of the schema. When the schema evolves, the application logic may not be updated immediately. Therefore, we need to be able to project data with newer schemas onto the schema that the application understands. To support this use case, we can evolve the schemas in a forward compatible way: data encoded with the new schema can be read with the old schema. For example, the above new user schema is also forward compatible with the old one. When projecting data written with the new schema to the old one, the new field is simply dropped. Had the new schema dropped the field favorite_color, it would not be forward compatibile with the original user schema since we wouldn’t know how to fill in the value for favorite_color for the new data.

Full Compatibility

Finally, to support both previous use cases on the same data, we can evolve the schemas in a full compatible way: old data can be read with the new schema and new data can also be read with the old schema.

As you can see, when using Avro, one of the most important things is to manage its schemas and reason about how those schemas should evolve. Confluent’s Schema Registry is built for exactly that purpose. You can find out the details on how we use it to store Avro schemas and enforce certain compatibility rules during schema evolution by looking at the Schema Registry API Reference.