Support for Avro Files
You can use the Greenplum Database gphdfs protocol to access Avro files on a Hadoop file system (HDFS).
About the Avro File Format
An Avro file stores both the data definition (schema) and the data together in one file making it easy for programs to dynamically understand the information stored in an Avro file. The Avro schema is in JSON format, the data is in a binary format making it compact and efficient.
- name
- favorite_number
- favorite_color
{"namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
{ "name" : "miguno" , "favorite_number" : 6 , "favorite_color" : "red" } { "name" : "BlizzardCS" , "favorite_number" : 21 , "favorite_color" : "green" }
For information about the Avro file format, see http://avro.apache.org/docs/1.7.7/
Avro File Format Support
- Avro file format - GPDB certified with Avro version 1.7.7
- Reading and writing Avro files
- Support for overriding the Avro schema when reading an Avro file
- Compressing Avro files during writing
- Automatic Avro schema generation when writing an Avro file
Greenplum Database returns an error if the Avro file contains unsupported features or if the specified schema does not match the data.
Reading from and Writing to Avro Files
CREATE EXTERNAL TABLE tablename (column_spec) LOCATION ( 'gphdfs://location') FORMAT 'AVRO'
The location can be an individual Avro file or a directory containing a set of Avro files. If the location specifies multiple files (a directory name or a file name containing wildcard characters), Greenplum Database uses the schema in the first file of the directory as the schema of the whole directory. For the file name you can specify the wildcard character * to match any number of characters.
You can add parameters after the file specified in the location. You add parameters with the http query string syntax that starts with ? and & between field and value pairs.
For readable external tables, the only valid parameter is schema. The gphdfs uses this schema instead of the Avro file schema when reading Avro files. See Avro Schema Overrides for Readable External Tables.
For writable external tables, you can specify schema, namespace, and parameters for compression.
Parameter | Value | Readable/Writeable | Default Value |
---|---|---|---|
schema | URL_to_schema_file | Read and Write | None. For a readable external table
For a writeable external table
|
namespace | avro_namespace | Write only |
public.avro
If specified, a valid Avro namespace. |
compress | true or false | Write only | false |
compression_type | block | Write only | Optional. For avro format, compression_type must be block if compress is true. |
codec | deflate or snappy | Write only | deflate |
codec_level (deflate codec only) | integer between 1 and 9 | Write only |
6
The level controls the trade-off between speed and compression. Valid values are 1 to 9, where 1 is the fastest and 9 is the most compressed. |
'compress=true&codec=snappy'
These two sets of parameters specify deflate compression and are equivalent:
'compress=true&codec=deflate&codec_level=1' 'compress=true&codec_level=1'
Data Conversion When Reading Avro Files
- An Avro primitive data type, Greenplum Database converts the data to a Greenplum Database type.
- An Avro complex data type that is not map or record, Greenplum Database converts the data to a Greenplum Database type.
- An Avro record that is a sub-record (nested within the top level Avro schema record), Greenplum Database converts the data XML.
This table lists the Avro primitive data types and the Greenplum Database type it is converted to.
Avro Data Type | Greenplum Database Data Type |
---|---|
null | Supported only in a Avro union data type. See Data Conversion when Writing Avro Files. |
boolean | boolean |
int | int |
long | bigint |
float | real |
double | double |
bytes | bytea |
string | text |
This table lists the Avro complex data types and the and the Greenplum Database type it is converted to.
Avro Data Type | Greenplum Database Data Type |
---|---|
enum | int The integer represents the zero-based position of the symbol in the schema. |
array | array The Greenplum Database array dimensions match the Avro array dimensions. The element type is converted from the Avro data type to the Greenplum Database data type. |
maps | Not supported |
union | The first non-null data type. |
fixed | bytea |
record | XML data |
Example Avro Schema
- name and color data are converted to Greenplum Database sting.
- age data is converted to Greenplum Database int.
- clist records are converted to XML.
{"namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "number", "type": ["int", "null"]}, {"name": "color", "type": ["string", "null"]}, {"name": "clist", "type": { "type":"record", "name":"clistRecord", "fields":[ {"name": "class", "type": ["string", "null"]}, {"name": "score", "type": ["double", "null"]}, {"name": "grade", "type": { "type":"record", "name":"inner2", "fields":[ {"name":"a", "type":["double" ,"null"]}, {"name":"b", "type":["string","null"]} ]} }, {"name": "grade2", "type": { "type":"record", "name":"inner", "fields":[ {"name":"a", "type":["double","null"]}, {"name":"b", "type":["string","null"]}, {"name":"c", "type":{ "type": "record", "name":"inner3", "fields":[ {"name":"c1", "type":["string", "null"]}, {"name":"c2", "type":["int", "null"]} ]}} ]} } ]} } ] }
<clist type="clistRecord"> <class type="string">math</class> <score type="double">99.5</score> <grade type="inner2"> <a type="double">88.8</a> <b type="string">subb0</b> </grade> <grade2 type="inner"> <a type="double">77.7</a> <b type="string">subb20</b> <c type="inner3"> <c1 type="string">subc</c1> <c2 type="int& quot;>0</c2> </c> </grade2> </clist>
Avro Schema Overrides for Readable External Tables
When you specify schema for a readable external table that specifies an Avro file as a source, Greenplum Database uses the schema when reading data from the Avro file. The specified schema overrides the Avro file schema.
You can specify a file that contains an Avro schema as part of the location paramter CREATE EXTERNAL TABLE command, to override the Avro file schema. If a set of Avro files contain different, related schemas, you can specify an Avro schema to retrieve the data common to all the files.
Greenplum Database extracts the data from the Avro files based on the field name. If an Avro file contains a field with same name, Greenplum Database reads the data , otherwise a NULL is returned.
For example, if a set of Avro files contain one of the two different schemas. This is the original schema.
{ "type":"record", "name":"tav2", "namespace":"public.avro", "doc":"", "fields":[ {"name":"id","type":["null","int"],"doc":""}, {"name":"name","type":["null","string"],"doc":""}, {"name":"age","type":["null","long"],"doc":""}, {"name":"birth","type":["null","string"],"doc":""} ] }
This updated schema contains a comment field.
{ "type":"record", "name":"tav2", "namespace":"public.avro", "doc":"", "fields":[ {"name":"id","type":["null","int"],"doc":""}, {"name":"name","type":["null","string"],"doc":""}, {"name":"birth","type":["null","string"],"doc":""}, {"name":"age","type":["null","long"],"doc":""}, {"name":"comment","type":["null","string"],"doc":""} ] }
{ "type":"record", "name":"tav2", "namespace":"public.avro", "doc":"", "fields":[ {"name":"id","type":["null","int"],"doc":""}, {"name":"name","type":["null","string"],"doc":""}, {"name":"birth","type":["null","string"],"doc":""}, {"name":"comment","type":["null","string"],"doc":""} ] }
CREATE WRITABLE EXTERNAL TABLE cust_avro(id int, name text, birth date) LOCATION ('gphdfs://my_hdfs:8020/tmp/cust*.avro ?schema=hdfs://my_hdfs:8020/avro/cust.avsc') FORMAT 'avro';
When reading the Avro data, if Greenplum Database reads a file that does not contain a comment field, a NULL is returned for the comment data.
Data Conversion when Writing Avro Files
When you create a writable external table to write data to an Avro file, each table row is an Avro record and each table column is an Avro field. When writing an Avro file, the default compression algorithm is deflate.
For a writable external table, if the schema option is not specified, Greenplum Database creates an Avro schema for the Avro file based on the Greenplum Database external table definition. The name of the table column is the Avro field name. The data type is a union data type. See the following table:
Greenplum Database Data Type | Avro Union Data Type Definition |
---|---|
boolean | ["boolean", "null"] |
int | ["int", "null"] |
bigint | ["long", "null"] |
real | ["float", "null"] |
double | ["double", "null"] |
bytea | ["bytes", "null"] |
text | ["string", "null"] |
array |
[{array}, "null"]
The Greenplum Database array is converted to an Avro array with same dimensions and same element type as the Greenplum Database array. |
other data types |
["string", "null"]
Data are formatted strings. The gphdfs protocol casts the data to Greenplum Database text and writes the text to the Avro file as an Avro string. For example, date and time data are formatted as date and time strings and converted to Avro string type. |
You can specify a schema with the schema option. When you specify a schema, the file can be on the segment hosts or a file on the HDFS that is accessible to Greenplum Database. For a local file, the file must exist in all segment hosts in the same location. For a file on the HDFS, the file must exist in the same cluster as the data file.
This example schema option specifies a schema on an HDFS.
'schema=hdfs://mytest:8000/avro/array_simple.avsc'
This example schema option specifies a schema on the host file system.
'schema=file:///mydata/avro_schema/array_simple.avsc'
Limitations
For a Greenplum Database writeable external table definition, columns cannot specify the NOT NULL clause.
Greenplum Database does not support the Avro map data type and returns an error when encountered.
When Greenplum Database reads an array from an Avro file, the array is converted to the literal text value. For example, the array [1,3] is converted to '{1,3}'.
User defined types (UDT), including array UDT, are supported. For a writable external table, the type is converted to string.
Examples
CREATE EXTERNAL TABLE avro1 (id int, ba bytea[]) LOCATION ('gphdfs://my_hdfs:8020/avro/singleAvro/array2.avro') FORMAT 'avro';CREATE WRITABLE EXTERNAL TABLE command specifies the Avro schema that is the gphdfs protocol uses to create the Avro file.
CREATE WRITABLE EXTERNAL TABLE at1w(id int, names text[], nums int[]) LOCATION ('gphdfs://my_hdfs:8020/tmp/at1 ?schema=hdfs://my_hdfs:8020/avro/array_simple.avsc') FORMAT 'avro';
CREATE WRITEABLE EXTERNAL TABLE command that writes to an Avro file and specifies a namespace for the Avro schema.
CREATE WRITABLE EXTERNAL TABLE atudt1 (id int, info myt, birth date, salary numeric ) LOCATION ('gphdfs://my_hdfs:8020/tmp/emp01.avro ?namespace=public.pivotal.avro') FORMAT 'avro';