Support for Parquet Files

You can use the Greenplum Database gphdfs protocol to access Parquet files on a Hadoop file system (HDFS).

About the Parquet File Format

The Parquet file format is designed to take advantage of compressed, efficient columnar data representation available to projects in the Hadoop ecosystem. Parquet supports complex nested data structures and uses Dremel record shredding and assembly algorithms. Parquet supports very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and supports adding more encodings as they are invented and implemented.

For information about the Parquet, see the Parquet documentation http://parquet.apache.org/documentation/latest/.

For an overview of columnar data storage and the Parquet file format, see https://blog.twitter.com/2013/dremel-made-simple-with-parquet.

Parquet File Format Support

The Greenplum Database gphdfs protocol supports the Parquet file format version 1 or 2. Parquet takes advantage of compressed, columnar data representation on HDFS. In a Parquet file, the metadata (Parquet schema definition) contains data structure information is written after the data to allow for single pass writing.

This is an example of the Parquet schema definition format:

message test {
    repeated byte_array binary_field;
    required int32 int32_field;
    optional int64 int64_field;
    required boolean boolean_field;
    required fixed_len_byte_array(3) flba_field;
    required byte_array someDay (utf8);
    };

The definition for last field someDay specifies the binary data type with the utf8 annotation. The data type and annotation defines the data as a UTF-8 encoded character string.

Reading from and Writing to Parquet Files

To read from or write to a Parquet file, you create an external table and specify the location of the parquet file in the LOCATION clause and 'PARQUET' in the FORMAT clause. For example, this is the syntax for a readable external table.

CREATE EXTERNAL TABLE tablename (column_spec) LOCATION ( 'gphdfs://location') FORMAT 'PARQUET'

The location can be an Parquet file or a directory containing a set of Parquet files. For the file name you can specify the wildcard character * to match any number of characters. If the location specifies multiple files when reading Parquet files, Greenplum Database uses the schema in the first file that is read as the schema for the other files.

Reading a Parquet File

The following table lists how Greenplum database converts the Parquet data type if the Parquet schema definition does not contain an annotation.

Table 1. Data Type Conversion when Reading a Parquet File
Parquet Data Type	Greenplum Database Data Type
boolean	boolean
int32	int
int64	long
int96	bytea
float	real
double	double
fixed_lenth_byte_array	bytea
byte_array	bytea

The gphdfs protocol considers Parquet schema annotations for these cases. Otherwise, data conversion is based on the parquet schema primitive type :

Table 2. Data Type (with Annotation) Conversion when Reading Parquet File
Parquet Schema Data Type and Annotation	Greenplum Database Data Type
binary with `json` or `utf8` annotation	text
binary and the Greenplum Database column data type is text	text
int32, int64, fixed_lenth_byte_array, or binary with `decimal` annotation	decimal
`repeated`	array column - The data type is converted according to Table 1
`optional`, `required`	Data type is converted according to Table 1

Note: See Limitations and Notes and the Parquet documentation when specifying decimal, date, interval, or time* annotations.

The gphdfs protocol converts the field data to text if the Parquet field type is binary without any annotation, and the data type is defined as text for the corresponding Greenplum Database external table column.

When reading Parquet type group, the gphdfs protocol converts the group data into an XML document.

This schema contains a required group with the name inner.

message test {
    required byte_array binary_field;
    required int64 int64_field;
    required group inner {
       int32 age;
       required boolean test;
       required byte_array name (UTF8);
       } 
    };

This how a single row of the group data would be converted to XML.

<inner type="group">
  <age type="int">50</age>
  <test type="boolean">true</test>
  <name type="string">fred</name>
</inner>

This example schema contains a repeated group with the name inner.

message test {
    required byte_array binary_field;
    required int64 int64_field;
    repeated group inner {
       int32 age;
       required boolean test;
       required byte_array name (UTF8);
       } 
    };

For a repeated group, the Parquet file can contain multiple sets of the group data in a single row. For the example schema, the data for the inner group is converted into XML data.

This is sample output if the data in the Parquet file contained two sets of data for the inner group.

<inner type="repeated">
  <inner type="group">
    <age type="int">50</age>
    <test type="boolean">true</test>
    <name type="string">fred</name>
  </inner>
  <inner>
    <age type="int">23</age>
    <test type="boolean">false</test>
    <name type="string">sam</name>
  </inner>
</inner>

Writing a Parquet File

For writable external tables, you can add parameters after the file specified in the location. You add parameters with the http query string syntax that starts with ? and & between field and value pairs.

Table 3. Parquet Format External Table location Parameters
Option	Values	Readable/Writeable	Default Value
schema	`URL_to_schema`	Write only	None. If not specified, the `gphdfs` protocol creates a schema according to the external table definition.
pagesize	> 1024 Bytes	Write only	1 MB
rowgroupsize	> 1024 Bytes	Write only	8 MB
version	`v1`, `v2`	Write only	`v1`
codec	`UNCOMPRESSED`, `GZIP`, `LZO`, `snappy`	Write only	`UNCOMPRESSED`
dictionaryenable¹	`true`, `false`	Write only	false
dictionarypagesize¹	> 1024 Bytes	Write only	512 KB

Note:

Creates an internal dictionary. Enabling a dictionary can improve Parquet file compression if text columns contain similar or duplicate data.

When writing a Parquet file, the gphdfs protocol can generate a Parquet schema based on the table definition.

The table name is used as the Parquet message name.
The column name is uses as the Parquet field name.

When creating the Parquet schema from a Greenplum Database table definition, the schema is generated based on the column data types and

Table 4. Schema Data Type Conversion when Writing a Parquet File
Greenplum Database Data Type	Parquet Schema Data Type
boolean	optional boolean
int, smallint	optional int32
bigint	optional int64
real	optional float
double	optional double
numeric or decimal	binary with annotation `decimal`
bytea	optional binary
array column	repeated field - The data type is the same data type as the Greenplum Database the array. For example, `array[int]` is converted to `repeated int`
Others	binary with annotation `utf8`

Note: To support Null data, gphdfs protocol specifies the Parquet optional schema annotation when creating a Parquet schema.

A simple example of a Greenplum Database table definition and the Parquet schema generated by the gphdfs protocol.

An example external table definition for a Parquet file.

CREATE WRITABLE EXTERNAL TABLE films (
   code char(5), 
   title varchar(40),
   id integer,
   date_prod date, 
   subtitle boolean
) LOCATION ( 'gphdfs://my-films') FORMAT 'PARQUET' ;

This is the Parquet schema for the Parquet file my-films generated by the gphdfs protocol.

message films {
    optional byte_array code;
    optional byte_array title (utf8);
    optional int32 id;
    optional binary date_prod (utf8);
    optional byte_array(3) flba_field;
    optional boolean subtitle;
    };

Limitations and Notes

For writable external tables, column definitions in Greenplum Database external table cannot specify NOT NULL to support automatically generating a Parquet schema. When the gphdfs protocol automatically generates a Parquet schema, the gphdfs protocol specifies the field attribute optional to support null in the Parquet schema. Repeated fields can be null in Parquet.
The gphdfs protocol supports Parquet nested group structures only for readable external files. The nested structures are converted to an XML document.
Greenplum Database does not have an unsigned int data type. Greenplum Database converts the Parquet unsigned int data type to the next largest Greenplum Database int type. For example, Parquet uint_8 is converted to Greenplum Database int (32 bit).
Greenplum Database supports any UDT data type or UDT array data type. Greenplum Database attempts to convert the UDT to a sting. If the UDT cannot be converted to a sting, Greenplum Database returns an error.
The definition of the Interval data type in Parquet is significantly different than the Interval definition in Greenplum Database and cannot be converted. The Parquet Interval data is formatted as bytea.
The Date data type in Parquet is starts from 1970.1.1, while Date in Greenplum Database starts from 4173 BC, Greenplum Database cannot convert date data types because largest values are different. A similar situation occurs between Timestamp_millis in Parquet and Timestamp in Greenplum Database.