Reading and Writing Custom-Formatted HDFS Data

Use MapReduce and the CREATE EXTERNAL TABLE command to read and write data with custom formats on HDFS.

To read custom-formatted data:

Author and run a MapReduce job that creates a copy of the data in a format accessible to Greenplum Database.
Use CREATE EXTERNAL TABLE to read the data into Greenplum Database.

See Example 1 - Read Custom-Formatted Data from HDFS.

To write custom-formatted data:

Write the data.
Author and run a MapReduce program to convert the data to the custom format and place it on the Hadoop Distributed File System.

See Example 2 - Write Custom-Formatted Data from Greenplum Database to HDFS.

MapReduce code is written in Java. Greenplum provides Java APIs for use in the MapReduce code. The Javadoc is available in the $GPHOME/docs directory. To view the Javadoc, expand the file gnet-1.1-javadoc.tar and open index.html. The Javadoc documents the following packages:

com.emc.greenplum.gpdb.hadoop.io
com.emc.greenplum.gpdb.hadoop.mapred
com.emc.greenplum.gpdb.hadoop.mapreduce.lib.input
com.emc.greenplum.gpdb.hadoop.mapreduce.lib.output

The HDFS cross-connect packages contain the Java library, which contains the packages GPDBWritable, GPDBInputFormat, and GPDBOutputFormat. The Java packages are available in $GPHOME/lib/hadoop. Compile and run the MapReduce job with the cross-connect package. For example, compile and run the MapReduce job with gphd-1.0-gnet-1.0.0.1.jar if you use the Greenplum HD 1.0 distribution of Hadoop.

To make the Java library available to all Hadoop users, the Hadoop cluster administrator should place the corresponding gphdfs connector jar in the $HADOOP_HOME/lib directory and restart the job tracker. If this is not done, a Hadoop user can still use the gphdfs connector jar; but with the distributed cache technique.