HadoopReader

Available in Community Designer

Short Description
Ports
Metadata
HadoopReader Attributes
Details
Examples
See also

Short Description

HadoopReader reads Hadoop sequence files.

Component Data source Input ports Output ports Each to all outputs Different to different outputs Transformation Transf. req. Java CTL Auto-propagated metadata
HadoopReaderHadoop Sequence File0–11
no
no
no
no
no
no
no

Icon

Ports

Port typeNumberRequiredDescriptionMetadata
Input0
no
For Input Port Reading. Only source mode is supported.Any
Output0
yes
For read data records.Any

Metadata

HadoopReader does not propagate metadata.

HadoopReader has no metadata template.

HadoopReader Attributes

AttributeReqDescriptionPossible values
Basic
Hadoop connection  Hadoop connection with Hadoop libraries containing Hadoop sequence file parser implementation. If Hadoop connection ID is specified in a hdfs:// URL in the File URL attribute, value of this attribute is ignored. Hadoop connection ID
File URL
yes

URL to a file on HDFS or local file system.

URLs without protocol (i.e. absolute or relative path actually) or with the file:// protocol are considered to be located on the local file system.

If file to be read is located on the HDFS, use URL in this form: hdfs://ConnID/path/to/file, where ConnID is ID of a Hadoop connection (Hadoop connection component attribute will be ignored), and /path/to/myfile is absolute path on corresponding HDFS to file with name myfile.

 
Key field
yes
Name of an output edge record field, where key of each key-value pair will be stored. 
Value field
yes
Name of an output edge record field, where value of each key-value pair will be stored. 

Details

HadoopReader reads data from special Hadoop sequence file (org.apache.hadoop.io.SequenceFile). These files contain key-value pairs and are used in MapReduce jobs as input/output file formats. The component can read a single file as well as a collection of files which have to be located on HDFS or local file system.

If you connect to local sequence files, there is no need to connect to a hadoop cluster. However, you still need a valid Hadoop connection (with correct version of libraries).

Exact version of file format supported by the HadoopReader component depends on Hadoop libraries which you supply in Hadoop connection referenced from the File URL attribute. In general, sequence files created by one version of Hadoop may not be readable by different version.

Hadoop sequence files may contain compressed data. HadoopReader automatically detects this and decompresses the data. Which compression codecs are supported, again, depends on libraries you specify in the Hadoop connection.

For technical details about Hadoop sequence files, have a look at Apache Hadoop Wiki.

Examples

Reading data from local sequence files

Read records from Hadoop Sequence file products.dat. The file has ProductID as a key and ProductName as value.

Solution

Create a valid Hadoop connection or use existing one. See Hadoop connection.

Use Hadoop connection, File URL, Key field and Key value attributes.

AttributeValue
Hadoop connectionMyHadoopConnection
File URL${DATA_IN}/products.dat
Key fieldProductID
Value fieldProductName

See also

HadoopWriter
Hadoop connection
Common Properties of Components
Specific Attribute Types
Common Properties of Readers
Readers Comparison