Isolation
Writing to HDFS from Samza
The samza-hdfs
module implements a Samza Producer to write to HDFS. The current implementation includes a ready-to-use HdfsSystemProducer
, and two HdfsWriter
s: One that writes messages of raw bytes to a SequenceFile
of BytesWritable
keys and values. The other writes UTF-8 String
s to a SequenceFile
with LongWritable
keys and Text
values.
Configuring an HdfsSystemProducer
You can configure an HdfsSystemProducer like any other Samza system: using configuration keys and values set in a job.properties
file.
You might configure the system producer for use by your StreamTasks
like this:
# set the SystemFactory implementation to instantiate HdfsSystemProducer aliased to 'hdfs-clickstream'
systems.hdfs-clickstream.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory
# define a serializer/deserializer for the hdfs-clickstream system
systems.hdfs-clickstream.samza.msg.serde=some-serde-impl
# consumer configs not needed for HDFS system, reader is not implemented yet
# Assign a Metrics implementation via a label we defined earlier in the props file
systems.hdfs-clickstream.streams.metrics.samza.msg.serde=some-metrics-impl
# Assign the implementation class for this system's HdfsWriter
systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.TextSequenceFileHdfsWriter
# Set HDFS SequenceFile compression type. Only BLOCK compression is supported currently
systems.hdfs-clickstream.producer.hdfs.compression.type=snappy
# The base dir for HDFS output. The default Bucketer for SequenceFile HdfsWriters
# is currently /BASE/JOB_NAME/DATE_PATH/FILES, where BASE is set below
systems.hdfs-clickstream.producer.hdfs.base.output.dir=/user/me/analytics/clickstream_data
# Assign the implementation class for the HdfsWriter's Bucketer
systems.hdfs-clickstream.producer.hdfs.bucketer.class=org.apache.samza.system.hdfs.writer.JobNameDateTimeBucketer
# Configure the DATE_PATH the Bucketer will set to bucket output files by day for this job run.
systems.hdfs-clickstream.producer.hdfs.bucketer.date.path.format=yyyy_MM_dd
# Optionally set the max output bytes per file. A new file will be cut and output
# continued on the next write call each time this many bytes are written.
systems.hdfs-clickstream.producer.hdfs.write.batch.size.bytes=134217728
The above configuration assumes a Metrics and Serde implemnetation has been properly configured against the some-serde-impl
and some-metrics-impl
labels somewhere else in the same job.properties
file. Each of these properties has a reasonable default, so you can leave out the ones you don’t need to customize for your job run.