Configuration Options

Schema Registry Configuration

schema.registry.url

URL for schema registry. This is required for Avro message decoder to interact with schema registry.

  • Type: string
  • Required: Yes
max.schemas.per.subject

The max number of schema objects allowed per subject.

  • Type: int
  • Required: No
  • Default: 1000
is.new.producer

Set to true if the data Camus is consuming from Kafka was created using the new producer, or set to false if it was created with the old producer.

  • Type: boolean
  • Required: No
  • Default: true

Camus Job Configuration

camus.job.name

The name of the Camus job.

  • Type: string
  • Required: No
  • Default: “Camus Job”
camus.message.decoder.class

Decoder class for Kafka Messages to Avro Records. The default value is Confluent version of AvroMessageDecoder which integrates schema registry.

  • Type: string
  • Required: No
  • Default: io.confluent.camus.etl.kafka.coders.AvroMessageDecoder
etl.record.writer.provider.class

Class for writing records to HDFS/S3.

  • Type: string
  • Required: No
  • Default: com.linkedin.camus.etl.kafka.common.AvroRecordWriterProvider
etl.partitioner.class

Class for partitioning Camus output. The default partitioner partitions incoming data into hourly partitions

  • Type: string
  • Required: No
  • Default: com.linkedin.camus.etl.kafka.partitioner.DefaultPartitioner
camus.work.allocator.class

Class for creating input splits from ETL requests

  • Type: string
  • Required: No
  • Default: com.linkedin.camus.workallocater.BaseAllocator
etl.destination.path

Top-level output directory, sub-directories will be dynamically created for each topic pulled.

  • Type: string
  • Required: Yes
etl.execution.base.path

HDFS location where you want to keep execution files, i.e. offsets, error logs and count files.

  • Type: string
  • Required: Yes
etl.execution.history.path

HDFS location to keep historical execution files, usually a sub directory of etl.execution.base.path.

  • Type: string
  • Required: Yes
hdfs.default.classpath.dir

All files in this directory will be added to the distributed cache and placed on the classpath for Hadoop tasks.

  • Type: string
  • Required: No
  • Default: null
mapred.map.tasks

Max hadoop tasks to use, each task can pull multiple topic partitions.

  • Type: int
  • Required: No
  • Default: 30

Kafka Configuration

kafka.brokers

List of Kafka brokers for Camus to pull metadata from.

  • Type: string
  • Required: Yes
kafka.max.pull.hrs

The max duration from the timestamp of the first record to the timestamp of the last record. When the max is reached, the pull will cease. -1 means no limit.

  • Type: int
  • Required: No
  • Default: -1
kafka.max.historical.days

Events with a timestamp older than this will be discarded. -1 means no limit.

  • Type: int
  • Required: No
  • Default: -1
kafka.max.pull.minutes.per.task

Max minutes for each mapper to pull messages.

  • Type: int
  • Required: No
  • Default: -1
kafka.blacklist.topics

Nothing on the blacklist is pulled from Kafka.

  • Type: string
  • Required: No
  • Default: null
kafka.whitelist.topics

If whitelist has values, only whitelisted topic are pulled from Kafka.

  • Type: string
  • Required: No
  • Default: null
etl.output.record.delimiter

Delimiter for writing string records.

  • Type: string
  • Required: No
  • Default: “\n”

Example Configuration

# The job name.
camus.job.name=Camus Job


# Kafka brokers to connect to, format: kafka.brokers=host1:port,host2:port,host3:port
kafka.brokers=

# Top-level data output directory, sub-directories are dynamically created for each topic pulled
etl.destination.path=/user/username/topics

# HDFS location to keep execution files: requests, offsets, error logs and count files
etl.execution.base.path=/user/username/exec

# HDFS location of keep historical execution files, usually a sub-directory of the base.path
etl.execution.history.path=/user/username/camus/exec/history


# Concrete implementation of the decoder class to use.
camus.message.decoder.class=io.confluent.camus.etl.kafka.coders.AvroMessageDecoder


# Max number of MapReduce tasks to use, each task can pull multiple topic partitions
mapred.map.tasks=30

# Max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1

# Events with a timestamp older than this will be discarded
kafka.max.historical.days=3

# Max minutes for each mapper to pull messages from Kafka (-1 means no limit)
kafka.max.pull.minutes.per.task=-1


# Only topics in whitelist are pulled from Kafka and no topics from the blacklist is pulled
kafka.blacklist.topics=
kafka.whitelist.topics=


log4j.configuration=true

# Name of the client as seen by kafka
kafka.client.name=camus


# Stops the mapper from getting inundated with decoder exceptions from the same topic
max.decoder.exceptions.to.print=5