Kafka Connect

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. It makes it simple to quickly define connectors that move large data sets into and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch or into batch systems such as Hadoop for offline analysis.

Kafka Connect’s scope is narrow: it focuses only on copying streaming data to and from Kafka and does not handle other tasks, such as stream processing, which are already well addressed by other systems. This focus makes it much simpler for developers to write high quality, reliable, and high performance connector plugins, makes it possible for the framework to make guarantees that are difficult to achieve in other frameworks, and ultimately makes it easier for developers and administrators to use. It also avoids duplicating functionality already provided by stream processing frameworks. Kafka Connect is not an ETL framework, although it can be an integral component in an ETL pipeline when combined with Kafka and a stream processing framework.

Kafka Connect can run either as a standalone process for testing and one-off jobs, or as a distributed, scalable, fault tolerant service supporting an entire organization. This allows it to scale down to development, testing, and small production deployments with a low barrier to entry and low operational overhead, and to scale up to support a large organization’s data pipeline.

Quickstart

To demonstrate the basic functionality of Kafka Connect and its integration with the Confluent Schema Registry, we’ll run a few local standalone Kafka Connect processes running connectors to let you insert data written to a file into Kafka and write data from a Kafka topic to the console.

The following assumes you have Kafka and the Confluent Schema Registry running. See the quickstart for instructions on getting the Confluent Platform services running.

First, read through the standalone worker configuration. This file is used to specify options that are not specific to connectors. For example, in this configuration, we specify that we want to use the AvroConverter class to read/write Avro-formatted data, give the addresses for Kafka and the Schema Registry, and specify a local file where offsets can be stored so we can stop and restart the process and pick up where we left off previously.

$ cat ./etc/schema-registry/connect-avro-standalone.properties

Next, look at the configuration for the source connector that will read input from the file and write each line to Kafka as a message. The configuration contains all the common settings shared by all source connectors: a unique name, the connector class to instantiate, a maximum number of tasks to control parallelism (only 1 makes sense here), and the name of the topic to produce data to. It also has one connector-specific setting: the name of the file to read from.

$ cat ./etc/kafka/connect-file-source.properties

Now lets seed the file with some sample data. Note that the connector configuration specifies a relative path for the file, so you should create the file in the same directory you are running Kafka Connect in.

$ echo -e "log line 1\nlog line 2" > test.txt

Next, start a Kafka Connect instance in standalone mode running this connector. For standalone mode, we can specify the connector configurations directly on the command line:

$ ./bin/connect-standalone ./etc/schema-registry/connect-avro-standalone.properties \
      ./etc/kafka/connect-file-source.properties

Each of the two lines in our log file should be delivered to Kafka, having registered a schema with the Schema Registry. One way to validate that the data is there is to use the console consumer in another console to inspect the contents of the topic:

$ ./bin/kafka-avro-console-consumer --zookeeper localhost:2181 --topic connect-test --from-beginning
  "log line 1"
  "log line 2"

We can also start another connector to log each message to the console. Hit Ctrl-C to gracefully stop the Kafka Connect process. Then take a look at the configuration for a sink task that logs each message to the console:

$ cat ./etc/kafka/connect-file-sink.properties

The configuration contains similar settings to the file source. Because its functionality is so simple, it has no additional configuration parameters.

Now start the Kafka Connect standalone process, but this time specify both connector configurations. They will run in the same process, but each will have its own dedicated thread.

$ ./bin/connect-standalone ./etc/schema-registry/connect-avro-standalone.properties \
      ./etc/kafka/connect-file-source.properties ./etc/kafka/connect-console-sink.properties
  ... [ start up logs clipped ] ...
  "log line 1"
  "log line 2"

Once the process is up and running, you should see the two log lines written to the console as the sink connector consumes them. Note that the messages were not written again by the source connector because it was able to resume from the same point in the file where it left off when we shut down the previous process.

With both connectors running, we can see data flowing end-to-end in real time. Use another terminal to add more lines to the text file:

$ echo -e "log line 3\nlog line 4" >> test.txt

and you should see them output on the console of the Kafka Connect standalone process. The new data was picked up by the source connector, written to Kafka, read by the sink connector from Kafka, and finally logged to the console.

Both source and sink connectors can track offsets, so you can start and stop the process any number of times and add more data to the input file and both will resume where they previously left off.

The connectors demonstrated in this quickstart are intentionally simple so no additional dependencies are necessary. Most connectors will require a bit more configuration to specify how to connect to the source or sink system and what data to copy, and for many you will want to execute on a Kafka Connect cluster for scalability and fault tolerance. To get started with you’ll want to see the user guide for more details on running and managing Kafka Connect, including how to run in distributed mode. The Connectors section includes details on configuring and deploying the connectors that ship with Confluent Platform.

Requirements

  • Kafka 0.9.0.0-cp1
  • Required for Avro support: Schema Registry 2.0.0 recommended, 1.0 minimum