Installing and Configuring Kafka Connect¶
This section describes how you can install and configure a Kafka Connect instance.
Getting Started¶
The quickstart describes how to get started in standalone mode. It demonstrates an end-to-end job, importing data from one “system” (the filesystem) into Kafka, then exporting that same data from the Kafka topic to another “system” the console. This section is for users comfortable with the concepts and quickstart. Those interested configuring Kafka Connect with security should go here. This section covers the following:
- Planning a Kafka Connect installation
- Running workers in standalone and distributed modes
- Installing connector plugins
- Configuring workers
Planning for Installation¶
When getting going with Kafka Connect, there are a few considerations to be aware of to help your environment scale to the long term needs of your data pipeline. This section aims to provide some context around those decisions.
Prerequisites¶
Kafka Connect has only one hard prerequisite in order to get started: a set of Kafka brokers. However, as your cluster grows, there are a couple of items that are helpful to consider ahead of time:
- Interal Topic Creation
As we will talk about in more detail it is important to create Kafka Connect’s required internal topics ahead of time with a high replication factor, a compaction cleanup policy, and correct number of partitions. This helps avoid recalibrating these topics later on.
- Schema Registry
Although the Schema Registry is not a required service for Kafka Connect, it enables you to easily use Avro as the common data format for all connectors. This keeps the need to write custom code at a minimum and standardizes your data in a flexible format. Additionally, you get the benefits of enforced compatibility rules for schema evolution.
Standalone vs. Distributed¶
As we discussed in the concepts section, workers can be run in two different modes. It is useful to identify which mode works best for your environment before getting started. For development or environments that lend themselves to single agents (e.g. sending logs from webservers to Kafka), standalone mode is well suited. In use cases where a single source or sink may require heavy data volumes (e.g. sending data from Kafka to HDFS), distributed mode is more flexible in terms of scalability and offers the added advantage of a highly available service to minimize downtime. In the end, the choice is up to the installer, but knowing what you are moving and where you are moving it with Kakfa Connect will help inform this decision. We recommend distributed mode for production deployments for ease of management and scalability.
Deployment Considerations¶
Kafka Connect workers can be deployed in a number of ways, each with their own benefits. Workers lend themselves well to being run in containers in managed environments such as YARN, Mesos, or Docker Swarm as all state is stored in Kafka, making the local processes themselves stateless. We provide Docker images and documentation for getting started with those images is here. By design, Kafka Connect does not automatically handle restarting or scaling workers which means your existing clustering solutions can continue to be used transparently.
Additionally, Kafka Connect workers are simply JVM processes and as such can be run on shared machines that have sufficient resources. The resource limit depends heavily on the types of connectors being run by the workers, but in most cases users should be aware of CPU and memory bounds when running workers concurrently on a single machine.
Hardware requirements for Kafka Connect workers are similar to that of the standard Java producers and consumers. For those deployments that are expected to send large messages, more memory will be required. Those that make heavy use of compression will require more powerful CPUs. We recommend starting with the default heap size setting and monitoring the JMX metrics and the system to be sure CPU, memory, and network (recommend 10GbE and up) are sufficient for load.
Installing Connector Plugins¶
Confluent Platform ships with commonly used connector plugins that have been tested with the rest of the platform. Additionally, Kafka Connect is designed to be extensible so that it is easy for other developers to create new connectors and for users to run them with minimal effort.
To install a new plugin place it in the CLASSPATH
of the Kafka Connect process. All the
scripts for running Kafka Connect will use the CLASSPATH
environment variable if it is set when they are invoked,
making it easy to run with additional connector plugins:
$ export CLASSPATH=/path/to/my/connectors/*
$ bin/connect-standalone standalone.properties new-custom-connector.properties
Running Workers¶
Standalone Mode¶
To execute a worker in standalone mode, run the following command:
$ bin/connect-standalone worker.properties connector1.properties [connector2.properties connector3.properties ...]
The first parameter is always a configuration file for the worker as described below. This configuration gives you control over settings such as the Kafka cluster to use and the serialization format. All additional parameters are connector configuration files. Each file contains a single connector configuration.
If you run multiple standalone instances on the same host, there are a couple of settings that must be unique between each instance:
offset.storage.file.filename
- storage for connector offsets, which are stored on the local filesystem in standalone mode; using the same file will lead to offset data being deleted or overwritten with different valuesrest.port
- the port the REST interface listens on for HTTP requests
Distributed Mode¶
To start a distributed worker, create a worker configuration just as you would with standalone mode, but the options for distributed workers specified. We recommend creating the required Kafka topics manually as follows:
# config.storage.topic=connect-configs
$ bin/kafka-topics --create --zookeeper localhost:2181 --topic connect-configs --replication-factor 3 --partitions 1 --config cleanup.policy=compact
# offset.storage.topic=connect-offsets
$ bin/kafka-topics --create --zookeeper localhost:2181 --topic connect-offsets --replication-factor 3 --partitions 50 --config cleanup.policy=compact
# status.storage.topic=connect-status
$ $ bin/kafka-topics --create --zookeeper localhost:2181 --topic connect-status --replication-factor 3 --partitions 10 --config cleanup.policy=compact
Then, simply start the worker process with a worker configuration file. An example configuration file can be found in etc/kafka/connect-distributed.properties
, you can copy it and adopt it to your needs:
$ bin/connect-distributed worker.properties
Distributed mode does not have any additional command line parameters other than a worker configuration file. New workers will either start a new group or join an existing one based on the worker properties provided. Workers then coordinate similarly to consumer groups to distribute the work to be done. This is different from standalone mode where users may optionally provide connector configurations at the command line as only a single worker instance exists and no coordination is required in standalone mode. Use the REST API to manage the connectors when running in distributed mode as described here.
In distributed mode, if you run more than one worker per host (for example, if you are testing distributed mode locally during development), the following settings must have different values for each instance:
rest.port
- the port the REST interface listens on for HTTP requests
Configuring Workers¶
Whether you’re running standalone or distributed mode, Kafka Connect workers are configured by passing a properties
file containing any required or overridden options as the first parameter to the worker process. Some example
configuration files are included with the Confluent Platform to help you get started. We recommend using the files
etc/schema-registry/connect-avro-[standalone|distributed].properties
as a starting point because they include the
necessary configuration to use Confluent Platform’s Avro converters that integrate with the Schema Registry. They are
configured to work well with Kafka and Schema Registry services running locally, making it easy to test Kafka Connect
locally; using them with production deployments should only require adjusting the hostnames for Kafka and Schema Registry.
An exhaustive listing of options is provided in the References. This section calls out a few commonly updated configurations.
Common Worker Configs¶
bootstrap.servers
A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form
host1:port1,host2:port2,...
. Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).- Type: list
- Default: [localhost:9092]
- Importance: high
key.converter
Converter class for key Connect data. This controls the format of the data that will be written to Kafka for source connectors or read from Kafka for sink connectors. Popular formats include Avro and JSON.
- Type: class
- Default:
- Importance: high
value.converter
Converter class for value Connect data. This controls the format of the data that will be written to Kafka for source connectors or read from Kafka for sink connectors. Popular formats include Avro and JSON.
- Type: class
- Default:
- Importance: high
internal.key.converter
Converter class for internal key Connect data that implements the
Converter
interface. Used for converting data like offsets and configs.- Type: class
- Default:
- Importance: low
internal.value.converter
Converter class for offset value Connect data that implements the
Converter
interface. Used for converting data like offsets and configs.- Type: class
- Default:
- Importance: low
rest.host.name
Hostname for the REST API. If this is set, it will only bind to this interface.
- Type: string
- Importance: low
rest.port
Port for the REST API to listen on.
- Type: int
- Default: 8083
- Importance: low
Standalone Worker Configuration¶
In addition to the common worker configuration options, the following are available in standalone mode.
offset.storage.file.filename
The file to store connector offsets in. By storing offsets on disk, a standalone process can be stopped and started on a single node and resume where it previously left off.
- Type: string
- Default: “”
- Importance: high
Distributed Worker Configuration¶
In distributed mode, the workers need to be able to discover each other and share both connector
configurations and offset data. This requires workers to have matching values for group.id
to form a Connect cluster group and have access to three Kafka topics which are defined by the properties config.storage.topic
, offset.storage.topic
, and status.storage.topic
.
These properties are specified in the properties file passed to the connect worker when started in distributed mode as shown here. Detailed descriptions of these properties and guidelines for creating the required topics are below.
group.id
A unique string that identifies the Connect cluster group this worker belongs to.
- Type: string
- Default: “”
- Importance: high
config.storage.topic
The topic to store connector and task configuration data in. This must be the same for all workers with the same
group.id
. This topic should always have a single partition and be highly replicated (3x or more). This topic should be compacted to avoid losing data due to retention policy.- Type: string
- Default: “”
- Importance: high
offset.storage.topic
The topic to store offset data for connectors in. This must be the same for all workers with the same
group.id
. To support large Kafka Connect clusters, this topic should have a large number of partitions (e.g. 25 or 50, just like Kafka’s built-in__consumer_offsets
topic) and highly replicated (3x or more). This topic should be compacted to avoid losing data due to retention policy.- Type: string
- Default: “”
- Importance: high
status.storage.topic
The Kafka topic to store status updates for connectors and tasks. This topic can have multiple partitions and should be highly replicated (3x or more). This topic should be compacted to avoid losing data due to retention policy.
- Type: string
- Default: “”
- Importance: high
Configuring Converters¶
In going through the common worker configurations, you will have noticed the key.converter
and value.converter
properties where you specify a converters to use. Each converter implementation will have its
own associated configuration requirements. To configure a converter specific property, you prepend the connect property
where the converter has been specified to the converter property. This is an example worker property file snippet showing
that the AvroConverter
bundled with the Schema Registry requires the URL for the Schema Registry to be passed as a property:
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
We recommend using the AvroConverter
for your Kafka Connect data. Those with a need to use JSON for Connect data can use the
JsonConverter
bundled with Kafka. An example of using the JsonConverter
without schemas for converting keys looks like:
key.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
Overriding Producer & Consumer Settings¶
Internally, Kafka Connect uses the standard Java producer and consumers to communicate with Kafka. Kafka Connect configures these producer and consumer instances with good defaults. Most importantly, it is configured with important settings that ensure data from sources will be delivered to Kafka in order and without any loss. The most critical site-specific options, such as the Kafka bootstrap servers, are already exposed via the standard worker configuration.
Occasionally, you may have an application that needs to adjust the default settings. One example is a standalone process that runs a log file connector. For the logs being collected, you might prefer low-latency, best-effort delivery and minor data loss in the case of connectivity issues might be acceptable for this application in order to avoid any data buffering on the client, keeping the log collection as lean as possible.
All new producer configs and
new consumer configs can be overridden by prefixing
them with producer.
or consumer.
, respectively. For example:
producer.retries=1
consumer.max.partition.fetch.bytes=10485760
would override the producers to only retry sending messages once and increase the default amount of data fetched from a partition per request to 10 MB.
Note that these configuration changes are applied to all connectors running on the worker. You should be especially careful making any changes to these settings when running distributed mode workers.
Upgrading Kafka Connect Workers¶
Documentation for upgrading your Kafka Connect workers is found in the platform upgrade section. To upgrade individual connectors, please see our documentation on upgrading connector plugins.