This article covers how to install and configure Apache Kafka on your Ubuntu 20.04 LTS machine. Apache Kafka is a distributed events streaming platform which has the ability to handle the high-performance data pipelines. It was originally developed by Linkedin then to be public as an open-source platform and used by many IT companies in the world.
Terms related to Apache Kafka Infrastructure:
1. Topic: A topic is a common name used to store and publish a particular stream of data. For example if you would wish to store all the data about a page being clicked, you can give the Topic a name such as "Added Customer".
2. Partition: Every topic is split up into partitions ("baskets"). When a topic is created, the number of partitions need to be specified but can be increased later as need arises. Each message gets stored into partitions with an incremental id known as its Offset value.
3. Kafka Broker: Every server with Kafka installed in it is known as a broker. It is a container holding several topics having their partitions.
4. Zookeeper: Zookeeper manages Kafka's cluster state and configurations.
Main advantages of using Apache Kafka:
1. Message Broking: In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications
2. Website Activity Tracking
3. Log Aggregation: Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages.
4. Stream Processing: capturing data in real-time from event sources; storing these event streams durably for later retrieval; and routing the event streams to different destination technologies as needed
5. Event Sourcing: This is a style of application design where state changes are logged as a time-ordered sequence of records.
6. Commit Log: Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data.
7. Metrics: This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
To install Apache Kafka on Ubuntu:
1. Update your fresh Ubuntu 20.04 server and get Java installed as illustrated below.
$ sudo apt update && sudo apt upgrade
$ sudo apt install default-jre wget git unzip -y
$ sudo apt install default-jdk -y
2. Fetch Kafka on Ubuntu 20.04.
$ cd ~
$ wget https://downloads.apache.org/kafka/2.6.0/kafka_2.13-2.6.0.tgz
$ sudo mkdir /usr/local/kafka-server && cd /usr/local/kafka-server
$ sudo tar -xvzf ~/kafka_2.13-2.6.0.tgz --strip 1
3. Create Kafka and Zookeeper Systemd Unit Files
i. Let us begin with Zookeeper service.
$ sudo vim /etc/systemd/system/zookeeper.service
Description=Apache Zookeeper Server
ii. Then for Kafka service. Make sure your JAVA_HOME configs are well inputted or Kafka will not start.
$ sudo vim /etc/systemd/system/kafka.service
Description=Apache Kafka Server
iii. Reload the systemd daemon to apply changes and then start the services.
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now zookeeper
$ sudo systemctl enable --now kafka
$ sudo systemctl status kafka zookeeper
4. Install Cluster Manager for Apache Kafka (CMAK) | Kafka Manager.
$ cd ~
$ git clone https://github.com/yahoo/CMAK.git
5. Configure CMAK on Ubuntu.