Each task that contains a stateful processor has exclusive access to the state stores in the processor. State store basicsĪ stateful processor may use one or more state stores. To parallelize the processing, Kafka Streams distributes the five tasks, t 0-t 4, over all Kafka Streams clients belonging to the same application via the Kafka rebalance protocol. Task t 0 processes records from partition p 0, task t 1 processes records from partition p 1, and so on. For example, if the input topic in the topology above has five partitions p 0-p 4, Kafka Streams will create five tasks t 0-t 4. Kafka Streams creates a task for each partition of the input topic, and each task processes records from its input partition. Just as a process is an instance of a program that is executed by a computer, a task is an instance of a topology that is executed by Kafka Streams. Once a topology is specified, Kafka Streams will execute the topology. The last processor in the topology writes its output records to a Kafka topic. Each processor applies its logic on the input record and forwards an output record to the downstream processors. The topology in the figure above reads records from a Kafka topic and streams the records through a series of stateless and stateful processors. While managing state outside of Kafka Streams is possible, we usually recommend managing state inside Kafka Streams to benefit from high performance and processing guarantees. Note that Kafka Streams does not consider processors as stateful if their state is exclusively managed outside of Kafka Streams, that is, when user code within the processor directly calls an external database. For example, an aggregation operation (such as counting the number of input records received in the past five minutes) needs to retrieve the current aggregated value from the state store, update the current aggregated value with the input record, and finally write the new aggregated value to the state store as well as forward the new aggregated value to the downstream processors in the topology. Stateful processors query and maintain a state during the processing of records. For example, a processor that implements a map operation (such as masking all but the last four digits of a credit card number) transforms a record into another record without querying any other data. Stateless processors process records independently of any other data. A processor executes its logic on a stream record by record. A topology consists of processors connected by streams. Kafka Streams defines its computational logic through a so-called topology. Once we understand the foundational principles, we’ll deep dive into operational issues that you may encounter when you operate your Kafka Streams application with RocksDB state stores, and most importantly how you can tune RocksDB to overcome those issues. Then, we‘ll provide an overview about RocksDB, including the two most used compaction styles, level compaction and universal compaction. We’ll first explain the basics around Kafka Streams and how it uses state stores. This blog post will cover key concepts that show how Kafka Streams uses RocksDB to maintain its state and how you can tune RocksDB for Kafka Streams’ state stores. Kafka Streams configures RocksDB to deliver a write-optimized state store. Many companies use RocksDB in their infrastructure to get high performance to serve data. RocksDB is a highly adaptable, embeddable, and persistent key-value store that was originally built by the Engineering team at Facebook. For these state stores, Kafka Streams uses RocksDB as its default storage to maintain local state on a computing node (think: a container that runs one instance of your distributed application). A Kafka Streams application can perform stateless operations like maps and filters as well as stateful operations like windowed joins and aggregations on incoming data records.įor stateful operations, Kafka Streams uses local state stores that are made fault-tolerant by associated changelog topics stored in Kafka. Apache Kafka ships with Kafka Streams, a powerful yet lightweight client library for Java and Scala to implement highly scalable and elastic applications and microservices that process and analyze data stored in Kafka.
0 Comments
Leave a Reply. |