What problem does Kafka solve

Apache Kafka is becoming increasingly popular for mapping information flows within a wide variety of IT architectures. Be it in the area of ​​big data to carry out real-time analyzes based on streaming data, or in the area of ​​service-to-service communication in microservices.

This blog article explains which problems in the Apache Kafka environment the use of a Kafka schema registry can solve and which advantages the Avro data format offers compared to "classic" data formats such as JSON or other binary formats. A basic knowledge of Apache Kafka is required.

Confluent Kafka Schema Registry

Basically, Apache Kafka sees itself as a distributed streaming platform, which provides streams of messages similar to message queue systems or enterprise messaging systems in a fault-tolerant manner. Apache Kafka accepts any sequence of bytes as "messages" - be it (semi-) structured data in the form of JSON documents, unstructured text or entire images. When a message is sent, the "correctness" of the message is not verified. It does not matter to Apache Kafka whether, for example, the data schema of a JSON document corresponds to the expectations of the consumers of a message or not.


Problem cases and best practices in the operation of Apache Kafka

Typical problem cases in the long-term operation of an Apache Kafka platform arise precisely from this fact. Things change, are further developed and supplemented, or individual data fields within JSON documents are renamed or even dropped entirely. Due to the loose coupling of producers and consumers within a Kafka ecosystem, consumers are often not notified of upcoming changes. Possible solutions offer try-catch constructs for all parsing steps of a message, which are often at the expense of the understandability of the program code. Even corporate paradigms such as "talking to each other" sometimes only last until responsibilities and / or people and their roles change.

How the schema registry works as a contractor between producers and consumers

The Confluent Schema Registry now solves exactly this problem. It can be integrated as an additional component in an existing Kafka ecosystem. It forces the producers to compare the data schema inherent in the message and the schema registry before a message is sent. (see Figure 1). If the message schema matches the expected schema, the producer can send a corresponding message via Kafka. Otherwise the message will not be allowed to be sent. In this way, consumers of a message can be sure that the messages they have read actually correspond to the content that they expect.

Fig. 1

The schema registry now fulfills the following functions:

  • Storage of a schema,
  • Retrieval of a saved schema,
  • Comparison and validation of a schema with an already saved,
  • Further development and (re) validation of a schema.

The latter point is necessary in order to continue to guarantee a certain degree of flexibility. Any schema that has already been published can be further developed with a certain degree of compatibility. For example, new fields with meaningful default values ​​can be added to ensure downward compatibility. A list of the supported compatibility levels can be found here.

Libraries for controlling the schema registry are available in almost all common programming languages ​​(Java, Scala, NodeJS, Go, etc.)

Apache Avro - Deep Dive

Apache Avro is a compact, quickly convertible, binary data format that was first published in the context of Apache Hadoop. The aim was to develop a container format that allows data to be stored persistently and type-safe. This is in direct contrast to the ubiquitous JSON documents that are ubiquitous these days and offer neither type safety nor fixed structures. Therefore, the underlying data schema must necessarily be sent again with every message. In addition, JSON documents are not a binary format, so JSON documents are generally much more memory-intensive than other formats.

Avro schemas themselves are described in the form of JSON documents. Field names, data types and documentation of individual data fields can be recorded here:

JavaScript


With the help of the example shown above, the actual data can not only be written and (de-) serialized, but also corresponding classes / objects can be generated for a wide variety of programming languages. From experience it can be said that this drastically reduces the development time for new consumers, since the existing data schemes mean that no corresponding JSON parsers / objects have to be generated manually.

By using the schema registry, schemas are centrally managed and stored so that they can also be made accessible to all developers in a company at a central point. In addition, the schema registry offers another, inestimable advantage: Instead of including the schema in the actual message every time, it is possible to send just one schema ID with the actual message. Before processing the actual message, a consumer can now obtain the corresponding schema from there by sending the schema ID to the schema registry and thereby process the respective message in a type-safe manner. This further reduces the size of the messages, which in turn not only reduces the storage space required, but also the reading and processing time of a message.

Avro vs. Thrift vs. Protocol Buffers (ProtoBuf)

Why should Avro be used instead of other binary data formats such as Apache Thrift or ProtoBuf? In contrast to Thrift or ProtoBuf, Avro does not require a fixed order of the data fields within a message, as these are not referenced via a sequence, but via field names. This is roughly comparable to accessing individual data columns within relational tables or databases using names, instead of accessing individual columns based on a fixed sequence, as is the case in Excel, for example. At the same time, this enables a certain dynamic in the further development of the data structure and facilitates the storage of data in data storage systems such as databases, Hadoop or Druid: if referencing via sequences, all data would otherwise have to be read and written again each time the data model is changed in order to keep the sequence of the To ensure data fields across old and new inventories.

New job?

Kafka, Hadoop and data engineering are your topics, but your current employer is not making good progress in this regard? Apply to be a data engineer at rheindata!


In contrast to Thrift or ProtoBuf, Avro also separates the memory area in which the schema is stored from the actual content of the file. With Thrift and ProtoBuf, the field number and content of the field always alternate, so the scheme is strongly "interwoven" with the content. By contrast, the separation enables the schema to be replaced by a schema ID as described above.

The structure of a message in Avro format when using the schema registry is accordingly as follows:

  • Byte 0 - "Magic Byte": Confluent serialization format version number (currently always "0")
  • Byte 1-4 - "Schema ID": 4 byte long schema ID, as transmitted by the schema registry
  • Byte 5 ... - "Data": The actual serialized data in Avro binary format

In most cases, the construction of a message formatted in this way is done by the Kafka client used. With some programming languages, however, this part must be implemented by the developer using the respective program code. Here is an implementation in Go as an example:

golang


Conclusion

By using the Confluent Kafka Schema Registry and Apache Avro, it is possible to guarantee consistent data quality company-wide, to simplify collaboration between teams, to reduce development time and to make Apache Kafka performant and without much effort at data sinks such as Hadoop, Hive, Presto or Tie up druid.

If you have any questions about the use of Apache Kafka, a schema registry, Avro or other topics related to Hadoop, please do not hesitate to contact us. In addition, we support you in all other areas of business intelligence, data warehouse and data analytics.