It’s Okay to Store Data in Apache Kafka (2017)

vore · on Oct 23, 2019

I think what the article greatly skimps over is data migrations: what do you do if you need to change the format of your data? If you retain logs in Kafka indefinitely as the source of truth for your data, then if you need to migrate materialized data to a new format, you'll also need to either 1) support all the previous forms of materialized data so operations from the log are guaranteed to be safely replayable on it, or 2) don't do that and keep one form of materialized data and hope you have enough test coverage to make sure some unexpectedly old data doesn't silently corrupt your materialized data.

Event sourcing is useful, but using it as a source of truth data store in itself instead of e.g. an occasional journalling mechanism seems pretty fraught.

Joeri · on Oct 23, 2019

You use AVRO and the confluent schema store so that you can do compatible schema upgrades (automatically upgrading messages when reading). If you need to do a breaking change, you can detect the avro schema from the message and have code in the reader to convert one to the other. It’s not that much of a problem in practice.

vore · on Oct 23, 2019

This is a problem we've actually encountered in practice: you need to preserve the conversion code for as long as the history exists, which is permanent technical debt. If you get your domain model wrong at the start, it will haunt you forever.

Fiahil · on Oct 23, 2019

You can truncate the history by reading events from the starting point to the junction, then creating a new snapshot that will serve as a new starting point, and finally deleting the previous history segment.

dwenzek · on Oct 23, 2019

This is definitely a decision which requires a long term support, as any solution regarding storage. This is not a technical debt unless you see it as a temporary solution, waiting for a better time to replace it.

You highlight the "need to preserve the conversion code for as long as the history exists". This is a point so decisive that I even wonder if Kafka should provide some support to keep that relationship.

parasubvert · on Oct 23, 2019

This is no different from evolving an RDBMS with ActiveRecord or Flyway. It’s not technical debt (there is nothing to repay or maintain).

dllthomas · on Nov 4, 2019

If you write "read it to the latest version" code for every version, you need to write that for every past version - potentially a high burden. Writing "convert it to the next version" instead is much less work for humans, and probably usually the best call, but less efficient when you actually need to read that old data. There are hybrid approaches to consider, too.

Wherever you are on that spectrum, there is code. Exercising it may slow down your test suite. It might need to be touched when you update lint rules (at least to add annotations to shut off the new rules).

Code nearly always represents some amount of debt. Code to deal with history, more so.

parasubvert · on Nov 4, 2019

We have a slight disagreement on what it means to accrue technical debt.

When a schema changes for an all, someone has to migrate it and the data, unless the design is that all incompatible data is dropped. Either that’s the codebase itself or the dev team is punting to the user. Punting to the user just shifts the burden around.

If the R&D team shipped a shitty / incomplete schema to get the software out the door that they need to change later, then yes, that is technical debt - something they’ll eventually need to repay.

If requirements evolve over time and thus the schema needs to, that is not necessarily technical debt in the usual sense, which usually implies a temporary technical compromise for the purposes of expediency / getting something out the door.

I suppose I agree there is a trade off here and a penalty - after many years the migrations get slow to apply etc, and you could say that checkpointing the schema every few versions and preventing upgrades from any prior point is a way of cleaning up the “migration debt”.

But people like to suggest that there’s some other way , ie. the OP saying “ If you get your domain model wrong at the start, it will haunt you forever.”.... I have never seen any system get the domain model perfectly right at the start!

dllthomas · on Nov 5, 2019

That may be. I was mostly trying to address the claim that "[i]t's not technical debt [because] there is nothing to [...] maintain."

Financially speaking, it's more like an annuity than a credit card.

lojack · on Oct 23, 2019

So, like a set of traditional database migrations?

tele_ski · on Oct 23, 2019

I personally don't understand avro, are there tools to write the schemas easier? The json format is very difficult to read and it just seems so clunky compared to protobuf.

abalone · on Oct 23, 2019

> 1) support all the previous forms of materialized data so operations from the log are guaranteed to be safely replayable on it

Yes, I believe with event sourcing you typically do exactly that. The very point of it is to use the event log as the source of truth, not just an “occasional journaling mechanism.”

mumblemumble · on Oct 23, 2019

I've worked on maintaining systems like that. IME, it's classic technical debt: You can get up and running much more quickly and cheaply, but in the long run it can become quite expensive. So, like any good technical debt, it's fine, even desirable, in the short run, but less fine if you end up carrying it indefinitely.

Fortunately, Kafka's already thought about this, so, when the time comes, it should be pretty easy to migrate to an architecture where you're no longer using it as your long-term source of truth.

UK-Al05 · on Oct 23, 2019

People go out there way to make events the source of truth in non-kafka systems. With a lot of effort involved. It isn't just people being lazy with Kafka. Event sourcing existed before with traditional databases.

I came across it in the .net world with cqrs + event sourcing combo long before Kafka became popular. People put a lot effort in to move away traditional current state storage. Often used in industries that have a lot of regulation.

abalone · on Oct 23, 2019

Event sourcing may be many things but I don't think "quick and cheap" is one of them. If you want quick and cheap do a (distributed) monolith.

UK-Al05 · on Oct 23, 2019

The entire point of event sourcing is that it is the source of truth. The sourcing in event sourcing.

Otherwise it's not different from a audit log.

zbentley · on Oct 23, 2019

I think what GP meant is: what if one of your "indexes" is a SQL database, and over time the shape that database needs to be changes?

In that situation, how do you handle migrations on that database, or building a new copy of it from scratch via the log? Your log will have historical data in a different shape than the schema expects, so you'll end up in an uncomfortable "replay a little, migrate, repeat" situation when reading historical data into the index for any reason.

UK-Al05 · on Oct 23, 2019

Normally you version your events, uplift or convert a stream to a updated version.

epistasis · on Oct 23, 2019

Kreps has a gift for writing, this is so clear, well organized, and far more fun to read than the topic has any right to be. Hopefully he'll retire after confluent and finally start writing novels.

sitkack · on Oct 23, 2019

I came to say something similar, but targeted at the quoted paragraph below.

> If Kafka isn’t going to be a universal format for queries, what is it? I think it makes more sense to think of your datacenter as a giant database, in that database Kafka is the commit log, and these various storage systems are kinds of derived indexes or views. It’s true that a log like Kafka can be fantastic primitive for building a database, but the queries are still served out of indexes that were built for the appropriate access pattern.

A wonderful insight!

mvitorino · on Oct 23, 2019

IMO using Kafka for long term storage is not the greatest idea. It is expensive to keep CPU and RAM constantly on top of data that it going to be cold most of the time. There is no DML which means mistakes are expensive (from an engineering pov). And while the whole event sourcing paradigm can work quite well in narrow domains with teams fully aware of the implications of what they are doing, in practice, on large orgs, it is hard to scale (from a people perspective).

tempodox · on Oct 24, 2019

> hard to scale (from a people perspective).

This is always going to be true. Either your system is built in a way that even the most ignorant of users cannot do permanent damage, or you design it to be understood and run by a small group of people who fully comprehend what they're doing. Of course neither option is easy, but the alternative is a prolific generator of daily trouble.

ryanthedev · on Oct 23, 2019

Well maybe for non critical data. Multi regional Kafka clustering is not easy. There are much better and cheaper data storage options that can provide eventual consistency.

onefuncman · on Oct 23, 2019

Mirror Maker is fine for small scale, or you upgrade into https://github.com/uber/uReplicator ... region-homing data is generally more practical

You should provide examples of either better or cheaper to substantiate your claim.

je42 · on Oct 23, 2019

What tech would you use to store a critical append-only data if not with Kafka ?

chrisjc · on Oct 23, 2019

Perhaps store it out to hdfs/blob storage. Then use some kind of unified processing layer to reconstruct history or parts of history when you need it? Eg: https://news.ycombinator.com/item?id=20059006#20062821

Or use a durable unbounded solution like Pravega instead of Kafka? https://news.ycombinator.com/item?id=20059006#20062821

edit: this could be problematic especially when data privacy laws are involved. See the comments from gopalv below.

rockostrich · on Oct 23, 2019

Kafka topics compacted to some sort of disk storage? It really depends on what you need to do with it. All of our kafka topics have a TTL on messages that's less than or equal to 7 days. If the data is critical, then that topic has a archiver job that writes the messages to files in HDFS.

Postgres, Spanner, etc. come to mind as options if you need to store critical append-only data and also have it be queryable in real-time.

tr4r3x · on Oct 23, 2019

postgres is a bit different use case, isn't? kafka is mainly about scalability, it has nice partitioning out of the box. and i would not say that it's not reliable, because you always replicate data with it

rad_gruchalski · on Oct 23, 2019

Apache Pulsar: https://streamnative.io/blog/tech/2019-10-22-powering-tencen.... Or BookKeeper directly.

je42 · on Oct 25, 2019

this is the first time I read about Kafka has durability and consistency issues... but the the article also doesn't say what the durability and consistency issues are.

super_mario · on Oct 23, 2019

Appache Cassandra.

je42 · on Oct 25, 2019

ok,in what areas is Cassandra better than Kafka ? i am reading this https://stackoverflow.com/questions/35711129/how-to-stream-d...

According to https://issues.apache.org/jira/browse/CASSANDRA-8844: CDC seems a lot less capable than Kafka:

> Cassandra would only write to the CDC log, and never delete from it. > Cleaning up consumed logfiles would be the client daemon's responibility

on the other hand there is the Cassandra CDC Kafka Connector, that seems to do some work...

in the i still don't see why i would want to use Cassandra over Kafka for storing data, when i care about durability and streaming of append only data.

KaiserPro · on Oct 23, 2019

you _can_ do it, but you shouldn't

treating your message bus as a infinite storage system is going give you a bad time.

benjaminwootton · on Oct 23, 2019

Literally the whole article talks about why Kafka isn’t a message bus and makes the case whilst this is fine.

That said, I do find pretty quickly you need to get your data out of Kafka in order to query it. Keeping your data in Kafka forever and re-upserting it into your database has mileage though.

marton78 · on Oct 23, 2019

This is quite an unsubstantiated claim in response to a detailed article. Care to elaborate?

gopalv · on Oct 23, 2019

Not the original comment, but I have run into a specific set of regulatory scenarios where the immutability of any event log is a big problem.

The data retention policies for users who leave the platform is set to 30 days by GDPR, which requires the data to be deleted and expunged by the storage system in ways in which normal users cannot recover - or in another way, the data needs to "offline".

This is not actually that "all data older than 30 days is thrown away", but that "all data for users who have deleted their accounts need to be deleted from the start of their account creation, 30 days after they say forget-me".

Kafka (or Pulsar or Pravega) or any of the other immutable commit log implementations make forgetting a small slice of data from a large set a complicated and nearly impossible task to accomplish.

You can accomplish some part of this with log compaction assuming you have a definite primary key for all updates (i.e the log needs to be partitioned on a key to do compaction along that key). If there's a way I could declare a primary key as a device+event_timestamp+metric, but delete by a user_id column, let me know and I'll be happy to find out how to do it.

However in its original form, Kafka is still very useful.

Being able to hold data in Kafka in those periods of time is extremely valuable and naturally lets the system lose part of its state with the ability to replay itself back into the same state from a saved checkpoint.

If you store 7+ days of Kafka data and flush the newly arrived data-set into a persistent, but mutable columnar store every day & maintain the partition/offsets on commit, then you can recover from a complete loss of the mutable store's in-memory data by replaying the log from where you left off.

The row-major nature of its storage still hurts though if you plan to do all your analysis off it directly, because you'll burn through the disk bandwidth for no good reason.

cfontes · on Oct 23, 2019

> forgetting a small slice of data from a large set a complicated and nearly impossible task to accomplish.

We use Kafka as our storage for almost everything, and we managed to solve this by encrypting all user data that is relevant to GDPR and trowing away the key when asked for a removal.

if a user asks to be forgotten, we commit a empty privacy key for this user and compress the privacykeys topic and all is done, no service will be able to decrypt it anymore.

So far it has been a good solution and it was easy to implement on all our services.

ec109685 · on Oct 23, 2019

This is a great idea. If you couple this practice with storing user data encryption keys with additional layers of security, you’ll decrease your susceptibility of someone being able to extract all data if they get access to your kafka. Spotify talks about this here: http://labs.spotify.com/2018/09/18/scalable-user-privacy/

Kalium · on Oct 23, 2019

This practice - cryptoshredding - works well with two caveats.

First, it requires some policing of Kafka use. It's easy for developers to slip up and some PII to spill into the append-only data systems.

Second, your developers will have to handle for what happens when the key is deleted. The happy-path of fetching data, fetching key, and applying will fail quite hard the first time the rare event of a key deletion comes around.

dalore · on Oct 23, 2019

If you consider the log streams as backups then GDPR doesn't apply.

According to France's GDPR supervisory authority, CNIL, organisations don't have to delete backups when complying with the right to erasure. Nonetheless, they must clearly explain to the data subject that backups will be kept for a specified length of time (outlined in your retention policy).

je42 · on Oct 23, 2019

If you have time, you may want to read the article Database inside/out it gives a bit more background: https://www.confluent.io/blog/turning-the-database-inside-ou...

thanatropism · on Oct 23, 2019

It's funny. I've always had the instinct (to a fault) to make every data model include a log. E.g. if I'm making a Facesmash/Elo app, to store the latest score and the history of matches. Because who knows what kind of time-series analytics I might want to do one day, it feels overly destructive to just erase old scores.

That quickly gets unwieldy and I discard that first instinct. But now people are doing it for real!

zbentley · on Oct 23, 2019

This article seems to propose log compaction as the answer to the question of size (i.e. how much historical data is going to have to be kept around and how much is that going to cost). However, log compaction is not well suited to many use cases: storing partial updates or diffs in the log, storing many trillions of tiny entries (as keys), multiple messages on the log corresponding to related (contingent) updates, and so on.

Those are tractable but hard to solve; log compaction is not a silver bullet and unless you think really hard about how your data changes over time, you may end up storing more of it than you expect if you use the log as an eternal source of truth--compaction or not.