Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Elasticsearch as a Time Series Data Store (elastic.co)
126 points by trampi on Nov 13, 2015 | hide | past | favorite | 17 comments


That's great for unstructured data, like data with high cardinality on the dimensions. But for most real world metrics outside analytics, this isn't necessary and a data model like prometheus makes more sense. If I did the math right, even after compression elasticsearch uses 22 bytes per data point (23m points / 508 megabyte) where prometheus uses about 2.5-3.5 bytes per data point.

Disclosure: Prometheus contributor here


Given your background, I'm going to take this opportunity to ask some very noobish questions. (I will be doing my due diligence to read up on Prometheus, I likely should have done so prior, but these are just some off the top of my head that I'm not sure I'd gain a good intuition for until late in the learning process)

- How is the bulk of this additional compression derived? Is it explicitly the existence of a data model that lets you use more aggressive/intelligent compression strategies?

- Does this come at a cost? (increased CPU overhead, latency at read time, something like that.)


If you can predict next value in the sequence you can achieve better compression (xor predicted value and actual, not previous and next - http://users.ices.utexas.edu/~burtscher/papers/dcc06.pdf), but I don't think that Prometheus developers are using this.

And BTW, I don't believe in 2.5-3.5 bytes per data-point somewhere outside synthetic benchmarks.


While I'm happy to hear about a great success story of a great piece of open source software, Elasticsearch has done a great disservice by making application developers lazy about learning the ins and outs of various analytical/transactional/storage backend systems.

Echoing other commenters, Elasticsearch is hardly the best tool for many kinds of analytics. In fact, it is strictly not a good tool for several use cases. For starters:

1. It's not good at joining two or more data sources

2. It's not good at complex analytical processing like window functions (for example to calculating session length based on the deltas of consecutive timestamps partitioned by user_id and ordered by time).

Of course, it's also good at many things like simple filtering and aggregation against "real-time" data. Being in-memory really helps with performance, and with right tools, it's horizontally scalable. Elastic's commercial support is also not to be discounted.

However, as an old OLAP fart who spent years optimizing KDB+ queries, I am deeply concerned about the willful ignorance of data processing systems that I see among Elasticsearch fans. Just take my word for it and study Postgres (with c_store extension) and other real databases, in-memory or otherwise, open-source or proprietary, so that you won't be shooting yourself (or future co-workers) in the foot, trying to shoehorn Elasticsearch and its ilk into suboptimal workloads (To be fair, I see a similar tendency among Splunk zealots).


> Of course, it's also good at many things like simple filtering and aggregation against "real-time" data.

And also fulltext search at scale, which is basically its primary use case.

PostgreSQL's fulltext search isn't quite at the same level. The last time I looked into its capabilities, it didn't fully support TF-IDF. (I don't think it keeps track of corpus frequencies for terms.) Interestingly, I think SQLite's fulltext support does include TF-IDF, but I could be misremembering.

I mean, the Elasticsearch docs are pretty clear that joining doesn't work well (or really, at all). I'm not sure how being clear about the trade offs of your software is "doing a disservice." Sometimes you don't need to store relational data. Sometimes you do need to store relational data, but the other benefits of Elasticsearch outweight shoehorning relational data into what is effectively a document database.

If your only complaint is that people misuse software... Well... Yeah. It's been happening for a while now. We should help educate others. I'm not sure your approach is the most constructive.


Worth mentioning that Elastic.co has a commercial product called Watcher [1] which I think is a really nice way for making an automated alert system. The downside is being a commercial product I can't use Watcher and would have to implement one myself.

I am still deciding between ES, a relational database and Cassandra for time series data. We use graphite now and are happy with it, but I think having a single database handling logs, events and metrics data would be much more ideal. Having logs already in ES does make ES a better choice.

[1]: https://www.elastic.co/guide/en/watcher/current/index.html


We used elastalert on a project and it did the trick: https://github.com/Yelp/elastalert


Thanks... and Python... I don't have to reinvent something... will take a look thanks!


We used Cassandra for timeseries data for a while.

My word of caution would be to set up a dedicated cluster for it. Cassandra is good at a lot of things, but if you have different workloads running on the same cluster, performance is hard to tune. We used it for raw document store (write once, read almost never) and timeseries in the same place (lots of appending, even more reads), and it wasn't great.


Yeah absolutely, points taken thanks. I think infrastructure's infrastructure :D should be dedicated. Yeah agree with metrics read is more than write, plus write in Cassandra is intrinsically quick.


The 2.5 release of the time series focused dashboard Grafana added support for Elasticsearch. In a way they've come full-circle, since Grafana started several years ago as a fork of the Elasticsearch dashboard Kibana.

http://grafana.org/blog/2015/10/28/Grafana-2-5-Released.html


So many ways you can abuse Lucene :) Many years ago, we used it as a graph data storage as well.


And I used Lucene as the data backend for an inference engine recently :) If there is one library that keeps me in the Java world, it's Lucene. No other open source comes close to Lucene in its category.


We use elastic search in a very similar manner as described in the article to store high-frequency data for our instance and multi-cloud profiling / benchmarking tool:

https://profiler.bitfusionlabs.com

Since we are collecting data at sub-second granularity and did not want to introduce noise on the profiled instances themselves, whether it be for cpu, mem, or disk, we had a play a few tricks about how to collect data and when to precisely send the data to elastic search, but in general it has been working out very well for us.


I tend to think of Time Series data as being several orders of magnitude larger than 23 million data points per week (38 per second) but now I can't seem to find a good definition of Time Series data. Anyone have thoughts on the rough threshold between event data and time series data? I think of arrays of hundreds/thousands of individual sensors that take 10 measurements a second as "different" than user generated data that is time-ordered.


I agree, time series should be more like 1000 measurements taken 100 times a second. Industrial acquisition data is not the same thing as timestamped web log data.


??? elasticsearch good for everything.. looks like the computers got cheap and fast enough to do almost anything. Why not put the data in sql database? I suppose this will be much better

But nothing seems strange when in order to monitor 1 server you have to run 10 machine cluster with elasticsearch log collector




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: