This looks very interesting. I'm doing some log file processing in Apache Spark in Clojure. Spark is written in Scala, but has a Java API, which is wrapped by Flambo. It looks and feels entirely Clojure.
The semantics look very similar indeed. Does anyone have a comparison between Onyx and Spark?
I've used Onyx, but I haven't used Spark, so take this with a grain of salt.
A few key differences:
Onyx aggressively uses data structures to define the structure of computation, defining the data flow (Onyx workflow) and parameterization (Onyx catalog) of the the computation via clojure maps and vectors. In comparison Flambo and Spark define the structure of computation via functions over collections. One way in which Onyx's approach is powerful is that it becomes trivial to manipulate workflows or catalogs before submitting jobs at runtime, allowing you to add additional tasks, task options, etc.
Onyx also implements batching over streaming operations, whereas Spark appears to be the opposite. There are likely to be trade-offs between these approaches.
Spark is also a lot faster, though this isn't necessarily intrinsic to the approaches.
- Storm is significantly more mature and performant the moment.
- Storm has a better cross-language story in terms of bolt functions.
- Pretty much everything in Onyx is much more open ended. This applies to deployment, program structure, and workflow creation - and is mostly an artifact of how aggressively Onyx uses data structures.
- Onyx has a far better reach across languages in terms of its information model.
- Onyx will be adopting a tweaked version of Storm's message model next release to get on the same level of performance and reliability. We're dropping the HornetQ dependency.
- Onyx is born out of years of frustration of direct usage of Storm and Hadoop.
As someone who has been using Storm, this looks very interesting. What I particularly like are the clean, well thought-out ideas. Also, easily reconfigurable (at runtime) topologies are something we'd be interested in. I will definitely take a very close look at Onyx.
Performance is important: in our case, decreasing it significantly below Storm's level would not be acceptable.
Also, I watched the Strange Loop presentation and the tree model looks limiting to me: I have topologies where I need to merge information from two streams (but perhaps I haven't understood the Onyx model yet).
Hi Michael, thanks for your work creating Onyx - it looks really cool.
I can infer two of your frustrations with Storm from the above post: that Storm was too closed, and it's information model didn't span across languages very well. If you have the time, could you elaborate on these pain points, and any others that you found?
I'll paraphrase a few snippets from my own documentation to answer these questions. Happy to comment more if needed.
Information models are often superior to APIs, and almost always better than DSLs. The hyper-flexibility of a data structure literal allows Onyx workflows and catalogs to be constructed at a distance, meaning on another machine, in a different language, by another program, etc. Contrast this to Storm. Topologies are written with functions, macros, and objects. These things are specific to a programming language, and make it hard to work at a distance - specifically in the browser. JavaScript is the ultimate place to be when creating specifications.
Further, the information model for an Onyx workflow has the distinct advantage that it's possible to compile other workflows (perhaps a datalog) into the workflow that Onyx understands.
Michael, can you explain this more? "[Storm] Topologies are written with functions, macros, and objects. These things are specific to a programming language, and make it hard to work at a distance -- specifically in the browser. JavaScript is the ultimate place to be when creating specifications."
I don't really get it. Storm Topologies are built in Java or Clojure using a builder interface, but the data structures for topologies themselves are actually DAGs that serialize using Thrift. It's true that this is a bit heavy-weight compared to something like JSON or EDN, but offering an alternative is a discussion in the community right now. What would your ideal representation of topologies be, actually?
I wasn't aware that they're Thrift serializable - that's cool, and offers roughly what Onyx does in terms of its workflow representation.
Onyx goes a little further though in terms of its catalog. I wanted more of the computation to be pulled out into a data structure. That includes runtime parameters, flow, performance tuning knobs, and grouping functions. All of these things are represented as data in Onyx. It's a little harder, at least in my experience, to do these things in Storm.
What were the main pain points that motivated you to develop Onyx? What capabilities do you want to add or have already added that Storm doesn't provide?
Re: Onyx's architecture. I would wonder about performance when keeping a shared log in ZooKeeper. Why not use something like Kafka -- it is designed for high-volume, immutable logging. ZK works best for less-frequently changing configuration, such as node connection information or snapshotting. I could be wrong. I'd like to hear your thoughts and experience.
From a brief examination tesser looks a lot simpler (probably because of encoding most of the folding using various monoids). Does onyx have a similar abstraction model that I missed?
Tesser also allows you to distribute it using hadoop i think. I haven't used it, I only happened to hear about it why @aphyr gave a talk at the clojure exchange in London.
The semantics look very similar indeed. Does anyone have a comparison between Onyx and Spark?