They have the right idea, it's a real need and a major opportunity. Airbyte[1] i...

glogla · on March 1, 2021

Another one on the market is pipelinewise.

Meltano and pipelinewise are both ways how to orchestrate Singer.io taps and targets while Airbyte is its own thing.

EDIT: I also don't know how just released Airbyte can be ahead of something that's surely in production for a while.

lmeyerov · on March 1, 2021

Singer.io seems to run the data plane on json -- is there an Apache Arrow equiv for this ecosystem/problem, or something like it?

tayloramurphy · on March 1, 2021

I'm not aware of any. I did just open this issue[0] in the Meltano project to open discussion with the team/community. It could be an interesting iteration on the Singer Spec[1] if we find that users are interested in it and it helps solve some bottleneck challenges.

[0] https://gitlab.com/meltano/meltano/-/issues/2616 [1] https://github.com/singer-io/getting-started/blob/master/doc...

lmeyerov · on March 1, 2021

yeah we do not push an etl pipeline through json unless we have to (and generally cannot), most etl-scale data engineering we do is almost all arrow/parquet/orc/protobuf etc, and slow legacy, odbc/json, which is streams that we turn into typed and compact data. I think json fine for command/metadata layers though, esp early on, but pretty core to what I look for an etl/streaming tool is out-of-the-box foundations for the data plane

the good news is the implicitly typed json examples look arrow friendly, so users can to/from_json if they don't care about data speed/quality like when prototyping and not think about it. there may be other data-engineering-friendly formats that'd work too.

prefect, dask, and friends solve it by abstracting over it. you can send whatever you want.. and it happens to be friendly to dataframes (pydata) / compact & typed data. but there projects seem to be more about source/sink, so encouraging structure by default would be helpful...

edgarrmondragon · on March 1, 2021

AFAIK Meltano uses JSON only in the interface between a tap (source) and a target, to communicate schema, state and records.

It's up to the target what it does with the JSON messages it receives, so you can for example have a target-avro that takes JSON records and outputs them as an Avro file and translates the JSON schema to the corresponding Avro schema.

glogla · on March 1, 2021

Yep, definitely.

Then the holy grail would be to have bunch of taps-targets running in parallel for a single pipeline, each working on a subset of streams.

rahimnathwani · on March 1, 2021

I'm currently using Stitch (https://www.stitchdata.com/) to move data from a few sources to Redshift.

Since Stitch, Meltano and Pipelinewise all use Singer.io taps and targets under the hood, I wonder if there's any reason to choose one over the other?

tayloramurphy · on March 1, 2021

It's dependent on your use case, but the three examples you've listed here all have a slightly different approach to the market. Stitch (as mentioned in a comment below) is SaaS only. Pipelinewise is open source but as far as I know have no plans to build a company around it. With Meltano, we're aiming to grow the project and community and eventually build a business around it in a similar manner to what GitLab has done. Our docs[0] have more information about our current focus and roadmap if you're curious.

[0] https://meltano.com/docs/#focus

glogla · on March 1, 2021

Stich is SaaS only. That by nature makes it suitable for some uses, and unsuitable for others (like when you have a provision that your data can't leave your network or when you are in a company where adding a new vendor isn't a quick process.

Meltano and Pipelinewise are open source projects that someone built for themselves but are sharing. You can just start playing with it and change the code or whatever, but there's no support to pay for.

For example where I'm standing the best one would be "Stitch I can self-host for free for a PoC and then eventually engage vendor about a support contract while still self-hosting it for security reasons" but there doesn't seem to be anything like it.

dataminded · on March 1, 2021

Easy -- Meltano has not been in production. They are both relatively new tools.

tayloramurphy · on March 1, 2021

The GitLab Data Team is running Meltano in production[0]. We're currently extracting Zoom data with it and have plans for several more extractors (Slack, GMail, PTO by Roots, EdCast, and a few more). I just made this MR[1] to update the list of Extractors to include Zoom too.

[0] https://about.gitlab.com/handbook/business-ops/data-team/pla... [1] https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_request...

DouweM · on March 1, 2021

And the GitLab Data Team is not alone! The Meltano Slack community (link on the homepage) is about 800 strong right now, and every day we've got people discussing their production deployments and helping new users set up their own.

PS. Like Taylor, I'm on the Meltano team at GitLab.