If you want to work in a "big data"-type role as a developer, I wouldn't worry about finding huge data sets. There's a dearth of candidates, especially ones who actually have hands-on experience, and having deep knowledge of (and a little experience with) a broad range of tools will make you a pretty good candidate:
Fire up a VM with a single-node install on it [1] and just grab any old CSVs. Load them into HDFS, query them with Hive, query them with Impala (Drill, SparkQL, etc.). Rinse and repeat for any size of syslog data, then JSON data. Write a MapReduce job to transform the files in some way. Move on to some Spark exercises [2]. Read up on Kafka, understand how it works and think about ways to get exactly-once message delivery. Hook Kafka up to HDFS, or HBase, or a complex event processing pipeline. You'll probably need to know about serialization formats too, so study up on Avro, protobuf and Parquet (or ORCfile, as long as you understand columnar storage).
If you can talk intelligently about the whole grab bag of stuff these teams use, that'll get you in the door. Understanding RDBMSes, data warehousing concepts, and ETL is a big plus for people doing infrastructure work. If you're focused on analytics you can get away with less of the above, but knowing some of it, plus stats and BI tools (or D3 if you want to roll your own visualization) is a plus.
Is that really all that "big data" positions need? In past experience (Google), there were all sorts of problems that working with actual huge data sets introduced that wasn't handled by the frameworks available (not even MapReduce, which by most accounts is significantly more advanced than Hadoop). Things like:
1. With a big data set, there is no easy way to verify the correctness of your algorithms. The data is too big to hand inspect, and so assuming your code is syntactically well-formed and doesn't crash, you will get an answer. Is your answer correct? Well, you don't actually know, and any number of logic errors might throw it off without causing a detectable programming error.
2. Big data is messy. There will be some records in your data set that are formatted differently than you expect, or contain data that means something semantically different than you expect. Best case, your Hadoop job crashes 4 hours in. Worst case, it silently succeeds, and you have no idea that your results were polluted by spurious results that you had no idea existed.
3. Big data will expose basically every code path and combination of code paths in your analysis program, so it all better be bulletproof. Learn how to write code correctly the first time, or you're going to be spending a lot of time waiting for the script to run and then fixing crashes several hours in.
4. Big data contains outliers. Oftentimes, the outliers will dominate your results, and so if you don't have a way of filtering them out or making your algorithm less sensitive to them, you will get garbage as your final answer.
There are techniques to deal with these, but they are techniques that are built into your workflow as a data scientist, and not the tools that are available. One thing that always amazed me at Google was how much time the data scientists on staff spent not writing code. Writing your MapReduces takes perhaps 5-10% of your day; most of the rest of it is mundane stuff like staring at data and compiling golden sets.
Definitely - playing with any data is the best way to learn the tools.
Just for giggles I built a tool that extracts all oracle permissions, sums that up into relationship information using PIG (this schema owner reads from here, writes to there, etc) and used first R/ggplot2 and then later Gephi to plot the results.
None of the data sets could be called big data by any stretch, and I could have done the processing more quickly with perl or python, or even a mix of shell commands. But that wasn't the point. It was to expand on the one day of training I'd had and help cement the ideas, and frankly it was to have fun.
Find something that you're passionate about or just plain sounds like fun and then use the tool you want to learn to solve your problem.
If someone with a typical web dev background (knows how to handle databases like oracle or MySQL, but nothing about tools like Hadoop etc), could you recommend a course/book to start big data with? Also, with so much to learn, how does one go about deciding what field within big data to specialize in?
>how does one go about deciding what field within big data to specialize in?
In my experience your job generally dictates what you specialize in. I ended up being more data engineer than scientist since my job had a lot of tricky data warehousing problems.
There's a lot of stuff under the "Big Data" umbrella: I focused on Hadoop below because that's my focus right now. I'm sure I'm missing some roles here, but the specialities I can think of are:
- getting data out of production systems and transforming it (infrastructure or ETL)
- analytical querying and reporting
- system administration
- machine learning
There's also the wide world of NoSQL data stores, which people lump in with big data, but which require vastly different skills.
The Hadoop VM I linked to above is good for working through exercises for all of the above.
As a starting point, this book[1] walks through the motivation behind Hadoop, and then gets a little into internals and use cases. It's out of date, but you can work through it and get into the right frame of mind, understand HDFS, etc. It's a good starting point.
AMP Camp (that I linked to above) is an introduction to Spark for people with a little Hadoop experience. Spark is getting a lot of attention, you could run into it in a number of roles.
If you're going to be planning the whole pipeline, or doing any sort of infrastructure role, I recommend Hadoop Application Architecture[2] for more modern tools and design patterns. This blog post[3] is a pretty good overview of distributed logs, which are essential for horizontal scale. Understanding Kafka and ZooKeeper is really useful for infrastructure roles, maybe less so for admins.
If you're planning to be in the reporting layer, having a deep understanding of SQL and data warehousing is useful. This book[4] is old hat, but I would say it's expected knowledge for anyone planning a warehouse, and it's interesting to understand best practices. Most places will also expect knowledge of Tableau or a similar BI tool, but that's tougher to learn on your own since licenses are brutal. Visualization with D3 is nice to have in this space, especially if you're coming from a web background - Scott Murray's tutorials [5] are a good starting place.
It's harder to point to resources for sysadmins - if you weren't a sysadmin before, you need to understand a lot of other concepts before you worry about Hadoop stuff. ML is similar - you need to understand the principles and be able to work on a single node. There's lots of good resources out there about getting started in data science.
>If you can talk intelligently about the whole grab bag of stuff these teams use, that'll get you in the door. Understanding RDBMSes, data warehousing concepts, and ETL is a big plus for people doing infrastructure work.
This is sadly true, for now. I don't think folks here disagree with the "true" part. Let me explain why it is "sadly" and "for now".
The biggest issue with big data is most of it sits unused. In many organizations, HDFS ends up being an alternative to NetApp storage servers, storing terabytes of data with the hopes of them being useful one day.
In fact, if you already get to that stage of using HDFS as a storage server, you must have a decent ETL team that can put data into HDFS with a menacing combination of ad hoc scripts and a workflow that looks like a cobweb produced by a deranged spider. For now, knowing the ins and outs of various semi-functional open source components and the tenacity, patience and skill to deal with the gnarliest of ETL tasks get you a high-paying data engineering job.
But, in the long term, there will be a big change.
1. Tools are getting better: many data practitioners are realizing there are huge gaps between different data infrastructure components, and they are trying to fill these gaps. There is a lot of attention given to query execution engines (Presto, Impala, Spark, etc.) but I find data collection/workflow management tools are just as critical (if not higher leverage) right now. Tools like Fluentd (log collector) [1], Luigi (workflow engine) are OSS software in this direction.
2. Data-related cloud services are becoming really, really good: huge kudos to services like AWS, GCP, Heroku (through Addons). They are quickly building a great ecosystem of data processing/analysis/database components that frankly work better than most self-administered OSS counterparts. (Disclaimer: my perception might be colored here since I work for a data processing/collaboration SaaS myself [3])
So, back to the question. I think aspiring data engineers have two distinct career paths:
1. Becoming an expert in a particular data engineering component: this would be building a query execution engine, designing a distributed stream processing system, etc. (It would be awesome if you decide to release as open source)
2. Becoming an expert on quickly and effectively deploying cloud services to get the job done: this is the skill most desired among data engineers at startups.
What not to become is one of these OSS DIY bigots: not good enough to build truly differentiating technology, but adamant about building and running their own <up and coming OSS technology>. These folks will be wiped out in the next decade or so.
> The biggest issue with big data is most of it sits unused
This is really variable. If you're at a place where they jumped on the bandwagon, then yes. There are also lots of companies (and not just Google/FB/LinkedIn) that build mission critical reporting and ML infratstructure on Hadoop. These companies appreciate the value of workflow coordination, and they wouldn't move ahead without (at least) Oozie/Azkaban in place to give some visibility into their workflow.
> But, in the long term, there will be a big change.
I think more types of work will become commoditized. If you just want log processing, there are lots of on-premises and cloud options. Splunk has been doing this forever. Ostensibly with good-enough BI software you could just focus on ingest, and everything else is drag and drop. On a long enough time frame, hand-rolling pipelines will become obsolete. This is like a 10+ year timeline for any player to get significant market share. In the meantime, people have to actually get stuff done, and their skills will be transferable because they understand distributed systems, ETL, warehousing, and a lot of other stuff that hasn't really changed in a decade.
> Becoming an expert in a particular data engineering component
Are you advocating that nobody writes Spark Streaming jobs, because they should rewrite Spark instead? Don't learn to work with Impala, learn to rewrite Impala? I disagree, the tools are only getting better, and it's going to take more and more work to replace the entrenched players. Working on top of solid tools will make you far more productive than engaging in NIH and making your own SQL engine.
> Becoming an expert on quickly and effectively deploying cloud services to get the job done
Like RedShift, EMR and Amazon Data Pipeline? They're hardly turn-key solutions. Amazon's Kinesis is just Kafka with paid throughput - you can absolutely re-use your skills in the cloud, without having to cave and get locked in to a single vendor serving one specific use-case.
> What not to become is one of these OSS DIY bigots: not good enough to build truly differentiating technology, but adamant about building and running their own <up and coming OSS technology>
So in your mind you either pick a vendor to handle all your data for you, or you're an "OSS DIY bigot"? Something like owning your entire user analytics pipeline isn't mission critical for a startup, it's stupid to build it yourself?
> These folks will be wiped out in the next decade or so.
Even though Oracle is amazing and great, lots of people still use Postgres, MySQL, etc. There's always going to be a continuum from "We should buy his turnkey thing" to "we started by rolling our own SQL query engine". You need to be able to identify when each is appropriate, not shoehorn in a one-size-fits-all solution.
Fire up a VM with a single-node install on it [1] and just grab any old CSVs. Load them into HDFS, query them with Hive, query them with Impala (Drill, SparkQL, etc.). Rinse and repeat for any size of syslog data, then JSON data. Write a MapReduce job to transform the files in some way. Move on to some Spark exercises [2]. Read up on Kafka, understand how it works and think about ways to get exactly-once message delivery. Hook Kafka up to HDFS, or HBase, or a complex event processing pipeline. You'll probably need to know about serialization formats too, so study up on Avro, protobuf and Parquet (or ORCfile, as long as you understand columnar storage).
If you can talk intelligently about the whole grab bag of stuff these teams use, that'll get you in the door. Understanding RDBMSes, data warehousing concepts, and ETL is a big plus for people doing infrastructure work. If you're focused on analytics you can get away with less of the above, but knowing some of it, plus stats and BI tools (or D3 if you want to roll your own visualization) is a plus.
[1] http://www.cloudera.com/content/cloudera/en/downloads/quicks... [2] http://ampcamp.berkeley.edu/5/