Is that really all that "big data" positions need? In past experience (Google), there were all sorts of problems that working with actual huge data sets introduced that wasn't handled by the frameworks available (not even MapReduce, which by most accounts is significantly more advanced than Hadoop). Things like:
1. With a big data set, there is no easy way to verify the correctness of your algorithms. The data is too big to hand inspect, and so assuming your code is syntactically well-formed and doesn't crash, you will get an answer. Is your answer correct? Well, you don't actually know, and any number of logic errors might throw it off without causing a detectable programming error.
2. Big data is messy. There will be some records in your data set that are formatted differently than you expect, or contain data that means something semantically different than you expect. Best case, your Hadoop job crashes 4 hours in. Worst case, it silently succeeds, and you have no idea that your results were polluted by spurious results that you had no idea existed.
3. Big data will expose basically every code path and combination of code paths in your analysis program, so it all better be bulletproof. Learn how to write code correctly the first time, or you're going to be spending a lot of time waiting for the script to run and then fixing crashes several hours in.
4. Big data contains outliers. Oftentimes, the outliers will dominate your results, and so if you don't have a way of filtering them out or making your algorithm less sensitive to them, you will get garbage as your final answer.
There are techniques to deal with these, but they are techniques that are built into your workflow as a data scientist, and not the tools that are available. One thing that always amazed me at Google was how much time the data scientists on staff spent not writing code. Writing your MapReduces takes perhaps 5-10% of your day; most of the rest of it is mundane stuff like staring at data and compiling golden sets.
1. With a big data set, there is no easy way to verify the correctness of your algorithms. The data is too big to hand inspect, and so assuming your code is syntactically well-formed and doesn't crash, you will get an answer. Is your answer correct? Well, you don't actually know, and any number of logic errors might throw it off without causing a detectable programming error.
2. Big data is messy. There will be some records in your data set that are formatted differently than you expect, or contain data that means something semantically different than you expect. Best case, your Hadoop job crashes 4 hours in. Worst case, it silently succeeds, and you have no idea that your results were polluted by spurious results that you had no idea existed.
3. Big data will expose basically every code path and combination of code paths in your analysis program, so it all better be bulletproof. Learn how to write code correctly the first time, or you're going to be spending a lot of time waiting for the script to run and then fixing crashes several hours in.
4. Big data contains outliers. Oftentimes, the outliers will dominate your results, and so if you don't have a way of filtering them out or making your algorithm less sensitive to them, you will get garbage as your final answer.
There are techniques to deal with these, but they are techniques that are built into your workflow as a data scientist, and not the tools that are available. One thing that always amazed me at Google was how much time the data scientists on staff spent not writing code. Writing your MapReduces takes perhaps 5-10% of your day; most of the rest of it is mundane stuff like staring at data and compiling golden sets.