Friday, 5 September 2014

Data science at scale - calling out the "Big Data Scientist"

“Data science” is a popular term and one in the ascendancy in Gartner’s Hype Cycle for Emerging Technologies 2014. It has multiple meanings based on whom you ask. One way to deal with subjective interpretations is to crowdsource the answer and pick the popular interpretations, provided there is enough data. Recently, a data scientist (who else?) at LinkedIn attempted to define the term “data scientist” using data from profiles of people that have the phrase “data scientist” across its network.  His results are available in a small post over at LinkedIn's page. Unusually for a data scientist, the author doesn’t provide any quantified data at all, whereas I would have expected to see at least the numbers of profiles analyzed, the popularity scores and the strength of the relationship between the terms or the popularity scores for skills. Without numbers, there isn’t a whole lot of interpretation that outsiders like me can do though. Looking at the information qualitatively, the set of data scientists in the LinkedIn network seems to be distinctly tilted towards “small data” analysis as opposed to “large data” analysis. I gauge this from two indicators: (a) absence from the “Most popular skills” table of those skills typically associated exclusively with large data analysis; (b) the small sizes of the bubbles of these large data-focused skills and the lack of any strong connections (look at the higher resolution image in that post) from any of these to the popular “small data” skill bubbles.

Does this mean that the majority of data practitioners are “small data” scientists? Where are the “Big Data Scientists” (a portmanteau of “big data” and “data scientist”) and what sets them apart?

As that post and many others delineate, a good data scientist has mastery over a breadth of techniques, the tools that encode these techniques, and the domain knowledge that helps provide the extra oomph to the results. As aids, the tools – be it statistical or visualization in nature – provide algorithms and implementations of techniques out-of-the-box that are then used as deemed fit for the data problem at hand. The tools themselves do not provide readymade solutions to the problem, whereas it is the data scientist who knows how to use which tools and what techniques given the nature of the data, the type of problem being addressed and the targets to be achieved, if any. It is no wonder then that data science is sometimes referred to as “art” with the practitioners commanding a premium.

Data science at scale is a completely different beast from data science on a single machine. Data analysis on a single machine is itself hard but, data analysis at scale typically challenges fundamentals that are often taken for granted. Take the problem of sorting. It is one of the first to be introduced in an algorithms course in a computer science curriculum, and how to sort data is well understood. However, when the data being sorted is larger than the memory available in a machine, a different algorithm is required. Let’s call this the single machine algorithm while the textbook algorithms could be classified as main-memory algorithms. When the data becomes even larger and no longer fits within a single machine, the previous algorithms do not suffice and yet another design for algorithms is required. These could be called the distributed sorting algorithms. Sorting at massive scale is a problem class in itself and has a dedicated big data benchmark too (look for "Gray").

Sorting is a very simple problem in the world of big data. There are complicated ones as well, like machine learning at scale. In all, I would argue that the most challenging aspect of being a “Big Data Scientist” is to know when to use some data analysis approaches (e.g. clustering versus classification) and the techniques for each (e.g. k-means for clustering) and to also know the design of algorithms. Knowledge of the internals of algorithms comes handy in designing a distributed version of the same algorithm that works with good performance on massive data. This crucial task of having to not only know “data science” but also be able to design and implement the algorithms to run on massive data really sets apart a “Big Data Scientist” from a “small data" scientist.

In the last couple of years, there has been a steady stream of software packages offering big data-enabled algorithms out-of-the-box. Open-source packages in the Hadoop stack include the popular Mahout and the newer Spark MLlib, to name a few. If you do not subscribe to the Hadoop architecture, GraphLab built using MPI can be executed standalone. Amongst the really few proprietary packages offering massive-scale algorithms out-of-the-box, Teradata Aster is a great example, and I cut my teeth in big data by contributing massive-scale algorithms to its analytic foundation library. These software packages make the transition from a “small data” scientist to a “Big Data Scientist” easier, but talk to any expert statistician and you’ll know that the coverage of the required breadth of algorithms and techniques is still poor.


O’Reilly Media conducted a survey of data scientist salaries across Strata editions in 2012 and 2013. That report is a better example of presenting data about data scientists (the report calls them out as data professionals since not all individuals wear the “data scientist” tag). Parts of the survey, especially those about proprietary tool usage, are not that useful since the majority of the audience at Strata tend to be the open-source-kool-aid consuming types and the survey sample is therefore biased. The size and the geographic variance of the audiences at Strata are also necessarily lesser than what LinkedIn could potentially see in its data of the world. Nevertheless, the O’Reilly survey also reinforces the points in this post that the portmanteau role of “Big Data Scientist” is a rare combination and commands a premium over even the “small data" scientist.

No comments:

Post a Comment