Friday, 12 September 2014

Data Science at scale: the 1-hop neighbours problem

This post is technical. An understanding of how systems work and algorithms designed is recommended. 

As a companion post to “Calling out the Big Data Scientist”, here is an illustration of the kind of decisions a data scientist would need to make when carrying out graph analytics with large data sets.

Consider building a recommendation engine for people/friends on a social network. There are many attributes that make up people recommendations, including common subjects of interest, places visited, educational universities studied together, workplaces shared etc. One of the most important features for recommendations is the number of friends that the user and the potential candidate to be recommended have in common. Let’s refer to the user as U and the recommended person as X. The number of friends that U and X have in common is in turn the number of intermediaries that U is connected to and that in turn connects to X. In graph jargon, it is the number of distinct paths that connect U and X in a single hop. That’s why this problem is also referred to as the "1-hop neighbours problem.”

For ease of understanding and without loss of generality, assume that a typical user U has 10 connections, and those friends are typical users themselves in that they also each have 10 connections. Note that there is a good chance that a typical user could also be connected to a well-connected user (where well-connected refers to having a large number of connections, say 1000 or more), for instance a celebrity or a headhunter or a popular subject matter expert.

To solve the "1-hop neighbours” problem, a programmer’s approach could be to store the data of edges (i.e. who is connected to whom) in a file and then write some code. Clearly it is the code that is the complex part of this approach. We will not dig deep into this line of approach since it is brute-force and there are more readymade techniques available that will solve the problem for lesser effort and time. 

A RDBMS-oriented analyst could model the edges data as a relation with each tuple having 2 columns - “src” and “dest”  - possibly representing people as vertices with an identifier. There are other representations for graph data that are possible but an edge-based representation turns out to work well in practice for large, real-world graphs.

In this RDBMS setup, the SQL query to calculate the number of common friends between every pair of users is as follows:
SELECT L1.src, L2.dest, count(1) as common_friends_count 
FROM  edges L1 INNER JOIN edges L2 ON (L1.dest=L2.src) 
WHERE L1.src <> L2.dest 
GROUP BY L1.src, L2.dest
In brief, the query joins the edges table (referred as L1) with itself (referred to as L2, to disambiguate) so that for every user (i.e L1’s “src” column) we get his friend (i.e  L1’s “dest” column for the corresponding tuple) from the edges table,  and then use the edges L2 table to get that friend’s friends (i.e the L2’s “dest” column for all tuples as determined by L1's "dest" column). Then, by counting the number of tuples that result from the join, we get the answer for every pair of users. We also need to avoid loops, the WHERE clause helps to filter that out.

The SQL query above performs well for a small graph. Unfortunately, it fails quite spectacularly on graphs of large real-world networks. This is because of two reasons: 
  • real-world graphs are sparse (intuition: the number of people each of us typically knows is far far lesser than the total number of people in the network worldwide); 
  • real-world graphs have network effects (meaning, the user U who has 10 friends can, by tapping into his friends’ networks, reach 100 users at least who in turn can tap into their friends’ networks to allow U to reach 1000 users at least, and so on; since there are many well-connected individuals in real-world graphs, within a small number of hops, a typical user can reach most of the network - this is the so-called “Six Degrees of Separation”).

To appreciate the argument, let us consider first how we would execute this query on a large graph in a database. To start with, single machine databases are not sufficient to carry out the analysis on a large graph. Even if the graph can be stored on disk, it is prohibitively expensive to have enough resources on a single machine to be able to compute the answer. So, a parallel database, preferably with a shared-nothing, massively parallel processing architecture, needs to be considered. In a typical parallel database, data is usually laid out distributed by the hash of a particular column or set of columns - for the edges data of 2-tuples, there are only 2 choices "src" or “dest” either of which is reasonable. Let us choose “src” as the distribution column. This implies that all edges from the same “src” are available together in a single machine. The various "src" values are spread out roughly uniformly across the entire cluster. 

The distribution of the graph on a column across the parallel database is itself a problem. This is due to the presence of well-connected users. The specific machines that store the edge data for such well-connected users are assigned a much larger chunk of data than other machines resulting in skew. For a parallel database, skew is a real performance killer. Let us set this revisit this graph distribution issue later.

Now, consider the processing side of the argument. Let us get into the operations of a parallel database involved in the execution of this query:
  1.  In order to execute the join of the edges table with itself, the join needs to be executed locally on every machine. This requires having the data necessary for the join to be present locally. This in turn requires creating a copy of edges that is distributed on the “dest” column. The dominant cost in this operation is the cost of transferring the entire edges table once over the network.
  2. Each machine in the cluster then executes the join locally. The join produces significantly more number of tuples than the input tables due to network effects of expanding from the circle of friends to the circle of friends of friends. The dominant cost here is the join itself, which is be executed using any of a number of well-known relational algebra join techniques like hash joins, sort joins, hybrid hash joins etc.
  3. Since the aggregation in the query happens on columns that are different from those used in the join condition, the output of the join will need to be redistributed. The dominant cost here is the cost of transferring the output table from the join over the network.
  4. The aggregation is carried out on the redistributed join output. The dominant cost depends on the algorithm used to perform the aggregation, which could be any of a number of well-known techniques like hash aggregations, sort aggregations, hybrid hash aggregations etc.

The join operation in step 2 is more expensive than it could be considering the sparsity of the graph. The network effects present in real-world graphs produce a large amount of output, which adds to the latency involved in step 2 and also has a more significant impact in the aggregation operation. The aggregation operation again is more expensive than it needs to be due to the same reasons of sparsity. As an illustration of sparsity, even though network effects mean a user U could now reach 100 friends in 1-hop starting from his 10 friends, notice that computing the number of friends user U has in common with any of those 100 users is not dependent on the rest of the network.

By adopting a scalable approach and a platform that is optimised for processing large graphs, the "1-hop neighbours” problem becomes easier to solve and get the performance right. The popularity of bulk synchronous parallel (BSP) models, or vertex-centric processing, is because of the benefits it provides for large-scale processing of real-world graphs. Notable among the implementations is Google’s Pregel and its many clones including ApacheGiraph, Teradata Aster SQL-GR, and GraphLab. In this model, each vertex becomes an independent thread of computation where messages that are sent by its friends are processed followed by the sending of messages, if any, to its friends. The BSP model guarantees that a vertex receives only the messages that is sent to it and a message is sent only to the vertex it is meant for.

For the task of solving the “1-hop neighbours" problem, an implementation using the BSP model has the following 3 iterations after which the model stops execution since there are no more messages:
  1. Each vertex sends to its neighbours a message about that edge of the form “src”-”dest”. The dominant cost in this operation is the cost of transferring the entire edges table once over the network.
  2. In the next iteration, a vertex broadcasts each message that it receives to every one of its neighbours. The dominant cost in this operation is the cost of transferring the output messages which are the tuples corresponding to the friends of friends.
  3. In the last iteration, each vertex aggregates over its messages the number of occurrences of each “src”, sets itself as the "dest" and constructs the output as the "src"-"dest"-count.

For a large graph, iterations 1 and 2 have nearly the same cost as steps 1 and 3 in the parallel database approach. Note that there is no join operation in the BSP model as vertex-to-vertex message routing is provided by the platform. Iteration 3 in the BSP approach carries out the aggregation operation but needs to only consider those rows that influence the output for that pair of users. 

Comparing at a deeper level the aggregation steps in the 2 approaches, Step 4 in the parallel database approach is aggregating over all rows. Each machine in the cluster executes the aggregation locally and needs to consider a memory structure (typically, a hash table) suitable to hold all pairs of vertices that occur in that machine. When the number of distinct pairs (and their counts) is likely to exceed available memory, the database engine decides to adopt a disk-based approach to carry out the aggregation. When that happens (which it does for large graphs), the same input data needs to be scanned multiple times. This translates to significant overheads compared to the aggregation in the BSP model.

In the BSP model, since each vertex is a thread of computation, iteration 3 carries out aggregation easily in memory scanning only those friends of friends tuples that vertex receives. The typical vertex uses up very little memory in carrying out a hash aggregation, exploiting the sparsity observation. For the extremely connected users, it is possible that the number of distinct pairs is far larger and so a disk-based aggregation would be required. Again, since each vertex’s aggregation operation operates on only its friends of friends, the disk-based aggregation would need to store, retrieve and scan multiple times only the tuples of that vertex. Even with putting together aggregations that happen across all vertices, the resources used in the BSP model are lesser than in the parallel database world.

In summary, the BSP model is equivalent to the parallel database approach in terms of network cost of data transfer, avoids the join processing cost almost completely, and minimises the cost of the aggregation step exploiting graph sparsity. For a large graph, even though network costs are significant, the data processing costs are significant too. Coupled with potential skew issues related to graph distribution over the database, the parallel database approach is less optimal than the BSP model for the "1-hop neighbours" problem.

A quick note on graph distribution and the potential problems of skew: the BSP model allows for much more flexibility in dealing with the problem of distributing real-world graphs. It is not to say that the skew problem disappears altogether but only that the flexibility allows modeling the distribution in such a way so as to minimise the impact of skew.   

A recent blogpost over at Intel's Science and Technology Center (ISTC) for Big Data presents results comparing performance of a special class of parallel databases (namely, a column-store) with that of a BSP model on two graph-processing problems of computing page rank and computing shortest paths. The post goes on to claim that the "1-hop neighbours" problem too is best solved in a database using SQL to express the solution. As we saw earlier in this post, though the claim of the SQL being easy to write is indeed true, it is far from obvious that databases would perform as well as a BSP model. Notwithstanding the fact that the performance in practice is not dependent on theory alone but also on the implementation of the systems (for examples of BSP model implementations that are faster than Apache Giraph, see GoFFish and Giraphx), for the “1-hop neighbours” problem on massive graphs, our past experience has in fact been to the contrary to what the post over at ISTC claims - namely, a database by itself is not the right vehicle for the "1-hop neighbours" problem as compared to the BSP model.

Using the right tool for the right job can go a long way in solving a problem, and that is very much true for data science as well. The relational model along with SQL has many advantages owing to the maturity of the industry. By exploiting specific properties of the problem (including number of vertices, number of edges, connectivity etc.), there are other approaches on parallel databases a data scientist could design for solving the problem of “1-hop neighbours” that could be faster than - for instance, using custom user defined functions and keeping a copy of the edges table on every machine in the cluster provided the graph is not too large. A big data scientist learns this by studying the properties of the problem and then prototyping and experimenting these variants.

Friday, 5 September 2014

Data science at scale - calling out the "Big Data Scientist"

“Data science” is a popular term and one in the ascendancy in Gartner’s Hype Cycle for Emerging Technologies 2014. It has multiple meanings based on whom you ask. One way to deal with subjective interpretations is to crowdsource the answer and pick the popular interpretations, provided there is enough data. Recently, a data scientist (who else?) at LinkedIn attempted to define the term “data scientist” using data from profiles of people that have the phrase “data scientist” across its network.  His results are available in a small post over at LinkedIn's page. Unusually for a data scientist, the author doesn’t provide any quantified data at all, whereas I would have expected to see at least the numbers of profiles analyzed, the popularity scores and the strength of the relationship between the terms or the popularity scores for skills. Without numbers, there isn’t a whole lot of interpretation that outsiders like me can do though. Looking at the information qualitatively, the set of data scientists in the LinkedIn network seems to be distinctly tilted towards “small data” analysis as opposed to “large data” analysis. I gauge this from two indicators: (a) absence from the “Most popular skills” table of those skills typically associated exclusively with large data analysis; (b) the small sizes of the bubbles of these large data-focused skills and the lack of any strong connections (look at the higher resolution image in that post) from any of these to the popular “small data” skill bubbles.

Does this mean that the majority of data practitioners are “small data” scientists? Where are the “Big Data Scientists” (a portmanteau of “big data” and “data scientist”) and what sets them apart?

As that post and many others delineate, a good data scientist has mastery over a breadth of techniques, the tools that encode these techniques, and the domain knowledge that helps provide the extra oomph to the results. As aids, the tools – be it statistical or visualization in nature – provide algorithms and implementations of techniques out-of-the-box that are then used as deemed fit for the data problem at hand. The tools themselves do not provide readymade solutions to the problem, whereas it is the data scientist who knows how to use which tools and what techniques given the nature of the data, the type of problem being addressed and the targets to be achieved, if any. It is no wonder then that data science is sometimes referred to as “art” with the practitioners commanding a premium.

Data science at scale is a completely different beast from data science on a single machine. Data analysis on a single machine is itself hard but, data analysis at scale typically challenges fundamentals that are often taken for granted. Take the problem of sorting. It is one of the first to be introduced in an algorithms course in a computer science curriculum, and how to sort data is well understood. However, when the data being sorted is larger than the memory available in a machine, a different algorithm is required. Let’s call this the single machine algorithm while the textbook algorithms could be classified as main-memory algorithms. When the data becomes even larger and no longer fits within a single machine, the previous algorithms do not suffice and yet another design for algorithms is required. These could be called the distributed sorting algorithms. Sorting at massive scale is a problem class in itself and has a dedicated big data benchmark too (look for "Gray").

Sorting is a very simple problem in the world of big data. There are complicated ones as well, like machine learning at scale. In all, I would argue that the most challenging aspect of being a “Big Data Scientist” is to know when to use some data analysis approaches (e.g. clustering versus classification) and the techniques for each (e.g. k-means for clustering) and to also know the design of algorithms. Knowledge of the internals of algorithms comes handy in designing a distributed version of the same algorithm that works with good performance on massive data. This crucial task of having to not only know “data science” but also be able to design and implement the algorithms to run on massive data really sets apart a “Big Data Scientist” from a “small data" scientist.

In the last couple of years, there has been a steady stream of software packages offering big data-enabled algorithms out-of-the-box. Open-source packages in the Hadoop stack include the popular Mahout and the newer Spark MLlib, to name a few. If you do not subscribe to the Hadoop architecture, GraphLab built using MPI can be executed standalone. Amongst the really few proprietary packages offering massive-scale algorithms out-of-the-box, Teradata Aster is a great example, and I cut my teeth in big data by contributing massive-scale algorithms to its analytic foundation library. These software packages make the transition from a “small data” scientist to a “Big Data Scientist” easier, but talk to any expert statistician and you’ll know that the coverage of the required breadth of algorithms and techniques is still poor.

O’Reilly Media conducted a survey of data scientist salaries across Strata editions in 2012 and 2013. That report is a better example of presenting data about data scientists (the report calls them out as data professionals since not all individuals wear the “data scientist” tag). Parts of the survey, especially those about proprietary tool usage, are not that useful since the majority of the audience at Strata tend to be the open-source-kool-aid consuming types and the survey sample is therefore biased. The size and the geographic variance of the audiences at Strata are also necessarily lesser than what LinkedIn could potentially see in its data of the world. Nevertheless, the O’Reilly survey also reinforces the points in this post that the portmanteau role of “Big Data Scientist” is a rare combination and commands a premium over even the “small data" scientist.