Friday, 12 September 2014

Data Science at scale: the 1-hop neighbours problem

This post is technical. An understanding of how systems work and algorithms designed is recommended. 

As a companion post to “Calling out the Big Data Scientist”, here is an illustration of the kind of decisions a data scientist would need to make when carrying out graph analytics with large data sets.

Consider building a recommendation engine for people/friends on a social network. There are many attributes that make up people recommendations, including common subjects of interest, places visited, educational universities studied together, workplaces shared etc. One of the most important features for recommendations is the number of friends that the user and the potential candidate to be recommended have in common. Let’s refer to the user as U and the recommended person as X. The number of friends that U and X have in common is in turn the number of intermediaries that U is connected to and that in turn connects to X. In graph jargon, it is the number of distinct paths that connect U and X in a single hop. That’s why this problem is also referred to as the "1-hop neighbours problem.”

For ease of understanding and without loss of generality, assume that a typical user U has 10 connections, and those friends are typical users themselves in that they also each have 10 connections. Note that there is a good chance that a typical user could also be connected to a well-connected user (where well-connected refers to having a large number of connections, say 1000 or more), for instance a celebrity or a headhunter or a popular subject matter expert.

To solve the "1-hop neighbours” problem, a programmer’s approach could be to store the data of edges (i.e. who is connected to whom) in a file and then write some code. Clearly it is the code that is the complex part of this approach. We will not dig deep into this line of approach since it is brute-force and there are more readymade techniques available that will solve the problem for lesser effort and time. 

A RDBMS-oriented analyst could model the edges data as a relation with each tuple having 2 columns - “src” and “dest”  - possibly representing people as vertices with an identifier. There are other representations for graph data that are possible but an edge-based representation turns out to work well in practice for large, real-world graphs.

In this RDBMS setup, the SQL query to calculate the number of common friends between every pair of users is as follows:
SELECT L1.src, L2.dest, count(1) as common_friends_count 
FROM  edges L1 INNER JOIN edges L2 ON (L1.dest=L2.src) 
WHERE L1.src <> L2.dest 
GROUP BY L1.src, L2.dest
In brief, the query joins the edges table (referred as L1) with itself (referred to as L2, to disambiguate) so that for every user (i.e L1’s “src” column) we get his friend (i.e  L1’s “dest” column for the corresponding tuple) from the edges table,  and then use the edges L2 table to get that friend’s friends (i.e the L2’s “dest” column for all tuples as determined by L1's "dest" column). Then, by counting the number of tuples that result from the join, we get the answer for every pair of users. We also need to avoid loops, the WHERE clause helps to filter that out.

The SQL query above performs well for a small graph. Unfortunately, it fails quite spectacularly on graphs of large real-world networks. This is because of two reasons: 
  • real-world graphs are sparse (intuition: the number of people each of us typically knows is far far lesser than the total number of people in the network worldwide); 
  • real-world graphs have network effects (meaning, the user U who has 10 friends can, by tapping into his friends’ networks, reach 100 users at least who in turn can tap into their friends’ networks to allow U to reach 1000 users at least, and so on; since there are many well-connected individuals in real-world graphs, within a small number of hops, a typical user can reach most of the network - this is the so-called “Six Degrees of Separation”).

To appreciate the argument, let us consider first how we would execute this query on a large graph in a database. To start with, single machine databases are not sufficient to carry out the analysis on a large graph. Even if the graph can be stored on disk, it is prohibitively expensive to have enough resources on a single machine to be able to compute the answer. So, a parallel database, preferably with a shared-nothing, massively parallel processing architecture, needs to be considered. In a typical parallel database, data is usually laid out distributed by the hash of a particular column or set of columns - for the edges data of 2-tuples, there are only 2 choices "src" or “dest” either of which is reasonable. Let us choose “src” as the distribution column. This implies that all edges from the same “src” are available together in a single machine. The various "src" values are spread out roughly uniformly across the entire cluster. 

The distribution of the graph on a column across the parallel database is itself a problem. This is due to the presence of well-connected users. The specific machines that store the edge data for such well-connected users are assigned a much larger chunk of data than other machines resulting in skew. For a parallel database, skew is a real performance killer. Let us set this revisit this graph distribution issue later.

Now, consider the processing side of the argument. Let us get into the operations of a parallel database involved in the execution of this query:
  1.  In order to execute the join of the edges table with itself, the join needs to be executed locally on every machine. This requires having the data necessary for the join to be present locally. This in turn requires creating a copy of edges that is distributed on the “dest” column. The dominant cost in this operation is the cost of transferring the entire edges table once over the network.
  2. Each machine in the cluster then executes the join locally. The join produces significantly more number of tuples than the input tables due to network effects of expanding from the circle of friends to the circle of friends of friends. The dominant cost here is the join itself, which is be executed using any of a number of well-known relational algebra join techniques like hash joins, sort joins, hybrid hash joins etc.
  3. Since the aggregation in the query happens on columns that are different from those used in the join condition, the output of the join will need to be redistributed. The dominant cost here is the cost of transferring the output table from the join over the network.
  4. The aggregation is carried out on the redistributed join output. The dominant cost depends on the algorithm used to perform the aggregation, which could be any of a number of well-known techniques like hash aggregations, sort aggregations, hybrid hash aggregations etc.

The join operation in step 2 is more expensive than it could be considering the sparsity of the graph. The network effects present in real-world graphs produce a large amount of output, which adds to the latency involved in step 2 and also has a more significant impact in the aggregation operation. The aggregation operation again is more expensive than it needs to be due to the same reasons of sparsity. As an illustration of sparsity, even though network effects mean a user U could now reach 100 friends in 1-hop starting from his 10 friends, notice that computing the number of friends user U has in common with any of those 100 users is not dependent on the rest of the network.

By adopting a scalable approach and a platform that is optimised for processing large graphs, the "1-hop neighbours” problem becomes easier to solve and get the performance right. The popularity of bulk synchronous parallel (BSP) models, or vertex-centric processing, is because of the benefits it provides for large-scale processing of real-world graphs. Notable among the implementations is Google’s Pregel and its many clones including ApacheGiraph, Teradata Aster SQL-GR, and GraphLab. In this model, each vertex becomes an independent thread of computation where messages that are sent by its friends are processed followed by the sending of messages, if any, to its friends. The BSP model guarantees that a vertex receives only the messages that is sent to it and a message is sent only to the vertex it is meant for.

For the task of solving the “1-hop neighbours" problem, an implementation using the BSP model has the following 3 iterations after which the model stops execution since there are no more messages:
  1. Each vertex sends to its neighbours a message about that edge of the form “src”-”dest”. The dominant cost in this operation is the cost of transferring the entire edges table once over the network.
  2. In the next iteration, a vertex broadcasts each message that it receives to every one of its neighbours. The dominant cost in this operation is the cost of transferring the output messages which are the tuples corresponding to the friends of friends.
  3. In the last iteration, each vertex aggregates over its messages the number of occurrences of each “src”, sets itself as the "dest" and constructs the output as the "src"-"dest"-count.

For a large graph, iterations 1 and 2 have nearly the same cost as steps 1 and 3 in the parallel database approach. Note that there is no join operation in the BSP model as vertex-to-vertex message routing is provided by the platform. Iteration 3 in the BSP approach carries out the aggregation operation but needs to only consider those rows that influence the output for that pair of users. 

Comparing at a deeper level the aggregation steps in the 2 approaches, Step 4 in the parallel database approach is aggregating over all rows. Each machine in the cluster executes the aggregation locally and needs to consider a memory structure (typically, a hash table) suitable to hold all pairs of vertices that occur in that machine. When the number of distinct pairs (and their counts) is likely to exceed available memory, the database engine decides to adopt a disk-based approach to carry out the aggregation. When that happens (which it does for large graphs), the same input data needs to be scanned multiple times. This translates to significant overheads compared to the aggregation in the BSP model.

In the BSP model, since each vertex is a thread of computation, iteration 3 carries out aggregation easily in memory scanning only those friends of friends tuples that vertex receives. The typical vertex uses up very little memory in carrying out a hash aggregation, exploiting the sparsity observation. For the extremely connected users, it is possible that the number of distinct pairs is far larger and so a disk-based aggregation would be required. Again, since each vertex’s aggregation operation operates on only its friends of friends, the disk-based aggregation would need to store, retrieve and scan multiple times only the tuples of that vertex. Even with putting together aggregations that happen across all vertices, the resources used in the BSP model are lesser than in the parallel database world.

In summary, the BSP model is equivalent to the parallel database approach in terms of network cost of data transfer, avoids the join processing cost almost completely, and minimises the cost of the aggregation step exploiting graph sparsity. For a large graph, even though network costs are significant, the data processing costs are significant too. Coupled with potential skew issues related to graph distribution over the database, the parallel database approach is less optimal than the BSP model for the "1-hop neighbours" problem.

A quick note on graph distribution and the potential problems of skew: the BSP model allows for much more flexibility in dealing with the problem of distributing real-world graphs. It is not to say that the skew problem disappears altogether but only that the flexibility allows modeling the distribution in such a way so as to minimise the impact of skew.   

A recent blogpost over at Intel's Science and Technology Center (ISTC) for Big Data presents results comparing performance of a special class of parallel databases (namely, a column-store) with that of a BSP model on two graph-processing problems of computing page rank and computing shortest paths. The post goes on to claim that the "1-hop neighbours" problem too is best solved in a database using SQL to express the solution. As we saw earlier in this post, though the claim of the SQL being easy to write is indeed true, it is far from obvious that databases would perform as well as a BSP model. Notwithstanding the fact that the performance in practice is not dependent on theory alone but also on the implementation of the systems (for examples of BSP model implementations that are faster than Apache Giraph, see GoFFish and Giraphx), for the “1-hop neighbours” problem on massive graphs, our past experience has in fact been to the contrary to what the post over at ISTC claims - namely, a database by itself is not the right vehicle for the "1-hop neighbours" problem as compared to the BSP model.

Using the right tool for the right job can go a long way in solving a problem, and that is very much true for data science as well. The relational model along with SQL has many advantages owing to the maturity of the industry. By exploiting specific properties of the problem (including number of vertices, number of edges, connectivity etc.), there are other approaches on parallel databases a data scientist could design for solving the problem of “1-hop neighbours” that could be faster than - for instance, using custom user defined functions and keeping a copy of the edges table on every machine in the cluster provided the graph is not too large. A big data scientist learns this by studying the properties of the problem and then prototyping and experimenting these variants.

No comments:

Post a Comment