Sunday, 21 June 2015

The case for standardising data lake ingestion

The recent Hadoop Summit in San Jose brought out a few themes, some familiar and some hyperbole. One standout observation that is indisputable is that conversations in the conference were no longer about "what" but instead were about "how". Hadoop in particular and the big data ecosystem in general is not only game-changing but has been here for a while now. The market is in need of best practices, reference architectures and success stories of implemented solutions and not just slideware. In a first of a series targeted at enterprises that want to know how to adopt Hadoop as the data lake, this post highlights the topic of governance and makes the case for standardising ingestion into the data lake.

A previous post highlighted governance as one of the requirements that the enterprise needs from its data lake. In fact, there has been active mainstream discussion too on data lake governance of late, an example being this O'reilly webcast by a cybersecurity company, and more so as also acknowledged by Hortonworks in their recent announcement of the Apache Atlas project. In this post, we argue that there is no better way to get governance right than by standardising ingestion into the data lake. By defining as standards what data can go in to the data lake and how, we posit that data governance is realised with minimal effort. In contrast, the typical approach to usage of Hadoop has been to load the data in first and start analysing, implicitly deferring to a later time the question of "what needs to happen in production?" While it is indeed cool that Hadoop supports "load first, ask questions later" approaches, it only works for data science experiments.

Indeed, there are more facets to the governance question, including who gets access to the data thus loaded and what those interfaces look like. However, decisions about data access can usually be taken and implemented on an as-needed basis after data is loaded into the data lake. On the data ingestion side, though, such just-in-time, or, worse, post-facto decision making is not conducive to guaranteeing key governance principles of data lineage, traceability, and metadata capture. It is absolutely necessary then to "fix" data governance at ingestion time for all enterprise data, and then proceed to define governed access patterns.

To start on the journey of standardising data lake ingestion, identify the sources from whom data need to be sourced and loaded on to the data lake. For a large enterprise, data comes from a host of different types of source systems including mainframes, relational databases, file systems, logs and even object stores. Fortunately, the Hadoop ecosystem comes with a myriad of data sourcing options that can get data from all these sources starting from legacy mainframes onwards to the latest of NoSQL databases.

Typically, the particular source system also dictates the type of data that is made available (e.g. full dump, change data capture or CDC, incremental data) and its format (e.g. binary, JSON, delimited). Though not required to be standardised, the type of data and its format needs to be taken into account in the design of the data ingestion pipeline.
Next, outline the ways in which the data would get to the data lake. Various data transport mechanisms are usually in play in a large enterprise (e.g. the data bus, file transfer clients). New paradigms like the direct, parallel, source-to-Hadoop transfer enabled by Sqoop are also available from the Hadoop ecosystem.

The third step is to define the type of processing to perform on the data in the ingestion pipeline which usually depends on the business criticality of the data.

Business-critical data needs to meet data quality standards, needs to adhere to specifications as laid out by the data stewards/owners, and has to be processed and ready in a business-consumable form. Ingestion of such data requires a processing framework that can allow for different data quality checks and validation rules to be deployed, while at the same time be able to consume data in its native form as made available by the earlier-defined standard sourcing methods. Also, business-critical data has stricter SLAs and tighter ingestion windows.

Non-critical data is usually loaded in to the data lake with a view to do data science experiments and discover potential for improving the business. Compared to the business-critical data, though the rigour around such data is lesser, there are specifications to follow to verify data quality from source, carry out transformations and data formats for materialising processed data that allows for easy discovery-oriented access. SLAs for ingestion could still apply to non-critical data as well. What is more, non-critical and business-critical data coexist on the same data lake in a multi-tenant deployment model resulting in the need for careful design and repeated reuse of ingestion patterns.

In either of the cases, it is necessary to ensure that flexibility of data processing, a unique value proposition offered by Hadoop in comparison to traditional data warehouses, is not lost. So, various steps in the processing pipeline need to be customisable without sacrificing governance properties.

As a final step in the standardisation process, identify all the points in the data ingestion pipeline from which metadata needs to be logged and lineage information needs to be captured. This serves a variety of purposes including satisfying audit and compliance requirements.

Rather surprisingly, for a technology ecosystem that has seen so much hype over the years, there has been little effort to define such a standardised ingestion framework. Part of the reason is surely the complexity associated with supporting all of the enterprise's data with all of the various ingestion patterns. Late last year, LinkedIn (a major power user of Hadoop) revealed the existence of Gobblin as an overarching ingestion framework in play for all of its data ingestion needs.

At The Data Team, we have engineered and built a standardised data lake ingestion framework for all of an enterprise's data using completely open-source components that serves the above needs. Moreover, the framework automates almost all the common tasks typically associated with data lake ingestion using a configuration-driven approach. Emphasis is placed on convention over configuration to ease the pain associated with on-boarding new data onto the data lake. Metadata and lineage information is logged at every step in the ingestion pipeline. A standardised template of options is provided for each processing step so that new projects can easily choose what's applicable for a particular data ingestion and immediately on-boarding data in a governed manner. At the same time, since enterprise needs do vary, the design also allows new processing additions using custom code in a plug-and-play manner.

Finally, as further validation of the need to standardise ingestion into the data lake, consider the fact that one of last year's more significant acquisitions in the big data space was for a technology that helped find what data already resides on a Hadoop cluster and generate metadata about them. Oh, inverted world! Apparently, such a configuration-driven ingestion into Hadoop is patented. Oh, cruel world! 

Sunday, 15 February 2015

Guaranteeing exactly-once load semantics in moving data from Kafka to HDFS

In a previous post, the importance of fault-tolerance in the design of distributed systems and their applications was highlighted. This post continues in the same vein.

A good use of Hadoop in a mature enterprise is as an ETL platform to compliment and offload from the warehouse. In this use, data is loaded in its raw form and then typically intensive transformation jobs are migrated to Hadoop both from the warehouse and from standalone ETL servers and tools. By designing transformations as standard, reusable processing patterns, significant cost savings can be obtained by the enterprise. More on this in a later post.

A key architectural principle in this Hadoop-as-an-ETL use case is that of exactly-once semantics for processing (in general) and writes (in particular) in the presence of failures. The principle refers to the problem of whether the data ingestion pipeline can guarantee that data is loaded into Hadoop exactly once and ruling out both the scenarios of data loss (as is possible with "at most-once" write semantics) and data duplication (as is possible with "at least-once" write semantics). If the ingestion guarantee is at least-once, then downstream processing needs to perform data de-duplication, a task made easy or difficult by the properties of the input data being ingested. Of course, at most once semantics is not acceptable where critical data is involved. The problem discussed here is not as much of a concern for mainstream RDBMS where data writes are transactional.

Consider a technology stack where data is being loaded into Hadoop from Kafka, an increasingly  popular enterprise-wide message broker (see, for example, Loggly's explanation for why they chose Kafka). One approach for getting data from Kafka into Hadoop is by using Storm, a popular choice advocated by many practitioners. As we found out and confirmed from experts as well, despite the use of Trident, the Storm-HDFS connector does not provide a mechanism for ingesting Kafka data into HDFS that guarantees exactly-once write semantics. Let us see why.

The data pipeline includes the Storm-Kafka opaque transactional spout which supports transactional semantics, and the Storm processing pipeline itself which is fault-tolerant and supports exactly-once processing semantics using the Trident implementation. However, even the use of the Storm-HDFS trident state fails to provide exactly-once semantics. Instead, what it provides is at least-once write semantics. To understand this, let us dig through the behaviour.

Writing data into HDFS using the connector is controlled by a notion of batching that is configurable (either based on time or by file size).  Once the batch threshold is reached, the bolt can be configured to rotate the file into another HDFS directory and continue writing to a new HDFS file. A Storm worker is responsible for writing into one HDFS file with each bolt worker writing to a unique file. Until the destination file is rotated though, the Storm worker continues to keep the file open for further writes.  Since the target write batch size is different from the notion of batch as implemented (opaquely) by the Trident API, there is in practice multiple Trident batches of data being written to the same HDFS file by each worker.

Consider a failure in the process of writing to a HDFS file by the Storm-HDFS Trident-based connector. Assume the failure is in the worker process itself. As is the guarantee provided by Trident, the worker is restarted by Storm and the Trident batch whose write failed will be attempted again. Since a new worker is started, a new destination HDFS file will be opened and written to with data from the just-failed Trident batch. The previous destination HDFS file is left behind as an orphan with no worker process owning it, so there isn't an explicit file close that happens and the file is not rotated. Note that this orphaned file could be storing data from prior, successfully-written Trident batches. To avoid data loss, this orphaned file cannot be ignored but will have to be included in downstream processing by a separate, out-of-band process. However, some of the data from the failed Trident batch could have been written to the orphaned file, while the entire batch is again written in to the new HDFS file. In including the orphaned file in downstream processing, there is the chance that the same data appears twice. Thus, the use of the Storm-HDFS connector provides at least-once semantics, and orphaned files need to be handled explicitly by the downstream process to avoid data loss.

A second approach to loading data from Kafka into Hadoop would be to use the Hadoop MR API and write a Map function to consume data from Kafka. The function updates the Kafka offsets after data upto a threshold is consumed. The consumption offsets are stored in a fault-tolerant system that provides durable and atomic changes like Zookeeper (or even HDFS itself as file move is an atomic operation). If the MR job consuming from Kafka fails, then it can be simply started again. This works because the Kafka offsets are updated only after the data pull into HDFS happens and the the changes to the offsets themselves are atomic. In retrying the same job, the data is pulled from Kafka from the same offset as before thus ensuring the same data gets pulled in without data duplication or data loss. The Camus project from LinkedIn and the Kangaroo project from Conductor are implementations of this method. By controlling the size of the threshold, the batching effects of this approach can be controlled, though as yet it is not conducive for near real-time loads (i.e. in the order of seconds) into HDFS that the Storm approach is capable of.

Thursday, 6 November 2014

Building Fault-Tolerant Big Data Applications

The world of big data is driven by open-source. Popularity of software such as Hadoop has shot through the roof over the last few years as more and more people equate "big data" to "Hadoop" (and its various incarnations). This has also led to a bigger community participation in these projects and, consequently, rapid innovations. Witness the short (by enterprise software standards) 6-month release cycles for landmark features in the major versions of the Hortonworks Hadoop distribution (assuming it is a usable proxy in place of the innovations happening directly in the respective Apache projects). One of the flip-sides to all this quick innovation is unstable software.

True, the broader Hadoop community is helped by large corporations like Yahoo which provide testing grounds at scale. What is also true is that despite such functional and scale testing, buggy software still gets released and makes it to the packaged distribution that goes out of the door to the enterprises out there (heck, forget open-source software, this is true even for proprietary enterprise software). At the Data Team, one of our most recent trysts with buggy software was with Tez in Hive 0.13 that sometimes crashed and sometimes didn't for the same query executing on the same (nearly idle) cluster having the same settings.

Builders of big data applications that use open-source software need to bear in mind the important design principle that the builders of distributed systems software have long taken to heart: expect failure and design for it.

As Jeff Dean of Google famously said, "fault-tolerant software is inevitable". In fact, the design principle that sets apart building software for single machine use versus software for use across multiple machines is that failures are all too common and only become more so as the scale grows larger and larger.

Modern distributed systems (meaning those designed by the legions of Internet-era engineers that have been inspired by the systems they built and saw being built at companies like Google) are designed in a decoupled manner allowing various components to talk to each other using well-defined interfaces. These interface specifications are not tied to any particular platform choice or programming language or OS or geographic location. Fault-tolerance is built into each and every one of these components and at the interfaces of component interaction too.

When integrating big data technologies like Hadoop into the enterprise architecture, the same design principle of fault-tolerance should be applied to the use of the application interfaces that are exposed by the various open-source software on the Hadoop stack. Though Hadoop internally might be able to withstand a large number of common failures and still be alive to service data requests, the fault tolerance of the architecture as a whole is also very much dependent on the the points of integration of Hadoop with the rest of the components in the enterprise. In other words, application integration needs to assume that a Hive query or a Hadoop MapReduce job might in fact fail at any given time. What's more, failures could be non-deterministic, in that where a query ran successfully yesterday, the same query could now fail for reasons unfathomable by any amount of machine automation.

In short, applications that use big data need to be designed for faults.

What does this imply for the enterprise? While the implications on the architecture need to be considered and designed for, there are also implications for the process and the people in the enterprise.

First, applications need to anticipate failures. This means having to capture and relay useful error messages that allow a reasonable amount of troubleshooting to happen in the common case, or in the case of critical failures shut down gracefully. In particular, they should not be crashing on the user unexpectedly.

Second, businesses should need to budget for and build into their expectations an occasional, unexplained delay in delivery of insights. Where they are probably used to having reports of KPIs come in every morning on the hour, they now need to be relaxed about the eventuality that every month or so, a report might not arrive until the afternoon.

Third, while it is common knowledge today that the skills required to exploit big data are lesser than the supply, project managers and delivery managers need to also explicitly acknowledge the reality of unstable software and factor in the time and effort required to investigate, think up alternatives and implement them. Due to the nascent nature of the big data landscape, there will be precious little by way of precedence and so interacting directly with the code contributors and designers of the software will prove to be the only feasible resolution path (with associated delays accounting for time zone differences and the busy-bodies that coders tend to be).

Fourth, the organization needs to anticipate that there would be a fair amount of dialogue now between the business users and applications on the one side and the IT teams on the other side that attempt to dig into why a certain deliverable was not achieved in time. Such dialogue could quickly degenerate into finger-pointing unless the right expectations are set up front. At a deployment, the IT team had locked down access to the Master nodes and the logs of the various software components of Hadoop such that only the administrators could view these logs. Since the cluster management interface revealed precious little about the nature of failures and potential causes, users were left irritated and confused, and moreover had to put up with the process of having to go back to the administrators to troubleshoot for them who in turn were made to work on tasks that were not budgeted for.

Friday, 12 September 2014

Data Science at scale: the 1-hop neighbours problem

This post is technical. An understanding of how systems work and algorithms designed is recommended. 

As a companion post to “Calling out the Big Data Scientist”, here is an illustration of the kind of decisions a data scientist would need to make when carrying out graph analytics with large data sets.

Consider building a recommendation engine for people/friends on a social network. There are many attributes that make up people recommendations, including common subjects of interest, places visited, educational universities studied together, workplaces shared etc. One of the most important features for recommendations is the number of friends that the user and the potential candidate to be recommended have in common. Let’s refer to the user as U and the recommended person as X. The number of friends that U and X have in common is in turn the number of intermediaries that U is connected to and that in turn connects to X. In graph jargon, it is the number of distinct paths that connect U and X in a single hop. That’s why this problem is also referred to as the "1-hop neighbours problem.”

For ease of understanding and without loss of generality, assume that a typical user U has 10 connections, and those friends are typical users themselves in that they also each have 10 connections. Note that there is a good chance that a typical user could also be connected to a well-connected user (where well-connected refers to having a large number of connections, say 1000 or more), for instance a celebrity or a headhunter or a popular subject matter expert.

To solve the "1-hop neighbours” problem, a programmer’s approach could be to store the data of edges (i.e. who is connected to whom) in a file and then write some code. Clearly it is the code that is the complex part of this approach. We will not dig deep into this line of approach since it is brute-force and there are more readymade techniques available that will solve the problem for lesser effort and time. 

A RDBMS-oriented analyst could model the edges data as a relation with each tuple having 2 columns - “src” and “dest”  - possibly representing people as vertices with an identifier. There are other representations for graph data that are possible but an edge-based representation turns out to work well in practice for large, real-world graphs.

In this RDBMS setup, the SQL query to calculate the number of common friends between every pair of users is as follows:
SELECT L1.src, L2.dest, count(1) as common_friends_count 
FROM  edges L1 INNER JOIN edges L2 ON (L1.dest=L2.src) 
WHERE L1.src <> L2.dest 
GROUP BY L1.src, L2.dest
In brief, the query joins the edges table (referred as L1) with itself (referred to as L2, to disambiguate) so that for every user (i.e L1’s “src” column) we get his friend (i.e  L1’s “dest” column for the corresponding tuple) from the edges table,  and then use the edges L2 table to get that friend’s friends (i.e the L2’s “dest” column for all tuples as determined by L1's "dest" column). Then, by counting the number of tuples that result from the join, we get the answer for every pair of users. We also need to avoid loops, the WHERE clause helps to filter that out.

The SQL query above performs well for a small graph. Unfortunately, it fails quite spectacularly on graphs of large real-world networks. This is because of two reasons: 
  • real-world graphs are sparse (intuition: the number of people each of us typically knows is far far lesser than the total number of people in the network worldwide); 
  • real-world graphs have network effects (meaning, the user U who has 10 friends can, by tapping into his friends’ networks, reach 100 users at least who in turn can tap into their friends’ networks to allow U to reach 1000 users at least, and so on; since there are many well-connected individuals in real-world graphs, within a small number of hops, a typical user can reach most of the network - this is the so-called “Six Degrees of Separation”).

To appreciate the argument, let us consider first how we would execute this query on a large graph in a database. To start with, single machine databases are not sufficient to carry out the analysis on a large graph. Even if the graph can be stored on disk, it is prohibitively expensive to have enough resources on a single machine to be able to compute the answer. So, a parallel database, preferably with a shared-nothing, massively parallel processing architecture, needs to be considered. In a typical parallel database, data is usually laid out distributed by the hash of a particular column or set of columns - for the edges data of 2-tuples, there are only 2 choices "src" or “dest” either of which is reasonable. Let us choose “src” as the distribution column. This implies that all edges from the same “src” are available together in a single machine. The various "src" values are spread out roughly uniformly across the entire cluster. 

The distribution of the graph on a column across the parallel database is itself a problem. This is due to the presence of well-connected users. The specific machines that store the edge data for such well-connected users are assigned a much larger chunk of data than other machines resulting in skew. For a parallel database, skew is a real performance killer. Let us set this revisit this graph distribution issue later.

Now, consider the processing side of the argument. Let us get into the operations of a parallel database involved in the execution of this query:
  1.  In order to execute the join of the edges table with itself, the join needs to be executed locally on every machine. This requires having the data necessary for the join to be present locally. This in turn requires creating a copy of edges that is distributed on the “dest” column. The dominant cost in this operation is the cost of transferring the entire edges table once over the network.
  2. Each machine in the cluster then executes the join locally. The join produces significantly more number of tuples than the input tables due to network effects of expanding from the circle of friends to the circle of friends of friends. The dominant cost here is the join itself, which is be executed using any of a number of well-known relational algebra join techniques like hash joins, sort joins, hybrid hash joins etc.
  3. Since the aggregation in the query happens on columns that are different from those used in the join condition, the output of the join will need to be redistributed. The dominant cost here is the cost of transferring the output table from the join over the network.
  4. The aggregation is carried out on the redistributed join output. The dominant cost depends on the algorithm used to perform the aggregation, which could be any of a number of well-known techniques like hash aggregations, sort aggregations, hybrid hash aggregations etc.

The join operation in step 2 is more expensive than it could be considering the sparsity of the graph. The network effects present in real-world graphs produce a large amount of output, which adds to the latency involved in step 2 and also has a more significant impact in the aggregation operation. The aggregation operation again is more expensive than it needs to be due to the same reasons of sparsity. As an illustration of sparsity, even though network effects mean a user U could now reach 100 friends in 1-hop starting from his 10 friends, notice that computing the number of friends user U has in common with any of those 100 users is not dependent on the rest of the network.

By adopting a scalable approach and a platform that is optimised for processing large graphs, the "1-hop neighbours” problem becomes easier to solve and get the performance right. The popularity of bulk synchronous parallel (BSP) models, or vertex-centric processing, is because of the benefits it provides for large-scale processing of real-world graphs. Notable among the implementations is Google’s Pregel and its many clones including ApacheGiraph, Teradata Aster SQL-GR, and GraphLab. In this model, each vertex becomes an independent thread of computation where messages that are sent by its friends are processed followed by the sending of messages, if any, to its friends. The BSP model guarantees that a vertex receives only the messages that is sent to it and a message is sent only to the vertex it is meant for.

For the task of solving the “1-hop neighbours" problem, an implementation using the BSP model has the following 3 iterations after which the model stops execution since there are no more messages:
  1. Each vertex sends to its neighbours a message about that edge of the form “src”-”dest”. The dominant cost in this operation is the cost of transferring the entire edges table once over the network.
  2. In the next iteration, a vertex broadcasts each message that it receives to every one of its neighbours. The dominant cost in this operation is the cost of transferring the output messages which are the tuples corresponding to the friends of friends.
  3. In the last iteration, each vertex aggregates over its messages the number of occurrences of each “src”, sets itself as the "dest" and constructs the output as the "src"-"dest"-count.

For a large graph, iterations 1 and 2 have nearly the same cost as steps 1 and 3 in the parallel database approach. Note that there is no join operation in the BSP model as vertex-to-vertex message routing is provided by the platform. Iteration 3 in the BSP approach carries out the aggregation operation but needs to only consider those rows that influence the output for that pair of users. 

Comparing at a deeper level the aggregation steps in the 2 approaches, Step 4 in the parallel database approach is aggregating over all rows. Each machine in the cluster executes the aggregation locally and needs to consider a memory structure (typically, a hash table) suitable to hold all pairs of vertices that occur in that machine. When the number of distinct pairs (and their counts) is likely to exceed available memory, the database engine decides to adopt a disk-based approach to carry out the aggregation. When that happens (which it does for large graphs), the same input data needs to be scanned multiple times. This translates to significant overheads compared to the aggregation in the BSP model.

In the BSP model, since each vertex is a thread of computation, iteration 3 carries out aggregation easily in memory scanning only those friends of friends tuples that vertex receives. The typical vertex uses up very little memory in carrying out a hash aggregation, exploiting the sparsity observation. For the extremely connected users, it is possible that the number of distinct pairs is far larger and so a disk-based aggregation would be required. Again, since each vertex’s aggregation operation operates on only its friends of friends, the disk-based aggregation would need to store, retrieve and scan multiple times only the tuples of that vertex. Even with putting together aggregations that happen across all vertices, the resources used in the BSP model are lesser than in the parallel database world.

In summary, the BSP model is equivalent to the parallel database approach in terms of network cost of data transfer, avoids the join processing cost almost completely, and minimises the cost of the aggregation step exploiting graph sparsity. For a large graph, even though network costs are significant, the data processing costs are significant too. Coupled with potential skew issues related to graph distribution over the database, the parallel database approach is less optimal than the BSP model for the "1-hop neighbours" problem.

A quick note on graph distribution and the potential problems of skew: the BSP model allows for much more flexibility in dealing with the problem of distributing real-world graphs. It is not to say that the skew problem disappears altogether but only that the flexibility allows modeling the distribution in such a way so as to minimise the impact of skew.   

A recent blogpost over at Intel's Science and Technology Center (ISTC) for Big Data presents results comparing performance of a special class of parallel databases (namely, a column-store) with that of a BSP model on two graph-processing problems of computing page rank and computing shortest paths. The post goes on to claim that the "1-hop neighbours" problem too is best solved in a database using SQL to express the solution. As we saw earlier in this post, though the claim of the SQL being easy to write is indeed true, it is far from obvious that databases would perform as well as a BSP model. Notwithstanding the fact that the performance in practice is not dependent on theory alone but also on the implementation of the systems (for examples of BSP model implementations that are faster than Apache Giraph, see GoFFish and Giraphx), for the “1-hop neighbours” problem on massive graphs, our past experience has in fact been to the contrary to what the post over at ISTC claims - namely, a database by itself is not the right vehicle for the "1-hop neighbours" problem as compared to the BSP model.

Using the right tool for the right job can go a long way in solving a problem, and that is very much true for data science as well. The relational model along with SQL has many advantages owing to the maturity of the industry. By exploiting specific properties of the problem (including number of vertices, number of edges, connectivity etc.), there are other approaches on parallel databases a data scientist could design for solving the problem of “1-hop neighbours” that could be faster than - for instance, using custom user defined functions and keeping a copy of the edges table on every machine in the cluster provided the graph is not too large. A big data scientist learns this by studying the properties of the problem and then prototyping and experimenting these variants.

Friday, 5 September 2014

Data science at scale - calling out the "Big Data Scientist"

“Data science” is a popular term and one in the ascendancy in Gartner’s Hype Cycle for Emerging Technologies 2014. It has multiple meanings based on whom you ask. One way to deal with subjective interpretations is to crowdsource the answer and pick the popular interpretations, provided there is enough data. Recently, a data scientist (who else?) at LinkedIn attempted to define the term “data scientist” using data from profiles of people that have the phrase “data scientist” across its network.  His results are available in a small post over at LinkedIn's page. Unusually for a data scientist, the author doesn’t provide any quantified data at all, whereas I would have expected to see at least the numbers of profiles analyzed, the popularity scores and the strength of the relationship between the terms or the popularity scores for skills. Without numbers, there isn’t a whole lot of interpretation that outsiders like me can do though. Looking at the information qualitatively, the set of data scientists in the LinkedIn network seems to be distinctly tilted towards “small data” analysis as opposed to “large data” analysis. I gauge this from two indicators: (a) absence from the “Most popular skills” table of those skills typically associated exclusively with large data analysis; (b) the small sizes of the bubbles of these large data-focused skills and the lack of any strong connections (look at the higher resolution image in that post) from any of these to the popular “small data” skill bubbles.

Does this mean that the majority of data practitioners are “small data” scientists? Where are the “Big Data Scientists” (a portmanteau of “big data” and “data scientist”) and what sets them apart?

As that post and many others delineate, a good data scientist has mastery over a breadth of techniques, the tools that encode these techniques, and the domain knowledge that helps provide the extra oomph to the results. As aids, the tools – be it statistical or visualization in nature – provide algorithms and implementations of techniques out-of-the-box that are then used as deemed fit for the data problem at hand. The tools themselves do not provide readymade solutions to the problem, whereas it is the data scientist who knows how to use which tools and what techniques given the nature of the data, the type of problem being addressed and the targets to be achieved, if any. It is no wonder then that data science is sometimes referred to as “art” with the practitioners commanding a premium.

Data science at scale is a completely different beast from data science on a single machine. Data analysis on a single machine is itself hard but, data analysis at scale typically challenges fundamentals that are often taken for granted. Take the problem of sorting. It is one of the first to be introduced in an algorithms course in a computer science curriculum, and how to sort data is well understood. However, when the data being sorted is larger than the memory available in a machine, a different algorithm is required. Let’s call this the single machine algorithm while the textbook algorithms could be classified as main-memory algorithms. When the data becomes even larger and no longer fits within a single machine, the previous algorithms do not suffice and yet another design for algorithms is required. These could be called the distributed sorting algorithms. Sorting at massive scale is a problem class in itself and has a dedicated big data benchmark too (look for "Gray").

Sorting is a very simple problem in the world of big data. There are complicated ones as well, like machine learning at scale. In all, I would argue that the most challenging aspect of being a “Big Data Scientist” is to know when to use some data analysis approaches (e.g. clustering versus classification) and the techniques for each (e.g. k-means for clustering) and to also know the design of algorithms. Knowledge of the internals of algorithms comes handy in designing a distributed version of the same algorithm that works with good performance on massive data. This crucial task of having to not only know “data science” but also be able to design and implement the algorithms to run on massive data really sets apart a “Big Data Scientist” from a “small data" scientist.

In the last couple of years, there has been a steady stream of software packages offering big data-enabled algorithms out-of-the-box. Open-source packages in the Hadoop stack include the popular Mahout and the newer Spark MLlib, to name a few. If you do not subscribe to the Hadoop architecture, GraphLab built using MPI can be executed standalone. Amongst the really few proprietary packages offering massive-scale algorithms out-of-the-box, Teradata Aster is a great example, and I cut my teeth in big data by contributing massive-scale algorithms to its analytic foundation library. These software packages make the transition from a “small data” scientist to a “Big Data Scientist” easier, but talk to any expert statistician and you’ll know that the coverage of the required breadth of algorithms and techniques is still poor.

O’Reilly Media conducted a survey of data scientist salaries across Strata editions in 2012 and 2013. That report is a better example of presenting data about data scientists (the report calls them out as data professionals since not all individuals wear the “data scientist” tag). Parts of the survey, especially those about proprietary tool usage, are not that useful since the majority of the audience at Strata tend to be the open-source-kool-aid consuming types and the survey sample is therefore biased. The size and the geographic variance of the audiences at Strata are also necessarily lesser than what LinkedIn could potentially see in its data of the world. Nevertheless, the O’Reilly survey also reinforces the points in this post that the portmanteau role of “Big Data Scientist” is a rare combination and commands a premium over even the “small data" scientist.

Friday, 29 August 2014

Big Data and the Enterprise

Businesses that have defined a data strategy know that data is an integral part of the enterprise. There are a slew of enterprise standards for all data to adhere to, irrespective of whether the data is small or big, structured or unstructured, comes from sensors or websites or transactions, is housed in the holiest of data centres with the strictest of controls or is stored out in the wild, wild west of freshest-on-github software. A successful data strategy not only meets the business needs but also incorporates the enterprise standards in a holistic manner.

Big data being at the top of its hype nowadays (see Gartner's latest hype cycle for emerging technologies), there have been many companies that have eagerly jumped into its adoption without adequate considerations unfortunately. Of course, the prime motivating factors for considering big data - including those discussed in an earlier post in these pages Data as a Strategy - are usually in place and are not the subject of this post. It is the set of enterprise standards and requirements that typically are in the background but serve the crucial purpose of keeping the house in order that are being glossed over in the excitement over "new".  In this post, let us look at the top requirements that fall under the headline of enterprise standards that apply equally to big data as it does to business-critical data. Mature organisations with a thriving data strategy will hopefully not find any of the below surprising. For the rest of this post, I'll assume that the enterprise has data-focused implementation in place already - in the form of a data warehouse or an all-purpose database, for instance - and also has enterprise standards to follow.

First, the question "Why should an enterprise care about applying its corporate standards to new data that is not critical to the business?" In customer conversations, clickstream data arises frequently as an example of this new data. Suppose that clickstream data is collected and analysed in large scale in a big data technology in a R&D/labs environment. Incidentally, the reasons why a classic database would not be suited for such analysis is that the data is of unknown value to the business, has variable structure and popular big data technologies including Hadoop allow storing such data at a significantly lower cost. In the course of analysis, suppose there are insights of significant value about online customer buying behaviour that have been discovered and the clickstream data consequently becomes critical for the business. The repeated extraction of that value needs the clickstream data and the process of extraction be put into production. The new data and the associated processes need to be “operationalised” (to use a coined term), and therefore should have to adhere to the standards set by the enterprise. In other words, even if big data exploitation starts off as a project in an R&D/labs setting, when it starts adding value to the business, it would need to be turned into an operational/production platform in order to extend the data hygiene that the enterprise already has to encompass this new data. In a later post, we'll see how we can effect this transition from labs to production in an effective manner.

Let us get back to the enterprise standards and their requirements of big data.

Top most amongst the requirements is for governance around this new data. Irrespective of the nature of the data, data governance is a critical requirement in all mature enterprise since lack of (adequate amounts of) governance runs the risk of breaking businesses. Data that is not governed when analysed can produce dubious results leading to a questionable business decision that, in the worst case, can be catastrophic. Most companies that start down the road of big data without due consideration get this critical requirement awfully wrong. From our experience, one reason for this seems to be the misconception that governance processes requires time and effort and introduces latencies whereas that time is better spent doing the more “cool” activity of data science. Unless the big data project is intended to always remain in the labs environment, this is seriously faulty thinking. By the time a labs experiment is well and truly on its way, the governance cat (if I may)  is already out of the bag. There is already some unknown amount of data duplication that has happened internally (and heavens forbid, externally too), some unrecorded numbers of unauthorised access (e.g. data scientist to outside-of-work friend “see this awesome analysis I did on average spends by HNIs”), and a pot pourri of ad-hoc scripts and execution logs that serve as the only documentation of how the data got in.

Security and related topics of data audit and access audit are equally critical requirements for big data. Like governance, security demands a clear plan in action even before the first access to data happens. Otherwise, the risks to the enterprise are too great, especially in the world of big data where there could be more to the data than meets the eye. Access audits demand the record of every interaction of every user with the data and the steps followed pre and post data access. In most industries and countries, access audits are mandatory for legal compliance. Data audits, on the other hand, are in some industries like finance required for compliance but in other industries, though not legally required, are strongly recommended in order to maintain good hygiene. Data audits pertain to the record of how a particular piece of data happened, by way of capturing all the steps in the data processing logic that happened before it, alongside the prior representations for that data tracing back all the way to the source. In other words, data audit requires that a lineage be available for each and every portion of the data.

The last, but by no means the least, of the critical requirements for big data is integration of new data and the big data technologies in use into the existing data ecosystem of the enterprise. Technology integration into the ecosystem involves making sure the existing interfaces are supported, the upstream and the downstream tools are tested for compatibility and seamless execution with the new big data technologies, applications on this new data can be implemented using existing tools, and management and administration happen in the same way for all technologies. Preferably, none of the tasks listed would require procurement of yet some more new technologies. Data integration into the existing ecosystem refers to rationalisation with existing metadata repositories, quality control, and creation of new metadata repositories as required. Note that this requirement coupled with the previous requirements imply that aspects like traceability need to be designed and implemented in a holistic manner that includes but is not limited to just the big data technologies.

The phrase "operational big data platform”, bringing together two apparently conflicting terms “operational” and “big data”, would probably have elicited a few smirks some time ago but that is no longer the case now. The enterprise should carefully plan and orchestrate big data projects with the same emphasis on standards relating to governance, security and ecosystem integration that they have in place for their mission-critical data, preparing for the eventuality that their new data becomes critical to the continued success of their business with the right use of big data and data science.

Thursday, 14 August 2014

Data as a Strategy

At The Data Team, we realize that "big data" and "data science" are hyped and over-used terms whereas in reality organizations find it challenging to go beyond the initial hype and see the value. The main reasons are a lack of clarity on what to expect from "big data" and "data science", and the absence of a mature strategy to leverage data. In this post, we will demystify the term "big data" and then touch upon what constitutes "data as a strategy". The two concepts are related so much so that the latter is the framework that leverages the former at the right time. In a subsequent post, we'll be dissecting the term "data science" and tying it back to data strategy as well.

Let us begin by seeing how the popular conceptions of "big data" fall short.

Big data is not about the Three V's. After all, large volumes have been handled by massively parallel processing architectures for a while now (for instance, my ex-employer). It is not about velocity since rapid ingestion and action on data too has been around from the time of transaction processing systems.

Big data is not about a use case. I have come across innumerable companies claiming to offer "big data products" or "be" big data companies whereas in fact most of them play in the social media/digital marketing space. Let me tell you that social media or digital marketing is probably not the first use case your company will be solving with big data, since deriving value from social media requires reasonable-to-high penetration in various social media channels, marketing maturity to take advantage of such engagements, and some legal clearances.

Big data should not be mistakenly equated to a specific technology. It is not a farm where all animals are equal and the elephant is more equal than the rest.

The hype around big data is certainly justified. We postulate that this is because of the emphasis big data has placed on promoting a culture that uses data for furthering business. This data culture demands of the organization the ability to allow anyone to analyze any data of any size by using any (combination of) tool to serve business objectives. This data culture doesn’t obey organizational boundaries like business and IT, is motivated by feedback and sharing internally and externally, doesn’t shy away from large data sizes, and in fact thrives when challenged with frugality and complexity. Some of the tools the practitioners use have been around in the enterprise ecosystem for a while now, and some are relatively new. A select few are powered by research at the cutting edge of computer science (for example, deep learning).

We argue that it is this data culture that is the fundamental disruption that big data has brought to the market. Not all companies have the need to analyse terabytes of information from day one.  Companies might not need sophisticated data algorithms or the data scientists that write them. However, almost all companies have data and that data if used strategically will impact the business. So, every B2B and B2C organisation needs to embrace this data culture in an evolutionary yet holistic manner. This is the process of "Data as a Strategy". A successful data strategy provides benefits that are immediate and revolutionary, and at the same time also charts a roadmap for growth and further data-derived benefits by incorporating big data principles into its fold.