Thursday, 6 November 2014

Building Fault-Tolerant Big Data Applications

The world of big data is driven by open-source. Popularity of software such as Hadoop has shot through the roof over the last few years as more and more people equate "big data" to "Hadoop" (and its various incarnations). This has also led to a bigger community participation in these projects and, consequently, rapid innovations. Witness the short (by enterprise software standards) 6-month release cycles for landmark features in the major versions of the Hortonworks Hadoop distribution (assuming it is a usable proxy in place of the innovations happening directly in the respective Apache projects). One of the flip-sides to all this quick innovation is unstable software.

True, the broader Hadoop community is helped by large corporations like Yahoo which provide testing grounds at scale. What is also true is that despite such functional and scale testing, buggy software still gets released and makes it to the packaged distribution that goes out of the door to the enterprises out there (heck, forget open-source software, this is true even for proprietary enterprise software). At the Data Team, one of our most recent trysts with buggy software was with Tez in Hive 0.13 that sometimes crashed and sometimes didn't for the same query executing on the same (nearly idle) cluster having the same settings.

Builders of big data applications that use open-source software need to bear in mind the important design principle that the builders of distributed systems software have long taken to heart: expect failure and design for it.

As Jeff Dean of Google famously said, "fault-tolerant software is inevitable". In fact, the design principle that sets apart building software for single machine use versus software for use across multiple machines is that failures are all too common and only become more so as the scale grows larger and larger.

Modern distributed systems (meaning those designed by the legions of Internet-era engineers that have been inspired by the systems they built and saw being built at companies like Google) are designed in a decoupled manner allowing various components to talk to each other using well-defined interfaces. These interface specifications are not tied to any particular platform choice or programming language or OS or geographic location. Fault-tolerance is built into each and every one of these components and at the interfaces of component interaction too.

When integrating big data technologies like Hadoop into the enterprise architecture, the same design principle of fault-tolerance should be applied to the use of the application interfaces that are exposed by the various open-source software on the Hadoop stack. Though Hadoop internally might be able to withstand a large number of common failures and still be alive to service data requests, the fault tolerance of the architecture as a whole is also very much dependent on the the points of integration of Hadoop with the rest of the components in the enterprise. In other words, application integration needs to assume that a Hive query or a Hadoop MapReduce job might in fact fail at any given time. What's more, failures could be non-deterministic, in that where a query ran successfully yesterday, the same query could now fail for reasons unfathomable by any amount of machine automation.

In short, applications that use big data need to be designed for faults.

What does this imply for the enterprise? While the implications on the architecture need to be considered and designed for, there are also implications for the process and the people in the enterprise.

First, applications need to anticipate failures. This means having to capture and relay useful error messages that allow a reasonable amount of troubleshooting to happen in the common case, or in the case of critical failures shut down gracefully. In particular, they should not be crashing on the user unexpectedly.

Second, businesses should need to budget for and build into their expectations an occasional, unexplained delay in delivery of insights. Where they are probably used to having reports of KPIs come in every morning on the hour, they now need to be relaxed about the eventuality that every month or so, a report might not arrive until the afternoon.

Third, while it is common knowledge today that the skills required to exploit big data are lesser than the supply, project managers and delivery managers need to also explicitly acknowledge the reality of unstable software and factor in the time and effort required to investigate, think up alternatives and implement them. Due to the nascent nature of the big data landscape, there will be precious little by way of precedence and so interacting directly with the code contributors and designers of the software will prove to be the only feasible resolution path (with associated delays accounting for time zone differences and the busy-bodies that coders tend to be).

Fourth, the organization needs to anticipate that there would be a fair amount of dialogue now between the business users and applications on the one side and the IT teams on the other side that attempt to dig into why a certain deliverable was not achieved in time. Such dialogue could quickly degenerate into finger-pointing unless the right expectations are set up front. At a deployment, the IT team had locked down access to the Master nodes and the logs of the various software components of Hadoop such that only the administrators could view these logs. Since the cluster management interface revealed precious little about the nature of failures and potential causes, users were left irritated and confused, and moreover had to put up with the process of having to go back to the administrators to troubleshoot for them who in turn were made to work on tasks that were not budgeted for.