Friday, 29 August 2014

Big Data and the Enterprise

Businesses that have defined a data strategy know that data is an integral part of the enterprise. There are a slew of enterprise standards for all data to adhere to, irrespective of whether the data is small or big, structured or unstructured, comes from sensors or websites or transactions, is housed in the holiest of data centres with the strictest of controls or is stored out in the wild, wild west of freshest-on-github software. A successful data strategy not only meets the business needs but also incorporates the enterprise standards in a holistic manner.

Big data being at the top of its hype nowadays (see Gartner's latest hype cycle for emerging technologies), there have been many companies that have eagerly jumped into its adoption without adequate considerations unfortunately. Of course, the prime motivating factors for considering big data - including those discussed in an earlier post in these pages Data as a Strategy - are usually in place and are not the subject of this post. It is the set of enterprise standards and requirements that typically are in the background but serve the crucial purpose of keeping the house in order that are being glossed over in the excitement over "new".  In this post, let us look at the top requirements that fall under the headline of enterprise standards that apply equally to big data as it does to business-critical data. Mature organisations with a thriving data strategy will hopefully not find any of the below surprising. For the rest of this post, I'll assume that the enterprise has data-focused implementation in place already - in the form of a data warehouse or an all-purpose database, for instance - and also has enterprise standards to follow.

First, the question "Why should an enterprise care about applying its corporate standards to new data that is not critical to the business?" In customer conversations, clickstream data arises frequently as an example of this new data. Suppose that clickstream data is collected and analysed in large scale in a big data technology in a R&D/labs environment. Incidentally, the reasons why a classic database would not be suited for such analysis is that the data is of unknown value to the business, has variable structure and popular big data technologies including Hadoop allow storing such data at a significantly lower cost. In the course of analysis, suppose there are insights of significant value about online customer buying behaviour that have been discovered and the clickstream data consequently becomes critical for the business. The repeated extraction of that value needs the clickstream data and the process of extraction be put into production. The new data and the associated processes need to be “operationalised” (to use a coined term), and therefore should have to adhere to the standards set by the enterprise. In other words, even if big data exploitation starts off as a project in an R&D/labs setting, when it starts adding value to the business, it would need to be turned into an operational/production platform in order to extend the data hygiene that the enterprise already has to encompass this new data. In a later post, we'll see how we can effect this transition from labs to production in an effective manner.

Let us get back to the enterprise standards and their requirements of big data.

Top most amongst the requirements is for governance around this new data. Irrespective of the nature of the data, data governance is a critical requirement in all mature enterprise since lack of (adequate amounts of) governance runs the risk of breaking businesses. Data that is not governed when analysed can produce dubious results leading to a questionable business decision that, in the worst case, can be catastrophic. Most companies that start down the road of big data without due consideration get this critical requirement awfully wrong. From our experience, one reason for this seems to be the misconception that governance processes requires time and effort and introduces latencies whereas that time is better spent doing the more “cool” activity of data science. Unless the big data project is intended to always remain in the labs environment, this is seriously faulty thinking. By the time a labs experiment is well and truly on its way, the governance cat (if I may)  is already out of the bag. There is already some unknown amount of data duplication that has happened internally (and heavens forbid, externally too), some unrecorded numbers of unauthorised access (e.g. data scientist to outside-of-work friend “see this awesome analysis I did on average spends by HNIs”), and a pot pourri of ad-hoc scripts and execution logs that serve as the only documentation of how the data got in.

Security and related topics of data audit and access audit are equally critical requirements for big data. Like governance, security demands a clear plan in action even before the first access to data happens. Otherwise, the risks to the enterprise are too great, especially in the world of big data where there could be more to the data than meets the eye. Access audits demand the record of every interaction of every user with the data and the steps followed pre and post data access. In most industries and countries, access audits are mandatory for legal compliance. Data audits, on the other hand, are in some industries like finance required for compliance but in other industries, though not legally required, are strongly recommended in order to maintain good hygiene. Data audits pertain to the record of how a particular piece of data happened, by way of capturing all the steps in the data processing logic that happened before it, alongside the prior representations for that data tracing back all the way to the source. In other words, data audit requires that a lineage be available for each and every portion of the data.

The last, but by no means the least, of the critical requirements for big data is integration of new data and the big data technologies in use into the existing data ecosystem of the enterprise. Technology integration into the ecosystem involves making sure the existing interfaces are supported, the upstream and the downstream tools are tested for compatibility and seamless execution with the new big data technologies, applications on this new data can be implemented using existing tools, and management and administration happen in the same way for all technologies. Preferably, none of the tasks listed would require procurement of yet some more new technologies. Data integration into the existing ecosystem refers to rationalisation with existing metadata repositories, quality control, and creation of new metadata repositories as required. Note that this requirement coupled with the previous requirements imply that aspects like traceability need to be designed and implemented in a holistic manner that includes but is not limited to just the big data technologies.

The phrase "operational big data platform”, bringing together two apparently conflicting terms “operational” and “big data”, would probably have elicited a few smirks some time ago but that is no longer the case now. The enterprise should carefully plan and orchestrate big data projects with the same emphasis on standards relating to governance, security and ecosystem integration that they have in place for their mission-critical data, preparing for the eventuality that their new data becomes critical to the continued success of their business with the right use of big data and data science.

Thursday, 14 August 2014

Data as a Strategy

At The Data Team, we realize that "big data" and "data science" are hyped and over-used terms whereas in reality organizations find it challenging to go beyond the initial hype and see the value. The main reasons are a lack of clarity on what to expect from "big data" and "data science", and the absence of a mature strategy to leverage data. In this post, we will demystify the term "big data" and then touch upon what constitutes "data as a strategy". The two concepts are related so much so that the latter is the framework that leverages the former at the right time. In a subsequent post, we'll be dissecting the term "data science" and tying it back to data strategy as well.

Let us begin by seeing how the popular conceptions of "big data" fall short.

Big data is not about the Three V's. After all, large volumes have been handled by massively parallel processing architectures for a while now (for instance, my ex-employer). It is not about velocity since rapid ingestion and action on data too has been around from the time of transaction processing systems.

Big data is not about a use case. I have come across innumerable companies claiming to offer "big data products" or "be" big data companies whereas in fact most of them play in the social media/digital marketing space. Let me tell you that social media or digital marketing is probably not the first use case your company will be solving with big data, since deriving value from social media requires reasonable-to-high penetration in various social media channels, marketing maturity to take advantage of such engagements, and some legal clearances.

Big data should not be mistakenly equated to a specific technology. It is not a farm where all animals are equal and the elephant is more equal than the rest.

The hype around big data is certainly justified. We postulate that this is because of the emphasis big data has placed on promoting a culture that uses data for furthering business. This data culture demands of the organization the ability to allow anyone to analyze any data of any size by using any (combination of) tool to serve business objectives. This data culture doesn’t obey organizational boundaries like business and IT, is motivated by feedback and sharing internally and externally, doesn’t shy away from large data sizes, and in fact thrives when challenged with frugality and complexity. Some of the tools the practitioners use have been around in the enterprise ecosystem for a while now, and some are relatively new. A select few are powered by research at the cutting edge of computer science (for example, deep learning).

We argue that it is this data culture that is the fundamental disruption that big data has brought to the market. Not all companies have the need to analyse terabytes of information from day one.  Companies might not need sophisticated data algorithms or the data scientists that write them. However, almost all companies have data and that data if used strategically will impact the business. So, every B2B and B2C organisation needs to embrace this data culture in an evolutionary yet holistic manner. This is the process of "Data as a Strategy". A successful data strategy provides benefits that are immediate and revolutionary, and at the same time also charts a roadmap for growth and further data-derived benefits by incorporating big data principles into its fold.