While there is relatively little to knock cloud from its hype perch, among web startups, BI and data geeks, the emergence of Big Data has become a game changer. Itâ€™s analytics and operational intelligence gone extreme.
Big Data typically is associated with obscene amounts of data â€“ the scale blows away anything that most enterprises would maintain within their core back end business systems. We’re talking hundreds of terabytes or even petabytes.
Itâ€™s because Internet giants Google, Yahoo, Facebook, Amazon, and others had to roll their own technologies to deal with magnitudes of data far beyond conventional wisdom of what was possible with enterprise systems. What makes the conversation interesting is that this technology is on the cusp of entering the enterprise mainstream today. Itâ€™s not just a matter of technology looking for a problem. When Facebook needs to understand how its 500 million members update their walls, share photographs, and have conversations, itâ€™s because (1) it needs to optimize its IT infrastructure to support how its members use the site, but more importantly (2) it needs to understand more about its members so it can sell advertising.
And when Facebook makes its API publicly available, that same issue becomes a critical for any marketer that is B2C. And as the technology becomes available, suddenly there are downstream uses in capital markets for conducting brute force analyses on trading positions, healthcare providers for understanding outcomes, homeland security for controlling borders, metropolitan entities seeking to manage congestion pricing, life sciences organization seeking to decipher clinical studies, mobile carriers seeking to prevent or minimize customer churn, and so on.
There are a couple technology and market paths that have opened for contending with Big Data. There are Advanced SQL analytic database providers that have adapted SQL for structured data through strategies such as reducing indexing, introducing new forms of data compression and query optimization, columnar architectures, and embedding analytics and data transformation directly into the data engine to minimize data movement; in some cases, they have developed optimized appliances. Weâ€™re talking about the Aster Datas, Greenplums, Netezzas, ParAccels, and Verticas of the world â€“ and players like Teradata that invented big data warehousing, Oracle that has extended it, and Sybase which acquired the first column-oriented database. Business has obviously picked up here; IBM, EMC, Teradata, and HP have all made acquisitions in this space over the past 12 months.
But the Facebooks and Googles of the world werenâ€™t dealing with structured data in the enterprise sense â€“ they are contending with web log files, document APIs, rich media files, and so on. They are dealing with data whose structure and volume is so varied and huge that there is no time to model it and form a schema; they need to just load the data into the file system and then analyze it. That spawned the NoSQL movement â€“ initially a focus on technologies that avoided the overhead and scalability limits of SQL.
Until now, neither Google, Yahoo, or Facebook considered themselves in the tools or database business. So they released the fruits of their innovation as open source, with one of the best known projects being Apache Hadoop. Hadoop is a family of projects that includes a distributed file system, the MapReduce framework that parcels out massively parallel computing jobs across a cluster plus a number of other frameworks, file systems, and utilities.
Whatâ€™s kind of fascinating is the almost incestuous relationship between these NoSQL projects. Hadoop, developed at Yahoo was descended from the Google File System that in turn was developed for Google BigTable; the same was true for Cassandra, another NoSQL file system. Meanwhile, Facebook develops Hive, a relational-like table structure designed to work with Hadoop. You get the picture.
Cloudera has stepped to the forefront in commercializing Hadoop technology and applying MapReduce. Using a Red Hat-like business model, it offers support, several open source extensions, plus an enterprise edition that adds a number of proprietary monitoring and management features. It has distinguished itself with forging partnerships with almost every major BI and data warehousing player except one â€“ IBM. the highlights are its relationships with Informatica, for data transformation, and MicroStrategy, which provides a data mart strategy designed to complement Hadoop. And it has garnered roughly 75 enterprise paying customers in a market segment that has barely commercialized.
In the long run, we also expect IBM to make a stab at Hadoop and related technologies by extending its InfoSphere offerings -â€“ it can see Cloudera-Informatica and Cloudera-MicroStrategy raise it one with its own InfoSphere DataStage and Cognos offerings, before it even talks about partnerships. Today we saw a shot from left field â€“ Yahoo which invented the technology â€“ is now saying it might spin off its Hadoop business to go up against Cloudera, and potentially IBM. In a way, its closing the doors after the horses left the barn as the creator of Hadoop is now part of Cloudera.
Clearly there will be a market for NoSQL technologies in the quest for Big Data, although for now, they require sufficient specialized skills that they are not for the faint of heart. that is, if you can find any Hadoop and MapReduce programmers who haven’t already bee scarfed up by Amazon, Zynga, or JP Morgan Chase. That market will not necessarily be in competition with Advanced SQL as there are different use cases for each. And in fact, there will likely be a blending of the technologies in the long run. Today, many Advanced SQL platforms are already extending support for MapReduce, and in the long run, we expect that SQL-like technologies in the NoSQL space like Hive or HBase will themselves be made more accessible to the huge base of SQL developers.
But we digress.
For Yahoo, this would clearly be a shot out of its comfort zone, as it is not a tools company. But it is hungry for monetizing its intellectual property, even if that property has already been open sourced. Itâ€™s redolent of Sun striving to monetize Java and we all know how that went. Obviously this will be an uphill battle for Yahoo, but at least this would be a spinoff so hopefully there won’t be distractions from the mother ship. Given Yahooâ€™s fortunes, we shouldnâ€™t be surprised that they are now looking to maximize what they can get out of the family jewels.