Hadoop Ecosystem Starts Crystallizing

What a difference a year makes. A year ago, Big Data was an abstract concept left to the domain of a bunch of niche players and open source groups. Over the next 9 months, the Advanced SQL space dramatically consolidated as EMC, IBM, HP, and Teradata made their moves. In the past 3 months, it’s been Hadoop’s turn.

We’ve seen Yahoo flirt with the idea of setting up its response to Cloudera and IBM with its own Hadoop support company, while EMC announced ambitious but ambiguous plans to – choose your term – extend or fork Hadoop. After a series of increasingly vocal hints, IBM has placed its cards on the table, while Informatica has fleshed out its plans for civilizing NoSQL data.

IBM’s InfoSphere BigInsights productizes what IBM’s been talking about for months and vocalized at its BigData analyst summit held at its Yorktown Lab (yup, the place where Watson played Jeopardy). They’re offering the core freebie, which includes a distribution of Hadoop and the HDFS file system, MapReduce, and integration to DB2, paid support, and an enterprise edition that adds indexing, integrated text analytics, a development studio based around Jaql, a SQL-like query language developed by Google that takes elements of Hive and Pig, and targets Json (the data objects of JavaScript), access control security features, and the requisite administrative console.

Contrary to EMC, which hedged its words when describing if it would support Apache Hadoop, IBM came down clearly on the side of aligning its effort with the Apache projects. We shouldn’t be surprised, as IBM gave Yahoo’s VP of Hadoop development, Eric Baldeschwieler, the soapbox at its analyst event pleading for Hadoop not to be forked into competing technology implementations.

Informatica in turn fleshed out its big data support, which was the highlight of its 9.1 platform release being announced today. While Informatica already provides the ability to extract data from Hadoop for ETL to SQL data warehouses, the 9.1 release adds new adapters for social networks LinkedIn, Twitter, and Facebook; new capabilities to connect to call detail records and image files as part of its B2B unstructured data exchange offering. More importantly, whereas before Informatica PowerCenter could extract data from Hadoop, now it can feed data back in, providing another path for tapping the power of MapReduce that might not otherwise be easily supported in your relational data warehouse.

This is the start of the taming of so-called “unstructured” data that populates NoSQL; in actuality, most of this data has structure, much of which has yet to be defined. Informatica’s release of social network adapters targets the lowest hanging fruit, as social media sentiment analysis has become one of the most popular use cases for building data warehouses on steroids. It couples well with text analytics, which was one of the BI market’s first forays outside the transaction world. But there are many other NoSQL data types awaiting some form of structural definition such as sensory, graph, or rich media meta data (some of this could leverage text parsing capabilities).

It’s still early days for commercialization of tooling for big data; while 2010 was the year that major database and platform players discovered Advanced SQL, 2011 is the point where they began directing attention at NoSQL. You can see that on the Advanced SQL side as the use cases are pouring out. For NoSQL, and more specifically Hadoop, commercialization moves are just the first steps, as Jim Kobielus points out.

Hadoop itself is a fairly complex ecosystem of Apache projects; saying that you support Hadoop is not the same as that for Linux because it lacks Linux’s singular nature. And different pieces of Hadoop are interchangeable: for instance, you can swap out its HBase table system for Cassandra or Cloudbase if you want something more interactive.

For now there is an infatuation with Hadoop, but works remains to be done for vendors to lift the burden off customers for integrating the disparate pieces.

Furthermore the technology use cases are only starting to be fleshed out for what to use where. Inevitably this will lend itself to a solution rather than raw database tools approach for the more popular use cases such as instant or long term social activity graph analysis for marketing, civil infrastructure management, telco churn management, and so on. Furthermore, the bigness of big data means that you might want to attack certain tasks differently. For instance, once the data is at rest, you don’t want to move it. Data governance in the NoSQL environment is still a blank slate waiting to be filled with best practices, not to mention tooling support. For instance, while Facebook data might be available by public API, will having access to that data trigger any customer privacy issues? Also, while Hadoop’s file system provides relatively low cost storage when measured per terabyte, at some point there will be need to profile, cleanse, compress, and eventually deprecate that data. Again, more white space for tooling and best practices.

IBM’s embrace of what otherwise appears to be an obscure query language is yet another indicator that aside from general “brand” awareness of Hadoop and MapReduce (which is a framework, not a language or technology), the target market of enterprise developers remains in learning mode and as yet lacks knowledge to choose the right tools for the job.