Lifeâ€™s getting more interesting around the Hadoop world â€“ until now, if you were looking for commercial support, Cloudera was the only game in town. Barely a couple weeks back, Yahoo â€“ which invented the technology â€“ began making noises about a possible commercial spinoff to go up against Cloudera. That came, ironically, after Yahoo decided to drops its own Hadoop distribution. Go figure.
But the point today is that EMC Greenplum has decided to dive in as it packages Hadoop to run natively within the Greenplum Advanced SQL analytic database. This is a departure from Greenplumâ€™s previous agreement to interface with the Cloudera edition of Hadoop that was concluded last summer before EMC acquired Greenplum. This will be EMCâ€™s own distribution that incorporates modifications from Facebook to address potential single points of failure such as in the naming node and job tracker.
More to the point, it adds to the variety of choices that are becoming available with Hadoop â€“ which is essentially a grab bag of technologies that include file systems, column-oriented table structures, data warehousing and transformation query languages, parallel computing frameworks, serialization, workload coordination, and so on. While Hadoop is known as a place for storing lots of data but not known for its speed, there are offshoots providing more interactive capabilities. For instance, you can use Hadoop but substitute Cassandra or Cloudbase for the HDFS file system. Or you can add relational nodes, as Hadapt is trying.
If youâ€™re confused, join the crowd. These are early days where innovation is raw, with multiple approaches to managing all the other data that doesn’t neatly fit in a SQL database are just emerging.
At the end of the day, itâ€™s about solving analytic problems for the business, not about analyzing specific kinds of data. For instance, you may wish to marry the transactive interactions with customers stored by your CRM system with the things that they are saying about you on Facebook â€“ and in turn â€“ youâ€™ll probably want to know where theyâ€™re getting their ideas from. The idea that EMC Greenplum is pushing is to use the same platform, but run different parts of the analytic question on the appropriate data store.
From a market development standpoint weâ€™re now at the second inflection point in the Big data tooling market. the first was the rapid wave of consolidation that hit the more familiar Advanced SQL analytic data portion of the market â€“ within the last 8 â€“ 9 months alone, EMC, IBM (Netezza), HP (Vertica), and more recently, Teradata (Aster Data) made acquisitions in this space. As Advanced SQL is in the phase of consolidation, just the opposite is happening with Hadoop, or more broadly, the NoSQL space at large. Itâ€™s a period where there is now a competition of raw ideas and also the beginnings of a convergence between SQL and NoSQL.
The latter is what EMC Greenplumâ€™s move is all about. By EMC Greenplum repackaging Hadoop, they are helping to civilize it â€“ for Greenplum customers anyway. Greenplum is placing it under their own management umbrella â€“ and this being EMC â€“ obviously they are adding APIs for plugging in storage. Additionally they are leveraging their own internal high-speed, low latency interconnects, and providing a certified stack for what would otherwise be an unwieldy garb bag of Apache and other open source projects. Itâ€™s also part of a longer term trend for addressing the skills gap with MapReduce and Hadoop â€“ just as Java developers were hard to find in 1999, the same is true with Hadoop and MapReduce today. In part the laws of supply and demand will resolve that, but in the long run, the NoSQL world (which many consider to mean â€œNot only SQLâ€) is going to get managed by many of the same tools that DBAs and software developers are already familiar with.
If youâ€™re an enterprise customer, moves like EMC Greenplumâ€™s make it safe for you to start piloting. It gives you a view of what will be the end game in the convergence of the SQL world with NoSQL. But keep in mind that as a technology stack, Hadoop is still very much a moving target.