We’re in the thick of analyst conference season – Informatica last week, SAS tomorrow. So on this Sunday afternoon between gigs, we’re digesting what went down at Strata 2013 in Santa Clara last week. It was kind of a frustrating day in that we had limited time, were scheduled wall to wall with meetings, and missed what were likely some fascinating sessions. But we got a sense of some dominant themes: Harden Hadoop for the enterprise, and take the SQL world to Hadoop.
The Hadoop vendor ecosystem is filling in – new players with their own distros, and new capabilities focused on making Hadoop more enterprise grade. The field is early enough that the approaches are still quite diverse – it’s time to invent, not consolidate. Let the games proceed.
EMC stole the jump early in the week by announcing the grafting of its own Greenplum Advanced SQL analytic data store onto Hadoop – basically, the Greenplum MPP database squooched (wanted an excuse to use a “word” like that) atop HDFS. Tastes like a SQL analytic database, scales like Hadoop. Cloudera Impala will soon be in a GA branded as RTQ (Real-Time Query). Not to be outshined, Hortonworks, which works through the official Hadoop project itself, announced a couple responding initiatives: the Tez runtime and Stinger interactive query engine. You wouldn’t be seeing all these efforts to make Hadoop interactive if the demand wasn’t out there; while Hadoop as a platform for extending the range of analytics has become very compelling to enterprises, they clearly expect that the platform must be SQL interactive if it is to become a part of their analytic system portfolio.
While we’ve been expending electrons on the SQLization of Hadoop, the next stage of hardening is rapidly emerging. Specifically, make Hadoop and Hadoop data more governable and secure. This involves capabilities such as data masking (where you permanently obliterate sensitive pieces of data), data encryption (where you can recover the original data), activity monitoring (who does what), data lineage (who and where this data came from, and who has done what to it), and of course, more fine grained access control (preferably role-based) that picks up where Kerberos authentication leaves off. The pieces are just beginning to fall into place.
Dataguise, a niche player in data obfuscation that relaunched itself in the Hadoop space last year, has had an encryption product out for roughly six months and has drawn several customers; they promote a self-learning feature that discovers sensitive data (e.g., credit card numbers), selectively encrypts, and then acts only when data is changed. IBM already has capabilities in Optim that are typically used when pulling data from an external database; a user-defined function can mask it in Hadoop, or mask data as it is drawn from Hadoop. IBM offers data masking and activity monitoring, a capability that Cloudera just announced. Specifically, Cloudera’s new Navigation tool places agents (like everybody else, they characterize them as “lightweight”) on HDFS, Hive, and HBase, and you can configure them. For instance, the traffic on Hive is likely to be a fraction of that for HBase, which is more interactive, so you can configure monitoring of event changes to data accordingly. And then we came across Revelytix, which focuses on data lineage
Then out of the blue, Intel swooped in with announcement of its own Hadoop distribution. As if that was the last thing the world needed. But Intel has carved some interesting angles: it is utilizing the native instruction set of the Xeon processor to move encryption and I/O optimization directly into the chip. Intel’s play addresses the issue that these processes are resource-heavy, a point where the sheer size of Hadoop data stores add insult to injury. And that is not to mention that embedding encryption in hardware lessens the load of developers. Intel has drawn a number of partners including SAP, where integration with the HANA in-memory platform offers some interesting Fast Data possibilities. So far we’ve missed signals with Intel, but will speak with them later next week to get a better idea of where they hope to take hardware optimization with Hadoop.
Loose ends: Time is running out on us, but coming out of this week, there are several issues that are running in the back of our mind:
• Hive – we thought this was a done deal. Hive is one of the earliest components of Hadoop. Having been designed when MapReduce was the predominant processing pattern, and the jobs to spawn the metadata were batch in nature. We were surprised that the debate over Hive’s use remains very, very live. The issue is over how dynamic Hive can become – yes, it can support interactive queries, but is it based on metadata that is current? We sense that this will become another area for vendor differentiation.
• Apache Hadoop project – This could be spin, but there is sniping behind the scenes that the Hadoop project is no longer so broad-based when it comes to contributions. The flipside is arguments over whether a particular vendor has enough (or any) committers rings a bit hollow. The operable question for enterprises is whether the distro of Hadoop is and will remain well-supported.
• Resource management – this one has multiple angles. Of course there is debate over YARN. It is supposed to be the über resource manager of Hadoop, so MapReduce jobs don’t collide with those of other frameworks that may have different (and conflicting) demand on processing and data access. There’s active debate over whether YARN has sufficiently weaned itself of its MapReduce batch lineage, or whether it should be a batch-oriented sub manager in a scheme where there is yet another layer of control. The counterargument to that is that this may make life (or at least levels of control) far too complex. Expect vendor differentiation here.