Intel’s Hadoop Sleeper

One of the more out-of-the-blue announcements to have dropped last week at Strata was Intel’s entry into the Hadoop space. Our first reactions were (1) what does Intel know about the software market (actually they do own McAfee) and (2) the last thing the Hadoop ecosystem needs is yet another distro to further confuse matters. We had a chance to briefly review it during our Strata write-up, but since then we’ve had a chance for a more detailed drill down. There’s some interesting technology under the covers that could accelerate Hadoop thanks to optimization at the Xeon chip instruction set level.

The headline is that it will speed data crunching; in the press release, it cites an internal benchmark of reducing analyzing of a terabyte of data from 4 hours down to 7 minutes admittedly, we don’t know what kind of analysis it was, or whether there are certain forms of processing that will benefit more from chip-level optimization than others, but we’ll accept the general message.

Update: The spec was based on the Terasort benchmark

The headline of the initial release pertains to an area that has so far posed an unmet challenge in Hadoop: embedding encryption functions at the hardware level. It uses a specific encryption instruction set in Xeon designed for the Advanced Encryption Standard (AES) instruction set. It addresses a key conundrum in Hadoop data security: if you have sensitive data, it would be nice to encrypt it, but encryption is such a compute-intensive operation that applying it at terabyte scale would seem impractical. For now, IBM and Dataguise are the only ones that are providing such capability for Hadoop, and that, encryption is selective; at hardware level, the performance differences could be significant. Admittedly, hardware-based encryption is not new; there are appliances on the market that already handle it, but until now nothing that works inside Hadoop.

Like Cloudera, Intel is also competing with its own proprietary Hadoop management tooling, which in this case incorporates patented auto-tuning technology for optimizing clusters.

The Intel release is based on technology developed for HPC clusters, and was initially developed for several Chinese telco clients. It is a hybrid of proprietary and open source technologies. Obviously, optimization at the chip level is proprietary, but accompanying refinements to HDFS, HBase, and Hive to work with Intel processors are being contributed back to open source. Additionally, Intel is co-sponsoring a new open source project GraphLab, which will develop a graph processing framework that will rival the existing Apache GIRAPH project, both of which are intended to provide more efficient alternatives to graph processing than MapReduce. Intel is also backing several nascent open source initiatives, including Panthera, a SQL parser which was announced last fall and Rhino, which is actually a framework for encryption and key management intended to be seeded across various Hadoop projects (components) (don’t confuse this with a similarly named project which specifies a server-side JavaScript engine).

Clearly, Intel’s entry to the Hadoop space ups the bar on performance. There are a number of different paths to the same summit; Cloudera’s Impala is a framework that is supposed to speed SQL processing while avoiding the bottleneck of MapReduce; Hortonworks is proposing the Tez runtime and Stinger interactive query projects to accomplish a similar end; Greenplum has fused its MPP SQL database while using HDFS for storage under the Pivotal HD system; while MapR swaps out HDFS altogether for a more highly performant, NFS-compatible one.

Intel enters an increasingly competitive and forking Hadoop market where there is contention at almost every layer of the stack. Although Intel has sizable enterprise software businesses, it is mostly at the lower levels of the stack (e.g., hardware optimization, security). So it is new to database level of the stack. On one hand, the Big Data software unit of Intel will be at most a tiny speck of its overall business; but as a means to an end (selling more Hadoop boxes), it is potentially very strategic.

The question is whether Intel is better off competing head to head with its own hardware-optimized platform, or forming great OEM deals with the players that are gaining critical mass in the Hadoop space, which could sell even more Xeon servers. Intel has announced a number of opening round partners which include several data platforms on the other side of the Hadoop divide. Among the most interesting are SAP, where chip-optimized Hadoop would form a natural side-by-side install with the in-memory HANA database; and Teradata, whose Aster unit already offers an appliance with rival Hortonworks (could the Intel distro become a virtual extension of the Teradata Extreme Data Appliance?).

There’s little question that Hadoop could use a hardware kick.