Hadoop spinning YARNs

With the Strata 2013 Santa Clara conference about to kick into high gear a week from now, we’re bracing for a wave of SQL-related announcements. You won’t hear a lot about this in the vendor announcements, but behind the scenes, there’s a major disruption occurring that will determine whether MapReduce and other products or frameworks play friendly with each other on Hadoop.

MapReduce has historically been the yin to Hadoop’s yan. Historically, the literature about Hadoop invariably mentioned MapReduce, often in within the same sentence. So excuse us for having wondered, once upon a naïve time, if they were synonymous.

MapReduce is the processing framework that put Hadoop on the map because it so effectively took advantage of Hadoop’s scalable Internet data center-style architecture. In and of itself, MapReduce is a generic idea for massively parallel computing: break a job into multiple con current threads( Map) and then consolidate them (Reduce) to get a result. The MapReduce framework itself was written for Hadoop’s architecture. It pushes Map operations directly to Hadoop data nodes; each operation being completely self-contained (e.g., it supports massively parallel, shared-nothing operation); it treats data as the key-value pairs that Hadoop uses; and it works directly with Hadoop’s JobTracker and TaskTracker to provide a closed-loop process for checking and submitting correctly-formed jobs, tracking their progress to completion (where the results of each Map are shuffled together as part of the Reduce phase).

A key advantage of MapReduce is that it treats, not only individual Map operations and self-contained, but also each Map-Reduce cycle as a self-contained operation. That allows huge flexibility to allow problems to be solved iteratively through a chained series of MapReduce cycles. Such a process proved extremely effectively for crunching through petabytes of data.

Yet, that advantage is also a drawback: each MapReduce cycle is extremely read/write-intensive, as each MapReduce step is written to disk (rather than cached), which makes the process time-consuming and best suited for batch operation. If anything, the trend in enterprise analytics has been towards interactive and in some cases real-time operation, but Hadoop has been off limits to that – until recently.

As we’ve noted, with convergence of SQL and Hadoop, we believe that the predominant theme this year for Hadoop development is rationalization with the SQL world. While of course there is batch processing in the SQL world, the dominant mode is interactive. But this doesn’t rule out innovation in other directions with Hadoop, as the platform’s flexibility could greatly extend and expand the types of analytics. Yes, there will be other types of batch analytics, but it’s hard to ignore the young elephant in the room: interactive Hadoop.

Enter YARN. As we said, there was a good reason why we used to get confused between Hadoop and MapReduce. Although you could run jobs in any style that could scan and process HDFS files, the only framework you could directly manage with core Hadoop was MapReduce. YARN takes the resource management piece out of MapReduce. That means (1) MapReduce can just be MapReduce and (2) you can use the same resource manager to run other processing frameworks. YARN is a major element of the forthcoming Hadoop 2.0, which we expect to see formal release of around Q3.

That’s the end goal with YARN; it’s still a work in process, as is all of Hadoop 2.0. At this point, YARN has been tested at scale at Yahoo — over 30,000 nodes and 14 million applications reports Arun Murthy in his blog (as release manager, he’s in charge of herding cats for Hadoop 2.0). OK, so YARN has tested at scale (MapReduce could do the same thing) but still needs some API work.

So what other frameworks will be supporting YARN? We had a chat with Arun a couple weeks back to get a better idea of what’s shakin’. It’s still early days; for now, Apache HAMA (a rather unfortunate name in our view), which you could imagine as MapReduce’s scientific computing cousin, supports YARN. Others are still work in progress. Giraph, an incubating Apache project that addresses graph processing, will likely join the fray. Others include Spark, a framework for supporting in-memory cluster computing (it provides the engine behind Shark, an Apache Hive-compatible data warehousing system that is supposed to run 100x faster than Hive). Pervasive Software (about to be acquired by Actian) has gone on record that its DataRush engine would run under YARN. We wouldn’t be surprised if Twitter Storm, which focuses on distributed, real-time processing of streaming data, also comes under the YARN umbrella.

There are of course other frameworks emerging that may or may not support YARN. Cloudera already supports a prerelease version of YARN as part of CDH 4, but it has not stated whether Impala, its own open source SQL-compatible MPP data warehousing framework, will run under YARN. Speaking of Impala, there are a number of other approaches that are emerging for making Hadoop more interactive or real-time, such as Platfora, which adapts a common approach from the relational world for tiering “hot” data into memory. There are others, like Hadapt and Splice Machine that are inserting SQL directly into HDFS.

The 64-petabyte question of course is whether everybody is going to play nice and rational and make their frameworks or products work with YARN. In essence, it’s a literal power grab question – should I let the operation of my own product or framework be governed by a neutral resource manager, or can that resource manager fully support my product’s style of execution? The answer is both technology and market maturity.

On the technology end, there’s the question of whether YARN can get beyond its MapReduce (batch) roots. The burden of proof is on the YARN project folks for demonstrating, not only that their framework works at scale and supports the necessary APIs, but also that it can support other styles such as interactive or real-time streaming modes, and that it can balance workloads as approaches from the database world, such as data tiering, require their own unique optimizations.

The commercial end of the argument is where the boundary between open source and commercial value-add (proprietary or non-Hadoop open source) lies. It’s a natural rite of passage for any open source platform that becomes a victim of its own success. And it’s a question for enterprises to consider when they make their decision: ultimately, it’s about the ecosystem or club that they want to belong to.