Hadoop: The Third Way

Working with Hadoop has been kind of a throwback. Until recently, Hadoop was synonymous with MapReduce programming, meaning that when you worked with Hadoop, it seemed that you were working with a newfangled mainframe. As if client/server never happened.

With emergence and heavy competition between the various interactive SQL frameworks (e.g., Impala, Tez, Presto, Drill, BigSQL, Big Data SQL, QueryGrid, Spark SQL), a second path emerged for database developers. So the Hadoop mainframe became a client/server machine. As if n-tier never happened.

The need for speed made n-tier happen – due to the need to bypass the bottleneck of database I/O and the overhead of large, monolithic applications. And so the application server platform was born, and with it, ways to abstract functions such as integration, security, transaction management so they could operate as modular piece parts with whatever application or database. Or to prevent abandoned online shopping carts, so a transaction can be executed without being held hostage to ensuring ACID compliance. Internet-based applications were now being developed on WebSphere, WebLogic, JBoss, and more recently, more compact open source alternatives like Apache Tomcat.

But with Hadoop, we’re still in the era of mainframe or client/server. But with the 2.x generation, where resource management has been taken out of MapReduce, the way has been cleared to make Hadoop more of a multi-purpose platform. While interactive SQL was the first shot, new frameworks supporting streaming (Storm, Spark Streaming), machine learning (Spark), and search (Solr) are among some of the new additions to the palette.

But at this point, we’re still looking at Hadoop as either a mainframe or two-tier system. Developers write MapReduce or Spark programs, or BI/query tools access HDFS with or without Hive. There’s nothing available to write data-driven programs, such as real-time user scoring or intrusion detection.

Nearly four years ago, a startup with a weird name – Continuuity – emerged to become in its own terms “the JBoss for Hadoop.” The goal was building a data fabric that abstracted the low-level APIs to HDFS, MapReduce, Hive, and other Hadoop components to clear the way for developers to write, not just MapReduce programs or run BI tools, but write API-driven programs that could connect to Hadoop. Just as a generation ago, application servers abstracted data and programs so they could flexibly connect with each other. Its first project was a data ingestion platform written on Storm that would be easier to work with than existing Hadoop projects such as Flume.

Continuuity’s problem was that the company was founded too early. During a period where Hadoop was exclusively a batch processing platform, there was little clamor for developers to write data-driven applications. But as new frameworks transform Hadoop into a platform that can deliver experiences closer to real-time, demand should emerge among developers to write, not just programs, but applications that can run against Hadoop (or other platforms).

In the interim, Continuuity changed its name to Cask, and changed its business model to become an open source company. It has diversified its streaming engine to work with other frameworks besides Storm to more readily persist data. And the 40-person company which was founded a few blocks away from Cloudera’s original headquarters, next to Fry’s Electronics in Palo Alto, has just drawn a modest investment from Cloudera to further develop its middleware platform.

Admittedly, Cask’s website really doesn’t make a good case (the home page gives you as 404 error), providing an application platform for Hadoop opens up possibilities sonly limited by the imagination. For instance, it could make possible event-driven programs for performing data validation or detecting changes in customer interactions, and so on.

For Cloudera, Cask is a low-risk proposition for developing that long-missing third path to Hadoop to further its transformation to a multi-purpose platform.