Is SQL the Gateway Drug for Hadoop?

How much difference does a year make? Last year, Last year was the point where each Hadoop vendor was compelled to plant their stake in supporting interactive SQL. Cloudera Impala; Hortonworks’ Stinger (injecting steroids to Hive); IBM’s Big SQL; Pivotal’s HAWQ; MapR and Drill (or Impala available upon request); and for good measure, Actian turbocharging their Vectorwise processing engine onto Hadoop.

This year, the benchmarketing has followed: Cloudera Impala clobbering the latest version of Hive in its own benchmarks, Hortonworks’ response, and Actian’s numbers with the Vectorwise engine (rebranded Vortex) now native on Hadoop supposedly trumping the others. OK, there are lies, damn lies, and benchmarks, but at least Hadoop vendors feel compelled to optimize interactive SQL performance.

As the Hadoop stack gets filled out, it also gets more complicated. In his keynote before this year’s Hadoop Summit, Gartner’s Merv Adrian made note of all the technologies and frameworks that are either filling out the Apache Hadoop project – such as YARN – and those that are adding new choices and options, such as the number of frameworks for tiering to memory or Flash. Add to that, the number of interactive SQL frameworks.

So where does this leave the enterprises that comprise the Hadoop market? In all likelihood, dazed and confused. All that interactive SQL is part of the problem, but it’s also part of the solution.

Yes, Big Data analytics has pumped new relevancy to the java community, which now has something sexier than middleware to keep itself employed. And it’s provided a jolt to Python, which as it turns out is a very useful data manipulation language, not to mention open source R for statistical processing. And there are loads of data science programs bringing new business to Higher Ed computer science programs.

But we digress.

Java, Python and R will add new blood to analytics teams. But face it, no enterprise in its right mind is going to swap out its IT staff. From our research at Ovum, we have concluded that Big Data (and Hadoop) must become first class citizens in the enterprise if they are to gain traction. Inevitably, that means SQL must be part of the mix.

Ironically, the great interactive SQL rollout is occurring as something potentially far more disruptive is occurring: the diversification of data platforms. Hadoop and data warehousing platforms are each adding multiple personas. As Hadoop adds interactive SQL, SQL data warehouses are adding column stores, JSON/document style support, MapReduce style analytics.

But SQL is not the only new trick up Hadoop’s sleeve; there are several open source frameworks that promise to make real-time streaming analytics possible, not to mention search, and… if only the community could settle on some de facto standard language(s) and storage formats, graph. YARN, still in its early stages, offers the possibility of running multiple workloads concurrently on the same Hadoop cluster without the need to physically split it up. On the horizon are tools applying machine learning to take ETL outside the wall garden of enterprise data, not to mention BI tools that employ other approaches not easily implemented in SQL such as path analysis. Our research has found that the most common use cases for Big Data analytics are actually very familiar problems (e.g., customer experience, risk/fraud prevention, operational efficiency), but with new data and new techniques that improve visibility.

Therefore it would be a waste if enterprises only use Hadoop as a cheaper ETL box or place to offload some SQL analytics. Hopefully, SQL will become the gateway drug for enterprises to adopt Hadoop.

Postscript: Here’s the broader context for our thoughts: databases are converging. There are more platforms to run SQL than ever before. Here’s a link to our presentation at 20145 Hadoop Summit on how Hadoop, SQL, and NoSQL data platforms are converging.