Hadoop and enterprise databases: Brothers from different mothers?

Hadoop has come a long way in its first decade. Doug Cutting, who along with Mike Cafarella cofounded the Hadoop project as the offshoot of the Apache Nutch web crawler (search) project, has written an elegant post from someone who wasn’t just an eyewitness, but who created history.

Cutting’s main point is that the open source model has become the new default for much of commercial software development. With the Linux project having laid the groundwork, open source today has become the accepted, and in many cases, expected development and delivery model for infrastructure-based software. Open source works as long as you don’t go up toward the application level, where IP becomes much too specific to encourage a lowest common denominator approach to value creation. Hadoop has validated that model – the platform has become the preferred destination for big data analytics, and increasingly, the data lake.

For enterprises, the most important attribute of open source is not the access to source code (most IT organizations are not about to spare resources for that), but access to technology that should forestall vendor lock-in.

But reality is not quite so black and white. While open source projects clear the way for cross-vendor support, open source technology itself is no guarantee of portability. And when you peer beneath the surface, the question is whether open source has changed the dynamics of competition.

Compare notes with relational databases, which emerged during the zenith of the closed-source model for enterprise software. Forget for a moment what’s different; look at what’s similar. The most obvious is market consolidation: a decade ago, the enterprise database market was defined around Oracle, Microsoft, IBM, and for data warehousing, Teradata. For Hadoop, the landscape has winnowed down to four flavors: Amazon EMR, Cloudera, Hortonworks (and its ODPi partners), and MapR.

Dig down another level: both enterprise databases and Hadoop developed around core standards: the ANSI SQL query language for relational databases, and HDFS (storage) and MapReduce (compute) for Hadoop. But in both cases, those core pieces served as launching points for products that in the commercial market got differentiated.

For SQL relational databases, the Oracles, Sybases, Microsofts, and IBMs distinguished themselves with different storage engines, computational approaches (e.g., stored procedures), tuning, security, and administrative tools. In some cases, enterprise applications were written to run natively on them.

For Hadoop, the original foundational components– HDFS and MapReduce – are no longer sacrosanct. The operational definition of Hadoop is being defined around APIs to components that are becoming mix and match. There remain core components like HBase, Hive, Pig, and YARN (with the exception of HDFS) that are supported across the board – although with different vendors supporting different versions. But beyond that, there is a beehive (pun intended) of overlapping, competing open source and proprietary projects for platform management, security/access control, data governance, interactive SQL, and streaming; you can find a good blow-by-blow description here. Portability? That’s defined by APIs to storage and compute (remember them?).

So while you might be able to migrate from one Hadoop vendor’s implementation of Hive or Spark, that won’t be the case with those overlapping projects, open source or not. Welcome back, vendor stack.

The question for the Hadoop market as it was for the relational database folks a decade ago is whether to close the book.

The database market in 2006 on first glance appeared a done deal, as vendor consolidation had long passed the inflection point (that happened roughly a decade earlier). But on the margins, developers of simple web-based database applications did not require the sophistication or expense of enterprise relational data platforms. And with emergence of Linux as a viable alternative (which received significant backing from IBM several years before), the scene was set for the LAMP stack: Linux, MySQL, Apache webserver, and one of the “P” languages: Perl, Python, or PHP.

While web developers sparked demand for open source at the low end, Internet companies sparked similar needs at the high end, for handling torrents of flexibly-structured data for analytics and real-time processing. That begat Hadoop and the wave of NoSQL databases (e.g., MongoDB, Cassandra, Couchbase, and others) that set the stage for today’s increasingly multi-polar data platform landscape.

In all this, Hadoop’s future is increasingly tied with that of the data lake – the default place for ingest and storage of raw data, and resting place for actively archived aging data. The data lake is priming Hadoop for growth if platform providers can adequately address ease of implementation and usage issues.

Does demand for data lake, and the need for platform providers that are enterprise-ready, necessarily narrow the Hadoop field for good? Packaging and integration of core, not state-of-the-art, components may likely commoditize to the point where we’ll likely see emergence of low-cost upstarts in emerging world regions. But it’s the broader ecosystem for higher value-add where Hadoop players will have to play, and compete head-on, that will be the future of this market. At one end, demand for Spark-based analytics in the cloud without the “overhead” of Hadoop is already materializing, but that obscures the real picture: competition for security, performance optimization and cost management, information lifecycle management, data governance, and query “ownership.” All the things that Spark standalone lacks and that the Hadoop ecosystem is still building. As inheritors of the data lake, Hadoop vendors will want to own these functions. But what’s to keep these functions from being virtualized from Hadoop? Or from other data platforms? Or for other data platforms to own it?

The tip of the iceberg will come at ownership of the query because, in most organizations, the data lake will only be one of many sources of the truth and it will be more economical to pushdown processing of analytics where the data resides. Not the other way around. Hadoop players want to own this, but so do incumbent database providers as they extend their spheres of governance, not to mention the analytic and integration tool providers as well. And by the way, all that competition is a good thing, even if the path is gets a bit messy in the bargain.

There is little question that open source as a model for commercial software development has become the new norm. The key concept here is development – open source opens the floodgates to development and visibility. The commercial model for supporting open source has become widely accepted among enterprise IT organizations. But as to the competitive dynamics of a commercial market, competing players will inevitably have to differentiate, and the result is likely to be unique blends of open source projects with or without proprietary content (in most cases, with). And so it shouldn’t be surprising that when examining the competitive dynamics, the Hadoop market is following in many of the same footsteps as their relational brothers of a generation ago.