What will Hadoop be when it grows up?

Hadoop World was sold out and it seemed like “For Hire” signs were all over the place –- or at least that’s what it said on the slides at the end of many of the presentations. “We’re hiring, and we’re paying 10% more than the other guys,” declared a member of the office of the CIO at JP MorganChase in a conference keynote. Not to mention predictions that there’s big money in big data. Or that Accel Partner’s announced a new $100 million venture fund for big data startups; Cloudera scored $40 million in D funding; and rival Hortonworks previously secured $20 million for Round A.

These are heady days. For some like Matt Asay it’s time to voice a word of caution for all the venture money pouring into Hadoop: Is the field bloating with more venture dollars than it can swallow?

The resemblance to Java 1999 was more than coincidental; like Java during the dot com bubble, Hadoop is a relatively new web-related technology undergoing its first wave of commercialization ahead of the buildup of the necessary skills base. We haven’t seen such a greenfield opportunity in the IT space in over a decade. And so the mood at the conference became a bit heady –– where else in the IT world today is the job scene a seller’s market?

Hadoop has come a long way in the past year. A poll of conference attendees showed at least 200 petabytes under management. And while Cloudera has had a decent logo slide of partners for a while, it is no longer the lonely voice in the wilderness for delivering commercial distributions and enterprise support of Hadoop. Within this calendar year alone, Cloudera has finally drawn the competition to legitimize Hadoop as a commercial market. You’ve got the household names from data management and storage -– IBM, Oracle, EMC, Microsoft, and Teradata — jumping in.

Savor the moment. Because the laws of supply and demand are going to rectify the skills shortage in Hadoop and MapReduce and the market is going to become more “normal.” Colleagues like Forrester’s Jim Kobielus predict Hadoop is going to enter the enterprise data warehousing mainstream; he’s also gone on record that interactive and near real-time Hadoop analytics are not far off.

Nonetheless, Hadoop is not going to be the end-all; with the learning curve, we’ll understand the use cases where Hadoop fits and where it doesn’t.

But before we declare victory and go home, we’ve got to get a better handle of what Hadoop is and what it can and should do. In some respects, Hadoop is undergoing a natural evolution that happens with any successful open source technology: there are always questions over what is the kernel and where vendors can differentiate.

Let’s start with the Apache Hadoop stack, which is increasingly resembling a huge brick wall where things are arbitrarily stacked atop one another with no apparent order, sequence, or interrelationship. Hadoop is not a single technology or open source project but –– depending on your perspective –– an ecosystem or a tangled jumble of projects. We won’t bore you with the full list here, but Apache projects are proliferating. That’s great if you’re an open source contributor as it provides lots of outlet for innovation, but if you’re at the consuming end in enterprise IT, the last thing you want is to have to maintain a live scorecard on what’s hot and what’s not.

Compounding the situation, there is still plenty of experimentation going on. Like most open source technologies that get commercialized, there is the question of where the open source kernel leaves off and vendor differentiation picks up. For instance, MapR and IBM each believe it is in the file system, with both having have their own answers to the inadequacies of the core Hadoop file system, (HDFS).

But enterprises need an answer. They need to know what makes Hadoop, Hadoop. Knowing that is critical, not only for comparing vendor implementations, but software compatibility. Over the coming year, we expect others to follow Karmasphere and create development tooling, and we also except new and existing analytic applications to craft solutions targeted at Hadoop. If that’s the case, we better know where to insist on compatibility. Defining Hadoop the way that Supreme Court justice Potter Stewart defined pornography (“I know it when I see it”) just won’t cut it.

Of course, Apache is the last place to expect clarity as that’s not its mission. The Apache Foundation is a meritocracy. Its job is not to pick winners, although it will step aside once the market pulls the plug as it did when it mothballed Project Harmony. That’s where the vendors come in –– they package the distributions and define what they support. What’s needed is not an intimidating huge rectangle showing a profile, but instead a concentric circle diagram. For instance, you’d think that the file system would be sacred to Hadoop, but if not, what are the core building blocks or kernel of Hadoop? Put that at the center of the circle and color it a dark red, blue, or the most convincing shade of elephant yellow. Everything else surrounds the core and is colored pale. We call upon the Clouderas, Hortonworks, IBMs, EMCs et al to step up the plate and define Hadoop.

Then there’s the question of what Hadoop does. We know what it’s done traditionally. It’s a large distributed file system that is used for offline, a.k.a., batch –– analytic runs grinding through ridiculous amounts of data. Hadoop literally chops huge problems down to size thanks a lot of things: it has a simple file structure and it brings computation directly to the data; leverages cheap commodity hardware; supports scaled-out clustering; has a highly distributed and replicated architecture; and uses the MapReduce pattern for dividing and pipelining jobs into lots of concurrent threads, and mapping them back to unity.

But we also caught a presentation from Facebook’s Jonathan Grey on how Hadoop and its HBase column store was adapted to real-time operation for several core applications at Facebook such as its unified messaging system, the polar opposite of a batch application. In summary, there were a number of brute force workarounds to make Hadoop and HBase more performant, such as extreme denormalization of data; heavy reliance on smart caching; and use of inverted indexes that point to the physical location of data, and so on. There’s little doubt that Hadoop won’t become a mainstream enterprise analytic platform until performance bottlenecks are addressed. Not surprisingly, there’s little doubt that the HBase Apache project is targeting interactivity as one of the top development goals.

Conversely, we also heard lots of mention about the potential for Hadoop to function as an online alternative to offline archiving. That’s fed by an architectural design assumption that Big Data analytic data stores allow organizations to analyze all the data, not just a sample of it. Organizations like Yahoo have demonstrated dramatic increases in click-through rates from using Hadoop to dissect all user interactions. That’s instead of using MySQL or other relational data warehouse that can only analyze a sampling. And the Yahoos and Googles of the world currently have no plan to archive their data –– they will just keep scaling their Hadoop clusters out and distributing them. Facebook’s messaging system –– which was used for rolling out real-time Hadoop, is also designed with the use case that old data will not be archived.

The challenge is that the same Hadoop cannot be all things to all people. Optimizing the same data store for interactive and online archiving is like violating the laws of gravity –– either you make the storage cheap or you make it fast. Maybe there will be different flavors of Hadoop, as data in most organizations outside the Googles, Yahoos, or Facebooks of the world is more mortal –– as are the data center budgets.

Admittedly, there is an emerging trend to brute force design databases for mixed workloads –– that’s the design pattern behind Oracle’s Exadata. But even Oracle’s Exadata strategy has limitations in that its design will be overkill for smaller-midsize organizations, and that is exactly why Oracle came out with the Oracle Database Appliance. Same engine, but optimized differently. As few organizations will have Google’s IT budget, Hadoop will also have to have personas –– one size won’t fit all. And the Hadoop community –– Apache and vendor alike –– has got to decide what Hadoop’s going to be when it grows up.