Category Archives: Data Management

The Open Data Platform is and is not like UNIX, Cloudera cracks $100m, and what becomes of Pivotal

How’s that for a mouthful?

It shouldn’t be surprising that the run-up to Strata is full of announcement designed to shape mindsets. And so today, we have a trio of announcements that solve – for now – the issue of whether Pivotal is still in the Hadoop business (or at least with its own distro); verify that Cloudera did make $100m last year; and announce formation of a cross-industry initiative, the Open Data Platform.

First, we’ll get our thoughts on Cloudera and Pivotal out of the way. Cloudera’s announcement didn’t surprise us, we’ve estimated that they were on their way to a $100m year given our estimates of typical $250k deal sizes (outliers go a lot higher than that), a new customer run rate that we pegged at about 50 per quarter, and of course subscription renewals that inflate as customers grow their deployments. In perspective we’re still in a greenfield market where a rising tide is lifting all boats; we estimate that business is also doubling for most of Cloudera’s rivals – but that Cloudera has had an obvious head start.

As to Pivotal, they’ve been the subject of much FUD in the wake of published reports last fall of a layoff of 60 employees on the Big Data side of their business. Word on the street was that Pivotal, the last to enter the Hadoop distribution business, would would be the first to leave – Hortonworks was the logical candidate as Pivotal disclosed last summer that it would replace its Command Center with the Hortonworks-led Ambari project for Hadoop cluster management.

The news is that Pivotal is making a final break from its proprietary technology legacy and open sourcing everything – including the Greenplum database. And yes, Pivotal will OEM support HDP, but it will still offer its own distribution optimized for HAWQ and for integration with its other data engines including the GemFire in-memory database. But this announcement didn’t happen in a vacuum, but in conjunction with another announcement today of the Open Data Platform – of which Pivotal and Hortonworks (along with IBM, and others) are members. We’re frankly puzzled as to why Pivotal would continue offering its own distribution. But we’ll get back to that.

The Open Data Platform is an initiative to that tries to put the toothpaste back into the tube: define, integrate, test, and certify a standard Hadoop core. Once upon a time, Apache Hadoop could be defined by core projects, like what was on the Apache project home page. But since then there have been multiple overlapping and often competing projects regarding running interactive SQL (do we use Hive or bypass it?); cluster management (Ambari or various vendor proprietary management systems); managing security; managing resource (YARN for everything, or just batch jobs, and what about Mesos?), streaming (Storm or Spark Streaming), and so on. When even the core file system HDFS may not be in every distro, the question of what makes Hadoop, Hadoop remains key.

Of course, ODP is not just about defining core Hadoop, but designating, in effect, a stable base on which value-added features or third party software can reliably hook in. It picks up where the Apache community, which simply designates what releases are stable, leave off, by providing a formal certification base. That’s the type of thing that vendor consortia rather than open source communities are best equipped to deliver. For the record, ODP pledges to work alongside Apache.

So far so good, except that this initiative comprises only half the global Hadoop vendor base. This is where the historical analogies with UNIX come in; recall the Open Software Foundation, which was everybody vs. the industry leader Sun? It repeats the dynamic of the community vs. the market leaders – for now, the Cloudera and Amazon customer bases will outnumber ODP committers.

Over time OSF UNIXes remained overshadowed by Solaris, but eventually everybody turned their attention to dealing with Microsoft. After laying down arms, OSF morphed into The Open Group, which refocused on enterprise architecture frameworks and best practices.

The comparison between ODP and OSF is only in the competitive dynamics. Otherwise, UNIX and Hadoop are different creatures. While both are commodity technologies, Hadoop is a destination product that enterprises buy, whereas UNIX (and Linux) are foundational components that are built into the purchase of servers and appliances. Don’t get confused by those who characterize Hadoop as a data operating system, as enterprises are increasingly demanding capabilities like security, manageability, configurability, and recovery that are expected of any data platform that they would buy.

And further, where the narrative differs is that Hadoop, unlike UNIX, lacks a common enemy. Hadoop will exist alongside, not instead of other database platforms as they eventually meld into a fabric where workloads are apportioned. So we don’t necessarily expect history to repeat itself with Open Data Platform. The contribution of ODP will be the expectation of a non-moving target that becomes a consensus, although not an absolutely common one. It’s also the realization that value-add in Hadoop increasingly comes, not from the core, but from the analytics that run on it and the connective glue that the platform provider supplies.

As for Pivotal and what it’s still doing in the Hadoop business, our expectation is that ODP provides the umbrella under which its native distribution converges and becomes a de facto dialect of HDP. We believe that Pivotal’s value-add won’t be in the Hadoop distribution business, but how it integrates GemFire and optimizes implementation for its Cloud Foundry Platform-as-a-Service cloud.

Postcript: No good deed goes unpunished. Here’s Mike Olson’s take.

Making Yin and Yan of YARN and Mesos

YARN has drawn considerable spotlight as the resource scheduler allowing Hadoop 2.x to finally transcend its MapReduce roots. The strength and weakness of YARN was its MapReduce roots – meaning there was backward compatibility to managing MapReduce workloads that dominated Hadoop., but also limitations for running ongoing workloads because of its job-oriented batch origins. By contrast, Apache Mesos has existed as an open source project for some time that provides resource management for scale-out clusters of all kinds – not just Hadoop. It was well suited for dynamic management on continuous (ongoing) workloads.

While a bit dated, this 2011 Quora posting provides a good point by point comparison of YARN’s and Mesos’ strengths and shortcomings. Although not directly comparable, until now both have been considered rival approaches.

A new project – Myriad – proposes to bring them together. Pending Apache incubation status, it would superimpose Mesos as the top level dynamic juggler of resources, while YARN sticks to its knitting and schedules them. In essence, it would make YARN elastic. MapR, which is staking new ground as a participant rather than consumer of Apache projects, is joining with Mesosphere and eBay to drive the project with plans to submit to Apache for incubation.

Myriad is not the only game in town. Slider, a project lead by Hortonworks, is taking the reverse approach. Instead of Mesos dynamically allocating containers (resources) to YARN, Slider works as a helper to YARN for dynamically requesting new resources when a YARN container fails.

Myriad vs. Slider typifies the emerging reality for Hadoop; when issues arise in the Hadoop platform, chances are there will be competing remedies vying for adoption.

Hadoop: The Third Way

Working with Hadoop has been kind of a throwback. Until recently, Hadoop was synonymous with MapReduce programming, meaning that when you worked with Hadoop, it seemed that you were working with a newfangled mainframe. As if client/server never happened.

With emergence and heavy competition between the various interactive SQL frameworks (e.g., Impala, Tez, Presto, Drill, BigSQL, Big Data SQL, QueryGrid, Spark SQL), a second path emerged for database developers. So the Hadoop mainframe became a client/server machine. As if n-tier never happened.

The need for speed made n-tier happen – due to the need to bypass the bottleneck of database I/O and the overhead of large, monolithic applications. And so the application server platform was born, and with it, ways to abstract functions such as integration, security, transaction management so they could operate as modular piece parts with whatever application or database. Or to prevent abandoned online shopping carts, so a transaction can be executed without being held hostage to ensuring ACID compliance. Internet-based applications were now being developed on WebSphere, WebLogic, JBoss, and more recently, more compact open source alternatives like Apache Tomcat.

But with Hadoop, we’re still in the era of mainframe or client/server. But with the 2.x generation, where resource management has been taken out of MapReduce, the way has been cleared to make Hadoop more of a multi-purpose platform. While interactive SQL was the first shot, new frameworks supporting streaming (Storm, Spark Streaming), machine learning (Spark), and search (Solr) are among some of the new additions to the palette.

But at this point, we’re still looking at Hadoop as either a mainframe or two-tier system. Developers write MapReduce or Spark programs, or BI/query tools access HDFS with or without Hive. There’s nothing available to write data-driven programs, such as real-time user scoring or intrusion detection.

Nearly four years ago, a startup with a weird name – Continuuity – emerged to become in its own terms “the JBoss for Hadoop.” The goal was building a data fabric that abstracted the low-level APIs to HDFS, MapReduce, Hive, and other Hadoop components to clear the way for developers to write, not just MapReduce programs or run BI tools, but write API-driven programs that could connect to Hadoop. Just as a generation ago, application servers abstracted data and programs so they could flexibly connect with each other. Its first project was a data ingestion platform written on Storm that would be easier to work with than existing Hadoop projects such as Flume.

Continuuity’s problem was that the company was founded too early. During a period where Hadoop was exclusively a batch processing platform, there was little clamor for developers to write data-driven applications. But as new frameworks transform Hadoop into a platform that can deliver experiences closer to real-time, demand should emerge among developers to write, not just programs, but applications that can run against Hadoop (or other platforms).

In the interim, Continuuity changed its name to Cask, and changed its business model to become an open source company. It has diversified its streaming engine to work with other frameworks besides Storm to more readily persist data. And the 40-person company which was founded a few blocks away from Cloudera’s original headquarters, next to Fry’s Electronics in Palo Alto, has just drawn a modest investment from Cloudera to further develop its middleware platform.

Admittedly, Cask’s website really doesn’t make a good case (the home page gives you as 404 error), providing an application platform for Hadoop opens up possibilities sonly limited by the imagination. For instance, it could make possible event-driven programs for performing data validation or detecting changes in customer interactions, and so on.

For Cloudera, Cask is a low-risk proposition for developing that long-missing third path to Hadoop to further its transformation to a multi-purpose platform.

MongoDB grows up

One could say that MongoDB has been at the right place at the right time. When web developers demanded a fast, read-intensive store of complex variably-structured data, the company formerly known as 10Gen came up with a simple engine backed by intuitive developer-friendly tooling. It grew incredibly popular for applications like product catalogs, tracking hierarchical events (like chat strings with responses), and some forms of web content management.

In a sense, MongoDB and JSON became the moral equivalents of MySQL and the LAMP stack, which were popular with web developers who needed an easy-to-deploy transactional SQL database sans all the overhead of an Oracle.

Some things changed. Over the past decade, Internet developers expanded from web to also include mobile developers. And the need for databases has now extended to variably structured data. Enter JSON. It provided that long-elusive answer to providing a simple operational database with an object-like representation of the world without the associated baggage (e.g., polymorphism, inheritance), using a language (JavaScript) and data structure that was already lingua franca with web developers.

Like MySQL, Mongo was known for its simplicity. It had a simple data model, a query framework that was easy for developers to use, and well-developed indexing that made reads very fast. It’s been cited by db-Engines as the fourth most popular database among practitioners.

And like MySQL, MongoDB was not known for its ability to scale (just ask Cassandra fans). For MySQL, a Berkeley company, Sleepycat Software, InnoDB, developed an engine that provided a heart transplant that could turn MySQL into a serious database.

Fast forward, and some alumni from Sleepycat Software (which developed BerkelyDB, later bought by Oracle) founded WiredTiger, ginning out an engine that could add similar scale to Mongo. WiredTiger offers a more write-friendly engine that aggressively takes advantage of compression (that is configurable) to scale and deliver high performance. And it provides a much more granular and configurable approach to locking that could alleviate much of those write bottlenecks that plagued Mongo.

History took interesting paths. Oracle bought Sleepycat and later inherited MySQL via the Sun acquisition. And last fall, MongoDB bought WiredTiger.

Which brings us to MongoDB 2.8 3.0. It’s no mystery (except for the dot zero release number ) that the WiredTiger engine would end up in Mongo as their integration was destiny. Also not surprising is that the original MongoDB MMAP engine lives on. There is a huge installed base, and for existing read-heavy applications, it works perfectly well for a wide spectrum of use cases (e.g., recommendation engines). The new release makes the storage engine pluggable via a public API.

We’ve been down this road before; today MySQL has almost a dozen storage engines. Starting off the gate, MongoDB will have the two supported by the company: classic MMAP or the industrial-strength WiredTiger engine. Then there’s also an “experimental” in-memory engine that’s part of this release. And off in the future, there’s no reason why HDFS, cloud-based object storage, or or even SQL engines couldn’t follow.

The significance with the 3.0 release is that the MongoDB architecture becomes an extensible family. And in fact, this is quite consistent with trends that we at Ovum have been seeing with other data platforms, that are all overlapping and taking on multiple persona. That doesn’t mean that every database will become the same, but that each will have its area of strength, but also be able to take on boundary cases. For instance:
• Hadoop platforms have been competing on adding interactive SQL;
• SQL databases have been adding the ability to query JSON data; and
• MongoDB is now adding the fast, scalable write capabilities associated with rival NoSQL engines like Cassandra or Couchbase, reducing the performance gap with key-value stores.

Database convergence or overlap doesn’t mean that you’ll suddenly use Hadoop to replace your data warehouse, or MongoDB instead of your OLTP SQL database. And if you really need fast write performance, key-value stores will probably remain your first choice. Instead, view these as extended capabilities that allow you to handle a greater variety of use cases, data types, and queries off the same platform with familiar development, query, and administration tools.

Back to MongoDB 3.0, there are a few other key enhancements with this release. Concurrency control (the source of those annoying write locks with the original MMAP engine) becomes more granular in this release. Instead of having to lock the entire database for consistent writes, locks can now be confined to a specific collection (the MongoDB equivalent of table) level, reducing an annoying bottleneck. Meanwhile, WiredTiger adds schema validation and more granular memory management to further improve write performance. Eventually, WiredTiger might even bring schema validation to Mongo.

We don’t view this release as being about existing MongoDB customers migrating to the new engine; yes, the new engine will support the same tools, but it will require a one-time reload of the database. Instead, we view this as expanding MongoDB’s addressable market, with the obvious target being key-value stores like Cassandra, BerkeleyDB (now commercially available as Oracle NoSQL Database), or Amazon DynamoDB. It’s just like other data platforms are doing by adding on their share of capability overlaps.

Hortonworks takes on thankless, necessary job for governing data in Hadoop

Governance has always been a tough sell for IT or the business. It’s a cost of doing business that although it might not be optional, is something that inevitably gets kicked down the road. That is, unless your policies, external regulatory requirements, or embarrassing public moments force the issue.

Put simply, data governance consists of the policies, rules, and practices that organizations enforce for the way they handle data. It shouldn’t be surprising that in most organizations, data governance is at best ad hoc. Organizations with data (and often governance) in their names offer best practices, frameworks, and so on. IBM has been active on this front as well, having convened its own data governance council among leading clients for the better part of the last decade; it has published maturity models and blueprints for action.

As data governance is a broad umbrella encompassing multiple disciplines from data architecture to data quality, security and privacy management, risk management, lifecycle management, classification and metadata, and audit logging, it shouldn’t be surprising that there is a wealth of disparate tools out there for performing specific functions.

The challenge with Hadoop, like any emerging technology, is its skunk works origins among Internet companies who had (at the time) unique problems to solve and had to invent new technology to solve them. But as Big Data – and Hadoop as platform – has become a front burner issue for enterprises at large, the dilemma is ensuring that this new Data Lake not become a desert island when it comes to data governance. Put another way, implementing a data lake won’t be sustainable if data handling is out of compliance with whatever internal policies are in force. The problem comes to a head for any organization dealing with sensitive or private information, because in Hadoop, even the most cryptic machine data can contain morsels that could compromise the identity (and habits) of individuals and the trade secrets of organizations.

For Hadoop, the pattern is repeating. Open source projects such as Sentry, Knox, Ranger, Falcon and others are attacking pieces of the problem. But there is no framework that brings it all together – as if that were possible.

Towards that end, we salute Hortonworks for taking on what in our eyes is otherwise a thankless task: herding cats to create definable targets that could be the focus of future Apache projects – and for Hortonworks, value-added additions to its platform. Its Data Governance Initiative, announced earlier this morning, is the beginning of an effort that mimics to some extent what IBM has been doing for years: convene big industry customers to help define practice, and for the vendor, define targets for technology development. Charter members include Target, Merck, Aetna – plus SAS, as technology partner. This is likely to spawn future Apache projects that, if successful, will draw critical mass participation for technologies that will be optimized for the distinct environment of Hadoop.

A key challenge is delineating where vertical industry practice and requirements leave off (as that is a space already covered by many industry groups) so it doesn’t wind up reinventing the wheel. The same is true across the general domain of data management – where as we stated before there are already organizations that have defined the landscape, to which we hope that the new initiative formally or informally syncs up.

Hortonworks, Big Data, and Big Money

So the Hadoop market has finally had its first IPO. Hortonworks’ successful $100 million IPO reflects pent-up demand for those outside the Silicon Valley venture community to get a piece of the action in the fast emerging Hadoop space. Excluding established BI/analytics players who extended their wares to support Hadoop, until now VCs had all the fun. Significantly, Hortonworks and New Relic both conducted successful IPOs that saw share prices surging 40% on the same day that the Dow otherwise went south.

This all comes during a period where there’s been an unquestioned surge in Big Data, and technology investments in general. Before the IPO, all three Hadoop pure plays raised nearly $1 billion in venture funding during this calendar year, and add to that about another $300 million for NoSQL players MongoDB, DataStax, and Couchbase (if you start the clock last fall). In retrospect, what’s interesting about the Hortonworks IPO is not the size, because at $100 million, it’s dwarfed by previous rounds of venture financing.

From our seat way back in the peanut gallery, it appears that Hortonworks IPO was about making a statement that the loss-driven vendor’s business – and Hadoop as a whole – is becoming investment-grade. And we believe it was about getting first in line while the iron was still hot.

This is very much a greenfield market, as almost all sales are new, with few being competitive replacements. It is a high growth market; Hortonworks alone reported YoY tripling of 3x in subscription sales as of the end of Q3 2014. That drops a broad hint on prospective growth: With the overall Hadoop paid installed base (all vendors) at 1000 – 1500 (depending on whether you count paid sandboxes in the cloud), there’s still a lot of virgin market out there. But the flipside of this is the heavy investment, both in product development and building a global go-to-market network from scratch. Looking at the Hortonworks S-1, those two areas gobbled up most of the reported $80 million losses for the first three quarters of this calendar (and fiscal) year.

We don’t expect that Hadoop pure plays (or at least those that haven’t been acquired) will be profitable for at least another 2 – 3 years.

As we’ve noted before, we’re bullish on Hadoop as a pillar of the data platform market in the short run, where we expect sales to grow geometrically, and in the long run, where it joins SQL, NoSQL, and real-time streaming platforms as part of the data ecosystem that enterprises are expected to manage. But we’re concerned over the midterm, where the expectations of capital collide with the realities of greenfield markets. There is growth, but also start-up expense.

Hortonworks’ numbers have been well-known ever since the filing of the S-1 back last month. Admittedly, it is not unusual for high growth companies to IPO while still in the red. But Hortonworks is not the hottest company in its field, but one of three hot companies. They happened to be the one that IPO’d first. Nonetheless, there were several red flags:
* The revenue base is too narrow, being concentrated in its top three customers. Admittedly, the revenue base is getting more diversified, but even this year, the top three customers still accounted for over a third of business.
* The business is too low margin, with over 40% of sales coming from professional services (subscriptions are more profitable and for a product company, a more reliable growth indicator).

Hortonworks states that gross billings is a more reliable trending indicator of their business, as it recognizes revenues that subscription-based accounting normally defers; with that added in, its business the first three quarters of this year is roughly 25% higher. And as of the end of Q3 2014, it is reporting $47.7 million deferred revenue + $17.3 million backlog.

While we’re happy that Hadoop has finally made it to NASDAQ and congratulate Hortonworks for its strong first day showing, our wish is that the company had deferred this offering by another 6 – 12 months to show a more diversified business.

Strata 2014 Part 2: Exploratory Analytics and the need to Fail Fast

Among the unfulfilled promises of BI and data warehousing was the prospect of analytic dashboards on the desk of everyman. History didn’t quite turn out that way – BI and data warehousing caught on, but only as the domain of elite power users who were able to create and manipulate dashboards, understand KPIs, and know what to ask for when calling on IT to set up their data marts or query environments. The success of Tableau and Qlik revealed latent demand for intuitive, highly visual BI self-service tools, and the feasibility of navigating data with lesser reliance on IT.

History has repeated itself with Big Data analytics and Hadoop platforms – except that we need even more specialized skills on this go round. Whereas BI required power users and DBAs, for Big Data it’s cluster specialists, Hadoop programmers, and data scientists. When we asked an early enterprise adopter at a Hortonworks-sponsored user panel back at Hadoop Summit as to their staffing requirements, they listed Java, R, and Python programmers.

Even if you’re able to find or train Hadoop programmers and cluster admins, and can even spot the spare underemployed data scientist, you’ll still face a gap in operationalizing analytics. Unless you’re planning to rely on an elite team of programmers, data scientists or statisticians, Big Data analytics will wind up in a more gilded version of the familiar data warehousing/BI ghetto.

This pattern won’t be sustainable for mainstream enterprises. We’ve gone on record that Big Data and Hadoop must become first class citizens in the enterprise. And that means mapping to the skills of the army you already have. No wonder that interactive SQL is becoming the gateway drug for Hadoop in the enterprise. That at least gets Big Data and Hadoop to a larger addressable practitioner base, but unless you’re simply querying longer periods of the same customer data that you held in the data warehouse, you’ll face new bottlenecks getting your arms around all that bigger and badder data. You’re probably going to be wondering:
• What questions do you ask when you have a greater variety of data to choose from?
• What data sets should you select for analysis when you have dozens being ingested into your Hadoop cluster?
• What data sets will you tell your systems admins or DBAs (yes, they can be retrained for schema-on-read data collections) to provision for your Hadoop cluster?

If you’re asking these questions, you’re not lost. Your team is likely acquiring new data sets to provide richer context to perennial challenges such as optimizing customer engagement, reducing financial risk exposure, improving security, or managing operations. Unless your team is analyzing log files with a specific purpose, chances are you won’t have the exact questions or know specifically which data sets you should pinpoint in advance.

Welcome to the world of Exploratory Analytics. This is where you iterate your queries, and identify which data sets to yield the answers. It’s different from traditional analytics, where the data sets, schema, and queries are already pre-determined for you. At exploratory phase, you look for answers that explains why your KPIs have changed, or whether you’re looking at the right KPIs at all. Exploratory analytics does not replace your existing regimen of analytic, query, or reporting – it complements it. Exploratory Analytics:
• Gives you the Big Picture; it shows you the forest. Traditional analytics gives you the Precise Picture, where you get down to the trees.
• May be used for quickly getting a fix on some unique scenario occurring, where you might run a query once and move on; or it can be used for recalibrating where you should do your core data warehousing analytics – which means that it is a preparatory stage for feeding new data to the data warehouse.
• Gives you the context for making decisions. Data warehousing analytics are where final decisions (for which your organization may be legally accountable) are made.

A constant refrain from the just-concluded Strata Hadoop World conference was the urgency of being able to fail fast. You are conducting a process of discovery where you are testing and retesting assumptions and hypotheses. The nature of that process is that you are not always going to be right, and in fact, if you are thorough enough in your discovery process, you won’t. That doesn’t mean that you don’t start out with a question or hypothesis – you are. But unlike conventional BI and data warehousing, your assumptions and hypotheses are not baked in from the moment that schema has been set in concrete.

At the other end of the scale, exploratory analytics should not degenerate into a hunting expedition. Like any scientific process, you need direction and ways for setting bounds on your experiments.

Exploratory analytics requires self-service, from data preparation through query and analysis. You can’t afford to wait for IT to transform your data and build query environments, and then repeat the process for the next stage of refining your hypothesis. It takes long enough to build a single data mart.

It starts with getting data. You must identify data sets and then transform them (schema on read doesn’t eliminate this step). It may involve searching externally for data sets or scanning the data sets that are already available from your organization’s portfolio of transaction or messaging systems, log files, or other sources; or that data might already be on your Hadoop cluster. You need to identify what’s in the data sets of interest, reconcile, and conduct the transformation, a process often characterized as data wrangling. This process is not identical to ETL, but rather, the precursor. You are getting the big picture and may not require the same degree of precision in matching and de-duplicating records as you would inside a data warehouse. You’re designing for queries that give you the big picture (is your organization on the right track) as opposed to the precise or exact picture (where you are making decisions that carry legal and/or financial accountability).

You will have to become self-sufficient in performing the wrangling and setting up the queries and get the results. You’ll need the system to assist you in this endeavor. As serial database entrepreneur and MIT professor Michael Stonebraker put it during a Strata presentation, when you have more than 20 – 25 data sources, it is not humanly possible to keep track of all of them manually; you’ll need automation to help track and organize data sets for you. You may need the system to assist you in selecting data sets, and you certainly need the system to help you determine how to correlate and transform those data sets into workable form. And to keep from reinventing the wheel, you’ll need a way to track and preferably collaborate with others regarding the wrangling process – e.g., your knowledge about data sets, transformations, queries, and so on.

Advances in machine learning are helping make the wild goose chase become manageable. Compared to traditional ETL tools, these offerings use a variety of techniques, such as capabilities to recognize patterns of data and identify what kind of data is in a column, calibrated with various training techniques where you download a sample set and “teach” the system, or provide prompted or unprompted feedback as to the correctness of the transform. Unlike traditional ETL tools, you can operate from a simple spreadsheet rather than have to navigate schema.

Emerging players like Trifacta, Paxata, and Tamr introduced techniques to data preparation and reconciliation; IBM has embraced these approaches with Data Refinery and Watson Analytics, while Informatica leverages machine learning with its Springbok cloud service; and we expect to hear from Oracle very soon.

The next step is data consumption. Where data is already formatted as relational tables, existing BI self-service visualization tools may suffice. But other approaches are emerging that deduce the story. IBM’s Watson Analytics can suggest what questions to ask and pre-populate a storyboard or infographic; ClearStory Data combines live blending of data from internal and external sources to generate interactive storyboards that similarly venture beyond dashboards.

For organizations that already actively conduct BI and analytics, the prime value-add from Big Data will be the addition of exploratory analytics at the front end of the process. Exploratory analytics won’t replace traditional BI query and reporting, as the latter is where the data and processes are repeatable. Exploratory Analytics allows organization to search deeper and wider for new signals or patterns; the results might in some cases be elevated to the data warehouse, but in other cases, may provide a background process that helps the organization get the bigger picture to understand whether it has the right business or operational strategy, whether it is asking the right questions, serving the right customers, or protecting against the right threats.

Strata 2014 Part 1: Hadoop, Bright Lights, Big City

If you’re running a conference in New York, there’s pretty much no middle ground between a large hotel and the Javits Center. And so this year, Strata Hadoop World made the leap, getting provisional access to a small part of the big conventional center to see if it could fill the place. That turned out to be a foregone conclusion.

The obvious question was whether Hadoop, and Big Data, had in fact “crossed the chasm” to become a mainstream enterprise IT market. In case you were wondering, the O’Reilly folks got Geoffrey Moore up on the podium to answer that very question.

For Big Data-powered businesses, there’s little chasm to cross when you factor in the cloud. As Moore put it, if you only need to rent capacity on AWS, the cost of entry is negligible. All that early adopter, early majority, late majority stuff doesn’t really apply. A social site has a business model of getting a million eyes or nothing, and getting there is a matter of having the right buzz to go viral – the key is that there’s scant cost of entry and you get to fail fast. Save that thought – because the fail fast principle also applies to enterprises implementing Big Data projects (we’ll explain in Part 2 of this post, soon to come).

Enterprise adoption follows Moore’s more familiar chasm model – at that we’re still at early majority where the tools of the trade are arcane languages and frameworks like Spark and Pig. But the key, Moore says, is for “pragmatists” to feel pain; that is the chasm to late majority, the point where conventional wisdom is to embrace the new thing. Pragmatists in the ad industry are feeling pain responding to Google; the same goes with media and entertainment sectors were even cable TV mainstays such as HBO are willing to risk decades-old relationships with cable providers to embrace pure internet delivery.

According to Cloudera’s Mike Olson, Hadoop must “disappear” to become mainstream. That’s a 180 switch as the platform has long required specialized skills, even if you ran an off-the-shelf BI tool against it. Connecting from familiar desktop analytics tools is the easy part – they all carry interfaces that translate SQL to the query language that can run on Hive, or on any of the expanding array of interactive-SQL-on-Hadoop frameworks that are making Hadoop analytics more accessible (and SQL on Hadoop a less painful experience).

Between BI tools and frameworks like Impala, HAWQ, Tez, Big SQL, Big Data SQL, Query Grid, Drill, or Presto, we’ve got the last mile covered. But the first miles, which involve mounting clusters, managing and optimizing them, wrangling the data into shape, and governing the data, are still works in progress (there is some good news regarding data wrangling). Tools that hide the complexity and applications that move the complexity under the hood are works in progress.

No wonder that for many enterprises, offloading ETL cycles was their first use of Hadoop. Not that there’s anything wrong with that – moving ETL off Teradata, Oracle, or DB2 can yield savings because you’ve moved low value workloads off platforms where you pay by footprint. Those savings can pay the bill while your team defines where it wants to go next,

We couldn’t agree with Olson more – Hadoop will not make it into the enterprise as this weird, difficult, standalone platform that requires special skills. Making a new platform technology like Hadoop “disappear” isn’t new — it’s been done before with BI and Data Warehousing. In fact, Hadoop and Big Data today are at the same point where BI and data warehousing were in the 1995 – 96 timeframe.

The resemblance is uncanny. At the time, data warehouses were unfamiliar and required special skills because few organizations or practitioners had relevant experience. Furthermore, SQL relational databases were the Big Data of their day, providing common repositories for data that was theoretically liberated from application silos (well, reality proved a bit otherwise). Once tools automated ETL, query, and reporting, BI and data warehousing in essence disappeared. Data Warehouses became part of the enterprise database environment, while BI tools became routine additions to the enterprise application portfolio. Admittedly, the promise of BI and Data warehousing was never completely fulfilled as analytic dashboards for “everyman” remained elusive.

Back to the original question, have Hadoop and Big Data gone mainstream? The conference had little troubled filling up the hall, and questions about economic cycles notwithstanding, shouldn’t have issues occupying more of Javits next year. We’re optimists based on Moore’s “pragmatist pain” criteria — in some sectors, pragmatists will have little choice but to embrace the Big Data analytics that their rivals are already leveraging.

More specifically, we’re bullish in the short term and long term, but are concerned over the medium term. There’s been a lot of venture funding pouring into this space over the past year for platform players and tools providers. Some players, like Cloudera, have well broken the billion-dollar valuation range. Yet, if you look at the current enterprise paid installed base for Hadoop, conservatively we’re in the 1000 – 2000 range (depending on how you count). Even if these numbers double or triple over the next year, will that be enough to satisfy venture backers? And what about the impacts of Vladimir Putin or Ebola on the economy over the near term?

At Strata we had some interesting conversations with members of the venture community, who indicated that the money pouring in is 10-year money. That’s a lot of faith – but then again, there’s more pain spreading around certain sectors where leaders are taking leaps to analyze torrents of data from new sources. But ingesting the data or pointing an interactive SQL tool (or streaming or search) at it is the easy part. When you’re getting beyond the enterprise data wall garden, you have to wonder if you’re looking at the right data or asking the right questions. In the long run, that will be the gating factor as to how, whether, and when analysis data will become routine in the enterprise. And that’s what we’re going to talk about in Part 2.

We believe that self-service will be essential for enterprises to successfully embrace Big Data. We’ll tell why in our next post.

Is SQL the Gateway Drug for Hadoop?

How much difference does a year make? Last year, Last year was the point where each Hadoop vendor was compelled to plant their stake in supporting interactive SQL. Cloudera Impala; Hortonworks’ Stinger (injecting steroids to Hive); IBM’s Big SQL; Pivotal’s HAWQ; MapR and Drill (or Impala available upon request); and for good measure, Actian turbocharging their Vectorwise processing engine onto Hadoop.

This year, the benchmarketing has followed: Cloudera Impala clobbering the latest version of Hive in its own benchmarks, Hortonworks’ response, and Actian’s numbers with the Vectorwise engine (rebranded Vortex) now native on Hadoop supposedly trumping the others. OK, there are lies, damn lies, and benchmarks, but at least Hadoop vendors feel compelled to optimize interactive SQL performance.

As the Hadoop stack gets filled out, it also gets more complicated. In his keynote before this year’s Hadoop Summit, Gartner’s Merv Adrian made note of all the technologies and frameworks that are either filling out the Apache Hadoop project – such as YARN – and those that are adding new choices and options, such as the number of frameworks for tiering to memory or Flash. Add to that, the number of interactive SQL frameworks.

So where does this leave the enterprises that comprise the Hadoop market? In all likelihood, dazed and confused. All that interactive SQL is part of the problem, but it’s also part of the solution.

Yes, Big Data analytics has pumped new relevancy to the java community, which now has something sexier than middleware to keep itself employed. And it’s provided a jolt to Python, which as it turns out is a very useful data manipulation language, not to mention open source R for statistical processing. And there are loads of data science programs bringing new business to Higher Ed computer science programs.

But we digress.

Java, Python and R will add new blood to analytics teams. But face it, no enterprise in its right mind is going to swap out its IT staff. From our research at Ovum, we have concluded that Big Data (and Hadoop) must become first class citizens in the enterprise if they are to gain traction. Inevitably, that means SQL must be part of the mix.

Ironically, the great interactive SQL rollout is occurring as something potentially far more disruptive is occurring: the diversification of data platforms. Hadoop and data warehousing platforms are each adding multiple personas. As Hadoop adds interactive SQL, SQL data warehouses are adding column stores, JSON/document style support, MapReduce style analytics.

But SQL is not the only new trick up Hadoop’s sleeve; there are several open source frameworks that promise to make real-time streaming analytics possible, not to mention search, and… if only the community could settle on some de facto standard language(s) and storage formats, graph. YARN, still in its early stages, offers the possibility of running multiple workloads concurrently on the same Hadoop cluster without the need to physically split it up. On the horizon are tools applying machine learning to take ETL outside the wall garden of enterprise data, not to mention BI tools that employ other approaches not easily implemented in SQL such as path analysis. Our research has found that the most common use cases for Big Data analytics are actually very familiar problems (e.g., customer experience, risk/fraud prevention, operational efficiency), but with new data and new techniques that improve visibility.

Therefore it would be a waste if enterprises only use Hadoop as a cheaper ETL box or place to offload some SQL analytics. Hopefully, SQL will become the gateway drug for enterprises to adopt Hadoop.

Cloudera’s show of numbers

The announcement of Cloudera’s new $160 million venture funding almost looked too perfectly timed. It came midway during Cloudera’s first formal dog and pony show in front of industry analysts. And we’re not just talking about the usual suspects, but a broader, more sober crowd of doubters from across the IT spectrum: app development, IT infrastructure, database, and BI, where the consensus remains that Hadoop is not a database.

Unlike Hortonworks, Cloudera has not been afraid to ruffle feathers. It dares to offer a hybrid open source/proprietary model in a market born in open source. Or more importantly, announce a strategic Enterprise Data Hub path that potentially places it in competition with established data warehouse providers that might otherwise form logical partners. Cloudera’s enterprise data hub positioning is ambitious, auspicious, and for now, a concept leap. Hadoop is not a database, and it currently lacks enterprise-grade features for performance management, SLA conformance, security, and data governance. The emphasis is on “currently” as platform and practice are evolving rapidly; Hadoop will grow into a more robust platform that can compete for the role of hub.

There is little question that Hadoop is here to stay; Cloudera has drawn competition from Hortonworks, which positions itself as the 100% open source platform that is very OEM-friendly; MapR, whose implementation includes proprietary technology that gets the platform closer to the robustness and performance of databases; and IBM, which after a brief flirtation with Cloudera subsequently reiterated its positioning as the adult in the room. Meanwhile, Teradata, Oracle, Microsoft, and Amazon include Hadoop in their data stacks.

Cloudera hardly needed the capital as it already had $140 million in the bank. The new infusion jumps that to $300 million. More to the point, it includes a battery of firms, such as T. Rowe Price, who tend to be long-term investors, plus Michael Dell’s venture arm and Google Ventures as “strategic” backers. The company does not deny having IPO aspirations, but states that the new money gives it more flexibility on the timing.

Immediately following the announcement, we received several press queries as to whether Cloudera was for sale. In our view the most likely candidates would be Oracle (which resells Cloudera’s full platform as part of its Big Data Appliance, and just saw disappointing Q3 numbers) and newly privatized Dell. The common thread is that both are seeking engines to rekindle growth. But the $300 million in the bank inflates Cloudera’s valuation to the point that it would be a very, very expensive buy.

Nonetheless, there’s a lot of venture money floating around right now. And with Facebook’s $19 billion acquisition of a company that few ever heard of (except for hundreds of millions of casual subscribers like us who have the app but don’t use it), we have the makings of a venture capital bubble. As such, there is a flight to quality (invest in market leaders) for Tier 1 VCs. In the Big Data arena, players like Cloudera and MongoDB are perceived to be among them.

So we don’t believe that Cloudera is currently for sale. With Enterprise Data Hub, they are not claiming to replace data warehousing incumbents, but the pressure to move data storage and compute cycles onto the cheaper Hadoop platform is potentially quite threatening. (We believe that the incumbents must assert their value higher up the stack, such as with in-database analytic functionality, data governance, and query optimization.)

Whatever Cloudera’s next step (IPO or acquisition), their immediate goal is placing more facts on the ground with product and market share to raise the stakes on whatever transpires. That will inevitably include Cloudera making its own acquisitions – a skill that the company needs to learn – and likely diversification of the product line. At the analyst session, we viewed a demonstration of a Hadoop-based predictive analytics system that Cloudera uses as its nerve center for customer support; it’s a technology that could be generalized beyond Hadoop users.

$300 million in the bank may be a nice security blanket. But look at the state of adoption: Cloudera, which has had a multiyear jump in the market, counts a 10 – 12,000 installed base, plus or minus. But that boils down to about 350 paying subscribers (currently growing at about 40 – 50/monthquarter). Any market where the leader’s paid base numbers in the hundreds is either a niche segment or a very immature one. Obviously, Hadoop’s the latter, and as such, there are any number of potential disruptors that could surface on the road to mainstream adoption. For Cloudera and its rivals, it’s hardly game over.

Postscript: That $160 million was quickly dwarfed barely a week later with another $740 million infusion from Intel. Minus payments to earlier investors, we believe Cloudera netted about $500 million new funding in these couple weeks of March.