Category Archives: Data Management

Hadoop and Spark: A Tale of two Cities

If it seems like we’ve been down this path before, well, maybe we have. June has been a month of juxtapositions, back and forth to the west coast for Hadoop and Spark Summits. The mood from last week to this has been quite contrasting. Spark Summit has the kind of canned heat that Hadoop conferences had a couple years back. We won’t stretch the Dickens metaphor.

Yeah, it’s human nature to say, down with the old and in with the new.

But let’s set something straight: Spark ain’t going to replace Hadoop, as we’re talking about apples and oranges. Spark can run on Hadoop, and it can run on other data platforms. What it might replace is MapReduce, if Spark can overcome its scaling hurdles. And it could fulfil IBM’s vision as the next analytic operating system if it addresses mundane – but very important concerns – for supporting scaling, high concurrency, and bulletproof security. Spark originated at UC Berkeley’s AMPLab back in 2009, with the founders forming Databricks. With roughly 700 committers contributors, Spark has ballooned to becoming the most active open source project in the apache community, barely 2 years after becoming an Apache project.

Spark is best known as a sort of in-memory analytics replacement for iterative computation frameworks like MapReduce; both employ massively parallel compute and then shuffle interim results, with the difference being that Spark caches in memory while MapReduce writes to disk. But that’s just the tip of the iceberg. Spark offers a simpler programming model, better fault tolerance, and it’s far more extensible than MapReduce. Spark is any form of iterative computation, and it was designed to support specific extensions; among the most popular are machine learning, microbatch stream processing, graph computing, and even SQL.

By contrast, Hadoop is a data platform. It is one of many that can run Spark, because Spark is platform-independent. So you could also run Spark on Cassandra, other NoSQL data store, or SQL databases, but Hadoop has been the most popular target right now.

And, not to forget Apache Mesos, another AMPLab discovery for cluster management to which Spark had originally been closely associated.

There’s little question about the excitement level over Spark. By now the headlines have poured out over IBM investing $300 million, committing 3500 developers, establishing a Spark open source development center a few BART stops from AMPLab in San Francisco, and aiming directly and through partners to educate 1 million professionals on Spark in the next few years (or about 4 – 5x the current number registered for IBM’s online Big Data university). IBM views Spark’s strength as machine learning, and wants to make machine learning a declarative programming experience that will fellow in SQL’s footsteps with its new SystemML language (which it plans to open source).

That’s not to overshadow Databricks’ announcement that its Spark developer cloud, in preview over the past year, has now gone GA. The big challenge facing Databricks was making its cloud scalable and sufficiently elastic to meet the demand – and not become a victim of its own success. And there is the growing number of vendors that are embedding Spark within their analytic tools, streaming products, and development tools. The release announcement of Spark 1.4 brings new manageability and capability for automatically renewing Kerberos tokens for long running processes like streaming. But there remain growing pains, like reducing the number of moving parts needed to make Spark a first class citizen with Hadoop YARN.

By contrast, last week was about Hadoop becoming more manageable and more amenable to enterprise infrastructure, like shared storage as our colleague Merv Adrian pointed out. Not to mention enduring adolescent factional turf wars.

It’s easy to get excited by the idealism around the shiny new thing. While the sky seems the limit, the reality is that there’s lots of blocking and tackling ahead. And the need for engaging, not only developers, but business stakeholders through applications, rather than development tools, and success stories with tangible results. It’s a stage that the Hadoop community is just starting to embrace now.

MongoDB widens its sights

MongoDB has passed several key watershed events over the past year, including a major redesign of its core platform and a strategic shift in its management team. By now, the architectural transition is relatively old news; as we noted last winter, MongoDB 3.0 made the storage engine pluggable. So voila! Just like MySQL before it, Mongo becomes whatever you want it to be. Well eventually, anyway, but today there’s the option of substituting the more write-friendly WiredTiger engine, and in the near future, an in-memory engine now in preview could provide an even faster write-ahead cache to complement the new overcaffeinated tiger. And there are likely other engines to come.

From a platform – and market standpoint – the core theme is Mongo broadening its aim. Initially, it will be through new storage engines that allow Mongo to be whatever you make of it. MongoDB has started the fray with WiredTiger and the new in-memory data store, but with publishing of the API, there are opportunities for other engines to plug in. At MongoDB’s user conference, we saw one such result – the RocksDB engine developed at Facebook for extremely I/O-intensive transactions involving log data. And as we’ve speculated, there’s nothing to stop other storage engines like SQL from plugging in.

Letting a thousand flowers bloom
Analytics is an example where Mongo is spreading its focus. While Mongo and other NoSQL data stores are typically used for operational applications requiring fast reads and/or writes, for operational simplicity, there is also growing demand for in-line analytics. Why move data to a separate data warehouse data mart or Hadoop if it can be avoided? And why not embed some analytics with your operational applications? This is hardly an outlier – a key selling point for the latest generations of Oracle and SAP applications are the ability to embed analytics with transaction processing. Analytics evolves from after-the-fact to an inline process that is part of processing a transaction. Any real-time customer facing or operational process is ripe for analytics that can prompt inline decisions for providing next-best offers or tweaking the operation of an industrial process, supply chain, or the delivery of a service. And so a growing number of MongoDB deployments are adding analytics to the mix.

It’s almost a no-brainer for SQL BI tools to target JSON data per se because the data has a structure. (Admittedly, this is assuming the data is relatively clean, which in many cases is not a given.) But by nature, JSN has a more complex and potentially richer structure than SQL tables in the degree that the data is nested. Yet most SQL tools do away with the nesting and hierarchies that are stored in JSON documents, “flattening’ the structure into a single column.

We’ve always wondered when analytic tools would wake up to the potential of querying JSON natively – at least, not flattening the structure, but incorporating that information when processing the query. The upcoming MongoDB 3.2 release will add a new connector to BI and visualization tools that will push down analytic processing into MongoDB, rather than require data to be extracted first to populate an external data mart of data warehouse for the analytic tool to target. But this enhancement is not as much about enriching the query with information pertaining to the JSON schema; it’s more about efficiency, eliminating data transport.

But some emerging startups are looking to address that JSON native query gap. jSonar demonstrated SonarW, a data warehouse engine that plugs into the Mongo API that has a columnar format, with a key difference. It provides metadata that provides a logical representation of the nested and hierarchical relationships. We saw a reporting tool from Slamdata that applies similar context to the data, based on patent-pending algorithms that apply relational algebra to slicing, dicing, and aggregating deeply nested data.

Who says JSON data has to be dirty?
While a key advantage of NoSQL data stores, like Mongo, is that you don’t have to worry about applying strict schema or validation (e.g., ensuring that the database isn’t sparse and that the data in the fields is not gibberish). But there’s nothing inherent to JSON that rules out validation and robust data typing. MongoDB will be introducing a tool supporting schema validation for those use cases that demand it, plus a tool for visualizing the schema to provide a rough indication of unique fields and unique data (e.g., cardinality) within these fields. While maybe not a full-blown data profiling capability, it is a start.

Breaking the glass ceiling
The script for MongoDB has been familiar up ‘til now. The entrepreneurial startup whose product has grown popular through grassroots appeal. The natural trajectory for MongoDB is to start engaging the C- level and the business, who write larger checks. A decade ago, MySQL played this role. It was kinda of an Oracle or SQL Server Lite that was less complex than its enterprise cousins. That’s been very much MongoDB’s appeal. But with making the platform more extensible, MongoDB creates a technology path to grow up. Can the business grow with it?

Ove the past year MongoDB’s upper management team has largely been replaced; the CEO, CMO, and head of sales are new. It’s the classic story of startup visionaries, followed by those experienced at building the business. President and CEO Dev Ittycheria, most recently from the venture community, previously took BladeLogic public before eventually selling to BMC for $900 million in 2008. Its heads of sales and marketing come from similar backgrounds and long track records. While MongoDB is clearly not sloughing off on product development, it is plowing much of its capitalization into building out the go-to-market.

The key challenge facing Mongo, and all the new data platform players, is where (or whether) they will break the proverbial glass ceiling. There are several perspectives to this challenge. For open source players like MongoDB, it is determining where the value-add lies. It’s a moving target; while traditionally, functions that make a data store enterprise grade such as data governance, management, and security were traditionally unique to the vendor and platform, open source is eating away at it. Just look at the Hadoop world where there’s Ambari, while Cloudera and IBM offer their own either as core or optional replacement. So this dilemma is hardly unique to MongoDB. Our take is that lowest common denominator cannot be applied to governance, security, or management, but it will become a case where platform players, like MongoDB, must branch out and offer related value-add such as optimizations for cloud deployment, information lifecycle management, and so on.

Such a strategy of broadening the value-add grows even more important given market expectations for pricing; in essence, coping with the I’m not going to pay a lot for this muffler syndrome. The expectation with open source and other emerging platforms is that enterprises are not willing, or lack the budget, for paying the types of licenses customary with established databases and data warehouse systems. Yes, the land and expand value is critical for the likes of MongoDB, Cloudera, Hortonworks and others for growing revenues. They may not replace the Oracles or Microsoft of the world, but they are angling to be the favorite for new generation applications supplementing what’s already on the back end (e.g., customer experience, enhancing and working alongside classical CRM).

Land and expand into the enterprise, and broadening from data platform to data management are familiar scripts. Even in an open source, commodity platform world, these scripts will remain as important as ever for MongoDB.

Hortonworks evens the score

Further proof that Hadoop competition is going up the stack toward areas such as packaged analytics, security, and data management and integration can be seen from Hortonworks’ latest series of announcements today – refresh of the Hortonworks Data Platform with Ambari 2.0 and the acquisition of cloud deployment automation tool SequenceIQ.

Specifically, Ambari 2.0 provides much of the automation previously missing, such as automating rolling updates, restarts, Kerberos authentications, alerting and health checks, and so on. Until now, automation of deployment, monitoring and alerting, rot cause diagnosis, and authentications was a key differentiator for Cloudera Manager. While Hadoop systems management may not be a done deal (e.g., updating to major new dot zero releases is not yet a lights-out operation), the basic blocking and tackling is no longer a differentiator; any platform should have these capabilities. The recent debut of the Open Data Platform – where IBM and Pivotal are leveraging the core Hortonworks platform as the starting point for their Hadoop distributions – is further evidence. Ambari is the cornerstone of all implementations, although IBM will still offer a more “premium” value-add with options such as Platform Symphony and Adaptive MapReduce.

Likewise, Hortonworks’ acquisition of SequenceIQ is a similar move to even the score with Cloudera Director. Both handle automation of cloud deployment with policy-based elastic scaling (e.g., when to provision or kill compute nodes). The comparison may not yet be apples-to-apples; for instance, Cloudera Director has been a part of the Cloudera enterprise platform (the paid edition) since last fall, whereas the ink is just drying on the Hortonworks acquisition of SequenceIQ. And, while SequenceIQ’s product, Cloudbreak, is cloud infrastructure-agnostic but Cloudera Director right now only supports Amazon, that too will change.

More to the point is where competition is heading – we believe that it is heading from the core platform higher up the value chain to analytic capabilities and all forms of data management – stewardship, governance, and integration. In short, it’s a page out of the playbook of established data warehousing platforms that have had to provide value-add that could be embedded inside the database. Just take a look at Cloudera’s latest announcements: acquisition of Xplain and a strategic investment in Cask. Xplain automates the design, integration, and optimization of data models to reduce or eliminate hurdles to conducting self-service analytics on Hadoop. Cask on the other hand provides hooks for developers to integrate applications with Hadoop – the third way that until now has been overlooked.

As Hadoop graduates from specialized platform for complex, data science computing to an enterprise data lake, the blocking and tackling functions – e.g., systems management and housekeeping – becomes checklist items. What’s more important is how to manage data, make data and analytics more accessible beyond data scientists and statistical programming experts, and providing the security that is expected of any enterprise-grade platform.

Spark Summit debrief: Relax, the growing pains are mundane

As the most active project (by number of committers) in the Apache Hadoop open source community, it’s not surprising that Spark has drawn much excitement and expectation. At the core, there are several key elements to Spark’s appeal:
1. It provides a much simpler and more resilient programming model compared to MapReduce – for instance, it can restart failed nodes in process rather than requiring the entire run to be restarted from scratch.
2. It takes advantage of DRAM memory, significantly accelerating compute jobs – and because of the speed, allowing more complex, chained computations to run (which could be quite useful for simulations or orchestrated computations based on if/then logic).
3. It is extensible. Spark provides a unified computing model that lets you mix and match complex iterative MapReduce-style computation with SQL, streaming, machine learning and other processes on the same node, with the same data, on the same cluster, without having to invoke separate programs. It’s akin to what Teradata is doing with the SNAP framework to differentiate its proprietary Aster platform.

Mike Olson, among others, has termed Spark “The leading candidate for ‘successor to MapReduce’.” How’s that for setting modest expectations?

So we were quite pleased to see Spark Summit making it to New York and have the chance to get immersed in the discussion.

Last fall, Databricks, whose founders created Spark from their work at UC Berkeley’s AMPlab, announced their first commercial product – that being a Spark Platform-as-A-Service (PaaS) cloud for developing Spark programs. We view the Databricks Cloud as a learning tool and incubator for developers to get up to speed on Spark without having to worry about marshaling compute clusters. The question on everybody’s minds at the conference was when the Databricks Cloud would go GA. The answer, like everything Spark, is about dealing with scalability – in this case, being capable of handling high con currency, highly spikey workloads. The latest word is later this year.

The trials and tribulations of the Databricks Cloud is quite typical for Spark – it’s dealing with scale, whether that be in numbers of users (concurrency) or data (when the data sets get too big for memory and must spill to disk). At a meetup last summer where we heard a trip report from Spark Summit 2015, the key pain point was having a more graceful spilling to disk.

Memory-resident compute frameworks of course are nothing new. SAS for instance has its LASR Server, which it contends is far more robust in dealing with concurrency and compute-intensive workloads. But, as SAS’s core business is analytics, we expect that they will meet Spark halfway to appeal to Spark developers.

While Spark is thought of as a potential replacement for MapReduce, in actuality we believe that MapReduce will be as dead as the mainframe. While DRAM memory is, in the long run, getting cheaper, it will never be as cheap as disk. And while ideally, you shouldn’t have to comb through petabytes of data on a routine basis (that’s part of defining your query and identifying the data sets), there are going to be analytic problems involving data sets that won’t completely fit in memory. Not to mention that not all computations (e.g., anything that requires developing a comprehensive model) will be suited for real-time or interactive computation. Not surprisingly, most of the use cases that we came across at Spark Summit were more about “medium data,” such as curating data feeds, real-time fraud detection, or heat maps of NYC taxi cab activity.

While dealing with scaling is part of the Spark roadmap, so is making it more accessible. At this stage, the focus is on developers, through APIs to popular statistical computation languages such as Python or R, and with frameworks such as Spark SQL and Spark DataFrames.

On one hand, with Hadoop and NoSQL platform providers competing with their own interactive SQL frameworks, the question is why the world needs another SQL framework. In actuality, Spark SQL doesn’t compete with Impala, Tez, BigSQL, Drill, Presto or whatever. First, it’s not only about SQL, but querying data with any kind of explicit schema. The use case for Spark SQL is running SQL programs in line with other computations, such as chaining SQL queries to streaming or machine learning runs. As for DataFrames, Databricks is simply adapting the Distributed DataFrame technology already implemented with languages such as Java, Python, and R to access data sets that are organized as tables with columns contained typed data.

Spark’s extensibility is both blessing and curse. Blessing in that the framework can run a wide variety of workloads, but curse in that developers can drown in abundance. One of the speakers at Summit called for package management so developers won’t stumble over their expanding array of Spark libraries and wind up reinventing the wheel.

Making Spark more accessible to developers is a logical step in growing the skills base. But ultimately, for Spark to have an impact with enterprises, it must be embraced by applications. In those scenarios, the end user doesn’t care what process is used under the hood. There are a few applications and tools, like ClearStory Data for curating data feeds, or ZoomData, an emerging Big Data BI tool that has some unique IP (likely to stay proprietary) for handling scale and concurrency.

There’s no shortage of excitement and hype around Spark. The teething issues (E.G., scalability, concurrency, package management) are rather mundane. The hype – that Spark will replace MapReduce – is ahead of the reality; as we’ve previously noted, there’s a place for in-memory computing, but it won’t replace all workloads or make disk-based databases obsolete. And while Spark hardly has a monopoly on in-memory computing, the accessibility and economics of an open source framework on commodity hardware opens lots of possibilities for drawing a skills base and new forms of analytics. But let’s not get too far ahead of ourselves first.

IBM and Twitter: Another piece of the anaytics puzzle

Roughly 20 years ago, IBM faced a major fork in the road from the hardware-centric model that defined the computer industry from the days of Grace Hopper. It embraced a services-heavy model that leveraged IBM’s knowledge of how and where enterprises managed their information in an era when many were about to undergo drastic replatforming in the wake of Y2K.

Today it’s about the replatforming, not of IT infrastructure necessarily, but of the business in the face the need to connect in an increasingly mobile and things connected world. And so IBM is in a reinvention, trying to embrace all things mobile, all things data, and all things connected. A key pillar of this strategy has been IBM’s mounting investment in Watson, where it has aggressively recruited and incubated partners to flesh out a new path of business solutions based on cognitive computing. On the horizon, we’ll be focusing our attention on a new path of insight: exploratory analytics, an area that is enabled by the next generation of business intelligence tools – Watson Analytics among them.

Which brings us to last fall’s announcement that IBM and Twitter would from a strategic partnership to develop real-time business solutions. As IBM has been seeking to reinvent itself, Twitter has been seeking to invent itself as a profitable business that can monetize its data in a manner that maintains trust among its members – yours truly among them. Twitter’s key value proposition is the immediacy if its data. While it may lack the richness and depth of content-heavy social networks like Facebook, it is, in essence, the world’s heartbeat. A ticker feed that is about, not financial markets, but the world.

When something happens, you might post on Facebook, within minutes or hours, blogs and news feeds may populate headlines. But for real-time immediacy, nothing beats the ease and simplicity of 140 characters. Uniquely, Twitter is sort of a hybrid between consumer-oriented social network like Facebook and a professional one like LinkedIn. There is an immediacy and uniqueness to the data feed that Twitter provides, With its acquisition last year of partner Gnip (which already had commercial relationships with enterprise software providers like SAP), Twitter now had a direct pipeline for mounting the enterprise value chain.

So far, so good, but what has IBM done to build a real business out of all this? A few months in, IBM is on a publicity offensive to show there is real business here. It is part way to a goal of cross-trading up to a quarter of its 140,000 over 10,000 GBS consultants on Twitter solutions. IBM has already signed a handful of reference customer deals, and is disclosing some of the real-world use cases that are the focus on actuals engagements.

Meanwhile, Twitter has been on a heavily publicized path to monetize the data that it has – which is a unique real-time pulse of what’s happening in the world. Twitter certainly has had its spate of challenges here. It sits on a data stream that is rich with currency, but lacking the depth that social networks like Facebook offer in abundance. Nonetheless, Twitter is unique in that it provides a ticker feed of what’s happening in the world. That was what was behind the announcement last fall that Twitter would become a strategic partner with IBM – to help Twitter monetize its data and for IBM to generate unique real-time business solutions.

Roughly six months into the partnership, IBM has taken the offensive to demonstrate that the new partnership is generating real business and tangible use cases. We sat down for some off the record discussions with IBM, Twitter, and several customers and prospects ahead of today’s announcements.

The obvious low-hanging fruit is customer experience. As we wrote this in midflight, before boarding we had a Twitter exchange with United regarding whether we’d be put on another fight if our plane – delayed for a couple hours with software trouble (yes… software) – was going to get cancelled (the story had a happy ending). Businesses are already using Twitter – that’s not the question. Instead, it’s whether there are other analytics-driven use cases – sorta like the type of thing we used to talk about with CEP but are real and not theoretical.

We had some background conversations with IBM last week ahead of today’s announcements. They told us of some engagements that they’ve booked during the first few months of the Twitter initiative. What’s remarkable is they are very familiar use cases, where Twitter adds another verifying data point.

An obvious case is mobile carriers – this being the beachfront real estate of telco. As mobile embeds itself in our lives, there is more at stake for carriers who ear churn, and even more so, the reputational damage that can come when defecting customers cry out about bad service publicly over social media. Telcos already have real-time data; they have connection data from their operational systems, and because this is mobile, location data as well. What’s kind of interesting to us is IBM’s assertion that what’s less understood is the relationship between Tweets and churn – as we already use Twitter, we thought those truths were self-evident. You have a crappy connection, the mobile carrier has the data on what calls, texts, or web access were dropped, and if the telco already knows its customers’ Twitter handles, it should be as plain as day what the relationship is between tweet’s and potential churn events. IBM’s case here was that integrating Twitter with data that was already available – connection’s, weather, cell tower traffic, etc., it helped connect the dots. IBM makes the claim that correlating Twitter with weather data alone could improve the accuracy of telco churn models by 5.

Another example drawn from early engagement is employee turnover. Now, unless an employee has gotten to the point where they’d rather take this job and shove it, you’d think that putting your gripes out over the Twitter feed would be a career-limiting move. But the approach here was more indirect: look at consumer businesses and correlate customer Twitter activity with locations where employee morale is sagging. Or look at the Twitter data to deduce that staff loyalty was flagging.

A more obvious use cases was in the fashion industry. IBM is adapting another technology from its labs – psycholinguistic analysis (a.k.a., what are you really saying?) – to conduct a more nuanced form of sentiment analysis of your tweets. For this engagement, a fashion industry firm employed this analysis to gain more insight on why different products sold or not.

Integrating Twitter is just another piece of the puzzle when trying to decipher signals from the market. It’s not a case of blazing new trails; indeed, sentiment analysis has become a well-established disciple for consumer marketers. The data from Twitter is crying out to be added to the mix of feeds used for piecing together the big picture. IBM’s alliance with Twitter is notable in that both are putting real skin in the game for productizing the insights that can be gained from Twitter feeds.

It’s not a criticism to say this, but incorporating Twitter is evolutionary, not revolutionary. That’s true for most big data analytics – we’re just expanding and deepening the window to solve very familiar problems. The data is out there – we might as well use it.

Strata 2015 post mortem: Does the Hadoop market have a glass ceiling?

The move of this year’s west coast Strata HadoopWorld conference to the same venue as Hadoop Summit gave the event a bit of a mano a mano air: who can throw the bigger, louder party?

But show business dynamics aside, the net takeaway from these events is looking at milestones in the development of the ecosystem. Indeed, the brunt of our time was spent “speed dating” with third party tools and applications that are steadily addressing the white space in the Big Data and Hadoop markets. While our sampling is hardly representative, we saw growth, not only from the usual suspects from the data warehousing world, but also from a growing population of vendors who are aiming to package machine learning algorithms, real-time streaming, more granular data security, along with new domains such as entity analytics. Databricks, inventor of Spark, announced in a keynote a new DataFrames initiative to make it easier for R and Python programmers accustomed to working on laptops to easily commandeer and configure clusters to run their computations using Spark.

Immediately preceding the festivities, the Open Data Platform initiative announced its existence, and Cloudera announced its $100 million 2014 numbers – ground we already covered. After the event, Hortonworks did its first quarterly financial call. Depending on how you count, they did nearly $50 million business last year; but the billings, which signify the pipeline, came in at $87 million. Hortonworks closed an impressive 99 new customers in Q4. There’s little question that Hortonworks has momentum, but right now, so does everybody. We’re at a stage in the market where a rising tide is lifting all boats; even the smallest Hadoop player – Pivotal – grew from token revenues to our estimate of $20 million Hadoop sales last year.

At this point, there’s nowhere for the Hadoop market to go but up, as we estimate that the paid enterprise installed base (at roughly 1200 – 1500) as just a fraction of the potential base. Or in revenues, our estimate of $325 million for 2014 (Hadoop subscriptions and related professional services, but not including third party software or services), up against $10 billion+ for the database market. Given that Hadoop is just a speck compared to the overall database market, what is the realistic addressable market?

Keep in mind that while Hadoop may shift some data warehouse workloads, the real picture is not necessarily a zero sum game, but the commoditization of the core database business. Exhibit One: Oracle’s recent X5 engineered systems announcement designed to meet Cisco UCS at its commodity price point. Yes, there will be some contention, as databases are converging and overlapping, competing for many of the same use cases.

But the likely outcome is that organizations will use more data platforms and grow accustomed to paying more commodity process – whether that is through open source subscriptions or cloud pay-by-the-drink (or both). The value-add increasingly will come from housekeeping tools (e.g., data security; access control and authentication; data lineage and audit for compliance; cluster performance management and optimization; lifecycle and job management; query management and optimization in a heterogeneous environment).

The takeaway here is that the tasks normally associated with the care and feeding of a database, not to mention the governance of data, grow far more complex when superseding traditional enterprise data with Big Data. So the Hadoop subscription business may only grow so far, but that will be just the tip of the iceberg regarding the ultimate addressable market.

The Open Data Platform is and is not like UNIX, Cloudera cracks $100m, and what becomes of Pivotal

How’s that for a mouthful?

It shouldn’t be surprising that the run-up to Strata is full of announcement designed to shape mindsets. And so today, we have a trio of announcements that solve – for now – the issue of whether Pivotal is still in the Hadoop business (or at least with its own distro); verify that Cloudera did make $100m last year; and announce formation of a cross-industry initiative, the Open Data Platform.

First, we’ll get our thoughts on Cloudera and Pivotal out of the way. Cloudera’s announcement didn’t surprise us, we’ve estimated that they were on their way to a $100m year given our estimates of typical $250k deal sizes (outliers go a lot higher than that), a new customer run rate that we pegged at about 50 per quarter, and of course subscription renewals that inflate as customers grow their deployments. In perspective we’re still in a greenfield market where a rising tide is lifting all boats; we estimate that business is also doubling for most of Cloudera’s rivals – but that Cloudera has had an obvious head start.

As to Pivotal, they’ve been the subject of much FUD in the wake of published reports last fall of a layoff of 60 employees on the Big Data side of their business. Word on the street was that Pivotal, the last to enter the Hadoop distribution business, would would be the first to leave – Hortonworks was the logical candidate as Pivotal disclosed last summer that it would replace its Command Center with the Hortonworks-led Ambari project for Hadoop cluster management.

The news is that Pivotal is making a final break from its proprietary technology legacy and open sourcing everything – including the Greenplum database. And yes, Pivotal will OEM support HDP, but it will still offer its own distribution optimized for HAWQ and for integration with its other data engines including the GemFire in-memory database. But this announcement didn’t happen in a vacuum, but in conjunction with another announcement today of the Open Data Platform – of which Pivotal and Hortonworks (along with IBM, and others) are members. We’re frankly puzzled as to why Pivotal would continue offering its own distribution. But we’ll get back to that.

The Open Data Platform is an initiative to that tries to put the toothpaste back into the tube: define, integrate, test, and certify a standard Hadoop core. Once upon a time, Apache Hadoop could be defined by core projects, like what was on the Apache project home page. But since then there have been multiple overlapping and often competing projects regarding running interactive SQL (do we use Hive or bypass it?); cluster management (Ambari or various vendor proprietary management systems); managing security; managing resource (YARN for everything, or just batch jobs, and what about Mesos?), streaming (Storm or Spark Streaming), and so on. When even the core file system HDFS may not be in every distro, the question of what makes Hadoop, Hadoop remains key.

Of course, ODP is not just about defining core Hadoop, but designating, in effect, a stable base on which value-added features or third party software can reliably hook in. It picks up where the Apache community, which simply designates what releases are stable, leave off, by providing a formal certification base. That’s the type of thing that vendor consortia rather than open source communities are best equipped to deliver. For the record, ODP pledges to work alongside Apache.

So far so good, except that this initiative comprises only half the global Hadoop vendor base. This is where the historical analogies with UNIX come in; recall the Open Software Foundation, which was everybody vs. the industry leader Sun? It repeats the dynamic of the community vs. the market leaders – for now, the Cloudera and Amazon customer bases will outnumber ODP committers.

Over time OSF UNIXes remained overshadowed by Solaris, but eventually everybody turned their attention to dealing with Microsoft. After laying down arms, OSF morphed into The Open Group, which refocused on enterprise architecture frameworks and best practices.

The comparison between ODP and OSF is only in the competitive dynamics. Otherwise, UNIX and Hadoop are different creatures. While both are commodity technologies, Hadoop is a destination product that enterprises buy, whereas UNIX (and Linux) are foundational components that are built into the purchase of servers and appliances. Don’t get confused by those who characterize Hadoop as a data operating system, as enterprises are increasingly demanding capabilities like security, manageability, configurability, and recovery that are expected of any data platform that they would buy.

And further, where the narrative differs is that Hadoop, unlike UNIX, lacks a common enemy. Hadoop will exist alongside, not instead of other database platforms as they eventually meld into a fabric where workloads are apportioned. So we don’t necessarily expect history to repeat itself with Open Data Platform. The contribution of ODP will be the expectation of a non-moving target that becomes a consensus, although not an absolutely common one. It’s also the realization that value-add in Hadoop increasingly comes, not from the core, but from the analytics that run on it and the connective glue that the platform provider supplies.

As for Pivotal and what it’s still doing in the Hadoop business, our expectation is that ODP provides the umbrella under which its native distribution converges and becomes a de facto dialect of HDP. We believe that Pivotal’s value-add won’t be in the Hadoop distribution business, but how it integrates GemFire and optimizes implementation for its Cloud Foundry Platform-as-a-Service cloud.

Postcript: No good deed goes unpunished. Here’s Mike Olson’s take.

Making Yin and Yan of YARN and Mesos

YARN has drawn considerable spotlight as the resource scheduler allowing Hadoop 2.x to finally transcend its MapReduce roots. The strength and weakness of YARN was its MapReduce roots – meaning there was backward compatibility to managing MapReduce workloads that dominated Hadoop., but also limitations for running ongoing workloads because of its job-oriented batch origins. By contrast, Apache Mesos has existed as an open source project for some time that provides resource management for scale-out clusters of all kinds – not just Hadoop. It was well suited for dynamic management on continuous (ongoing) workloads.

While a bit dated, this 2011 Quora posting provides a good point by point comparison of YARN’s and Mesos’ strengths and shortcomings. Although not directly comparable, until now both have been considered rival approaches.

A new project – Myriad – proposes to bring them together. Pending Apache incubation status, it would superimpose Mesos as the top level dynamic juggler of resources, while YARN sticks to its knitting and schedules them. In essence, it would make YARN elastic. MapR, which is staking new ground as a participant rather than consumer of Apache projects, is joining with Mesosphere and eBay to drive the project with plans to submit to Apache for incubation.

Myriad is not the only game in town. Slider, a project lead by Hortonworks, is taking the reverse approach. Instead of Mesos dynamically allocating containers (resources) to YARN, Slider works as a helper to YARN for dynamically requesting new resources when a YARN container fails.

Myriad vs. Slider typifies the emerging reality for Hadoop; when issues arise in the Hadoop platform, chances are there will be competing remedies vying for adoption.

Hadoop: The Third Way

Working with Hadoop has been kind of a throwback. Until recently, Hadoop was synonymous with MapReduce programming, meaning that when you worked with Hadoop, it seemed that you were working with a newfangled mainframe. As if client/server never happened.

With emergence and heavy competition between the various interactive SQL frameworks (e.g., Impala, Tez, Presto, Drill, BigSQL, Big Data SQL, QueryGrid, Spark SQL), a second path emerged for database developers. So the Hadoop mainframe became a client/server machine. As if n-tier never happened.

The need for speed made n-tier happen – due to the need to bypass the bottleneck of database I/O and the overhead of large, monolithic applications. And so the application server platform was born, and with it, ways to abstract functions such as integration, security, transaction management so they could operate as modular piece parts with whatever application or database. Or to prevent abandoned online shopping carts, so a transaction can be executed without being held hostage to ensuring ACID compliance. Internet-based applications were now being developed on WebSphere, WebLogic, JBoss, and more recently, more compact open source alternatives like Apache Tomcat.

But with Hadoop, we’re still in the era of mainframe or client/server. But with the 2.x generation, where resource management has been taken out of MapReduce, the way has been cleared to make Hadoop more of a multi-purpose platform. While interactive SQL was the first shot, new frameworks supporting streaming (Storm, Spark Streaming), machine learning (Spark), and search (Solr) are among some of the new additions to the palette.

But at this point, we’re still looking at Hadoop as either a mainframe or two-tier system. Developers write MapReduce or Spark programs, or BI/query tools access HDFS with or without Hive. There’s nothing available to write data-driven programs, such as real-time user scoring or intrusion detection.

Nearly four years ago, a startup with a weird name – Continuuity – emerged to become in its own terms “the JBoss for Hadoop.” The goal was building a data fabric that abstracted the low-level APIs to HDFS, MapReduce, Hive, and other Hadoop components to clear the way for developers to write, not just MapReduce programs or run BI tools, but write API-driven programs that could connect to Hadoop. Just as a generation ago, application servers abstracted data and programs so they could flexibly connect with each other. Its first project was a data ingestion platform written on Storm that would be easier to work with than existing Hadoop projects such as Flume.

Continuuity’s problem was that the company was founded too early. During a period where Hadoop was exclusively a batch processing platform, there was little clamor for developers to write data-driven applications. But as new frameworks transform Hadoop into a platform that can deliver experiences closer to real-time, demand should emerge among developers to write, not just programs, but applications that can run against Hadoop (or other platforms).

In the interim, Continuuity changed its name to Cask, and changed its business model to become an open source company. It has diversified its streaming engine to work with other frameworks besides Storm to more readily persist data. And the 40-person company which was founded a few blocks away from Cloudera’s original headquarters, next to Fry’s Electronics in Palo Alto, has just drawn a modest investment from Cloudera to further develop its middleware platform.

Admittedly, Cask’s website really doesn’t make a good case (the home page gives you as 404 error), providing an application platform for Hadoop opens up possibilities sonly limited by the imagination. For instance, it could make possible event-driven programs for performing data validation or detecting changes in customer interactions, and so on.

For Cloudera, Cask is a low-risk proposition for developing that long-missing third path to Hadoop to further its transformation to a multi-purpose platform.

MongoDB grows up

One could say that MongoDB has been at the right place at the right time. When web developers demanded a fast, read-intensive store of complex variably-structured data, the company formerly known as 10Gen came up with a simple engine backed by intuitive developer-friendly tooling. It grew incredibly popular for applications like product catalogs, tracking hierarchical events (like chat strings with responses), and some forms of web content management.

In a sense, MongoDB and JSON became the moral equivalents of MySQL and the LAMP stack, which were popular with web developers who needed an easy-to-deploy transactional SQL database sans all the overhead of an Oracle.

Some things changed. Over the past decade, Internet developers expanded from web to also include mobile developers. And the need for databases has now extended to variably structured data. Enter JSON. It provided that long-elusive answer to providing a simple operational database with an object-like representation of the world without the associated baggage (e.g., polymorphism, inheritance), using a language (JavaScript) and data structure that was already lingua franca with web developers.

Like MySQL, Mongo was known for its simplicity. It had a simple data model, a query framework that was easy for developers to use, and well-developed indexing that made reads very fast. It’s been cited by db-Engines as the fourth most popular database among practitioners.

And like MySQL, MongoDB was not known for its ability to scale (just ask Cassandra fans). For MySQL, a Berkeley company, Sleepycat Software, InnoDB, developed an engine that provided a heart transplant that could turn MySQL into a serious database.

Fast forward, and some alumni from Sleepycat Software (which developed BerkelyDB, later bought by Oracle) founded WiredTiger, ginning out an engine that could add similar scale to Mongo. WiredTiger offers a more write-friendly engine that aggressively takes advantage of compression (that is configurable) to scale and deliver high performance. And it provides a much more granular and configurable approach to locking that could alleviate much of those write bottlenecks that plagued Mongo.

History took interesting paths. Oracle bought Sleepycat and later inherited MySQL via the Sun acquisition. And last fall, MongoDB bought WiredTiger.

Which brings us to MongoDB 2.8 3.0. It’s no mystery (except for the dot zero release number ) that the WiredTiger engine would end up in Mongo as their integration was destiny. Also not surprising is that the original MongoDB MMAP engine lives on. There is a huge installed base, and for existing read-heavy applications, it works perfectly well for a wide spectrum of use cases (e.g., recommendation engines). The new release makes the storage engine pluggable via a public API.

We’ve been down this road before; today MySQL has almost a dozen storage engines. Starting off the gate, MongoDB will have the two supported by the company: classic MMAP or the industrial-strength WiredTiger engine. Then there’s also an “experimental” in-memory engine that’s part of this release. And off in the future, there’s no reason why HDFS, cloud-based object storage, or or even SQL engines couldn’t follow.

The significance with the 3.0 release is that the MongoDB architecture becomes an extensible family. And in fact, this is quite consistent with trends that we at Ovum have been seeing with other data platforms, that are all overlapping and taking on multiple persona. That doesn’t mean that every database will become the same, but that each will have its area of strength, but also be able to take on boundary cases. For instance:
• Hadoop platforms have been competing on adding interactive SQL;
• SQL databases have been adding the ability to query JSON data; and
• MongoDB is now adding the fast, scalable write capabilities associated with rival NoSQL engines like Cassandra or Couchbase, reducing the performance gap with key-value stores.

Database convergence or overlap doesn’t mean that you’ll suddenly use Hadoop to replace your data warehouse, or MongoDB instead of your OLTP SQL database. And if you really need fast write performance, key-value stores will probably remain your first choice. Instead, view these as extended capabilities that allow you to handle a greater variety of use cases, data types, and queries off the same platform with familiar development, query, and administration tools.

Back to MongoDB 3.0, there are a few other key enhancements with this release. Concurrency control (the source of those annoying write locks with the original MMAP engine) becomes more granular in this release. Instead of having to lock the entire database for consistent writes, locks can now be confined to a specific collection (the MongoDB equivalent of table) level, reducing an annoying bottleneck. Meanwhile, WiredTiger adds schema validation and more granular memory management to further improve write performance. Eventually, WiredTiger might even bring schema validation to Mongo.

We don’t view this release as being about existing MongoDB customers migrating to the new engine; yes, the new engine will support the same tools, but it will require a one-time reload of the database. Instead, we view this as expanding MongoDB’s addressable market, with the obvious target being key-value stores like Cassandra, BerkeleyDB (now commercially available as Oracle NoSQL Database), or Amazon DynamoDB. It’s just like other data platforms are doing by adding on their share of capability overlaps.