Category Archives: Big Data

Big Data 2015-2016: A look back and a look ahead

Quickly looking back
2015 was the year of Spark.

If you follow Big Data, you’d have to be living under a rock to have missed the Spark juggernaut. The extensive use of in-memory processing has helped machine learning go mainstream, because the speed of processing enables the system to quickly detect patterns and provide actionable artificial intelligence. It’s surfaced in data prep/data curation tools, where the system helps you get an idea of what’s in your big data and how it fits together, and in a new breed of predictive analytics tools that are now, thanks to machine learning, starting to become prescriptive. Yup, Cloudera brought Spark to our attention a couple years back as the eventual successor to MapReduce, but it was the endorsement of IBM, backed by commitment of 3500 developers and $300 million investment in tool and technology development, which plants the beachhead for Spark computing pass from early adopter to enterprise. We believe that will mostly be through tools that embed Spark under the covers. It’s not game over for Spark; there persist issues of scalability and security, but there’s little question it’s here to stay.

We also saw continued overlap and convergence in the tectonic plates of databases. Hadoop became more SQL like, and if you didn’t think there were enough SQL-on-Hadoop frameworks, this year we got two more from MapR and Teradata. It underscored our belief that there will be as many flavors of SQL on Hadoop as there are in the enterprise database market.

And while we’re on the topic of overlap, there’s the unmistakable trend of NoSQL databases adding SQL faces. Couchbase’s N1QL, Cassandra/DataStax’s CQL, and most recently, the SQL extensions for MongoDB. It reflects the reality that, while NoSQL databases emerged to serve operational roles, there is a need to add some lightweight analytics on them – not to replace data warehouses or Hadoop, but to add some inline analytics as you are handling live customer sessions. Also pertinent to overlap is the morphing of MongoDB, which has been the poster child for lightweight, developer-friendly database. Like Hadoop, MongoDB is no longer being known by its storage engine, but for its developer tooling and APIs. With the 3.0 release, the storage engines became pluggable (the same path trod by MySQL a decade earlier). With the just-announced 3.2 version, write-friendlier WiredTiger replaces the original MMAP as the default storage engine (meaning you can still use MMAP if you override factory settings).

A year ago, we expected streaming, machine learning, and search to become the fastest growing Big Data analytic use cases; turns out that machine learning was the hands-down winner last year, but we’ve also seen quite an upsurge of interest in streaming thanks to a perfect storm-like convergence of IoT and mobile data use cases (which epitomize real time) with technology opportunity (open source has lowered barriers for developers, enterprises, and vendors alike, while commodity scale-out architecture provides the economical scaling to handle torrents of real-time data). Open source is not necessarily replacing proprietary technology; proprietary products offer the polish (e.g., ease of use, data integration, application management, and security) that are either lacking from open source products or require manual integration. But open source has injected new energy into a field that formerly was more of a complex solution looking for a problem.

So what’s up in 2016?

A lot… but three trends pop out at us.

1. Appliances and cloud drive the next wave of Hadoop adoption.
Hadoop has been too darn hard to implement. Even with the deployment and management tools offered with packaged commercial distributions, implementation remains developer-centric and best undertaken with teams experienced with DevOps-style continuous integration. The difficulty of implementation was not a show-stopper for early adopters (e.g., Internet firms who invent their own technology, digital media and adtech firms who thrive on advanced technology, and capital markets firms who compete on being bleeding edge), or early enterprise adopters (innovators from the Global 2000). But it will be for the next wave, who lack the depth or sophistication of IT skills/resources of the trailblazers.

The wake up call came when we heard that Oracle’s Big Data Appliance, which barely registered on the map during its first couple years of existence, encountered a significant upsurge in sales among the company’s European client base. Considered in conjunction with continued healthy growth in Amazon’s cloud adoption, it dawned on us that the next wave of Hadoop adoption will be driven by simpler paths: either via appliance or cloud. This is not to say that packaged Hadoop offerings won’t further automate deployment, but the cloud and appliances are the straightest paths to getting more black box.

2. Machine learning becomes a fact of life with analytics tools. And more narratives, less dashboards.
Already a checklist item with data preparation, we expect the same to happen with analytics tools this year. Until now the skills threshold has been steep for taking advantage of machine learning. There are numerous techniques to choose from; first you identify whether you already know what type of outcome that you’re looking for, then you choose between approaches such as linear regression models, decision trees, random forests, clustering, anomaly detection and so on to solve your problem. It takes a statistical programmer to make that choice. Then you have to write the algorithm, or you can use tools that prepackage those algorithms for you such as those from H2O or Skytree. The big nut to crack will be in how to apply these algorithms and interpret them.

But we expect to see more of these models packaged under the hood. We’ve seen some cool tools this past year, like Adatao, that combine natural language query for business end users with an underlying development environment for R and Python programmers. We’re seeing tooling that puts all this more inside the black box, combining natural language querying with the ability to recognize signals in the data, guide the user on what to query, and automatically construct narratives or storyboards, as opposed to abstract dashboards. Machine learning plays a foundational role in generating such guided experiences. We’ve seen varying bits and pieces of these capabilities in offerings such as IBM Watson Analytics, Oracle Big Data Discovery, and Amazon QuickSight – and in the coming year, we expect to see more.

3. Data Lake enters the agenda
The Data Lake, the stuff of debate over the past few years, starts becoming reality with early enterprise adopters. The definitions of data lakes are in the eyes of the beholder – we view them as the governed repository that acts as the default ingest point and repository for raw data and the resting point for aged data that is retained online for active archiving. It’s typically not the first use case for Hadoop and shouldn’t be because you shouldn’t build a repository until you know how to use the underlying platform and, for data lake, know how to work with big data. But as the early wave of enterprise adopters grow comfortable with Hadoop in production serving more than a single organization, planning for the data lake is a logical follow-on step. It’s not that we’ll see full adoption in 2016 – Rome wasn’t built in a day. But we’ll start seeing more scrutiny on data management, building on the rudimentary data lineage capabilities currently available with Hadoop platforms (e.g., Cloudera Navigator, Apache Atlas) and that are part of data wrangling tools. Data lake governance is a work in process; there is much white space to be filled out with lifecycle management/data tiering, data retention, data protection, and cost/performance optimization.

Data Scientists are people too

There’s been lots of debate over whether the data scientist position is the sexiest job of the 21st century. Despite the Unicorn hype, spending a day with them at the Wrangle conference, an event staged by Cloudera, was a surprisingly earthy experience. It wasn’t an event chock full of algorithms, but instead, it was about the trials and tribulations of making data science work in a business. The issues were surprisingly mundane. And by the way, the brains in the room spoke perfectly understandable English.

It starts with questions as elementary as finding the data, and enough of it, to learn something meaningful. Or defining your base assumptions; a data scientist with a financial payments processor found definitions of fraud were not as black and white as she (or anybody) would have expected. And assuming you’ve found those data sets and established some baseline truths, there are the usual growing pains of scaling infrastructure and analytics. What might compute well in a 10-node cluster might have issues when you scale many times that. Significantly, the hiccups could be logical as well as physical; if your computations have any interdependencies; surprises can emerge as the threads multiply.

But let’s get down to brass tacks. Like why run a complex algorithm when a simple one will do. For instance, when a flyer tweets about bad services, it’s far more effective for the airline to simply respond to the tweet asking the customer to provide their booking number (thru private message) rather than resort to elaborate graph analytics establishing the customer’s identity. And don’t just show data for the sake of it; there’s a good reason why Google Maps GPS simply shows colored lines to highlight best routes rather than dashboards at each intersection showing which percentages of drivers turned left or went straight. When formulating queries or hypotheses, look outside your peer group to see if it makes sense through other peoples’ eyes.

Data scientists face many of the same issues as developers at large. One of the speakers admitted resorting to Python scripts rather than some heavier weight frameworks like Storm or Kafka; the question in retrospect is how well are those scripts documented for future reference. Another spoke of the pain of scale up of infrastructure not designed for sophisticated analytics; in this case, a system built with Ruby scripting (not exactly well suited for statistical programming) on a Mongo database (not well suited for analytics), and taking Band-Aid approaches (e.g., replicating the database nightly to a Hadoop cluster) before finally biting the bullet and rewriting the code to eliminate the need for wasteful data transfers. Another spoke of the difficulty of debugging machine learning algorithms that get too complex for their own good.

There are moral questions as well. Clare Corthell, who heads her own ML consulting firm, made an impassioned plea for data scientists to root out bias in their algorithms. Of course, the idea of any human viewing data or querying it objectively is a literal impossibility as we’re all human, we see things through our own mental lenses. In essence, it means factoring in human biases even in the most objective computational problems. For instance, the algorithms for online dating sites should factor skews, such as Asian men tending to rate African American women more negatively than average; or that loan approvals based on ‘objective’ metrics such as income, assets, and zip code in effect perpetuate the same redlining practices that fair lending laws were supposed to prohibit.

Data science may be a much hyped profession; the supply is far dwarfed by demand. We’ve long believed that there will always be need for data scientists, but that also, for the large mass of enterprises, the applications will start embedding data science. And it’s already happening, thanks to machine learning providing a system assist to humans in BI tools and data prep/data wrangling tools. But at the end of the day, as much as they might be considered unicorns, data scientists face very familiar issues.

Strata 2015 Post Mortem: Sparking expectations for Smart, Fast Applications

A year ago, Turing award winner Dr. Michael Stonebraker made the point that, when you try managing more than a handful of data sets, manual approaches run out of gas and the machine must come in to help. He was referring to the task of cataloging data sets in the context of capabilities performed by his latest startup, Tamr. If your typical data warehouse or data mart involves three or four data sources, it’s possible for you to get your head around figuring the idiosyncrasies of each data set and how to integrate them for analytics.

But push that number to dozens, if not hundreds or thousands of data sets, and any human brain is going to hit the wall — maybe literally. And that’s where machine learning first made big data navigable, not just to data scientists, but to business users. Introduced by Paxata, and since then, through a long tail of startups and household names, these tools applied machine learning to help the user wrangle data through a new kind of iterative process. Since then, analytic tools such as IBM’s Watson Analytics are employing machine learning to help end users perform predictive analytics.

Walking the floor of last week’s Strata Hadoop World in New York, we saw machine learning powering “emergent” approaches to building data warehouses. Infoworks monitors the what data end users are targeting for their queries by taking a change data capture-like approach to monitoring logs; but instead of just tracking changes (which is useful for data lineage), it deduces the data model and builds OLAP cubes. Alation, another startup, uses a similar approach for crawling data sets to build catalogs with Google-like PageRanks showing which tables and queries are the most popular. It’s supplemented with a collaboration environment where people add context, and a natural language query capability that browses the catalog.

Just as machine learning is transforming the data transformation process to help business users navigate their way through big data, it’s also starting to provide the intelligence to help business users become more effective with exploratory analytics. While over the past couple years, interactive SQL was the most competitive battle for Hadoop providers — enabling established BI tools to treat Hadoop as simply a larger data warehouse — machine learning will become essential to helping users become productive with exploratory analytics on big data.

What makes machine learning possible within an interactive experience is the emerging Spark compute engine. Spark is what’s turning Hadoop from a Big Data platform to a Fast Data one. By now, every commercial Hadoop distro includes a Spark implementation, although which Spark engines (e.g., SQL, Streaming, Machine Learning, and Graph) still varies by vendor. A few months back IBM declared it would invest $300 million and dedicate 3500 developers to Spark machine learning product development, followed by Cloudera’s announcement of a One Platform initiative to plug Spark’s gaps.

And so our attention was piqued by Netflix’s Strata session on running Spark at petabyte scale. Among Spark’s weaknesses is that it hasn’t consistently scaled over a thousand nodes, and is not known for high concurrency. Netflix’s data warehouse currently tops out at 20 petabytes and serves roughly 350 users (we presume, technically savvy data scientists and data engineers). Spark is still at its infancy at Netflix; while workloads are growing, they are not at a level that would merit a dedicated cluster (Netflix runs its computing in the Amazon cloud, on S3 storage). Much of the Spark workloads are for streaming, run under YARN. And that leads to a number of issues showing that at high scale, and high concurrency, Spark is a work in progress.

A few of the issues that Netflix is working to scale Spark include adding caching steps to accelerate loading of large data sets. Related to that is reducing the latency of retrieving large metadata sets (“list calls”) that are often associated with large data sets; Netflix is working on an optimization that would apply to Amazon’s S3. Another scaling issue related to file scanning (Spark normally scans all Hive tables when a query is first run); Netflix has designed a workaround to pushdown predicate processing so queries only scan relevant tables.

For most business users, the issue of Spark scaling won’t be relevant as their queries are not routinely expected to involve multiple petabytes of data. But for Spark to reach its promise for supplanting MapReduce for iterative, complex, data-intensive workloads, scale will prove an essential hurdle. We have little doubt that the sizable Spark community will rise to the task. But the future won’t necessarily be all Spark all the time. Keep your eye out for the Apex streaming project; it’s drawn some key principals who have been known for backing Storm.

So is Spark really outgrowing Hadoop?

That’s one of the headlines of a newly released Databricks survey that you should definitely check out. Because Spark only requires a JVM to run, there’s been plenty of debate on whether you really need to run it on Hadoop, or whether Spark will displace it altogether. Technically, the answer is no. To run Spark, all you need is a JVM installed on the cluster, or a lightweight cluster manager like Apache Mesos. It’s the familiar argument about why bother with the overhead of installing and running a general-purpose platform if you only have a single-purpose workload.

Actually, there are reasons, if security or data governance are necessary, but hold that thought.

According to the Databricks survey, which polled nearly 1500 respondents online over the summer, nearly half are running Spark standalone, with the other 40% running under YARN (Hadoop) and 11% on Mesos. There’s a strong push for dedicated deployment.

But let’s take a closer look at the numbers. About half the respondents are also running Spark on a public cloud. Admittedly, running in the cloud does not necessarily automatically equate with standalone deployment. But there’s a lot more than coincidence in the numbers given that popular cloud-based Spark services from Databricks, and more recently, Amazon and Google, are (or will be) running in dedicated environments.

And consider what stage we’re at with Spark adoption. Commercial support is barely a couple years old and cloud PaaS offerings are much newer than that. The top 10 sectors using Spark are the classic early adopters of Big Data analytics (and, ironically in this case, Hadoop): Software, web, and mobile technology/solutions providers. So the question is whether the trend will continue as Spark adoption breaks into mainstream IT, and as Spark is embedded into commercial analytic tools and data management/data wrangling tools (which it already is).

This is not to say that running Spark standalone will become just a footnote to history. If you’re experimenting with new analytic workloads – like testing another clustering or machine learning algorithm, dedicated sandboxes are great places to run those proofs-of-concepts. If you have specific types of workloads, there has long been good business and technology cases for running them on the right infrastructure; if you’re running a compute-intensive workloads, for instance, you’ll probably want to run it on servers or clusters that are compute- rather than storage-heavy. And if you’re running real-time, operational analytics, you’ll want to run it on hardware that has heavily bulked up on memory.

Hardware providers like Teradata, Oracle, and IBM have long offered workload-optimized machines, while cloud providers like Amazon offer arrays of different compute and storage instances that clients can choose for deployment. There’s no reason why Spark should be any different – and that’s why there’s an expanding marketplace of PaaS providers that are offering Spark-optimized environments.

But if Spark dedicated deployment is to become the norm rather than the exception, it must reinvent the wheel where it comes to security, data protection, lifecycle workflows, data localization, and so on. The Spark open source community is busy addressing many of the same gaps that are currently challenging the Hadoop community (just that the Hadoop project has a 2-year head start). But let’s assume that the Spark project dots all the i’s and crosses all the t’s to deliver the robustness that is expected of any enterprise data platform. As Spark workloads get productionalized, will your organization really want to run them in yet another governance silo?

Note: There are plenty of nuggets in the Databricks survey beyond Hadoop. Recommendation systems, log processing, and business intelligence (an umbrella category) are the most popular uses. The practitioners are mostly data engineers and data scientists – suggesting that adoption is concentrated among those with new generation skills. But while advanced analytics and real-time streaming are viewed by respondents as the most important Spark features, paradoxically, Spark SQL is the most used Spark component. While new bells and whistles are important, at the end of the day, accessibility from and integration with enterprise analytics trump all.

Hadoop and Spark: A Tale of two Cities

If it seems like we’ve been down this path before, well, maybe we have. June has been a month of juxtapositions, back and forth to the west coast for Hadoop and Spark Summits. The mood from last week to this has been quite contrasting. Spark Summit has the kind of canned heat that Hadoop conferences had a couple years back. We won’t stretch the Dickens metaphor.

Yeah, it’s human nature to say, down with the old and in with the new.

But let’s set something straight: Spark ain’t going to replace Hadoop, as we’re talking about apples and oranges. Spark can run on Hadoop, and it can run on other data platforms. What it might replace is MapReduce, if Spark can overcome its scaling hurdles. And it could fulfil IBM’s vision as the next analytic operating system if it addresses mundane – but very important concerns – for supporting scaling, high concurrency, and bulletproof security. Spark originated at UC Berkeley’s AMPLab back in 2009, with the founders forming Databricks. With roughly 700 committers contributors, Spark has ballooned to becoming the most active open source project in the apache community, barely 2 years after becoming an Apache project.

Spark is best known as a sort of in-memory analytics replacement for iterative computation frameworks like MapReduce; both employ massively parallel compute and then shuffle interim results, with the difference being that Spark caches in memory while MapReduce writes to disk. But that’s just the tip of the iceberg. Spark offers a simpler programming model, better fault tolerance, and it’s far more extensible than MapReduce. Spark is any form of iterative computation, and it was designed to support specific extensions; among the most popular are machine learning, microbatch stream processing, graph computing, and even SQL.

By contrast, Hadoop is a data platform. It is one of many that can run Spark, because Spark is platform-independent. So you could also run Spark on Cassandra, other NoSQL data store, or SQL databases, but Hadoop has been the most popular target right now.

And, not to forget Apache Mesos, another AMPLab discovery for cluster management to which Spark had originally been closely associated.

There’s little question about the excitement level over Spark. By now the headlines have poured out over IBM investing $300 million, committing 3500 developers, establishing a Spark open source development center a few BART stops from AMPLab in San Francisco, and aiming directly and through partners to educate 1 million professionals on Spark in the next few years (or about 4 – 5x the current number registered for IBM’s online Big Data university). IBM views Spark’s strength as machine learning, and wants to make machine learning a declarative programming experience that will fellow in SQL’s footsteps with its new SystemML language (which it plans to open source).

That’s not to overshadow Databricks’ announcement that its Spark developer cloud, in preview over the past year, has now gone GA. The big challenge facing Databricks was making its cloud scalable and sufficiently elastic to meet the demand – and not become a victim of its own success. And there is the growing number of vendors that are embedding Spark within their analytic tools, streaming products, and development tools. The release announcement of Spark 1.4 brings new manageability and capability for automatically renewing Kerberos tokens for long running processes like streaming. But there remain growing pains, like reducing the number of moving parts needed to make Spark a first class citizen with Hadoop YARN.

By contrast, last week was about Hadoop becoming more manageable and more amenable to enterprise infrastructure, like shared storage as our colleague Merv Adrian pointed out. Not to mention enduring adolescent factional turf wars.

It’s easy to get excited by the idealism around the shiny new thing. While the sky seems the limit, the reality is that there’s lots of blocking and tackling ahead. And the need for engaging, not only developers, but business stakeholders through applications, rather than development tools, and success stories with tangible results. It’s a stage that the Hadoop community is just starting to embrace now.

MongoDB widens its sights

MongoDB has passed several key watershed events over the past year, including a major redesign of its core platform and a strategic shift in its management team. By now, the architectural transition is relatively old news; as we noted last winter, MongoDB 3.0 made the storage engine pluggable. So voila! Just like MySQL before it, Mongo becomes whatever you want it to be. Well eventually, anyway, but today there’s the option of substituting the more write-friendly WiredTiger engine, and in the near future, an in-memory engine now in preview could provide an even faster write-ahead cache to complement the new overcaffeinated tiger. And there are likely other engines to come.

From a platform – and market standpoint – the core theme is Mongo broadening its aim. Initially, it will be through new storage engines that allow Mongo to be whatever you make of it. MongoDB has started the fray with WiredTiger and the new in-memory data store, but with publishing of the API, there are opportunities for other engines to plug in. At MongoDB’s user conference, we saw one such result – the RocksDB engine developed at Facebook for extremely I/O-intensive transactions involving log data. And as we’ve speculated, there’s nothing to stop other storage engines like SQL from plugging in.

Letting a thousand flowers bloom
Analytics is an example where Mongo is spreading its focus. While Mongo and other NoSQL data stores are typically used for operational applications requiring fast reads and/or writes, for operational simplicity, there is also growing demand for in-line analytics. Why move data to a separate data warehouse data mart or Hadoop if it can be avoided? And why not embed some analytics with your operational applications? This is hardly an outlier – a key selling point for the latest generations of Oracle and SAP applications are the ability to embed analytics with transaction processing. Analytics evolves from after-the-fact to an inline process that is part of processing a transaction. Any real-time customer facing or operational process is ripe for analytics that can prompt inline decisions for providing next-best offers or tweaking the operation of an industrial process, supply chain, or the delivery of a service. And so a growing number of MongoDB deployments are adding analytics to the mix.

It’s almost a no-brainer for SQL BI tools to target JSON data per se because the data has a structure. (Admittedly, this is assuming the data is relatively clean, which in many cases is not a given.) But by nature, JSN has a more complex and potentially richer structure than SQL tables in the degree that the data is nested. Yet most SQL tools do away with the nesting and hierarchies that are stored in JSON documents, “flattening’ the structure into a single column.

We’ve always wondered when analytic tools would wake up to the potential of querying JSON natively – at least, not flattening the structure, but incorporating that information when processing the query. The upcoming MongoDB 3.2 release will add a new connector to BI and visualization tools that will push down analytic processing into MongoDB, rather than require data to be extracted first to populate an external data mart of data warehouse for the analytic tool to target. But this enhancement is not as much about enriching the query with information pertaining to the JSON schema; it’s more about efficiency, eliminating data transport.

But some emerging startups are looking to address that JSON native query gap. jSonar demonstrated SonarW, a data warehouse engine that plugs into the Mongo API that has a columnar format, with a key difference. It provides metadata that provides a logical representation of the nested and hierarchical relationships. We saw a reporting tool from Slamdata that applies similar context to the data, based on patent-pending algorithms that apply relational algebra to slicing, dicing, and aggregating deeply nested data.

Who says JSON data has to be dirty?
While a key advantage of NoSQL data stores, like Mongo, is that you don’t have to worry about applying strict schema or validation (e.g., ensuring that the database isn’t sparse and that the data in the fields is not gibberish). But there’s nothing inherent to JSON that rules out validation and robust data typing. MongoDB will be introducing a tool supporting schema validation for those use cases that demand it, plus a tool for visualizing the schema to provide a rough indication of unique fields and unique data (e.g., cardinality) within these fields. While maybe not a full-blown data profiling capability, it is a start.

Breaking the glass ceiling
The script for MongoDB has been familiar up ‘til now. The entrepreneurial startup whose product has grown popular through grassroots appeal. The natural trajectory for MongoDB is to start engaging the C- level and the business, who write larger checks. A decade ago, MySQL played this role. It was kinda of an Oracle or SQL Server Lite that was less complex than its enterprise cousins. That’s been very much MongoDB’s appeal. But with making the platform more extensible, MongoDB creates a technology path to grow up. Can the business grow with it?

Ove the past year MongoDB’s upper management team has largely been replaced; the CEO, CMO, and head of sales are new. It’s the classic story of startup visionaries, followed by those experienced at building the business. President and CEO Dev Ittycheria, most recently from the venture community, previously took BladeLogic public before eventually selling to BMC for $900 million in 2008. Its heads of sales and marketing come from similar backgrounds and long track records. While MongoDB is clearly not sloughing off on product development, it is plowing much of its capitalization into building out the go-to-market.

The key challenge facing Mongo, and all the new data platform players, is where (or whether) they will break the proverbial glass ceiling. There are several perspectives to this challenge. For open source players like MongoDB, it is determining where the value-add lies. It’s a moving target; while traditionally, functions that make a data store enterprise grade such as data governance, management, and security were traditionally unique to the vendor and platform, open source is eating away at it. Just look at the Hadoop world where there’s Ambari, while Cloudera and IBM offer their own either as core or optional replacement. So this dilemma is hardly unique to MongoDB. Our take is that lowest common denominator cannot be applied to governance, security, or management, but it will become a case where platform players, like MongoDB, must branch out and offer related value-add such as optimizations for cloud deployment, information lifecycle management, and so on.

Such a strategy of broadening the value-add grows even more important given market expectations for pricing; in essence, coping with the I’m not going to pay a lot for this muffler syndrome. The expectation with open source and other emerging platforms is that enterprises are not willing, or lack the budget, for paying the types of licenses customary with established databases and data warehouse systems. Yes, the land and expand value is critical for the likes of MongoDB, Cloudera, Hortonworks and others for growing revenues. They may not replace the Oracles or Microsoft of the world, but they are angling to be the favorite for new generation applications supplementing what’s already on the back end (e.g., customer experience, enhancing and working alongside classical CRM).

Land and expand into the enterprise, and broadening from data platform to data management are familiar scripts. Even in an open source, commodity platform world, these scripts will remain as important as ever for MongoDB.

Hortonworks evens the score

Further proof that Hadoop competition is going up the stack toward areas such as packaged analytics, security, and data management and integration can be seen from Hortonworks’ latest series of announcements today – refresh of the Hortonworks Data Platform with Ambari 2.0 and the acquisition of cloud deployment automation tool SequenceIQ.

Specifically, Ambari 2.0 provides much of the automation previously missing, such as automating rolling updates, restarts, Kerberos authentications, alerting and health checks, and so on. Until now, automation of deployment, monitoring and alerting, rot cause diagnosis, and authentications was a key differentiator for Cloudera Manager. While Hadoop systems management may not be a done deal (e.g., updating to major new dot zero releases is not yet a lights-out operation), the basic blocking and tackling is no longer a differentiator; any platform should have these capabilities. The recent debut of the Open Data Platform – where IBM and Pivotal are leveraging the core Hortonworks platform as the starting point for their Hadoop distributions – is further evidence. Ambari is the cornerstone of all implementations, although IBM will still offer a more “premium” value-add with options such as Platform Symphony and Adaptive MapReduce.

Likewise, Hortonworks’ acquisition of SequenceIQ is a similar move to even the score with Cloudera Director. Both handle automation of cloud deployment with policy-based elastic scaling (e.g., when to provision or kill compute nodes). The comparison may not yet be apples-to-apples; for instance, Cloudera Director has been a part of the Cloudera enterprise platform (the paid edition) since last fall, whereas the ink is just drying on the Hortonworks acquisition of SequenceIQ. And, while SequenceIQ’s product, Cloudbreak, is cloud infrastructure-agnostic but Cloudera Director right now only supports Amazon, that too will change.

More to the point is where competition is heading – we believe that it is heading from the core platform higher up the value chain to analytic capabilities and all forms of data management – stewardship, governance, and integration. In short, it’s a page out of the playbook of established data warehousing platforms that have had to provide value-add that could be embedded inside the database. Just take a look at Cloudera’s latest announcements: acquisition of Xplain and a strategic investment in Cask. Xplain automates the design, integration, and optimization of data models to reduce or eliminate hurdles to conducting self-service analytics on Hadoop. Cask on the other hand provides hooks for developers to integrate applications with Hadoop – the third way that until now has been overlooked.

As Hadoop graduates from specialized platform for complex, data science computing to an enterprise data lake, the blocking and tackling functions – e.g., systems management and housekeeping – becomes checklist items. What’s more important is how to manage data, make data and analytics more accessible beyond data scientists and statistical programming experts, and providing the security that is expected of any enterprise-grade platform.

Spark Summit debrief: Relax, the growing pains are mundane

As the most active project (by number of committers) in the Apache Hadoop open source community, it’s not surprising that Spark has drawn much excitement and expectation. At the core, there are several key elements to Spark’s appeal:
1. It provides a much simpler and more resilient programming model compared to MapReduce – for instance, it can restart failed nodes in process rather than requiring the entire run to be restarted from scratch.
2. It takes advantage of DRAM memory, significantly accelerating compute jobs – and because of the speed, allowing more complex, chained computations to run (which could be quite useful for simulations or orchestrated computations based on if/then logic).
3. It is extensible. Spark provides a unified computing model that lets you mix and match complex iterative MapReduce-style computation with SQL, streaming, machine learning and other processes on the same node, with the same data, on the same cluster, without having to invoke separate programs. It’s akin to what Teradata is doing with the SNAP framework to differentiate its proprietary Aster platform.

Mike Olson, among others, has termed Spark “The leading candidate for ‘successor to MapReduce’.” How’s that for setting modest expectations?

So we were quite pleased to see Spark Summit making it to New York and have the chance to get immersed in the discussion.

Last fall, Databricks, whose founders created Spark from their work at UC Berkeley’s AMPlab, announced their first commercial product – that being a Spark Platform-as-A-Service (PaaS) cloud for developing Spark programs. We view the Databricks Cloud as a learning tool and incubator for developers to get up to speed on Spark without having to worry about marshaling compute clusters. The question on everybody’s minds at the conference was when the Databricks Cloud would go GA. The answer, like everything Spark, is about dealing with scalability – in this case, being capable of handling high con currency, highly spikey workloads. The latest word is later this year.

The trials and tribulations of the Databricks Cloud is quite typical for Spark – it’s dealing with scale, whether that be in numbers of users (concurrency) or data (when the data sets get too big for memory and must spill to disk). At a meetup last summer where we heard a trip report from Spark Summit 2015, the key pain point was having a more graceful spilling to disk.

Memory-resident compute frameworks of course are nothing new. SAS for instance has its LASR Server, which it contends is far more robust in dealing with concurrency and compute-intensive workloads. But, as SAS’s core business is analytics, we expect that they will meet Spark halfway to appeal to Spark developers.

While Spark is thought of as a potential replacement for MapReduce, in actuality we believe that MapReduce will be as dead as the mainframe. While DRAM memory is, in the long run, getting cheaper, it will never be as cheap as disk. And while ideally, you shouldn’t have to comb through petabytes of data on a routine basis (that’s part of defining your query and identifying the data sets), there are going to be analytic problems involving data sets that won’t completely fit in memory. Not to mention that not all computations (e.g., anything that requires developing a comprehensive model) will be suited for real-time or interactive computation. Not surprisingly, most of the use cases that we came across at Spark Summit were more about “medium data,” such as curating data feeds, real-time fraud detection, or heat maps of NYC taxi cab activity.

While dealing with scaling is part of the Spark roadmap, so is making it more accessible. At this stage, the focus is on developers, through APIs to popular statistical computation languages such as Python or R, and with frameworks such as Spark SQL and Spark DataFrames.

On one hand, with Hadoop and NoSQL platform providers competing with their own interactive SQL frameworks, the question is why the world needs another SQL framework. In actuality, Spark SQL doesn’t compete with Impala, Tez, BigSQL, Drill, Presto or whatever. First, it’s not only about SQL, but querying data with any kind of explicit schema. The use case for Spark SQL is running SQL programs in line with other computations, such as chaining SQL queries to streaming or machine learning runs. As for DataFrames, Databricks is simply adapting the Distributed DataFrame technology already implemented with languages such as Java, Python, and R to access data sets that are organized as tables with columns contained typed data.

Spark’s extensibility is both blessing and curse. Blessing in that the framework can run a wide variety of workloads, but curse in that developers can drown in abundance. One of the speakers at Summit called for package management so developers won’t stumble over their expanding array of Spark libraries and wind up reinventing the wheel.

Making Spark more accessible to developers is a logical step in growing the skills base. But ultimately, for Spark to have an impact with enterprises, it must be embraced by applications. In those scenarios, the end user doesn’t care what process is used under the hood. There are a few applications and tools, like ClearStory Data for curating data feeds, or ZoomData, an emerging Big Data BI tool that has some unique IP (likely to stay proprietary) for handling scale and concurrency.

There’s no shortage of excitement and hype around Spark. The teething issues (E.G., scalability, concurrency, package management) are rather mundane. The hype – that Spark will replace MapReduce – is ahead of the reality; as we’ve previously noted, there’s a place for in-memory computing, but it won’t replace all workloads or make disk-based databases obsolete. And while Spark hardly has a monopoly on in-memory computing, the accessibility and economics of an open source framework on commodity hardware opens lots of possibilities for drawing a skills base and new forms of analytics. But let’s not get too far ahead of ourselves first.

IBM and Twitter: Another piece of the anaytics puzzle

Roughly 20 years ago, IBM faced a major fork in the road from the hardware-centric model that defined the computer industry from the days of Grace Hopper. It embraced a services-heavy model that leveraged IBM’s knowledge of how and where enterprises managed their information in an era when many were about to undergo drastic replatforming in the wake of Y2K.

Today it’s about the replatforming, not of IT infrastructure necessarily, but of the business in the face the need to connect in an increasingly mobile and things connected world. And so IBM is in a reinvention, trying to embrace all things mobile, all things data, and all things connected. A key pillar of this strategy has been IBM’s mounting investment in Watson, where it has aggressively recruited and incubated partners to flesh out a new path of business solutions based on cognitive computing. On the horizon, we’ll be focusing our attention on a new path of insight: exploratory analytics, an area that is enabled by the next generation of business intelligence tools – Watson Analytics among them.

Which brings us to last fall’s announcement that IBM and Twitter would from a strategic partnership to develop real-time business solutions. As IBM has been seeking to reinvent itself, Twitter has been seeking to invent itself as a profitable business that can monetize its data in a manner that maintains trust among its members – yours truly among them. Twitter’s key value proposition is the immediacy if its data. While it may lack the richness and depth of content-heavy social networks like Facebook, it is, in essence, the world’s heartbeat. A ticker feed that is about, not financial markets, but the world.

When something happens, you might post on Facebook, within minutes or hours, blogs and news feeds may populate headlines. But for real-time immediacy, nothing beats the ease and simplicity of 140 characters. Uniquely, Twitter is sort of a hybrid between consumer-oriented social network like Facebook and a professional one like LinkedIn. There is an immediacy and uniqueness to the data feed that Twitter provides, With its acquisition last year of partner Gnip (which already had commercial relationships with enterprise software providers like SAP), Twitter now had a direct pipeline for mounting the enterprise value chain.

So far, so good, but what has IBM done to build a real business out of all this? A few months in, IBM is on a publicity offensive to show there is real business here. It is part way to a goal of cross-trading up to a quarter of its 140,000 over 10,000 GBS consultants on Twitter solutions. IBM has already signed a handful of reference customer deals, and is disclosing some of the real-world use cases that are the focus on actuals engagements.

Meanwhile, Twitter has been on a heavily publicized path to monetize the data that it has – which is a unique real-time pulse of what’s happening in the world. Twitter certainly has had its spate of challenges here. It sits on a data stream that is rich with currency, but lacking the depth that social networks like Facebook offer in abundance. Nonetheless, Twitter is unique in that it provides a ticker feed of what’s happening in the world. That was what was behind the announcement last fall that Twitter would become a strategic partner with IBM – to help Twitter monetize its data and for IBM to generate unique real-time business solutions.

Roughly six months into the partnership, IBM has taken the offensive to demonstrate that the new partnership is generating real business and tangible use cases. We sat down for some off the record discussions with IBM, Twitter, and several customers and prospects ahead of today’s announcements.

The obvious low-hanging fruit is customer experience. As we wrote this in midflight, before boarding we had a Twitter exchange with United regarding whether we’d be put on another fight if our plane – delayed for a couple hours with software trouble (yes… software) – was going to get cancelled (the story had a happy ending). Businesses are already using Twitter – that’s not the question. Instead, it’s whether there are other analytics-driven use cases – sorta like the type of thing we used to talk about with CEP but are real and not theoretical.

We had some background conversations with IBM last week ahead of today’s announcements. They told us of some engagements that they’ve booked during the first few months of the Twitter initiative. What’s remarkable is they are very familiar use cases, where Twitter adds another verifying data point.

An obvious case is mobile carriers – this being the beachfront real estate of telco. As mobile embeds itself in our lives, there is more at stake for carriers who ear churn, and even more so, the reputational damage that can come when defecting customers cry out about bad service publicly over social media. Telcos already have real-time data; they have connection data from their operational systems, and because this is mobile, location data as well. What’s kind of interesting to us is IBM’s assertion that what’s less understood is the relationship between Tweets and churn – as we already use Twitter, we thought those truths were self-evident. You have a crappy connection, the mobile carrier has the data on what calls, texts, or web access were dropped, and if the telco already knows its customers’ Twitter handles, it should be as plain as day what the relationship is between tweet’s and potential churn events. IBM’s case here was that integrating Twitter with data that was already available – connection’s, weather, cell tower traffic, etc., it helped connect the dots. IBM makes the claim that correlating Twitter with weather data alone could improve the accuracy of telco churn models by 5.

Another example drawn from early engagement is employee turnover. Now, unless an employee has gotten to the point where they’d rather take this job and shove it, you’d think that putting your gripes out over the Twitter feed would be a career-limiting move. But the approach here was more indirect: look at consumer businesses and correlate customer Twitter activity with locations where employee morale is sagging. Or look at the Twitter data to deduce that staff loyalty was flagging.

A more obvious use cases was in the fashion industry. IBM is adapting another technology from its labs – psycholinguistic analysis (a.k.a., what are you really saying?) – to conduct a more nuanced form of sentiment analysis of your tweets. For this engagement, a fashion industry firm employed this analysis to gain more insight on why different products sold or not.

Integrating Twitter is just another piece of the puzzle when trying to decipher signals from the market. It’s not a case of blazing new trails; indeed, sentiment analysis has become a well-established disciple for consumer marketers. The data from Twitter is crying out to be added to the mix of feeds used for piecing together the big picture. IBM’s alliance with Twitter is notable in that both are putting real skin in the game for productizing the insights that can be gained from Twitter feeds.

It’s not a criticism to say this, but incorporating Twitter is evolutionary, not revolutionary. That’s true for most big data analytics – we’re just expanding and deepening the window to solve very familiar problems. The data is out there – we might as well use it.

Strata 2015 post mortem: Does the Hadoop market have a glass ceiling?

The move of this year’s west coast Strata HadoopWorld conference to the same venue as Hadoop Summit gave the event a bit of a mano a mano air: who can throw the bigger, louder party?

But show business dynamics aside, the net takeaway from these events is looking at milestones in the development of the ecosystem. Indeed, the brunt of our time was spent “speed dating” with third party tools and applications that are steadily addressing the white space in the Big Data and Hadoop markets. While our sampling is hardly representative, we saw growth, not only from the usual suspects from the data warehousing world, but also from a growing population of vendors who are aiming to package machine learning algorithms, real-time streaming, more granular data security, along with new domains such as entity analytics. Databricks, inventor of Spark, announced in a keynote a new DataFrames initiative to make it easier for R and Python programmers accustomed to working on laptops to easily commandeer and configure clusters to run their computations using Spark.

Immediately preceding the festivities, the Open Data Platform initiative announced its existence, and Cloudera announced its $100 million 2014 numbers – ground we already covered. After the event, Hortonworks did its first quarterly financial call. Depending on how you count, they did nearly $50 million business last year; but the billings, which signify the pipeline, came in at $87 million. Hortonworks closed an impressive 99 new customers in Q4. There’s little question that Hortonworks has momentum, but right now, so does everybody. We’re at a stage in the market where a rising tide is lifting all boats; even the smallest Hadoop player – Pivotal – grew from token revenues to our estimate of $20 million Hadoop sales last year.

At this point, there’s nowhere for the Hadoop market to go but up, as we estimate that the paid enterprise installed base (at roughly 1200 – 1500) as just a fraction of the potential base. Or in revenues, our estimate of $325 million for 2014 (Hadoop subscriptions and related professional services, but not including third party software or services), up against $10 billion+ for the database market. Given that Hadoop is just a speck compared to the overall database market, what is the realistic addressable market?

Keep in mind that while Hadoop may shift some data warehouse workloads, the real picture is not necessarily a zero sum game, but the commoditization of the core database business. Exhibit One: Oracle’s recent X5 engineered systems announcement designed to meet Cisco UCS at its commodity price point. Yes, there will be some contention, as databases are converging and overlapping, competing for many of the same use cases.

But the likely outcome is that organizations will use more data platforms and grow accustomed to paying more commodity process – whether that is through open source subscriptions or cloud pay-by-the-drink (or both). The value-add increasingly will come from housekeeping tools (e.g., data security; access control and authentication; data lineage and audit for compliance; cluster performance management and optimization; lifecycle and job management; query management and optimization in a heterogeneous environment).

The takeaway here is that the tasks normally associated with the care and feeding of a database, not to mention the governance of data, grow far more complex when superseding traditional enterprise data with Big Data. So the Hadoop subscription business may only grow so far, but that will be just the tip of the iceberg regarding the ultimate addressable market.