Big Data 2015-2016: A look back and a look ahead

Quickly looking back
2015 was the year of Spark.

If you follow Big Data, you’d have to be living under a rock to have missed the Spark juggernaut. The extensive use of in-memory processing has helped machine learning go mainstream, because the speed of processing enables the system to quickly detect patterns and provide actionable artificial intelligence. It’s surfaced in data prep/data curation tools, where the system helps you get an idea of what’s in your big data and how it fits together, and in a new breed of predictive analytics tools that are now, thanks to machine learning, starting to become prescriptive. Yup, Cloudera brought Spark to our attention a couple years back as the eventual successor to MapReduce, but it was the endorsement of IBM, backed by commitment of 3500 developers and $300 million investment in tool and technology development, which plants the beachhead for Spark computing pass from early adopter to enterprise. We believe that will mostly be through tools that embed Spark under the covers. It’s not game over for Spark; there persist issues of scalability and security, but there’s little question it’s here to stay.

We also saw continued overlap and convergence in the tectonic plates of databases. Hadoop became more SQL like, and if you didn’t think there were enough SQL-on-Hadoop frameworks, this year we got two more from MapR and Teradata. It underscored our belief that there will be as many flavors of SQL on Hadoop as there are in the enterprise database market.

And while we’re on the topic of overlap, there’s the unmistakable trend of NoSQL databases adding SQL faces. Couchbase’s N1QL, Cassandra/DataStax’s CQL, and most recently, the SQL extensions for MongoDB. It reflects the reality that, while NoSQL databases emerged to serve operational roles, there is a need to add some lightweight analytics on them – not to replace data warehouses or Hadoop, but to add some inline analytics as you are handling live customer sessions. Also pertinent to overlap is the morphing of MongoDB, which has been the poster child for lightweight, developer-friendly database. Like Hadoop, MongoDB is no longer being known by its storage engine, but for its developer tooling and APIs. With the 3.0 release, the storage engines became pluggable (the same path trod by MySQL a decade earlier). With the just-announced 3.2 version, write-friendlier WiredTiger replaces the original MMAP as the default storage engine (meaning you can still use MMAP if you override factory settings).

A year ago, we expected streaming, machine learning, and search to become the fastest growing Big Data analytic use cases; turns out that machine learning was the hands-down winner last year, but we’ve also seen quite an upsurge of interest in streaming thanks to a perfect storm-like convergence of IoT and mobile data use cases (which epitomize real time) with technology opportunity (open source has lowered barriers for developers, enterprises, and vendors alike, while commodity scale-out architecture provides the economical scaling to handle torrents of real-time data). Open source is not necessarily replacing proprietary technology; proprietary products offer the polish (e.g., ease of use, data integration, application management, and security) that are either lacking from open source products or require manual integration. But open source has injected new energy into a field that formerly was more of a complex solution looking for a problem.

So what’s up in 2016?

A lot… but three trends pop out at us.

1. Appliances and cloud drive the next wave of Hadoop adoption.
Hadoop has been too darn hard to implement. Even with the deployment and management tools offered with packaged commercial distributions, implementation remains developer-centric and best undertaken with teams experienced with DevOps-style continuous integration. The difficulty of implementation was not a show-stopper for early adopters (e.g., Internet firms who invent their own technology, digital media and adtech firms who thrive on advanced technology, and capital markets firms who compete on being bleeding edge), or early enterprise adopters (innovators from the Global 2000). But it will be for the next wave, who lack the depth or sophistication of IT skills/resources of the trailblazers.

The wake up call came when we heard that Oracle’s Big Data Appliance, which barely registered on the map during its first couple years of existence, encountered a significant upsurge in sales among the company’s European client base. Considered in conjunction with continued healthy growth in Amazon’s cloud adoption, it dawned on us that the next wave of Hadoop adoption will be driven by simpler paths: either via appliance or cloud. This is not to say that packaged Hadoop offerings won’t further automate deployment, but the cloud and appliances are the straightest paths to getting more black box.

2. Machine learning becomes a fact of life with analytics tools. And more narratives, less dashboards.
Already a checklist item with data preparation, we expect the same to happen with analytics tools this year. Until now the skills threshold has been steep for taking advantage of machine learning. There are numerous techniques to choose from; first you identify whether you already know what type of outcome that you’re looking for, then you choose between approaches such as linear regression models, decision trees, random forests, clustering, anomaly detection and so on to solve your problem. It takes a statistical programmer to make that choice. Then you have to write the algorithm, or you can use tools that prepackage those algorithms for you such as those from H2O or Skytree. The big nut to crack will be in how to apply these algorithms and interpret them.

But we expect to see more of these models packaged under the hood. We’ve seen some cool tools this past year, like Adatao, that combine natural language query for business end users with an underlying development environment for R and Python programmers. We’re seeing tooling that puts all this more inside the black box, combining natural language querying with the ability to recognize signals in the data, guide the user on what to query, and automatically construct narratives or storyboards, as opposed to abstract dashboards. Machine learning plays a foundational role in generating such guided experiences. We’ve seen varying bits and pieces of these capabilities in offerings such as IBM Watson Analytics, Oracle Big Data Discovery, and Amazon QuickSight – and in the coming year, we expect to see more.

3. Data Lake enters the agenda
The Data Lake, the stuff of debate over the past few years, starts becoming reality with early enterprise adopters. The definitions of data lakes are in the eyes of the beholder – we view them as the governed repository that acts as the default ingest point and repository for raw data and the resting point for aged data that is retained online for active archiving. It’s typically not the first use case for Hadoop and shouldn’t be because you shouldn’t build a repository until you know how to use the underlying platform and, for data lake, know how to work with big data. But as the early wave of enterprise adopters grow comfortable with Hadoop in production serving more than a single organization, planning for the data lake is a logical follow-on step. It’s not that we’ll see full adoption in 2016 – Rome wasn’t built in a day. But we’ll start seeing more scrutiny on data management, building on the rudimentary data lineage capabilities currently available with Hadoop platforms (e.g., Cloudera Navigator, Apache Atlas) and that are part of data wrangling tools. Data lake governance is a work in process; there is much white space to be filled out with lifecycle management/data tiering, data retention, data protection, and cost/performance optimization.