Category Archives: Machine learning/advanced analytics

Spark 2.0: Walking the tightrope

Here’s the second post summarizing our takeaways from the recent Spark Summit East.

In April or May, we’ll see Spark 2.0. The direction is addressing gap filling, performance enhancement, and refactoring to nip API sprawl in the bud.

Rewinding the tape, in 2015 the Spark project added new entry points beyond Resilient Distributed Datasets (RDDs). We saw DataFrames, a schema-based data API that borrowed from constructs familiar to Python and R developers. Besides opening Spark to SQL developers (who could write analytics to run against database-like tabular representations) and BI tools, the DataFrame API also leveled the playing field between Scala (the native language of Spark) and R, Python, Java, and Clojure via a common API. But DataFrames are not as fast or efficient as RDDs, so recently, Datasets were introduced to provide the best of both worlds: the efficiency of Spark data objects, with the ability to surface them as schema.

Spark 2.0 release will consolidate the DataFrame and Dataset APIs into one; DataFrame becomes, in effect, the row-level construct of Dataset. Together, both will be positioned as the default interchange format and richer API of Spark with more semantics than the low-level RDD.

If you want ease of use, go with Datasets, but if feeds and speeds is the goal, that’s where RDDs fit in. And that’s where the next enhancement comes in. Spark 2.0 adds the first tweaks to the recently-released Tungsten (adding code generation), which aims to replace the 20-year old JVM with a more efficient mechanism for managing CPU memory. That’s a key strategy for juicing Spark performance, and maybe one that will make Dataset performance good enough. The backdrop to this is that with in-memory processing and faster networks (up to 10 GBE are becoming commonplace), the CPU has become the bottleneck. By eliminating the overhead of JVM garbage collection, Tungsten hopes to even the score with storage and network performance.

The final highlight of Spark 2.0 is Structured Streaming, which will extend Spark SQL and DataFrames (which in turn is becoming part of Dataset) with a streaming API. That will allow streaming and interactive steps, which formerly had to be orchestrated with separate programs, to run as one. And it makes streaming analytics richer; instead of running basic filtering or count actions, you will be able to run more complex queries and transforms. The initial release in 2.0 will support ETL, but future releases will extend querying.

Beyond the 2.0 generation, Spark Streaming will finally get – catch this – streaming. Spark Streaming has been a misnomer, as it is really Spark microbatching. By contrast, rival open source streaming engines such as Storm and Flink give you the choice of streaming (processing exactly one event at a time) or microbatch. In the future, Spark Streaming will give you that choice as well. Because sometimes you want pure streaming, where you need to resolve down to a single event, but other use cases will be better suited for microbatch where you can do more complex processes such as data aggregations and joins. And one other thing, Spark Streaming has never been known for low latency; at best it can resolve batches of events in seconds rather than subseconds. When paired with Tungsten memory management, that should hopefully change.

Spark 2.0 walks a tightrope between adding functionality, consolidating APIs, while not trying to break them. It for now begs the question about all the housekeeping that will be necessary if running Spark standalone. If it’s in the cloud, the cloud service provider should offer the perimeter security, but for now more fine-grained access control will have to be implemented in the application or storage layers. There are some pieces – such as the managing the lifecycle of Spark compute artifacts such as RDDs or DataFrames – that may be the domain of third party value-added tools. And – as it seems likely – Spark establishes itself as the successor to MapReduce for the bulk of complex Big Data analytics workloads, the challenge will be drawing the line between what innovations belong on the Apache side (and preventing fragmentation) and what sits better with third parties. We began that discussion in our last post. Later this year, we expect this discussion to hit the forefront.

Big Data 2015-2016: A look back and a look ahead

Quickly looking back
2015 was the year of Spark.

If you follow Big Data, you’d have to be living under a rock to have missed the Spark juggernaut. The extensive use of in-memory processing has helped machine learning go mainstream, because the speed of processing enables the system to quickly detect patterns and provide actionable artificial intelligence. It’s surfaced in data prep/data curation tools, where the system helps you get an idea of what’s in your big data and how it fits together, and in a new breed of predictive analytics tools that are now, thanks to machine learning, starting to become prescriptive. Yup, Cloudera brought Spark to our attention a couple years back as the eventual successor to MapReduce, but it was the endorsement of IBM, backed by commitment of 3500 developers and $300 million investment in tool and technology development, which plants the beachhead for Spark computing pass from early adopter to enterprise. We believe that will mostly be through tools that embed Spark under the covers. It’s not game over for Spark; there persist issues of scalability and security, but there’s little question it’s here to stay.

We also saw continued overlap and convergence in the tectonic plates of databases. Hadoop became more SQL like, and if you didn’t think there were enough SQL-on-Hadoop frameworks, this year we got two more from MapR and Teradata. It underscored our belief that there will be as many flavors of SQL on Hadoop as there are in the enterprise database market.

And while we’re on the topic of overlap, there’s the unmistakable trend of NoSQL databases adding SQL faces. Couchbase’s N1QL, Cassandra/DataStax’s CQL, and most recently, the SQL extensions for MongoDB. It reflects the reality that, while NoSQL databases emerged to serve operational roles, there is a need to add some lightweight analytics on them – not to replace data warehouses or Hadoop, but to add some inline analytics as you are handling live customer sessions. Also pertinent to overlap is the morphing of MongoDB, which has been the poster child for lightweight, developer-friendly database. Like Hadoop, MongoDB is no longer being known by its storage engine, but for its developer tooling and APIs. With the 3.0 release, the storage engines became pluggable (the same path trod by MySQL a decade earlier). With the just-announced 3.2 version, write-friendlier WiredTiger replaces the original MMAP as the default storage engine (meaning you can still use MMAP if you override factory settings).

A year ago, we expected streaming, machine learning, and search to become the fastest growing Big Data analytic use cases; turns out that machine learning was the hands-down winner last year, but we’ve also seen quite an upsurge of interest in streaming thanks to a perfect storm-like convergence of IoT and mobile data use cases (which epitomize real time) with technology opportunity (open source has lowered barriers for developers, enterprises, and vendors alike, while commodity scale-out architecture provides the economical scaling to handle torrents of real-time data). Open source is not necessarily replacing proprietary technology; proprietary products offer the polish (e.g., ease of use, data integration, application management, and security) that are either lacking from open source products or require manual integration. But open source has injected new energy into a field that formerly was more of a complex solution looking for a problem.

So what’s up in 2016?

A lot… but three trends pop out at us.

1. Appliances and cloud drive the next wave of Hadoop adoption.
Hadoop has been too darn hard to implement. Even with the deployment and management tools offered with packaged commercial distributions, implementation remains developer-centric and best undertaken with teams experienced with DevOps-style continuous integration. The difficulty of implementation was not a show-stopper for early adopters (e.g., Internet firms who invent their own technology, digital media and adtech firms who thrive on advanced technology, and capital markets firms who compete on being bleeding edge), or early enterprise adopters (innovators from the Global 2000). But it will be for the next wave, who lack the depth or sophistication of IT skills/resources of the trailblazers.

The wake up call came when we heard that Oracle’s Big Data Appliance, which barely registered on the map during its first couple years of existence, encountered a significant upsurge in sales among the company’s European client base. Considered in conjunction with continued healthy growth in Amazon’s cloud adoption, it dawned on us that the next wave of Hadoop adoption will be driven by simpler paths: either via appliance or cloud. This is not to say that packaged Hadoop offerings won’t further automate deployment, but the cloud and appliances are the straightest paths to getting more black box.

2. Machine learning becomes a fact of life with analytics tools. And more narratives, less dashboards.
Already a checklist item with data preparation, we expect the same to happen with analytics tools this year. Until now the skills threshold has been steep for taking advantage of machine learning. There are numerous techniques to choose from; first you identify whether you already know what type of outcome that you’re looking for, then you choose between approaches such as linear regression models, decision trees, random forests, clustering, anomaly detection and so on to solve your problem. It takes a statistical programmer to make that choice. Then you have to write the algorithm, or you can use tools that prepackage those algorithms for you such as those from H2O or Skytree. The big nut to crack will be in how to apply these algorithms and interpret them.

But we expect to see more of these models packaged under the hood. We’ve seen some cool tools this past year, like Adatao, that combine natural language query for business end users with an underlying development environment for R and Python programmers. We’re seeing tooling that puts all this more inside the black box, combining natural language querying with the ability to recognize signals in the data, guide the user on what to query, and automatically construct narratives or storyboards, as opposed to abstract dashboards. Machine learning plays a foundational role in generating such guided experiences. We’ve seen varying bits and pieces of these capabilities in offerings such as IBM Watson Analytics, Oracle Big Data Discovery, and Amazon QuickSight – and in the coming year, we expect to see more.

3. Data Lake enters the agenda
The Data Lake, the stuff of debate over the past few years, starts becoming reality with early enterprise adopters. The definitions of data lakes are in the eyes of the beholder – we view them as the governed repository that acts as the default ingest point and repository for raw data and the resting point for aged data that is retained online for active archiving. It’s typically not the first use case for Hadoop and shouldn’t be because you shouldn’t build a repository until you know how to use the underlying platform and, for data lake, know how to work with big data. But as the early wave of enterprise adopters grow comfortable with Hadoop in production serving more than a single organization, planning for the data lake is a logical follow-on step. It’s not that we’ll see full adoption in 2016 – Rome wasn’t built in a day. But we’ll start seeing more scrutiny on data management, building on the rudimentary data lineage capabilities currently available with Hadoop platforms (e.g., Cloudera Navigator, Apache Atlas) and that are part of data wrangling tools. Data lake governance is a work in process; there is much white space to be filled out with lifecycle management/data tiering, data retention, data protection, and cost/performance optimization.

Data Scientists are people too

There’s been lots of debate over whether the data scientist position is the sexiest job of the 21st century. Despite the Unicorn hype, spending a day with them at the Wrangle conference, an event staged by Cloudera, was a surprisingly earthy experience. It wasn’t an event chock full of algorithms, but instead, it was about the trials and tribulations of making data science work in a business. The issues were surprisingly mundane. And by the way, the brains in the room spoke perfectly understandable English.

It starts with questions as elementary as finding the data, and enough of it, to learn something meaningful. Or defining your base assumptions; a data scientist with a financial payments processor found definitions of fraud were not as black and white as she (or anybody) would have expected. And assuming you’ve found those data sets and established some baseline truths, there are the usual growing pains of scaling infrastructure and analytics. What might compute well in a 10-node cluster might have issues when you scale many times that. Significantly, the hiccups could be logical as well as physical; if your computations have any interdependencies; surprises can emerge as the threads multiply.

But let’s get down to brass tacks. Like why run a complex algorithm when a simple one will do. For instance, when a flyer tweets about bad services, it’s far more effective for the airline to simply respond to the tweet asking the customer to provide their booking number (thru private message) rather than resort to elaborate graph analytics establishing the customer’s identity. And don’t just show data for the sake of it; there’s a good reason why Google Maps GPS simply shows colored lines to highlight best routes rather than dashboards at each intersection showing which percentages of drivers turned left or went straight. When formulating queries or hypotheses, look outside your peer group to see if it makes sense through other peoples’ eyes.

Data scientists face many of the same issues as developers at large. One of the speakers admitted resorting to Python scripts rather than some heavier weight frameworks like Storm or Kafka; the question in retrospect is how well are those scripts documented for future reference. Another spoke of the pain of scale up of infrastructure not designed for sophisticated analytics; in this case, a system built with Ruby scripting (not exactly well suited for statistical programming) on a Mongo database (not well suited for analytics), and taking Band-Aid approaches (e.g., replicating the database nightly to a Hadoop cluster) before finally biting the bullet and rewriting the code to eliminate the need for wasteful data transfers. Another spoke of the difficulty of debugging machine learning algorithms that get too complex for their own good.

There are moral questions as well. Clare Corthell, who heads her own ML consulting firm, made an impassioned plea for data scientists to root out bias in their algorithms. Of course, the idea of any human viewing data or querying it objectively is a literal impossibility as we’re all human, we see things through our own mental lenses. In essence, it means factoring in human biases even in the most objective computational problems. For instance, the algorithms for online dating sites should factor skews, such as Asian men tending to rate African American women more negatively than average; or that loan approvals based on ‘objective’ metrics such as income, assets, and zip code in effect perpetuate the same redlining practices that fair lending laws were supposed to prohibit.

Data science may be a much hyped profession; the supply is far dwarfed by demand. We’ve long believed that there will always be need for data scientists, but that also, for the large mass of enterprises, the applications will start embedding data science. And it’s already happening, thanks to machine learning providing a system assist to humans in BI tools and data prep/data wrangling tools. But at the end of the day, as much as they might be considered unicorns, data scientists face very familiar issues.