Spark 2.0: Walking the tightrope

Here’s the second post summarizing our takeaways from the recent Spark Summit East.

In April or May, we’ll see Spark 2.0. The direction is addressing gap filling, performance enhancement, and refactoring to nip API sprawl in the bud.

Rewinding the tape, in 2015 the Spark project added new entry points beyond Resilient Distributed Datasets (RDDs). We saw DataFrames, a schema-based data API that borrowed from constructs familiar to Python and R developers. Besides opening Spark to SQL developers (who could write analytics to run against database-like tabular representations) and BI tools, the DataFrame API also leveled the playing field between Scala (the native language of Spark) and R, Python, Java, and Clojure via a common API. But DataFrames are not as fast or efficient as RDDs, so recently, Datasets were introduced to provide the best of both worlds: the efficiency of Spark data objects, with the ability to surface them as schema.

Spark 2.0 release will consolidate the DataFrame and Dataset APIs into one; DataFrame becomes, in effect, the row-level construct of Dataset. Together, both will be positioned as the default interchange format and richer API of Spark with more semantics than the low-level RDD.

If you want ease of use, go with Datasets, but if feeds and speeds is the goal, that’s where RDDs fit in. And that’s where the next enhancement comes in. Spark 2.0 adds the first tweaks to the recently-released Tungsten (adding code generation), which aims to replace the 20-year old JVM with a more efficient mechanism for managing CPU memory. That’s a key strategy for juicing Spark performance, and maybe one that will make Dataset performance good enough. The backdrop to this is that with in-memory processing and faster networks (up to 10 GBE are becoming commonplace), the CPU has become the bottleneck. By eliminating the overhead of JVM garbage collection, Tungsten hopes to even the score with storage and network performance.

The final highlight of Spark 2.0 is Structured Streaming, which will extend Spark SQL and DataFrames (which in turn is becoming part of Dataset) with a streaming API. That will allow streaming and interactive steps, which formerly had to be orchestrated with separate programs, to run as one. And it makes streaming analytics richer; instead of running basic filtering or count actions, you will be able to run more complex queries and transforms. The initial release in 2.0 will support ETL, but future releases will extend querying.

Beyond the 2.0 generation, Spark Streaming will finally get – catch this – streaming. Spark Streaming has been a misnomer, as it is really Spark microbatching. By contrast, rival open source streaming engines such as Storm and Flink give you the choice of streaming (processing exactly one event at a time) or microbatch. In the future, Spark Streaming will give you that choice as well. Because sometimes you want pure streaming, where you need to resolve down to a single event, but other use cases will be better suited for microbatch where you can do more complex processes such as data aggregations and joins. And one other thing, Spark Streaming has never been known for low latency; at best it can resolve batches of events in seconds rather than subseconds. When paired with Tungsten memory management, that should hopefully change.

Spark 2.0 walks a tightrope between adding functionality, consolidating APIs, while not trying to break them. It for now begs the question about all the housekeeping that will be necessary if running Spark standalone. If it’s in the cloud, the cloud service provider should offer the perimeter security, but for now more fine-grained access control will have to be implemented in the application or storage layers. There are some pieces – such as the managing the lifecycle of Spark compute artifacts such as RDDs or DataFrames – that may be the domain of third party value-added tools. And – as it seems likely – Spark establishes itself as the successor to MapReduce for the bulk of complex Big Data analytics workloads, the challenge will be drawing the line between what innovations belong on the Apache side (and preventing fragmentation) and what sits better with third parties. We began that discussion in our last post. Later this year, we expect this discussion to hit the forefront.