As the most active project (by number of committers) in the Apache Hadoop open source community, it’s not surprising that Spark has drawn much excitement and expectation. At the core, there are several key elements to Spark’s appeal:
1. It provides a much simpler and more resilient programming model compared to MapReduce – for instance, it can restart failed nodes in process rather than requiring the entire run to be restarted from scratch.
2. It takes advantage of DRAM memory, significantly accelerating compute jobs – and because of the speed, allowing more complex, chained computations to run (which could be quite useful for simulations or orchestrated computations based on if/then logic).
3. It is extensible. Spark provides a unified computing model that lets you mix and match complex iterative MapReduce-style computation with SQL, streaming, machine learning and other processes on the same node, with the same data, on the same cluster, without having to invoke separate programs. It’s akin to what Teradata is doing with the SNAP framework to differentiate its proprietary Aster platform.
So we were quite pleased to see Spark Summit making it to New York and have the chance to get immersed in the discussion.
Last fall, Databricks, whose founders created Spark from their work at UC Berkeley’s AMPlab, announced their first commercial product – that being a Spark Platform-as-A-Service (PaaS) cloud for developing Spark programs. We view the Databricks Cloud as a learning tool and incubator for developers to get up to speed on Spark without having to worry about marshaling compute clusters. The question on everybody’s minds at the conference was when the Databricks Cloud would go GA. The answer, like everything Spark, is about dealing with scalability – in this case, being capable of handling high con currency, highly spikey workloads. The latest word is later this year.
The trials and tribulations of the Databricks Cloud is quite typical for Spark – it’s dealing with scale, whether that be in numbers of users (concurrency) or data (when the data sets get too big for memory and must spill to disk). At a meetup last summer where we heard a trip report from Spark Summit 2015, the key pain point was having a more graceful spilling to disk.
Memory-resident compute frameworks of course are nothing new. SAS for instance has its LASR Server, which it contends is far more robust in dealing with concurrency and compute-intensive workloads. But, as SAS’s core business is analytics, we expect that they will meet Spark halfway to appeal to Spark developers.
While Spark is thought of as a potential replacement for MapReduce, in actuality we believe that MapReduce will be as dead as the mainframe. While DRAM memory is, in the long run, getting cheaper, it will never be as cheap as disk. And while ideally, you shouldn’t have to comb through petabytes of data on a routine basis (that’s part of defining your query and identifying the data sets), there are going to be analytic problems involving data sets that won’t completely fit in memory. Not to mention that not all computations (e.g., anything that requires developing a comprehensive model) will be suited for real-time or interactive computation. Not surprisingly, most of the use cases that we came across at Spark Summit were more about “medium data,” such as curating data feeds, real-time fraud detection, or heat maps of NYC taxi cab activity.
While dealing with scaling is part of the Spark roadmap, so is making it more accessible. At this stage, the focus is on developers, through APIs to popular statistical computation languages such as Python or R, and with frameworks such as Spark SQL and Spark DataFrames.
On one hand, with Hadoop and NoSQL platform providers competing with their own interactive SQL frameworks, the question is why the world needs another SQL framework. In actuality, Spark SQL doesn’t compete with Impala, Tez, BigSQL, Drill, Presto or whatever. First, it’s not only about SQL, but querying data with any kind of explicit schema. The use case for Spark SQL is running SQL programs in line with other computations, such as chaining SQL queries to streaming or machine learning runs. As for DataFrames, Databricks is simply adapting the Distributed DataFrame technology already implemented with languages such as Java, Python, and R to access data sets that are organized as tables with columns contained typed data.
Spark’s extensibility is both blessing and curse. Blessing in that the framework can run a wide variety of workloads, but curse in that developers can drown in abundance. One of the speakers at Summit called for package management so developers won’t stumble over their expanding array of Spark libraries and wind up reinventing the wheel.
Making Spark more accessible to developers is a logical step in growing the skills base. But ultimately, for Spark to have an impact with enterprises, it must be embraced by applications. In those scenarios, the end user doesn’t care what process is used under the hood. There are a few applications and tools, like ClearStory Data for curating data feeds, or ZoomData, an emerging Big Data BI tool that has some unique IP (likely to stay proprietary) for handling scale and concurrency.
There’s no shortage of excitement and hype around Spark. The teething issues (E.G., scalability, concurrency, package management) are rather mundane. The hype – that Spark will replace MapReduce – is ahead of the reality; as we’ve previously noted, there’s a place for in-memory computing, but it won’t replace all workloads or make disk-based databases obsolete. And while Spark hardly has a monopoly on in-memory computing, the accessibility and economics of an open source framework on commodity hardware opens lots of possibilities for drawing a skills base and new forms of analytics. But let’s not get too far ahead of ourselves first.