Industrializing Spark

This is the first of two pieces summarizing our takeaways from the recent Spark Summit East.

Given the 1000+ contributors to the Apache Spark project, it shouldn’t be surprising that development is pacing in dog years. Last year, Spark exploded as the emerging fact of life for bringing Fast Data velocity to Big Data, courtesy of a critical mass of commercial endorsements underscored by IBM’s bear hug in mid year. The Spark practitioner community has been highly successful speaking to itself – Spark would not have become the most active and fastest ramping Apache project were it not for grassroots interest that’s translated to action (over a thousand contributors to date), and it would not have piqued IBM’s attention were it just a small clique of developers.

But with momentum on Spark and related projects (almost a couple hundred using Spark, at last count), it’s time to deal with the reality of taking Spark to the enterprise. For practitioners, we’ve been harping on the need to explain the benefits of Spark-based analytics in business terms. Those benefits can be summarized in two words: Smart Analytics. Machine learning can provide the assist for sifting through torrents of data and helping the business ask the right questions.

The corollary is that the Spark engine, and the management infrastructure for running it, has to become ready for prime time. It’s time to industrialize the running of Spark. That will grow even more critical, not just as data analysts and data scientists write programs, but commercial software tools and applications embed Spark.

If you are implementing standalone – we’ve already weighed in on that – you’re going to have to reinvent all the measures associated with running a data processing platform, like security, workload management, and systems management.

But regardless of whether you run Spark standalone or under a data platform or cloud service with its own management and security infrastructure, what to do about the plumbing of running Spark operations on an ongoing basis, serving many masters? At Spark Summit East last week, a team from Bloomberg gave a glimpse of what organizations will encounter: building a registry of RDDs and DataFrames runtimes would not have to be recreated from scratch each time analysts want to run specific problems. The Bloomberg folks had to create this registry – which is also meant to store valuable lineage metadata on the provenance of the data or real-time stream. Bloomberg had to invent this because there is nothing off-the-shelf to manage frequent RDD use yet. Our take is that within the year, you’ll see ISVs introducing solutions to manage your Spark compute artifacts – not just RDDs or DataFrames, but also Datasets and the new constructs for Structured Streaming feeds. And tools and applications that embed Spark will similarly have to

Bloomberg’s RDD registry is just the tip of the iceberg. As you industrialize Spark, there will be issues relating managing which Spark workloads get priority, which ones get first dibs on memory, and optimizing workloads to fit into available memory. These are issues, not for the core Spark project, but for the ISV community to develop solutions.