That’s one of the headlines of a newly released Databricks survey that you should definitely check out. Because Spark only requires a JVM to run, there’s been plenty of debate on whether you really need to run it on Hadoop, or whether Spark will displace it altogether. Technically, the answer is no. To run Spark, all you need is a JVM installed on the cluster, or a lightweight cluster manager like Apache Mesos. It’s the familiar argument about why bother with the overhead of installing and running a general-purpose platform if you only have a single-purpose workload.
Actually, there are reasons, if security or data governance are necessary, but hold that thought.
According to the Databricks survey, which polled nearly 1500 respondents online over the summer, nearly half are running Spark standalone, with the other 40% running under YARN (Hadoop) and 11% on Mesos. There’s a strong push for dedicated deployment.
But let’s take a closer look at the numbers. About half the respondents are also running Spark on a public cloud. Admittedly, running in the cloud does not necessarily automatically equate with standalone deployment. But there’s a lot more than coincidence in the numbers given that popular cloud-based Spark services from Databricks, and more recently, Amazon and Google, are (or will be) running in dedicated environments.
And consider what stage we’re at with Spark adoption. Commercial support is barely a couple years old and cloud PaaS offerings are much newer than that. The top 10 sectors using Spark are the classic early adopters of Big Data analytics (and, ironically in this case, Hadoop): Software, web, and mobile technology/solutions providers. So the question is whether the trend will continue as Spark adoption breaks into mainstream IT, and as Spark is embedded into commercial analytic tools and data management/data wrangling tools (which it already is).
This is not to say that running Spark standalone will become just a footnote to history. If you’re experimenting with new analytic workloads – like testing another clustering or machine learning algorithm, dedicated sandboxes are great places to run those proofs-of-concepts. If you have specific types of workloads, there has long been good business and technology cases for running them on the right infrastructure; if you’re running a compute-intensive workloads, for instance, you’ll probably want to run it on servers or clusters that are compute- rather than storage-heavy. And if you’re running real-time, operational analytics, you’ll want to run it on hardware that has heavily bulked up on memory.
Hardware providers like Teradata, Oracle, and IBM have long offered workload-optimized machines, while cloud providers like Amazon offer arrays of different compute and storage instances that clients can choose for deployment. There’s no reason why Spark should be any different – and that’s why there’s an expanding marketplace of PaaS providers that are offering Spark-optimized environments.
But if Spark dedicated deployment is to become the norm rather than the exception, it must reinvent the wheel where it comes to security, data protection, lifecycle workflows, data localization, and so on. The Spark open source community is busy addressing many of the same gaps that are currently challenging the Hadoop community (just that the Hadoop project has a 2-year head start). But let’s assume that the Spark project dots all the i’s and crosses all the t’s to deliver the robustness that is expected of any enterprise data platform. As Spark workloads get productionalized, will your organization really want to run them in yet another governance silo?
Note: There are plenty of nuggets in the Databricks survey beyond Hadoop. Recommendation systems, log processing, and business intelligence (an umbrella category) are the most popular uses. The practitioners are mostly data engineers and data scientists – suggesting that adoption is concentrated among those with new generation skills. But while advanced analytics and real-time streaming are viewed by respondents as the most important Spark features, paradoxically, Spark SQL is the most used Spark component. While new bells and whistles are important, at the end of the day, accessibility from and integration with enterprise analytics trump all.