Apache Arrow: Lining up the ducks in a row… or column

As we noted a couple years back, data is getting bigger and fast data is getting faster because of the onward declining cost of infrastructure. And nowhere has that been more apparent than with in-memory and Flash storage. For instance, when SAP HANA yanked the in-memory database from its formerly specialized niche, IBM, Oracle, and Teradata subsequently one-upped with in-memory columnar add-ons to their core platforms. And in the NoSQL world, where Aerospike debuted with its Flash-based operational database, today use of in-memory and Flash storage is no longer unusual. And, while in-memory processing is not the only advantage of the Spark compute engine, the Apache project would not have caught on with the wildfire pace were memory still cost-prohibitive.

But the not so subtle problem with all this is that Silicon-based storage is not simply a faster version of disk. Disk is optimized for retrieval – especially in cases where you have different temperatures of data and want to tier, stripe, or shard it so that the most frequently-accessed data is on the most accessible (typically outer) part of the spindle. Or with Flash, where you want to minimize writes (which are inefficient) and rewrites/updates (which can shorten Flash lifespan), or with memory, where it’s all about lining up similar data types and functions so you can in effect operate it as a form of pipeline so the chip operates at an even pace.

That’s why, as we stated a couple years back, the architecture of storage impacts the architecture of the database and the application(s) running on it.

Now, compound the issue with CPU: there are storage and processing ramifications here. There is chip level cache that gives you an even faster form of storage on-board for highly volatile processes, and then the compute itself, where pipelining techniques can cram multiple actions into a single compute cycle. These factors were not that critical when systems were disk-based, but when you start to level the playing field between storage and CPU performance, the slightest perturbation can add serious speed bumps that could defeat the whole purpose of going to Flash or in-memory storage.

So the attention paid to Spark is a reflection of the importance given to speed in processing Big Data. When you can run what-if scenarios and trend analytics in seconds or minutes instead of hours (or days), machine learning becomes useful. But just as Spark, Flink, and the increasingly endless array of interactive SQL, streaming, and graph engines are emerging, each of them has to solve the problem of literally lining up their data for in-memory processing. This is a thankless task and one that offers zilch added value. Having interfaces for converting data for in memory instantiation is, in effect, running in place.

That’s where a new project, Apache Arrow, comes in. Led by the CTO and co-founder of Dremio (who came from MapR), a stealth startup that will be doing something with Apache Drill, they’ve taken a momentary detour to build a standard interface and protocol for marshaling data and, literally, lining it up in an easily consumable columnar format for processing. So all the 4-byte integers are processed together, and all the join operations are processed together, and so on. It’s built around the data SIMD parallelism engineered into Intel Xeon processors.

Significantly, this project has the potential of going viral in a similar, but quieter way, than Spark. It’s backed by a who’s who list of over 20 Apache committers from Dremio, MapR, Cloudera, Hortonworks, Salesforce, DataStax, Twitter, and AWS. Coming out of the gate, it’s going to get supported by Spark, Storm, Drill, Impala, Pig, Phoenix, Hive, Cassandra, Pandas, Parquet, HBase, and Kudu. And the project, just coming out of stealth today, is not even bothering with incubation. The Apache community has already ratified it as a new top-level project.

Arrow will provide a standard for columnar in-memory processing and interchange of data to an in-memory representation that can be shared by multiple engines (e.g., Spark and Impala) residing on the same node. That means that each database or compute engine does not have to have its own dedicated slice of memory – in-memory columnar storage can now be a common pool, meaning that users can get more use out of the same in-memory footprint, reducing infrastructure costs. They will have implementations for C, C++, Python and Java, with more languages to come.

The project will succeed as long as it keeps its aspirations contained – open source projects get more widely adopted as long as they don’t usurp the unique IP of commercial products. And so, while the Arrow project may develop some sample functions (e.g., joins) for manipulating data within the column, we suggest that they not grow overly ambitious.