Flattening of Big Data architecture has become something of an epidemic. The largeness of Big Data has forced the middle and bottom layers of the stack – analytics and data – to converge; the accessibility of SQL married to the scale of Hadoop has driven a similar result. And now we’re seeing the top, middle, and in some cases lower levels of the stack converging with BI and transformation atop an increasingly ambitious data tier.
It began with the notion of making BI more self-service; give ordinary people the ability to make ad hoc queries without waiting for IT to clear its backlog. Tools like Tableau, QlikTech, and Spotfire have popularized visualization with intuitive front ends backed typically by some form of data caching mechanism for materializing views PDQ. Originally these approaches may have amounted to putting lipstick on a pig (e.g., big, ugly, complex SQL databases), but in many cases, these tools are packing more back end functionality to not simply paint pictures, but quickly assemble them. They are increasingly embedding their own transformation tools. While they are eliminating the ETL tier, they are definitely not eliminating the “T” – although that message tends to get blotted out by marketing hyperbole. That in turn is leading to the next step, which is elevating cache to becoming full-bore in-memory databases. So now you’ve collapsed, not only the ETL middle tier, but the data back end tier. We’re seeing platforms that would otherwise be classed Advanced SQL or NewSQL databases, like SiSense.
That phenomenon is also working its way on the NoSQL side where we see Platfora packaging, not only an in-memory caching tier for Hadoop, but also the means to marshal and transform data and views on the fly.
This is not simply a tale of flattening architecture for its own sake. The ramifications are basic changes to the analytics workflow and lifecycle. Instead of planning your data structures and queries ahead of time, generate schema and views on demand. Flattening the architecture facilitates this new style.
Traditionally with SQL – both for data warehousing and transaction systems – the process was completely different. You modeled the data and specified the tables at design time based on your knowledge of the content of the data and the queries and reports you were going to generate.
As you might recall, NoSQL was a reaction to the constraints of imposing a schema on the database at design time. By contrast, NoSQL did not necessarily do away with structure; it simply allowed the process to become more flexible. Collect the data and when it’s time to harvest it, explore it, discover the problem, and then derive the structure. And because in most NoSQL platforms, you still retain the data in raw form, you can generate a different schema as the nature of the problem, business challenge, or content of the data itself changes. Just run another series of MapReduce or similar processes to generate new views. Nonetheless, this view of flexible schema was borne with assumptions of a batch processing environment.
What’s changed is the declining cost of silicon-based storage: Flash (SSD) and memory (DRAM). That’s allowed those cute SQL D-I-Y visualization tools to morph into in-memory data platforms, because it was now cheap enough to gang terabytes of memory together. Likewise, it has cleared the way for Oracle and SAP to release in-memory platforms. And on the NoSQL side, it is making the notion of dynamic views from Hadoop thinkable.
We’re only at the beginning of the great rethink of analytic data views, schematizing processes, and architectural refactoring. It fires a shot across the bow to traditional BI players who have built their solutions on traditional schema at design rather than run time; in their case, it will require some significant architectural redesign. The old way will not disappear, as the contents of core end-of-period reporting and similar processes will not go away. There will always be a need for data warehouses and BI/reporting tools that provide repeatable, baseline query and reporting. And the analytic and data protection/housekeeping functions that are provided by established platforms will continue to be in demand. Astute BI and DW vendors should consider these new options as additive; they will be challenged by upstarts offering highly discounted pricing. For established vendors, the key is emphasizing the value add while they provide the means for taking advantage of the new, more flexible style of schema on demand.
Sadly, as we’re at the beginning of a new era of dynamic schema and dynamic analytics, there is also a lot of noise, like the dubious proposition that we can eliminate ETL. Folks, we are eliminating a tier, not the process itself. Even with Hadoop, when you analyze data, you inevitably end up forming it into a structure so you can grind out your analytics.
Disregard the noise and hype. You’re not going to replace your data warehouses for routine, mandated processes. But there will be new analytics will become more organic. It’s not simply a phenomenon in the Hadoop world, but with SQL as well. Your analytics infrastructure will flatten, and your schema and analytics will grow more flexible and organic.