Category Archives: Big Data

Strata 2014 Part 2: Exploratory Analytics and the need to Fail Fast

Among the unfulfilled promises of BI and data warehousing was the prospect of analytic dashboards on the desk of everyman. History didn’t quite turn out that way – BI and data warehousing caught on, but only as the domain of elite power users who were able to create and manipulate dashboards, understand KPIs, and know what to ask for when calling on IT to set up their data marts or query environments. The success of Tableau and Qlik revealed latent demand for intuitive, highly visual BI self-service tools, and the feasibility of navigating data with lesser reliance on IT.

History has repeated itself with Big Data analytics and Hadoop platforms – except that we need even more specialized skills on this go round. Whereas BI required power users and DBAs, for Big Data it’s cluster specialists, Hadoop programmers, and data scientists. When we asked an early enterprise adopter at a Hortonworks-sponsored user panel back at Hadoop Summit as to their staffing requirements, they listed Java, R, and Python programmers.

Even if you’re able to find or train Hadoop programmers and cluster admins, and can even spot the spare underemployed data scientist, you’ll still face a gap in operationalizing analytics. Unless you’re planning to rely on an elite team of programmers, data scientists or statisticians, Big Data analytics will wind up in a more gilded version of the familiar data warehousing/BI ghetto.

This pattern won’t be sustainable for mainstream enterprises. We’ve gone on record that Big Data and Hadoop must become first class citizens in the enterprise. And that means mapping to the skills of the army you already have. No wonder that interactive SQL is becoming the gateway drug for Hadoop in the enterprise. That at least gets Big Data and Hadoop to a larger addressable practitioner base, but unless you’re simply querying longer periods of the same customer data that you held in the data warehouse, you’ll face new bottlenecks getting your arms around all that bigger and badder data. You’re probably going to be wondering:
• What questions do you ask when you have a greater variety of data to choose from?
• What data sets should you select for analysis when you have dozens being ingested into your Hadoop cluster?
• What data sets will you tell your systems admins or DBAs (yes, they can be retrained for schema-on-read data collections) to provision for your Hadoop cluster?

If you’re asking these questions, you’re not lost. Your team is likely acquiring new data sets to provide richer context to perennial challenges such as optimizing customer engagement, reducing financial risk exposure, improving security, or managing operations. Unless your team is analyzing log files with a specific purpose, chances are you won’t have the exact questions or know specifically which data sets you should pinpoint in advance.

Welcome to the world of Exploratory Analytics. This is where you iterate your queries, and identify which data sets to yield the answers. It’s different from traditional analytics, where the data sets, schema, and queries are already pre-determined for you. At exploratory phase, you look for answers that explains why your KPIs have changed, or whether you’re looking at the right KPIs at all. Exploratory analytics does not replace your existing regimen of analytic, query, or reporting – it complements it. Exploratory Analytics:
• Gives you the Big Picture; it shows you the forest. Traditional analytics gives you the Precise Picture, where you get down to the trees.
• May be used for quickly getting a fix on some unique scenario occurring, where you might run a query once and move on; or it can be used for recalibrating where you should do your core data warehousing analytics – which means that it is a preparatory stage for feeding new data to the data warehouse.
• Gives you the context for making decisions. Data warehousing analytics are where final decisions (for which your organization may be legally accountable) are made.

A constant refrain from the just-concluded Strata Hadoop World conference was the urgency of being able to fail fast. You are conducting a process of discovery where you are testing and retesting assumptions and hypotheses. The nature of that process is that you are not always going to be right, and in fact, if you are thorough enough in your discovery process, you won’t. That doesn’t mean that you don’t start out with a question or hypothesis – you are. But unlike conventional BI and data warehousing, your assumptions and hypotheses are not baked in from the moment that schema has been set in concrete.

At the other end of the scale, exploratory analytics should not degenerate into a hunting expedition. Like any scientific process, you need direction and ways for setting bounds on your experiments.

Exploratory analytics requires self-service, from data preparation through query and analysis. You can’t afford to wait for IT to transform your data and build query environments, and then repeat the process for the next stage of refining your hypothesis. It takes long enough to build a single data mart.

It starts with getting data. You must identify data sets and then transform them (schema on read doesn’t eliminate this step). It may involve searching externally for data sets or scanning the data sets that are already available from your organization’s portfolio of transaction or messaging systems, log files, or other sources; or that data might already be on your Hadoop cluster. You need to identify what’s in the data sets of interest, reconcile, and conduct the transformation, a process often characterized as data wrangling. This process is not identical to ETL, but rather, the precursor. You are getting the big picture and may not require the same degree of precision in matching and de-duplicating records as you would inside a data warehouse. You’re designing for queries that give you the big picture (is your organization on the right track) as opposed to the precise or exact picture (where you are making decisions that carry legal and/or financial accountability).

You will have to become self-sufficient in performing the wrangling and setting up the queries and get the results. You’ll need the system to assist you in this endeavor. As serial database entrepreneur and MIT professor Michael Stonebraker put it during a Strata presentation, when you have more than 20 – 25 data sources, it is not humanly possible to keep track of all of them manually; you’ll need automation to help track and organize data sets for you. You may need the system to assist you in selecting data sets, and you certainly need the system to help you determine how to correlate and transform those data sets into workable form. And to keep from reinventing the wheel, you’ll need a way to track and preferably collaborate with others regarding the wrangling process – e.g., your knowledge about data sets, transformations, queries, and so on.

Advances in machine learning are helping make the wild goose chase become manageable. Compared to traditional ETL tools, these offerings use a variety of techniques, such as capabilities to recognize patterns of data and identify what kind of data is in a column, calibrated with various training techniques where you download a sample set and “teach” the system, or provide prompted or unprompted feedback as to the correctness of the transform. Unlike traditional ETL tools, you can operate from a simple spreadsheet rather than have to navigate schema.

Emerging players like Trifacta, Paxata, and Tamr introduced techniques to data preparation and reconciliation; IBM has embraced these approaches with Data Refinery and Watson Analytics, while Informatica leverages machine learning with its Springbok cloud service; and we expect to hear from Oracle very soon.

The next step is data consumption. Where data is already formatted as relational tables, existing BI self-service visualization tools may suffice. But other approaches are emerging that deduce the story. IBM’s Watson Analytics can suggest what questions to ask and pre-populate a storyboard or infographic; ClearStory Data combines live blending of data from internal and external sources to generate interactive storyboards that similarly venture beyond dashboards.

For organizations that already actively conduct BI and analytics, the prime value-add from Big Data will be the addition of exploratory analytics at the front end of the process. Exploratory analytics won’t replace traditional BI query and reporting, as the latter is where the data and processes are repeatable. Exploratory Analytics allows organization to search deeper and wider for new signals or patterns; the results might in some cases be elevated to the data warehouse, but in other cases, may provide a background process that helps the organization get the bigger picture to understand whether it has the right business or operational strategy, whether it is asking the right questions, serving the right customers, or protecting against the right threats.

Strata 2014 Part 1: Hadoop, Bright Lights, Big City

If you’re running a conference in New York, there’s pretty much no middle ground between a large hotel and the Javits Center. And so this year, Strata Hadoop World made the leap, getting provisional access to a small part of the big conventional center to see if it could fill the place. That turned out to be a foregone conclusion.

The obvious question was whether Hadoop, and Big Data, had in fact “crossed the chasm” to become a mainstream enterprise IT market. In case you were wondering, the O’Reilly folks got Geoffrey Moore up on the podium to answer that very question.

For Big Data-powered businesses, there’s little chasm to cross when you factor in the cloud. As Moore put it, if you only need to rent capacity on AWS, the cost of entry is negligible. All that early adopter, early majority, late majority stuff doesn’t really apply. A social site has a business model of getting a million eyes or nothing, and getting there is a matter of having the right buzz to go viral – the key is that there’s scant cost of entry and you get to fail fast. Save that thought – because the fail fast principle also applies to enterprises implementing Big Data projects (we’ll explain in Part 2 of this post, soon to come).

Enterprise adoption follows Moore’s more familiar chasm model – at that we’re still at early majority where the tools of the trade are arcane languages and frameworks like Spark and Pig. But the key, Moore says, is for “pragmatists” to feel pain; that is the chasm to late majority, the point where conventional wisdom is to embrace the new thing. Pragmatists in the ad industry are feeling pain responding to Google; the same goes with media and entertainment sectors were even cable TV mainstays such as HBO are willing to risk decades-old relationships with cable providers to embrace pure internet delivery.

According to Cloudera’s Mike Olson, Hadoop must “disappear” to become mainstream. That’s a 180 switch as the platform has long required specialized skills, even if you ran an off-the-shelf BI tool against it. Connecting from familiar desktop analytics tools is the easy part – they all carry interfaces that translate SQL to the query language that can run on Hive, or on any of the expanding array of interactive-SQL-on-Hadoop frameworks that are making Hadoop analytics more accessible (and SQL on Hadoop a less painful experience).

Between BI tools and frameworks like Impala, HAWQ, Tez, Big SQL, Big Data SQL, Query Grid, Drill, or Presto, we’ve got the last mile covered. But the first miles, which involve mounting clusters, managing and optimizing them, wrangling the data into shape, and governing the data, are still works in progress (there is some good news regarding data wrangling). Tools that hide the complexity and applications that move the complexity under the hood are works in progress.

No wonder that for many enterprises, offloading ETL cycles was their first use of Hadoop. Not that there’s anything wrong with that – moving ETL off Teradata, Oracle, or DB2 can yield savings because you’ve moved low value workloads off platforms where you pay by footprint. Those savings can pay the bill while your team defines where it wants to go next,

We couldn’t agree with Olson more – Hadoop will not make it into the enterprise as this weird, difficult, standalone platform that requires special skills. Making a new platform technology like Hadoop “disappear” isn’t new — it’s been done before with BI and Data Warehousing. In fact, Hadoop and Big Data today are at the same point where BI and data warehousing were in the 1995 – 96 timeframe.

The resemblance is uncanny. At the time, data warehouses were unfamiliar and required special skills because few organizations or practitioners had relevant experience. Furthermore, SQL relational databases were the Big Data of their day, providing common repositories for data that was theoretically liberated from application silos (well, reality proved a bit otherwise). Once tools automated ETL, query, and reporting, BI and data warehousing in essence disappeared. Data Warehouses became part of the enterprise database environment, while BI tools became routine additions to the enterprise application portfolio. Admittedly, the promise of BI and Data warehousing was never completely fulfilled as analytic dashboards for “everyman” remained elusive.

Back to the original question, have Hadoop and Big Data gone mainstream? The conference had little troubled filling up the hall, and questions about economic cycles notwithstanding, shouldn’t have issues occupying more of Javits next year. We’re optimists based on Moore’s “pragmatist pain” criteria — in some sectors, pragmatists will have little choice but to embrace the Big Data analytics that their rivals are already leveraging.

More specifically, we’re bullish in the short term and long term, but are concerned over the medium term. There’s been a lot of venture funding pouring into this space over the past year for platform players and tools providers. Some players, like Cloudera, have well broken the billion-dollar valuation range. Yet, if you look at the current enterprise paid installed base for Hadoop, conservatively we’re in the 1000 – 2000 range (depending on how you count). Even if these numbers double or triple over the next year, will that be enough to satisfy venture backers? And what about the impacts of Vladimir Putin or Ebola on the economy over the near term?

At Strata we had some interesting conversations with members of the venture community, who indicated that the money pouring in is 10-year money. That’s a lot of faith – but then again, there’s more pain spreading around certain sectors where leaders are taking leaps to analyze torrents of data from new sources. But ingesting the data or pointing an interactive SQL tool (or streaming or search) at it is the easy part. When you’re getting beyond the enterprise data wall garden, you have to wonder if you’re looking at the right data or asking the right questions. In the long run, that will be the gating factor as to how, whether, and when analysis data will become routine in the enterprise. And that’s what we’re going to talk about in Part 2.

We believe that self-service will be essential for enterprises to successfully embrace Big Data. We’ll tell why in our next post.

Is SQL the Gateway Drug for Hadoop?

How much difference does a year make? Last year, Last year was the point where each Hadoop vendor was compelled to plant their stake in supporting interactive SQL. Cloudera Impala; Hortonworks’ Stinger (injecting steroids to Hive); IBM’s Big SQL; Pivotal’s HAWQ; MapR and Drill (or Impala available upon request); and for good measure, Actian turbocharging their Vectorwise processing engine onto Hadoop.

This year, the benchmarketing has followed: Cloudera Impala clobbering the latest version of Hive in its own benchmarks, Hortonworks’ response, and Actian’s numbers with the Vectorwise engine (rebranded Vortex) now native on Hadoop supposedly trumping the others. OK, there are lies, damn lies, and benchmarks, but at least Hadoop vendors feel compelled to optimize interactive SQL performance.

As the Hadoop stack gets filled out, it also gets more complicated. In his keynote before this year’s Hadoop Summit, Gartner’s Merv Adrian made note of all the technologies and frameworks that are either filling out the Apache Hadoop project – such as YARN – and those that are adding new choices and options, such as the number of frameworks for tiering to memory or Flash. Add to that, the number of interactive SQL frameworks.

So where does this leave the enterprises that comprise the Hadoop market? In all likelihood, dazed and confused. All that interactive SQL is part of the problem, but it’s also part of the solution.

Yes, Big Data analytics has pumped new relevancy to the java community, which now has something sexier than middleware to keep itself employed. And it’s provided a jolt to Python, which as it turns out is a very useful data manipulation language, not to mention open source R for statistical processing. And there are loads of data science programs bringing new business to Higher Ed computer science programs.

But we digress.

Java, Python and R will add new blood to analytics teams. But face it, no enterprise in its right mind is going to swap out its IT staff. From our research at Ovum, we have concluded that Big Data (and Hadoop) must become first class citizens in the enterprise if they are to gain traction. Inevitably, that means SQL must be part of the mix.

Ironically, the great interactive SQL rollout is occurring as something potentially far more disruptive is occurring: the diversification of data platforms. Hadoop and data warehousing platforms are each adding multiple personas. As Hadoop adds interactive SQL, SQL data warehouses are adding column stores, JSON/document style support, MapReduce style analytics.

But SQL is not the only new trick up Hadoop’s sleeve; there are several open source frameworks that promise to make real-time streaming analytics possible, not to mention search, and… if only the community could settle on some de facto standard language(s) and storage formats, graph. YARN, still in its early stages, offers the possibility of running multiple workloads concurrently on the same Hadoop cluster without the need to physically split it up. On the horizon are tools applying machine learning to take ETL outside the wall garden of enterprise data, not to mention BI tools that employ other approaches not easily implemented in SQL such as path analysis. Our research has found that the most common use cases for Big Data analytics are actually very familiar problems (e.g., customer experience, risk/fraud prevention, operational efficiency), but with new data and new techniques that improve visibility.

Therefore it would be a waste if enterprises only use Hadoop as a cheaper ETL box or place to offload some SQL analytics. Hopefully, SQL will become the gateway drug for enterprises to adopt Hadoop.

Postscript: Here’s the broader context for our thoughts: databases are converging. There are more platforms to run SQL than ever before. Here’s a link to our presentation at 20145 Hadoop Summit on how Hadoop, SQL, and NoSQL data platforms are converging.

Cloudera’s show of numbers

The announcement of Cloudera’s new $160 million venture funding almost looked too perfectly timed. It came midway during Cloudera’s first formal dog and pony show in front of industry analysts. And we’re not just talking about the usual suspects, but a broader, more sober crowd of doubters from across the IT spectrum: app development, IT infrastructure, database, and BI, where the consensus remains that Hadoop is not a database.

Unlike Hortonworks, Cloudera has not been afraid to ruffle feathers. It dares to offer a hybrid open source/proprietary model in a market born in open source. Or more importantly, announce a strategic Enterprise Data Hub path that potentially places it in competition with established data warehouse providers that might otherwise form logical partners. Cloudera’s enterprise data hub positioning is ambitious, auspicious, and for now, a concept leap. Hadoop is not a database, and it currently lacks enterprise-grade features for performance management, SLA conformance, security, and data governance. The emphasis is on “currently” as platform and practice are evolving rapidly; Hadoop will grow into a more robust platform that can compete for the role of hub.

There is little question that Hadoop is here to stay; Cloudera has drawn competition from Hortonworks, which positions itself as the 100% open source platform that is very OEM-friendly; MapR, whose implementation includes proprietary technology that gets the platform closer to the robustness and performance of databases; and IBM, which after a brief flirtation with Cloudera subsequently reiterated its positioning as the adult in the room. Meanwhile, Teradata, Oracle, Microsoft, and Amazon include Hadoop in their data stacks.

Cloudera hardly needed the capital as it already had $140 million in the bank. The new infusion jumps that to $300 million. More to the point, it includes a battery of firms, such as T. Rowe Price, who tend to be long-term investors, plus Michael Dell’s venture arm and Google Ventures as “strategic” backers. The company does not deny having IPO aspirations, but states that the new money gives it more flexibility on the timing.

Immediately following the announcement, we received several press queries as to whether Cloudera was for sale. In our view the most likely candidates would be Oracle (which resells Cloudera’s full platform as part of its Big Data Appliance, and just saw disappointing Q3 numbers) and newly privatized Dell. The common thread is that both are seeking engines to rekindle growth. But the $300 million in the bank inflates Cloudera’s valuation to the point that it would be a very, very expensive buy.

Nonetheless, there’s a lot of venture money floating around right now. And with Facebook’s $19 billion acquisition of a company that few ever heard of (except for hundreds of millions of casual subscribers like us who have the app but don’t use it), we have the makings of a venture capital bubble. As such, there is a flight to quality (invest in market leaders) for Tier 1 VCs. In the Big Data arena, players like Cloudera and MongoDB are perceived to be among them.

So we don’t believe that Cloudera is currently for sale. With Enterprise Data Hub, they are not claiming to replace data warehousing incumbents, but the pressure to move data storage and compute cycles onto the cheaper Hadoop platform is potentially quite threatening. (We believe that the incumbents must assert their value higher up the stack, such as with in-database analytic functionality, data governance, and query optimization.)

Whatever Cloudera’s next step (IPO or acquisition), their immediate goal is placing more facts on the ground with product and market share to raise the stakes on whatever transpires. That will inevitably include Cloudera making its own acquisitions – a skill that the company needs to learn – and likely diversification of the product line. At the analyst session, we viewed a demonstration of a Hadoop-based predictive analytics system that Cloudera uses as its nerve center for customer support; it’s a technology that could be generalized beyond Hadoop users.

$300 million in the bank may be a nice security blanket. But look at the state of adoption: Cloudera, which has had a multiyear jump in the market, counts a 10 – 12,000 installed base, plus or minus. But that boils down to about 350 paying subscribers (currently growing at about 40 – 50/monthquarter). Any market where the leader’s paid base numbers in the hundreds is either a niche segment or a very immature one. Obviously, Hadoop’s the latter, and as such, there are any number of potential disruptors that could surface on the road to mainstream adoption. For Cloudera and its rivals, it’s hardly game over.

Postscript: That $160 million was quickly dwarfed barely a week later with another $740 million infusion from Intel. Minus payments to earlier investors, we believe Cloudera netted about $500 million new funding in these couple weeks of March.

Hadoop vendor ecosystem gaining critical mass

Nature abhors a vacuum, and enterprises abhor platforms lacking tooling. Few enterprises have the developer resources or technology savvy of early adopters. For Hadoop, early adopters invented the technology; mainstream enterprises want to consume it.

On our just-concluded tour of Ovum enterprise clients across Australia/Pacific Rim, we found that the few who have progressed beyond discussion stage with Hadoop are doing so with technology staff accustomed to being on their own, building their own R programs and experimenting with embryonic frameworks like Mesos and YARN. Others are either awaiting more commercial tooling or still sorting out perennial data silos.

But Hadoop is steadily turning into a more “normal” software market. And with it, the vendor ecosystem vacuum is starting to fill in. It’s very much in line with what happened with BI and data warehousing back in the mid-1990s, when tools civilized what was a new architecture for managing data that originally required manual scripting.

So let’s take a brief tour.

Look at the exhibitor list for last month’s Strata HadoopWorld conference; as the largest such Big Data event in North America, it provides a good sampling of the ecosystem. Of nearly a hundred sponsors, roughly a third were tools encompassing BI and analytics, data federation and integration, data protection, and middleware.

There was a mix of the usual suspects who regard Hadoop as their newest target. SAS analytics takes an agnostic approach, bundling a distro of Hadoop in its LASR in-memory appliance; but SAS analytics can also execute inside Hadoop clusters, converting their HPC routines to MapReduce. MicroStrategy and other BI players are connecting to Hadoop in a variety of ways; they either provide suboptimal experience of having your SQL query execute in batch on Hadoop (which few use), or work through the data warehouse or Hadoop platform’s path for interactive SQL.

But there are also new players that are taking BI beyond SQL. Datameer and Platfora each provide their own operators (e.g., clustering, time series, decision trees, or other forms of analysis that would be laborious with SQL), presenting data either through spreadsheets or visualizations. ClearStory Data, which emerged from stealth at the show, provides a way to semantically crawl your own data and mash it with external data from publicly-available APIs. Players like Pivotal, Hadapt, SpliceMachineand CitusData are implementing or co-locating SQL data stores inside HDFS or HBase.

Significantly, some are starting to package forms of data science as well, with almost a half dozen machine learning programs. A necessary development, because there are just so many Hilary Masons to go around. Having people who have a natural feel for data, able to understand its significance, how to analyze it, and most importantly, its relevance, will remain few and far between. To use these tools, you’ll need to know what algorithms to use, but at least you don’t have to build them from scratch. For instance, 0xdata packages machine learning algorithms and combines it with a caching engine for high performance analytics on Hadoop. Skytree, packages classification, clustering, regression analyses, and most importantly, dimension reduction so you can see something meaningful after combing a billion nodes (points) and edges (relationships and context).

Security, a perennial weakness of Hadoop, is another area where you’re seeing vendor activity. Originally designed for trusted environments, Hadoop has long had the remote authentication piece down (Kerberos), because early adopters needed to gain access to remote clusters, and now there are incubating open source projects tackling the other two A’s of AAA – a gateway for access control (Knox) and a mechanism for role-based authorization (Sentry). Yes, there is also a specialized project for “cell” (data entity) level protection created for the NSA (Accumulo), which is being led by Sqrrl. But otherwise, we expect that vendor-based proprietary tools are going to be where most of the action is. Policy-based data protection, either about encryption or data masking, is now emerging via emerging players like Zettaset and Gazzang, with incumbents such as Protegrity and IBM extending support beyond SQL. Data lineage and activity monitoring (the first steps that could eventually lead to full-blown audit and selective read/write access) are emerging from IBM, Cloudera, and Revelytix.

We’ve long believed that for Big Data – and Hadoop – to gain traction with enterprises, that it must become a first class citizen. Among other things, it means Hadoop must integrate with the data center and, inevitably, apps that run against it. Incumbent data integration like Informatica, Talend, Syncsort, and Pentaho view Hadoop as yet another target. Originally touching Hadoop at arm’s length via the traditional ETL staging server topology, they have enabled their transformation tools to work natively inside Hadoop as the idea is a natural (Hadoop promises cheaper compute cycles for the task). Emerging players are adding new integration capabilities – Cirro for data federation; JethroData, for adding indexing to Hadoop; Kapow and Continuuity that are providing middleware for applications to integrate to Hadoop; and Appfluent for extending its data lifecycle management tool to support active archiving on Hadoop.

The subtext of the explosion of the ecosystem is Hadoop’s evolution into a more varied platform; to play anything more than a niche role in the enterprise (and draw a tooling and applications ecosystem), Hadoop must provide other processing options besides MapReduce.

Not surprisingly, interactive SQL on Hadoop became a prime battleground for vendors to differentiate. Cloudera introduced Impala, an MPP-based alternative to MapReduce that uses Hive metadata but bypasses the bottleneck of Hive processing (which had traditionally relied on MapReduce). Meanwhile, Hortonworks has led projects to make Hive better (read: faster), complementing it with a faster alternative to MapReduce. As noted above, several players are implementing SQL data stores directly inside Hadoop, while IBM has modified SQL to run against Hive.

The YARN (a.k.a., MapReduce 2.0) framework provides resource allocation (not full-blown resource management, however) that will allow multiple (read: MapReduce and alternative) workloads to run on Hadoop clusters. Hortonworks, which led development, announced a circle of partners who are supporting the new framework. Its rival, Cloudera, is taking a more measured approach; MapReduce and Impala workloads will be allocated under the YARN umbrella, but streaming or search won’t. Having been carved out of the original resource manager for pre-2.0 MapReduce, Cloudera doesn’t believe the new framework is suited for handling continuous workloads that don’t have starts or stops.

So, going forward, we’re seeing Hadoop emerge with an increasingly well-rounded third party ecosystem where little existed before. We expect that in the coming year, this will spread beyond tools to applications as well; we’ll see more of what the likes of Causata are doing.

So what role will Hadoop play?
For now, Hadoop remains a work in progress – data integration and lifecycle management, security, performance management, and governance practices and technologies are at early stages of evolution. At Strata, Facebook’s Ken Rudin made an eloquent plea for coexistence; they tracked against the wind by starting with Hadoop and learning that it was best for exploratory analytics while relational was best suited for queries with standard metrics (he’s pitched the same message to the data warehousing audience as well).

Cloudera’s Mike Olson, who had the podium right before Rudin, announced Cloudera’s vision of Hadoop as enterprise data hub: Hadoop is not just the logical landing spot for data, but also the place where you can run multiple workloads. Andrew Brust equates Cloudera’s positioning as making Hadoop become “the Ellis Island of data.”

So is Olson agreeing or arguing with Rudin?

The context is that analytic (and some transactional) data platforms are taking on multiple personalities (e.g., SQL row stores adding column engines, file/HDFS data stores, JSON stores – in some cases alongside or in hybrid). All analytic data platforms are grabbing for multiple data types and running workloads. They are also vying to become the logical spot where analytics are choreographed – mixing and matching data sets on different platforms for running analytic problems.

Cloudera aims to compete, not just as another Hadoop platform, but as the default platform where analytic data lives. It doesn’t necessarily replace SQL enterprise data warehouses, but assumes more workloads requiring scale, inexpensive compute cycles, and the ability to run multiple types of workloads – not just MapReduce. SQL data warehouses aren’t standing still either, and in many cases are embracing Hadoop. Hadoop has the edge on cost of compute cycles, but pieces must fall into place to gain parity regarding service level management and performance, security, availability and reliability, and information lifecycle management. Looking ahead, we expect analytics to run on multiple platforms, with the center of gravity up for grabs.

Is the sky the limit for Flash and In-Memory Databases?

Big Data is getting bigger, and Fast Data is getting faster because of the continuing declining cost of all things infrastructure. Ongoing commoditization of powerful, multi-core CPU, storage media, and connectivity made scale-out Internet data centers possible, and with them, scale-out data platforms such as Hadoop and the new generation of Advanced SQL/NewSQL analytic data stores. Bandwidth is similarly going crazy; while the lack of 4G may make bandwidth seem elusive to mobile users, growth of bandwidth for connecting devices and things has become another fact taken for granted.

Conventional wisdom is that similar trends are impacting storage, and until recently, that was the Kool-Aid that we swallowed. For sure, the macro picture is that declining price and ascending density curves are changing the conversation where it comes to deploying data. The type of media on which you store data is no longer just a price/performance tradeoff, but increasingly an architectural consideration on how data is processed and applications that run on data are engineered. Bigger, cheaper storage makes bigger analytics possible; faster, cheaper storage makes more complex and functional applications possible.

At 100,000 feet, such trends for storage are holding, but dig beneath the surface and the picture gets more nuanced. And those nuances are increasingly driving how we design our data-driven transaction applications and analytics.

Cut through the terminology
But before we dive into the trends, let’s get our terminology straight, because the term memory is used much too loosely (does it mean DRAM or Flash?). For this discussion, we’ll stick with the following conventions:
• CPU cache is the memory on chip that is used for temporarily holding data being processed by the processor.
• DRAM memory is the fastest storage layer that sits outside the chip, and is typically parceled out in GBytes per compute core.
• Solid State Drive (SSD) based on Flash memory, is the silicon-based, faster substitute to traditional hard drives are typically sized at hundreds of GBytes (with some units just under a terabyte). But it is not as fast as DRAM.
• Hard disk, or “disk,” is the workhorse that now scales economically up to 1 – 3 TBytes per spindle.

So what’s best for which?
For hard drives, conventional wisdom has been that they keep getting faster and cheaper. Turns out, only the latter is true. The cheapness of 1- and 3-TByte drives has made scale-out Internet data centers possible, and with it, scale-out Big Data analytic platforms like Hadoop. Hard disk continues to be the medium of choice for large volumes of data because individual drives routinely scale to 1 – 3 TBytes. And momentary supply chain disruptions like the 2011 Thailand floods aside, the supply remains more than adequate. Flash drives simply don’t get as fat.

But if anything, hard drives are getting slower because it’s no longer worthwhile to try speeding them up. With Flash being at least 10 – 100x faster, there’s no way that disk will easily catch up even if the technology gets refreshed. Flash is actually pulling the rug out from under demand for 7200-RPM disks (currently the state of the art for disk). Not surprisingly, disk technology development has hit the wall.

Given current price trends, where Flash prices are expectedsome analysts expect Flash to reach parity with disk in the next 12 – 18 months (or maybe sooner), there will be less reason for your next transaction system to be disk-based. In fact there is good reason to be a bit skeptical on how soon supply of SSD Flash will ramp up adequately for the transaction system market; but SSD Flash will gradually make its way to prime time. Conversely, with disk likely to remain fatter in capacity than Flash, it will be best suited for active archiving that keeps older data otherwise bound for tape live; and for Big Data analytics, where the need is for volume. Nonetheless, the workhorse of large Hadoop, and similar disk-based Big Data analytic or active archive clusters will likely be the slower 5400 RPM models.

So what about even faster modes of storage? In the past couple years, DRAM memory prices crossed the threshold where it became feasible to deploy them for persistent storage rather than caching of currently used data. That cleared the way for the in-memory database (IMDB), which is often code word for all-DRAM memory storage.

In-memory databases are hardly new, but until the last 3 – 4 years they were highly specialized. Oracle TimesTen, one of the earliest commercial offerings, was designed for tightly-coupled, specialized transactional applications; other purpose-built in-memory data stores have existed for capital markets for at least a decade or more. But DRAM memory prices dropped to bring them into the enterprise mainstream. Kognitio opened the floodgates as it reincarnated its MOLAP cube and row store analytic platform to in-memory on industry-standard hardware just over 5 years ago; SAP put in-memory in the spotlight with HANA for analytics and transactional applications; followed by Oracle, which reincarnated TimesTen as Exalytics for running Oracle Business Intelligence Enterprise Edition (OBIEE) and Essbase.

Yet, an interesting blip happened on the way to the “inevitable” all in-memory database future: Last spring, DRAM memory prices stopped dropping. In part this was attributable to consolidation of the industry to fewer suppliers. But the larger driver was that the wisdom of crowds – e.g., that DRAM memory was now ready for prime time – got ahead of itself. Yes, the laws of supply and demand will eventually shift the trajectory of memory pricing. But nope, that won’t change the fact of life that, no matter how cheap, DRAM memory (and cache) will always be premium storage.

In-memory databases are dead, long live tiered databases
The sky is not the limit for DRAM in-memory databases. The rush to in-memory will morph into an expansion of data tiering. And actually that’s not such a bad thing: do you really need to put all of that data in memory? We think not.

IBM and Teradata have shunned all in-memory architectures; their contention is that the 80/20 rule should govern which data goes into memory. And under their breaths, the all in-memory database folks have fallbacks for paging data between disk and memory. If designed properly, this is not constant paging, but rather a process that only occurs for that rare out-of-range query. Kognitio has a clever pricing model where they don’t charge you for the disk, but just for the volume of memory. As for HANA, disk is designed into the system for permanent offline storage, but SAP quietly adds that it can also be utilized for paging data during routine operation. Maybe SAP shouldn’t be so quiet about that.

There’s one additional form of tiering to consider for highly complex analytics: it’s the boost that can come from pipelining computations inside chip cache. Oracle is looking to similar techniques for further optimizing upcoming generations of its Exadata database appliance platform. It’s a technique that’s part of IBM’s recent BLU architecture for DB2. High-performance analytic platforms such as SiSense also incorporate in-chip pipelining to actually reduce balance of system costs (e.g., require less DRAM).

It’s all about balance of system
Balance of system is hardly new, but until recently, it meant trading off CPU, bandwidth with tiers of disk. Application and database design in turn focused on distributing or sharding data to place the most frequently accessed data on the disk or portions of disk that could be accessed the fastest. New forms of storage, including Flash and DRAM memory, add a few new elements to the mix. You’ll still configure storage (along with processor and interconnects) for the application and vice versa, but you’ll have a couple new toys in your arsenal.

For Flash, it means fast OLTP applications that could add basic analytics, such as what Oracle’s recent wave of In-Memory Applications promise. For in-memory, that would dictate OLTP applications with even more complex analytics and/or what-if simulations embedded in line, such as what SAP is promising with its recently-introduced Business Suite and CRM applications on HANA.

For in-memory, we’d contend that for most cases, configurations for keeping 100% of data in DRAM will remain overkill. Unless you are running a Big Data analytic problem that is supposed to encompass all of the data, you will likely work with just a fraction of the data. Furthermore, IBM, Oracle, and Teradata are incorporating data skipping features into their analytic platforms that deliberately filter irrelevant data so it is not scanned. There a many ways to speed processing before using the fast storage option.

Storage will become an application design option
Although we’re leery about hopping the 100% DRAM in-memory bandwagon, smartly deployed, in-memory or DRAM could truly transform applications. When you eliminate the latency, you can embed complex analytics in-line with transactional applications, enable the running of more complex analytics, or make it feasible for users to run more what-if simulations to couch their decisions.

Examples include transaction applications that differentiate how to fulfill orders from gold, silver, or bronze-level customers based on levels of services and cost of fulfillment. It could help mitigate risk when making operational or fiduciary decisions by allowing the running of more permutations of scenarios. It could also enhance Big Data analytics by tiering the more frequently used data (and logic) in memory.

Whether to use DRAM or Flash will be a function of degree of data volume and problem complexity. No longer will inclusion of storage tiers be simply a hardware platform design decision; it will also become a configuration decision for application designers as well.

The Odd Couple: Hadoop and Data Security

Until now, security wasn’t a term that would normally be associated with Hadoop. Hadoop was originally conceived for a trusted environment, where the user base was confined to a small elite set of advanced programmers or statisticians and the data was largely weblogs. But as Hadoop crosses over to the enterprise, the concern is that many enterprises will want to store pretty sensitive data.

So what stood for Hadoop security up till now was using Kerberos to authenticate users for firing up remote clusters if the local ones were already used; accessing specific MapReduce tasks; applying coarse-grained access control to HDFS files or directories; or providing authorization to run Hive metadata tasks.

Security is a headache for even the most established enterprise systems. But getting beyond concern about Trojan Horses and other incursions, there is a laundry list of capabilities that are expected with any enterprise data store. There are the “three As” of Authorization, Access, and Authentication; there is the concern over the sanctity of the data; and there may be requirements for restricting access based on workload type and capacity utilization concerns. Clearly there are many missing pieces for Hadoop waiting to fall into place.

For starters, Hadoop is not a monolithic system, but a collection of modules that correspond to open source projects or vendor proprietary extensions. Until now the only single mechanism has been Kerberos tokens for authenticating users to different Hadoop components, but only standalone mechanisms for admitting named users (e.g. to Hive metadata stores). You could restrict user access at file system or directory level, but for the most part such measures are only useful for preventing accidental erasures. And when it comes to monitoring the system (e.g., JobTracker, TaskTracker), you can do so from insecure web clients.

There are more missing pieces concerning data, as nothing was built into the Apache project. There was no standard way for encrypting data, and neither was there any way for regulating who can have what kinds of privileges with which sets of data. Obviously, that matters when you transition from low level weblog data to handling names, account numbers, account balances or other personal data. And by the way, even low level machine data such as weblogs or from other devices that detect location grow sensitive when associated with people’s identities.

There’s a mix of activity on the open source and vendor proprietary sides for addressing the void. There are some projects at incubation stage within Apache, or awaiting Apache approval, for providing LDAP/Active Directory linked gateways (Knox), data lifecycle policies (Falcon), and APIs for processor-based encryption (Rhino). There’s also an NSA-related project for adding fine-grained data security (Accumulo) based on Google BigTable constructs. And Hive Server 2 will add the LDAP/AD integration that’s current missing.

For the near term, not surprisingly, vendors are working to fill the void. Besides employing file system or directory-level permissions, you could physically isolate specific clusters, then superimpose perimeter security. Or you could use virtualization to similar effect, although that was not what it was engineered for. Zettaset Secure Data Warehouse includes role-based access control to augment Kerberos authentication, and applies a form of virtualization with its proprietary file system swap-out for HDFS; VMware’s Project Serengeti is intended to open source the API to virtualization.

Activity monitoring, what happens to data over its lifecycle – is supposed to be the domain of Apache Falcon, which is still in incubation. But vendors are beating the Apache project to the punch: IBM has extended its InfoSphere Guardium data lineage tool to monitor who is interacting with which data in HDFS, HBase, Hive, and MapReduce. Cloudera has recently introduced the Navigator module to Cloudera Manager to monitor data interaction, while Revelytix has introduced Loom, a tool for tracking the lineage of data feeding into HDFS. These offerings are the first steps — but not the last word — for establishing audit trails. Clearly, such capability will be essential as Hadoop gets utilized for data and applications subject to any regulatory compliance and/or enforcing any types of privacy protections over data.

Similarly, there are also moves active for masking or encrypting Hadoop data. Although there is nothing built into the Hadoop platform that prevents data masking or encryption, the sheer volume of data involved has tended to keep the task selective at best. IBM has extended InfoSphere Optim to provide selective data masking for Hadoop; Dataguise and Protegrity offers a similar capability for encryption at HDFS file level. Intel however holds the trump cards here with the recent Project Rhino initiative for open sourcing APIs for implementing hardware-based encryption, stealing a page from the SSL appliance industry (it forms the basis of Intel’s new, hardware-optimized Hadoop distro).

It’s in this context that Cloudera has introduced Sentry, a new open source project for providing role-based authorization for creating Hive Server 2 and Impala 1.1 tables. It’s a logical step for implementing the types of data protection measures for table creation and modification that are taken for granted in the SQL world, where privileges are differentiated by role for specific actions (e.g., select views and tables, insert to tables, transform schema at server or node level).

There are several common threads to Cloudera’s announcement. It breaks important new ground for adding features that gradually raise Hadoop security to the level of the SQL database world, yet also reveals what’s missing. For instance, the initial release lacks any visual navigation of HDFS files, Hive or Impala metadata tables, and data structures; it also lacks the type of time-based authorization that is common practice with Kerberos tickets. Most importantly, it lacks direct integration with Active Directory or LDAP (it can be performed manually), and it stores permissions locally. Sentry is only version 1; much of this wish list is already on Cloudera’s roadmap.

Correction: Sentry does leverage Kerberos for time windowing of access privileges and utilizes Linux utilities for tying in with directories such as Active Directory; we have not seen how well these OS functions are integrated into the Sentry engine. While Sentry leverages visual file navigation via the Apache Hue project GUI, it currently lacks a visual interface for configuring group/role permissions. Currently a standalone utility, we expect more integration with related capabilities, such as monitoring data activity, will be on Cloudera’s roadmap.

All these are good starts, but the core issue is that Sentry, like all Hadoop security features, remain point solutions lacking any capability for integrating to common systems of record for identity and access management – or end user directories. Sure, for many of these offerings, you can perform the task manually and wind up with piecemeal integration.

Admittedly, it’s not that single sign-on is ubiquitous in the SQL world. You can take the umbrella approach of buying all your end user and data security tools from your database provider, but there remains an active market for best in breed – that’s popular for enterprises seeking to address points of pain. But at least most best of breed SQL data security tools incorporate at least some provision for integration with directories or broader-based identity and access management systems.

Incumbents such as IBM, Oracle, Microsoft, and SAP/Sybase will (and in some cases already are) likely extend their protection umbrellas to Hadoop. That places onus on third parties – both veteran and startups – on the Hadoop side to get their acts together and align behind some security gateway framework allowing centralized administration of authentication, access, authorization, and data protection measures. Incubating projects like Apache Knox offer some promise, but again, the balkanized nature of open source projects threatens to make this a best of breed challenge as well as you have related functions — like what you do with data at different portions of the lifecycle – part of the domain of separate projects.

For Hadoop to go enterprise, the burden of integration must be taken off the backs of enterprise customers. For Hadoop security, the open source meritocracy will have to stop talking piecemeal projects before incumbents deliver their own faits accomplis: captive open source silos to their own security umbrellas.

Hadoop as your other data warehouse

Are data warehouses becoming victims of their own success? It’s hard to ignore the reality that the appetite for BI analytics has grown steadily – it’s one of the few enterprise IT software markets to continue enjoying steady growth over the past decade or more. While BI has not yet morphed into that long-promised democratic knowledge tool for the masses, there’s little question that it has become firmly embedded as a pillar of enterprise IT. Increasingly, analytics are being integrated with transactions, and there’s a movement for self-service BI that aims to address the everyman gap.

On the data side, there’s little question about the impact of data warehousing and BI. Enterprises have increasingly voracious appetite for data. And there are more kinds of data coming in. As part of our day job, we globally surveyed large enterprise data warehouse users (DWs over a terabyte) a couple years back and discovered over half of them were already routinely conducting text analytics in addition to conventionally structured data.

While SQL platforms have steadily increased scale and performance (it’s easy to forget that 30 years ago, conventional wisdom was that they would never scale to support enterprise OLTP systems), the legwork of operating data warehouses is becoming a source of bottlenecks. Data warehouses and transactional systems have traditionally been kept apart because their workloads significantly differed; they were typically kept at arm’s length with separate staging servers in the middle tier, where ETL operations were performed.

Yet, surging data volumes are breaking this pattern. With growing data volumes has come an emerging pattern where data and processing are brought together on the same platform. The “ELT” pattern was thus born based on the notion that collocating transformation operations inside the data warehouse would be more efficient as it would reduce data movements. The downside of ELT, however, is that data transformation compute cycles compete for finite resource with analytics.

Enter Hadoop. First developed for Internet companies for solving what were considered unique Internet problems (e.g., search indexes, ad optimization, gaming, etc.), enterprises are intrigued by the new platform for its ability to broaden the scope of their analytics to accommodate data outside the traditional purview of the data warehouse. Indeed, Hadoop allows organizations to broaden their analytic view. Instead of a traditional “360 view” of the customer that is largely transaction-based, add in social or mobile data to get a fuller picture. The same goes with machine data for operations, or weblog data for any Internet site.

But Hadoop can have another role in supplementing the data warehouse, performing the heavy lift at less cost. For starters, inserting a Hadoop platform to perform data transformation can offload cycles from the data warehouse to an environment where compute (and storage) are far less expensive. The software is far less expensive and the hardware is pure commodity (x86, standard Ethernet, and cheap 1 – 3 TB disk). And the system is sufficiently scalable that, should you need to commandeer additional resources to crunch higher transformation loads, they can be added economically by growing out your clusters.

Admittedly, there’s no free lunch; Hadoop is not free, as in free beer.
1. Like most open source software, you will pay for support. However, when compared to the cost of licensing commercial database software (where fees are related to installation size), the cost of open source should be far more modest.
2. In the short run, you will likely pay more for Hadoop skills because they are not (yet) as plentiful as SQL. This is a temporary state of affairs; as with Java 1999, the laws of supply and demand will eventually resolve this hurdle.
3. Hadoop is not as mature a platform as off-the-shelf SQL counterparts, but that is also a situation that time will eventually resolve.
4. Hadoop adds another tier to your analytic platform environment. If you embraced ELT, it does introduce some additional data movement; but then again, you may not have to load all of that data to a SQL target in the end.

But even if Hadoop is not free, it presents a lower cost target for shifting transform compute cycles. More importantly, it adds new options for analytic processing. With SQL and Hadoop converging, there are new paths for SQL developers to access data in Hadoop without having to learn MapReduce. These capabilities will not eliminate SQL querying to your existing data warehouse, as such platforms are well-suited for routine queries (with many of them carrying their own embedded specialized functions). But they supplement them by providing the opportunity to conduct exploratory querying that rounds out the picture and provides the opportunity to test drive new analytics before populating them to the primary data warehouse.

Emergence of Hadoop is part of a trend towards away from monolithic data warehousing environments. While enterprise DWs live on, they are no longer the sole focus of the universe. For instance, Teradata, which has long been associated with enterprise data warehousing, now promotes a unified data architecture that acknowledges that you’ll need different types of platforms for different workloads: operational data store, interactive analytics, and data deep dives. IBM, Oracle, and Microsoft are similarly diversifying their data platforms. Hadoop is just the latest addition, bringing capabilities for bulk transformation and exploratory analytics in SQL (and other) styles.

We will be discussing this topic in more detail later this week in a webinar sponsored by Cloudera. Catch our session “Hadoop: Extending your Data Warehouse” on Thursday, May 9, 2013 at 2:00pm ET/ 11:00a PT. You can register for the session here.

Does it matter if your SQL is bad?

That’s a paraphrase of a question raised by IDC analyst Carl Olofson at an IBM Big Data analyst event earlier this week. Carl’s question neatly summarized our impressions from the session, which centered around some big data announcements that IBM made. It concerned some new performance improvements that IBM has made that might render some issues with poorly formed SQL moot. More about that in a moment.

The question was all the more fitting and ironic given the setting – the event was held at IBM’s Almaden research facility, which happened to be the same place where Edgar (Ted) Codd invented SQL; IBM will video webcast excerpts on April 30.

Specifically, IBM made a series of announcements; while much of the press focused on announcement of a preview for IBM’s PureData for Hadoop appliance, to us the highlight was unveiling of a new architecture, branded BLU acceleration. Independent DB2 consultant Dave Beulke, whom we met at the launch, has published one of the best post mortems on the significance of the announcement.

BLU is supposed to be lightning fast. BNSF railroad, a BLU beta customer, reported performing a 4-billion row join in 8 milliseconds.

So what does this all mean?

Databases are assuming multiple personalities
BLU acceleration consists of a new engine that accelerates database performance. Let’s dissect that seemingly innocuous – and ambiguous – statement. Traditionally, the database and the underlying engine were considered one and the same. But increasingly, databases are evolving into broader data platforms with multiple personalities that are each designed for a specific form of processing or compute problem scenarios. Today’s DB2, Exadata, and Teradata 14 are not your father’s row-based data stores—they also have columnar support; even Microsoft SQL Server Parallel Warehouse Edition supports columnar indexing that can double as full-blown data tables alongside your row store. So you run rows for your existing applications (probably most of them are transactional) and run new analytic apps against the column store. Or if you’re a Pivotal HD (nee EMC Greenplum), run your SQL analytic queries against the relational engine, which happens to use Hadoop’s HDFS as the back end file system. And for Hadoop, new frameworks are emerging alongside MapReduce that are adding interactive, graph, and stream processing faces.

This week’s announcements by IBM of the BLU architecture marked yet another milestone in this trend. BLU is an engine that can exist side by side with DB2’s traditional row-based data store (it will be supported inside DB2 10.5). So you can run existing apps on the row store while migrating a select few to tap BLU. BLU is also being made available for IBM’s Informix TimeSeries 12.1, and in the long run, you’re likely to see it going into IBM’s other data platforms (think PureData for Analytics, Operational Analytics, and Hadoop models).

From IBM: More mixing and matching
In the same spirit, we expect to see IBM (and its rivals) do more mixing and matching in the future. We’re waiting for IBM to release an appliance that combines SQL analytics side by side with an instance of Hadoop, where you could run blended analytic queries (think: analytics from your CRM system alongside social, weblog, and mobile data harvested by Hadoop).

And while we’re on the topic of piling on data engines, IBM announced a preview of a JSON data store (think MongoDB style); it will become yet another engine to sit under the DB2 umbrella. We don’t expect MongoDB users to suddenly flock to buy DB2 licenses, but it will be a way to for existing DB2 shops to add an engine for developers who would otherwise implement their own Mongo one-off projects. The carrot is that IBM JSON takes advantage of data protection and security services of the DB2 platform that is not available from Mongo.

Dissecting BLU
BLU includes a number of features that individually, are not that unique (although there may be debates regarding degree of optimization). But together, they form a well-rounded approach to not only accelerating processing inside a SQL platform, but allowing new types of analytic processing. For instance, think about applying some of the late-binding schema practices from the Hadoop world to SQL (don’t believe for a moment that analytics on Hadoop doesn’t involve structuring data, but you can do it on demand, for the specific problem).

Put another way, in the Hadoop world, the competitive spotlight currently is on convergence with SQL. And now in the SQL world, styles of analytic processing from the NoSQL side are bleeding into SQL. Consider it a case of man bites dog.

The laundry list for BLU includes:
• Columnar and in-memory processing. Most Advanced SQL (or NewSQL) analytic platforms such as Teradata Aster, Pivotal HD (and its Greenplum predecessors), Vertica, ParaAccel, and others incorporate columnar as a core design. Hadoop’s HBase database also uses column storage. And of course as noted above, columnar engines are increasingly being incorporated alongside existing row-oriented stores inside relational warhorses. Columnar lends itself well to analytics because it reduces table scanning (you only need to look at specific columns rather than across entire rows) and focuses on aggregate data rather than individual records.
• Data compression – Compression and columnar tend to go together because, when you focus on representing aggregates, you can greatly reduce the number of bits for providing the data you need, such as averages, means, or outliers. Almost every column store employs some form of compression with ratios in the double-digit to 1 territory common. BLU is differentiated by a feature that IBM calls “actionable compression:” you can read compressed data without de-compressing it first, which significantly boosts performance because you can avoid de-compress/re-compress compute cycles.
• Data skipping – Many analytic data stores incorporate algorithms for minimizing data scans, with BLU’s algorithms doing so by ferreting out non-relevant data.

There are more optimizations under the hood. For instance, BLU tiers active columnar data into and out of memory and/or Flash (solid state disk) drives. And while in memory, BLU optimizes processing so that several columns can be crammed into a single memory register; that may sound quite geeky, but this design pattern is a key ingredient to accelerating throughput.

IBM contends that its in-memory and Flash optimizations are “good enough” to the point that a 100% in-memory PureData appliance to counter SAP HANA is not likely. But for Flash, never say never. In our view, given rapidly declining prices, we wouldn’t be surprised to see IBM at some point come out with an all-Flash unit.

Again, what does this mean for SQL and the DBA?
Now, back to our original question: When performance is accelerated to such an extent, does it really matter whether you’ve structured your tables, tuned your database, or formed your SQL statements properly? At first blush, that sounds like a rather academic question, but consider that time spent modeling databases and optimizing queries is time diverted from taking on new problems that could cut into the development backlog. And there is historical precedent; in SQL’s early days, conventional wisdom was that it required so much processing overhead (compared to hierarchical file systems that prevailed at the time) that it would never scale for the enterprise. Well, Moore’s Law brute forced the solution; SQL processing didn’t get that much more efficient, but hardware got much more powerful. Will on-demand SQL acceleration do the same for database modeling and SQL querying? Will optimization and automation make DBAs obsolete?

It seemed sacrilegious that, nearing the 40th anniversary of SQL, that such a question was posed at the very place where the technology was born.

But matters aren’t quite so black and white; as one set of problems get solved, broader ones emerge. For the DBA, the multiple personalities of data platforms are changing the nature of problem-solving: instead of writing the best SQL statement, focus on defining and directing the right query, to the right data, on the right engine, at the right time.

For instance, a hot new mobile device is released to the market with huge fanfare, sales initially spike before unexpectedly dropping through the floor. Such a query might fuse SQL (from the CRM analytic system) with sentiment analysis (to see what customers and prospects were saying), graph analysis (to understand who is friends with, and influences, who), and time series (to see how sentiment changed over time). The query may run across SQL, Hadoop, and possibly another specialized data store.

Admittedly, there will be a significant role for automation to optimize such queries, but the trend points to a bigger reality for DBAs where they don’t worry as much about SQL schema or syntax per se, but focus more on optimizing (with the system’s help) data and queries in more global terms.

BI crashing into the database

Flattening of Big Data architecture has become something of an epidemic. The largeness of Big Data has forced the middle and bottom layers of the stack – analytics and data – to converge; the accessibility of SQL married to the scale of Hadoop has driven a similar result. And now we’re seeing the top, middle, and in some cases lower levels of the stack converging with BI and transformation atop an increasingly ambitious data tier.

It began with the notion of making BI more self-service; give ordinary people the ability to make ad hoc queries without waiting for IT to clear its backlog. Tools like Tableau, QlikTech, and Spotfire have popularized visualization with intuitive front ends backed typically by some form of data caching mechanism for materializing views PDQ. Originally these approaches may have amounted to putting lipstick on a pig (e.g., big, ugly, complex SQL databases), but in many cases, these tools are packing more back end functionality to not simply paint pictures, but quickly assemble them. They are increasingly embedding their own transformation tools. While they are eliminating the ETL tier, they are definitely not eliminating the “T” – although that message tends to get blotted out by marketing hyperbole. That in turn is leading to the next step, which is elevating cache to becoming full-bore in-memory databases. So now you’ve collapsed, not only the ETL middle tier, but the data back end tier. We’re seeing platforms that would otherwise be classed Advanced SQL or NewSQL databases, like SiSense.

That phenomenon is also working its way on the NoSQL side where we see Platfora packaging, not only an in-memory caching tier for Hadoop, but also the means to marshal and transform data and views on the fly.

This is not simply a tale of flattening architecture for its own sake. The ramifications are basic changes to the analytics workflow and lifecycle. Instead of planning your data structures and queries ahead of time, generate schema and views on demand. Flattening the architecture facilitates this new style.

Traditionally with SQL – both for data warehousing and transaction systems – the process was completely different. You modeled the data and specified the tables at design time based on your knowledge of the content of the data and the queries and reports you were going to generate.

As you might recall, NoSQL was a reaction to the constraints of imposing a schema on the database at design time. By contrast, NoSQL did not necessarily do away with structure; it simply allowed the process to become more flexible. Collect the data and when it’s time to harvest it, explore it, discover the problem, and then derive the structure. And because in most NoSQL platforms, you still retain the data in raw form, you can generate a different schema as the nature of the problem, business challenge, or content of the data itself changes. Just run another series of MapReduce or similar processes to generate new views. Nonetheless, this view of flexible schema was borne with assumptions of a batch processing environment.

What’s changed is the declining cost of silicon-based storage: Flash (SSD) and memory (DRAM). That’s allowed those cute SQL D-I-Y visualization tools to morph into in-memory data platforms, because it was now cheap enough to gang terabytes of memory together. Likewise, it has cleared the way for Oracle and SAP to release in-memory platforms. And on the NoSQL side, it is making the notion of dynamic views from Hadoop thinkable.

We’re only at the beginning of the great rethink of analytic data views, schematizing processes, and architectural refactoring. It fires a shot across the bow to traditional BI players who have built their solutions on traditional schema at design rather than run time; in their case, it will require some significant architectural redesign. The old way will not disappear, as the contents of core end-of-period reporting and similar processes will not go away. There will always be a need for data warehouses and BI/reporting tools that provide repeatable, baseline query and reporting. And the analytic and data protection/housekeeping functions that are provided by established platforms will continue to be in demand. Astute BI and DW vendors should consider these new options as additive; they will be challenged by upstarts offering highly discounted pricing. For established vendors, the key is emphasizing the value add while they provide the means for taking advantage of the new, more flexible style of schema on demand.

Sadly, as we’re at the beginning of a new era of dynamic schema and dynamic analytics, there is also a lot of noise, like the dubious proposition that we can eliminate ETL. Folks, we are eliminating a tier, not the process itself. Even with Hadoop, when you analyze data, you inevitably end up forming it into a structure so you can grind out your analytics.

Disregard the noise and hype. You’re not going to replace your data warehouses for routine, mandated processes. But there will be new analytics will become more organic. It’s not simply a phenomenon in the Hadoop world, but with SQL as well. Your analytics infrastructure will flatten, and your schema and analytics will grow more flexible and organic.