Are data warehouses becoming victims of their own success? It’s hard to ignore the reality that the appetite for BI analytics has grown steadily – it’s one of the few enterprise IT software markets to continue enjoying steady growth over the past decade or more. While BI has not yet morphed into that long-promised democratic knowledge tool for the masses, there’s little question that it has become firmly embedded as a pillar of enterprise IT. Increasingly, analytics are being integrated with transactions, and there’s a movement for self-service BI that aims to address the everyman gap.
On the data side, there’s little question about the impact of data warehousing and BI. Enterprises have increasingly voracious appetite for data. And there are more kinds of data coming in. As part of our day job, we globally surveyed large enterprise data warehouse users (DWs over a terabyte) a couple years back and discovered over half of them were already routinely conducting text analytics in addition to conventionally structured data.
While SQL platforms have steadily increased scale and performance (it’s easy to forget that 30 years ago, conventional wisdom was that they would never scale to support enterprise OLTP systems), the legwork of operating data warehouses is becoming a source of bottlenecks. Data warehouses and transactional systems have traditionally been kept apart because their workloads significantly differed; they were typically kept at arm’s length with separate staging servers in the middle tier, where ETL operations were performed.
Yet, surging data volumes are breaking this pattern. With growing data volumes has come an emerging pattern where data and processing are brought together on the same platform. The “ELT” pattern was thus born based on the notion that collocating transformation operations inside the data warehouse would be more efficient as it would reduce data movements. The downside of ELT, however, is that data transformation compute cycles compete for finite resource with analytics.
Enter Hadoop. First developed for Internet companies for solving what were considered unique Internet problems (e.g., search indexes, ad optimization, gaming, etc.), enterprises are intrigued by the new platform for its ability to broaden the scope of their analytics to accommodate data outside the traditional purview of the data warehouse. Indeed, Hadoop allows organizations to broaden their analytic view. Instead of a traditional “360 view” of the customer that is largely transaction-based, add in social or mobile data to get a fuller picture. The same goes with machine data for operations, or weblog data for any Internet site.
But Hadoop can have another role in supplementing the data warehouse, performing the heavy lift at less cost. For starters, inserting a Hadoop platform to perform data transformation can offload cycles from the data warehouse to an environment where compute (and storage) are far less expensive. The software is far less expensive and the hardware is pure commodity (x86, standard Ethernet, and cheap 1 – 3 TB disk). And the system is sufficiently scalable that, should you need to commandeer additional resources to crunch higher transformation loads, they can be added economically by growing out your clusters.
Admittedly, there’s no free lunch; Hadoop is not free, as in free beer.
1. Like most open source software, you will pay for support. However, when compared to the cost of licensing commercial database software (where fees are related to installation size), the cost of open source should be far more modest.
2. In the short run, you will likely pay more for Hadoop skills because they are not (yet) as plentiful as SQL. This is a temporary state of affairs; as with Java 1999, the laws of supply and demand will eventually resolve this hurdle.
3. Hadoop is not as mature a platform as off-the-shelf SQL counterparts, but that is also a situation that time will eventually resolve.
4. Hadoop adds another tier to your analytic platform environment. If you embraced ELT, it does introduce some additional data movement; but then again, you may not have to load all of that data to a SQL target in the end.
But even if Hadoop is not free, it presents a lower cost target for shifting transform compute cycles. More importantly, it adds new options for analytic processing. With SQL and Hadoop converging, there are new paths for SQL developers to access data in Hadoop without having to learn MapReduce. These capabilities will not eliminate SQL querying to your existing data warehouse, as such platforms are well-suited for routine queries (with many of them carrying their own embedded specialized functions). But they supplement them by providing the opportunity to conduct exploratory querying that rounds out the picture and provides the opportunity to test drive new analytics before populating them to the primary data warehouse.
Emergence of Hadoop is part of a trend towards away from monolithic data warehousing environments. While enterprise DWs live on, they are no longer the sole focus of the universe. For instance, Teradata, which has long been associated with enterprise data warehousing, now promotes a unified data architecture that acknowledges that you’ll need different types of platforms for different workloads: operational data store, interactive analytics, and data deep dives. IBM, Oracle, and Microsoft are similarly diversifying their data platforms. Hadoop is just the latest addition, bringing capabilities for bulk transformation and exploratory analytics in SQL (and other) styles.
We will be discussing this topic in more detail later this week in a webinar sponsored by Cloudera. Catch our session “Hadoop: Extending your Data Warehouse” on Thursday, May 9, 2013 at 2:00pm ET/ 11:00a PT. You can register for the session here.
That’s a paraphrase of a question raised by IDC analyst Carl Olofson at an IBM Big Data analyst event earlier this week. Carl’s question neatly summarized our impressions from the session, which centered around some big data announcements that IBM made. It concerned some new performance improvements that IBM has made that might render some issues with poorly formed SQL moot. More about that in a moment.
The question was all the more fitting and ironic given the setting – the event was held at IBM’s Almaden research facility, which happened to be the same place where Edgar (Ted) Codd invented SQL; IBM will video webcast excerpts on April 30.
Specifically, IBM made a series of announcements; while much of the press focused on announcement of a preview for IBM’s PureData for Hadoop appliance, to us the highlight was unveiling of a new architecture, branded BLU acceleration. Independent DB2 consultant Dave Beulke, whom we met at the launch, has published one of the best post mortems on the significance of the announcement.
BLU is supposed to be lightning fast. BNSF railroad, a BLU beta customer, reported performing a 4-billion row join in 8 milliseconds.
So what does this all mean?
Databases are assuming multiple personalities
BLU acceleration consists of a new engine that accelerates database performance. Let’s dissect that seemingly innocuous – and ambiguous – statement. Traditionally, the database and the underlying engine were considered one and the same. But increasingly, databases are evolving into broader data platforms with multiple personalities that are each designed for a specific form of processing or compute problem scenarios. Today’s DB2, Exadata, and Teradata 14 are not your father’s row-based data stores—they also have columnar support; even Microsoft SQL Server Parallel Warehouse Edition supports columnar indexing that can double as full-blown data tables alongside your row store. So you run rows for your existing applications (probably most of them are transactional) and run new analytic apps against the column store. Or if you’re a Pivotal HD (nee EMC Greenplum), run your SQL analytic queries against the relational engine, which happens to use Hadoop’s HDFS as the back end file system. And for Hadoop, new frameworks are emerging alongside MapReduce that are adding interactive, graph, and stream processing faces.
This week’s announcements by IBM of the BLU architecture marked yet another milestone in this trend. BLU is an engine that can exist side by side with DB2’s traditional row-based data store (it will be supported inside DB2 10.5). So you can run existing apps on the row store while migrating a select few to tap BLU. BLU is also being made available for IBM’s Informix TimeSeries 12.1, and in the long run, you’re likely to see it going into IBM’s other data platforms (think PureData for Analytics, Operational Analytics, and Hadoop models).
From IBM: More mixing and matching
In the same spirit, we expect to see IBM (and its rivals) do more mixing and matching in the future. We’re waiting for IBM to release an appliance that combines SQL analytics side by side with an instance of Hadoop, where you could run blended analytic queries (think: analytics from your CRM system alongside social, weblog, and mobile data harvested by Hadoop).
And while we’re on the topic of piling on data engines, IBM announced a preview of a JSON data store (think MongoDB style); it will become yet another engine to sit under the DB2 umbrella. We don’t expect MongoDB users to suddenly flock to buy DB2 licenses, but it will be a way to for existing DB2 shops to add an engine for developers who would otherwise implement their own Mongo one-off projects. The carrot is that IBM JSON takes advantage of data protection and security services of the DB2 platform that is not available from Mongo.
BLU includes a number of features that individually, are not that unique (although there may be debates regarding degree of optimization). But together, they form a well-rounded approach to not only accelerating processing inside a SQL platform, but allowing new types of analytic processing. For instance, think about applying some of the late-binding schema practices from the Hadoop world to SQL (don’t believe for a moment that analytics on Hadoop doesn’t involve structuring data, but you can do it on demand, for the specific problem).
Put another way, in the Hadoop world, the competitive spotlight currently is on convergence with SQL. And now in the SQL world, styles of analytic processing from the NoSQL side are bleeding into SQL. Consider it a case of man bites dog.
The laundry list for BLU includes:
• Columnar and in-memory processing. Most Advanced SQL (or NewSQL) analytic platforms such as Teradata Aster, Pivotal HD (and its Greenplum predecessors), Vertica, ParaAccel, and others incorporate columnar as a core design. Hadoop’s HBase database also uses column storage. And of course as noted above, columnar engines are increasingly being incorporated alongside existing row-oriented stores inside relational warhorses. Columnar lends itself well to analytics because it reduces table scanning (you only need to look at specific columns rather than across entire rows) and focuses on aggregate data rather than individual records.
• Data compression – Compression and columnar tend to go together because, when you focus on representing aggregates, you can greatly reduce the number of bits for providing the data you need, such as averages, means, or outliers. Almost every column store employs some form of compression with ratios in the double-digit to 1 territory common. BLU is differentiated by a feature that IBM calls “actionable compression:” you can read compressed data without de-compressing it first, which significantly boosts performance because you can avoid de-compress/re-compress compute cycles.
• Data skipping – Many analytic data stores incorporate algorithms for minimizing data scans, with BLU’s algorithms doing so by ferreting out non-relevant data.
There are more optimizations under the hood. For instance, BLU tiers active columnar data into and out of memory and/or Flash (solid state disk) drives. And while in memory, BLU optimizes processing so that several columns can be crammed into a single memory register; that may sound quite geeky, but this design pattern is a key ingredient to accelerating throughput.
IBM contends that its in-memory and Flash optimizations are “good enough” to the point that a 100% in-memory PureData appliance to counter SAP HANA is not likely. But for Flash, never say never. In our view, given rapidly declining prices, we wouldn’t be surprised to see IBM at some point come out with an all-Flash unit.
Again, what does this mean for SQL and the DBA?
Now, back to our original question: When performance is accelerated to such an extent, does it really matter whether you’ve structured your tables, tuned your database, or formed your SQL statements properly? At first blush, that sounds like a rather academic question, but consider that time spent modeling databases and optimizing queries is time diverted from taking on new problems that could cut into the development backlog. And there is historical precedent; in SQL’s early days, conventional wisdom was that it required so much processing overhead (compared to hierarchical file systems that prevailed at the time) that it would never scale for the enterprise. Well, Moore’s Law brute forced the solution; SQL processing didn’t get that much more efficient, but hardware got much more powerful. Will on-demand SQL acceleration do the same for database modeling and SQL querying? Will optimization and automation make DBAs obsolete?
It seemed sacrilegious that, nearing the 40th anniversary of SQL, that such a question was posed at the very place where the technology was born.
But matters aren’t quite so black and white; as one set of problems get solved, broader ones emerge. For the DBA, the multiple personalities of data platforms are changing the nature of problem-solving: instead of writing the best SQL statement, focus on defining and directing the right query, to the right data, on the right engine, at the right time.
For instance, a hot new mobile device is released to the market with huge fanfare, sales initially spike before unexpectedly dropping through the floor. Such a query might fuse SQL (from the CRM analytic system) with sentiment analysis (to see what customers and prospects were saying), graph analysis (to understand who is friends with, and influences, who), and time series (to see how sentiment changed over time). The query may run across SQL, Hadoop, and possibly another specialized data store.
Admittedly, there will be a significant role for automation to optimize such queries, but the trend points to a bigger reality for DBAs where they don’t worry as much about SQL schema or syntax per se, but focus more on optimizing (with the system’s help) data and queries in more global terms.
Flattening of Big Data architecture has become something of an epidemic. The largeness of Big Data has forced the middle and bottom layers of the stack – analytics and data – to converge; the accessibility of SQL married to the scale of Hadoop has driven a similar result. And now we’re seeing the top, middle, and in some cases lower levels of the stack converging with BI and transformation atop an increasingly ambitious data tier.
It began with the notion of making BI more self-service; give ordinary people the ability to make ad hoc queries without waiting for IT to clear its backlog. Tools like Tableau, QlikTech, and Spotfire have popularized visualization with intuitive front ends backed typically by some form of data caching mechanism for materializing views PDQ. Originally these approaches may have amounted to putting lipstick on a pig (e.g., big, ugly, complex SQL databases), but in many cases, these tools are packing more back end functionality to not simply paint pictures, but quickly assemble them. They are increasingly embedding their own transformation tools. While they are eliminating the ETL tier, they are definitely not eliminating the “T” – although that message tends to get blotted out by marketing hyperbole. That in turn is leading to the next step, which is elevating cache to becoming full-bore in-memory databases. So now you’ve collapsed, not only the ETL middle tier, but the data back end tier. We’re seeing platforms that would otherwise be classed Advanced SQL or NewSQL databases, like SiSense.
That phenomenon is also working its way on the NoSQL side where we see Platfora packaging, not only an in-memory caching tier for Hadoop, but also the means to marshal and transform data and views on the fly.
This is not simply a tale of flattening architecture for its own sake. The ramifications are basic changes to the analytics workflow and lifecycle. Instead of planning your data structures and queries ahead of time, generate schema and views on demand. Flattening the architecture facilitates this new style.
Traditionally with SQL – both for data warehousing and transaction systems – the process was completely different. You modeled the data and specified the tables at design time based on your knowledge of the content of the data and the queries and reports you were going to generate.
As you might recall, NoSQL was a reaction to the constraints of imposing a schema on the database at design time. By contrast, NoSQL did not necessarily do away with structure; it simply allowed the process to become more flexible. Collect the data and when it’s time to harvest it, explore it, discover the problem, and then derive the structure. And because in most NoSQL platforms, you still retain the data in raw form, you can generate a different schema as the nature of the problem, business challenge, or content of the data itself changes. Just run another series of MapReduce or similar processes to generate new views. Nonetheless, this view of flexible schema was borne with assumptions of a batch processing environment.
What’s changed is the declining cost of silicon-based storage: Flash (SSD) and memory (DRAM). That’s allowed those cute SQL D-I-Y visualization tools to morph into in-memory data platforms, because it was now cheap enough to gang terabytes of memory together. Likewise, it has cleared the way for Oracle and SAP to release in-memory platforms. And on the NoSQL side, it is making the notion of dynamic views from Hadoop thinkable.
We’re only at the beginning of the great rethink of analytic data views, schematizing processes, and architectural refactoring. It fires a shot across the bow to traditional BI players who have built their solutions on traditional schema at design rather than run time; in their case, it will require some significant architectural redesign. The old way will not disappear, as the contents of core end-of-period reporting and similar processes will not go away. There will always be a need for data warehouses and BI/reporting tools that provide repeatable, baseline query and reporting. And the analytic and data protection/housekeeping functions that are provided by established platforms will continue to be in demand. Astute BI and DW vendors should consider these new options as additive; they will be challenged by upstarts offering highly discounted pricing. For established vendors, the key is emphasizing the value add while they provide the means for taking advantage of the new, more flexible style of schema on demand.
Sadly, as we’re at the beginning of a new era of dynamic schema and dynamic analytics, there is also a lot of noise, like the dubious proposition that we can eliminate ETL. Folks, we are eliminating a tier, not the process itself. Even with Hadoop, when you analyze data, you inevitably end up forming it into a structure so you can grind out your analytics.
Disregard the noise and hype. You’re not going to replace your data warehouses for routine, mandated processes. But there will be new analytics will become more organic. It’s not simply a phenomenon in the Hadoop world, but with SQL as well. Your analytics infrastructure will flatten, and your schema and analytics will grow more flexible and organic.
One of the more out-of-the-blue announcements to have dropped last week at Strata was Intel’s entry into the Hadoop space. Our first reactions were (1) what does Intel know about the software market (actually they do own McAfee) and (2) the last thing the Hadoop ecosystem needs is yet another distro to further confuse matters. We had a chance to briefly review it during our Strata write-up, but since then we’ve had a chance for a more detailed drill down. There’s some interesting technology under the covers that could accelerate Hadoop thanks to optimization at the Xeon chip instruction set level.
The headline is that it will speed data crunching; in the press release, it cites an internal benchmark of reducing analyzing of a terabyte of data from 4 hours down to 7 minutes
admittedly, we don’t know what kind of analysis it was, or whether there are certain forms of processing that will benefit more from chip-level optimization than others, but we’ll accept the general message.
Update: The spec was based on the Terasort benchmark
The headline of the initial release pertains to an area that has so far posed an unmet challenge in Hadoop: embedding encryption functions at the hardware level. It uses a specific encryption instruction set in Xeon designed for the Advanced Encryption Standard (AES) instruction set. It addresses a key conundrum in Hadoop data security: if you have sensitive data, it would be nice to encrypt it, but encryption is such a compute-intensive operation that applying it at terabyte scale would seem impractical. For now, IBM and Dataguise are the only ones that are providing such capability for Hadoop, and that, encryption is selective; at hardware level, the performance differences could be significant. Admittedly, hardware-based encryption is not new; there are appliances on the market that already handle it, but until now nothing that works inside Hadoop.
Like Cloudera, Intel is also competing with its own proprietary Hadoop management tooling, which in this case incorporates patented auto-tuning technology for optimizing clusters.
Clearly, Intel’s entry to the Hadoop space ups the bar on performance. There are a number of different paths to the same summit; Cloudera’s Impala is a framework that is supposed to speed SQL processing while avoiding the bottleneck of MapReduce; Hortonworks is proposing the Tez runtime and Stinger interactive query projects to accomplish a similar end; Greenplum has fused its MPP SQL database while using HDFS for storage under the Pivotal HD system; while MapR swaps out HDFS altogether for a more highly performant, NFS-compatible one.
Intel enters an increasingly competitive and forking Hadoop market where there is contention at almost every layer of the stack. Although Intel has sizable enterprise software businesses, it is mostly at the lower levels of the stack (e.g., hardware optimization, security). So it is new to database level of the stack. On one hand, the Big Data software unit of Intel will be at most a tiny speck of its overall business; but as a means to an end (selling more Hadoop boxes), it is potentially very strategic.
The question is whether Intel is better off competing head to head with its own hardware-optimized platform, or forming great OEM deals with the players that are gaining critical mass in the Hadoop space, which could sell even more Xeon servers. Intel has announced a number of opening round partners which include several data platforms on the other side of the Hadoop divide. Among the most interesting are SAP, where chip-optimized Hadoop would form a natural side-by-side install with the in-memory HANA database; and Teradata, whose Aster unit already offers an appliance with rival Hortonworks (could the Intel distro become a virtual extension of the Teradata Extreme Data Appliance?).
There’s little question that Hadoop could use a hardware kick.
We’re in the thick of analyst conference season – Informatica last week, SAS tomorrow. So on this Sunday afternoon between gigs, we’re digesting what went down at Strata 2013 in Santa Clara last week. It was kind of a frustrating day in that we had limited time, were scheduled wall to wall with meetings, and missed what were likely some fascinating sessions. But we got a sense of some dominant themes: Harden Hadoop for the enterprise, and take the SQL world to Hadoop.
The Hadoop vendor ecosystem is filling in – new players with their own distros, and new capabilities focused on making Hadoop more enterprise grade. The field is early enough that the approaches are still quite diverse – it’s time to invent, not consolidate. Let the games proceed.
EMC stole the jump early in the week by announcing the grafting of its own Greenplum Advanced SQL analytic data store onto Hadoop – basically, the Greenplum MPP database squooched (wanted an excuse to use a “word” like that) atop HDFS. Tastes like a SQL analytic database, scales like Hadoop. Cloudera Impala will soon be in a GA branded as RTQ (Real-Time Query). Not to be outshined, Hortonworks, which works through the official Hadoop project itself, announced a couple responding initiatives: the Tez runtime and Stinger interactive query engine. You wouldn’t be seeing all these efforts to make Hadoop interactive if the demand wasn’t out there; while Hadoop as a platform for extending the range of analytics has become very compelling to enterprises, they clearly expect that the platform must be SQL interactive if it is to become a part of their analytic system portfolio.
While we’ve been expending electrons on the SQLization of Hadoop, the next stage of hardening is rapidly emerging. Specifically, make Hadoop and Hadoop data more governable and secure. This involves capabilities such as data masking (where you permanently obliterate sensitive pieces of data), data encryption (where you can recover the original data), activity monitoring (who does what), data lineage (who and where this data came from, and who has done what to it), and of course, more fine grained access control (preferably role-based) that picks up where Kerberos authentication leaves off. The pieces are just beginning to fall into place.
Dataguise, a niche player in data obfuscation that relaunched itself in the Hadoop space last year, has had an encryption product out for roughly six months and has drawn several customers; they promote a self-learning feature that discovers sensitive data (e.g., credit card numbers), selectively encrypts, and then acts only when data is changed. IBM already has capabilities in Optim that are typically used when pulling data from an external database; a user-defined function can mask it in Hadoop, or mask data as it is drawn from Hadoop. IBM offers data masking and activity monitoring, a capability that Cloudera just announced. Specifically, Cloudera’s new Navigation tool places agents (like everybody else, they characterize them as “lightweight”) on HDFS, Hive, and HBase, and you can configure them. For instance, the traffic on Hive is likely to be a fraction of that for HBase, which is more interactive, so you can configure monitoring of event changes to data accordingly. And then we came across Revelytix, which focuses on data lineage
Then out of the blue, Intel swooped in with announcement of its own Hadoop distribution. As if that was the last thing the world needed. But Intel has carved some interesting angles: it is utilizing the native instruction set of the Xeon processor to move encryption and I/O optimization directly into the chip. Intel’s play addresses the issue that these processes are resource-heavy, a point where the sheer size of Hadoop data stores add insult to injury. And that is not to mention that embedding encryption in hardware lessens the load of developers. Intel has drawn a number of partners including SAP, where integration with the HANA in-memory platform offers some interesting Fast Data possibilities. So far we’ve missed signals with Intel, but will speak with them later next week to get a better idea of where they hope to take hardware optimization with Hadoop.
Loose ends: Time is running out on us, but coming out of this week, there are several issues that are running in the back of our mind:
• Hive – we thought this was a done deal. Hive is one of the earliest components of Hadoop. Having been designed when MapReduce was the predominant processing pattern, and the jobs to spawn the metadata were batch in nature. We were surprised that the debate over Hive’s use remains very, very live. The issue is over how dynamic Hive can become – yes, it can support interactive queries, but is it based on metadata that is current? We sense that this will become another area for vendor differentiation.
• Apache Hadoop project – This could be spin, but there is sniping behind the scenes that the Hadoop project is no longer so broad-based when it comes to contributions. The flipside is arguments over whether a particular vendor has enough (or any) committers rings a bit hollow. The operable question for enterprises is whether the distro of Hadoop is and will remain well-supported.
• Resource management – this one has multiple angles. Of course there is debate over YARN. It is supposed to be the über resource manager of Hadoop, so MapReduce jobs don’t collide with those of other frameworks that may have different (and conflicting) demand on processing and data access. There’s active debate over whether YARN has sufficiently weaned itself of its MapReduce batch lineage, or whether it should be a batch-oriented sub manager in a scheme where there is yet another layer of control. The counterargument to that is that this may make life (or at least levels of control) far too complex. Expect vendor differentiation here.
It’s going to be quite a whirlwind this week. Informatica Analyst conference tomorrow, wall-to-wall meetings at Strata Wed afternoon, before heading off to the mountains of Colorado for the SAS Analyst meet next week.
So EMC Greenplum caught us at an opportune time in the news cycle. They chose the day before strata to saturate the airwaves with announcement that they are staking their future on Hadoop.
Specifically, they have come up with a product with a proliferation of brands. EMC is the parent company, Greenplum is the business unit and Advanced SQL MPP database, Pivotal HD is the branding of their new SQL/Hadoop offering, and guess what, did we forget to tell you about the HAWQ engine that provides the interactivity with Hadoop? Sounds like a decision by committee for branding.
(OK, HAWQ is the code name for the project, but Greeenplum has been promoting it quite prominently. Maybe they might want to tone it down as the HAWQ domain is already claimed.)
But it is an audacious move for Greenplum. While its rivals are putting Hadoop either at arm’s reach or in Siamese twin deployments, Greenplum is putting its engine directly into Hadoop, sitting atop HDFS. The idea is that to the enterprise user, it looks like Greenplum, but with the scale-out of HDFS underneath. Greenplum is not alone in looking to a singular Hadoop destination for big Data analytics; Cloudera is also pushing that direction heavily with Impala. And while Hortonworks has pushed coexistence with its HCatalog Apache incubator project ad close OEM partnerships with Teradata Aster and Microsoft, it is responding with announcement of the Tez runtime and Stinger interactive Hadoop query projects.
These developments simply confirm what we’ve been saying: SQL is converging with Hadoop, and for good reason. For Big Data – and Hadoop – to get accepted into the mainstream enterprise, it has to become a first class citizen with IT, the data center, and the business. That means (1) reasonably map to the skills (SQL, Java) that already exist, and extend from there (2) fit in with the databases, applications, and systems management practices (e.g., storage, virtualization) that are how the data center operate and (3) the analytics must start by covering the ground that the business understands.
For enterprises, these announcements represent vendors placing stakes in the ground; these are newly announced products that in many cases are still using pre-release technology. But it is important to understand the direction of the announcements, what this means for the analytics that your shop produces, and how your future needs are to be met.
Clearly, Hadoop and its programming styles (for now MapReduce remains the best known) offer a new approach for a new kind of analytic to connect the dots. But for enterprises, the journey to Big Data and Hadoop will be more evolutionary.
It was never a question of whether SAP would bring it flagship product, Business Suite to HANA, but when. And when I saw this while parking the car at my physical therapist over the holidays, I should’ve suspected that something was up: SAP at long last was about to announce … this.
From the start, SAP has made clear that its vision for HANA was not a technical curiosity, positioned as some high-end niche product or sideshow. In the long run, SAP was going to take HANA to Broadway.
SAP product rollouts on HANA have proceeded in logical, deliberate fashion. Start with the lowest hanging fruit, analytics, because that is the sweet spot of the embryonic market for in-memory data platforms. Then work up the food chain, with the CRM introduction in the middle of last year – there’s an implicit value proposition for having a customer database on a real-time system, especially while your call center reps are on the phone and would like to either soothe, cross-sell, or upsell the prospect. Get some initial customer references with a special purpose transactional product in preparation for taking it to the big time.
There’s no question that in-memory can have real impact, from simplifying deployment to speeding up processes and enabling more real-time agility. Your data integration architecture is much simpler and the amount of data you physically must store is smaller. SAP provides a cute video that shows how HANA cuts through the clutter.
For starters, when data is in memory, you don’t have to denormalize or resort to tricks like sharding or striping of data to enhance access to “hot” data. You also don’t have to create staging servers to perform ETL of you want to load transaction data into a data warehouse. Instead, you submit commands or routines that, thanks to processing speeds that are up to what SAP claims to be 1000x faster than disk, convert the data almost instantly to the form in which you need to consume it. And when you have data in memory, you can now perform more ad hoc analyses. In the case of production and inventory planning (a.k.a., the MRP portion of ERP), you could run simulations when weighing the impact of changing or submitting new customer orders, or judging the impact of changing sourcing strategies when commodity process fluctuate. For beta customer John Deere, they achieved positive ROI based solely on the benefits of implementing it for pricing optimization (SAP has roughly a dozen customers in ramp up for Business Suite on HANA).
It’s not a question of whether you can run ERP in real time. No matter how fast you construct or deconstruct your business planning, there is still a supply chain that introduces its own lag time. Instead, the focus is how to make enterprise planning more flexible, enhanced with built-in analytics.
But how hungry are enterprises for such improvements? To date, SAP has roughly 500 HANA installs, primarily for Business Warehouse (BW) where the in-memory data store was a logical upgrade for analytics, where demand for in-memory is more established. But on the transactional side, it’s a more uphill battle as enterprises are not clamoring to conduct forklift replacements of their ERP systems, not to mention their databases as well. Changing both is no trivial matter, and in fact, changing databases is even rarer because of the specialized knowledge that is required. Swap out your database, and you might as well swap out your DBAs.
The best precedent is Oracle, which introduced Fusion Applications two years ago. Oracle didn’t necessarily see Fusion as replacement for E-Business Suite, JD Edwards, or PeopleSoft. Instead it viewed Fusion Apps as a gap filler for new opportunities among its installed base or the rare case of greenfield enterprise install. We’d expect no less from SAP.
Yet in the exuberance of rollout day, SAP was speaking of the transformative nature of HANA, claiming it “Reinvents the Real-Time Enterprise.” It’s not the first time that SAP has positioned HANA in such terms.
Yes, HANA is transformative when it comes to how you manage data and run applications, but let’s not get caught down another path to enterprise transformation. We’ve seen that movie before, and few of us want to sit through it again.
With Strata, IBM IOD, and Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for Hadoop-related announcements. The common thread of announcements is essentially, “We know that Hadoop is not known for performance, but we’re getting better at it, and we’re going to make it look more like SQL.” In essence, Hadoop and SQL worlds are converging, and you’re going to be able to perform interactive BI analytics on it.
The opportunity and challenge of Big Data from new platforms such as Hadoop is that it opens a new range of analytics. On one hand, Big Data analytics have updated and revived programmatic access to data, which happened to be the norm prior to the advent of SQL. There are plenty of scenarios where taking programmatic approaches are far more efficient, such as dealing with time series data or graph analysis to map many-to-many relationships. It also leverages in-memory data grids such as Oracle Coherence, IBM WebSphere eXtreme Scale, GigaSpaces and others, and, where programmatic development (usually in Java) proved more efficient for accessing highly changeable data for web applications where traditional paths to the database would have been I/O-constrained. Conversely Advanced SQL platforms such as Greenplum and Teradata Aster have provided support for MapReduce-like programming because, even with structured data, sometimes using a Java programmatic framework is a more efficient way to rapidly slice through volumes of data.
Until now, Hadoop has not until now been for the SQL-minded. The initial path was, find someone to do data exploration inside Hadoop, but once you’re ready to do repeatable analysis, ETL (or ELT) it into a SQL data warehouse. That’s been the pattern with Oracle Big Data Appliance (use Oracle loader and data integration tools), and most Advanced SQL platforms; most data integration tools provide Hadoop connectors that spawn their own MapReduce programs to ferry data out of Hadoop. Some integration tool providers, like Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and Hortonworks have been talking up the potentials of HCatalog, actuality an enhanced version of Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.
But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops. And beneath the veneer of chaos, there is an implicit order to most so-called “unstructured” data that is within the reach programmatic transformation approaches that in the long run could likely be automated or packaged inside a tool.
At Ovum, we have long believed that for Big Data to crossover to the mainstream enterprise, that it must become a first-class citizen with IT and the data center. The early pattern of skunk works projects, led by elite, highly specialized teams of software engineers from Internet firms to solve Internet-style problems (e.g., ad placement, search optimization, customer online experience, etc.) are not the problems of mainstream enterprises. And neither is the model of recruiting high-priced talent to work exclusively on Hadoop sustainable for most organizations; such staffing models are not sustainable for mainstream enterprises. It means that Big Data must be consumable by the mainstream of SQL developers.
Making Hadoop more SQL-like is hardly new
Hive and Pig became Apache Hadoop projects because of the need for SQL-like metadata management and data transformation languages, respectively; HBase emerged because of the need for a table store to provide a more interactive face – although as a very sparse, rudimentary column store, does not provide the efficiency of an optimized SQL database (or the extreme performance of some columnar variants). Sqoop in turn provides a way to pipeline SQL data into Hadoop, a use case that will grow more common as organizations look to Hadoop to provide scalable and cheaper storage than commercial SQL. While these Hadoop subprojects that did not exactly make Hadoop look like SQL, they provided building blocks from which many of this week’s announcements leverage.
Progress marches on
One train of thought is that if Hadoop can look more like a SQL database, more operations could be performed inside Hadoop. That’s the theme behind Informatica’s long-awaited enhancement of its PowerCenter transformation tool to work natively inside Hadoop. Until now, PowerCenter could extract data from Hadoop, but the extracts would have to be moved to a staging server where the transformation would be performed for loading to the familiar SQL data warehouse target. The new offering, PowerCenter Big Data Edition, now supports an ELT pattern that uses the power of MapReduce processes inside Hadoop to perform transformations. The significance is that PowerCenter users now have a choice: load the transformed data to HBase, or continue loading to SQL.
There is growing support for packaging Hadoop inside a common hardware appliance with Advanced SQL. EMC Greenplum was the first out of gate with DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with Greenplum MR, a software only product that is accompanied by a MapR Hadoop distro). Teradata Aster has just joined the fray with Big Analytics Appliance, bundling the Hortonworks Data Platform Hadoop; this move was hardly surprising given their growing partnership around HCatalog, an enhancement of the SQL-like Hive metadata layer of Hadoop that adds features such as a cost optimizer and RESTful interfaces that make the metadata accessible without the need to learn MapReduce or Java. With HCatalog, data inside Hadoop looks like another Aster data table.
Not coincidentally, there is a growing array of analytic tools that are designed to execute natively inside Hadoop. For now they are from emerging players like Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like marketplace for developers), Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry, Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).
Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes. Hadapt has been promoting a variant where it squeezes SQL tables inside HDFS file structures – not exactly a no-brainer as it must shoehorn tables into a file system with arbitrary data block sizes. Hadapt’s approach sounds like the converse of object-relational stores, but in this case, it is dealing with a physical rather than a logical impedance mismatch.
Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does Cloudera. It has just announced Impala, a SQL-based alternative to MapReduce for querying the SQL-like Hive metadata store, supporting most but not all forms of SQL processing (based on SQL 92; Impala lacks triggers, which Cloudera deems low priority). Both Impala and MapReduce rely on parallel processing, but that’s where the similarity ends. MapReduce is a blunt instrument, requiring Java or other programming languages; it splits a job into multiple, concurrently, pipelined tasks where, at each step along the way, reads data, processes it, and writes it back to disk and then passes it to the next task. Conversely, Impala takes a shared nothing, MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera claims roughly 4x performance against MapReduce; if the data is in HBase, Cloudera claims performance multiples up to a factor of 30. For now, Impala only supports row-based views, but with columnar (on Cloudera’s roadmap), performance could double. Cloudera plans to release a real-time query (RTQ) offering that, in effect, is a commercially supported version of Impala.
By contrast, Teradata Aster and Hortonworks promote a SQL MapReduce approach that leverages HCatalog, an incubating Apache project that is a superset of Hive that Cloudera does not currently include in its roadmap. For now, Cloudera claims bragging rights for performance with Impala; over time, Teradata Aster will promote the manageability of its single appliance, and with the appliance has the opportunity to counter with hardware optimization.
The road to SQL/programmatic convergence
Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists. What’s more important to enterprises is getting the right tool for the job – whether it is the flexibility of SQL or raw power of programmatic approaches.
SQL convergence is the next major battleground for Hadoop. Cloudera is for now shunning HCatalog, an approach backed by Hortonworks and partner Teradata Aster. The open question is whether Hortonworks can instigate a stampede of third parties to overcome Cloudera’s resistance. It appears that beyond Hive, the SQL face of Hadoop will become a vendor-differentiated layer.
Part of conversion will involve a mix of cross-training and tooling automation. Savvy SQL developers will cross train to pick up some of the Java- or Java-like programmatic frameworks that will be emerging. Tooling will help lower the bar, reducing the degree of specialized skills necessary. And for programming frameworks, in the long run, MapReduce won’t be the only game in town. It will always be useful for large-scale jobs requiring brute force, parallel, sequential processing. But the emerging YARN framework, which deconstructs MapReduce to generalize the resource management function, will provide the management umbrella for ensuring that different frameworks don’t crash into one another by trying to grab the same resources. But YARN is not yet ready for primetime – for now it only supports the batch job pattern of MapReduce. And that means that YARN is not yet ready for Impala or vice versa.
Of course, mainstreaming Hadoop – and Big Data platforms in general – is more than just a matter of making it all look like SQL. Big Data platforms must be manageable and operable by the people who are already in IT; they will need some new skills and grow accustomed to some new practices (like exploratory analytics), but the new platforms must also look and act familiar enough. Not all announcements this week were about SQL; for instance, MapR is throwing a gauntlet to the Apache usual suspects by extending its management umbrella beyond the proprietary NFS-compatible file system that is its core IP to the MapReduce framework and HBase, making a similar promise of high performance. On the horizon, EMC Isilon and NetApp are proposing alternatives promising a more efficient file system but at the “cost” of separating the storage from the analytic processing. And at some point, the Hadoop vendor community will have to come to grips with capacity utilization issues, because in the mainstream enterprise world, no CFO will approve the purchase of large clusters or grids that get only 10 – 15% utilization. Keep an eye on VMware’s Project Serengeti.
They must be good citizens in data centers that need to maximize resource (e.g., virtualization, optimized storage); must comply with existing data stewardship policies and practices; and must fully support existing enterprise data and platform security practices. These are all topics for another day.
Data warehousing and analytics have accumulated a reasonably robust set of best practices and methodologies since they emerged in the mid-1990s. Although not all enterprises are equally vigilant, the state of practices around data stewardship (e.g., data quality, information lifecycle management, privacy and security) is pretty mature.
With emergence of Big Data and new analytic data platforms that handle different kinds of data such as Hadoop, the obvious question is whether these practices still apply. Admittedly, not all Hadoop use cases have been for analytics, but arguably, the brunt of early implementations are. That reality is reinforced by how most major IT data platform household brands have positioned Hadoop: EMC Greenplum, HP Vertica, Teradata Aster and others paint a picture that Hadoop is an extension of your [SQL] enterprise data warehouse.
That provokes the following question: if Hadoop is an extension of your data warehouse or analytic platform environment, should the same data stewardship practices apply?
We’ll train our focus on quality. Hadoop frees your analytic data store of limits, both to quantity of data and structure, which were part and parcel of maintaining a traditional data warehouse. Hadoop’s scalability frees your organization to analyze all of the data, not just a digestible sample of it. And not just structured data or text, but all sorts of data whose structure is entirely variable. With Hadoop, the whole world’s an analytic theatre.
Significantly, with the spotlight on volume and variety, the spotlight has been off quality. The question is, with different kinds and magnitudes of data, does data quality still matter? Can you afford to cleanse multiple terabytes of data? Is “bad data” still bad?
The answers aren’t obvious. Traditional data warehouses treated “bad” data as something to be purged, cleansed, or reconciled. While the maxim “garbage in, garbage out” has been with us since the dawn of computing, the issue of data quality hit the fan when data warehouses provided the opportunity to aggregate more, diverse sources of data that was not necessarily consistent in completeness, accuracy, or structure. The fix was cleansing record by record based on the proposition that analytics required strict apples to apples comparisons.
Yet volume and variety of Hadoop data casts doubt on the practicality of traditional data hygiene practice. Remediating record by record will take forever, and anyway, it’s simply not going to be practice – or worthwhile – to cleanse log files which are highly variable (and low value) by nature. The variety of data, not only by structure, but also source, makes it more difficult to know what is the correct structure and form of any individual record. And given that individual machine data readings are often cryptic and provide little value except when aggregated at huge scale also militates against traditional practice.
So now Hadoop becomes a special case. However, given that Hadoop also supports a different approach to analytics, by reason, data should also be treated differently.
Exact Picture or Big Picture?
Quality in Hadoop becomes more of a broad spectrum of choice that depends on the nature of the application and the characteristics of the data – specifically, the 4 V’s. Is your application mission-critical? That might augur for a more vigilant practice of data quality, but that depends on whether the application requires strict audit trails and carries regulatory compliance exposure. In those cases, better get the data right. However, web applications such as searching engines or ad placement may also be mission-critical but not necessarily bring the enterprise to its knees if the data is not 100% correct.
So you’ve got to ask yourself the question: are you trying to get the big picture, or the exact one? In some cases, they may be different.
The nature of data in turn determines the practicality of cleansing strategies. More volume dictates against traditional record-by-record approaches, variety makes the job of clean sing more difficult, while high velocity makes it virtually impossible. For instance, high throughput complex event processing (CEP)/data streaming applications are typically implemented for detecting patterns that drive operational decisions; cleansing would add too much processing overhead for especially high-velocity/low latency apps. Then there’s the question of data value; there’s more value in a customer identity record an individual reading that is the output of a sensor.
A spectrum of data hygiene approaches
Enforcing data quality is not impossible in Hadoop. There are different approaches, that, depending on the nature of the data and application, may dictate different levels of cleansing or none at all.
A “crowdsourcing” approach widens the net of data collection to a larger array of sources with the notion that enough good data from enough sources will drown out the noise. In actuality, that’s been the de facto approach that has been taken with early adopters, and it’s a fairly passive one. But such approaches could be juiced up with trending analytics that dynamically track the sweet spot of good data to see if the norm is drifting.
Another idea is unleashing the power of data science, not only to connect the dots, but also correct them. We’re not suggesting that you turn your expensive (and rare) data scientists into data QA techs, but to apply the same methodologies for exploration to dynamically track quality. Other variants are applying approaches that apply cleansing logic, not at the point of data ingestion, but consumption; that’s critical for highly-regulated processes, such as assessing counter-party risk for capital markets. In one particular case, an investment bank used a rules-based, semantic domain model using the OMG’s Common Warehouse Model as a means for validating data consumed.
Bad Data may be good
Big Data in Hadoop may be different data, and may be analyzed differently. The same logic applies to “bad data” that in conventional terms appears as outlier, incomplete, or plain wrong. The operable question of why the data may be “bad” may yield as much value as analyzing data within the comfort zone. It’s the inverse of analyzing the drift over time of the sweet spot of good data. When there’s enough bad data, that makes it fair game for trending to check whether different components or pieces of infrastructure are drifting off calibration, or if the assumptions on what constitute “normal” conditions are changing. Like rising sea levels, typical daily temperature swings, for instance. Similar ideas could apply to human-readable data, where perceived outliers reflect flawed assumptions on the meaning of data, such as when conducting sentiment analysis. In Hadoop, bad data may be good.
Hadoop remains a difficult platform for most enterprises to master. For now skills are still hard to come by – both for data architect or engineer, and especially for data scientists. It still takes too much skill, tape, and baling wire to get a Hadoop cluster together. Not every enterprise is Google or Facebook, with armies of software engineers that they can throw at a problem. With some exceptions, most enterprises don’t deal with data on the scale of Google or Facebook either – but the bar is rising.
If 2011 was the year that the big IT data warehouse and analytic platform brand names discovered Hadoop, 2012 becomes the year where a tooling ecosystem starts emerging to make Hadoop more consumable for the enterprise. Let’s amend that – along with tools, Hadoop must also become a first-class citizen with enterprise IT infrastructure. Hadoop won’t cross over to the enterprise if it has to be treated as some special island. That means meshing with the practices and technology approaches that enterprises are using to manage their data centers or cloud deployments. Like SQL, data integration, virtualization, storage strategy, and so on.
Admittedly, much of this cuts against the grain of early Hadoop deployment that stressed open source and commodity infrastructure. Early adopters did so out of necessity as commercial software ran out of gas for Facebook when its data warehouse daily refreshes were breaking terabyte range, not to mention that the cost of commercial licenses for such scaled out analytic platforms wouldn’t have been trivial. Anyway, Hadoop’s linearity leverages scale out of commodity blades and direct attached disk as far as the eye can see, enabling such an almost pure noncommercial approach. At the time, Google’s, Yahoo’s, and Facebook’s issues were considered rather unique – most enterprise don’t run global search engines – not to mention that their business was built on armies of software engineers.
As we’ve previously noted, something’s got to give on the skills front. Hadoop in the enterprise faces limits – the data problems are getting bigger and more complex for sure, but resources and skills are far more finite. So we envision tools and solutions addressing two areas:
1. Products that address “clusterophobia” – organizations that seeks the scalable analytics of Hadoop but lack the appetite to erect infinite data centers out in the fields or hire the necessary skillsets. Obviously, using the cloud is one option – but the questions there revolve around whether corporate policies allow maintenance of data off premises, and also, as datra store size grows, whether the cloud is still economical.
2. The other side of the coin is consummability – tools that simplify access to and manipulation of the data.
In the run-up to this year’s Hadoop Summit, a number of tooling announcements addressing clusterophobia and consumption are pouring out.
On the fear of clusters side, players like Oracle, EMC Greenplum, and Teradata Aster are already offering appliances that simplify deployment of Hadoop, typically in conjunction with an Advanced SQL analytic platform. While most vendors position this as a way for Hadoop to “extend’ your data warehouse so you perform exploration in Hadoop, but the serious analytics in SQL, we view appliances as more than transitional strategy; the workloads are going to get more equitably distributed, and in the long run, we wouldn’t be surprised to see more Hadoop-only appliances, sort of like Oracle’s (for the record, they also bundle another NoSQL database).
Also addressing the same constituency are storage and virtualization – facts of life in the data center. For Hadoop to cross over to the enterprise, it, too, must get virtualization-friendly; storage is an open question. The need for virtualization becomes even more apparent because (1) the exploratory nature of Hadoop analytics demands the ability to try out queries offline without having to disrupt or physically build a new cluster; and (2) the variable nature of Hadoop processing suggests that workloads are likely to be elastic. So we’ve been waiting for VMware to make their move. VMware – also part of EMC – has announced a pair of initiatives. First, they are working with the Apache Hadoop project to make the core pieces (HDFS and MapReduce) virtualization-aware, and separately, they are hosting their own open source project (Serengeti) for virtualizing Hadoop clusters. While Project Serengeti is not VM-specific, there’s little doubt that this will be a VMware project (we’d be shocked if the Xen folks were to buy in).
Where there’s virtualized servers, storage often closely follows. A few months back, EMC dropped the other shoe, finally unveiling a strategy for leveraging Isilon with the Greenplum HD platform, the closest thing in NAS that replicates the scale-out model storage model popularized with Hadoop. This opens an argument of whether the scales of data in Hadoop make premium products such as Isilon unaffordable; the flip side however is the “open source tax,” where you hire the skills in your IT organization to manage and deploy scale-out storage, or pay consultants to do it for you.
In the spirit of making Hadoop more consummable, we expect a lot of vibes from new players that are simplifying navigation of Hadoop and building SQL bridges. Datameer is bringing down the pricing of its uber Hadoop spreadsheet to personal and workgroup levels courtesy of entry level pricing from $299 to $2999. Teradata Aster, which already offers a patented framework that translates SQL to MapReduce (there are also others out there) is now taking an early bet on the incubating Apache HCatalog metadata spec so that you could write SQL statements that go up against Hadoop. It joins approaches such as those from Hadapt, which hangs SQL tables from HDFS file nodes, and mainstream BI players such as Jaspersoft, that already provide translators that can grab reports directly from Hadoop.
This doesn’t take away from the evolution of the Hadoop platform itself; Cloudera and Hortonworks are among those releasing new distributions that bundle their own mix of recent and current Apache Hadoop modules. While the Apache project has addressed the NameNode HA issue, it is still early in the game with bringing enterprise-grade manageability to MapReduce. That’s largely an academic issue as the bulk of enterprises have yet to implement Hadoop; by the time enterprises are ready, many of the core issues should resolve — although there will always be questions about the uptake of peripheral Hadoop projects.
What’s more important – and where the action will be – is in tools that allow enterprises to run and, more importantly, consume Hadoop. A chicken and egg situation, enterprises won’t implement before tools are available and vice versa.
Note: If you’re in San Jose, we invite you to join us at Hadoop Summit to catch our presentation Hadoop – Do Data Warehousing Rules Apply on Thursday morning at 10:30.
« Previous entries Next Page » Next Page »