Category Archives: Storage

It’s happening: Hadoop and SQL worlds are converging

With Strata, IBM IOD, and Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for Hadoop-related announcements. The common thread of announcements is essentially, “We know that Hadoop is not known for performance, but we’re getting better at it, and we’re going to make it look more like SQL.” In essence, Hadoop and SQL worlds are converging, and you’re going to be able to perform interactive BI analytics on it.

The opportunity and challenge of Big Data from new platforms such as Hadoop is that it opens a new range of analytics. On one hand, Big Data analytics have updated and revived programmatic access to data, which happened to be the norm prior to the advent of SQL. There are plenty of scenarios where taking programmatic approaches are far more efficient, such as dealing with time series data or graph analysis to map many-to-many relationships. It also leverages in-memory data grids such as Oracle Coherence, IBM WebSphere eXtreme Scale, GigaSpaces and others, and, where programmatic development (usually in Java) proved more efficient for accessing highly changeable data for web applications where traditional paths to the database would have been I/O-constrained. Conversely Advanced SQL platforms such as Greenplum and Teradata Aster have provided support for MapReduce-like programming because, even with structured data, sometimes using a Java programmatic framework is a more efficient way to rapidly slice through volumes of data.

Until now, Hadoop has not until now been for the SQL-minded. The initial path was, find someone to do data exploration inside Hadoop, but once you’re ready to do repeatable analysis, ETL (or ELT) it into a SQL data warehouse. That’s been the pattern with Oracle Big Data Appliance (use Oracle loader and data integration tools), and most Advanced SQL platforms; most data integration tools provide Hadoop connectors that spawn their own MapReduce programs to ferry data out of Hadoop. Some integration tool providers, like Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and Hortonworks have been talking up the potentials of HCatalog, actuality an enhanced version of Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.

But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops. And beneath the veneer of chaos, there is an implicit order to most so-called “unstructured” data that is within the reach programmatic transformation approaches that in the long run could likely be automated or packaged inside a tool.

At Ovum, we have long believed that for Big Data to crossover to the mainstream enterprise, that it must become a first-class citizen with IT and the data center. The early pattern of skunk works projects, led by elite, highly specialized teams of software engineers from Internet firms to solve Internet-style problems (e.g., ad placement, search optimization, customer online experience, etc.) are not the problems of mainstream enterprises. And neither is the model of recruiting high-priced talent to work exclusively on Hadoop sustainable for most organizations; such staffing models are not sustainable for mainstream enterprises. It means that Big Data must be consumable by the mainstream of SQL developers.

Making Hadoop more SQL-like is hardly new
Hive and Pig became Apache Hadoop projects because of the need for SQL-like metadata management and data transformation languages, respectively; HBase emerged because of the need for a table store to provide a more interactive face – although as a very sparse, rudimentary column store, does not provide the efficiency of an optimized SQL database (or the extreme performance of some columnar variants). Sqoop in turn provides a way to pipeline SQL data into Hadoop, a use case that will grow more common as organizations look to Hadoop to provide scalable and cheaper storage than commercial SQL. While these Hadoop subprojects that did not exactly make Hadoop look like SQL, they provided building blocks from which many of this week’s announcements leverage.

Progress marches on
One train of thought is that if Hadoop can look more like a SQL database, more operations could be performed inside Hadoop. That’s the theme behind Informatica’s long-awaited enhancement of its PowerCenter transformation tool to work natively inside Hadoop. Until now, PowerCenter could extract data from Hadoop, but the extracts would have to be moved to a staging server where the transformation would be performed for loading to the familiar SQL data warehouse target. The new offering, PowerCenter Big Data Edition, now supports an ELT pattern that uses the power of MapReduce processes inside Hadoop to perform transformations. The significance is that PowerCenter users now have a choice: load the transformed data to HBase, or continue loading to SQL.

There is growing support for packaging Hadoop inside a common hardware appliance with Advanced SQL. EMC Greenplum was the first out of gate with DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with Greenplum MR, a software only product that is accompanied by a MapR Hadoop distro). Teradata Aster has just joined the fray with Big Analytics Appliance, bundling the Hortonworks Data Platform Hadoop; this move was hardly surprising given their growing partnership around HCatalog, an enhancement of the SQL-like Hive metadata layer of Hadoop that adds features such as a cost optimizer and RESTful interfaces that make the metadata accessible without the need to learn MapReduce or Java. With HCatalog, data inside Hadoop looks like another Aster data table.

Not coincidentally, there is a growing array of analytic tools that are designed to execute natively inside Hadoop. For now they are from emerging players like Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like marketplace for developers), Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry, Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).

Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes. Hadapt has been promoting a variant where it squeezes SQL tables inside HDFS file structures – not exactly a no-brainer as it must shoehorn tables into a file system with arbitrary data block sizes. Hadapt’s approach sounds like the converse of object-relational stores, but in this case, it is dealing with a physical rather than a logical impedance mismatch.

Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does Cloudera. It has just announced Impala, a SQL-based alternative to MapReduce for querying the SQL-like Hive metadata store, supporting most but not all forms of SQL processing (based on SQL 92; Impala lacks triggers, which Cloudera deems low priority). Both Impala and MapReduce rely on parallel processing, but that’s where the similarity ends. MapReduce is a blunt instrument, requiring Java or other programming languages; it splits a job into multiple, concurrently, pipelined tasks where, at each step along the way, reads data, processes it, and writes it back to disk and then passes it to the next task. Conversely, Impala takes a shared nothing, MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera claims roughly 4x performance against MapReduce; if the data is in HBase, Cloudera claims performance multiples up to a factor of 30. For now, Impala only supports row-based views, but with columnar (on Cloudera’s roadmap), performance could double. Cloudera plans to release a real-time query (RTQ) offering that, in effect, is a commercially supported version of Impala.

By contrast, Teradata Aster and Hortonworks promote a SQL MapReduce approach that leverages HCatalog, an incubating Apache project that is a superset of Hive that Cloudera does not currently include in its roadmap. For now, Cloudera claims bragging rights for performance with Impala; over time, Teradata Aster will promote the manageability of its single appliance, and with the appliance has the opportunity to counter with hardware optimization.

The road to SQL/programmatic convergence
Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists. What’s more important to enterprises is getting the right tool for the job – whether it is the flexibility of SQL or raw power of programmatic approaches.

SQL convergence is the next major battleground for Hadoop. Cloudera is for now shunning HCatalog, an approach backed by Hortonworks and partner Teradata Aster. The open question is whether Hortonworks can instigate a stampede of third parties to overcome Cloudera’s resistance. It appears that beyond Hive, the SQL face of Hadoop will become a vendor-differentiated layer.

Part of conversion will involve a mix of cross-training and tooling automation. Savvy SQL developers will cross train to pick up some of the Java- or Java-like programmatic frameworks that will be emerging. Tooling will help lower the bar, reducing the degree of specialized skills necessary. And for programming frameworks, in the long run, MapReduce won’t be the only game in town. It will always be useful for large-scale jobs requiring brute force, parallel, sequential processing. But the emerging YARN framework, which deconstructs MapReduce to generalize the resource management function, will provide the management umbrella for ensuring that different frameworks don’t crash into one another by trying to grab the same resources. But YARN is not yet ready for primetime – for now it only supports the batch job pattern of MapReduce. And that means that YARN is not yet ready for Impala or vice versa.

Of course, mainstreaming Hadoop – and Big Data platforms in general – is more than just a matter of making it all look like SQL. Big Data platforms must be manageable and operable by the people who are already in IT; they will need some new skills and grow accustomed to some new practices (like exploratory analytics), but the new platforms must also look and act familiar enough. Not all announcements this week were about SQL; for instance, MapR is throwing a gauntlet to the Apache usual suspects by extending its management umbrella beyond the proprietary NFS-compatible file system that is its core IP to the MapReduce framework and HBase, making a similar promise of high performance. On the horizon, EMC Isilon and NetApp are proposing alternatives promising a more efficient file system but at the “cost” of separating the storage from the analytic processing. And at some point, the Hadoop vendor community will have to come to grips with capacity utilization issues, because in the mainstream enterprise world, no CFO will approve the purchase of large clusters or grids that get only 10 – 15% utilization. Keep an eye on VMware’s Project Serengeti.

They must be good citizens in data centers that need to maximize resource (e.g., virtualization, optimized storage); must comply with existing data stewardship policies and practices; and must fully support existing enterprise data and platform security practices. These are all topics for another day.

Leo Apotheker to target HP’s forgotten business

Ever since its humble beginnings in the Palo Alto garage, HP has always been kind of a geeky company – in spite of Carly Fiorina’s superficial attempts to prod HP towards a vision thing during her aborted tenure. Yet HP keeps talking about getting back to that spiritual garage.

Software has long been the forgotten business of HP. Although – surprisingly – the software business was resuscitated under Mark Hurd’s reign (revenues have more than doubled as of a few years ago), software remains almost a rounding error in HP’s overall revenue pie.

Yes, Hurd gave the software business modest support. Mercury Interactive was acquired under his watch, giving the business a degree of critical mass when combined with the legacy OpenView business. But during Hurd’s era, there were much bigger fish to fry beyond all the internal cost cutting for which Wall Street cheered, but insiders jeered. Converged Infrastructure has been the mantra, reminding us one and all that HP was still very much a hardware company. The message remains loud and clear with HP’s recent 3PAR acquisition at a heavily inflated $2.3 billion which was concluded in spite of the interim leadership vacuum.

The dilemma that HP faces is that, yes, it is the world’s largest hardware company (they call it technology), but the bulk of that is from personal systems. Ink, anybody?

The converged infrastructure strategy was a play at the CTO’s office. Yet HP is a large enough company that it needs to compete in the leagues of IBM and Oracle, and for that it needs to get meetings with the CEO. Ergo, the rumors of feelers made to IBM Software’s Steve Mills, and the successful offer to Leo Apotheker, and agreement for Ray Lane as non executive chairman.

Our initial reaction was one of disappointment; others have felt similarly. But Dennis Howlett feels that Apotheker is the right choice “to set a calm tone” that there won’t be a massive a debilitating reorg in the short term.

Under Apotheker’s watch, SAP stagnated, hit by the stillborn Business ByDesign and the hike in maintenance fees that, for the moment, made Oracle look warmer and fuzzier. Of course, you can’t blame all of SAP’s issues on Apotheker; the company was in a natural lull cycle as it was seeking a new direction in a mature ERP market. The problem with SAP is that, defensive acquisition of Business Objects notwithstanding, the company has always been limited by a “not invented here” syndrome that has tended to blind the company to obvious opportunities – such as inexplicably letting strategic partner IDS Scheer slip away to Software AG. Apotheker’s shortcoming was not providing the strong leadership to jolt SAP out of its inertia.

Instead, Apotheker’s – and Ray Lane’s for that matter – value proposition is that they know the side of the enterprise market that HP doesn’t. That’s the key to this transition.

The next question becomes acquisitions. HP has a lot on its plate already. It took at least 18 months for HP to digest the $14 billion acquisition of EDS, providing a critical mass IT services and data center outsourcing business. It is still digesting nearly $7 billion of subsequent acquisitions of 3Com, 3PAR, and Palm to make its converged infrastructure strategy real. HP might be able to get backing to make new acquisitions, but the dilemma is that Converged Infrastructure is a stretch in the opposite direction from enterprise software. So it’s not just a question of whether HP can digest another acquisition; it’s an issue of whether HP can strategically focus in two different directions that ultimately might come together, but not for a while.

So let’s speculate about software acquisitions.

SAP, the most logical candidate, is, in a narrow sense, relatively “affordable” given that its stock is roughly about 10 – 15% off its 2007 high. But SAP would be obviously the most challenging given the scale; it would be difficult enough for HP to digest SAP under normal circumstances, but with all the converged infrastructure stuff on its plate, it’s back to the question of how can you be in two places at once. Infor is a smaller company, but as it is also a polyglot of many smaller enterprise software firms, would present HP additional integration headaches that it doesn’t need.

HP may have little choice but to make a play for SAP if IBM or Microsoft were unexpectedly to actively bid. Otherwise, its best bet is to revive the relationship which would give both companies the time to acclimate. But in a rapidly consolidating technology market, who has the luxury of time these days?

Salesforce.com would make a logical stab as it would reinforce HP Enterprise Services’ (formerly EDS) outsourcing and BPO business. It would be far easier for HP to get its arms around this business. The drawback is that Salesforce.com would not be very extensible as an application as it uses a proprietary stored procedures database architecture. That would make it difficult to integrate with a prospective ERP SaaS acquisition, which would otherwise be the next logical step to growing the enterprise software footprint.

Informatica is often brought up – if HP is to salvage its Neoview BI business, it would need a data integration engine to help bolster it. Better yet, buy Teradata, which is one of the biggest resellers of Informatica PowerCenter – that would give HP far more credible presence in the analytics space. Then it will have to ward off Oracle – which has an even more pressing need for Informatica to fill out the data integration piece in its Fusion middleware stack – for Informatica. But with Teradata, there would at least be a real anchor for the Informatica business.

HP has to decide what kind of company it needs to be as Tom Kucharvy summarized well a few weeks back. Can HP afford to converge itself in another direction? Can it afford not to? Leo Apotheker has a heck of a listening tour ahead of him.

The Network is the Computer

It’s sometimes funny that history takes some strange turns. Back in the 1980s, Sun began building its empire in the workgroup by combining two standards: UNIX boxes with TCP/IP networks built in. Sun’s The Network is the Computer message declared that computing was of little value without the network. Of course, Sun hardly had a lock on the idea, as Bob Metcalfe devised the law stating that the value of the network was exponentially related to the number of nodes connected, and that Digital (DEC) (remember them?) actually scaled out the idea at division level where Sun was elbowing its way into the workgroup.

Funny that DEC was there first but they only got the equation half right – bundling a proprietary OS to a standard networking protocol. Fast forward a decade and Digital was history, Sun was the dot in dot com. Then go a few more years later, as Linux made even a “standard” OS like UNIX look proprietary, Sun suffers DEC’s fate (OK they haven’t been acquired, yet and still have cash reserves, if they could only figure out what to do when they finally grow up), and bandwidth, blades get commodity enough that businesses start thinking that the cloud might be a cheaper, more flexible alternative to the data center. Throw in a very wicked recession and companies are starting to think that the numbers around the cloud – cheap bandwidth, commodity OS, commodity blades – might provide the avoided cost dollars they’ve all been looking for. That is, if they can be assured that lacing data out in the cloud won’t violate ay regulatory or privacy headaches.

So today it gets official. After dropping hints for months, Cisco has finally made it official: its Unified Computing System is to provide in essence a prepackaged data center:

Blades + Storage Networking + Enterprise Networking in a box.

By now you’ve probably read the headlines – that UCS is supposed to do, what observers like Dana Gardner term, bring an iPhone like unity to the piece parts that pass for data centers. It would combine blade, network device, storage management and VMware’s virtualization platform (as you might recall, Cisco owns a $150 million chunk of VMware) to provide, in essence, a data center appliance in the cloud.

In a way, UCS is a closing of the circle that began with mainframe host/terminal architectures of a half century ago: a single monolithic architecture with no external moving parts.

Of course, just as Sun wasn’t the first to exploit TCP/IP network, but got the lion’s share of credit from, similarly, Cisco is hardly the first to bridge the gap between compute and networking node. Sun already has a Virtual Network Machines Project for processing network traffic on general-purpose servers, while its Project Crossbow is supposed to make networks virtual as well as part of its OpenSolaris project. Sounds like a nice open source research project to us that’s limited to the context of the Solaris OS. Meanwhile HP has raped up its Procurve business, which aims at the heart of Cisco territory. Ironically, the dancer left on the sidelines is IBM, which sold off its global networking business to AT&T over a decade ago, and its ROLM network switches nearly a decade before that.

It’s also not Cisco’s first foray out of the base of the network OSI stack. Anybody remember Application-Oriented Networking? Cisco’s logic, to build a level of content-based routing into its devices was supposed to make the network “understand” application traffic. Yes, it secured SAP’s endorsement for the rollout, but who were you really going to sell this to in the enterprise? Application engineers didn’t care for the idea of ceding some of their domain to their network counterparts. On the other hand, Cisco’s successful foray into storage networking proves that the company is not a one-trick pony.

What makes UCS different on this go round are several factors. Commoditization of hardware and firmware, emergence of virtualization and the cloud, makes division of networking, storage, and datacenter OS artificial. Recession makes enterprises hungry for found money, maturation of the cloud incents cloud providers to buy pre-packaged modules to cut acquisition costs and improve operating margins. Cisco’s lineup of partners is also impressive – VMware, Microsoft, Red Hat, Accenture, BMC, etc. – but names and testimonials alone won’t make UCS fly. The fact is that IT has no more hunger for data center complexity, the divisions between OS, storage, and networking no longer adds value, and cloud providers need a rapid way of prefabricating their deliverables.

Nonetheless we’ve heard lots of promises of all-in-one before. The good news is this time around there’s lots of commodity technology and standards available. But if Cisco is to make a real alternative to IBM, HP, or Dell, it’s got to put make datacenter or cloud-in-the box reality.

Buy ‘em While They’re Hot

Old computing ideas never seem to die. For instance, grid and other forms of utility computing are updates of classic time-sharing. Besides the fact that both ideas are based around the idea of shared resource, both were — or are — being driven by brute economics. Twenty years ago, few companies could afford their own mainframe. Today, hardware has gotten so cheap that most companies have bought too much of it, winding up with lots of spare computing capacity that are inflating IT overhead dollars.

Behind the idea of shared resource is the notion of virtualization — a way to make multiple systems, or multiple parts of a system, look like one. An established idea from the mainframe world, the concept is gradually being introduced to the heterogeneous computing world, where virtualization has long proven easier said than done.

That’s especially been the case in storage, where the market for storage networks and tools has been highly fragmented (standards take-up has been slow). But for Intel machines, one company — VMware — has created a de facto standard. Initially a geek tool for partitioning test and development machines with several instances of Windows and Linux, the company’s business took off and entered the black last year when it introduced server virtualization. Today, over half of VMware’s revenues are coming from its newer server line, which isn’t surprising, since data center budgets are typically higher than development teams.

With little if any competition (Microsoft’s Virtual PC lags in server functionality), 5-year old VMware has been a rare bright spot in a down tech market. At roughly $635 million — roughly triple anticipated 2004 revenues — EMC paid dearly for the privilege. Originally, EMC was going to wait until next summer to buy VMware, because the ink is still drying on its 2-month old Documentum acquisition (expected to close in a few days). However, given VMware’s growth, EMC had to nip what would have been one of 2004′s most highly awaited tech IPOs in the bud.

The VMware acquisition is a solid but costly buy for EMC. The proof is that there is already significant customer overlap. That makes sense: If you’re virtualizing storage, you probably want to consolidate servers as well.

Butterflies Aren’t Free

Just like Moore’s law in processors, storage continues to get denser and cheaper. While a gigabyte-size storage appliance would have cost about $50,000 in 1980, today a handheld 60-Gbyte drive can be ordered for $200 or less.

Not surprisingly, companies are buying more storage and spending less. According to IDC, in Q2 2003, worldwide shipments totaled 180 petabytes, a 36% increase in raw capacity, but a 4% revenue decrease over the same period last year. Businesses are piling up more transaction, business intelligence, and unstructured (e.g., documents) data, at a faster rate, than ever before.

But as companies accumulate data, they are spending even more on managing it. Gartner predicts that the storage resource management market will grow at a 24% annual clip through 2007.

And that’s why EMC has pushed the Information Life Cycle, a strategy to migrate the company from hardware to solutions provider. Over the past 90 days or so, it has bought two software companies to do just that: Legato, a Veritas wannabee in storage management, and just yesterday, Documentum, one of the largest players in document management.

The Documentum deal grabs our attention because this gets EMC beyond its pure infrastructure niche. Aside from hierarchical storage management (HSM) tools, the relationship between data value and storage management has been ephemeral at best. While business units have typically bought document management solutions, storage management has been the domain of the data center. Consequently, business folks keep buying enterprise applications with little if any idea on what it will cost to store all that data, while the IT folks end up buying solutions reacting to the data access bottlenecks that result.

As with any IT initiative these days, the ideal is getting the business and technology sides of the house on the same page, so both understand the true cost of ownership. Conceivably, with Documentum part of EMC, quantifying the total costs of managing more content should become more straightforward. At least, that’s the goal.

Nonetheless, EMC will face a missionary sell because of the different target buyer demographics. Not surprisingly, EMC plans to continue operating its two software acquisitions as separate business units — and if it continues to do so, both will simply amount to what the consumer goods industry characterizes as “brand line extensions.” EMC’s real challenge, however, is to forge a new target market that finally breaks the blood-brain barrier separating the data center from the rest of the enterprise.