Category Archives: Big Data

Searching for Data Scientists as a Service

It’s no secret that rocket .. err … data scientists are in short supply. The explosion of data and the corresponding explosion of tools, and the knock-on impacts of Moore’s and Metcalfe’s laws, is that there is more data, more connections, and more technology to process it than ever. At last year’s Hadoop World, there was a feeding frenzy for data scientists, which only barely dwarfed demand for the more technically oriented data architects. in English, that means:

1. Potential MacArthur Grant recipients who have a passion and insight for data, the mathematical and statistical prowess for ginning up the algorithms, and the artistry for painting the picture that all that data leads to. That’s what we mean by data scientists.
2. People who understand the platform side of Big Data, a.k.a., data architect or data engineer.

The data architect side will be the more straightforward nut to crack. Understanding big data platforms (Hadoop, MongoDB, Riak) and emerging Advanced SQL offerings (Exadata, Netezza, Greenplum, Vertica, and a bunch of recent upstarts like Calpont) is a technical skill that can be taught with well-defined courses. The laws of supply and demand will solve this one – just as they did when the dot com bubble created demand for Java programmers back in 1999.

Behind all the noise for Hadoop programmers, there’s a similar, but quieter desperate rush to recruit data scientists. While some data scientists call data scientist a buzzword, the need is real.

However, data science will be a tougher number to crack. It’s all about connecting the dots, not as easy as it sounds. The V’s of big data – volume, variety, velocity, and value — require someone who discovers insights from data; traditionally, that role was performed by the data miner. But data miners dealt with better-bounded problems and well-bounded (and known) data sets that made the problem more 2-dimensional. The variety of Big Data – in form and in sources – introduces an element of the unknown. Deciphering Big Data requires a mix of investigative savvy, communications skills, creativity/artistry, and the ability to think counter-intuitively. And don’t forget it all comes atop a foundation of a solid statistical and machine learning background plus technical knowledge of the tools and programming languages of the trade.

Sometimes it seems like we’re looking for Albert Einstein or somebody smarter.

As nature abhors a vacuum, there’s also a rush to not only define what a data scientist is, but develop programs that could somehow teach it, software packages that to some extent package it, and otherwise throw them into a meat … err, the free market. EMC and other vendors are stepping up to the plate to offer training, not just on platforms, but for data science. Kaggle offers an innovative cloud-based, crowdsourced approach to data science, making available a predictive modeling platform and then staging sponsored 24-hour competitions for moonlighting data scientists to devise the best solutions to particular problems (redolent of the Netflix $1 million prize to devise a smarter algorithm for predicting viewer preferences).

With data science talent scarce, we’d expect that consulting firms would buy up talent that could then be “rented’ to multiple clients. Excluding a few offshore firms, few SIs have yet stepped up to the plate to roll out formal big data practices (the logical place where data scientists would reside), but we expect that to change soon.

Opera Solutions, which has been in the game of predictive analytics consulting since 2004, is taking the next step down the packaging route. having raised $84 million in Series A funding last year, the company has staffed up to nearly 200 data scientists, making it one of the largest assemblages of genius this side of Google. Opera’s predictive analytics solutions are designed for a variety of platforms, SQL and Hadoop, and today they join the SAP Sapphire announcement stream with a release of their offering on the HANA in-memory database. Andrew Brust provides a good drilldown on the details on this announcement.

From SAP’s standpoint, Opera’s predictive analytics solutions are a logical fit for HANA as they involve the kinds of complex problems (e.g., a computation triggers other computations) that their new in-memory database platform was designed for.

There’s too much value at stake to expect that Opera will remain the only large aggregation of data scientists for hire. But ironically, the barriers to entry will keep the competition narrow and highly concentrated. Of course, with market demand, there will inevitably be a watering down of the definition of data scientists so that more companies can claim they’ve got one… or many.

The laws of supply and demand will kick in for data scientists, but the ramp up of supply won’t be as quick as that for the more platform-oriented data architect or engineer. Of necessity, that supply of data scientists will have to be augmented by software that automates the interpretation of machine learning, but there’s only so far that you can program creativity and counter-intuitive insight into a machine.

Another vote for the Apache Hadoop Stack

As we’ve noted previously, the measure of success of an open source stack is the degree to which the target remains intact. That either comes as part of a captive open source project, where a vendor unilaterally open sources their code (typically hosting the project) to promote adoption, or a community model where a neutral industry body hosts the project and gains support from a diverse cross section of vendors and advanced developers. In that case, the goal is getting the formal standard to also become the de facto standard.

The most successful open source projects are those that represent commodity software – otherwise, why would vendors choose not to compete with software that anybody can freely license or consume? That’s been the secret behind the success of Linux, where there has been general agreement on where the kernel ends, and as a result, a healthy market of products that run atop (and license) Linux. For community open source projects, vendors obviously have to agree on where the line between commodity and unique value-add begins.

And so we’ve discussed that the fruition of Hadoop will require some informal agreement as to exactly what components make Hadoop, Hadoop. For a while, the question appeared in doubt, as one of the obvious pillars – the file system – was being contested with proprietary alternatives like MapR and IBM’s GPFS.

What’s interesting is that the two primary commercial providers that signed on for the proprietary files systems – IBM and EMC (via partnership with MapR) – have retrenched clarified their messages. They still offer the proprietary file system systems in question, but both now also offer purer are emphasizing that they also offer Apache versions. IBM made the announcement today, buried below the fold after its announced intention to acquire data federation search player, Vivisimo. The announcement had a bit of a grudging aspect to it – unlike Oracle, which has a full OEM agreement with Cloudera, IBM is simply stating that it will certify Cloudera’s Hadoop as one of the approved distributions for InfoSphere BigInsights – there’s no exchange of money or other skin in the game. If IBM also gets demand for the Hortonworks distro (or if it wants to keep Cloudera in its place), it’ll also likely add Hortonworks to the approved list.

Against this background is a technology that is a moving target. The primary drawback – that there was no redundancy or failover with the HDFS NameNode (which acts as a file directory) – has been addressed with the latest versions of Hadoop. The other – which provides POSIX compliance so Hadoop can be accessed through the NFS standard) – is only necessary for very high, transactional-like (OK, not ACID) performance which so far has not been an issue. If you want that kind of performance, Hadoop’s HBase offers more promise.

But just as the market has passed judgment on what comprises the Hadoop “kernel” (using some Linuxspeak), that doesn’t rule out differences in implementation. Teradata Aster and Sybase IQ are promoting their analytics data stores as swappable, more refined replacements for HBase (Hadoop’s column store), while upstarts like Hadapt are proposing to hang SQL data nodes onto HDFS.

When it comes to Hadoop, you gotta reverse the old maxim: The more things stay the same, the more things are actually changing.

SAP and databases no longer an oxymoron

In its rise to leadership of the ERP market, SAP shrewdly placed bounds around its strategy: it would stick to its knitting on applications and rely on partnerships with systems integrators to get critical mass implementation across the Global 2000. When it came to architecture, SAP left no doubt of its ambitions to own the application tier, while leaving the data tier to the kindness of strangers (or in Oracle’s case, the estranged).

Times change in more ways than one – and one of those ways is in the data tier. The headlines of SAP acquiring Sybase (for its mobile assets, primarily) and subsequent emergence of HANA, its new in-memory data platform, placed SAP in the database market. And so it was that at an analyst meeting last December, SAP made the audacious declaration that it wanted to become the #2 database player by 2015.

Of course, none of this occurs in a vacuum. SAP’s declaration to become a front line player in the database market threatens to destabilize existing relationships with Microsoft and IBM as longtime SAP observer Dennis Howlett commented in a ZDNet post. OK, sure, SAP is sick of leaving money on the table to Oracle, and it’s throwing in roughly $500 million in sweeteners to get prospects to migrate. But if the database is the thing, to meet its stretch goals, says Howlett, SAP and Sybase would have to grow that part of the business by a cool 6x – 7x.

But SAP would be treading down a ridiculous path if it were just trying to become a big player in the database market for the heck of it. Fortuitously, during SAP’s press conference on announcements of their new mobile and database strategies, chief architect Vishal Sikka tamped down the #2 aspirations as that’s really not the point – it’s the apps that count, and increasingly, it’s the database that makes the apps. Once again.

Back to our main point, IT innovation goes in waves; during emergence of client/server, innovation focused on database where the need was mastering SQL and relational table structures; during the latter stages of client/server and subsequent waves of Webs 1.0 and 2.0, activity shifted to the app tier, which grew more distributed. With emergence of Big Data and Fast Data, energy shifted back to the data tier given the efficiencies of processing data big or fast inside the data store itself. Not surprisingly, when you hear SAP speak about HANA, they describe an ability to perform more complex analytic problems or compound operational transactions. It’s no coincidence that SAP now states that it’s in the database business.

So how will SAP execute its new database strategy? Given the hype over HANA, how does SAP convince Sybase ASE, IQ, and SQL Anywhere customers that they’re not headed down a dead end street?

That was the point of the SAP announcements, which in the press release stated the near term roadmap but shed little light on how SAP would get there. Specifically, the announcements were:
• SAP HANA on BW is now going GA and at the low (SMB) end come out with aggressive pricing: roughly $3000 for SAP BusinessOne on HANA; $40,000 for HANA Edge.
• Ending a 15-year saga, SAP will finally port its ERP applications to Sybase ASE, with tentative target date of year end. HANA will play a supporting role as the real-time reporting adjunct platform for ASE customers.
• Sybase SQL Anywhere would be positioned as the mobile front end database atop HANA, supporting real-time mobile applications.
• Sybase’s event stream (CEP) offerings would have optional integration with HANA, providing convergence between CEP and BI – where rules are used for stripping key event data for persistence in HANA. In so doing, analysis of event streams could be integrated or directly correlating with historical data.
• Integrations are underway between HANA and IQ with Hadoop.
• Sybase is extending its PowerDesigner data modeling tools to address each of its database engines.

Most of the announcements, like HANA going GA or Sybase ASE supporting SAP Business suite, were hardly surprises. Aside from go-to-market issues, which are many and significant, we’ll direct our focus on the technology roadmaps.

We’ve maintained that if SAP were serious about its database goals, that it had to do three basic things:
1. Unify its database organization. The good news is that it has started down that path as of January 1 of this year. Of course, org charts are only the first step as ultimately it comes down to people.
2. Branding. Although long eclipsed in the database market, Sybase still has an identifiable brand and would be the logical choice; for now SAP has punted.
3. Cross-fertilize technology. Here, SAP can learn lessons from IBM which, despite (or because of) acquiring multiple products that fall under different brands, freely blends technologies. For instance, Cognos BI reporting capabilities are embedded into rational and Tivoli reporting tools.

The third part is the heavy lift. For instance, given that data platforms are increasingly employing advanced caching, it would at first glance seem logical to blend in some of HANA’s in-memory capabilities to the ASE platform; however, architecturally, that would be extremely difficult as one of HANA’s strengths –dynamic indexing – would be difficult to implement in ASE.

On the other hand, given that HANA can index or restructure data on the fly (e.g., organize data into columnar structures on demand), the question is, does that make IQ obsolete? The short answer is that while memory keeps getting cheaper, it will never be as cheap as disk and that therefore, IQ could evolve as near-line storage for HANA. Of course that begs the question as to whether Hadoop could eventually perform the same function. SAP maintains that Hadoop is too slow and therefore should be reserved for offline cases; that’s certainly true today, but given developments with HBase, it could easily become fast and cheap enough for SAP to revisit the IQ question a year or two down the road.

Not that SAP Sybase is sitting still with Hadoop integration. They are providing MapReduce and R capabilities to IQ (SAP Sybase is hardly alone here, as most Advanced SQL platforms are offering similar support). SAP Sybase is also providing capabilities to map IQ tables into Hadoop Hive, slotting IQ as alternative to HBase; in effect, that’s akin to a number of strategies to put SQL layers inside Hadoop (in a way, similar to what the lesser-known Hadapt is doing). And of course, like most of the relational players, SAP Sybase is also support the bulk ETL/ELT load from HDFS to HANA or IQ.

On SAP’s side for now is the paucity of Hadoop talent, so pitching IQ as an alternative to HBase may help soften the blow for organizations seeking to get a handle. But in the long run, we believe that SAP Sybase will have to revisit this strategy. Because, if it’s serious about the database market, it will have to amplify its focus to add value atop the new realities on the ground.

Fast Data hits the Big Data Fast Lane

Of the 3 “V’s” of Big Data – volume, variety, velocity (we’d add “Value as the 4th V) – velocity has been the unsung ‘V.’ With the spotlight on Hadoop, the popular image of Big Data is large petabyte data stores of unstructured data (which are the first two V’s). While Big Data has been thought of as large stores of data at rest, it can also be about data in motion.

“Fast Data” refers to processes that require lower latencies than would otherwise be possible with optimized disk-based storage. Fast Data is not a single technology, but a spectrum of approaches that process data that might or might not be stored. It could encompass event processing, in-memory databases, or hybrid data stores that optimize cache with disk.

Fast Data is nothing new, but because of the cost of memory, was traditionally restricted to a handful of extremely high-value use cases. For instance:
Wall Street firms routinely analyze live market feeds, and in many cases, run sophisticated complex event processing (CEP) programs on event streams (often in real time) to make operational decisions.
• Telcos have handled such data in optimizing network operations while leading logistics firms have used CEP to optimize their transport networks.
• In-memory databases, used as a faster alternative to disk, have similarly been around for well over a, having been employed for program stock trading, telecommunications equipment, airline schedulers, and large destination online retail (e.g., Amazon).

Hybrid in-memory and disk have also become commonplace, especially amongst data warehousing systems (e.g., “>Teradata, Kognitio), and more recently among the emergent class of advanced SQL analytic platforms (e.g., Greenplum, Teradata Aster, IBM Netezza, HP Vertica, ParAccel) that employ smart caching in conjunction with a number of other bells and whistles to juice SQL performance and scaling (e.g., flatter indexes, extensive use of various data compression schemes, columnar table structures, etc.). Many of these systems are in turn packaged as appliances that come with specially tuned, high-performance e backplanes and direct attached disk.

Finally, caching is hardly unknown to the database world. Hot spots of data that are frequently accessed are often placed in cache, as are snapshots of database configurations that are often stored to support restore processes, and so on

So what’s changed?
The usual factors: the same data explosion that created the urgency for Big Data is also generating demand for making the data instantly actionable. Bandwidth, commodity hardware, and of course, declining memory prices, are further forcing the issue: Fast Data is no longer limited to specialized, premium use cases for enterprises with infinite budgets.

Not surprisingly, pure in-memory databases are now going mainstream: Oracle and SAP are choosing in-memory as one of the next places where they are establishing competitive stakes: SAP HANA vs. Oracle Exalytics. Both Oracle and SAP for now are targeting analytic processing, including OLAP(raise the size limits on OLAP cubes) and more complex, multi-stage analytic problems that traditionally would have required batch runs (such as multivariate pricing) or would not have been run at all (too complex, too much delay). More to the point, SAP is counting on HANA as a major pillar of its stretch goal to become the #2 database player by 2015, which means expanding HANA’s target to include next generation enterprise transactional applications with embedded analytics.

Potential use cases for Fast Data could encompass:
• A homeland security agency monitoring the borders requires the ability to parse, decipher, and act on complex occurrences in real time to prevent suspicious people from entering the country
• Capital markets trading firms require real-time analytics and sophisticated event processing to conduct algorithmic or high-frequency trades
• Entities managing smart infrastructure must digest torrents of sensory data to make real-time decisions that optimize use transportation or public utility infrastructure
• B2B consumer products firms monitoring social networks may require real-time response to understand sudden swings in customer sentiment

For such organizations, Fast Data is no longer a luxury, but a necessity.

More specialized use cases are similarly emerging now that the core in-memory technology is becoming more affordable. YarcData, a startup from venerable HPC player Cray Computer, is targeting graph data, which represents data with many-to-many relationships. Graph computing is extremely process-intensive, and as such, has traditionally been run in batch when involving Internet-size sets of data. YarcData adopts a classic hybrid approach that pipelines computations in memory, but persisting data to disk. YarcData is the tip of the iceberg – we expect to see more specialized applications that utilize hybrid caching that combine speed with scale.

But don’t forget, memory’s not the new disk
The movement – or tiering – of data to faster or slower media is also nothing new. What is new is that data in memory may not longer be such a transient thing, and if memory is relied upon for in situ processing of data in motion or rapid processing of data at rest, memory cannot simply be treated as the new disk. Excluding specialized forms of memory such as ROM, by nature anything that’s solid state is volatile: there goes your power… and there goes your data. Not surprisingly, in-memory systems such as HANA still replicate to disk to reduce volatility. For conventional disk data stores that increasingly leverage memory, Storage Switzerland’s George Crump makes the case that caching practices must become smarter to avoid misses (where data gets mistakenly swapped out). There are also balance of system considerations: memory may be fast, but is its processing speed well matched with processor? Maybe solid state overcomes I/O issues associated with disk, but may still be vulnerable to coupling issues if processors get bottlenecked or MapReduce jobs are not optimized.

Declining memory process are putting Fast Data on the fast lane to mainstream. But as the technology is now becoming affordable, we’re still early in the learning curve for how to design for it.

Informatica’s Stretch Goal

Informatica is within a year or two of becoming a $1 billion company, and the CEO’s stretch goal is to get to $3b.

Informatica has been on a decent tear. It’s had a string of roughly 30 consecutive growth quarters, growth over the last 6 years averaging 20%, and 2011 revenues nearing $800 million. Abbasi took charge back in 2004, lifting Informatica out of its midlife crisis by ditching an abortive foray into analytic applications, instead expanding from the company’s data transformation roots to data integration. Getting the company to its current level came largely through a series of acquisitions that then expanded the category of data integration itself. While master data management (MDM) has been the headliner, other recent acquisitions have targeted information lifecycle management (ILM), complex event processing (CEP), low latency messaging (ultra messaging), along with filling gaps in its B2B and data quality offerings. While some of those pieces were obvious additions, others such as ultra messaging or event processing were not.

CEO Sohaib Abbassi is talking about a stretch goal of $3 billion revenue. The obvious chunk is to deepen the company’s share of existing customer wallets. We’re not at liberty to say how much, but Informatica had a significant number of 6-figure deals. Getting more $1m+ deals will help, but on their own won’t triple revenue.

So how to get to $3 billion?
Obviously, two strategies: deepen the existing business while taking the original formula to expand the footprint of what’s data integration.

First, the existing business. Of the current portfolio, MDM is likely best primed to allow Informatica to more deeply penetrate the installed base. Most of its data integration clients haven’t yet done MDM, and it is not a trivial investment. And for MDM clients who may have started with a customer or product domain, there are always more domains to tackle. During Q&A, Abbasi listed MDM has having as much potential addressable market as the traditional ETL and data quality segments.

The addition of SAP and Oracle veteran Dennis Moore to the Informatica MDM team points to the classic tightrope for any middleware vendor that claims it’s not in the applications game – build more “solutions” or jumpstart templates to confront the same generic barrier that packaged applications software was designed to surmount: provide customers an alternative to raw toolsets or custom programming. For MDM, think industry-specific “solutions” like counter-party risk, or horizontal patterns like social media profiles. If you’re Informatica, don’t think analytic applications.

That’s part of a perennial debate (or rant) on whether middleware is the new enterprise application: you implement for a specific business purpose as opposed to technology project, such as application or data integration, and you implement with a product that offers patterns of varying granularity as a starting point. Informatica MDM product marketing director Ravi Shankar argues it’s not an application because applications have specific data models and logic that become their own de factor silos, whereas MDM solutions reuse the same core metadata engine for different domains (e.g., customer, product, operational process). Our contention? If it solves a business problem and it’s more than a raw programming toolkit, it’s a de facto application. If anybody else cares about this debate, raise your hand.

MDM is typically a very dry subject but demo’ing a social MDM straw man showing a commerce application integrated into Facebook perked Twitter debate among analysts in the room. The operable notion is that such a use of MDM could update the customer’s (some might say, victim’s) profile by the associations that they make in social networks. An existing Informatica higher educational client that shall remain anonymous actually used MDM to mine LinkedIn to prove that its grads got jobs.

This prompts the question, just because you can do it, should you. When a merchant knows just a bit too much about you – and your friends (who may not have necessarily opted in) – that more than borders on creepy. Informatica’s Facebook MDM integration was quite effective; as a pattern for social business, well, we’ll see.

So what about staking new ground? When questioned, Abbasi stated that Informatica had barely scratched the surface with productizing around several megatrend areas that it sees impacting its market: cloud, social media, mobile, and Big Data. More specifically:
• Cloud continues to be a growing chunk of the business. Informatica doesn’t have all of its tooling up in the cloud, but it’s getting there. Consumption of services from the Informatica Cloud continues to grow at a 100 – 150% annual run rate. Most of the 1500 cloud customers are new to Informatica. Among recent introductions are a wizard-driven Contact Validation service that verifies and corrects postal addresses from over 240 countries and territories. A new rapid connectivity framework further eases the ability of third parties to OEM Informatica Cloud services.
• Social media – there were no individual product announcements her per se, just that Informatica’s tools must increasingly parse data coming from social feeds. That covers MDM, data profiling and data quality. Much of it leverages HParser, the new Hadoop data parsing tool released late last year.
• Mobile – for now this is mostly a matter of making Informatica tools and apps (we’ll use the term) consumable on small devices. On the back end, there are opportunities for optimizing virtualizing and replicating data on demand to the edges of highly distributed networks. Aside from newly-announced features such as iPhone and Android support of monitoring the Informatica cloud, for now Informatica is making a statement of product direction.
• Big Data – Informatica, like other major BI and database vendors, have discovered big Data with a vengeance over the past year. The ability to extract from Hadoop is nothing special – other vendors have that – but Informatica took a step ahead with release of HParser last fall. In general there’s growing opportunity for tooling in a variety of areas touching Hadoop, with Informatica’s data integration focus being one of them. We expect to see extension of Informatica’s core tools to not only parse or extract from Hadoop, but increasingly, work natively inside HDFS on the assumption that customers are not simply using it as a staging platform anymore. We also see opportunities in refinements to HParser providing templates or other shortcuts for deciphering sensory data. ILM, for instance, is another obvious one. While Facebook et al might not archive or deprecate their Hadoop data, mere mortal enterprises will have to bite the bullet. Data quality in Hadoop in many cases may not demand the same degree of vigilance as SQL data warehouses, creating demand for lighter weight data profiling and cleansing tooling And for other real-time web centric use case, alternatives stores like MongoDB, Couchbase, and Cassandra may become new Informatica data platform targets.

What, no exit talk?
Abbasi commented at the end of the company’s annual IT analyst meeting that this was the first time in recent memory that none of the analysts asked who would buy Informatica when. Buttonholing him after the session, we got his take which, very loosely translated to Survivor terms, Informatica has avoided getting voted off the island.

At this point, Informatica’s main rivals – Oracle and IBM – have bulked up their data integration offerings to the point where an Informatica acquisition would no longer be gap filling; it would simply be a strategy of taking out a competitor – and with Informatica’s growth, an expensive one at that. One could then point to dark horses like EMC, Tibco, Teradata, or SAP (for obvious reasons we’ve omitted HP). A case might be made for EMC, or SAP if it remains serious in raising its profile as database player– but we believe both have bigger fish to fry. Never say never. But otherwise, the common thread is that data integration will not differentiate these players and therefore it is not strategic to their growth plans.

EMC’s Hadoop Strategy cuts to the chase

To date, Big Storage has been locked out of Big Data. It’s been all about direct attached storage for several reasons. First, Advanced SQL players have typically optimized architectures from data structure (using columnar), unique compression algorithms, and liberal usage of caching to juice response over hundreds of terabytes. For the NoSQL side, it’s been about cheap, cheap, cheap along the Internet data center model: have lots of commodity stuff and scale it out. Hadoop was engineered exactly for such an architecture; rather than speed, it was optimized for sheer linear scale.

Over the past year, most of the major platform players have planted their table stakes with Hadoop. Not surprisingly, IT household names are seeking to somehow tame Hadoop and make it safe for the enterprise.

Up till now, anybody with armies of the best software engineers that Internet firms could buy could brute force their way to scale out humungous clusters and if necessary, invent their own technology, then share and harvest from the open source community at will. Hardly a suitable scenario for the enterprise mainstream, the common thread behind the diverse strategies of IBM, EMC, Microsoft, and Oracle toward Hadoop has been to not surprisingly make Hadoop more approachable.

What’s been conspicuously absent so far was a play from Big Optimized Storage. The conventional wisdom is that SAN or NAS are premium, architected systems whose costs might be prohibitive when you talk petabytes of data. Similarly, so far there has been a different operating philosophy behind the first generation implementations from the NoSQL world that assumed that parts would fail, and that five nines service levels were overkill. And anyway, the design of Hadoop brute forced the solution: replicate to have three unique copies of the data distributed around the cluster, as hardware is cheap.

As Big Data gains traction in the enterprise, some of it will certainly fit this pattern of something being better than nothing, as the result is unique insights that would not otherwise be possible. For instance, if you’re running analysis of Facebook or Twitter goes down, it probably won’t take the business with it. But as enterprises adopt Hadoop – and as pioneers stretch Hadoop to new operational use cases such as what Facebook is doing with its messaging system – those concepts of mission-criticality are being revisited.

And so, ever since EMC announced last spring that its Greenplum unit would start supporting and bundling different versions of Hadoop, we’ve been waiting for the other shoe to drop: When would EMC infuse its Big Data play with its core DNA, storage?

Today, EMC announced that its Isilon networked storage system was adding native support for Apache Hadoop’s HDFS file system. There were some interesting nuances to the rollout.

1. Big vendors are feeling their way around Hadoop
It’s interesting to see how IT household names are cautiously navigating their way into unfamiliar territory. EMC becomes the latest, after Oracle and Microsoft, to calibrate their Hadoop strategy in public.

Oracle announced its Big Data appliance last fall before it lined up its Hadoop distribution. Microsoft ditched its Dryad project built around its HPC Server. Now EMC has recalibrated its Hadoop strategy; when it first unveiled its Hadoop strategy last spring, the spotlight was on the MapR proprietary alternatives to the HDFS file system of Apache Hadoop. It’s interesting that vendor initial announcements have either been vague, or have been tweaked as they’ve waded into the market. For EMC’s shift, more about that below.

2. What is Hadoop? For EMC, HDFS is the mainstream, not MapR

MapR’s strategy (and IBM’s along with it, regarding GPFS) has prompted debate and concern in the Hadoop community about commercial vendors forking the technology. As we’ve ranted previously, Hadoop’s growth will be tied, not only to megaplatform vendors that support it, but the third party tools and solutions ecosystem that grows around it. For such a thing to happen, ISVs and consulting firms need to have a common target to write against, and having forked versions of Hadoop won’t exactly grow large partner communities.

Regarding EMC, the original strategy was two Greenplum Hadoop editions: a Community Edition with a free Apache distro and an Enterprise Edition that bundled MapR, both under the Greenplum HD branding umbrella. At first blush, it looked like EMC was going to earn the bulk of its money from the proprietary side of the Hadoop business. What’s significant is that the new announcement of Isilon support pertains to the HDFS open source side. More to the point, EMC is rebranding and subtly repositioning its Greenplum Hadoop offerings: Greenplum HD is the Apache HDFS edition with the optional Isilon support, and Greenplum MR is the MapR version, which is niche targeted towards advanced Hadoop use cases that demand higher performance.

Update: Even if EMC later extends Isolon support to Greenplum MR, it doesn’t change the core positioning.

Coming atop recent announcements from Oracle and Microsoft that have come clearly out on the side of OEM’ing Apache rather than anything limited or proprietary, and this amounts to an unqualified endorsement of Apache Hadoop/HDFS as not only the formal, but also the de facto standard. This reflects emerging conventional wisdom that the enterprise mainstream is leery about lock-in to anything that smells proprietary for technology where they still are in the learning curve. Other forks may emerge, but they will not be at the base file system layer. This leaves IBM and MapR pigeonholed – admittedly, there will be API compatibility, but clearly both are swimming upstream.

3. Central Storage is newest battleground for Scale Up vs. Scale Out Hadoop

As noted earlier, Hadoop’s heritage has been the classic Internet data center scale-out model. The advantage is that, leveraging Hadoop’s highly linear scalability, organizations could easily expand their clusters quite easily by plucking more commodity server and disk. Pioneers or purists would scoff at the notion of an appliance approach because it was always simply scaling out inexpensive, commodity hardware, rather than paying premiums for big vendor boxes.

In blunt terms, the choice is whether you pay now or pay later. As mentioned before, do-it-yourself compute clusters require sweat equity – you need engineers who know how to design, deploy, and operate them. The flipside is that many, arguably most corporate IT organizations either lack the skills or the capital. There are various solutions to what might otherwise appear a Hobson’s Choice:
• Go to a cloud service provider that has already created the infrastructure, such as what Microsoft is offering with its Hadoop-on-Azure services;
• Look for a happy, simpler medium such as Amazon’s Elastic MapReduce on its DynamoDB service;
• Subscribe to SaaS providers that offer Hadoop applications (e.g., social network analysis, smart grid as a service) as a service;
• Get a platform and have a systems integrator put it together for you (key to IBM’s BigInsights offering, and applicable to any SI that has a Hadoop practice)
• Go to an appliance or engineered systems approach that puts Hadoop and/or its subsystems in a box, such as with Oracle Big Data Appliance or EMC’s Greenplum DCA. The systems engineering is mostly done for you, but the increments for growing the system can be much larger than simply adding a few x86 servers here or there (Greenplum HD DCA can scale in groups of 4 server modules). Entry or expansion costs are not necessarily cheap, but then again, you have to balance capital cost against labor.
• Surrounding Hadoop infrastructure with solutions. This is not a mutually exclusive strategy; unless you’re Cloudera or Hortonworks, which make their business bundling and supporting the core Apache Hadoop platform, most of the household names will bundle frameworks, algorithms, and eventually solutions that in effect place Hadoop under the hood. For EMC, the strategy is their recent announcement of a Unified Analytics Platform (UAP) that provides collaborative development capabilities for Big Data applications. EMC is (or will be) hardly alone here.

With EMC’s new offering, the scale-up option tackles the next variable: storage. This is the natural progression of a market that will address many constituencies, and where there will be no single silver bullet that applies to all.

Oracle fills another gap in its Big Data offering

When we last left Oracle’s Big Data plans, there was definitely a missing piece. Oracle’s Big Data Appliance as initially disclosed at last fall’s OpenWorld was a vague plan that appeared to be positioned primarily as an appliance that would accompany and feed data to Exadata. Oracle did specify some utilities, such as an enterprise version of the open source R statistical processing program that was designed for multithreaded execution, plus a distribution of a NoSQL database based on Oracle’s BerkeleyDB as an alternative to Apache Hive. But the emphasis appeared to be extraction and transformation of data for Exadata via Oracle’s own utilities that were optimized for its platform.

As such, Oracle’s plan for Hadoop was competition, not for Cloudera (or Hortonworks), which featured a full Apache Hadoop platform, but EMC which offered a comparable, appliance-based strategy that pairs Hadoop with an Advanced SQL data store; and IBM, which took a different approach by emphasizing Hadoop as an analytics platform destination enhanced with text and predictive analytics engines, and other features such as unique query languages and file systems.

Oracle’s initial Hadoop blueprint lacked explicit support of many pieces of the Hadoop stack such as HBase, Hive, Pig, Zookeeper, and Avro. No more. With Oracle’s announcement of general availability of the Big Data appliance, it is filling in the blanks by disclosing that it is OEM’ing Cloudera’s CDH Hadoop distribution, and more importantly, the management tooling that is key to its revenue stream. For Oracle, OEM’ing Cloudera’s Hadoop offering fully fleshes out its Hadoop distribution and positions it as a full-fledged analytic platform in its own right; for Cloudera, the deal is a coup that will help establish its distribution as the reference. It is fully consistent with Cloudera’s goal to become the Red Hat of Hadoop as it does not aspire to spread its footprint into applications or frameworks.

Of course, whenever you put Oracle in the same sentence as OEM deal, the question of acquisition inevitably pops up. There are several reasons why an Oracle acquisition of Cloudera is unlikely.

1. Little upside for Oracle. While Oracle likes to assert maximum control of the stack, from software to hardware, its foray into productizing its own support for Red Hat Enterprise Linux has been strictly defensive; its offering has not weakened Red Hat.

2. Scant leverage. Compare Hadoop to MySQL and you have a Tale of Two Open Source projects. One is hosted and controlled by Apache, the other is hosted and controlled by Oracle. As a result, while Oracle can change licensing terms for MySQL, which it owns, it has no such control over Hadoop. Were Oracle to buy Cloudera, another provider could easily move in to fill the vacuum. The same would happen to Cloudera if, as a prelude to such a deal, it began forking from the Apache project with its own proprietary adds-ons or substitutions.

OEMs deals are a major stage of building the market. Cloudera has used its first mover advantage with Hadoop well with deals Dell, and now Oracle. Microsoft in turn has decided to keep the “competition” honest by signing up Hortonworks to (eventually) deliver the Hadoop engine for Azure.

OEM deals are important for attaining another key goal in developing the Hadoop market: defining the core stack – as we’ve ranted about previously. Just as Linux took off once a robust kernel was defined, the script will be identical for Hadoop. With IBM and EMC/MapR forking the Apache stack at the core file system level, and with niche providers like Hadapt offering replacement for HBase and Hive, there is growing variability in the Hadoop stack. However, to develop the third party ecosystem that will be vital to the development of Hadoop, a common target (and APIs for where the forks occur) must emerge. A year from now, the outlines of the market’s decision on what makes Hadoop Hadoop will become clear.

The final piece of the trifecta will be commitments from the Accentures and Deloittes of the world to develop practices based on specific Hadoop platforms. For now they are still keeping their cards close to their vests.

What will Hadoop be when it grows up?

Hadoop World was sold out and it seemed like “For Hire” signs were all over the place –- or at least that’s what it said on the slides at the end of many of the presentations. “We’re hiring, and we’re paying 10% more than the other guys,” declared a member of the office of the CIO at JP MorganChase in a conference keynote. Not to mention predictions that there’s big money in big data. Or that Accel Partner’s announced a new $100 million venture fund for big data startups; Cloudera scored $40 million in D funding; and rival Hortonworks previously secured $20 million for Round A.

These are heady days. For some like Matt Asay it’s time to voice a word of caution for all the venture money pouring into Hadoop: Is the field bloating with more venture dollars than it can swallow?

The resemblance to Java 1999 was more than coincidental; like Java during the dot com bubble, Hadoop is a relatively new web-related technology undergoing its first wave of commercialization ahead of the buildup of the necessary skills base. We haven’t seen such a greenfield opportunity in the IT space in over a decade. And so the mood at the conference became a bit heady –– where else in the IT world today is the job scene a seller’s market?

Hadoop has come a long way in the past year. A poll of conference attendees showed at least 200 petabytes under management. And while Cloudera has had a decent logo slide of partners for a while, it is no longer the lonely voice in the wilderness for delivering commercial distributions and enterprise support of Hadoop. Within this calendar year alone, Cloudera has finally drawn the competition to legitimize Hadoop as a commercial market. You’ve got the household names from data management and storage -– IBM, Oracle, EMC, Microsoft, and Teradata — jumping in.

Savor the moment. Because the laws of supply and demand are going to rectify the skills shortage in Hadoop and MapReduce and the market is going to become more “normal.” Colleagues like Forrester’s Jim Kobielus predict Hadoop is going to enter the enterprise data warehousing mainstream; he’s also gone on record that interactive and near real-time Hadoop analytics are not far off.

Nonetheless, Hadoop is not going to be the end-all; with the learning curve, we’ll understand the use cases where Hadoop fits and where it doesn’t.

But before we declare victory and go home, we’ve got to get a better handle of what Hadoop is and what it can and should do. In some respects, Hadoop is undergoing a natural evolution that happens with any successful open source technology: there are always questions over what is the kernel and where vendors can differentiate.

Let’s start with the Apache Hadoop stack, which is increasingly resembling a huge brick wall where things are arbitrarily stacked atop one another with no apparent order, sequence, or interrelationship. Hadoop is not a single technology or open source project but –– depending on your perspective –– an ecosystem or a tangled jumble of projects. We won’t bore you with the full list here, but Apache projects are proliferating. That’s great if you’re an open source contributor as it provides lots of outlet for innovation, but if you’re at the consuming end in enterprise IT, the last thing you want is to have to maintain a live scorecard on what’s hot and what’s not.

Compounding the situation, there is still plenty of experimentation going on. Like most open source technologies that get commercialized, there is the question of where the open source kernel leaves off and vendor differentiation picks up. For instance, MapR and IBM each believe it is in the file system, with both having have their own answers to the inadequacies of the core Hadoop file system, (HDFS).

But enterprises need an answer. They need to know what makes Hadoop, Hadoop. Knowing that is critical, not only for comparing vendor implementations, but software compatibility. Over the coming year, we expect others to follow Karmasphere and create development tooling, and we also except new and existing analytic applications to craft solutions targeted at Hadoop. If that’s the case, we better know where to insist on compatibility. Defining Hadoop the way that Supreme Court justice Potter Stewart defined pornography (“I know it when I see it”) just won’t cut it.

Of course, Apache is the last place to expect clarity as that’s not its mission. The Apache Foundation is a meritocracy. Its job is not to pick winners, although it will step aside once the market pulls the plug as it did when it mothballed Project Harmony. That’s where the vendors come in –– they package the distributions and define what they support. What’s needed is not an intimidating huge rectangle showing a profile, but instead a concentric circle diagram. For instance, you’d think that the file system would be sacred to Hadoop, but if not, what are the core building blocks or kernel of Hadoop? Put that at the center of the circle and color it a dark red, blue, or the most convincing shade of elephant yellow. Everything else surrounds the core and is colored pale. We call upon the Clouderas, Hortonworks, IBMs, EMCs et al to step up the plate and define Hadoop.

Then there’s the question of what Hadoop does. We know what it’s done traditionally. It’s a large distributed file system that is used for offline, a.k.a., batch –– analytic runs grinding through ridiculous amounts of data. Hadoop literally chops huge problems down to size thanks a lot of things: it has a simple file structure and it brings computation directly to the data; leverages cheap commodity hardware; supports scaled-out clustering; has a highly distributed and replicated architecture; and uses the MapReduce pattern for dividing and pipelining jobs into lots of concurrent threads, and mapping them back to unity.

But we also caught a presentation from Facebook’s Jonathan Grey on how Hadoop and its HBase column store was adapted to real-time operation for several core applications at Facebook such as its unified messaging system, the polar opposite of a batch application. In summary, there were a number of brute force workarounds to make Hadoop and HBase more performant, such as extreme denormalization of data; heavy reliance on smart caching; and use of inverted indexes that point to the physical location of data, and so on. There’s little doubt that Hadoop won’t become a mainstream enterprise analytic platform until performance bottlenecks are addressed. Not surprisingly, there’s little doubt that the HBase Apache project is targeting interactivity as one of the top development goals.

Conversely, we also heard lots of mention about the potential for Hadoop to function as an online alternative to offline archiving. That’s fed by an architectural design assumption that Big Data analytic data stores allow organizations to analyze all the data, not just a sample of it. Organizations like Yahoo have demonstrated dramatic increases in click-through rates from using Hadoop to dissect all user interactions. That’s instead of using MySQL or other relational data warehouse that can only analyze a sampling. And the Yahoos and Googles of the world currently have no plan to archive their data –– they will just keep scaling their Hadoop clusters out and distributing them. Facebook’s messaging system –– which was used for rolling out real-time Hadoop, is also designed with the use case that old data will not be archived.

The challenge is that the same Hadoop cannot be all things to all people. Optimizing the same data store for interactive and online archiving is like violating the laws of gravity –– either you make the storage cheap or you make it fast. Maybe there will be different flavors of Hadoop, as data in most organizations outside the Googles, Yahoos, or Facebooks of the world is more mortal –– as are the data center budgets.

Admittedly, there is an emerging trend to brute force design databases for mixed workloads –– that’s the design pattern behind Oracle’s Exadata. But even Oracle’s Exadata strategy has limitations in that its design will be overkill for smaller-midsize organizations, and that is exactly why Oracle came out with the Oracle Database Appliance. Same engine, but optimized differently. As few organizations will have Google’s IT budget, Hadoop will also have to have personas –– one size won’t fit all. And the Hadoop community –– Apache and vendor alike –– has got to decide what Hadoop’s going to be when it grows up.



From Big to Bigger Data: First Thoughts from Teradata Influencer Summit

It’s kind of ironic that Teradata, which actually invented the big data, data warehouse is being grilled about its big data strategy. Hold that thought.

The crux of the first day of Teradata’s Third Party Influencers conference, a kind of Vegas summer camp for selected partners and analysts, was about how Teradata is expanding its footprint as it competes as an independent in an avenue of giants.

As part of the tour, we were given a nostalgic glimpse at a 1998 – 99 vintage slide showing Teradata’s definition of an Enterprise Data Warehouse; it’s the definition of the classic galactic enterprise storehouse that never really became the single repository of all things analytic over the years. But for organizations like Wal-Mart or eBay, it provide the core research for the big analytic problems that such businesses require.

Teradata has recalibrated this vision to an “Integrated Data Warehouse” which is a more realistic notion in a world that has become so interconnected to the point where it’s ridiculous to think that you can centralize wisdom in a single place. Instead, the idea is to think beyond single purpose data warehouses, not necessarily to consolidate every departmental data mart in sight, but to put together places where you might have several intersecting fonts of wisdom. For instance, in a consumer products company, you might want to stage a warehouse that covers customer and product data, because there are going to be synergies when you start doing analytics to segment your customer base, because product preferences may provide some richness to the demographics.

In the past year, Teradata has done a couple of acquisitions that could reshape its course going forward. Acquisition of Aprimo, an integrated marketing campaign management provider that competes with IBM‘s recently-acquired Unica places Teradata into the applications space, although – like IBM – it still positions itself as not being in the applications business. Sure, Aprimo provides Teradata a chance to sell an additional product to consumer product companies, but today’s session provided little insight as to the long-term synergies that it will provide to the mother ship.

As to the applications issue, well, that’s a natural issue that any vendor in the middle or data tiers has got to confront because (1) the enterprise software market continues to consolidate, and vendors can’t stand still when it comes to growing their footprint and (2) the natural direction to embed more logic in the middle or data tiers will thrust otherwise agnostic software vendors into the apps space whether they consciously intend to get there or not.

In Teradata’s case, it’s been gradually heading this direction for years with its vertical industry data models, so at some point, as the company strives to aim higher up the value chain, it has to add more logic that could be construed as applications. Same thing for IBM with its vertical industry oriented middleware frameworks.

But ironically what drew the spotlight was the plan for Teradata and its other acquisition, Aster Data. Ironic because it wasn’t even on the official program today — Tasso Argyros, who co-founded Aster Data, won’t be speaking until tomorrow. It prompted questions from the peanut gallery as to how Teradata was going to get into the big data market, which prompted Teradata to throw out the challenge to those of us cynical questioners as to how would we define big data. “I hate [the term] big data,” stated Randy Lea, VP of product marketing and management, as the term has become one of those buzzwords that means all things to all people.

The irony of course is that Teradata’s heritage was having a platform that could house bigger data warehouses; it essentially invented the original Big Data market 30 years ago, when Big Data was measured in megabytes. But there is a different vibe to big data today, not only in volume, but the variety of forms – and some say, the velocity at which it comes in. We’d also add, it also has a different vibe when it comes to governance, whether that means archiving or dealing with privacy and confidentiality over data that was theoretically made public in a social network, but not necessarily in the context of a marketing database maintained by a third party.

Although parts of the briefing veered into non-disclosure territory, we still left the day with confirmation of our existing belief that in the long run there will be convergence of traditional SQL data warehouse platforms with the new Advanced SQL technologies associated with MapReduce and other capabilities that allow them to process ridiculous amounts of data, fast. We also believe that there will not only be convergence between SQL and MapReduce (already happening and public with many vendors), but also with the principles of NoSQL data stores. From that standpoint, it was quite interesting that almost every third question from the audience was in some way related to, what will Teradata do with Aster Data?:

Hadoop Ecosystem Starts Crystallizing

What a difference a year makes. A year ago, Big Data was an abstract concept left to the domain of a bunch of niche players and open source groups. Over the next 9 months, the Advanced SQL space dramatically consolidated as EMC, IBM, HP, and Teradata made their moves. In the past 3 months, it’s been Hadoop’s turn.

We’ve seen Yahoo flirt with the idea of setting up its response to Cloudera and IBM with its own Hadoop support company, while EMC announced ambitious but ambiguous plans to – choose your term – extend or fork Hadoop. After a series of increasingly vocal hints, IBM has placed its cards on the table, while Informatica has fleshed out its plans for civilizing NoSQL data.

IBM’s InfoSphere BigInsights productizes what IBM’s been talking about for months and vocalized at its BigData analyst summit held at its Yorktown Lab (yup, the place where Watson played Jeopardy). They’re offering the core freebie, which includes a distribution of Hadoop and the HDFS file system, MapReduce, and integration to DB2, paid support, and an enterprise edition that adds indexing, integrated text analytics, a development studio based around Jaql, a SQL-like query language developed by Google that takes elements of Hive and Pig, and targets Json (the data objects of JavaScript), access control security features, and the requisite administrative console.

Contrary to EMC, which hedged its words when describing if it would support Apache Hadoop, IBM came down clearly on the side of aligning its effort with the Apache projects. We shouldn’t be surprised, as IBM gave Yahoo’s VP of Hadoop development, Eric Baldeschwieler, the soapbox at its analyst event pleading for Hadoop not to be forked into competing technology implementations.

Informatica in turn fleshed out its big data support, which was the highlight of its 9.1 platform release being announced today. While Informatica already provides the ability to extract data from Hadoop for ETL to SQL data warehouses, the 9.1 release adds new adapters for social networks LinkedIn, Twitter, and Facebook; new capabilities to connect to call detail records and image files as part of its B2B unstructured data exchange offering. More importantly, whereas before Informatica PowerCenter could extract data from Hadoop, now it can feed data back in, providing another path for tapping the power of MapReduce that might not otherwise be easily supported in your relational data warehouse.

This is the start of the taming of so-called “unstructured” data that populates NoSQL; in actuality, most of this data has structure, much of which has yet to be defined. Informatica’s release of social network adapters targets the lowest hanging fruit, as social media sentiment analysis has become one of the most popular use cases for building data warehouses on steroids. It couples well with text analytics, which was one of the BI market’s first forays outside the transaction world. But there are many other NoSQL data types awaiting some form of structural definition such as sensory, graph, or rich media meta data (some of this could leverage text parsing capabilities).

It’s still early days for commercialization of tooling for big data; while 2010 was the year that major database and platform players discovered Advanced SQL, 2011 is the point where they began directing attention at NoSQL. You can see that on the Advanced SQL side as the use cases are pouring out. For NoSQL, and more specifically Hadoop, commercialization moves are just the first steps, as Jim Kobielus points out.

Hadoop itself is a fairly complex ecosystem of Apache projects; saying that you support Hadoop is not the same as that for Linux because it lacks Linux’s singular nature. And different pieces of Hadoop are interchangeable: for instance, you can swap out its HBase table system for Cassandra or Cloudbase if you want something more interactive.

For now there is an infatuation with Hadoop, but works remains to be done for vendors to lift the burden off customers for integrating the disparate pieces.

Furthermore the technology use cases are only starting to be fleshed out for what to use where. Inevitably this will lend itself to a solution rather than raw database tools approach for the more popular use cases such as instant or long term social activity graph analysis for marketing, civil infrastructure management, telco churn management, and so on. Furthermore, the bigness of big data means that you might want to attack certain tasks differently. For instance, once the data is at rest, you don’t want to move it. Data governance in the NoSQL environment is still a blank slate waiting to be filled with best practices, not to mention tooling support. For instance, while Facebook data might be available by public API, will having access to that data trigger any customer privacy issues? Also, while Hadoop’s file system provides relatively low cost storage when measured per terabyte, at some point there will be need to profile, cleanse, compress, and eventually deprecate that data. Again, more white space for tooling and best practices.

IBM’s embrace of what otherwise appears to be an obscure query language is yet another indicator that aside from general “brand” awareness of Hadoop and MapReduce (which is a framework, not a language or technology), the target market of enterprise developers remains in learning mode and as yet lacks knowledge to choose the right tools for the job.