Category Archives: Big Data

Greenplum ramps up competition in the Hadoop space

Life’s getting more interesting around the Hadoop world – until now, if you were looking for commercial support, Cloudera was the only game in town. Barely a couple weeks back, Yahoo – which invented the technology – began making noises about a possible commercial spinoff to go up against Cloudera. That came, ironically, after Yahoo decided to drops its own Hadoop distribution. Go figure.

But the point today is that EMC Greenplum has decided to dive in as it packages Hadoop to run natively within the Greenplum Advanced SQL analytic database. This is a departure from Greenplum’s previous agreement to interface with the Cloudera edition of Hadoop that was concluded last summer before EMC acquired Greenplum. This will be EMC’s own distribution that incorporates modifications from Facebook to address potential single points of failure such as in the naming node and job tracker.

More to the point, it adds to the variety of choices that are becoming available with Hadoop – which is essentially a grab bag of technologies that include file systems, column-oriented table structures, data warehousing and transformation query languages, parallel computing frameworks, serialization, workload coordination, and so on. While Hadoop is known as a place for storing lots of data but not known for its speed, there are offshoots providing more interactive capabilities. For instance, you can use Hadoop but substitute Cassandra or Cloudbase for the HDFS file system. Or you can add relational nodes, as Hadapt is trying.

If you’re confused, join the crowd. These are early days where innovation is raw, with multiple approaches to managing all the other data that doesn’t neatly fit in a SQL database are just emerging.

At the end of the day, it’s about solving analytic problems for the business, not about analyzing specific kinds of data. For instance, you may wish to marry the transactive interactions with customers stored by your CRM system with the things that they are saying about you on Facebook – and in turn – you’ll probably want to know where they’re getting their ideas from. The idea that EMC Greenplum is pushing is to use the same platform, but run different parts of the analytic question on the appropriate data store.

From a market development standpoint we’re now at the second inflection point in the Big data tooling market. the first was the rapid wave of consolidation that hit the more familiar Advanced SQL analytic data portion of the market – within the last 8 – 9 months alone, EMC, IBM (Netezza), HP (Vertica), and more recently, Teradata (Aster Data) made acquisitions in this space. As Advanced SQL is in the phase of consolidation, just the opposite is happening with Hadoop, or more broadly, the NoSQL space at large. It’s a period where there is now a competition of raw ideas and also the beginnings of a convergence between SQL and NoSQL.

The latter is what EMC Greenplum’s move is all about. By EMC Greenplum repackaging Hadoop, they are helping to civilize it – for Greenplum customers anyway. Greenplum is placing it under their own management umbrella – and this being EMC – obviously they are adding APIs for plugging in storage. Additionally they are leveraging their own internal high-speed, low latency interconnects, and providing a certified stack for what would otherwise be an unwieldy garb bag of Apache and other open source projects. It’s also part of a longer term trend for addressing the skills gap with MapReduce and Hadoop – just as Java developers were hard to find in 1999, the same is true with Hadoop and MapReduce today. In part the laws of supply and demand will resolve that, but in the long run, the NoSQL world (which many consider to mean “Not only SQL”) is going to get managed by many of the same tools that DBAs and software developers are already familiar with.

If you’re an enterprise customer, moves like EMC Greenplum’s make it safe for you to start piloting. It gives you a view of what will be the end game in the convergence of the SQL world with NoSQL. But keep in mind that as a technology stack, Hadoop is still very much a moving target.

Yahoo to Hadoop: Show Me the Money

While there is relatively little to knock cloud from its hype perch, among web startups, BI and data geeks, the emergence of Big Data has become a game changer. It’s analytics and operational intelligence gone extreme.

Big Data typically is associated with obscene amounts of data – the scale blows away anything that most enterprises would maintain within their core back end business systems. We’re talking hundreds of terabytes or even petabytes.

Today, Yahoo announced that it might take the business of its best-known Big Data brainchild, Hadoop, and and consider spinning it off into a new entity.

So why are we having this conversation?

It’s because Internet giants Google, Yahoo, Facebook, Amazon, and others had to roll their own technologies to deal with magnitudes of data far beyond conventional wisdom of what was possible with enterprise systems. What makes the conversation interesting is that this technology is on the cusp of entering the enterprise mainstream today. It’s not just a matter of technology looking for a problem. When Facebook needs to understand how its 500 million members update their walls, share photographs, and have conversations, it’s because (1) it needs to optimize its IT infrastructure to support how its members use the site, but more importantly (2) it needs to understand more about its members so it can sell advertising.

And when Facebook makes its API publicly available, that same issue becomes a critical for any marketer that is B2C. And as the technology becomes available, suddenly there are downstream uses in capital markets for conducting brute force analyses on trading positions, healthcare providers for understanding outcomes, homeland security for controlling borders, metropolitan entities seeking to manage congestion pricing, life sciences organization seeking to decipher clinical studies, mobile carriers seeking to prevent or minimize customer churn, and so on.

There are a couple technology and market paths that have opened for contending with Big Data. There are Advanced SQL analytic database providers that have adapted SQL for structured data through strategies such as reducing indexing, introducing new forms of data compression and query optimization, columnar architectures, and embedding analytics and data transformation directly into the data engine to minimize data movement; in some cases, they have developed optimized appliances. We’re talking about the Aster Datas, Greenplums, Netezzas, ParAccels, and Verticas of the world – and players like Teradata that invented big data warehousing, Oracle that has extended it, and Sybase which acquired the first column-oriented database. Business has obviously picked up here; IBM, EMC, Teradata, and HP have all made acquisitions in this space over the past 12 months.

But the Facebooks and Googles of the world weren’t dealing with structured data in the enterprise sense – they are contending with web log files, document APIs, rich media files, and so on. They are dealing with data whose structure and volume is so varied and huge that there is no time to model it and form a schema; they need to just load the data into the file system and then analyze it. That spawned the NoSQL movement – initially a focus on technologies that avoided the overhead and scalability limits of SQL.

Until now, neither Google, Yahoo, or Facebook considered themselves in the tools or database business. So they released the fruits of their innovation as open source, with one of the best known projects being Apache Hadoop. Hadoop is a family of projects that includes a distributed file system, the MapReduce framework that parcels out massively parallel computing jobs across a cluster plus a number of other frameworks, file systems, and utilities.

What’s kind of fascinating is the almost incestuous relationship between these NoSQL projects. Hadoop, developed at Yahoo was descended from the Google File System that in turn was developed for Google BigTable; the same was true for Cassandra, another NoSQL file system. Meanwhile, Facebook develops Hive, a relational-like table structure designed to work with Hadoop. You get the picture.

Cloudera has stepped to the forefront in commercializing Hadoop technology and applying MapReduce. Using a Red Hat-like business model, it offers support, several open source extensions, plus an enterprise edition that adds a number of proprietary monitoring and management features. It has distinguished itself with forging partnerships with almost every major BI and data warehousing player except one – IBM. the highlights are its relationships with Informatica, for data transformation, and MicroStrategy, which provides a data mart strategy designed to complement Hadoop. And it has garnered roughly 75 enterprise paying customers in a market segment that has barely commercialized.

In the long run, we also expect IBM to make a stab at Hadoop and related technologies by extending its InfoSphere offerings -– it can see Cloudera-Informatica and Cloudera-MicroStrategy raise it one with its own InfoSphere DataStage and Cognos offerings, before it even talks about partnerships. Today we saw a shot from left field – Yahoo which invented the technology – is now saying it might spin off its Hadoop business to go up against Cloudera, and potentially IBM. In a way, its closing the doors after the horses left the barn as the creator of Hadoop is now part of Cloudera.

Clearly there will be a market for NoSQL technologies in the quest for Big Data, although for now, they require sufficient specialized skills that they are not for the faint of heart. that is, if you can find any Hadoop and MapReduce programmers who haven’t already bee scarfed up by Amazon, Zynga, or JP Morgan Chase. That market will not necessarily be in competition with Advanced SQL as there are different use cases for each. And in fact, there will likely be a blending of the technologies in the long run. Today, many Advanced SQL platforms are already extending support for MapReduce, and in the long run, we expect that SQL-like technologies in the NoSQL space like Hive or HBase will themselves be made more accessible to the huge base of SQL developers.

But we digress.

For Yahoo, this would clearly be a shot out of its comfort zone, as it is not a tools company. But it is hungry for monetizing its intellectual property, even if that property has already been open sourced. It’s redolent of Sun striving to monetize Java and we all know how that went. Obviously this will be an uphill battle for Yahoo, but at least this would be a spinoff so hopefully there won’t be distractions from the mother ship. Given Yahoo’s fortunes, we shouldn’t be surprised that they are now looking to maximize what they can get out of the family jewels.

Big Data analytics in the cloud could be HP’s enterprise trump card

Unfortunately, scheduling conflicts have kept us from attending Leo Apotheker’s keynote today before the HP Analyst Summit in San Francisco. But yesterday, he tipped his cards for his new software vision for HP before a group of investment analysts. HP’s software focus is not to reinvent the wheel – at least where it comes to enterprise apps. Apotheker has to put to rest that he’s not about to do a grudge match and buy the company that dismissed him. There is already plenty of coverage here, interesting comment from Tom Foremski (we agree with him about SAP being a non-starter), and the Software Advice guys who are conducting a poll.

To some extent this has been little surprise with HP’s already stated plans for WebOS and its recently announced acquisition of Vertica. We do have one question though: what happened to Converged Infrastructure?

For now, we’re not revisiting the acquisitions stakes, although if you follow #HPSummit twitter tags today, you’ll probably see lots of ideas floating around today after 9am Pacific time. We’ll instead focus on the kind of company HP wants to be, based on its stated objectives.

1. Develop a portfolio of cloud services from infrastructure to platform services and run the industry’s first open cloud marketplace that will combine a secure, scalable and trusted consumer app store and an enterprise application and services catalog.

This hits two points on the checklist: provide a natural market for all those PCs that HP sells. The next part is stating that HP wants to venture higher up the food chain than just sell lots of iron. That certainly makes sense. The next part is where we have a question: offering cloud services to consumers, the enterprise, and developers sounds at first blush that HP wants its cloud to be all things to all people.

The good news is that HP has a start on the developer side where it has been offering performance testing services for years – but is now catching up to providers like CollabNet (with which it is aligned and would make a logical acquisition candidate) and Rally in offering higher value planning services for the app lifecycle.

In the other areas – consumer apps and enterprise apps – HP is starting from square one. It obviously must separate the two, as cloud is just about the only thing that the two have in common.

For the consumer side, HP (like Google Android and everyone else) is playing catchup to Apple. It is not simply a matter of building it and expecting they will come. Apple has built an entire ecosystem around its iOS platform that has penetrated content and retail – challenging Amazon, not just Salesforce or a would-be HP, using its user experience as the basis for building a market for an audience that is dying to be captive. For its part, HP hopes to build WebOS to have the same “Wow!” factor as the iPhone/iPad experience. It’s got a huge uphill battle on its hands.

For the enterprise, it’s a more wide open space where only Salesforce’s AppExchange has made any meaningful mark. Again, the key is a unifying ecosystem, with the most likely outlet being enterprise outsourcing customers for HP’s Enterprise Services (the former EDS operation). The key principle is that when you build a market place, you have to identity who your customers are and give them a reason to visit. A key challenge, as we’ve stated in our day job, is that enterprise customers are not the enterprise equivalent of those $2.99 apps that you’ll see in the AppStore. The experience at Salesforce – the classic inversion of the long tail – is that the market is primarily for add-ons to the Salesforce.com CRM application or use of the Force.com development platform, but that most entries simply get buried deep down the list.

Enterprise apps marketplaces are not simply going to provide a cheaper channel for solutions that still require consultative sells. We’ve suggested that they adhere more to the user group model, which also includes forums, chats, exchanges of ideas, and by the way, places to get utilities that can make enterprise software programs more useful. Enterprise app stores are not an end in themselves, but a means for reinforcing a community — whether it be for a core enterprise app – or for HP, more likely, for the community of outsourcing customers that it already has.

2. Build webOS into a leading connectivity platform.
HP clearly hopes to replicate Apple’s success with iOS here – the key being that it wants to extend the next-generation Palm platform to its base of PCs and other devices. This one’s truly a Hail Mary pass designed to rescue the Palm platform from irrelevance in a market where iOS, Android, Adobe Flash, Blackberry, and Microsoft Windows 7/Silverlight are battling it out. Admittedly, mobile developers have always tolerated fragmentation as a fact of life in this space – but of course that was when the stakes (with feature phones) were rather modest. With smart device – in all its varied form factors from phone to tablet – becoming the next major consumer (and to some extent, enterprise) frontier, there’s a new battle afresh for mindshare. That mindshare will be built on the size of the third party app ecosystem that these platforms attract.

As Palm was always more an enterprise rather consumer platform – before the Blackberry eclipsed it – HP’s likely WebOS venue will be the enterprise space. Another uphill battle with Microsoft (that has the office apps), Blackberry (with its substantial corporate email base), and yes, Apple, where enterprise users are increasingly sneaking iPhones in the back door, just like they did with PCs 25 years ago,

3. Build presence with Big Data
Like (1), this also hits a key checkbox for where to sell all those HP PCs. HP has had a half-hearted presence with the discontinued Neoview business. The Vertica acquisition was clearly the first one that had Apotheker’s stamp on it. Of HP’s announced strategies, this is the one that aligns closest with the enterprise software strategy that we’ve all expected Apotheker to champion. Obviously Vertica is the first step here – and there are many logical acquisitions that could fill this out, as we’ve noted previously, regarding Tibco, Informatica, and Teradata. The importance is that classic business intelligence never really suffered through the recession, and arguably, big data is becoming the next frontier for BI that is becoming, not just a nice to have, but increasingly an expected cost of competition.

What’s interesting so far is that in all the talk about big Data, there’s been relatively scant attention paid to utilizing the cloud to provide the scaling to conduct such analytics. We foresee a market where organizations that don’t necessarily want to buy all that and that use large advanced analytics on an event-driven basis, to consume the cloud for their Hadoop – or Vertica – runs. Big Data analytics in the cloud could be HP’s enterprise trump card.

The Second Wave of Analytics

Throughout the recession, business intelligence (BI) was one of the few growth markets in IT. Given that transactional systems that report “what” is happening are simply the price of entry for remaining in a market, BI and analytics systems answer the question of “why” something is happening, and ideally, provide intelligence that is actionable so you can know ‘how’ to respond. Not surprisingly, understanding the whys and hows are essential for maximizing the top line in growing markets, and pinpointing the path to survival in down markets. The latter reason is why BI has remained one of the few growth areas in the IT and business applications space through the recession.

Analytic databases are cool again. Teradata, the analytic database provider with a 30-year track record, had its strongest Q2 in what was otherwise a lousy 2010 for most IT vendors. Over the past year, IBM, SAP, and EMC took major acquisitions in this space, while some of the loudest decibels at this year’s Oracle OpenWorld were over the Exadata optimized database machine. There are a number of upstarts with significant venture funding, ranging from Vertica to Cloudera, Aster Data, ParAccel and others that are not only charting solid growth, but the range of varied approaches that reveal that the market is far from mature and that there remains plenty of demand for innovation.

We are seeing today a second wave of innovation in BI and analytics that matches the ferment and intensity of the 1995-96 era when data warehousing and analytic reporting went commercial. There isn’t any one thing that is driving BI innovation. At one end of the spectrum, you have Big Data, and at the other end, Fast Data — the actualization of real-time business intelligence. Advances in commodity hardware, memory density, parallel programming models, and emergence of NoSQL, open source statistical programming languages, cloud are bringing this all within reach. There is more and more data everywhere that’s begging to be sliced, diced and analyzed.

The amount of data being generated is mushrooming, but much of it will not necessarily be persisted to storage. For instance, if you’re a power company that wants to institute a smart grid, moving from monthly to daily meter reads multiplies your data volumes by a factor of 30, and if you decide to take readings every 15 minutes, better multiple all that again by a factor of 100. Much of this data will be consumed as events. Even if any of it is persisted, traditional relational models won’t handle the load. The issue is not only because of overhead of operating all the iron, but with it the concurrent need for additional equipment, space, HVAC, and power.

Unlike the past, when the biggest databases were maintained inside the walls of research institutions, public sector agencies, or within large telcos or banks, today many of the largest data stores on the Internet are getting opened through APIs, such as from Facebook. Big databases are no longer restricted to use by big companies.

Compare that to the 1995-96 time period when relational databases, which made enterprise data accessible, reached critical mass adoption; rich Windows clients, which put powerful apps on the desktop, became enterprise standard; while new approaches to optimizing data storage and productizing the kind of enterprise reporting pioneered by Information Builders, emerged. And with it all came the debates OLAP (or MOLAP) vs ROLAP, star vs. snowflake schema, and ad hoc vs. standard reporting. Ever since, BI has become ingrained with enterprise applications, as reflected by recent consolidations with the acquisitions of Cognos, Business Objects, and Hyperion by IBM, SAP, and Oracle. How much more establishment can you get?

What’s old is new again. When SQL relational databases emerged in the 1980s, conventional wisdom was that the need for indexes and related features would limit their ability to perform or scale to support enterprise transactional systems. Moore’s Law and emergence of client/server helped make mockery of that argument until the web, proliferation of XML data, smart sensory devices, and realization that unstructured data contained valuable morsels of market and process intelligence, in turn made mockery of the argument that relational was the enterprise database end-state.

In-memory databases are nothing new either, but the same hardware commoditization trends that helped mainstream SQL has also brought costs of these engines down to earth.

What’s interesting is that there is no single source or style of innovation. Just as 1995 proved a year of discovery and debate over new concepts, today you are seeing a proliferation of approaches ranging from different strategies for massively-parallel, shared-nothing architectures; columnar databases; massive networked and hierarchical file systems; and SQL vs. programmatic approaches. It is not simply SQL vs. a single post-SQL model, but variations that mix and match SQL-like programming with various approaches to parallelism, data compression, and use of memory. And don’t forget the application of analytic models to complex event processes for identifying key patterns in long-running events or coming through streaming data that is arriving in torrents too fast and large to ever consider putting into persistent storage.

This time, much of the innovation is coming from the open source world as evidenced by projects like the Java-based distributed computing platform Hadoop developed by Google; MapReduce parallel programming model developed by Google; the HIVE project that makes MapReduce look like SQL; the R statistical programming language. Google has added fuel to the fire by releasing to developers its BigQuery and Prediction API for analyzing large sets of data and applying predictive algorithms.

These are not simply technology innovations looking for problems, as use cases for Big Data or real-time analysis are mushrooming. Want to extend your analytics from structured data to blogs, emails, instant messaging, wikis, or sensory data? Want to convene the world’s largest focus group? There’s sentiment analysis to be conducted from Facebook; trending topics for Wikipedia; power distribution optimization for smart grids; or predictive analytics for use cases such as real-time inventory analysis for retail chains, or strategic workforce planning, and so on.

Adding icing to the cake was an excellent talk at a New York Technology Council meeting by Merv Adrian, a 30-year veteran of the data management field (who will soon be joining Gartner) who outlined the content of a comprehensive multi-client study on analytic databases that can be downloaded free from Bitpipe.

Adrian speaks of a generational disruption occurring to the database market that is attacking new forms of age old problems: how to deal with expanding datasets while maintaining decent performance. as mundane as that. But the explosion of data coupled with commoditization of hardware and increasing bandwidth have exacerbated matters to the point where we can no longer apply the brute force approach to tweaking relational architectures. “Most of what we’re doing is figuring out how to deal with the inadequacies of existing systems,” he said, adding that the market and state of knowledge has not yet matured to the point where we’re thinking about how the data management scheme should look logically.

So it’s not surprising that competition has opened wide for new approaches to solving the Big and Fast Data challenges; the market has not yet matured to the point where there are one or a handful of consensus approaches around which to build a critical mass practitioner base. But when Adrian describes the spate of vendor acquisitions over the past year, it’s just a hint of things to come.

Watch this space.