Category Archives: Fast Data

Strata 2015 Post Mortem: Sparking expectations for Smart, Fast Applications

A year ago, Turing award winner Dr. Michael Stonebraker made the point that, when you try managing more than a handful of data sets, manual approaches run out of gas and the machine must come in to help. He was referring to the task of cataloging data sets in the context of capabilities performed by his latest startup, Tamr. If your typical data warehouse or data mart involves three or four data sources, it’s possible for you to get your head around figuring the idiosyncrasies of each data set and how to integrate them for analytics.

But push that number to dozens, if not hundreds or thousands of data sets, and any human brain is going to hit the wall — maybe literally. And that’s where machine learning first made big data navigable, not just to data scientists, but to business users. Introduced by Paxata, and since then, through a long tail of startups and household names, these tools applied machine learning to help the user wrangle data through a new kind of iterative process. Since then, analytic tools such as IBM’s Watson Analytics are employing machine learning to help end users perform predictive analytics.

Walking the floor of last week’s Strata Hadoop World in New York, we saw machine learning powering “emergent” approaches to building data warehouses. Infoworks monitors the what data end users are targeting for their queries by taking a change data capture-like approach to monitoring logs; but instead of just tracking changes (which is useful for data lineage), it deduces the data model and builds OLAP cubes. Alation, another startup, uses a similar approach for crawling data sets to build catalogs with Google-like PageRanks showing which tables and queries are the most popular. It’s supplemented with a collaboration environment where people add context, and a natural language query capability that browses the catalog.

Just as machine learning is transforming the data transformation process to help business users navigate their way through big data, it’s also starting to provide the intelligence to help business users become more effective with exploratory analytics. While over the past couple years, interactive SQL was the most competitive battle for Hadoop providers — enabling established BI tools to treat Hadoop as simply a larger data warehouse — machine learning will become essential to helping users become productive with exploratory analytics on big data.

What makes machine learning possible within an interactive experience is the emerging Spark compute engine. Spark is what’s turning Hadoop from a Big Data platform to a Fast Data one. By now, every commercial Hadoop distro includes a Spark implementation, although which Spark engines (e.g., SQL, Streaming, Machine Learning, and Graph) still varies by vendor. A few months back IBM declared it would invest $300 million and dedicate 3500 developers to Spark machine learning product development, followed by Cloudera’s announcement of a One Platform initiative to plug Spark’s gaps.

And so our attention was piqued by Netflix’s Strata session on running Spark at petabyte scale. Among Spark’s weaknesses is that it hasn’t consistently scaled over a thousand nodes, and is not known for high concurrency. Netflix’s data warehouse currently tops out at 20 petabytes and serves roughly 350 users (we presume, technically savvy data scientists and data engineers). Spark is still at its infancy at Netflix; while workloads are growing, they are not at a level that would merit a dedicated cluster (Netflix runs its computing in the Amazon cloud, on S3 storage). Much of the Spark workloads are for streaming, run under YARN. And that leads to a number of issues showing that at high scale, and high concurrency, Spark is a work in progress.

A few of the issues that Netflix is working to scale Spark include adding caching steps to accelerate loading of large data sets. Related to that is reducing the latency of retrieving large metadata sets (“list calls”) that are often associated with large data sets; Netflix is working on an optimization that would apply to Amazon’s S3. Another scaling issue related to file scanning (Spark normally scans all Hive tables when a query is first run); Netflix has designed a workaround to pushdown predicate processing so queries only scan relevant tables.

For most business users, the issue of Spark scaling won’t be relevant as their queries are not routinely expected to involve multiple petabytes of data. But for Spark to reach its promise for supplanting MapReduce for iterative, complex, data-intensive workloads, scale will prove an essential hurdle. We have little doubt that the sizable Spark community will rise to the task. But the future won’t necessarily be all Spark all the time. Keep your eye out for the Apex streaming project; it’s drawn some key principals who have been known for backing Storm.

So is Spark really outgrowing Hadoop?

That’s one of the headlines of a newly released Databricks survey that you should definitely check out. Because Spark only requires a JVM to run, there’s been plenty of debate on whether you really need to run it on Hadoop, or whether Spark will displace it altogether. Technically, the answer is no. To run Spark, all you need is a JVM installed on the cluster, or a lightweight cluster manager like Apache Mesos. It’s the familiar argument about why bother with the overhead of installing and running a general-purpose platform if you only have a single-purpose workload.

Actually, there are reasons, if security or data governance are necessary, but hold that thought.

According to the Databricks survey, which polled nearly 1500 respondents online over the summer, nearly half are running Spark standalone, with the other 40% running under YARN (Hadoop) and 11% on Mesos. There’s a strong push for dedicated deployment.

But let’s take a closer look at the numbers. About half the respondents are also running Spark on a public cloud. Admittedly, running in the cloud does not necessarily automatically equate with standalone deployment. But there’s a lot more than coincidence in the numbers given that popular cloud-based Spark services from Databricks, and more recently, Amazon and Google, are (or will be) running in dedicated environments.

And consider what stage we’re at with Spark adoption. Commercial support is barely a couple years old and cloud PaaS offerings are much newer than that. The top 10 sectors using Spark are the classic early adopters of Big Data analytics (and, ironically in this case, Hadoop): Software, web, and mobile technology/solutions providers. So the question is whether the trend will continue as Spark adoption breaks into mainstream IT, and as Spark is embedded into commercial analytic tools and data management/data wrangling tools (which it already is).

This is not to say that running Spark standalone will become just a footnote to history. If you’re experimenting with new analytic workloads – like testing another clustering or machine learning algorithm, dedicated sandboxes are great places to run those proofs-of-concepts. If you have specific types of workloads, there has long been good business and technology cases for running them on the right infrastructure; if you’re running a compute-intensive workloads, for instance, you’ll probably want to run it on servers or clusters that are compute- rather than storage-heavy. And if you’re running real-time, operational analytics, you’ll want to run it on hardware that has heavily bulked up on memory.

Hardware providers like Teradata, Oracle, and IBM have long offered workload-optimized machines, while cloud providers like Amazon offer arrays of different compute and storage instances that clients can choose for deployment. There’s no reason why Spark should be any different – and that’s why there’s an expanding marketplace of PaaS providers that are offering Spark-optimized environments.

But if Spark dedicated deployment is to become the norm rather than the exception, it must reinvent the wheel where it comes to security, data protection, lifecycle workflows, data localization, and so on. The Spark open source community is busy addressing many of the same gaps that are currently challenging the Hadoop community (just that the Hadoop project has a 2-year head start). But let’s assume that the Spark project dots all the i’s and crosses all the t’s to deliver the robustness that is expected of any enterprise data platform. As Spark workloads get productionalized, will your organization really want to run them in yet another governance silo?

Note: There are plenty of nuggets in the Databricks survey beyond Hadoop. Recommendation systems, log processing, and business intelligence (an umbrella category) are the most popular uses. The practitioners are mostly data engineers and data scientists – suggesting that adoption is concentrated among those with new generation skills. But while advanced analytics and real-time streaming are viewed by respondents as the most important Spark features, paradoxically, Spark SQL is the most used Spark component. While new bells and whistles are important, at the end of the day, accessibility from and integration with enterprise analytics trump all.

Hadoop and Spark: A Tale of two Cities

If it seems like we’ve been down this path before, well, maybe we have. June has been a month of juxtapositions, back and forth to the west coast for Hadoop and Spark Summits. The mood from last week to this has been quite contrasting. Spark Summit has the kind of canned heat that Hadoop conferences had a couple years back. We won’t stretch the Dickens metaphor.

Yeah, it’s human nature to say, down with the old and in with the new.

But let’s set something straight: Spark ain’t going to replace Hadoop, as we’re talking about apples and oranges. Spark can run on Hadoop, and it can run on other data platforms. What it might replace is MapReduce, if Spark can overcome its scaling hurdles. And it could fulfil IBM’s vision as the next analytic operating system if it addresses mundane – but very important concerns – for supporting scaling, high concurrency, and bulletproof security. Spark originated at UC Berkeley’s AMPLab back in 2009, with the founders forming Databricks. With roughly 700 committers contributors, Spark has ballooned to becoming the most active open source project in the apache community, barely 2 years after becoming an Apache project.

Spark is best known as a sort of in-memory analytics replacement for iterative computation frameworks like MapReduce; both employ massively parallel compute and then shuffle interim results, with the difference being that Spark caches in memory while MapReduce writes to disk. But that’s just the tip of the iceberg. Spark offers a simpler programming model, better fault tolerance, and it’s far more extensible than MapReduce. Spark is any form of iterative computation, and it was designed to support specific extensions; among the most popular are machine learning, microbatch stream processing, graph computing, and even SQL.

By contrast, Hadoop is a data platform. It is one of many that can run Spark, because Spark is platform-independent. So you could also run Spark on Cassandra, other NoSQL data store, or SQL databases, but Hadoop has been the most popular target right now.

And, not to forget Apache Mesos, another AMPLab discovery for cluster management to which Spark had originally been closely associated.

There’s little question about the excitement level over Spark. By now the headlines have poured out over IBM investing $300 million, committing 3500 developers, establishing a Spark open source development center a few BART stops from AMPLab in San Francisco, and aiming directly and through partners to educate 1 million professionals on Spark in the next few years (or about 4 – 5x the current number registered for IBM’s online Big Data university). IBM views Spark’s strength as machine learning, and wants to make machine learning a declarative programming experience that will fellow in SQL’s footsteps with its new SystemML language (which it plans to open source).

That’s not to overshadow Databricks’ announcement that its Spark developer cloud, in preview over the past year, has now gone GA. The big challenge facing Databricks was making its cloud scalable and sufficiently elastic to meet the demand – and not become a victim of its own success. And there is the growing number of vendors that are embedding Spark within their analytic tools, streaming products, and development tools. The release announcement of Spark 1.4 brings new manageability and capability for automatically renewing Kerberos tokens for long running processes like streaming. But there remain growing pains, like reducing the number of moving parts needed to make Spark a first class citizen with Hadoop YARN.

By contrast, last week was about Hadoop becoming more manageable and more amenable to enterprise infrastructure, like shared storage as our colleague Merv Adrian pointed out. Not to mention enduring adolescent factional turf wars.

It’s easy to get excited by the idealism around the shiny new thing. While the sky seems the limit, the reality is that there’s lots of blocking and tackling ahead. And the need for engaging, not only developers, but business stakeholders through applications, rather than development tools, and success stories with tangible results. It’s a stage that the Hadoop community is just starting to embrace now.

Spark Summit debrief: Relax, the growing pains are mundane

As the most active project (by number of committers) in the Apache Hadoop open source community, it’s not surprising that Spark has drawn much excitement and expectation. At the core, there are several key elements to Spark’s appeal:
1. It provides a much simpler and more resilient programming model compared to MapReduce – for instance, it can restart failed nodes in process rather than requiring the entire run to be restarted from scratch.
2. It takes advantage of DRAM memory, significantly accelerating compute jobs – and because of the speed, allowing more complex, chained computations to run (which could be quite useful for simulations or orchestrated computations based on if/then logic).
3. It is extensible. Spark provides a unified computing model that lets you mix and match complex iterative MapReduce-style computation with SQL, streaming, machine learning and other processes on the same node, with the same data, on the same cluster, without having to invoke separate programs. It’s akin to what Teradata is doing with the SNAP framework to differentiate its proprietary Aster platform.

Mike Olson, among others, has termed Spark “The leading candidate for ‘successor to MapReduce’.” How’s that for setting modest expectations?

So we were quite pleased to see Spark Summit making it to New York and have the chance to get immersed in the discussion.

Last fall, Databricks, whose founders created Spark from their work at UC Berkeley’s AMPlab, announced their first commercial product – that being a Spark Platform-as-A-Service (PaaS) cloud for developing Spark programs. We view the Databricks Cloud as a learning tool and incubator for developers to get up to speed on Spark without having to worry about marshaling compute clusters. The question on everybody’s minds at the conference was when the Databricks Cloud would go GA. The answer, like everything Spark, is about dealing with scalability – in this case, being capable of handling high con currency, highly spikey workloads. The latest word is later this year.

The trials and tribulations of the Databricks Cloud is quite typical for Spark – it’s dealing with scale, whether that be in numbers of users (concurrency) or data (when the data sets get too big for memory and must spill to disk). At a meetup last summer where we heard a trip report from Spark Summit 2015, the key pain point was having a more graceful spilling to disk.

Memory-resident compute frameworks of course are nothing new. SAS for instance has its LASR Server, which it contends is far more robust in dealing with concurrency and compute-intensive workloads. But, as SAS’s core business is analytics, we expect that they will meet Spark halfway to appeal to Spark developers.

While Spark is thought of as a potential replacement for MapReduce, in actuality we believe that MapReduce will be as dead as the mainframe. While DRAM memory is, in the long run, getting cheaper, it will never be as cheap as disk. And while ideally, you shouldn’t have to comb through petabytes of data on a routine basis (that’s part of defining your query and identifying the data sets), there are going to be analytic problems involving data sets that won’t completely fit in memory. Not to mention that not all computations (e.g., anything that requires developing a comprehensive model) will be suited for real-time or interactive computation. Not surprisingly, most of the use cases that we came across at Spark Summit were more about “medium data,” such as curating data feeds, real-time fraud detection, or heat maps of NYC taxi cab activity.

While dealing with scaling is part of the Spark roadmap, so is making it more accessible. At this stage, the focus is on developers, through APIs to popular statistical computation languages such as Python or R, and with frameworks such as Spark SQL and Spark DataFrames.

On one hand, with Hadoop and NoSQL platform providers competing with their own interactive SQL frameworks, the question is why the world needs another SQL framework. In actuality, Spark SQL doesn’t compete with Impala, Tez, BigSQL, Drill, Presto or whatever. First, it’s not only about SQL, but querying data with any kind of explicit schema. The use case for Spark SQL is running SQL programs in line with other computations, such as chaining SQL queries to streaming or machine learning runs. As for DataFrames, Databricks is simply adapting the Distributed DataFrame technology already implemented with languages such as Java, Python, and R to access data sets that are organized as tables with columns contained typed data.

Spark’s extensibility is both blessing and curse. Blessing in that the framework can run a wide variety of workloads, but curse in that developers can drown in abundance. One of the speakers at Summit called for package management so developers won’t stumble over their expanding array of Spark libraries and wind up reinventing the wheel.

Making Spark more accessible to developers is a logical step in growing the skills base. But ultimately, for Spark to have an impact with enterprises, it must be embraced by applications. In those scenarios, the end user doesn’t care what process is used under the hood. There are a few applications and tools, like ClearStory Data for curating data feeds, or ZoomData, an emerging Big Data BI tool that has some unique IP (likely to stay proprietary) for handling scale and concurrency.

There’s no shortage of excitement and hype around Spark. The teething issues (E.G., scalability, concurrency, package management) are rather mundane. The hype – that Spark will replace MapReduce – is ahead of the reality; as we’ve previously noted, there’s a place for in-memory computing, but it won’t replace all workloads or make disk-based databases obsolete. And while Spark hardly has a monopoly on in-memory computing, the accessibility and economics of an open source framework on commodity hardware opens lots of possibilities for drawing a skills base and new forms of analytics. But let’s not get too far ahead of ourselves first.

IBM and Twitter: Another piece of the anaytics puzzle

Roughly 20 years ago, IBM faced a major fork in the road from the hardware-centric model that defined the computer industry from the days of Grace Hopper. It embraced a services-heavy model that leveraged IBM’s knowledge of how and where enterprises managed their information in an era when many were about to undergo drastic replatforming in the wake of Y2K.

Today it’s about the replatforming, not of IT infrastructure necessarily, but of the business in the face the need to connect in an increasingly mobile and things connected world. And so IBM is in a reinvention, trying to embrace all things mobile, all things data, and all things connected. A key pillar of this strategy has been IBM’s mounting investment in Watson, where it has aggressively recruited and incubated partners to flesh out a new path of business solutions based on cognitive computing. On the horizon, we’ll be focusing our attention on a new path of insight: exploratory analytics, an area that is enabled by the next generation of business intelligence tools – Watson Analytics among them.

Which brings us to last fall’s announcement that IBM and Twitter would from a strategic partnership to develop real-time business solutions. As IBM has been seeking to reinvent itself, Twitter has been seeking to invent itself as a profitable business that can monetize its data in a manner that maintains trust among its members – yours truly among them. Twitter’s key value proposition is the immediacy if its data. While it may lack the richness and depth of content-heavy social networks like Facebook, it is, in essence, the world’s heartbeat. A ticker feed that is about, not financial markets, but the world.

When something happens, you might post on Facebook, within minutes or hours, blogs and news feeds may populate headlines. But for real-time immediacy, nothing beats the ease and simplicity of 140 characters. Uniquely, Twitter is sort of a hybrid between consumer-oriented social network like Facebook and a professional one like LinkedIn. There is an immediacy and uniqueness to the data feed that Twitter provides, With its acquisition last year of partner Gnip (which already had commercial relationships with enterprise software providers like SAP), Twitter now had a direct pipeline for mounting the enterprise value chain.

So far, so good, but what has IBM done to build a real business out of all this? A few months in, IBM is on a publicity offensive to show there is real business here. It is part way to a goal of cross-trading up to a quarter of its 140,000 over 10,000 GBS consultants on Twitter solutions. IBM has already signed a handful of reference customer deals, and is disclosing some of the real-world use cases that are the focus on actuals engagements.

Meanwhile, Twitter has been on a heavily publicized path to monetize the data that it has – which is a unique real-time pulse of what’s happening in the world. Twitter certainly has had its spate of challenges here. It sits on a data stream that is rich with currency, but lacking the depth that social networks like Facebook offer in abundance. Nonetheless, Twitter is unique in that it provides a ticker feed of what’s happening in the world. That was what was behind the announcement last fall that Twitter would become a strategic partner with IBM – to help Twitter monetize its data and for IBM to generate unique real-time business solutions.

Roughly six months into the partnership, IBM has taken the offensive to demonstrate that the new partnership is generating real business and tangible use cases. We sat down for some off the record discussions with IBM, Twitter, and several customers and prospects ahead of today’s announcements.

The obvious low-hanging fruit is customer experience. As we wrote this in midflight, before boarding we had a Twitter exchange with United regarding whether we’d be put on another fight if our plane – delayed for a couple hours with software trouble (yes… software) – was going to get cancelled (the story had a happy ending). Businesses are already using Twitter – that’s not the question. Instead, it’s whether there are other analytics-driven use cases – sorta like the type of thing we used to talk about with CEP but are real and not theoretical.

We had some background conversations with IBM last week ahead of today’s announcements. They told us of some engagements that they’ve booked during the first few months of the Twitter initiative. What’s remarkable is they are very familiar use cases, where Twitter adds another verifying data point.

An obvious case is mobile carriers – this being the beachfront real estate of telco. As mobile embeds itself in our lives, there is more at stake for carriers who ear churn, and even more so, the reputational damage that can come when defecting customers cry out about bad service publicly over social media. Telcos already have real-time data; they have connection data from their operational systems, and because this is mobile, location data as well. What’s kind of interesting to us is IBM’s assertion that what’s less understood is the relationship between Tweets and churn – as we already use Twitter, we thought those truths were self-evident. You have a crappy connection, the mobile carrier has the data on what calls, texts, or web access were dropped, and if the telco already knows its customers’ Twitter handles, it should be as plain as day what the relationship is between tweet’s and potential churn events. IBM’s case here was that integrating Twitter with data that was already available – connection’s, weather, cell tower traffic, etc., it helped connect the dots. IBM makes the claim that correlating Twitter with weather data alone could improve the accuracy of telco churn models by 5.

Another example drawn from early engagement is employee turnover. Now, unless an employee has gotten to the point where they’d rather take this job and shove it, you’d think that putting your gripes out over the Twitter feed would be a career-limiting move. But the approach here was more indirect: look at consumer businesses and correlate customer Twitter activity with locations where employee morale is sagging. Or look at the Twitter data to deduce that staff loyalty was flagging.

A more obvious use cases was in the fashion industry. IBM is adapting another technology from its labs – psycholinguistic analysis (a.k.a., what are you really saying?) – to conduct a more nuanced form of sentiment analysis of your tweets. For this engagement, a fashion industry firm employed this analysis to gain more insight on why different products sold or not.

Integrating Twitter is just another piece of the puzzle when trying to decipher signals from the market. It’s not a case of blazing new trails; indeed, sentiment analysis has become a well-established disciple for consumer marketers. The data from Twitter is crying out to be added to the mix of feeds used for piecing together the big picture. IBM’s alliance with Twitter is notable in that both are putting real skin in the game for productizing the insights that can be gained from Twitter feeds.

It’s not a criticism to say this, but incorporating Twitter is evolutionary, not revolutionary. That’s true for most big data analytics – we’re just expanding and deepening the window to solve very familiar problems. The data is out there – we might as well use it.

Hadoop: The Third Way

Working with Hadoop has been kind of a throwback. Until recently, Hadoop was synonymous with MapReduce programming, meaning that when you worked with Hadoop, it seemed that you were working with a newfangled mainframe. As if client/server never happened.

With emergence and heavy competition between the various interactive SQL frameworks (e.g., Impala, Tez, Presto, Drill, BigSQL, Big Data SQL, QueryGrid, Spark SQL), a second path emerged for database developers. So the Hadoop mainframe became a client/server machine. As if n-tier never happened.

The need for speed made n-tier happen – due to the need to bypass the bottleneck of database I/O and the overhead of large, monolithic applications. And so the application server platform was born, and with it, ways to abstract functions such as integration, security, transaction management so they could operate as modular piece parts with whatever application or database. Or to prevent abandoned online shopping carts, so a transaction can be executed without being held hostage to ensuring ACID compliance. Internet-based applications were now being developed on WebSphere, WebLogic, JBoss, and more recently, more compact open source alternatives like Apache Tomcat.

But with Hadoop, we’re still in the era of mainframe or client/server. But with the 2.x generation, where resource management has been taken out of MapReduce, the way has been cleared to make Hadoop more of a multi-purpose platform. While interactive SQL was the first shot, new frameworks supporting streaming (Storm, Spark Streaming), machine learning (Spark), and search (Solr) are among some of the new additions to the palette.

But at this point, we’re still looking at Hadoop as either a mainframe or two-tier system. Developers write MapReduce or Spark programs, or BI/query tools access HDFS with or without Hive. There’s nothing available to write data-driven programs, such as real-time user scoring or intrusion detection.

Nearly four years ago, a startup with a weird name – Continuuity – emerged to become in its own terms “the JBoss for Hadoop.” The goal was building a data fabric that abstracted the low-level APIs to HDFS, MapReduce, Hive, and other Hadoop components to clear the way for developers to write, not just MapReduce programs or run BI tools, but write API-driven programs that could connect to Hadoop. Just as a generation ago, application servers abstracted data and programs so they could flexibly connect with each other. Its first project was a data ingestion platform written on Storm that would be easier to work with than existing Hadoop projects such as Flume.

Continuuity’s problem was that the company was founded too early. During a period where Hadoop was exclusively a batch processing platform, there was little clamor for developers to write data-driven applications. But as new frameworks transform Hadoop into a platform that can deliver experiences closer to real-time, demand should emerge among developers to write, not just programs, but applications that can run against Hadoop (or other platforms).

In the interim, Continuuity changed its name to Cask, and changed its business model to become an open source company. It has diversified its streaming engine to work with other frameworks besides Storm to more readily persist data. And the 40-person company which was founded a few blocks away from Cloudera’s original headquarters, next to Fry’s Electronics in Palo Alto, has just drawn a modest investment from Cloudera to further develop its middleware platform.

Admittedly, Cask’s website really doesn’t make a good case (the home page gives you as 404 error), providing an application platform for Hadoop opens up possibilities sonly limited by the imagination. For instance, it could make possible event-driven programs for performing data validation or detecting changes in customer interactions, and so on.

For Cloudera, Cask is a low-risk proposition for developing that long-missing third path to Hadoop to further its transformation to a multi-purpose platform.

Hadoop vendor ecosystem gaining critical mass

Nature abhors a vacuum, and enterprises abhor platforms lacking tooling. Few enterprises have the developer resources or technology savvy of early adopters. For Hadoop, early adopters invented the technology; mainstream enterprises want to consume it.

On our just-concluded tour of Ovum enterprise clients across Australia/Pacific Rim, we found that the few who have progressed beyond discussion stage with Hadoop are doing so with technology staff accustomed to being on their own, building their own R programs and experimenting with embryonic frameworks like Mesos and YARN. Others are either awaiting more commercial tooling or still sorting out perennial data silos.

But Hadoop is steadily turning into a more “normal” software market. And with it, the vendor ecosystem vacuum is starting to fill in. It’s very much in line with what happened with BI and data warehousing back in the mid-1990s, when tools civilized what was a new architecture for managing data that originally required manual scripting.

So let’s take a brief tour.

Look at the exhibitor list for last month’s Strata HadoopWorld conference; as the largest such Big Data event in North America, it provides a good sampling of the ecosystem. Of nearly a hundred sponsors, roughly a third were tools encompassing BI and analytics, data federation and integration, data protection, and middleware.

There was a mix of the usual suspects who regard Hadoop as their newest target. SAS analytics takes an agnostic approach, bundling a distro of Hadoop in its LASR in-memory appliance; but SAS analytics can also execute inside Hadoop clusters, converting their HPC routines to MapReduce. MicroStrategy and other BI players are connecting to Hadoop in a variety of ways; they either provide suboptimal experience of having your SQL query execute in batch on Hadoop (which few use), or work through the data warehouse or Hadoop platform’s path for interactive SQL.

But there are also new players that are taking BI beyond SQL. Datameer and Platfora each provide their own operators (e.g., clustering, time series, decision trees, or other forms of analysis that would be laborious with SQL), presenting data either through spreadsheets or visualizations. ClearStory Data, which emerged from stealth at the show, provides a way to semantically crawl your own data and mash it with external data from publicly-available APIs. Players like Pivotal, Hadapt, SpliceMachineand CitusData are implementing or co-locating SQL data stores inside HDFS or HBase.

Significantly, some are starting to package forms of data science as well, with almost a half dozen machine learning programs. A necessary development, because there are just so many Hilary Masons to go around. Having people who have a natural feel for data, able to understand its significance, how to analyze it, and most importantly, its relevance, will remain few and far between. To use these tools, you’ll need to know what algorithms to use, but at least you don’t have to build them from scratch. For instance, 0xdata packages machine learning algorithms and combines it with a caching engine for high performance analytics on Hadoop. Skytree, packages classification, clustering, regression analyses, and most importantly, dimension reduction so you can see something meaningful after combing a billion nodes (points) and edges (relationships and context).

Security, a perennial weakness of Hadoop, is another area where you’re seeing vendor activity. Originally designed for trusted environments, Hadoop has long had the remote authentication piece down (Kerberos), because early adopters needed to gain access to remote clusters, and now there are incubating open source projects tackling the other two A’s of AAA – a gateway for access control (Knox) and a mechanism for role-based authorization (Sentry). Yes, there is also a specialized project for “cell” (data entity) level protection created for the NSA (Accumulo), which is being led by Sqrrl. But otherwise, we expect that vendor-based proprietary tools are going to be where most of the action is. Policy-based data protection, either about encryption or data masking, is now emerging via emerging players like Zettaset and Gazzang, with incumbents such as Protegrity and IBM extending support beyond SQL. Data lineage and activity monitoring (the first steps that could eventually lead to full-blown audit and selective read/write access) are emerging from IBM, Cloudera, and Revelytix.

We’ve long believed that for Big Data – and Hadoop – to gain traction with enterprises, that it must become a first class citizen. Among other things, it means Hadoop must integrate with the data center and, inevitably, apps that run against it. Incumbent data integration like Informatica, Talend, Syncsort, and Pentaho view Hadoop as yet another target. Originally touching Hadoop at arm’s length via the traditional ETL staging server topology, they have enabled their transformation tools to work natively inside Hadoop as the idea is a natural (Hadoop promises cheaper compute cycles for the task). Emerging players are adding new integration capabilities – Cirro for data federation; JethroData, for adding indexing to Hadoop; Kapow and Continuuity that are providing middleware for applications to integrate to Hadoop; and Appfluent for extending its data lifecycle management tool to support active archiving on Hadoop.

The subtext of the explosion of the ecosystem is Hadoop’s evolution into a more varied platform; to play anything more than a niche role in the enterprise (and draw a tooling and applications ecosystem), Hadoop must provide other processing options besides MapReduce.

Not surprisingly, interactive SQL on Hadoop became a prime battleground for vendors to differentiate. Cloudera introduced Impala, an MPP-based alternative to MapReduce that uses Hive metadata but bypasses the bottleneck of Hive processing (which had traditionally relied on MapReduce). Meanwhile, Hortonworks has led projects to make Hive better (read: faster), complementing it with a faster alternative to MapReduce. As noted above, several players are implementing SQL data stores directly inside Hadoop, while IBM has modified SQL to run against Hive.

The YARN (a.k.a., MapReduce 2.0) framework provides resource allocation (not full-blown resource management, however) that will allow multiple (read: MapReduce and alternative) workloads to run on Hadoop clusters. Hortonworks, which led development, announced a circle of partners who are supporting the new framework. Its rival, Cloudera, is taking a more measured approach; MapReduce and Impala workloads will be allocated under the YARN umbrella, but streaming or search won’t. Having been carved out of the original resource manager for pre-2.0 MapReduce, Cloudera doesn’t believe the new framework is suited for handling continuous workloads that don’t have starts or stops.

So, going forward, we’re seeing Hadoop emerge with an increasingly well-rounded third party ecosystem where little existed before. We expect that in the coming year, this will spread beyond tools to applications as well; we’ll see more of what the likes of Causata are doing.

So what role will Hadoop play?
For now, Hadoop remains a work in progress – data integration and lifecycle management, security, performance management, and governance practices and technologies are at early stages of evolution. At Strata, Facebook’s Ken Rudin made an eloquent plea for coexistence; they tracked against the wind by starting with Hadoop and learning that it was best for exploratory analytics while relational was best suited for queries with standard metrics (he’s pitched the same message to the data warehousing audience as well).

Cloudera’s Mike Olson, who had the podium right before Rudin, announced Cloudera’s vision of Hadoop as enterprise data hub: Hadoop is not just the logical landing spot for data, but also the place where you can run multiple workloads. Andrew Brust equates Cloudera’s positioning as making Hadoop become “the Ellis Island of data.”

So is Olson agreeing or arguing with Rudin?

The context is that analytic (and some transactional) data platforms are taking on multiple personalities (e.g., SQL row stores adding column engines, file/HDFS data stores, JSON stores – in some cases alongside or in hybrid). All analytic data platforms are grabbing for multiple data types and running workloads. They are also vying to become the logical spot where analytics are choreographed – mixing and matching data sets on different platforms for running analytic problems.

Cloudera aims to compete, not just as another Hadoop platform, but as the default platform where analytic data lives. It doesn’t necessarily replace SQL enterprise data warehouses, but assumes more workloads requiring scale, inexpensive compute cycles, and the ability to run multiple types of workloads – not just MapReduce. SQL data warehouses aren’t standing still either, and in many cases are embracing Hadoop. Hadoop has the edge on cost of compute cycles, but pieces must fall into place to gain parity regarding service level management and performance, security, availability and reliability, and information lifecycle management. Looking ahead, we expect analytics to run on multiple platforms, with the center of gravity up for grabs.

Is the sky the limit for Flash and In-Memory Databases?

Big Data is getting bigger, and Fast Data is getting faster because of the continuing declining cost of all things infrastructure. Ongoing commoditization of powerful, multi-core CPU, storage media, and connectivity made scale-out Internet data centers possible, and with them, scale-out data platforms such as Hadoop and the new generation of Advanced SQL/NewSQL analytic data stores. Bandwidth is similarly going crazy; while the lack of 4G may make bandwidth seem elusive to mobile users, growth of bandwidth for connecting devices and things has become another fact taken for granted.

Conventional wisdom is that similar trends are impacting storage, and until recently, that was the Kool-Aid that we swallowed. For sure, the macro picture is that declining price and ascending density curves are changing the conversation where it comes to deploying data. The type of media on which you store data is no longer just a price/performance tradeoff, but increasingly an architectural consideration on how data is processed and applications that run on data are engineered. Bigger, cheaper storage makes bigger analytics possible; faster, cheaper storage makes more complex and functional applications possible.

At 100,000 feet, such trends for storage are holding, but dig beneath the surface and the picture gets more nuanced. And those nuances are increasingly driving how we design our data-driven transaction applications and analytics.

Cut through the terminology
But before we dive into the trends, let’s get our terminology straight, because the term memory is used much too loosely (does it mean DRAM or Flash?). For this discussion, we’ll stick with the following conventions:
CPU cache is the memory on chip that is used for temporarily holding data being processed by the processor.
DRAM memory is the fastest storage layer that sits outside the chip, and is typically parceled out in GBytes per compute core.
Solid State Drive (SSD) based on Flash memory, is the silicon-based, faster substitute to traditional hard drives are typically sized at hundreds of GBytes (with some units just under a terabyte). But it is not as fast as DRAM.
• Hard disk, or “disk,” is the workhorse that now scales economically up to 1 – 3 TBytes per spindle.

So what’s best for which?
For hard drives, conventional wisdom has been that they keep getting faster and cheaper. Turns out, only the latter is true. The cheapness of 1- and 3-TByte drives has made scale-out Internet data centers possible, and with it, scale-out Big Data analytic platforms like Hadoop. Hard disk continues to be the medium of choice for large volumes of data because individual drives routinely scale to 1 – 3 TBytes. And momentary supply chain disruptions like the 2011 Thailand floods aside, the supply remains more than adequate. Flash drives simply don’t get as fat.

But if anything, hard drives are getting slower because it’s no longer worthwhile to try speeding them up. With Flash being at least 10 – 100x faster, there’s no way that disk will easily catch up even if the technology gets refreshed. Flash is actually pulling the rug out from under demand for 7200-RPM disks (currently the state of the art for disk). Not surprisingly, disk technology development has hit the wall.

Given current price trends, where Flash prices are expectedsome analysts expect Flash to reach parity with disk in the next 12 – 18 months (or maybe sooner), there will be less reason for your next transaction system to be disk-based. In fact there is good reason to be a bit skeptical on how soon supply of SSD Flash will ramp up adequately for the transaction system market; but SSD Flash will gradually make its way to prime time. Conversely, with disk likely to remain fatter in capacity than Flash, it will be best suited for active archiving that keeps older data otherwise bound for tape live; and for Big Data analytics, where the need is for volume. Nonetheless, the workhorse of large Hadoop, and similar disk-based Big Data analytic or active archive clusters will likely be the slower 5400 RPM models.

So what about even faster modes of storage? In the past couple years, DRAM memory prices crossed the threshold where it became feasible to deploy them for persistent storage rather than caching of currently used data. That cleared the way for the in-memory database (IMDB), which is often code word for all-DRAM memory storage.

In-memory databases are hardly new, but until the last 3 – 4 years they were highly specialized. Oracle TimesTen, one of the earliest commercial offerings, was designed for tightly-coupled, specialized transactional applications; other purpose-built in-memory data stores have existed for capital markets for at least a decade or more. But DRAM memory prices dropped to bring them into the enterprise mainstream. Kognitio opened the floodgates as it reincarnated its MOLAP cube and row store analytic platform to in-memory on industry-standard hardware just over 5 years ago; SAP put in-memory in the spotlight with HANA for analytics and transactional applications; followed by Oracle, which reincarnated TimesTen as Exalytics for running Oracle Business Intelligence Enterprise Edition (OBIEE) and Essbase.

Yet, an interesting blip happened on the way to the “inevitable” all in-memory database future: Last spring, DRAM memory prices stopped dropping. In part this was attributable to consolidation of the industry to fewer suppliers. But the larger driver was that the wisdom of crowds – e.g., that DRAM memory was now ready for prime time – got ahead of itself. Yes, the laws of supply and demand will eventually shift the trajectory of memory pricing. But nope, that won’t change the fact of life that, no matter how cheap, DRAM memory (and cache) will always be premium storage.

In-memory databases are dead, long live tiered databases
The sky is not the limit for DRAM in-memory databases. The rush to in-memory will morph into an expansion of data tiering. And actually that’s not such a bad thing: do you really need to put all of that data in memory? We think not.

IBM and Teradata have shunned all in-memory architectures; their contention is that the 80/20 rule should govern which data goes into memory. And under their breaths, the all in-memory database folks have fallbacks for paging data between disk and memory. If designed properly, this is not constant paging, but rather a process that only occurs for that rare out-of-range query. Kognitio has a clever pricing model where they don’t charge you for the disk, but just for the volume of memory. As for HANA, disk is designed into the system for permanent offline storage, but SAP quietly adds that it can also be utilized for paging data during routine operation. Maybe SAP shouldn’t be so quiet about that.

There’s one additional form of tiering to consider for highly complex analytics: it’s the boost that can come from pipelining computations inside chip cache. Oracle is looking to similar techniques for further optimizing upcoming generations of its Exadata database appliance platform. It’s a technique that’s part of IBM’s recent BLU architecture for DB2. High-performance analytic platforms such as SiSense also incorporate in-chip pipelining to actually reduce balance of system costs (e.g., require less DRAM).

It’s all about balance of system
Balance of system is hardly new, but until recently, it meant trading off CPU, bandwidth with tiers of disk. Application and database design in turn focused on distributing or sharding data to place the most frequently accessed data on the disk or portions of disk that could be accessed the fastest. New forms of storage, including Flash and DRAM memory, add a few new elements to the mix. You’ll still configure storage (along with processor and interconnects) for the application and vice versa, but you’ll have a couple new toys in your arsenal.

For Flash, it means fast OLTP applications that could add basic analytics, such as what Oracle’s recent wave of In-Memory Applications promise. For in-memory, that would dictate OLTP applications with even more complex analytics and/or what-if simulations embedded in line, such as what SAP is promising with its recently-introduced Business Suite and CRM applications on HANA.

For in-memory, we’d contend that for most cases, configurations for keeping 100% of data in DRAM will remain overkill. Unless you are running a Big Data analytic problem that is supposed to encompass all of the data, you will likely work with just a fraction of the data. Furthermore, IBM, Oracle, and Teradata are incorporating data skipping features into their analytic platforms that deliberately filter irrelevant data so it is not scanned. There a many ways to speed processing before using the fast storage option.

Storage will become an application design option
Although we’re leery about hopping the 100% DRAM in-memory bandwagon, smartly deployed, in-memory or DRAM could truly transform applications. When you eliminate the latency, you can embed complex analytics in-line with transactional applications, enable the running of more complex analytics, or make it feasible for users to run more what-if simulations to couch their decisions.

Examples include transaction applications that differentiate how to fulfill orders from gold, silver, or bronze-level customers based on levels of services and cost of fulfillment. It could help mitigate risk when making operational or fiduciary decisions by allowing the running of more permutations of scenarios. It could also enhance Big Data analytics by tiering the more frequently used data (and logic) in memory.

Whether to use DRAM or Flash will be a function of degree of data volume and problem complexity. No longer will inclusion of storage tiers be simply a hardware platform design decision; it will also become a configuration decision for application designers as well.

Strata 2013 debrief: Enterprise-ready Hadoop Wars heat up

We’re in the thick of analyst conference season – Informatica last week, SAS tomorrow. So on this Sunday afternoon between gigs, we’re digesting what went down at Strata 2013 in Santa Clara last week. It was kind of a frustrating day in that we had limited time, were scheduled wall to wall with meetings, and missed what were likely some fascinating sessions. But we got a sense of some dominant themes: Harden Hadoop for the enterprise, and take the SQL world to Hadoop.

The Hadoop vendor ecosystem is filling in – new players with their own distros, and new capabilities focused on making Hadoop more enterprise grade. The field is early enough that the approaches are still quite diverse – it’s time to invent, not consolidate. Let the games proceed.

EMC stole the jump early in the week by announcing the grafting of its own Greenplum Advanced SQL analytic data store onto Hadoop – basically, the Greenplum MPP database squooched (wanted an excuse to use a “word” like that) atop HDFS. Tastes like a SQL analytic database, scales like Hadoop. Cloudera Impala will soon be in a GA branded as RTQ (Real-Time Query). Not to be outshined, Hortonworks, which works through the official Hadoop project itself, announced a couple responding initiatives: the Tez runtime and Stinger interactive query engine. You wouldn’t be seeing all these efforts to make Hadoop interactive if the demand wasn’t out there; while Hadoop as a platform for extending the range of analytics has become very compelling to enterprises, they clearly expect that the platform must be SQL interactive if it is to become a part of their analytic system portfolio.

While we’ve been expending electrons on the SQLization of Hadoop, the next stage of hardening is rapidly emerging. Specifically, make Hadoop and Hadoop data more governable and secure. This involves capabilities such as data masking (where you permanently obliterate sensitive pieces of data), data encryption (where you can recover the original data), activity monitoring (who does what), data lineage (who and where this data came from, and who has done what to it), and of course, more fine grained access control (preferably role-based) that picks up where Kerberos authentication leaves off. The pieces are just beginning to fall into place.

Dataguise, a niche player in data obfuscation that relaunched itself in the Hadoop space last year, has had an encryption product out for roughly six months and has drawn several customers; they promote a self-learning feature that discovers sensitive data (e.g., credit card numbers), selectively encrypts, and then acts only when data is changed. IBM already has capabilities in Optim that are typically used when pulling data from an external database; a user-defined function can mask it in Hadoop, or mask data as it is drawn from Hadoop. IBM offers data masking and activity monitoring, a capability that Cloudera just announced. Specifically, Cloudera’s new Navigation tool places agents (like everybody else, they characterize them as “lightweight”) on HDFS, Hive, and HBase, and you can configure them. For instance, the traffic on Hive is likely to be a fraction of that for HBase, which is more interactive, so you can configure monitoring of event changes to data accordingly. And then we came across Revelytix, which focuses on data lineage

Then out of the blue, Intel swooped in with announcement of its own Hadoop distribution. As if that was the last thing the world needed. But Intel has carved some interesting angles: it is utilizing the native instruction set of the Xeon processor to move encryption and I/O optimization directly into the chip. Intel’s play addresses the issue that these processes are resource-heavy, a point where the sheer size of Hadoop data stores add insult to injury. And that is not to mention that embedding encryption in hardware lessens the load of developers. Intel has drawn a number of partners including SAP, where integration with the HANA in-memory platform offers some interesting Fast Data possibilities. So far we’ve missed signals with Intel, but will speak with them later next week to get a better idea of where they hope to take hardware optimization with Hadoop.

Loose ends: Time is running out on us, but coming out of this week, there are several issues that are running in the back of our mind:
• Hive – we thought this was a done deal. Hive is one of the earliest components of Hadoop. Having been designed when MapReduce was the predominant processing pattern, and the jobs to spawn the metadata were batch in nature. We were surprised that the debate over Hive’s use remains very, very live. The issue is over how dynamic Hive can become – yes, it can support interactive queries, but is it based on metadata that is current? We sense that this will become another area for vendor differentiation.
• Apache Hadoop project – This could be spin, but there is sniping behind the scenes that the Hadoop project is no longer so broad-based when it comes to contributions. The flipside is arguments over whether a particular vendor has enough (or any) committers rings a bit hollow. The operable question for enterprises is whether the distro of Hadoop is and will remain well-supported.
• Resource management – this one has multiple angles. Of course there is debate over YARN. It is supposed to be the über resource manager of Hadoop, so MapReduce jobs don’t collide with those of other frameworks that may have different (and conflicting) demand on processing and data access. There’s active debate over whether YARN has sufficiently weaned itself of its MapReduce batch lineage, or whether it should be a batch-oriented sub manager in a scheme where there is yet another layer of control. The counterargument to that is that this may make life (or at least levels of control) far too complex. Expect vendor differentiation here.

Fast Data — the TV show

We’ve been talking about Fast Data over the past year, and so has Oracle. Last week we had the chance to make it a dialogue as we were interviewed by Hasan Rizvi, who heads Oracle’s middleware business as Executive Vice President Oracle Fusion Middleware and Java. The podcast, which will also include an appearance with Oracle customer Turkcell, will go live on February 27. You can sign up for it here.