Category Archives: Fast Data

Spark Summit debrief: Relax, the growing pains are mundane

As the most active project (by number of committers) in the Apache Hadoop open source community, it’s not surprising that Spark has drawn much excitement and expectation. At the core, there are several key elements to Spark’s appeal:
1. It provides a much simpler and more resilient programming model compared to MapReduce – for instance, it can restart failed nodes in process rather than requiring the entire run to be restarted from scratch.
2. It takes advantage of DRAM memory, significantly accelerating compute jobs – and because of the speed, allowing more complex, chained computations to run (which could be quite useful for simulations or orchestrated computations based on if/then logic).
3. It is extensible. Spark provides a unified computing model that lets you mix and match complex iterative MapReduce-style computation with SQL, streaming, machine learning and other processes on the same node, with the same data, on the same cluster, without having to invoke separate programs. It’s akin to what Teradata is doing with the SNAP framework to differentiate its proprietary Aster platform.

Mike Olson, among others, has termed Spark “The leading candidate for ‘successor to MapReduce’.” How’s that for setting modest expectations?

So we were quite pleased to see Spark Summit making it to New York and have the chance to get immersed in the discussion.

Last fall, Databricks, whose founders created Spark from their work at UC Berkeley’s AMPlab, announced their first commercial product – that being a Spark Platform-as-A-Service (PaaS) cloud for developing Spark programs. We view the Databricks Cloud as a learning tool and incubator for developers to get up to speed on Spark without having to worry about marshaling compute clusters. The question on everybody’s minds at the conference was when the Databricks Cloud would go GA. The answer, like everything Spark, is about dealing with scalability – in this case, being capable of handling high con currency, highly spikey workloads. The latest word is later this year.

The trials and tribulations of the Databricks Cloud is quite typical for Spark – it’s dealing with scale, whether that be in numbers of users (concurrency) or data (when the data sets get too big for memory and must spill to disk). At a meetup last summer where we heard a trip report from Spark Summit 2015, the key pain point was having a more graceful spilling to disk.

Memory-resident compute frameworks of course are nothing new. SAS for instance has its LASR Server, which it contends is far more robust in dealing with concurrency and compute-intensive workloads. But, as SAS’s core business is analytics, we expect that they will meet Spark halfway to appeal to Spark developers.

While Spark is thought of as a potential replacement for MapReduce, in actuality we believe that MapReduce will be as dead as the mainframe. While DRAM memory is, in the long run, getting cheaper, it will never be as cheap as disk. And while ideally, you shouldn’t have to comb through petabytes of data on a routine basis (that’s part of defining your query and identifying the data sets), there are going to be analytic problems involving data sets that won’t completely fit in memory. Not to mention that not all computations (e.g., anything that requires developing a comprehensive model) will be suited for real-time or interactive computation. Not surprisingly, most of the use cases that we came across at Spark Summit were more about “medium data,” such as curating data feeds, real-time fraud detection, or heat maps of NYC taxi cab activity.

While dealing with scaling is part of the Spark roadmap, so is making it more accessible. At this stage, the focus is on developers, through APIs to popular statistical computation languages such as Python or R, and with frameworks such as Spark SQL and Spark DataFrames.

On one hand, with Hadoop and NoSQL platform providers competing with their own interactive SQL frameworks, the question is why the world needs another SQL framework. In actuality, Spark SQL doesn’t compete with Impala, Tez, BigSQL, Drill, Presto or whatever. First, it’s not only about SQL, but querying data with any kind of explicit schema. The use case for Spark SQL is running SQL programs in line with other computations, such as chaining SQL queries to streaming or machine learning runs. As for DataFrames, Databricks is simply adapting the Distributed DataFrame technology already implemented with languages such as Java, Python, and R to access data sets that are organized as tables with columns contained typed data.

Spark’s extensibility is both blessing and curse. Blessing in that the framework can run a wide variety of workloads, but curse in that developers can drown in abundance. One of the speakers at Summit called for package management so developers won’t stumble over their expanding array of Spark libraries and wind up reinventing the wheel.

Making Spark more accessible to developers is a logical step in growing the skills base. But ultimately, for Spark to have an impact with enterprises, it must be embraced by applications. In those scenarios, the end user doesn’t care what process is used under the hood. There are a few applications and tools, like ClearStory Data for curating data feeds, or ZoomData, an emerging Big Data BI tool that has some unique IP (likely to stay proprietary) for handling scale and concurrency.

There’s no shortage of excitement and hype around Spark. The teething issues (E.G., scalability, concurrency, package management) are rather mundane. The hype – that Spark will replace MapReduce – is ahead of the reality; as we’ve previously noted, there’s a place for in-memory computing, but it won’t replace all workloads or make disk-based databases obsolete. And while Spark hardly has a monopoly on in-memory computing, the accessibility and economics of an open source framework on commodity hardware opens lots of possibilities for drawing a skills base and new forms of analytics. But let’s not get too far ahead of ourselves first.

IBM and Twitter: Another piece of the anaytics puzzle

Roughly 20 years ago, IBM faced a major fork in the road from the hardware-centric model that defined the computer industry from the days of Grace Hopper. It embraced a services-heavy model that leveraged IBM’s knowledge of how and where enterprises managed their information in an era when many were about to undergo drastic replatforming in the wake of Y2K.

Today it’s about the replatforming, not of IT infrastructure necessarily, but of the business in the face the need to connect in an increasingly mobile and things connected world. And so IBM is in a reinvention, trying to embrace all things mobile, all things data, and all things connected. A key pillar of this strategy has been IBM’s mounting investment in Watson, where it has aggressively recruited and incubated partners to flesh out a new path of business solutions based on cognitive computing. On the horizon, we’ll be focusing our attention on a new path of insight: exploratory analytics, an area that is enabled by the next generation of business intelligence tools – Watson Analytics among them.

Which brings us to last fall’s announcement that IBM and Twitter would from a strategic partnership to develop real-time business solutions. As IBM has been seeking to reinvent itself, Twitter has been seeking to invent itself as a profitable business that can monetize its data in a manner that maintains trust among its members – yours truly among them. Twitter’s key value proposition is the immediacy if its data. While it may lack the richness and depth of content-heavy social networks like Facebook, it is, in essence, the world’s heartbeat. A ticker feed that is about, not financial markets, but the world.

When something happens, you might post on Facebook, within minutes or hours, blogs and news feeds may populate headlines. But for real-time immediacy, nothing beats the ease and simplicity of 140 characters. Uniquely, Twitter is sort of a hybrid between consumer-oriented social network like Facebook and a professional one like LinkedIn. There is an immediacy and uniqueness to the data feed that Twitter provides, With its acquisition last year of partner Gnip (which already had commercial relationships with enterprise software providers like SAP), Twitter now had a direct pipeline for mounting the enterprise value chain.

So far, so good, but what has IBM done to build a real business out of all this? A few months in, IBM is on a publicity offensive to show there is real business here. It is part way to a goal of cross-trading up to a quarter of its 140,000 over 10,000 GBS consultants on Twitter solutions. IBM has already signed a handful of reference customer deals, and is disclosing some of the real-world use cases that are the focus on actuals engagements.

Meanwhile, Twitter has been on a heavily publicized path to monetize the data that it has – which is a unique real-time pulse of what’s happening in the world. Twitter certainly has had its spate of challenges here. It sits on a data stream that is rich with currency, but lacking the depth that social networks like Facebook offer in abundance. Nonetheless, Twitter is unique in that it provides a ticker feed of what’s happening in the world. That was what was behind the announcement last fall that Twitter would become a strategic partner with IBM – to help Twitter monetize its data and for IBM to generate unique real-time business solutions.

Roughly six months into the partnership, IBM has taken the offensive to demonstrate that the new partnership is generating real business and tangible use cases. We sat down for some off the record discussions with IBM, Twitter, and several customers and prospects ahead of today’s announcements.

The obvious low-hanging fruit is customer experience. As we wrote this in midflight, before boarding we had a Twitter exchange with United regarding whether we’d be put on another fight if our plane – delayed for a couple hours with software trouble (yes… software) – was going to get cancelled (the story had a happy ending). Businesses are already using Twitter – that’s not the question. Instead, it’s whether there are other analytics-driven use cases – sorta like the type of thing we used to talk about with CEP but are real and not theoretical.

We had some background conversations with IBM last week ahead of today’s announcements. They told us of some engagements that they’ve booked during the first few months of the Twitter initiative. What’s remarkable is they are very familiar use cases, where Twitter adds another verifying data point.

An obvious case is mobile carriers – this being the beachfront real estate of telco. As mobile embeds itself in our lives, there is more at stake for carriers who ear churn, and even more so, the reputational damage that can come when defecting customers cry out about bad service publicly over social media. Telcos already have real-time data; they have connection data from their operational systems, and because this is mobile, location data as well. What’s kind of interesting to us is IBM’s assertion that what’s less understood is the relationship between Tweets and churn – as we already use Twitter, we thought those truths were self-evident. You have a crappy connection, the mobile carrier has the data on what calls, texts, or web access were dropped, and if the telco already knows its customers’ Twitter handles, it should be as plain as day what the relationship is between tweet’s and potential churn events. IBM’s case here was that integrating Twitter with data that was already available – connection’s, weather, cell tower traffic, etc., it helped connect the dots. IBM makes the claim that correlating Twitter with weather data alone could improve the accuracy of telco churn models by 5.

Another example drawn from early engagement is employee turnover. Now, unless an employee has gotten to the point where they’d rather take this job and shove it, you’d think that putting your gripes out over the Twitter feed would be a career-limiting move. But the approach here was more indirect: look at consumer businesses and correlate customer Twitter activity with locations where employee morale is sagging. Or look at the Twitter data to deduce that staff loyalty was flagging.

A more obvious use cases was in the fashion industry. IBM is adapting another technology from its labs – psycholinguistic analysis (a.k.a., what are you really saying?) – to conduct a more nuanced form of sentiment analysis of your tweets. For this engagement, a fashion industry firm employed this analysis to gain more insight on why different products sold or not.

Integrating Twitter is just another piece of the puzzle when trying to decipher signals from the market. It’s not a case of blazing new trails; indeed, sentiment analysis has become a well-established disciple for consumer marketers. The data from Twitter is crying out to be added to the mix of feeds used for piecing together the big picture. IBM’s alliance with Twitter is notable in that both are putting real skin in the game for productizing the insights that can be gained from Twitter feeds.

It’s not a criticism to say this, but incorporating Twitter is evolutionary, not revolutionary. That’s true for most big data analytics – we’re just expanding and deepening the window to solve very familiar problems. The data is out there – we might as well use it.

Hadoop: The Third Way

Working with Hadoop has been kind of a throwback. Until recently, Hadoop was synonymous with MapReduce programming, meaning that when you worked with Hadoop, it seemed that you were working with a newfangled mainframe. As if client/server never happened.

With emergence and heavy competition between the various interactive SQL frameworks (e.g., Impala, Tez, Presto, Drill, BigSQL, Big Data SQL, QueryGrid, Spark SQL), a second path emerged for database developers. So the Hadoop mainframe became a client/server machine. As if n-tier never happened.

The need for speed made n-tier happen – due to the need to bypass the bottleneck of database I/O and the overhead of large, monolithic applications. And so the application server platform was born, and with it, ways to abstract functions such as integration, security, transaction management so they could operate as modular piece parts with whatever application or database. Or to prevent abandoned online shopping carts, so a transaction can be executed without being held hostage to ensuring ACID compliance. Internet-based applications were now being developed on WebSphere, WebLogic, JBoss, and more recently, more compact open source alternatives like Apache Tomcat.

But with Hadoop, we’re still in the era of mainframe or client/server. But with the 2.x generation, where resource management has been taken out of MapReduce, the way has been cleared to make Hadoop more of a multi-purpose platform. While interactive SQL was the first shot, new frameworks supporting streaming (Storm, Spark Streaming), machine learning (Spark), and search (Solr) are among some of the new additions to the palette.

But at this point, we’re still looking at Hadoop as either a mainframe or two-tier system. Developers write MapReduce or Spark programs, or BI/query tools access HDFS with or without Hive. There’s nothing available to write data-driven programs, such as real-time user scoring or intrusion detection.

Nearly four years ago, a startup with a weird name – Continuuity – emerged to become in its own terms “the JBoss for Hadoop.” The goal was building a data fabric that abstracted the low-level APIs to HDFS, MapReduce, Hive, and other Hadoop components to clear the way for developers to write, not just MapReduce programs or run BI tools, but write API-driven programs that could connect to Hadoop. Just as a generation ago, application servers abstracted data and programs so they could flexibly connect with each other. Its first project was a data ingestion platform written on Storm that would be easier to work with than existing Hadoop projects such as Flume.

Continuuity’s problem was that the company was founded too early. During a period where Hadoop was exclusively a batch processing platform, there was little clamor for developers to write data-driven applications. But as new frameworks transform Hadoop into a platform that can deliver experiences closer to real-time, demand should emerge among developers to write, not just programs, but applications that can run against Hadoop (or other platforms).

In the interim, Continuuity changed its name to Cask, and changed its business model to become an open source company. It has diversified its streaming engine to work with other frameworks besides Storm to more readily persist data. And the 40-person company which was founded a few blocks away from Cloudera’s original headquarters, next to Fry’s Electronics in Palo Alto, has just drawn a modest investment from Cloudera to further develop its middleware platform.

Admittedly, Cask’s website really doesn’t make a good case (the home page gives you as 404 error), providing an application platform for Hadoop opens up possibilities sonly limited by the imagination. For instance, it could make possible event-driven programs for performing data validation or detecting changes in customer interactions, and so on.

For Cloudera, Cask is a low-risk proposition for developing that long-missing third path to Hadoop to further its transformation to a multi-purpose platform.

Hadoop vendor ecosystem gaining critical mass

Nature abhors a vacuum, and enterprises abhor platforms lacking tooling. Few enterprises have the developer resources or technology savvy of early adopters. For Hadoop, early adopters invented the technology; mainstream enterprises want to consume it.

On our just-concluded tour of Ovum enterprise clients across Australia/Pacific Rim, we found that the few who have progressed beyond discussion stage with Hadoop are doing so with technology staff accustomed to being on their own, building their own R programs and experimenting with embryonic frameworks like Mesos and YARN. Others are either awaiting more commercial tooling or still sorting out perennial data silos.

But Hadoop is steadily turning into a more “normal” software market. And with it, the vendor ecosystem vacuum is starting to fill in. It’s very much in line with what happened with BI and data warehousing back in the mid-1990s, when tools civilized what was a new architecture for managing data that originally required manual scripting.

So let’s take a brief tour.

Look at the exhibitor list for last month’s Strata HadoopWorld conference; as the largest such Big Data event in North America, it provides a good sampling of the ecosystem. Of nearly a hundred sponsors, roughly a third were tools encompassing BI and analytics, data federation and integration, data protection, and middleware.

There was a mix of the usual suspects who regard Hadoop as their newest target. SAS analytics takes an agnostic approach, bundling a distro of Hadoop in its LASR in-memory appliance; but SAS analytics can also execute inside Hadoop clusters, converting their HPC routines to MapReduce. MicroStrategy and other BI players are connecting to Hadoop in a variety of ways; they either provide suboptimal experience of having your SQL query execute in batch on Hadoop (which few use), or work through the data warehouse or Hadoop platform’s path for interactive SQL.

But there are also new players that are taking BI beyond SQL. Datameer and Platfora each provide their own operators (e.g., clustering, time series, decision trees, or other forms of analysis that would be laborious with SQL), presenting data either through spreadsheets or visualizations. ClearStory Data, which emerged from stealth at the show, provides a way to semantically crawl your own data and mash it with external data from publicly-available APIs. Players like Pivotal, Hadapt, SpliceMachineand CitusData are implementing or co-locating SQL data stores inside HDFS or HBase.

Significantly, some are starting to package forms of data science as well, with almost a half dozen machine learning programs. A necessary development, because there are just so many Hilary Masons to go around. Having people who have a natural feel for data, able to understand its significance, how to analyze it, and most importantly, its relevance, will remain few and far between. To use these tools, you’ll need to know what algorithms to use, but at least you don’t have to build them from scratch. For instance, 0xdata packages machine learning algorithms and combines it with a caching engine for high performance analytics on Hadoop. Skytree, packages classification, clustering, regression analyses, and most importantly, dimension reduction so you can see something meaningful after combing a billion nodes (points) and edges (relationships and context).

Security, a perennial weakness of Hadoop, is another area where you’re seeing vendor activity. Originally designed for trusted environments, Hadoop has long had the remote authentication piece down (Kerberos), because early adopters needed to gain access to remote clusters, and now there are incubating open source projects tackling the other two A’s of AAA – a gateway for access control (Knox) and a mechanism for role-based authorization (Sentry). Yes, there is also a specialized project for “cell” (data entity) level protection created for the NSA (Accumulo), which is being led by Sqrrl. But otherwise, we expect that vendor-based proprietary tools are going to be where most of the action is. Policy-based data protection, either about encryption or data masking, is now emerging via emerging players like Zettaset and Gazzang, with incumbents such as Protegrity and IBM extending support beyond SQL. Data lineage and activity monitoring (the first steps that could eventually lead to full-blown audit and selective read/write access) are emerging from IBM, Cloudera, and Revelytix.

We’ve long believed that for Big Data – and Hadoop – to gain traction with enterprises, that it must become a first class citizen. Among other things, it means Hadoop must integrate with the data center and, inevitably, apps that run against it. Incumbent data integration like Informatica, Talend, Syncsort, and Pentaho view Hadoop as yet another target. Originally touching Hadoop at arm’s length via the traditional ETL staging server topology, they have enabled their transformation tools to work natively inside Hadoop as the idea is a natural (Hadoop promises cheaper compute cycles for the task). Emerging players are adding new integration capabilities – Cirro for data federation; JethroData, for adding indexing to Hadoop; Kapow and Continuuity that are providing middleware for applications to integrate to Hadoop; and Appfluent for extending its data lifecycle management tool to support active archiving on Hadoop.

The subtext of the explosion of the ecosystem is Hadoop’s evolution into a more varied platform; to play anything more than a niche role in the enterprise (and draw a tooling and applications ecosystem), Hadoop must provide other processing options besides MapReduce.

Not surprisingly, interactive SQL on Hadoop became a prime battleground for vendors to differentiate. Cloudera introduced Impala, an MPP-based alternative to MapReduce that uses Hive metadata but bypasses the bottleneck of Hive processing (which had traditionally relied on MapReduce). Meanwhile, Hortonworks has led projects to make Hive better (read: faster), complementing it with a faster alternative to MapReduce. As noted above, several players are implementing SQL data stores directly inside Hadoop, while IBM has modified SQL to run against Hive.

The YARN (a.k.a., MapReduce 2.0) framework provides resource allocation (not full-blown resource management, however) that will allow multiple (read: MapReduce and alternative) workloads to run on Hadoop clusters. Hortonworks, which led development, announced a circle of partners who are supporting the new framework. Its rival, Cloudera, is taking a more measured approach; MapReduce and Impala workloads will be allocated under the YARN umbrella, but streaming or search won’t. Having been carved out of the original resource manager for pre-2.0 MapReduce, Cloudera doesn’t believe the new framework is suited for handling continuous workloads that don’t have starts or stops.

So, going forward, we’re seeing Hadoop emerge with an increasingly well-rounded third party ecosystem where little existed before. We expect that in the coming year, this will spread beyond tools to applications as well; we’ll see more of what the likes of Causata are doing.

So what role will Hadoop play?
For now, Hadoop remains a work in progress – data integration and lifecycle management, security, performance management, and governance practices and technologies are at early stages of evolution. At Strata, Facebook’s Ken Rudin made an eloquent plea for coexistence; they tracked against the wind by starting with Hadoop and learning that it was best for exploratory analytics while relational was best suited for queries with standard metrics (he’s pitched the same message to the data warehousing audience as well).

Cloudera’s Mike Olson, who had the podium right before Rudin, announced Cloudera’s vision of Hadoop as enterprise data hub: Hadoop is not just the logical landing spot for data, but also the place where you can run multiple workloads. Andrew Brust equates Cloudera’s positioning as making Hadoop become “the Ellis Island of data.”

So is Olson agreeing or arguing with Rudin?

The context is that analytic (and some transactional) data platforms are taking on multiple personalities (e.g., SQL row stores adding column engines, file/HDFS data stores, JSON stores – in some cases alongside or in hybrid). All analytic data platforms are grabbing for multiple data types and running workloads. They are also vying to become the logical spot where analytics are choreographed – mixing and matching data sets on different platforms for running analytic problems.

Cloudera aims to compete, not just as another Hadoop platform, but as the default platform where analytic data lives. It doesn’t necessarily replace SQL enterprise data warehouses, but assumes more workloads requiring scale, inexpensive compute cycles, and the ability to run multiple types of workloads – not just MapReduce. SQL data warehouses aren’t standing still either, and in many cases are embracing Hadoop. Hadoop has the edge on cost of compute cycles, but pieces must fall into place to gain parity regarding service level management and performance, security, availability and reliability, and information lifecycle management. Looking ahead, we expect analytics to run on multiple platforms, with the center of gravity up for grabs.

Is the sky the limit for Flash and In-Memory Databases?

Big Data is getting bigger, and Fast Data is getting faster because of the continuing declining cost of all things infrastructure. Ongoing commoditization of powerful, multi-core CPU, storage media, and connectivity made scale-out Internet data centers possible, and with them, scale-out data platforms such as Hadoop and the new generation of Advanced SQL/NewSQL analytic data stores. Bandwidth is similarly going crazy; while the lack of 4G may make bandwidth seem elusive to mobile users, growth of bandwidth for connecting devices and things has become another fact taken for granted.

Conventional wisdom is that similar trends are impacting storage, and until recently, that was the Kool-Aid that we swallowed. For sure, the macro picture is that declining price and ascending density curves are changing the conversation where it comes to deploying data. The type of media on which you store data is no longer just a price/performance tradeoff, but increasingly an architectural consideration on how data is processed and applications that run on data are engineered. Bigger, cheaper storage makes bigger analytics possible; faster, cheaper storage makes more complex and functional applications possible.

At 100,000 feet, such trends for storage are holding, but dig beneath the surface and the picture gets more nuanced. And those nuances are increasingly driving how we design our data-driven transaction applications and analytics.

Cut through the terminology
But before we dive into the trends, let’s get our terminology straight, because the term memory is used much too loosely (does it mean DRAM or Flash?). For this discussion, we’ll stick with the following conventions:
CPU cache is the memory on chip that is used for temporarily holding data being processed by the processor.
DRAM memory is the fastest storage layer that sits outside the chip, and is typically parceled out in GBytes per compute core.
Solid State Drive (SSD) based on Flash memory, is the silicon-based, faster substitute to traditional hard drives are typically sized at hundreds of GBytes (with some units just under a terabyte). But it is not as fast as DRAM.
• Hard disk, or “disk,” is the workhorse that now scales economically up to 1 – 3 TBytes per spindle.

So what’s best for which?
For hard drives, conventional wisdom has been that they keep getting faster and cheaper. Turns out, only the latter is true. The cheapness of 1- and 3-TByte drives has made scale-out Internet data centers possible, and with it, scale-out Big Data analytic platforms like Hadoop. Hard disk continues to be the medium of choice for large volumes of data because individual drives routinely scale to 1 – 3 TBytes. And momentary supply chain disruptions like the 2011 Thailand floods aside, the supply remains more than adequate. Flash drives simply don’t get as fat.

But if anything, hard drives are getting slower because it’s no longer worthwhile to try speeding them up. With Flash being at least 10 – 100x faster, there’s no way that disk will easily catch up even if the technology gets refreshed. Flash is actually pulling the rug out from under demand for 7200-RPM disks (currently the state of the art for disk). Not surprisingly, disk technology development has hit the wall.

Given current price trends, where Flash prices are expectedsome analysts expect Flash to reach parity with disk in the next 12 – 18 months (or maybe sooner), there will be less reason for your next transaction system to be disk-based. In fact there is good reason to be a bit skeptical on how soon supply of SSD Flash will ramp up adequately for the transaction system market; but SSD Flash will gradually make its way to prime time. Conversely, with disk likely to remain fatter in capacity than Flash, it will be best suited for active archiving that keeps older data otherwise bound for tape live; and for Big Data analytics, where the need is for volume. Nonetheless, the workhorse of large Hadoop, and similar disk-based Big Data analytic or active archive clusters will likely be the slower 5400 RPM models.

So what about even faster modes of storage? In the past couple years, DRAM memory prices crossed the threshold where it became feasible to deploy them for persistent storage rather than caching of currently used data. That cleared the way for the in-memory database (IMDB), which is often code word for all-DRAM memory storage.

In-memory databases are hardly new, but until the last 3 – 4 years they were highly specialized. Oracle TimesTen, one of the earliest commercial offerings, was designed for tightly-coupled, specialized transactional applications; other purpose-built in-memory data stores have existed for capital markets for at least a decade or more. But DRAM memory prices dropped to bring them into the enterprise mainstream. Kognitio opened the floodgates as it reincarnated its MOLAP cube and row store analytic platform to in-memory on industry-standard hardware just over 5 years ago; SAP put in-memory in the spotlight with HANA for analytics and transactional applications; followed by Oracle, which reincarnated TimesTen as Exalytics for running Oracle Business Intelligence Enterprise Edition (OBIEE) and Essbase.

Yet, an interesting blip happened on the way to the “inevitable” all in-memory database future: Last spring, DRAM memory prices stopped dropping. In part this was attributable to consolidation of the industry to fewer suppliers. But the larger driver was that the wisdom of crowds – e.g., that DRAM memory was now ready for prime time – got ahead of itself. Yes, the laws of supply and demand will eventually shift the trajectory of memory pricing. But nope, that won’t change the fact of life that, no matter how cheap, DRAM memory (and cache) will always be premium storage.

In-memory databases are dead, long live tiered databases
The sky is not the limit for DRAM in-memory databases. The rush to in-memory will morph into an expansion of data tiering. And actually that’s not such a bad thing: do you really need to put all of that data in memory? We think not.

IBM and Teradata have shunned all in-memory architectures; their contention is that the 80/20 rule should govern which data goes into memory. And under their breaths, the all in-memory database folks have fallbacks for paging data between disk and memory. If designed properly, this is not constant paging, but rather a process that only occurs for that rare out-of-range query. Kognitio has a clever pricing model where they don’t charge you for the disk, but just for the volume of memory. As for HANA, disk is designed into the system for permanent offline storage, but SAP quietly adds that it can also be utilized for paging data during routine operation. Maybe SAP shouldn’t be so quiet about that.

There’s one additional form of tiering to consider for highly complex analytics: it’s the boost that can come from pipelining computations inside chip cache. Oracle is looking to similar techniques for further optimizing upcoming generations of its Exadata database appliance platform. It’s a technique that’s part of IBM’s recent BLU architecture for DB2. High-performance analytic platforms such as SiSense also incorporate in-chip pipelining to actually reduce balance of system costs (e.g., require less DRAM).

It’s all about balance of system
Balance of system is hardly new, but until recently, it meant trading off CPU, bandwidth with tiers of disk. Application and database design in turn focused on distributing or sharding data to place the most frequently accessed data on the disk or portions of disk that could be accessed the fastest. New forms of storage, including Flash and DRAM memory, add a few new elements to the mix. You’ll still configure storage (along with processor and interconnects) for the application and vice versa, but you’ll have a couple new toys in your arsenal.

For Flash, it means fast OLTP applications that could add basic analytics, such as what Oracle’s recent wave of In-Memory Applications promise. For in-memory, that would dictate OLTP applications with even more complex analytics and/or what-if simulations embedded in line, such as what SAP is promising with its recently-introduced Business Suite and CRM applications on HANA.

For in-memory, we’d contend that for most cases, configurations for keeping 100% of data in DRAM will remain overkill. Unless you are running a Big Data analytic problem that is supposed to encompass all of the data, you will likely work with just a fraction of the data. Furthermore, IBM, Oracle, and Teradata are incorporating data skipping features into their analytic platforms that deliberately filter irrelevant data so it is not scanned. There a many ways to speed processing before using the fast storage option.

Storage will become an application design option
Although we’re leery about hopping the 100% DRAM in-memory bandwagon, smartly deployed, in-memory or DRAM could truly transform applications. When you eliminate the latency, you can embed complex analytics in-line with transactional applications, enable the running of more complex analytics, or make it feasible for users to run more what-if simulations to couch their decisions.

Examples include transaction applications that differentiate how to fulfill orders from gold, silver, or bronze-level customers based on levels of services and cost of fulfillment. It could help mitigate risk when making operational or fiduciary decisions by allowing the running of more permutations of scenarios. It could also enhance Big Data analytics by tiering the more frequently used data (and logic) in memory.

Whether to use DRAM or Flash will be a function of degree of data volume and problem complexity. No longer will inclusion of storage tiers be simply a hardware platform design decision; it will also become a configuration decision for application designers as well.

Strata 2013 debrief: Enterprise-ready Hadoop Wars heat up

We’re in the thick of analyst conference season – Informatica last week, SAS tomorrow. So on this Sunday afternoon between gigs, we’re digesting what went down at Strata 2013 in Santa Clara last week. It was kind of a frustrating day in that we had limited time, were scheduled wall to wall with meetings, and missed what were likely some fascinating sessions. But we got a sense of some dominant themes: Harden Hadoop for the enterprise, and take the SQL world to Hadoop.

The Hadoop vendor ecosystem is filling in – new players with their own distros, and new capabilities focused on making Hadoop more enterprise grade. The field is early enough that the approaches are still quite diverse – it’s time to invent, not consolidate. Let the games proceed.

EMC stole the jump early in the week by announcing the grafting of its own Greenplum Advanced SQL analytic data store onto Hadoop – basically, the Greenplum MPP database squooched (wanted an excuse to use a “word” like that) atop HDFS. Tastes like a SQL analytic database, scales like Hadoop. Cloudera Impala will soon be in a GA branded as RTQ (Real-Time Query). Not to be outshined, Hortonworks, which works through the official Hadoop project itself, announced a couple responding initiatives: the Tez runtime and Stinger interactive query engine. You wouldn’t be seeing all these efforts to make Hadoop interactive if the demand wasn’t out there; while Hadoop as a platform for extending the range of analytics has become very compelling to enterprises, they clearly expect that the platform must be SQL interactive if it is to become a part of their analytic system portfolio.

While we’ve been expending electrons on the SQLization of Hadoop, the next stage of hardening is rapidly emerging. Specifically, make Hadoop and Hadoop data more governable and secure. This involves capabilities such as data masking (where you permanently obliterate sensitive pieces of data), data encryption (where you can recover the original data), activity monitoring (who does what), data lineage (who and where this data came from, and who has done what to it), and of course, more fine grained access control (preferably role-based) that picks up where Kerberos authentication leaves off. The pieces are just beginning to fall into place.

Dataguise, a niche player in data obfuscation that relaunched itself in the Hadoop space last year, has had an encryption product out for roughly six months and has drawn several customers; they promote a self-learning feature that discovers sensitive data (e.g., credit card numbers), selectively encrypts, and then acts only when data is changed. IBM already has capabilities in Optim that are typically used when pulling data from an external database; a user-defined function can mask it in Hadoop, or mask data as it is drawn from Hadoop. IBM offers data masking and activity monitoring, a capability that Cloudera just announced. Specifically, Cloudera’s new Navigation tool places agents (like everybody else, they characterize them as “lightweight”) on HDFS, Hive, and HBase, and you can configure them. For instance, the traffic on Hive is likely to be a fraction of that for HBase, which is more interactive, so you can configure monitoring of event changes to data accordingly. And then we came across Revelytix, which focuses on data lineage

Then out of the blue, Intel swooped in with announcement of its own Hadoop distribution. As if that was the last thing the world needed. But Intel has carved some interesting angles: it is utilizing the native instruction set of the Xeon processor to move encryption and I/O optimization directly into the chip. Intel’s play addresses the issue that these processes are resource-heavy, a point where the sheer size of Hadoop data stores add insult to injury. And that is not to mention that embedding encryption in hardware lessens the load of developers. Intel has drawn a number of partners including SAP, where integration with the HANA in-memory platform offers some interesting Fast Data possibilities. So far we’ve missed signals with Intel, but will speak with them later next week to get a better idea of where they hope to take hardware optimization with Hadoop.

Loose ends: Time is running out on us, but coming out of this week, there are several issues that are running in the back of our mind:
• Hive – we thought this was a done deal. Hive is one of the earliest components of Hadoop. Having been designed when MapReduce was the predominant processing pattern, and the jobs to spawn the metadata were batch in nature. We were surprised that the debate over Hive’s use remains very, very live. The issue is over how dynamic Hive can become – yes, it can support interactive queries, but is it based on metadata that is current? We sense that this will become another area for vendor differentiation.
• Apache Hadoop project – This could be spin, but there is sniping behind the scenes that the Hadoop project is no longer so broad-based when it comes to contributions. The flipside is arguments over whether a particular vendor has enough (or any) committers rings a bit hollow. The operable question for enterprises is whether the distro of Hadoop is and will remain well-supported.
• Resource management – this one has multiple angles. Of course there is debate over YARN. It is supposed to be the über resource manager of Hadoop, so MapReduce jobs don’t collide with those of other frameworks that may have different (and conflicting) demand on processing and data access. There’s active debate over whether YARN has sufficiently weaned itself of its MapReduce batch lineage, or whether it should be a batch-oriented sub manager in a scheme where there is yet another layer of control. The counterargument to that is that this may make life (or at least levels of control) far too complex. Expect vendor differentiation here.

Fast Data — the TV show

We’ve been talking about Fast Data over the past year, and so has Oracle. Last week we had the chance to make it a dialogue as we were interviewed by Hasan Rizvi, who heads Oracle’s middleware business as Executive Vice President Oracle Fusion Middleware and Java. The podcast, which will also include an appearance with Oracle customer Turkcell, will go live on February 27. You can sign up for it here.

What will Splunk be when it grows up?

Much of the hype around Big Data is that, not only are people generating more data, but machines. Machine data has always been there – it was traditionally collected by dedicated systems such as network node managers, firewalls systems, SCADA systems, and so on. But that’s where the data stayed.

Machine data is obviously pretty low level stuff. Depending on the format of data spewed forth by devices, it may be highly cryptic or may actually contain text that is human intelligible. It was traditionally considered low-density data that was digested either by specific programs or applications or by specific people – typically systems operators or security specialists.

Splunk’s reason for existence is putting this data onto a common data platform, then index it to make it searchable as a function of time. The operable notion is that the data could then be shared or correlated across applications, such as the weblogs. Its roots are in the underside of IT infrastructure management systems, where Splunk is often the embedded data engine. An increasingly popular use case is security, where Splunk can reach across network, server, storage, and web domains to provide a picture of exploits that could be end-to-end, at least within the data center.

There’s been a bit of hype around the company, which IPO’ed earlier this year and reported a strong Q2. Consumer technology still draws the headlines (just look at how much the release of the iPhone 5 drowned out almost all other tech news this week). But given Facebook’s market dive, maybe the turn of events on Wall Street could be characterized as revenge of the enterprise, given the market’s previous infatuation with the usual suspects in the consumer space – mobile devices, social networks, and gaming.

Splunk has a lot of headroom. With machine data proliferating and the company’s promoting its offering as an operational intelligence platform, Splunk is well-positioned as a company that leverages Fast Data. While Splunk is not split second or deterministic real-time, its ability to build searchable indexes on the fly positions itself nicely for tracking volatile environments as they change as opposed to waiting after the fact (although Splunk can also be used for retrospective historical analysis, too).

But Splunk faces real growing pains, both up the value chain, and across it.

While Splunk’s heritage is in IT infrastructure data, the company bills itself as being about the broader category of machine data analytics. And there’s certainly lots of it around, given the explosion of sensory devices that are sending log files from all over the place, inside the four walls of a data center or enterprise, and out. There’s The Internet of Things. IBM’s Smarter Planet campaign over the past few years has raised general awareness of how instrument and increasingly intelligent Spaceship Earth is becoming. Maybe we’re jaded, but it’s become common knowledge that the world is full of sensory points, whether it is traffic sensors embedded in the pavement, weather stations, GPS units, smartphones, biomedical devices, industrial machinery, oil and gas recovery and refining, not to mention the electronic control modules sitting between driver and the powertrain in your car.

And within the enterprise, there may be plenty of resistance to getting the bigger picture. For instance, while ITO owns infrastructure data, marketing probably owns the Omniture logs; yet absent the means to correlate the two, it may not be possible to get the answer on why the customer did or did not make the purchase online.

For a sub $200-million firm, this is all a lot of ground to cover. Splunk knows the IT and security market but lacks the breadth of an IBM to address all of the other segments across national intelligence, public infrastructure, smart utility grids, or healthcare verticals, to name a few. And it has no visibility above IT operations or appdev organizations. Splunk needs to pick its targets.

Splunk is trying to address scale – that’s where the Big Data angle comes in. Splunk is adding some features to increases its scale, with the new 5.0 release adding federated indexing to boost performance over larger bodies of data. But for real scale, that’s where integration with Hadoop comes in, acting as a near-line archive for Splunk data that might otherwise be purged. Splunk offers two forms of connectivity: HadoopConnect, which provides a way to stream and transform Splunk data to populate HDFS and Shuttl, a slower archival feature that treats Hadoop as a tape library (the data is heavily compressed with GZip). It’s definitely a first step – HadoopConnect as the name implies establishes connectivity, but the integration is hardly seamless or intuitive, yet. It uses Splunk’s familiar fill-in-the-blank interface (we’d love to see something more point and click), with the data in Hadoop retrievable, but without Splunk’s familiar indexes (unless you re-import the data back to Splunk). On the horizon, we’d love to see Splunk tackle the far more challenging problem of getting its indexes to work natively inside Hadoop, maybe with HBase.

Then there’s the eternal question of making machine data meaningful to the business. Splunk’s search-based interface today is intuitive to developers and systems admins, as it requires knowledge of the types of data elements that are being stored. But it won’t work for anybody that doesn’t work with the guts of applications or computing infrastructure. But it becomes more critical to convey that message as Splunk is used to correlate log files with higher level sources, such as the correlating abandoned shopping carts with underlying server data to see if the missed sale was attributable to system bugs or the buyer changing her mind.

The log files that record how different elements of IT infrastructure perform are in aggregate telling a story that tells how well your organization is serving the customer. Yet the perennial challenge of all systems level management platforms has been conveying the business impact from the events that generated those log files. For those who don’t have to dye their hair gray, there are distant memories of providers like CA, IBM, and HP promoting how their panes of glass displaying data center performance could tell a business message. There’s been the challenge for ITIL adopters to codify the running of processes in the data center to support the business. The lists of stillborn attempts to convey business meaning to the underlying operations are endless.

So maybe given the hype of the IPO, the relatively new management team that is in place, and the reality of Splunk’s heritage, it shouldn’t be surprising that we heard two different messages and tones.

From recently-appointed product SVP Guido Schroeder, we heard talk of creating a semantic metadata layer that would, in effect, create de facto business objects. That shouldn’t be surprising, as during his previous incarnation he oversaw the integration of Business Objects into the SAP business. For anyone who has tracked the BI business over the years, the key to success has been creation of a metadata layer that not only codified the entities, but made it possible to attain reuse in ad hoc query and standard reporting. Schroeder and the current management team are clearly looking to take Splunk above IT operations to CIO level.

But attend almost session at the conference, and the enterprise message was largely missing. That shouldn’t be surprising as the conference itself was aimed at the people who buy Splunk’s tools – and they tend to be down more in the depths of operations.

There were a few exceptions. One of the sessions in the Big Data track, led by Stuart Hirst, CTO of an Australian big data consulting firm Converging Data, communicated the importance of preserving the meaning of data as it moves through the lifecycle. In this case, it was a counter-intuitive pitch to conventional wisdom of Big Data, which is ingest the data, explore and classify it later. As Splunk data is ingested, it is time stamped to provide a chronological record. Although this may be low level data, as you bring more of it together, there should be a record of lineage, not to mention sensitivity (e.g., are customer-facing systems involved.

From that standpoint, the notion of adding a semantic metadata layer atop its indexing sounds quite intuitive – assign higher level meanings to buckets of log data that carries some business process meaning. For that, Splunk would have to rely on external sources – the applications and databases that run atop the infrastructure whose log files it tracks. That’s a tall order and one that will require partners, not to mention how do you define what are the entities that should be defined. Unfortunately, the track record for cross enterprise repositories is not great; maybe there could be some leveraging of MDM implementations around customer or product that could provide some beginning frame of reference.

But we’re getting way, way ahead of ourselves here. Splunk is the story of an engineering-oriented company that is seeking to climb higher up the value chain in the enterprise. Yet, as it seeks to engage higher level people within the customer organization, Splunk can’t afford to lose track of the base that has been responsible for its success. Splunk’s best route upward is likely through partnering with enterprise players like SAP. That doesn’t deal with the question of how to expand out the footprint to follow the footprint of what is called machine data, but then again, that’s a question for another day. First things first, Splunk needs to pick its target(s) carefully.

SAP and databases no longer an oxymoron

In its rise to leadership of the ERP market, SAP shrewdly placed bounds around its strategy: it would stick to its knitting on applications and rely on partnerships with systems integrators to get critical mass implementation across the Global 2000. When it came to architecture, SAP left no doubt of its ambitions to own the application tier, while leaving the data tier to the kindness of strangers (or in Oracle’s case, the estranged).

Times change in more ways than one – and one of those ways is in the data tier. The headlines of SAP acquiring Sybase (for its mobile assets, primarily) and subsequent emergence of HANA, its new in-memory data platform, placed SAP in the database market. And so it was that at an analyst meeting last December, SAP made the audacious declaration that it wanted to become the #2 database player by 2015.

Of course, none of this occurs in a vacuum. SAP’s declaration to become a front line player in the database market threatens to destabilize existing relationships with Microsoft and IBM as longtime SAP observer Dennis Howlett commented in a ZDNet post. OK, sure, SAP is sick of leaving money on the table to Oracle, and it’s throwing in roughly $500 million in sweeteners to get prospects to migrate. But if the database is the thing, to meet its stretch goals, says Howlett, SAP and Sybase would have to grow that part of the business by a cool 6x – 7x.

But SAP would be treading down a ridiculous path if it were just trying to become a big player in the database market for the heck of it. Fortuitously, during SAP’s press conference on announcements of their new mobile and database strategies, chief architect Vishal Sikka tamped down the #2 aspirations as that’s really not the point – it’s the apps that count, and increasingly, it’s the database that makes the apps. Once again.

Back to our main point, IT innovation goes in waves; during emergence of client/server, innovation focused on database where the need was mastering SQL and relational table structures; during the latter stages of client/server and subsequent waves of Webs 1.0 and 2.0, activity shifted to the app tier, which grew more distributed. With emergence of Big Data and Fast Data, energy shifted back to the data tier given the efficiencies of processing data big or fast inside the data store itself. Not surprisingly, when you hear SAP speak about HANA, they describe an ability to perform more complex analytic problems or compound operational transactions. It’s no coincidence that SAP now states that it’s in the database business.

So how will SAP execute its new database strategy? Given the hype over HANA, how does SAP convince Sybase ASE, IQ, and SQL Anywhere customers that they’re not headed down a dead end street?

That was the point of the SAP announcements, which in the press release stated the near term roadmap but shed little light on how SAP would get there. Specifically, the announcements were:
• SAP HANA on BW is now going GA and at the low (SMB) end come out with aggressive pricing: roughly $3000 for SAP BusinessOne on HANA; $40,000 for HANA Edge.
• Ending a 15-year saga, SAP will finally port its ERP applications to Sybase ASE, with tentative target date of year end. HANA will play a supporting role as the real-time reporting adjunct platform for ASE customers.
• Sybase SQL Anywhere would be positioned as the mobile front end database atop HANA, supporting real-time mobile applications.
• Sybase’s event stream (CEP) offerings would have optional integration with HANA, providing convergence between CEP and BI – where rules are used for stripping key event data for persistence in HANA. In so doing, analysis of event streams could be integrated or directly correlating with historical data.
• Integrations are underway between HANA and IQ with Hadoop.
• Sybase is extending its PowerDesigner data modeling tools to address each of its database engines.

Most of the announcements, like HANA going GA or Sybase ASE supporting SAP Business suite, were hardly surprises. Aside from go-to-market issues, which are many and significant, we’ll direct our focus on the technology roadmaps.

We’ve maintained that if SAP were serious about its database goals, that it had to do three basic things:
1. Unify its database organization. The good news is that it has started down that path as of January 1 of this year. Of course, org charts are only the first step as ultimately it comes down to people.
2. Branding. Although long eclipsed in the database market, Sybase still has an identifiable brand and would be the logical choice; for now SAP has punted.
3. Cross-fertilize technology. Here, SAP can learn lessons from IBM which, despite (or because of) acquiring multiple products that fall under different brands, freely blends technologies. For instance, Cognos BI reporting capabilities are embedded into rational and Tivoli reporting tools.

The third part is the heavy lift. For instance, given that data platforms are increasingly employing advanced caching, it would at first glance seem logical to blend in some of HANA’s in-memory capabilities to the ASE platform; however, architecturally, that would be extremely difficult as one of HANA’s strengths –dynamic indexing – would be difficult to implement in ASE.

On the other hand, given that HANA can index or restructure data on the fly (e.g., organize data into columnar structures on demand), the question is, does that make IQ obsolete? The short answer is that while memory keeps getting cheaper, it will never be as cheap as disk and that therefore, IQ could evolve as near-line storage for HANA. Of course that begs the question as to whether Hadoop could eventually perform the same function. SAP maintains that Hadoop is too slow and therefore should be reserved for offline cases; that’s certainly true today, but given developments with HBase, it could easily become fast and cheap enough for SAP to revisit the IQ question a year or two down the road.

Not that SAP Sybase is sitting still with Hadoop integration. They are providing MapReduce and R capabilities to IQ (SAP Sybase is hardly alone here, as most Advanced SQL platforms are offering similar support). SAP Sybase is also providing capabilities to map IQ tables into Hadoop Hive, slotting IQ as alternative to HBase; in effect, that’s akin to a number of strategies to put SQL layers inside Hadoop (in a way, similar to what the lesser-known Hadapt is doing). And of course, like most of the relational players, SAP Sybase is also support the bulk ETL/ELT load from HDFS to HANA or IQ.

On SAP’s side for now is the paucity of Hadoop talent, so pitching IQ as an alternative to HBase may help soften the blow for organizations seeking to get a handle. But in the long run, we believe that SAP Sybase will have to revisit this strategy. Because, if it’s serious about the database market, it will have to amplify its focus to add value atop the new realities on the ground.

Fast Data hits the Big Data Fast Lane

Of the 3 “V’s” of Big Data – volume, variety, velocity (we’d add “Value as the 4th V) – velocity has been the unsung ‘V.’ With the spotlight on Hadoop, the popular image of Big Data is large petabyte data stores of unstructured data (which are the first two V’s). While Big Data has been thought of as large stores of data at rest, it can also be about data in motion.

“Fast Data” refers to processes that require lower latencies than would otherwise be possible with optimized disk-based storage. Fast Data is not a single technology, but a spectrum of approaches that process data that might or might not be stored. It could encompass event processing, in-memory databases, or hybrid data stores that optimize cache with disk.

Fast Data is nothing new, but because of the cost of memory, was traditionally restricted to a handful of extremely high-value use cases. For instance:
Wall Street firms routinely analyze live market feeds, and in many cases, run sophisticated complex event processing (CEP) programs on event streams (often in real time) to make operational decisions.
• Telcos have handled such data in optimizing network operations while leading logistics firms have used CEP to optimize their transport networks.
• In-memory databases, used as a faster alternative to disk, have similarly been around for well over a, having been employed for program stock trading, telecommunications equipment, airline schedulers, and large destination online retail (e.g., Amazon).

Hybrid in-memory and disk have also become commonplace, especially amongst data warehousing systems (e.g., “>Teradata, Kognitio), and more recently among the emergent class of advanced SQL analytic platforms (e.g., Greenplum, Teradata Aster, IBM Netezza, HP Vertica, ParAccel) that employ smart caching in conjunction with a number of other bells and whistles to juice SQL performance and scaling (e.g., flatter indexes, extensive use of various data compression schemes, columnar table structures, etc.). Many of these systems are in turn packaged as appliances that come with specially tuned, high-performance e backplanes and direct attached disk.

Finally, caching is hardly unknown to the database world. Hot spots of data that are frequently accessed are often placed in cache, as are snapshots of database configurations that are often stored to support restore processes, and so on

So what’s changed?
The usual factors: the same data explosion that created the urgency for Big Data is also generating demand for making the data instantly actionable. Bandwidth, commodity hardware, and of course, declining memory prices, are further forcing the issue: Fast Data is no longer limited to specialized, premium use cases for enterprises with infinite budgets.

Not surprisingly, pure in-memory databases are now going mainstream: Oracle and SAP are choosing in-memory as one of the next places where they are establishing competitive stakes: SAP HANA vs. Oracle Exalytics. Both Oracle and SAP for now are targeting analytic processing, including OLAP(raise the size limits on OLAP cubes) and more complex, multi-stage analytic problems that traditionally would have required batch runs (such as multivariate pricing) or would not have been run at all (too complex, too much delay). More to the point, SAP is counting on HANA as a major pillar of its stretch goal to become the #2 database player by 2015, which means expanding HANA’s target to include next generation enterprise transactional applications with embedded analytics.

Potential use cases for Fast Data could encompass:
• A homeland security agency monitoring the borders requires the ability to parse, decipher, and act on complex occurrences in real time to prevent suspicious people from entering the country
• Capital markets trading firms require real-time analytics and sophisticated event processing to conduct algorithmic or high-frequency trades
• Entities managing smart infrastructure must digest torrents of sensory data to make real-time decisions that optimize use transportation or public utility infrastructure
• B2B consumer products firms monitoring social networks may require real-time response to understand sudden swings in customer sentiment

For such organizations, Fast Data is no longer a luxury, but a necessity.

More specialized use cases are similarly emerging now that the core in-memory technology is becoming more affordable. YarcData, a startup from venerable HPC player Cray Computer, is targeting graph data, which represents data with many-to-many relationships. Graph computing is extremely process-intensive, and as such, has traditionally been run in batch when involving Internet-size sets of data. YarcData adopts a classic hybrid approach that pipelines computations in memory, but persisting data to disk. YarcData is the tip of the iceberg – we expect to see more specialized applications that utilize hybrid caching that combine speed with scale.

But don’t forget, memory’s not the new disk
The movement – or tiering – of data to faster or slower media is also nothing new. What is new is that data in memory may not longer be such a transient thing, and if memory is relied upon for in situ processing of data in motion or rapid processing of data at rest, memory cannot simply be treated as the new disk. Excluding specialized forms of memory such as ROM, by nature anything that’s solid state is volatile: there goes your power… and there goes your data. Not surprisingly, in-memory systems such as HANA still replicate to disk to reduce volatility. For conventional disk data stores that increasingly leverage memory, Storage Switzerland’s George Crump makes the case that caching practices must become smarter to avoid misses (where data gets mistakenly swapped out). There are also balance of system considerations: memory may be fast, but is its processing speed well matched with processor? Maybe solid state overcomes I/O issues associated with disk, but may still be vulnerable to coupling issues if processors get bottlenecked or MapReduce jobs are not optimized.

Declining memory process are putting Fast Data on the fast lane to mainstream. But as the technology is now becoming affordable, we’re still early in the learning curve for how to design for it.