Nature abhors a vacuum, and enterprises abhor platforms lacking tooling. Few enterprises have the developer resources or technology savvy of early adopters. For Hadoop, early adopters invented the technology; mainstream enterprises want to consume it.
On our just-concluded tour of Ovum enterprise clients across Australia/Pacific Rim, we found that the few who have progressed beyond discussion stage with Hadoop are doing so with technology staff accustomed to being on their own, building their own R programs and experimenting with embryonic frameworks like Mesos and YARN. Others are either awaiting more commercial tooling or still sorting out perennial data silos.
But Hadoop is steadily turning into a more “normal” software market. And with it, the vendor ecosystem vacuum is starting to fill in. It’s very much in line with what happened with BI and data warehousing back in the mid-1990s, when tools civilized what was a new architecture for managing data that originally required manual scripting.
So let’s take a brief tour.
Look at the exhibitor list for last month’s Strata HadoopWorld conference; as the largest such Big Data event in North America, it provides a good sampling of the ecosystem. Of nearly a hundred sponsors, roughly a third were tools encompassing BI and analytics, data federation and integration, data protection, and middleware.
There was a mix of the usual suspects who regard Hadoop as their newest target. SAS analytics takes an agnostic approach, bundling a distro of Hadoop in its LASR in-memory appliance; but SAS analytics can also execute inside Hadoop clusters, converting their HPC routines to MapReduce. MicroStrategy and other BI players are connecting to Hadoop in a variety of ways; they either provide suboptimal experience of having your SQL query execute in batch on Hadoop (which few use), or work through the data warehouse or Hadoop platform’s path for interactive SQL.
But there are also new players that are taking BI beyond SQL. Datameer and Platfora each provide their own operators (e.g., clustering, time series, decision trees, or other forms of analysis that would be laborious with SQL), presenting data either through spreadsheets or visualizations. ClearStory Data, which emerged from stealth at the show, provides a way to semantically crawl your own data and mash it with external data from publicly-available APIs. Players like Pivotal, Hadapt, SpliceMachineand CitusData are implementing or co-locating SQL data stores inside HDFS or HBase.
Significantly, some are starting to package forms of data science as well, with almost a half dozen machine learning programs. A necessary development, because there are just so many Hilary Masons to go around. Having people who have a natural feel for data, able to understand its significance, how to analyze it, and most importantly, its relevance, will remain few and far between. To use these tools, you’ll need to know what algorithms to use, but at least you don’t have to build them from scratch. For instance, 0xdata packages machine learning algorithms and combines it with a caching engine for high performance analytics on Hadoop. Skytree, packages classification, clustering, regression analyses, and most importantly, dimension reduction so you can see something meaningful after combing a billion nodes (points) and edges (relationships and context).
Security, a perennial weakness of Hadoop, is another area where you’re seeing vendor activity. Originally designed for trusted environments, Hadoop has long had the remote authentication piece down (Kerberos), because early adopters needed to gain access to remote clusters, and now there are incubating open source projects tackling the other two A’s of AAA – a gateway for access control (Knox) and a mechanism for role-based authorization (Sentry). Yes, there is also a specialized project for “cell” (data entity) level protection created for the NSA (Accumulo), which is being led by Sqrrl. But otherwise, we expect that vendor-based proprietary tools are going to be where most of the action is. Policy-based data protection, either about encryption or data masking, is now emerging via emerging players like Zettaset and Gazzang, with incumbents such as Protegrity and IBM extending support beyond SQL. Data lineage and activity monitoring (the first steps that could eventually lead to full-blown audit and selective read/write access) are emerging from IBM, Cloudera, and Revelytix.
We’ve long believed that for Big Data – and Hadoop – to gain traction with enterprises, that it must become a first class citizen. Among other things, it means Hadoop must integrate with the data center and, inevitably, apps that run against it. Incumbent data integration like Informatica, Talend, Syncsort, and Pentaho view Hadoop as yet another target. Originally touching Hadoop at arm’s length via the traditional ETL staging server topology, they have enabled their transformation tools to work natively inside Hadoop as the idea is a natural (Hadoop promises cheaper compute cycles for the task). Emerging players are adding new integration capabilities – Cirro for data federation; JethroData, for adding indexing to Hadoop; Kapow and Continuuity that are providing middleware for applications to integrate to Hadoop; and Appfluent for extending its data lifecycle management tool to support active archiving on Hadoop.
The subtext of the explosion of the ecosystem is Hadoop’s evolution into a more varied platform; to play anything more than a niche role in the enterprise (and draw a tooling and applications ecosystem), Hadoop must provide other processing options besides MapReduce.
Not surprisingly, interactive SQL on Hadoop became a prime battleground for vendors to differentiate. Cloudera introduced Impala, an MPP-based alternative to MapReduce that uses Hive metadata but bypasses the bottleneck of Hive processing (which had traditionally relied on MapReduce). Meanwhile, Hortonworks has led projects to make Hive better (read: faster), complementing it with a faster alternative to MapReduce. As noted above, several players are implementing SQL data stores directly inside Hadoop, while IBM has modified SQL to run against Hive.
The YARN (a.k.a., MapReduce 2.0) framework provides resource allocation (not full-blown resource management, however) that will allow multiple (read: MapReduce and alternative) workloads to run on Hadoop clusters. Hortonworks, which led development, announced a circle of partners who are supporting the new framework. Its rival, Cloudera, is taking a more measured approach; MapReduce and Impala workloads will be allocated under the YARN umbrella, but streaming or search won’t. Having been carved out of the original resource manager for pre-2.0 MapReduce, Cloudera doesn’t believe the new framework is suited for handling continuous workloads that don’t have starts or stops.
So, going forward, we’re seeing Hadoop emerge with an increasingly well-rounded third party ecosystem where little existed before. We expect that in the coming year, this will spread beyond tools to applications as well; we’ll see more of what the likes of Causata are doing.
So what role will Hadoop play?
For now, Hadoop remains a work in progress – data integration and lifecycle management, security, performance management, and governance practices and technologies are at early stages of evolution. At Strata, Facebook’s Ken Rudin made an eloquent plea for coexistence; they tracked against the wind by starting with Hadoop and learning that it was best for exploratory analytics while relational was best suited for queries with standard metrics (he’s pitched the same message to the data warehousing audience as well).
Cloudera’s Mike Olson, who had the podium right before Rudin, announced Cloudera’s vision of Hadoop as enterprise data hub: Hadoop is not just the logical landing spot for data, but also the place where you can run multiple workloads. Andrew Brust equates Cloudera’s positioning as making Hadoop become “the Ellis Island of data.”
So is Olson agreeing or arguing with Rudin?
The context is that analytic (and some transactional) data platforms are taking on multiple personalities (e.g., SQL row stores adding column engines, file/HDFS data stores, JSON stores – in some cases alongside or in hybrid). All analytic data platforms are grabbing for multiple data types and running workloads. They are also vying to become the logical spot where analytics are choreographed – mixing and matching data sets on different platforms for running analytic problems.
Cloudera aims to compete, not just as another Hadoop platform, but as the default platform where analytic data lives. It doesn’t necessarily replace SQL enterprise data warehouses, but assumes more workloads requiring scale, inexpensive compute cycles, and the ability to run multiple types of workloads – not just MapReduce. SQL data warehouses aren’t standing still either, and in many cases are embracing Hadoop. Hadoop has the edge on cost of compute cycles, but pieces must fall into place to gain parity regarding service level management and performance, security, availability and reliability, and information lifecycle management. Looking ahead, we expect analytics to run on multiple platforms, with the center of gravity up for grabs.
Big Data is getting bigger, and Fast Data is getting faster because of the continuing declining cost of all things infrastructure. Ongoing commoditization of powerful, multi-core CPU, storage media, and connectivity made scale-out Internet data centers possible, and with them, scale-out data platforms such as Hadoop and the new generation of Advanced SQL/NewSQL analytic data stores. Bandwidth is similarly going crazy; while the lack of 4G may make bandwidth seem elusive to mobile users, growth of bandwidth for connecting devices and things has become another fact taken for granted.
Conventional wisdom is that similar trends are impacting storage, and until recently, that was the Kool-Aid that we swallowed. For sure, the macro picture is that declining price and ascending density curves are changing the conversation where it comes to deploying data. The type of media on which you store data is no longer just a price/performance tradeoff, but increasingly an architectural consideration on how data is processed and applications that run on data are engineered. Bigger, cheaper storage makes bigger analytics possible; faster, cheaper storage makes more complex and functional applications possible.
At 100,000 feet, such trends for storage are holding, but dig beneath the surface and the picture gets more nuanced. And those nuances are increasingly driving how we design our data-driven transaction applications and analytics.
Cut through the terminology
But before we dive into the trends, let’s get our terminology straight, because the term memory is used much too loosely (does it mean DRAM or Flash?). For this discussion, we’ll stick with the following conventions:
• CPU cache is the memory on chip that is used for temporarily holding data being processed by the processor.
• DRAM memory is the fastest storage layer that sits outside the chip, and is typically parceled out in GBytes per compute core.
• Solid State Drive (SSD) based on Flash memory, is the silicon-based, faster substitute to traditional hard drives are typically sized at hundreds of GBytes (with some units just under a terabyte). But it is not as fast as DRAM.
• Hard disk, or “disk,” is the workhorse that now scales economically up to 1 – 3 TBytes per spindle.
So what’s best for which?
For hard drives, conventional wisdom has been that they keep getting faster and cheaper. Turns out, only the latter is true. The cheapness of 1- and 3-TByte drives has made scale-out Internet data centers possible, and with it, scale-out Big Data analytic platforms like Hadoop. Hard disk continues to be the medium of choice for large volumes of data because individual drives routinely scale to 1 – 3 TBytes. And momentary supply chain disruptions like the 2011 Thailand floods aside, the supply remains more than adequate. Flash drives simply don’t get as fat.
But if anything, hard drives are getting slower because it’s no longer worthwhile to try speeding them up. With Flash being at least 10 – 100x faster, there’s no way that disk will easily catch up even if the technology gets refreshed. Flash is actually pulling the rug out from under demand for 7200-RPM disks (currently the state of the art for disk). Not surprisingly, disk technology development has hit the wall.
Given current price trends,
where Flash prices are expectedsome analysts expect Flash to reach parity with disk in the next 12 – 18 months (or maybe sooner), there will be less reason for your next transaction system to be disk-based. In fact there is good reason to be a bit skeptical on how soon supply of SSD Flash will ramp up adequately for the transaction system market; but SSD Flash will gradually make its way to prime time. Conversely, with disk likely to remain fatter in capacity than Flash, it will be best suited for active archiving that keeps older data otherwise bound for tape live; and for Big Data analytics, where the need is for volume. Nonetheless, the workhorse of large Hadoop, and similar disk-based Big Data analytic or active archive clusters will likely be the slower 5400 RPM models.
So what about even faster modes of storage? In the past couple years, DRAM memory prices crossed the threshold where it became feasible to deploy them for persistent storage rather than caching of currently used data. That cleared the way for the in-memory database (IMDB), which is often code word for all-DRAM memory storage.
In-memory databases are hardly new, but until the last 3 – 4 years they were highly specialized. Oracle TimesTen, one of the earliest commercial offerings, was designed for tightly-coupled, specialized transactional applications; other purpose-built in-memory data stores have existed for capital markets for at least a decade or more. But DRAM memory prices dropped to bring them into the enterprise mainstream. Kognitio opened the floodgates as it reincarnated its MOLAP cube and row store analytic platform to in-memory on industry-standard hardware just over 5 years ago; SAP put in-memory in the spotlight with HANA for analytics and transactional applications; followed by Oracle, which reincarnated TimesTen as Exalytics for running Oracle Business Intelligence Enterprise Edition (OBIEE) and Essbase.
Yet, an interesting blip happened on the way to the “inevitable” all in-memory database future: Last spring, DRAM memory prices stopped dropping. In part this was attributable to consolidation of the industry to fewer suppliers. But the larger driver was that the wisdom of crowds – e.g., that DRAM memory was now ready for prime time – got ahead of itself. Yes, the laws of supply and demand will eventually shift the trajectory of memory pricing. But nope, that won’t change the fact of life that, no matter how cheap, DRAM memory (and cache) will always be premium storage.
In-memory databases are dead, long live tiered databases
The sky is not the limit for DRAM in-memory databases. The rush to in-memory will morph into an expansion of data tiering. And actually that’s not such a bad thing: do you really need to put all of that data in memory? We think not.
IBM and Teradata have shunned all in-memory architectures; their contention is that the 80/20 rule should govern which data goes into memory. And under their breaths, the all in-memory database folks have fallbacks for paging data between disk and memory. If designed properly, this is not constant paging, but rather a process that only occurs for that rare out-of-range query. Kognitio has a clever pricing model where they don’t charge you for the disk, but just for the volume of memory. As for HANA, disk is designed into the system for permanent offline storage, but SAP quietly adds that it can also be utilized for paging data during routine operation. Maybe SAP shouldn’t be so quiet about that.
There’s one additional form of tiering to consider for highly complex analytics: it’s the boost that can come from pipelining computations inside chip cache. Oracle is looking to similar techniques for further optimizing upcoming generations of its Exadata database appliance platform. It’s a technique that’s part of IBM’s recent BLU architecture for DB2. High-performance analytic platforms such as SiSense also incorporate in-chip pipelining to actually reduce balance of system costs (e.g., require less DRAM).
It’s all about balance of system
Balance of system is hardly new, but until recently, it meant trading off CPU, bandwidth with tiers of disk. Application and database design in turn focused on distributing or sharding data to place the most frequently accessed data on the disk or portions of disk that could be accessed the fastest. New forms of storage, including Flash and DRAM memory, add a few new elements to the mix. You’ll still configure storage (along with processor and interconnects) for the application and vice versa, but you’ll have a couple new toys in your arsenal.
For Flash, it means fast OLTP applications that could add basic analytics, such as what Oracle’s recent wave of In-Memory Applications promise. For in-memory, that would dictate OLTP applications with even more complex analytics and/or what-if simulations embedded in line, such as what SAP is promising with its recently-introduced Business Suite and CRM applications on HANA.
For in-memory, we’d contend that for most cases, configurations for keeping 100% of data in DRAM will remain overkill. Unless you are running a Big Data analytic problem that is supposed to encompass all of the data, you will likely work with just a fraction of the data. Furthermore, IBM, Oracle, and Teradata are incorporating data skipping features into their analytic platforms that deliberately filter irrelevant data so it is not scanned. There a many ways to speed processing before using the fast storage option.
Storage will become an application design option
Although we’re leery about hopping the 100% DRAM in-memory bandwagon, smartly deployed, in-memory or DRAM could truly transform applications. When you eliminate the latency, you can embed complex analytics in-line with transactional applications, enable the running of more complex analytics, or make it feasible for users to run more what-if simulations to couch their decisions.
Examples include transaction applications that differentiate how to fulfill orders from gold, silver, or bronze-level customers based on levels of services and cost of fulfillment. It could help mitigate risk when making operational or fiduciary decisions by allowing the running of more permutations of scenarios. It could also enhance Big Data analytics by tiering the more frequently used data (and logic) in memory.
Whether to use DRAM or Flash will be a function of degree of data volume and problem complexity. No longer will inclusion of storage tiers be simply a hardware platform design decision; it will also become a configuration decision for application designers as well.
We’re in the thick of analyst conference season – Informatica last week, SAS tomorrow. So on this Sunday afternoon between gigs, we’re digesting what went down at Strata 2013 in Santa Clara last week. It was kind of a frustrating day in that we had limited time, were scheduled wall to wall with meetings, and missed what were likely some fascinating sessions. But we got a sense of some dominant themes: Harden Hadoop for the enterprise, and take the SQL world to Hadoop.
The Hadoop vendor ecosystem is filling in – new players with their own distros, and new capabilities focused on making Hadoop more enterprise grade. The field is early enough that the approaches are still quite diverse – it’s time to invent, not consolidate. Let the games proceed.
EMC stole the jump early in the week by announcing the grafting of its own Greenplum Advanced SQL analytic data store onto Hadoop – basically, the Greenplum MPP database squooched (wanted an excuse to use a “word” like that) atop HDFS. Tastes like a SQL analytic database, scales like Hadoop. Cloudera Impala will soon be in a GA branded as RTQ (Real-Time Query). Not to be outshined, Hortonworks, which works through the official Hadoop project itself, announced a couple responding initiatives: the Tez runtime and Stinger interactive query engine. You wouldn’t be seeing all these efforts to make Hadoop interactive if the demand wasn’t out there; while Hadoop as a platform for extending the range of analytics has become very compelling to enterprises, they clearly expect that the platform must be SQL interactive if it is to become a part of their analytic system portfolio.
While we’ve been expending electrons on the SQLization of Hadoop, the next stage of hardening is rapidly emerging. Specifically, make Hadoop and Hadoop data more governable and secure. This involves capabilities such as data masking (where you permanently obliterate sensitive pieces of data), data encryption (where you can recover the original data), activity monitoring (who does what), data lineage (who and where this data came from, and who has done what to it), and of course, more fine grained access control (preferably role-based) that picks up where Kerberos authentication leaves off. The pieces are just beginning to fall into place.
Dataguise, a niche player in data obfuscation that relaunched itself in the Hadoop space last year, has had an encryption product out for roughly six months and has drawn several customers; they promote a self-learning feature that discovers sensitive data (e.g., credit card numbers), selectively encrypts, and then acts only when data is changed. IBM already has capabilities in Optim that are typically used when pulling data from an external database; a user-defined function can mask it in Hadoop, or mask data as it is drawn from Hadoop. IBM offers data masking and activity monitoring, a capability that Cloudera just announced. Specifically, Cloudera’s new Navigation tool places agents (like everybody else, they characterize them as “lightweight”) on HDFS, Hive, and HBase, and you can configure them. For instance, the traffic on Hive is likely to be a fraction of that for HBase, which is more interactive, so you can configure monitoring of event changes to data accordingly. And then we came across Revelytix, which focuses on data lineage
Then out of the blue, Intel swooped in with announcement of its own Hadoop distribution. As if that was the last thing the world needed. But Intel has carved some interesting angles: it is utilizing the native instruction set of the Xeon processor to move encryption and I/O optimization directly into the chip. Intel’s play addresses the issue that these processes are resource-heavy, a point where the sheer size of Hadoop data stores add insult to injury. And that is not to mention that embedding encryption in hardware lessens the load of developers. Intel has drawn a number of partners including SAP, where integration with the HANA in-memory platform offers some interesting Fast Data possibilities. So far we’ve missed signals with Intel, but will speak with them later next week to get a better idea of where they hope to take hardware optimization with Hadoop.
Loose ends: Time is running out on us, but coming out of this week, there are several issues that are running in the back of our mind:
• Hive – we thought this was a done deal. Hive is one of the earliest components of Hadoop. Having been designed when MapReduce was the predominant processing pattern, and the jobs to spawn the metadata were batch in nature. We were surprised that the debate over Hive’s use remains very, very live. The issue is over how dynamic Hive can become – yes, it can support interactive queries, but is it based on metadata that is current? We sense that this will become another area for vendor differentiation.
• Apache Hadoop project – This could be spin, but there is sniping behind the scenes that the Hadoop project is no longer so broad-based when it comes to contributions. The flipside is arguments over whether a particular vendor has enough (or any) committers rings a bit hollow. The operable question for enterprises is whether the distro of Hadoop is and will remain well-supported.
• Resource management – this one has multiple angles. Of course there is debate over YARN. It is supposed to be the über resource manager of Hadoop, so MapReduce jobs don’t collide with those of other frameworks that may have different (and conflicting) demand on processing and data access. There’s active debate over whether YARN has sufficiently weaned itself of its MapReduce batch lineage, or whether it should be a batch-oriented sub manager in a scheme where there is yet another layer of control. The counterargument to that is that this may make life (or at least levels of control) far too complex. Expect vendor differentiation here.
We’ve been talking about Fast Data over the past year, and so has Oracle. Last week we had the chance to make it a dialogue as we were interviewed by Hasan Rizvi, who heads Oracle’s middleware business as Executive Vice President Oracle Fusion Middleware and Java. The podcast, which will also include an appearance with Oracle customer Turkcell, will go live on February 27. You can sign up for it here.
Much of the hype around Big Data is that, not only are people generating more data, but machines. Machine data has always been there – it was traditionally collected by dedicated systems such as network node managers, firewalls systems, SCADA systems, and so on. But that’s where the data stayed.
Machine data is obviously pretty low level stuff. Depending on the format of data spewed forth by devices, it may be highly cryptic or may actually contain text that is human intelligible. It was traditionally considered low-density data that was digested either by specific programs or applications or by specific people – typically systems operators or security specialists.
Splunk’s reason for existence is putting this data onto a common data platform, then index it to make it searchable as a function of time. The operable notion is that the data could then be shared or correlated across applications, such as the weblogs. Its roots are in the underside of IT infrastructure management systems, where Splunk is often the embedded data engine. An increasingly popular use case is security, where Splunk can reach across network, server, storage, and web domains to provide a picture of exploits that could be end-to-end, at least within the data center.
There’s been a bit of hype around the company, which IPO’ed earlier this year and reported a strong Q2. Consumer technology still draws the headlines (just look at how much the release of the iPhone 5 drowned out almost all other tech news this week). But given Facebook’s market dive, maybe the turn of events on Wall Street could be characterized as revenge of the enterprise, given the market’s previous infatuation with the usual suspects in the consumer space – mobile devices, social networks, and gaming.
Splunk has a lot of headroom. With machine data proliferating and the company’s promoting its offering as an operational intelligence platform, Splunk is well-positioned as a company that leverages Fast Data. While Splunk is not split second or deterministic real-time, its ability to build searchable indexes on the fly positions itself nicely for tracking volatile environments as they change as opposed to waiting after the fact (although Splunk can also be used for retrospective historical analysis, too).
But Splunk faces real growing pains, both up the value chain, and across it.
While Splunk’s heritage is in IT infrastructure data, the company bills itself as being about the broader category of machine data analytics. And there’s certainly lots of it around, given the explosion of sensory devices that are sending log files from all over the place, inside the four walls of a data center or enterprise, and out. There’s The Internet of Things. IBM’s Smarter Planet campaign over the past few years has raised general awareness of how instrument and increasingly intelligent Spaceship Earth is becoming. Maybe we’re jaded, but it’s become common knowledge that the world is full of sensory points, whether it is traffic sensors embedded in the pavement, weather stations, GPS units, smartphones, biomedical devices, industrial machinery, oil and gas recovery and refining, not to mention the electronic control modules sitting between driver and the powertrain in your car.
And within the enterprise, there may be plenty of resistance to getting the bigger picture. For instance, while ITO owns infrastructure data, marketing probably owns the Omniture logs; yet absent the means to correlate the two, it may not be possible to get the answer on why the customer did or did not make the purchase online.
For a sub $200-million firm, this is all a lot of ground to cover. Splunk knows the IT and security market but lacks the breadth of an IBM to address all of the other segments across national intelligence, public infrastructure, smart utility grids, or healthcare verticals, to name a few. And it has no visibility above IT operations or appdev organizations. Splunk needs to pick its targets.
Splunk is trying to address scale – that’s where the Big Data angle comes in. Splunk is adding some features to increases its scale, with the new 5.0 release adding federated indexing to boost performance over larger bodies of data. But for real scale, that’s where integration with Hadoop comes in, acting as a near-line archive for Splunk data that might otherwise be purged. Splunk offers two forms of connectivity: HadoopConnect, which provides a way to stream and transform Splunk data to populate HDFS and Shuttl, a slower archival feature that treats Hadoop as a tape library (the data is heavily compressed with GZip). It’s definitely a first step – HadoopConnect as the name implies establishes connectivity, but the integration is hardly seamless or intuitive, yet. It uses Splunk’s familiar fill-in-the-blank interface (we’d love to see something more point and click), with the data in Hadoop retrievable, but without Splunk’s familiar indexes (unless you re-import the data back to Splunk). On the horizon, we’d love to see Splunk tackle the far more challenging problem of getting its indexes to work natively inside Hadoop, maybe with HBase.
Then there’s the eternal question of making machine data meaningful to the business. Splunk’s search-based interface today is intuitive to developers and systems admins, as it requires knowledge of the types of data elements that are being stored. But it won’t work for anybody that doesn’t work with the guts of applications or computing infrastructure. But it becomes more critical to convey that message as Splunk is used to correlate log files with higher level sources, such as the correlating abandoned shopping carts with underlying server data to see if the missed sale was attributable to system bugs or the buyer changing her mind.
The log files that record how different elements of IT infrastructure perform are in aggregate telling a story that tells how well your organization is serving the customer. Yet the perennial challenge of all systems level management platforms has been conveying the business impact from the events that generated those log files. For those who don’t have to dye their hair gray, there are distant memories of providers like CA, IBM, and HP promoting how their panes of glass displaying data center performance could tell a business message. There’s been the challenge for ITIL adopters to codify the running of processes in the data center to support the business. The lists of stillborn attempts to convey business meaning to the underlying operations are endless.
So maybe given the hype of the IPO, the relatively new management team that is in place, and the reality of Splunk’s heritage, it shouldn’t be surprising that we heard two different messages and tones.
From recently-appointed product SVP Guido Schroeder, we heard talk of creating a semantic metadata layer that would, in effect, create de facto business objects. That shouldn’t be surprising, as during his previous incarnation he oversaw the integration of Business Objects into the SAP business. For anyone who has tracked the BI business over the years, the key to success has been creation of a metadata layer that not only codified the entities, but made it possible to attain reuse in ad hoc query and standard reporting. Schroeder and the current management team are clearly looking to take Splunk above IT operations to CIO level.
But attend almost session at the conference, and the enterprise message was largely missing. That shouldn’t be surprising as the conference itself was aimed at the people who buy Splunk’s tools – and they tend to be down more in the depths of operations.
There were a few exceptions. One of the sessions in the Big Data track, led by Stuart Hirst, CTO of an Australian big data consulting firm Converging Data, communicated the importance of preserving the meaning of data as it moves through the lifecycle. In this case, it was a counter-intuitive pitch to conventional wisdom of Big Data, which is ingest the data, explore and classify it later. As Splunk data is ingested, it is time stamped to provide a chronological record. Although this may be low level data, as you bring more of it together, there should be a record of lineage, not to mention sensitivity (e.g., are customer-facing systems involved.
From that standpoint, the notion of adding a semantic metadata layer atop its indexing sounds quite intuitive – assign higher level meanings to buckets of log data that carries some business process meaning. For that, Splunk would have to rely on external sources – the applications and databases that run atop the infrastructure whose log files it tracks. That’s a tall order and one that will require partners, not to mention how do you define what are the entities that should be defined. Unfortunately, the track record for cross enterprise repositories is not great; maybe there could be some leveraging of MDM implementations around customer or product that could provide some beginning frame of reference.
But we’re getting way, way ahead of ourselves here. Splunk is the story of an engineering-oriented company that is seeking to climb higher up the value chain in the enterprise. Yet, as it seeks to engage higher level people within the customer organization, Splunk can’t afford to lose track of the base that has been responsible for its success. Splunk’s best route upward is likely through partnering with enterprise players like SAP. That doesn’t deal with the question of how to expand out the footprint to follow the footprint of what is called machine data, but then again, that’s a question for another day. First things first, Splunk needs to pick its target(s) carefully.
In its rise to leadership of the ERP market, SAP shrewdly placed bounds around its strategy: it would stick to its knitting on applications and rely on partnerships with systems integrators to get critical mass implementation across the Global 2000. When it came to architecture, SAP left no doubt of its ambitions to own the application tier, while leaving the data tier to the kindness of strangers (or in Oracle’s case, the estranged).
Times change in more ways than one – and one of those ways is in the data tier. The headlines of SAP acquiring Sybase (for its mobile assets, primarily) and subsequent emergence of HANA, its new in-memory data platform, placed SAP in the database market. And so it was that at an analyst meeting last December, SAP made the audacious declaration that it wanted to become the #2 database player by 2015.
Of course, none of this occurs in a vacuum. SAP’s declaration to become a front line player in the database market threatens to destabilize existing relationships with Microsoft and IBM as longtime SAP observer Dennis Howlett commented in a ZDNet post. OK, sure, SAP is sick of leaving money on the table to Oracle, and it’s throwing in roughly $500 million in sweeteners to get prospects to migrate. But if the database is the thing, to meet its stretch goals, says Howlett, SAP and Sybase would have to grow that part of the business by a cool 6x – 7x.
But SAP would be treading down a ridiculous path if it were just trying to become a big player in the database market for the heck of it. Fortuitously, during SAP’s press conference on announcements of their new mobile and database strategies, chief architect Vishal Sikka tamped down the #2 aspirations as that’s really not the point – it’s the apps that count, and increasingly, it’s the database that makes the apps. Once again.
Back to our main point, IT innovation goes in waves; during emergence of client/server, innovation focused on database where the need was mastering SQL and relational table structures; during the latter stages of client/server and subsequent waves of Webs 1.0 and 2.0, activity shifted to the app tier, which grew more distributed. With emergence of Big Data and Fast Data, energy shifted back to the data tier given the efficiencies of processing data big or fast inside the data store itself. Not surprisingly, when you hear SAP speak about HANA, they describe an ability to perform more complex analytic problems or compound operational transactions. It’s no coincidence that SAP now states that it’s in the database business.
So how will SAP execute its new database strategy? Given the hype over HANA, how does SAP convince Sybase ASE, IQ, and SQL Anywhere customers that they’re not headed down a dead end street?
That was the point of the SAP announcements, which in the press release stated the near term roadmap but shed little light on how SAP would get there. Specifically, the announcements were:
• SAP HANA on BW is now going GA and at the low (SMB) end come out with aggressive pricing: roughly $3000 for SAP BusinessOne on HANA; $40,000 for HANA Edge.
• Ending a 15-year saga, SAP will finally port its ERP applications to Sybase ASE, with tentative target date of year end. HANA will play a supporting role as the real-time reporting adjunct platform for ASE customers.
• Sybase SQL Anywhere would be positioned as the mobile front end database atop HANA, supporting real-time mobile applications.
• Sybase’s event stream (CEP) offerings would have optional integration with HANA, providing convergence between CEP and BI – where rules are used for stripping key event data for persistence in HANA. In so doing, analysis of event streams could be integrated or directly correlating with historical data.
• Integrations are underway between HANA and IQ with Hadoop.
• Sybase is extending its PowerDesigner data modeling tools to address each of its database engines.
Most of the announcements, like HANA going GA or Sybase ASE supporting SAP Business suite, were hardly surprises. Aside from go-to-market issues, which are many and significant, we’ll direct our focus on the technology roadmaps.
We’ve maintained that if SAP were serious about its database goals, that it had to do three basic things:
1. Unify its database organization. The good news is that it has started down that path as of January 1 of this year. Of course, org charts are only the first step as ultimately it comes down to people.
2. Branding. Although long eclipsed in the database market, Sybase still has an identifiable brand and would be the logical choice; for now SAP has punted.
3. Cross-fertilize technology. Here, SAP can learn lessons from IBM which, despite (or because of) acquiring multiple products that fall under different brands, freely blends technologies. For instance, Cognos BI reporting capabilities are embedded into rational and Tivoli reporting tools.
The third part is the heavy lift. For instance, given that data platforms are increasingly employing advanced caching, it would at first glance seem logical to blend in some of HANA’s in-memory capabilities to the ASE platform; however, architecturally, that would be extremely difficult as one of HANA’s strengths –dynamic indexing – would be difficult to implement in ASE.
On the other hand, given that HANA can index or restructure data on the fly (e.g., organize data into columnar structures on demand), the question is, does that make IQ obsolete? The short answer is that while memory keeps getting cheaper, it will never be as cheap as disk and that therefore, IQ could evolve as near-line storage for HANA. Of course that begs the question as to whether Hadoop could eventually perform the same function. SAP maintains that Hadoop is too slow and therefore should be reserved for offline cases; that’s certainly true today, but given developments with HBase, it could easily become fast and cheap enough for SAP to revisit the IQ question a year or two down the road.
Not that SAP Sybase is sitting still with Hadoop integration. They are providing MapReduce and R capabilities to IQ (SAP Sybase is hardly alone here, as most Advanced SQL platforms are offering similar support). SAP Sybase is also providing capabilities to map IQ tables into Hadoop Hive, slotting IQ as alternative to HBase; in effect, that’s akin to a number of strategies to put SQL layers inside Hadoop (in a way, similar to what the lesser-known Hadapt is doing). And of course, like most of the relational players, SAP Sybase is also support the bulk ETL/ELT load from HDFS to HANA or IQ.
On SAP’s side for now is the paucity of Hadoop talent, so pitching IQ as an alternative to HBase may help soften the blow for organizations seeking to get a handle. But in the long run, we believe that SAP Sybase will have to revisit this strategy. Because, if it’s serious about the database market, it will have to amplify its focus to add value atop the new realities on the ground.
Of the 3 “V’s” of Big Data – volume, variety, velocity (we’d add “Value as the 4th V) – velocity has been the unsung ‘V.’ With the spotlight on Hadoop, the popular image of Big Data is large petabyte data stores of unstructured data (which are the first two V’s). While Big Data has been thought of as large stores of data at rest, it can also be about data in motion.
“Fast Data” refers to processes that require lower latencies than would otherwise be possible with optimized disk-based storage. Fast Data is not a single technology, but a spectrum of approaches that process data that might or might not be stored. It could encompass event processing, in-memory databases, or hybrid data stores that optimize cache with disk.
Fast Data is nothing new, but because of the cost of memory, was traditionally restricted to a handful of extremely high-value use cases. For instance:
• Wall Street firms routinely analyze live market feeds, and in many cases, run sophisticated complex event processing (CEP) programs on event streams (often in real time) to make operational decisions.
• Telcos have handled such data in optimizing network operations while leading logistics firms have used CEP to optimize their transport networks.
• In-memory databases, used as a faster alternative to disk, have similarly been around for well over a, having been employed for program stock trading, telecommunications equipment, airline schedulers, and large destination online retail (e.g., Amazon).
Hybrid in-memory and disk have also become commonplace, especially amongst data warehousing systems (e.g., “>Teradata, Kognitio), and more recently among the emergent class of advanced SQL analytic platforms (e.g., Greenplum, Teradata Aster, IBM Netezza, HP Vertica, ParAccel) that employ smart caching in conjunction with a number of other bells and whistles to juice SQL performance and scaling (e.g., flatter indexes, extensive use of various data compression schemes, columnar table structures, etc.). Many of these systems are in turn packaged as appliances that come with specially tuned, high-performance e backplanes and direct attached disk.
Finally, caching is hardly unknown to the database world. Hot spots of data that are frequently accessed are often placed in cache, as are snapshots of database configurations that are often stored to support restore processes, and so on
So what’s changed?
The usual factors: the same data explosion that created the urgency for Big Data is also generating demand for making the data instantly actionable. Bandwidth, commodity hardware, and of course, declining memory prices, are further forcing the issue: Fast Data is no longer limited to specialized, premium use cases for enterprises with infinite budgets.
Not surprisingly, pure in-memory databases are now going mainstream: Oracle and SAP are choosing in-memory as one of the next places where they are establishing competitive stakes: SAP HANA vs. Oracle Exalytics. Both Oracle and SAP for now are targeting analytic processing, including OLAP(raise the size limits on OLAP cubes) and more complex, multi-stage analytic problems that traditionally would have required batch runs (such as multivariate pricing) or would not have been run at all (too complex, too much delay). More to the point, SAP is counting on HANA as a major pillar of its stretch goal to become the #2 database player by 2015, which means expanding HANA’s target to include next generation enterprise transactional applications with embedded analytics.
Potential use cases for Fast Data could encompass:
• A homeland security agency monitoring the borders requires the ability to parse, decipher, and act on complex occurrences in real time to prevent suspicious people from entering the country
• Capital markets trading firms require real-time analytics and sophisticated event processing to conduct algorithmic or high-frequency trades
• Entities managing smart infrastructure must digest torrents of sensory data to make real-time decisions that optimize use transportation or public utility infrastructure
• B2B consumer products firms monitoring social networks may require real-time response to understand sudden swings in customer sentiment
For such organizations, Fast Data is no longer a luxury, but a necessity.
More specialized use cases are similarly emerging now that the core in-memory technology is becoming more affordable. YarcData, a startup from venerable HPC player Cray Computer, is targeting graph data, which represents data with many-to-many relationships. Graph computing is extremely process-intensive, and as such, has traditionally been run in batch when involving Internet-size sets of data. YarcData adopts a classic hybrid approach that pipelines computations in memory, but persisting data to disk. YarcData is the tip of the iceberg – we expect to see more specialized applications that utilize hybrid caching that combine speed with scale.
But don’t forget, memory’s not the new disk
The movement – or tiering – of data to faster or slower media is also nothing new. What is new is that data in memory may not longer be such a transient thing, and if memory is relied upon for in situ processing of data in motion or rapid processing of data at rest, memory cannot simply be treated as the new disk. Excluding specialized forms of memory such as ROM, by nature anything that’s solid state is volatile: there goes your power… and there goes your data. Not surprisingly, in-memory systems such as HANA still replicate to disk to reduce volatility. For conventional disk data stores that increasingly leverage memory, Storage Switzerland’s George Crump makes the case that caching practices must become smarter to avoid misses (where data gets mistakenly swapped out). There are also balance of system considerations: memory may be fast, but is its processing speed well matched with processor? Maybe solid state overcomes I/O issues associated with disk, but may still be vulnerable to coupling issues if processors get bottlenecked or MapReduce jobs are not optimized.
Declining memory process are putting Fast Data on the fast lane to mainstream. But as the technology is now becoming affordable, we’re still early in the learning curve for how to design for it.
Tibco has been running on all cylinders of late. In earnings and revenues, it has kept up with the Joneses in the enterprise software neighborhood, running respectable 25% revenue and 30+% software license growth numbers in its most recent quarterly year over year results as we’ve noted in several of our recent Ovum research notes.
It is beginning to make the turn from its geeky roots towards more solution selling to the business side in tone and deed. Ever since the 2007 Spotfire acquisition – which brought real-time analytic visualizations – it has made several buys that are more targeted to the business rather than strictly the IT or CIO side. They include Netrics, for fuzzy logic technologies for pattern matching; Loyalty Lab, for managing customer affinity programs; and Nimbus, a recent addition, which adds process discovery and management of manual activity that comprise the other 80% of what happens inside an enterprise.
Of course it’s not as if Tibco were trying to pull an HP in doing a 180 on its business strategy (heaven forbid, we don’t need any more senseless Silicon Valley soap operas!). Core infrastructure plays, such as FTL ultra low latency messaging or the DataSynapse data grid, remain core to Tibco’s 2-second advantage mission. It’s just that, in modest but growing cases, the raw technology is being packaged as a black box underneath more business-focused solutions. For instance, Tibco is packaging solutions for retail such as Active Catalog and Active Fulfillment that underneath the hood bundle Tibco Business Events (CEP), Active Matrix BPM, and other pieces.
Of course, such transformations don’t come overnight, as there is the need to get field sales up to speed and accustomed to calling on new entry points at target prospect. Not surprisingly, Tibco is also ramping up vertical solutions, but on an opportunistic basis. An example: we met with a European telco customer that is using Business Events for monitoring devices (in this case, water meters) which may present an opportunity for Tibco to develop an M2M (machine-to-machine) event-driven integration solution that could be more widely applied to segments such as utilities or logistics/transportation.
Several of its recent acquisitions, such as Foresight, a healthcare payer EDI gateway; Open Sprit, for data integration for upstream oil and gas processing, are strictly vertical plays. Loyalty Labs, which provides analytics for customer affinity programs, has helped make retail one of its fastest growing verticals coming from a near-zero client base a few years back.
Tibco is traveling a similar road as IBM, but is starting from much earlier point in developing vertical solutions. As Tibco lacks the professional services presence of IBM, it has to cherry pick its vertical opportunities.
At this point, the major disrupters for Tibco are big data and mobility.
For mobile the challenge is integrating alerts from Tibco’s Business Events and Spotfire engines to clients; tibbr, its internal collaboration messaging platform, provides the logical environment for bringing its events feed out to mobile devices. This could be bolstered with its recent Nimbus acquisition, both for input (process discovery, using mobile devices to snap a picture, for instance) and output (for communicating how to perform manual processes out to the field).
Big data positioning and productization for Tibco is also a work in progress. Its message busses can in some cases handle enormous amounts of data; its business event engine could also provide feeds if Tibco can make the sensing agent more lightweight; its BPM offering could be configured to get triggered based on specific event patterns that may involve crunching of enormous volumes of event feeds.
But there is a brave new world of variably structured data that is becoming fair game for enterprises to sense and respond. We don’t expect Tibco to buy its own Advanced SDQL platform or create its own Hadoop distribution, as Tibco is not about data at rest, nor is it a database player (OK, its MDM offering does have to store master and reference data). Nonetheless, delivering the 2-second advantage in a big world where the data is getting bigger, bigger, and more heterogeneous raises the urgency for Tibco to distinguish itself in extending its visibility.
When we were asked by the executive marketing team of our impressions this year, our thoughts were, well, there was hardly anything newsworthy. That’s not necessarily a bad thing, as during a strategy roadmap presentation at this year’s Tibco TUCON conference, a timeline of Tibco acquisitions showed roughly a half dozen entries for 2010 and just one for this year. Over the past year Tibco has been preoccupied with absorbing the new acquisitions and so – Nimbus excluded – has not been active on this front lately. For instance, Tibco has integrated the Netrics fuzzy pattern matching engine into Business Events, where it belongs.. It has similarly blended the recently acquired data grid technology with Business Events. Check out Sandy Kemsley’s post for a more detailed blow-by-blow on how Tibco has rounded out its product portfolio over the past year.
With the swoon on Wall Street, Tibco has left its $250 cash stash alone, in spite of the fact that there are plenty of acquisition targets available at reasonable prices right now as a lot of venture funds are looking for exits. By its CFO’s words, the company is not as enormous as IBM or Oracle, where acquisitions don’t disrupt the entire company. Nonetheless, we expect that 2012 will grow more active in acquisitions – we hope that acquisition of a data quality provider makes the top of the shopping list.
While there is relatively little to knock cloud from its hype perch, among web startups, BI and data geeks, the emergence of Big Data has become a game changer. It’s analytics and operational intelligence gone extreme.
Big Data typically is associated with obscene amounts of data – the scale blows away anything that most enterprises would maintain within their core back end business systems. We’re talking hundreds of terabytes or even petabytes.
Today, Yahoo announced that it might take the business of its best-known Big Data brainchild, Hadoop, and and consider spinning it off into a new entity.
So why are we having this conversation?
It’s because Internet giants Google, Yahoo, Facebook, Amazon, and others had to roll their own technologies to deal with magnitudes of data far beyond conventional wisdom of what was possible with enterprise systems. What makes the conversation interesting is that this technology is on the cusp of entering the enterprise mainstream today. It’s not just a matter of technology looking for a problem. When Facebook needs to understand how its 500 million members update their walls, share photographs, and have conversations, it’s because (1) it needs to optimize its IT infrastructure to support how its members use the site, but more importantly (2) it needs to understand more about its members so it can sell advertising.
And when Facebook makes its API publicly available, that same issue becomes a critical for any marketer that is B2C. And as the technology becomes available, suddenly there are downstream uses in capital markets for conducting brute force analyses on trading positions, healthcare providers for understanding outcomes, homeland security for controlling borders, metropolitan entities seeking to manage congestion pricing, life sciences organization seeking to decipher clinical studies, mobile carriers seeking to prevent or minimize customer churn, and so on.
There are a couple technology and market paths that have opened for contending with Big Data. There are Advanced SQL analytic database providers that have adapted SQL for structured data through strategies such as reducing indexing, introducing new forms of data compression and query optimization, columnar architectures, and embedding analytics and data transformation directly into the data engine to minimize data movement; in some cases, they have developed optimized appliances. We’re talking about the Aster Datas, Greenplums, Netezzas, ParAccels, and Verticas of the world – and players like Teradata that invented big data warehousing, Oracle that has extended it, and Sybase which acquired the first column-oriented database. Business has obviously picked up here; IBM, EMC, Teradata, and HP have all made acquisitions in this space over the past 12 months.
But the Facebooks and Googles of the world weren’t dealing with structured data in the enterprise sense – they are contending with web log files, document APIs, rich media files, and so on. They are dealing with data whose structure and volume is so varied and huge that there is no time to model it and form a schema; they need to just load the data into the file system and then analyze it. That spawned the NoSQL movement – initially a focus on technologies that avoided the overhead and scalability limits of SQL.
Until now, neither Google, Yahoo, or Facebook considered themselves in the tools or database business. So they released the fruits of their innovation as open source, with one of the best known projects being Apache Hadoop. Hadoop is a family of projects that includes a distributed file system, the MapReduce framework that parcels out massively parallel computing jobs across a cluster plus a number of other frameworks, file systems, and utilities.
What’s kind of fascinating is the almost incestuous relationship between these NoSQL projects. Hadoop, developed at Yahoo was descended from the Google File System that in turn was developed for Google BigTable; the same was true for Cassandra, another NoSQL file system. Meanwhile, Facebook develops Hive, a relational-like table structure designed to work with Hadoop. You get the picture.
Cloudera has stepped to the forefront in commercializing Hadoop technology and applying MapReduce. Using a Red Hat-like business model, it offers support, several open source extensions, plus an enterprise edition that adds a number of proprietary monitoring and management features. It has distinguished itself with forging partnerships with almost every major BI and data warehousing player except one – IBM. the highlights are its relationships with Informatica, for data transformation, and MicroStrategy, which provides a data mart strategy designed to complement Hadoop. And it has garnered roughly 75 enterprise paying customers in a market segment that has barely commercialized.
In the long run, we also expect IBM to make a stab at Hadoop and related technologies by extending its InfoSphere offerings -– it can see Cloudera-Informatica and Cloudera-MicroStrategy raise it one with its own InfoSphere DataStage and Cognos offerings, before it even talks about partnerships. Today we saw a shot from left field – Yahoo which invented the technology – is now saying it might spin off its Hadoop business to go up against Cloudera, and potentially IBM. In a way, its closing the doors after the horses left the barn as the creator of Hadoop is now part of Cloudera.
Clearly there will be a market for NoSQL technologies in the quest for Big Data, although for now, they require sufficient specialized skills that they are not for the faint of heart. that is, if you can find any Hadoop and MapReduce programmers who haven’t already bee scarfed up by Amazon, Zynga, or JP Morgan Chase. That market will not necessarily be in competition with Advanced SQL as there are different use cases for each. And in fact, there will likely be a blending of the technologies in the long run. Today, many Advanced SQL platforms are already extending support for MapReduce, and in the long run, we expect that SQL-like technologies in the NoSQL space like Hive or HBase will themselves be made more accessible to the huge base of SQL developers.
But we digress.
For Yahoo, this would clearly be a shot out of its comfort zone, as it is not a tools company. But it is hungry for monetizing its intellectual property, even if that property has already been open sourced. It’s redolent of Sun striving to monetize Java and we all know how that went. Obviously this will be an uphill battle for Yahoo, but at least this would be a spinoff so hopefully there won’t be distractions from the mother ship. Given Yahoo’s fortunes, we shouldn’t be surprised that they are now looking to maximize what they can get out of the family jewels.