With the Strata 2013 Santa Clara conference about to kick into high gear a week from now, we’re bracing for a wave of SQL-related announcements. You won’t hear a lot about this in the vendor announcements, but behind the scenes, there’s a major disruption occurring that will determine whether MapReduce and other products or frameworks play friendly with each other on Hadoop.
MapReduce has historically been the yin to Hadoop’s yan. Historically, the literature about Hadoop invariably mentioned MapReduce, often in within the same sentence. So excuse us for having wondered, once upon a naïve time, if they were synonymous.
MapReduce is the processing framework that put Hadoop on the map because it so effectively took advantage of Hadoop’s scalable Internet data center-style architecture. In and of itself, MapReduce is a generic idea for massively parallel computing: break a job into multiple con current threads( Map) and then consolidate them (Reduce) to get a result. The MapReduce framework itself was written for Hadoop’s architecture. It pushes Map operations directly to Hadoop data nodes; each operation being completely self-contained (e.g., it supports massively parallel, shared-nothing operation); it treats data as the key-value pairs that Hadoop uses; and it works directly with Hadoop’s JobTracker and TaskTracker to provide a closed-loop process for checking and submitting correctly-formed jobs, tracking their progress to completion (where the results of each Map are shuffled together as part of the Reduce phase).
A key advantage of MapReduce is that it treats, not only individual Map operations and self-contained, but also each Map-Reduce cycle as a self-contained operation. That allows huge flexibility to allow problems to be solved iteratively through a chained series of MapReduce cycles. Such a process proved extremely effectively for crunching through petabytes of data.
Yet, that advantage is also a drawback: each MapReduce cycle is extremely read/write-intensive, as each MapReduce step is written to disk (rather than cached), which makes the process time-consuming and best suited for batch operation. If anything, the trend in enterprise analytics has been towards interactive and in some cases real-time operation, but Hadoop has been off limits to that – until recently.
As we’ve noted, with convergence of SQL and Hadoop, we believe that the predominant theme this year for Hadoop development is rationalization with the SQL world. While of course there is batch processing in the SQL world, the dominant mode is interactive. But this doesn’t rule out innovation in other directions with Hadoop, as the platform’s flexibility could greatly extend and expand the types of analytics. Yes, there will be other types of batch analytics, but it’s hard to ignore the young elephant in the room: interactive Hadoop.
Enter YARN. As we said, there was a good reason why we used to get confused between Hadoop and MapReduce. Although you could run jobs in any style that could scan and process HDFS files, the only framework you could directly manage with core Hadoop was MapReduce. YARN takes the resource management piece out of MapReduce. That means (1) MapReduce can just be MapReduce and (2) you can use the same resource manager to run other processing frameworks. YARN is a major element of the forthcoming Hadoop 2.0, which we expect to see formal release of around Q3.
That’s the end goal with YARN; it’s still a work in process, as is all of Hadoop 2.0. At this point, YARN has been tested at scale at Yahoo — over 30,000 nodes and 14 million applications reports Arun Murthy in his blog (as release manager, he’s in charge of herding cats for Hadoop 2.0). OK, so YARN has tested at scale (MapReduce could do the same thing) but still needs some API work.
So what other frameworks will be supporting YARN? We had a chat with Arun a couple weeks back to get a better idea of what’s shakin’. It’s still early days; for now, Apache HAMA (a rather unfortunate name in our view), which you could imagine as MapReduce’s scientific computing cousin, supports YARN. Others are still work in progress. Giraph, an incubating Apache project that addresses graph processing, will likely join the fray. Others include Spark, a framework for supporting in-memory cluster computing (it provides the engine behind Shark, an Apache Hive-compatible data warehousing system that is supposed to run 100x faster than Hive). Pervasive Software (about to be acquired by Actian) has gone on record that its DataRush engine would run under YARN. We wouldn’t be surprised if Twitter Storm, which focuses on distributed, real-time processing of streaming data, also comes under the YARN umbrella.
There are of course other frameworks emerging that may or may not support YARN. Cloudera already supports a prerelease version of YARN as part of CDH 4, but it has not stated whether Impala, its own open source SQL-compatible MPP data warehousing framework, will run under YARN. Speaking of Impala, there are a number of other approaches that are emerging for making Hadoop more interactive or real-time, such as Platfora, which adapts a common approach from the relational world for tiering “hot” data into memory. There are others, like Hadapt and Splice Machine that are inserting SQL directly into HDFS.
The 64-petabyte question of course is whether everybody is going to play nice and rational and make their frameworks or products work with YARN. In essence, it’s a literal power grab question – should I let the operation of my own product or framework be governed by a neutral resource manager, or can that resource manager fully support my product’s style of execution? The answer is both technology and market maturity.
On the technology end, there’s the question of whether YARN can get beyond its MapReduce (batch) roots. The burden of proof is on the YARN project folks for demonstrating, not only that their framework works at scale and supports the necessary APIs, but also that it can support other styles such as interactive or real-time streaming modes, and that it can balance workloads as approaches from the database world, such as data tiering, require their own unique optimizations.
The commercial end of the argument is where the boundary between open source and commercial value-add (proprietary or non-Hadoop open source) lies. It’s a natural rite of passage for any open source platform that becomes a victim of its own success. And it’s a question for enterprises to consider when they make their decision: ultimately, it’s about the ecosystem or club that they want to belong to.
It was never a question of whether SAP would bring it flagship product, Business Suite to HANA, but when. And when I saw this while parking the car at my physical therapist over the holidays, I should’ve suspected that something was up: SAP at long last was about to announce … this.
From the start, SAP has made clear that its vision for HANA was not a technical curiosity, positioned as some high-end niche product or sideshow. In the long run, SAP was going to take HANA to Broadway.
SAP product rollouts on HANA have proceeded in logical, deliberate fashion. Start with the lowest hanging fruit, analytics, because that is the sweet spot of the embryonic market for in-memory data platforms. Then work up the food chain, with the CRM introduction in the middle of last year – there’s an implicit value proposition for having a customer database on a real-time system, especially while your call center reps are on the phone and would like to either soothe, cross-sell, or upsell the prospect. Get some initial customer references with a special purpose transactional product in preparation for taking it to the big time.
There’s no question that in-memory can have real impact, from simplifying deployment to speeding up processes and enabling more real-time agility. Your data integration architecture is much simpler and the amount of data you physically must store is smaller. SAP provides a cute video that shows how HANA cuts through the clutter.
For starters, when data is in memory, you don’t have to denormalize or resort to tricks like sharding or striping of data to enhance access to “hot” data. You also don’t have to create staging servers to perform ETL of you want to load transaction data into a data warehouse. Instead, you submit commands or routines that, thanks to processing speeds that are up to what SAP claims to be 1000x faster than disk, convert the data almost instantly to the form in which you need to consume it. And when you have data in memory, you can now perform more ad hoc analyses. In the case of production and inventory planning (a.k.a., the MRP portion of ERP), you could run simulations when weighing the impact of changing or submitting new customer orders, or judging the impact of changing sourcing strategies when commodity process fluctuate. For beta customer John Deere, they achieved positive ROI based solely on the benefits of implementing it for pricing optimization (SAP has roughly a dozen customers in ramp up for Business Suite on HANA).
It’s not a question of whether you can run ERP in real time. No matter how fast you construct or deconstruct your business planning, there is still a supply chain that introduces its own lag time. Instead, the focus is how to make enterprise planning more flexible, enhanced with built-in analytics.
But how hungry are enterprises for such improvements? To date, SAP has roughly 500 HANA installs, primarily for Business Warehouse (BW) where the in-memory data store was a logical upgrade for analytics, where demand for in-memory is more established. But on the transactional side, it’s a more uphill battle as enterprises are not clamoring to conduct forklift replacements of their ERP systems, not to mention their databases as well. Changing both is no trivial matter, and in fact, changing databases is even rarer because of the specialized knowledge that is required. Swap out your database, and you might as well swap out your DBAs.
The best precedent is Oracle, which introduced Fusion Applications two years ago. Oracle didn’t necessarily see Fusion as replacement for E-Business Suite, JD Edwards, or PeopleSoft. Instead it viewed Fusion Apps as a gap filler for new opportunities among its installed base or the rare case of greenfield enterprise install. We’d expect no less from SAP.
Yet in the exuberance of rollout day, SAP was speaking of the transformative nature of HANA, claiming it “Reinvents the Real-Time Enterprise.” It’s not the first time that SAP has positioned HANA in such terms.
Yes, HANA is transformative when it comes to how you manage data and run applications, but let’s not get caught down another path to enterprise transformation. We’ve seen that movie before, and few of us want to sit through it again.
Ever since IBM exited the applications business, it has been steadily inching its way back up the value chain from pure infrastructure software. IBM has over the past few years unleashed a string of initiatives seeking to deliver, not only infrastructure software and the integration services to accompany them, but gradually more bits of software that deliver content aimed for the needs of specific scenarios in specific verticals. Naturally, with a highly diversified organization like IBM, there have been multiple initiatives with, of course, varying levels of success.
It started with the usual scenario among IT service providers seeking to derive reusable content from client engagements. Then followed a series of acquisitions for capabilities targeted at vertical industries such as fixed asset management for capital-intensive sectors such as manufacturing or utilities; product information management for consumer product companies; commerce for B2B transactions; online marketing analytic capabilities, and so on. Then came the acquisition of Webify in 2007, where we thought this would lead to a new generation of SOA-based, composite vertical applications (disclosure: we were still drinking the SOA Kool-Aid at the time). At the time, IBM announced there would be Business Fabric SOA frameworks for telco, banking, and insurance, which left us waiting for the shoe to drop for more sectors. Well, that’s all they wrote.
Last year, IBM Software Group (SWG) reorganized into two uber organizations: Middleware under the lead of Robert Leblanc, and Solutions under Mike Rhodin. Both presented at SWG’s 2011 analyst forum as to what the reorg meant. What was interesting was that for organizational purposes, this was a very ecumenical definition of Middleware: it included much of the familiar products from the Information Management, Tivoli, and Rational brand portfolios, and as such, was far more encompassing (e.g., it also included the data layer).
More to the point, once you get past middleware infrastructure, what’s left? At his presentation last year, Rhodin outlined five core areas: Business Analytics and Optimization; Smarter Commerce; Social Business; Smarter Cities; and Watson Solutions. And he outlined IBM’s staged process for developing new markets, expressed as incubation, where the missionary work is done; “make a market” where the product and market is formally defined and materialized; and scale a market, which is self-explanatory. Beyond, we still wondered what makes an IBM solution.
This year, Rhodin fleshed out the answer. To paraphrase, Rhodin said that “it’s not about creating 5000 new products, but creating new market segments.” Rhodin defined segments as markets that are large enough to have visible impact on a $100 billion corporation’s top line. Not $100 million markets, but instead, add a zero or two to it.
An example is Smarter Cities, which began with the customary reference customer engagements to define a solution space. IBM had some marquee urban infrastructure engagements with Washington DC, Singapore, Stockholm, and other cites, out of which came its Intelligent Operations Center. IBM is at an earlier stage with Watson Solutions with engagements at WellPoint (for approving procedures) and Memorial Sloan-Kettering Cancer Center (healthcare delivery) in fleshing out a Smart Healthcare solution.
Of these, Smarter Analytics (not to be confused with Smart Analytics System – even big companies sometimes run out of original brand names) is the most mature.
The good news is that we have a better idea of what IBM means when it says solutions – it’s not individual packaged products per se, but groups of related software products, services, and systems. And we know at very high level where IBM is going to focus its solutions efforts.
Plus ca change… IBM has always been about software, services, and systems – although in recent years the first two have taken front stage. The flip side is that some of these solutions areas are overly broad. Smarter Analytics is a catch-all covering the familiar areas of business intelligence and performance management (much of the Cognos portfolio), predictive analytics and analytical decision management (much of the SPSS portfolio), and analytic applications (Cognos products tailored to specific line organizations like sales, finance, and operations).
It hasn’t been in doubt that for IBM, solutions meant addressing the line of business rather than just IT. That’s certainly a logical strategy for IBM to spread its footprint within the Global 2000. The takeaway of getting a better definition of what IBM’s Solutions business is that it gives us the idea of the scale and acquisitions opportunities that they’re after.
Conventional wisdom is that once Big Data is at rest, don’t move it or shake it. Akin to “don’t fold, spindle, or mutilate.” But seriously, if mainstream enterprises adopt Hadoop, they will expect it to become more robust. And so you start looking at things like data replication, or at least replication of the NameNode or other components that govern how and where data resides in Hadoop and how operations are performed against.
So here’s an interesting one to watch: Wandisco buying Altostore. They are applying replication technol developed for Subversion to Hadoop. We’re gonna check this one out
With Strata, IBM IOD, and Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for Hadoop-related announcements. The common thread of announcements is essentially, “We know that Hadoop is not known for performance, but we’re getting better at it, and we’re going to make it look more like SQL.” In essence, Hadoop and SQL worlds are converging, and you’re going to be able to perform interactive BI analytics on it.
The opportunity and challenge of Big Data from new platforms such as Hadoop is that it opens a new range of analytics. On one hand, Big Data analytics have updated and revived programmatic access to data, which happened to be the norm prior to the advent of SQL. There are plenty of scenarios where taking programmatic approaches are far more efficient, such as dealing with time series data or graph analysis to map many-to-many relationships. It also leverages in-memory data grids such as Oracle Coherence, IBM WebSphere eXtreme Scale, GigaSpaces and others, and, where programmatic development (usually in Java) proved more efficient for accessing highly changeable data for web applications where traditional paths to the database would have been I/O-constrained. Conversely Advanced SQL platforms such as Greenplum and Teradata Aster have provided support for MapReduce-like programming because, even with structured data, sometimes using a Java programmatic framework is a more efficient way to rapidly slice through volumes of data.
Until now, Hadoop has not until now been for the SQL-minded. The initial path was, find someone to do data exploration inside Hadoop, but once you’re ready to do repeatable analysis, ETL (or ELT) it into a SQL data warehouse. That’s been the pattern with Oracle Big Data Appliance (use Oracle loader and data integration tools), and most Advanced SQL platforms; most data integration tools provide Hadoop connectors that spawn their own MapReduce programs to ferry data out of Hadoop. Some integration tool providers, like Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and Hortonworks have been talking up the potentials of HCatalog, actuality an enhanced version of Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.
But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops. And beneath the veneer of chaos, there is an implicit order to most so-called “unstructured” data that is within the reach programmatic transformation approaches that in the long run could likely be automated or packaged inside a tool.
At Ovum, we have long believed that for Big Data to crossover to the mainstream enterprise, that it must become a first-class citizen with IT and the data center. The early pattern of skunk works projects, led by elite, highly specialized teams of software engineers from Internet firms to solve Internet-style problems (e.g., ad placement, search optimization, customer online experience, etc.) are not the problems of mainstream enterprises. And neither is the model of recruiting high-priced talent to work exclusively on Hadoop sustainable for most organizations; such staffing models are not sustainable for mainstream enterprises. It means that Big Data must be consumable by the mainstream of SQL developers.
Making Hadoop more SQL-like is hardly new
Hive and Pig became Apache Hadoop projects because of the need for SQL-like metadata management and data transformation languages, respectively; HBase emerged because of the need for a table store to provide a more interactive face – although as a very sparse, rudimentary column store, does not provide the efficiency of an optimized SQL database (or the extreme performance of some columnar variants). Sqoop in turn provides a way to pipeline SQL data into Hadoop, a use case that will grow more common as organizations look to Hadoop to provide scalable and cheaper storage than commercial SQL. While these Hadoop subprojects that did not exactly make Hadoop look like SQL, they provided building blocks from which many of this week’s announcements leverage.
Progress marches on
One train of thought is that if Hadoop can look more like a SQL database, more operations could be performed inside Hadoop. That’s the theme behind Informatica’s long-awaited enhancement of its PowerCenter transformation tool to work natively inside Hadoop. Until now, PowerCenter could extract data from Hadoop, but the extracts would have to be moved to a staging server where the transformation would be performed for loading to the familiar SQL data warehouse target. The new offering, PowerCenter Big Data Edition, now supports an ELT pattern that uses the power of MapReduce processes inside Hadoop to perform transformations. The significance is that PowerCenter users now have a choice: load the transformed data to HBase, or continue loading to SQL.
There is growing support for packaging Hadoop inside a common hardware appliance with Advanced SQL. EMC Greenplum was the first out of gate with DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with Greenplum MR, a software only product that is accompanied by a MapR Hadoop distro). Teradata Aster has just joined the fray with Big Analytics Appliance, bundling the Hortonworks Data Platform Hadoop; this move was hardly surprising given their growing partnership around HCatalog, an enhancement of the SQL-like Hive metadata layer of Hadoop that adds features such as a cost optimizer and RESTful interfaces that make the metadata accessible without the need to learn MapReduce or Java. With HCatalog, data inside Hadoop looks like another Aster data table.
Not coincidentally, there is a growing array of analytic tools that are designed to execute natively inside Hadoop. For now they are from emerging players like Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like marketplace for developers), Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry, Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).
Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes. Hadapt has been promoting a variant where it squeezes SQL tables inside HDFS file structures – not exactly a no-brainer as it must shoehorn tables into a file system with arbitrary data block sizes. Hadapt’s approach sounds like the converse of object-relational stores, but in this case, it is dealing with a physical rather than a logical impedance mismatch.
Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does Cloudera. It has just announced Impala, a SQL-based alternative to MapReduce for querying the SQL-like Hive metadata store, supporting most but not all forms of SQL processing (based on SQL 92; Impala lacks triggers, which Cloudera deems low priority). Both Impala and MapReduce rely on parallel processing, but that’s where the similarity ends. MapReduce is a blunt instrument, requiring Java or other programming languages; it splits a job into multiple, concurrently, pipelined tasks where, at each step along the way, reads data, processes it, and writes it back to disk and then passes it to the next task. Conversely, Impala takes a shared nothing, MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera claims roughly 4x performance against MapReduce; if the data is in HBase, Cloudera claims performance multiples up to a factor of 30. For now, Impala only supports row-based views, but with columnar (on Cloudera’s roadmap), performance could double. Cloudera plans to release a real-time query (RTQ) offering that, in effect, is a commercially supported version of Impala.
By contrast, Teradata Aster and Hortonworks promote a SQL MapReduce approach that leverages HCatalog, an incubating Apache project that is a superset of Hive that Cloudera does not currently include in its roadmap. For now, Cloudera claims bragging rights for performance with Impala; over time, Teradata Aster will promote the manageability of its single appliance, and with the appliance has the opportunity to counter with hardware optimization.
The road to SQL/programmatic convergence
Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists. What’s more important to enterprises is getting the right tool for the job – whether it is the flexibility of SQL or raw power of programmatic approaches.
SQL convergence is the next major battleground for Hadoop. Cloudera is for now shunning HCatalog, an approach backed by Hortonworks and partner Teradata Aster. The open question is whether Hortonworks can instigate a stampede of third parties to overcome Cloudera’s resistance. It appears that beyond Hive, the SQL face of Hadoop will become a vendor-differentiated layer.
Part of conversion will involve a mix of cross-training and tooling automation. Savvy SQL developers will cross train to pick up some of the Java- or Java-like programmatic frameworks that will be emerging. Tooling will help lower the bar, reducing the degree of specialized skills necessary. And for programming frameworks, in the long run, MapReduce won’t be the only game in town. It will always be useful for large-scale jobs requiring brute force, parallel, sequential processing. But the emerging YARN framework, which deconstructs MapReduce to generalize the resource management function, will provide the management umbrella for ensuring that different frameworks don’t crash into one another by trying to grab the same resources. But YARN is not yet ready for primetime – for now it only supports the batch job pattern of MapReduce. And that means that YARN is not yet ready for Impala or vice versa.
Of course, mainstreaming Hadoop – and Big Data platforms in general – is more than just a matter of making it all look like SQL. Big Data platforms must be manageable and operable by the people who are already in IT; they will need some new skills and grow accustomed to some new practices (like exploratory analytics), but the new platforms must also look and act familiar enough. Not all announcements this week were about SQL; for instance, MapR is throwing a gauntlet to the Apache usual suspects by extending its management umbrella beyond the proprietary NFS-compatible file system that is its core IP to the MapReduce framework and HBase, making a similar promise of high performance. On the horizon, EMC Isilon and NetApp are proposing alternatives promising a more efficient file system but at the “cost” of separating the storage from the analytic processing. And at some point, the Hadoop vendor community will have to come to grips with capacity utilization issues, because in the mainstream enterprise world, no CFO will approve the purchase of large clusters or grids that get only 10 – 15% utilization. Keep an eye on VMware’s Project Serengeti.
They must be good citizens in data centers that need to maximize resource (e.g., virtualization, optimized storage); must comply with existing data stewardship policies and practices; and must fully support existing enterprise data and platform security practices. These are all topics for another day.
Much of the hype around Big Data is that, not only are people generating more data, but machines. Machine data has always been there – it was traditionally collected by dedicated systems such as network node managers, firewalls systems, SCADA systems, and so on. But that’s where the data stayed.
Machine data is obviously pretty low level stuff. Depending on the format of data spewed forth by devices, it may be highly cryptic or may actually contain text that is human intelligible. It was traditionally considered low-density data that was digested either by specific programs or applications or by specific people – typically systems operators or security specialists.
Splunk’s reason for existence is putting this data onto a common data platform, then index it to make it searchable as a function of time. The operable notion is that the data could then be shared or correlated across applications, such as the weblogs. Its roots are in the underside of IT infrastructure management systems, where Splunk is often the embedded data engine. An increasingly popular use case is security, where Splunk can reach across network, server, storage, and web domains to provide a picture of exploits that could be end-to-end, at least within the data center.
There’s been a bit of hype around the company, which IPO’ed earlier this year and reported a strong Q2. Consumer technology still draws the headlines (just look at how much the release of the iPhone 5 drowned out almost all other tech news this week). But given Facebook’s market dive, maybe the turn of events on Wall Street could be characterized as revenge of the enterprise, given the market’s previous infatuation with the usual suspects in the consumer space – mobile devices, social networks, and gaming.
Splunk has a lot of headroom. With machine data proliferating and the company’s promoting its offering as an operational intelligence platform, Splunk is well-positioned as a company that leverages Fast Data. While Splunk is not split second or deterministic real-time, its ability to build searchable indexes on the fly positions itself nicely for tracking volatile environments as they change as opposed to waiting after the fact (although Splunk can also be used for retrospective historical analysis, too).
But Splunk faces real growing pains, both up the value chain, and across it.
While Splunk’s heritage is in IT infrastructure data, the company bills itself as being about the broader category of machine data analytics. And there’s certainly lots of it around, given the explosion of sensory devices that are sending log files from all over the place, inside the four walls of a data center or enterprise, and out. There’s The Internet of Things. IBM’s Smarter Planet campaign over the past few years has raised general awareness of how instrument and increasingly intelligent Spaceship Earth is becoming. Maybe we’re jaded, but it’s become common knowledge that the world is full of sensory points, whether it is traffic sensors embedded in the pavement, weather stations, GPS units, smartphones, biomedical devices, industrial machinery, oil and gas recovery and refining, not to mention the electronic control modules sitting between driver and the powertrain in your car.
And within the enterprise, there may be plenty of resistance to getting the bigger picture. For instance, while ITO owns infrastructure data, marketing probably owns the Omniture logs; yet absent the means to correlate the two, it may not be possible to get the answer on why the customer did or did not make the purchase online.
For a sub $200-million firm, this is all a lot of ground to cover. Splunk knows the IT and security market but lacks the breadth of an IBM to address all of the other segments across national intelligence, public infrastructure, smart utility grids, or healthcare verticals, to name a few. And it has no visibility above IT operations or appdev organizations. Splunk needs to pick its targets.
Splunk is trying to address scale – that’s where the Big Data angle comes in. Splunk is adding some features to increases its scale, with the new 5.0 release adding federated indexing to boost performance over larger bodies of data. But for real scale, that’s where integration with Hadoop comes in, acting as a near-line archive for Splunk data that might otherwise be purged. Splunk offers two forms of connectivity: HadoopConnect, which provides a way to stream and transform Splunk data to populate HDFS and Shuttl, a slower archival feature that treats Hadoop as a tape library (the data is heavily compressed with GZip). It’s definitely a first step – HadoopConnect as the name implies establishes connectivity, but the integration is hardly seamless or intuitive, yet. It uses Splunk’s familiar fill-in-the-blank interface (we’d love to see something more point and click), with the data in Hadoop retrievable, but without Splunk’s familiar indexes (unless you re-import the data back to Splunk). On the horizon, we’d love to see Splunk tackle the far more challenging problem of getting its indexes to work natively inside Hadoop, maybe with HBase.
Then there’s the eternal question of making machine data meaningful to the business. Splunk’s search-based interface today is intuitive to developers and systems admins, as it requires knowledge of the types of data elements that are being stored. But it won’t work for anybody that doesn’t work with the guts of applications or computing infrastructure. But it becomes more critical to convey that message as Splunk is used to correlate log files with higher level sources, such as the correlating abandoned shopping carts with underlying server data to see if the missed sale was attributable to system bugs or the buyer changing her mind.
The log files that record how different elements of IT infrastructure perform are in aggregate telling a story that tells how well your organization is serving the customer. Yet the perennial challenge of all systems level management platforms has been conveying the business impact from the events that generated those log files. For those who don’t have to dye their hair gray, there are distant memories of providers like CA, IBM, and HP promoting how their panes of glass displaying data center performance could tell a business message. There’s been the challenge for ITIL adopters to codify the running of processes in the data center to support the business. The lists of stillborn attempts to convey business meaning to the underlying operations are endless.
So maybe given the hype of the IPO, the relatively new management team that is in place, and the reality of Splunk’s heritage, it shouldn’t be surprising that we heard two different messages and tones.
From recently-appointed product SVP Guido Schroeder, we heard talk of creating a semantic metadata layer that would, in effect, create de facto business objects. That shouldn’t be surprising, as during his previous incarnation he oversaw the integration of Business Objects into the SAP business. For anyone who has tracked the BI business over the years, the key to success has been creation of a metadata layer that not only codified the entities, but made it possible to attain reuse in ad hoc query and standard reporting. Schroeder and the current management team are clearly looking to take Splunk above IT operations to CIO level.
But attend almost session at the conference, and the enterprise message was largely missing. That shouldn’t be surprising as the conference itself was aimed at the people who buy Splunk’s tools – and they tend to be down more in the depths of operations.
There were a few exceptions. One of the sessions in the Big Data track, led by Stuart Hirst, CTO of an Australian big data consulting firm Converging Data, communicated the importance of preserving the meaning of data as it moves through the lifecycle. In this case, it was a counter-intuitive pitch to conventional wisdom of Big Data, which is ingest the data, explore and classify it later. As Splunk data is ingested, it is time stamped to provide a chronological record. Although this may be low level data, as you bring more of it together, there should be a record of lineage, not to mention sensitivity (e.g., are customer-facing systems involved.
From that standpoint, the notion of adding a semantic metadata layer atop its indexing sounds quite intuitive – assign higher level meanings to buckets of log data that carries some business process meaning. For that, Splunk would have to rely on external sources – the applications and databases that run atop the infrastructure whose log files it tracks. That’s a tall order and one that will require partners, not to mention how do you define what are the entities that should be defined. Unfortunately, the track record for cross enterprise repositories is not great; maybe there could be some leveraging of MDM implementations around customer or product that could provide some beginning frame of reference.
But we’re getting way, way ahead of ourselves here. Splunk is the story of an engineering-oriented company that is seeking to climb higher up the value chain in the enterprise. Yet, as it seeks to engage higher level people within the customer organization, Splunk can’t afford to lose track of the base that has been responsible for its success. Splunk’s best route upward is likely through partnering with enterprise players like SAP. That doesn’t deal with the question of how to expand out the footprint to follow the footprint of what is called machine data, but then again, that’s a question for another day. First things first, Splunk needs to pick its target(s) carefully.
It’s natural to look back at the passing of Neil Armstrong and conclude that they just don’t make The Right Stuff like they used to. Or maybe in an era of declining expectations, it’s an unusual feeling to get a sense of pride that the U.S. is still able to muster a major accomplishment.
Yet the shots of people standing at 1:30am on a Sunday night/Monday morning in Times Square appeared a throwback to a more innocent, hopeful time. About a month ago, the Mars Science Laboratory Curiosity (MSL Curiosity) made the most improbable of landings on Mars. An eerie freak of timing that America’s greatest space achievement since the landing on the moon coincided within a few weeks of the passing of the man who uttered those words as he took the footsteps from the lunar lander.
We were reminded of this during a keynote from Doug McCuistion, who heads NASA’s Mars Exploration program, at Siemens PLM’s analyst conference last week. It was a fascinating talk, where he gave us background on why we’ve kept going to Mars (40 times over the past 40 years) and rarely succeeded (only 16 missions have made it there).
What are we doing there? It’s the obsession with familiarity: Mars is the closest relative to Earth, from adjacency and similarity (it’s the only terrestrial planet in the neighborhood). And all the surveillance and experiments points to a truism: there but for fortune Mars lost its atmosphere and most of its water. The evidence of water is both black and white – white as in the patches of silica (beach sand) uncovered by the tire tracks of a recent rover, and dark discolorations of sedimentary rock at the foot of Mt. Sharp. The Phoenix rover that visited the Martian pole back in 2008 discovered ice sheets that are several kilometers thick.
McCustion explained that the series of missions to Mars have followed a logical progression; the Global Surveyor identified old river channels while the Mars Reconnaissance orbiter has been taking high resolution photos of the entire planet, both of which have been used to select landing spots with greater likelihood for evidence of organics and water.
The dramatic landing of MSL Curiosity was just the latest of a series of high risk maneuvers that the mission endured. As for those Seven Minutes of Terror, it was closer to 10 minutes according to McCustion, but who’s counting anyway? That’s where the relevance of speaking at a PLM conference came in; McCustion spoke of the importance of simulation to “buy down” risk to the extent possible (the team used plenty of Siemens modeling tools to optimize component design), because artifacts like the operation of the huge parachute through the Martian atmosphere (which is 10% as dense as Earth’s) could not be physically tested. Simulation helped the team optimize and in some cases completely change the designs or plans for the plutonium power module and guided instruments. As for the unusual descent, it was dismissed as out of hand until all the options were weighed.
While hardly the only game in town, the Mars Exploration Program has replaced manned spaceflight as the public face of NASA. And to its credit, NASA marketed this mission extremely well, having a comprehensive web strategy replete with Twitter and Facebook feeds, partnering arrangements with games providers like Angry Birds, and staging the spectacle of live viewing in Times Square. Just think, if the touchdown had occurred at a more civilized hour, imagine the size of the crowds. It was an all-too-rare moment of feeling of shared accomplishment – and it wasn’t America’s alone. Technology onboard Curiosity had an unmistakable international pedigree, including a neutron detector from Russia.
The good news is that beyond the images of a shuddered manned spaceflight program, that private ventures like SpaceX are starting to fill a void. But SpaceX et al would not be possible had NASA not ventured where no man has gone before (SpaceX didn’t build that, but capitalized on it).
The question is whether, in an era where the national debate is all about cutbacks, that we are willing to invest anew in science, math, and engineering education. The Curiosity landing did not have the same global impact as Apollo 11. But would it be too naïve to hope that those Seven Minutes of terror becomes the early 21st century’s Sputnik moment?
This guest post comes from Ovum colleague Michael Azoff.
Agile practices have been around for over twenty years. The Agile Manifesto was written a decade after ‘agile’ first emerged (under different names of course, Agile was first coined at the 2001 manifesto meeting). There are also plenty of proof points around what works in agile and when to apply it. If you are still asking for agile to prove itself then you are missing where software development has progressed to.
Going back to Waterfall is not an option because it has inherent faults and those faults are visible all around in many failed IT projects. Ultimately, if waterfall is not broken for you then don’t fix it. But you should consider alternatives to waterfall if your software development processes or organization have become dysfunctional; over time, you might find difficulty in recruiting developers for legacy processes, but that’s another issue.
Ken Schwaber a co-originator of Scrum has said that only 25% of Scrum deployments succeed. The question then is what happens to the other 75% of failures. The problem can be examined at three levels of maturity: intra-team agility, extra-team agility, and business agility.
Teams may not be perfectly pure about their agile adoption, and we can get into discussions as Jeff Sutherland has with Scrum But scenarios (i.e. Scrum, but without some Scrum practices). But at some point there reaches a point where the team’s partial adoption of Scrum leads to failure. It could also be that cultural impediments prevent certain agile practices to take root: a highly hierarchical organization will be antithetical to the practice of self-organizing agile teams, for example.
The interface between the business and an agile team can harbor impediments. For example processes on the business side may have originally evolved around supporting waterfall processes and constrain a team that has transitioned to agile. In this scenario failure of agile is now a problem that spans beyond intra-team agile adoption and across the business-IT interface.
The biggest challenge and opportunity is with the organization as a whole: Can the business transform its agility? Can the business become agile and thereby make the agile IT department an integral part of the business, rather than a department in the basement that no executive visits? Today, many major businesses are essentially IT businesses and divorcing the IT team from the business becomes a serious handicap – witness successful businesses in technology, financial services, retail and more, where IT and the business are integral and are agile about it.
There is no magic recipe for agile adoption and it is seen in practice that the most successful agile transformation is one where the team goes through a learning process of self-discovery. Introducing agile practices, using trial and error, learning through experience, seeing what works and what does not, allows the team to evolve its agility and fit it to the constraints of the organization culture.
Organizations need support, training, and coaching in their agile transformation, but the need for business agility is greater the larger the scale of the IT project. Large scale agile projects can be swamped by business waterfall processes that impede their agility at levels above core software development. Interestingly there are cases where agility at the higher levels are introduced and succeed, while intra-team processes remain waterfall. There is no simple ‘right’ way to adopt agile. It all depends on the individual cases, but as long as we are agile about agile adoption, then we can avoid agile failure, or at least improve on what went before. Failure in adopting agile is not about giving up on agile, but re-thinking the problem and seeing what can be improved, incrementally.
Data warehousing and analytics have accumulated a reasonably robust set of best practices and methodologies since they emerged in the mid-1990s. Although not all enterprises are equally vigilant, the state of practices around data stewardship (e.g., data quality, information lifecycle management, privacy and security) is pretty mature.
With emergence of Big Data and new analytic data platforms that handle different kinds of data such as Hadoop, the obvious question is whether these practices still apply. Admittedly, not all Hadoop use cases have been for analytics, but arguably, the brunt of early implementations are. That reality is reinforced by how most major IT data platform household brands have positioned Hadoop: EMC Greenplum, HP Vertica, Teradata Aster and others paint a picture that Hadoop is an extension of your [SQL] enterprise data warehouse.
That provokes the following question: if Hadoop is an extension of your data warehouse or analytic platform environment, should the same data stewardship practices apply?
We’ll train our focus on quality. Hadoop frees your analytic data store of limits, both to quantity of data and structure, which were part and parcel of maintaining a traditional data warehouse. Hadoop’s scalability frees your organization to analyze all of the data, not just a digestible sample of it. And not just structured data or text, but all sorts of data whose structure is entirely variable. With Hadoop, the whole world’s an analytic theatre.
Significantly, with the spotlight on volume and variety, the spotlight has been off quality. The question is, with different kinds and magnitudes of data, does data quality still matter? Can you afford to cleanse multiple terabytes of data? Is “bad data” still bad?
The answers aren’t obvious. Traditional data warehouses treated “bad” data as something to be purged, cleansed, or reconciled. While the maxim “garbage in, garbage out” has been with us since the dawn of computing, the issue of data quality hit the fan when data warehouses provided the opportunity to aggregate more, diverse sources of data that was not necessarily consistent in completeness, accuracy, or structure. The fix was cleansing record by record based on the proposition that analytics required strict apples to apples comparisons.
Yet volume and variety of Hadoop data casts doubt on the practicality of traditional data hygiene practice. Remediating record by record will take forever, and anyway, it’s simply not going to be practice – or worthwhile – to cleanse log files which are highly variable (and low value) by nature. The variety of data, not only by structure, but also source, makes it more difficult to know what is the correct structure and form of any individual record. And given that individual machine data readings are often cryptic and provide little value except when aggregated at huge scale also militates against traditional practice.
So now Hadoop becomes a special case. However, given that Hadoop also supports a different approach to analytics, by reason, data should also be treated differently.
Exact Picture or Big Picture?
Quality in Hadoop becomes more of a broad spectrum of choice that depends on the nature of the application and the characteristics of the data – specifically, the 4 V’s. Is your application mission-critical? That might augur for a more vigilant practice of data quality, but that depends on whether the application requires strict audit trails and carries regulatory compliance exposure. In those cases, better get the data right. However, web applications such as searching engines or ad placement may also be mission-critical but not necessarily bring the enterprise to its knees if the data is not 100% correct.
So you’ve got to ask yourself the question: are you trying to get the big picture, or the exact one? In some cases, they may be different.
The nature of data in turn determines the practicality of cleansing strategies. More volume dictates against traditional record-by-record approaches, variety makes the job of clean sing more difficult, while high velocity makes it virtually impossible. For instance, high throughput complex event processing (CEP)/data streaming applications are typically implemented for detecting patterns that drive operational decisions; cleansing would add too much processing overhead for especially high-velocity/low latency apps. Then there’s the question of data value; there’s more value in a customer identity record an individual reading that is the output of a sensor.
A spectrum of data hygiene approaches
Enforcing data quality is not impossible in Hadoop. There are different approaches, that, depending on the nature of the data and application, may dictate different levels of cleansing or none at all.
A “crowdsourcing” approach widens the net of data collection to a larger array of sources with the notion that enough good data from enough sources will drown out the noise. In actuality, that’s been the de facto approach that has been taken with early adopters, and it’s a fairly passive one. But such approaches could be juiced up with trending analytics that dynamically track the sweet spot of good data to see if the norm is drifting.
Another idea is unleashing the power of data science, not only to connect the dots, but also correct them. We’re not suggesting that you turn your expensive (and rare) data scientists into data QA techs, but to apply the same methodologies for exploration to dynamically track quality. Other variants are applying approaches that apply cleansing logic, not at the point of data ingestion, but consumption; that’s critical for highly-regulated processes, such as assessing counter-party risk for capital markets. In one particular case, an investment bank used a rules-based, semantic domain model using the OMG’s Common Warehouse Model as a means for validating data consumed.
Bad Data may be good
Big Data in Hadoop may be different data, and may be analyzed differently. The same logic applies to “bad data” that in conventional terms appears as outlier, incomplete, or plain wrong. The operable question of why the data may be “bad” may yield as much value as analyzing data within the comfort zone. It’s the inverse of analyzing the drift over time of the sweet spot of good data. When there’s enough bad data, that makes it fair game for trending to check whether different components or pieces of infrastructure are drifting off calibration, or if the assumptions on what constitute “normal” conditions are changing. Like rising sea levels, typical daily temperature swings, for instance. Similar ideas could apply to human-readable data, where perceived outliers reflect flawed assumptions on the meaning of data, such as when conducting sentiment analysis. In Hadoop, bad data may be good.
Hadoop remains a difficult platform for most enterprises to master. For now skills are still hard to come by – both for data architect or engineer, and especially for data scientists. It still takes too much skill, tape, and baling wire to get a Hadoop cluster together. Not every enterprise is Google or Facebook, with armies of software engineers that they can throw at a problem. With some exceptions, most enterprises don’t deal with data on the scale of Google or Facebook either – but the bar is rising.
If 2011 was the year that the big IT data warehouse and analytic platform brand names discovered Hadoop, 2012 becomes the year where a tooling ecosystem starts emerging to make Hadoop more consumable for the enterprise. Let’s amend that – along with tools, Hadoop must also become a first-class citizen with enterprise IT infrastructure. Hadoop won’t cross over to the enterprise if it has to be treated as some special island. That means meshing with the practices and technology approaches that enterprises are using to manage their data centers or cloud deployments. Like SQL, data integration, virtualization, storage strategy, and so on.
Admittedly, much of this cuts against the grain of early Hadoop deployment that stressed open source and commodity infrastructure. Early adopters did so out of necessity as commercial software ran out of gas for Facebook when its data warehouse daily refreshes were breaking terabyte range, not to mention that the cost of commercial licenses for such scaled out analytic platforms wouldn’t have been trivial. Anyway, Hadoop’s linearity leverages scale out of commodity blades and direct attached disk as far as the eye can see, enabling such an almost pure noncommercial approach. At the time, Google’s, Yahoo’s, and Facebook’s issues were considered rather unique – most enterprise don’t run global search engines – not to mention that their business was built on armies of software engineers.
As we’ve previously noted, something’s got to give on the skills front. Hadoop in the enterprise faces limits – the data problems are getting bigger and more complex for sure, but resources and skills are far more finite. So we envision tools and solutions addressing two areas:
1. Products that address “clusterophobia” – organizations that seeks the scalable analytics of Hadoop but lack the appetite to erect infinite data centers out in the fields or hire the necessary skillsets. Obviously, using the cloud is one option – but the questions there revolve around whether corporate policies allow maintenance of data off premises, and also, as datra store size grows, whether the cloud is still economical.
2. The other side of the coin is consummability – tools that simplify access to and manipulation of the data.
In the run-up to this year’s Hadoop Summit, a number of tooling announcements addressing clusterophobia and consumption are pouring out.
On the fear of clusters side, players like Oracle, EMC Greenplum, and Teradata Aster are already offering appliances that simplify deployment of Hadoop, typically in conjunction with an Advanced SQL analytic platform. While most vendors position this as a way for Hadoop to “extend’ your data warehouse so you perform exploration in Hadoop, but the serious analytics in SQL, we view appliances as more than transitional strategy; the workloads are going to get more equitably distributed, and in the long run, we wouldn’t be surprised to see more Hadoop-only appliances, sort of like Oracle’s (for the record, they also bundle another NoSQL database).
Also addressing the same constituency are storage and virtualization – facts of life in the data center. For Hadoop to cross over to the enterprise, it, too, must get virtualization-friendly; storage is an open question. The need for virtualization becomes even more apparent because (1) the exploratory nature of Hadoop analytics demands the ability to try out queries offline without having to disrupt or physically build a new cluster; and (2) the variable nature of Hadoop processing suggests that workloads are likely to be elastic. So we’ve been waiting for VMware to make their move. VMware – also part of EMC – has announced a pair of initiatives. First, they are working with the Apache Hadoop project to make the core pieces (HDFS and MapReduce) virtualization-aware, and separately, they are hosting their own open source project (Serengeti) for virtualizing Hadoop clusters. While Project Serengeti is not VM-specific, there’s little doubt that this will be a VMware project (we’d be shocked if the Xen folks were to buy in).
Where there’s virtualized servers, storage often closely follows. A few months back, EMC dropped the other shoe, finally unveiling a strategy for leveraging Isilon with the Greenplum HD platform, the closest thing in NAS that replicates the scale-out model storage model popularized with Hadoop. This opens an argument of whether the scales of data in Hadoop make premium products such as Isilon unaffordable; the flip side however is the “open source tax,” where you hire the skills in your IT organization to manage and deploy scale-out storage, or pay consultants to do it for you.
In the spirit of making Hadoop more consummable, we expect a lot of vibes from new players that are simplifying navigation of Hadoop and building SQL bridges. Datameer is bringing down the pricing of its uber Hadoop spreadsheet to personal and workgroup levels courtesy of entry level pricing from $299 to $2999. Teradata Aster, which already offers a patented framework that translates SQL to MapReduce (there are also others out there) is now taking an early bet on the incubating Apache HCatalog metadata spec so that you could write SQL statements that go up against Hadoop. It joins approaches such as those from Hadapt, which hangs SQL tables from HDFS file nodes, and mainstream BI players such as Jaspersoft, that already provide translators that can grab reports directly from Hadoop.
This doesn’t take away from the evolution of the Hadoop platform itself; Cloudera and Hortonworks are among those releasing new distributions that bundle their own mix of recent and current Apache Hadoop modules. While the Apache project has addressed the NameNode HA issue, it is still early in the game with bringing enterprise-grade manageability to MapReduce. That’s largely an academic issue as the bulk of enterprises have yet to implement Hadoop; by the time enterprises are ready, many of the core issues should resolve — although there will always be questions about the uptake of peripheral Hadoop projects.
What’s more important – and where the action will be – is in tools that allow enterprises to run and, more importantly, consume Hadoop. A chicken and egg situation, enterprises won’t implement before tools are available and vice versa.
Note: If you’re in San Jose, we invite you to join us at Hadoop Summit to catch our presentation Hadoop – Do Data Warehousing Rules Apply on Thursday morning at 10:30.
« Previous Page — « Previous entries « Previous Page · Next Page » Next entries » — Next Page »