Category Archives: Database

Strata 2015 Post Mortem: Sparking expectations for Smart, Fast Applications

A year ago, Turing award winner Dr. Michael Stonebraker made the point that, when you try managing more than a handful of data sets, manual approaches run out of gas and the machine must come in to help. He was referring to the task of cataloging data sets in the context of capabilities performed by his latest startup, Tamr. If your typical data warehouse or data mart involves three or four data sources, it’s possible for you to get your head around figuring the idiosyncrasies of each data set and how to integrate them for analytics.

But push that number to dozens, if not hundreds or thousands of data sets, and any human brain is going to hit the wall — maybe literally. And that’s where machine learning first made big data navigable, not just to data scientists, but to business users. Introduced by Paxata, and since then, through a long tail of startups and household names, these tools applied machine learning to help the user wrangle data through a new kind of iterative process. Since then, analytic tools such as IBM’s Watson Analytics are employing machine learning to help end users perform predictive analytics.

Walking the floor of last week’s Strata Hadoop World in New York, we saw machine learning powering “emergent” approaches to building data warehouses. Infoworks monitors the what data end users are targeting for their queries by taking a change data capture-like approach to monitoring logs; but instead of just tracking changes (which is useful for data lineage), it deduces the data model and builds OLAP cubes. Alation, another startup, uses a similar approach for crawling data sets to build catalogs with Google-like PageRanks showing which tables and queries are the most popular. It’s supplemented with a collaboration environment where people add context, and a natural language query capability that browses the catalog.

Just as machine learning is transforming the data transformation process to help business users navigate their way through big data, it’s also starting to provide the intelligence to help business users become more effective with exploratory analytics. While over the past couple years, interactive SQL was the most competitive battle for Hadoop providers — enabling established BI tools to treat Hadoop as simply a larger data warehouse — machine learning will become essential to helping users become productive with exploratory analytics on big data.

What makes machine learning possible within an interactive experience is the emerging Spark compute engine. Spark is what’s turning Hadoop from a Big Data platform to a Fast Data one. By now, every commercial Hadoop distro includes a Spark implementation, although which Spark engines (e.g., SQL, Streaming, Machine Learning, and Graph) still varies by vendor. A few months back IBM declared it would invest $300 million and dedicate 3500 developers to Spark machine learning product development, followed by Cloudera’s announcement of a One Platform initiative to plug Spark’s gaps.

And so our attention was piqued by Netflix’s Strata session on running Spark at petabyte scale. Among Spark’s weaknesses is that it hasn’t consistently scaled over a thousand nodes, and is not known for high concurrency. Netflix’s data warehouse currently tops out at 20 petabytes and serves roughly 350 users (we presume, technically savvy data scientists and data engineers). Spark is still at its infancy at Netflix; while workloads are growing, they are not at a level that would merit a dedicated cluster (Netflix runs its computing in the Amazon cloud, on S3 storage). Much of the Spark workloads are for streaming, run under YARN. And that leads to a number of issues showing that at high scale, and high concurrency, Spark is a work in progress.

A few of the issues that Netflix is working to scale Spark include adding caching steps to accelerate loading of large data sets. Related to that is reducing the latency of retrieving large metadata sets (“list calls”) that are often associated with large data sets; Netflix is working on an optimization that would apply to Amazon’s S3. Another scaling issue related to file scanning (Spark normally scans all Hive tables when a query is first run); Netflix has designed a workaround to pushdown predicate processing so queries only scan relevant tables.

For most business users, the issue of Spark scaling won’t be relevant as their queries are not routinely expected to involve multiple petabytes of data. But for Spark to reach its promise for supplanting MapReduce for iterative, complex, data-intensive workloads, scale will prove an essential hurdle. We have little doubt that the sizable Spark community will rise to the task. But the future won’t necessarily be all Spark all the time. Keep your eye out for the Apex streaming project; it’s drawn some key principals who have been known for backing Storm.

MongoDB widens its sights

MongoDB has passed several key watershed events over the past year, including a major redesign of its core platform and a strategic shift in its management team. By now, the architectural transition is relatively old news; as we noted last winter, MongoDB 3.0 made the storage engine pluggable. So voila! Just like MySQL before it, Mongo becomes whatever you want it to be. Well eventually, anyway, but today there’s the option of substituting the more write-friendly WiredTiger engine, and in the near future, an in-memory engine now in preview could provide an even faster write-ahead cache to complement the new overcaffeinated tiger. And there are likely other engines to come.

From a platform – and market standpoint – the core theme is Mongo broadening its aim. Initially, it will be through new storage engines that allow Mongo to be whatever you make of it. MongoDB has started the fray with WiredTiger and the new in-memory data store, but with publishing of the API, there are opportunities for other engines to plug in. At MongoDB’s user conference, we saw one such result – the RocksDB engine developed at Facebook for extremely I/O-intensive transactions involving log data. And as we’ve speculated, there’s nothing to stop other storage engines like SQL from plugging in.

Letting a thousand flowers bloom
Analytics is an example where Mongo is spreading its focus. While Mongo and other NoSQL data stores are typically used for operational applications requiring fast reads and/or writes, for operational simplicity, there is also growing demand for in-line analytics. Why move data to a separate data warehouse data mart or Hadoop if it can be avoided? And why not embed some analytics with your operational applications? This is hardly an outlier – a key selling point for the latest generations of Oracle and SAP applications are the ability to embed analytics with transaction processing. Analytics evolves from after-the-fact to an inline process that is part of processing a transaction. Any real-time customer facing or operational process is ripe for analytics that can prompt inline decisions for providing next-best offers or tweaking the operation of an industrial process, supply chain, or the delivery of a service. And so a growing number of MongoDB deployments are adding analytics to the mix.

It’s almost a no-brainer for SQL BI tools to target JSON data per se because the data has a structure. (Admittedly, this is assuming the data is relatively clean, which in many cases is not a given.) But by nature, JSN has a more complex and potentially richer structure than SQL tables in the degree that the data is nested. Yet most SQL tools do away with the nesting and hierarchies that are stored in JSON documents, “flattening’ the structure into a single column.

We’ve always wondered when analytic tools would wake up to the potential of querying JSON natively – at least, not flattening the structure, but incorporating that information when processing the query. The upcoming MongoDB 3.2 release will add a new connector to BI and visualization tools that will push down analytic processing into MongoDB, rather than require data to be extracted first to populate an external data mart of data warehouse for the analytic tool to target. But this enhancement is not as much about enriching the query with information pertaining to the JSON schema; it’s more about efficiency, eliminating data transport.

But some emerging startups are looking to address that JSON native query gap. jSonar demonstrated SonarW, a data warehouse engine that plugs into the Mongo API that has a columnar format, with a key difference. It provides metadata that provides a logical representation of the nested and hierarchical relationships. We saw a reporting tool from Slamdata that applies similar context to the data, based on patent-pending algorithms that apply relational algebra to slicing, dicing, and aggregating deeply nested data.

Who says JSON data has to be dirty?
While a key advantage of NoSQL data stores, like Mongo, is that you don’t have to worry about applying strict schema or validation (e.g., ensuring that the database isn’t sparse and that the data in the fields is not gibberish). But there’s nothing inherent to JSON that rules out validation and robust data typing. MongoDB will be introducing a tool supporting schema validation for those use cases that demand it, plus a tool for visualizing the schema to provide a rough indication of unique fields and unique data (e.g., cardinality) within these fields. While maybe not a full-blown data profiling capability, it is a start.

Breaking the glass ceiling
The script for MongoDB has been familiar up ‘til now. The entrepreneurial startup whose product has grown popular through grassroots appeal. The natural trajectory for MongoDB is to start engaging the C- level and the business, who write larger checks. A decade ago, MySQL played this role. It was kinda of an Oracle or SQL Server Lite that was less complex than its enterprise cousins. That’s been very much MongoDB’s appeal. But with making the platform more extensible, MongoDB creates a technology path to grow up. Can the business grow with it?

Ove the past year MongoDB’s upper management team has largely been replaced; the CEO, CMO, and head of sales are new. It’s the classic story of startup visionaries, followed by those experienced at building the business. President and CEO Dev Ittycheria, most recently from the venture community, previously took BladeLogic public before eventually selling to BMC for $900 million in 2008. Its heads of sales and marketing come from similar backgrounds and long track records. While MongoDB is clearly not sloughing off on product development, it is plowing much of its capitalization into building out the go-to-market.

The key challenge facing Mongo, and all the new data platform players, is where (or whether) they will break the proverbial glass ceiling. There are several perspectives to this challenge. For open source players like MongoDB, it is determining where the value-add lies. It’s a moving target; while traditionally, functions that make a data store enterprise grade such as data governance, management, and security were traditionally unique to the vendor and platform, open source is eating away at it. Just look at the Hadoop world where there’s Ambari, while Cloudera and IBM offer their own either as core or optional replacement. So this dilemma is hardly unique to MongoDB. Our take is that lowest common denominator cannot be applied to governance, security, or management, but it will become a case where platform players, like MongoDB, must branch out and offer related value-add such as optimizations for cloud deployment, information lifecycle management, and so on.

Such a strategy of broadening the value-add grows even more important given market expectations for pricing; in essence, coping with the I’m not going to pay a lot for this muffler syndrome. The expectation with open source and other emerging platforms is that enterprises are not willing, or lack the budget, for paying the types of licenses customary with established databases and data warehouse systems. Yes, the land and expand value is critical for the likes of MongoDB, Cloudera, Hortonworks and others for growing revenues. They may not replace the Oracles or Microsoft of the world, but they are angling to be the favorite for new generation applications supplementing what’s already on the back end (e.g., customer experience, enhancing and working alongside classical CRM).

Land and expand into the enterprise, and broadening from data platform to data management are familiar scripts. Even in an open source, commodity platform world, these scripts will remain as important as ever for MongoDB.

Hortonworks evens the score

Further proof that Hadoop competition is going up the stack toward areas such as packaged analytics, security, and data management and integration can be seen from Hortonworks’ latest series of announcements today – refresh of the Hortonworks Data Platform with Ambari 2.0 and the acquisition of cloud deployment automation tool SequenceIQ.

Specifically, Ambari 2.0 provides much of the automation previously missing, such as automating rolling updates, restarts, Kerberos authentications, alerting and health checks, and so on. Until now, automation of deployment, monitoring and alerting, rot cause diagnosis, and authentications was a key differentiator for Cloudera Manager. While Hadoop systems management may not be a done deal (e.g., updating to major new dot zero releases is not yet a lights-out operation), the basic blocking and tackling is no longer a differentiator; any platform should have these capabilities. The recent debut of the Open Data Platform – where IBM and Pivotal are leveraging the core Hortonworks platform as the starting point for their Hadoop distributions – is further evidence. Ambari is the cornerstone of all implementations, although IBM will still offer a more “premium” value-add with options such as Platform Symphony and Adaptive MapReduce.

Likewise, Hortonworks’ acquisition of SequenceIQ is a similar move to even the score with Cloudera Director. Both handle automation of cloud deployment with policy-based elastic scaling (e.g., when to provision or kill compute nodes). The comparison may not yet be apples-to-apples; for instance, Cloudera Director has been a part of the Cloudera enterprise platform (the paid edition) since last fall, whereas the ink is just drying on the Hortonworks acquisition of SequenceIQ. And, while SequenceIQ’s product, Cloudbreak, is cloud infrastructure-agnostic but Cloudera Director right now only supports Amazon, that too will change.

More to the point is where competition is heading – we believe that it is heading from the core platform higher up the value chain to analytic capabilities and all forms of data management – stewardship, governance, and integration. In short, it’s a page out of the playbook of established data warehousing platforms that have had to provide value-add that could be embedded inside the database. Just take a look at Cloudera’s latest announcements: acquisition of Xplain and a strategic investment in Cask. Xplain automates the design, integration, and optimization of data models to reduce or eliminate hurdles to conducting self-service analytics on Hadoop. Cask on the other hand provides hooks for developers to integrate applications with Hadoop – the third way that until now has been overlooked.

As Hadoop graduates from specialized platform for complex, data science computing to an enterprise data lake, the blocking and tackling functions – e.g., systems management and housekeeping – becomes checklist items. What’s more important is how to manage data, make data and analytics more accessible beyond data scientists and statistical programming experts, and providing the security that is expected of any enterprise-grade platform.

Strata 2015 post mortem: Does the Hadoop market have a glass ceiling?

The move of this year’s west coast Strata HadoopWorld conference to the same venue as Hadoop Summit gave the event a bit of a mano a mano air: who can throw the bigger, louder party?

But show business dynamics aside, the net takeaway from these events is looking at milestones in the development of the ecosystem. Indeed, the brunt of our time was spent “speed dating” with third party tools and applications that are steadily addressing the white space in the Big Data and Hadoop markets. While our sampling is hardly representative, we saw growth, not only from the usual suspects from the data warehousing world, but also from a growing population of vendors who are aiming to package machine learning algorithms, real-time streaming, more granular data security, along with new domains such as entity analytics. Databricks, inventor of Spark, announced in a keynote a new DataFrames initiative to make it easier for R and Python programmers accustomed to working on laptops to easily commandeer and configure clusters to run their computations using Spark.

Immediately preceding the festivities, the Open Data Platform initiative announced its existence, and Cloudera announced its $100 million 2014 numbers – ground we already covered. After the event, Hortonworks did its first quarterly financial call. Depending on how you count, they did nearly $50 million business last year; but the billings, which signify the pipeline, came in at $87 million. Hortonworks closed an impressive 99 new customers in Q4. There’s little question that Hortonworks has momentum, but right now, so does everybody. We’re at a stage in the market where a rising tide is lifting all boats; even the smallest Hadoop player – Pivotal – grew from token revenues to our estimate of $20 million Hadoop sales last year.

At this point, there’s nowhere for the Hadoop market to go but up, as we estimate that the paid enterprise installed base (at roughly 1200 – 1500) as just a fraction of the potential base. Or in revenues, our estimate of $325 million for 2014 (Hadoop subscriptions and related professional services, but not including third party software or services), up against $10 billion+ for the database market. Given that Hadoop is just a speck compared to the overall database market, what is the realistic addressable market?

Keep in mind that while Hadoop may shift some data warehouse workloads, the real picture is not necessarily a zero sum game, but the commoditization of the core database business. Exhibit One: Oracle’s recent X5 engineered systems announcement designed to meet Cisco UCS at its commodity price point. Yes, there will be some contention, as databases are converging and overlapping, competing for many of the same use cases.

But the likely outcome is that organizations will use more data platforms and grow accustomed to paying more commodity process – whether that is through open source subscriptions or cloud pay-by-the-drink (or both). The value-add increasingly will come from housekeeping tools (e.g., data security; access control and authentication; data lineage and audit for compliance; cluster performance management and optimization; lifecycle and job management; query management and optimization in a heterogeneous environment).

The takeaway here is that the tasks normally associated with the care and feeding of a database, not to mention the governance of data, grow far more complex when superseding traditional enterprise data with Big Data. So the Hadoop subscription business may only grow so far, but that will be just the tip of the iceberg regarding the ultimate addressable market.

MongoDB grows up

One could say that MongoDB has been at the right place at the right time. When web developers demanded a fast, read-intensive store of complex variably-structured data, the company formerly known as 10Gen came up with a simple engine backed by intuitive developer-friendly tooling. It grew incredibly popular for applications like product catalogs, tracking hierarchical events (like chat strings with responses), and some forms of web content management.

In a sense, MongoDB and JSON became the moral equivalents of MySQL and the LAMP stack, which were popular with web developers who needed an easy-to-deploy transactional SQL database sans all the overhead of an Oracle.

Some things changed. Over the past decade, Internet developers expanded from web to also include mobile developers. And the need for databases has now extended to variably structured data. Enter JSON. It provided that long-elusive answer to providing a simple operational database with an object-like representation of the world without the associated baggage (e.g., polymorphism, inheritance), using a language (JavaScript) and data structure that was already lingua franca with web developers.

Like MySQL, Mongo was known for its simplicity. It had a simple data model, a query framework that was easy for developers to use, and well-developed indexing that made reads very fast. It’s been cited by db-Engines as the fourth most popular database among practitioners.

And like MySQL, MongoDB was not known for its ability to scale (just ask Cassandra fans). For MySQL, a Berkeley company, Sleepycat Software, InnoDB, developed an engine that provided a heart transplant that could turn MySQL into a serious database.

Fast forward, and some alumni from Sleepycat Software (which developed BerkelyDB, later bought by Oracle) founded WiredTiger, ginning out an engine that could add similar scale to Mongo. WiredTiger offers a more write-friendly engine that aggressively takes advantage of compression (that is configurable) to scale and deliver high performance. And it provides a much more granular and configurable approach to locking that could alleviate much of those write bottlenecks that plagued Mongo.

History took interesting paths. Oracle bought Sleepycat and later inherited MySQL via the Sun acquisition. And last fall, MongoDB bought WiredTiger.

Which brings us to MongoDB 2.8 3.0. It’s no mystery (except for the dot zero release number ) that the WiredTiger engine would end up in Mongo as their integration was destiny. Also not surprising is that the original MongoDB MMAP engine lives on. There is a huge installed base, and for existing read-heavy applications, it works perfectly well for a wide spectrum of use cases (e.g., recommendation engines). The new release makes the storage engine pluggable via a public API.

We’ve been down this road before; today MySQL has almost a dozen storage engines. Starting off the gate, MongoDB will have the two supported by the company: classic MMAP or the industrial-strength WiredTiger engine. Then there’s also an “experimental” in-memory engine that’s part of this release. And off in the future, there’s no reason why HDFS, cloud-based object storage, or or even SQL engines couldn’t follow.

The significance with the 3.0 release is that the MongoDB architecture becomes an extensible family. And in fact, this is quite consistent with trends that we at Ovum have been seeing with other data platforms, that are all overlapping and taking on multiple persona. That doesn’t mean that every database will become the same, but that each will have its area of strength, but also be able to take on boundary cases. For instance:
• Hadoop platforms have been competing on adding interactive SQL;
• SQL databases have been adding the ability to query JSON data; and
• MongoDB is now adding the fast, scalable write capabilities associated with rival NoSQL engines like Cassandra or Couchbase, reducing the performance gap with key-value stores.

Database convergence or overlap doesn’t mean that you’ll suddenly use Hadoop to replace your data warehouse, or MongoDB instead of your OLTP SQL database. And if you really need fast write performance, key-value stores will probably remain your first choice. Instead, view these as extended capabilities that allow you to handle a greater variety of use cases, data types, and queries off the same platform with familiar development, query, and administration tools.

Back to MongoDB 3.0, there are a few other key enhancements with this release. Concurrency control (the source of those annoying write locks with the original MMAP engine) becomes more granular in this release. Instead of having to lock the entire database for consistent writes, locks can now be confined to a specific collection (the MongoDB equivalent of table) level, reducing an annoying bottleneck. Meanwhile, WiredTiger adds schema validation and more granular memory management to further improve write performance. Eventually, WiredTiger might even bring schema validation to Mongo.

We don’t view this release as being about existing MongoDB customers migrating to the new engine; yes, the new engine will support the same tools, but it will require a one-time reload of the database. Instead, we view this as expanding MongoDB’s addressable market, with the obvious target being key-value stores like Cassandra, BerkeleyDB (now commercially available as Oracle NoSQL Database), or Amazon DynamoDB. It’s just like other data platforms are doing by adding on their share of capability overlaps.

Hortonworks takes on thankless, necessary job for governing data in Hadoop

Governance has always been a tough sell for IT or the business. It’s a cost of doing business that although it might not be optional, is something that inevitably gets kicked down the road. That is, unless your policies, external regulatory requirements, or embarrassing public moments force the issue.

Put simply, data governance consists of the policies, rules, and practices that organizations enforce for the way they handle data. It shouldn’t be surprising that in most organizations, data governance is at best ad hoc. Organizations with data (and often governance) in their names offer best practices, frameworks, and so on. IBM has been active on this front as well, having convened its own data governance council among leading clients for the better part of the last decade; it has published maturity models and blueprints for action.

As data governance is a broad umbrella encompassing multiple disciplines from data architecture to data quality, security and privacy management, risk management, lifecycle management, classification and metadata, and audit logging, it shouldn’t be surprising that there is a wealth of disparate tools out there for performing specific functions.

The challenge with Hadoop, like any emerging technology, is its skunk works origins among Internet companies who had (at the time) unique problems to solve and had to invent new technology to solve them. But as Big Data – and Hadoop as platform – has become a front burner issue for enterprises at large, the dilemma is ensuring that this new Data Lake not become a desert island when it comes to data governance. Put another way, implementing a data lake won’t be sustainable if data handling is out of compliance with whatever internal policies are in force. The problem comes to a head for any organization dealing with sensitive or private information, because in Hadoop, even the most cryptic machine data can contain morsels that could compromise the identity (and habits) of individuals and the trade secrets of organizations.

For Hadoop, the pattern is repeating. Open source projects such as Sentry, Knox, Ranger, Falcon and others are attacking pieces of the problem. But there is no framework that brings it all together – as if that were possible.

Towards that end, we salute Hortonworks for taking on what in our eyes is otherwise a thankless task: herding cats to create definable targets that could be the focus of future Apache projects – and for Hortonworks, value-added additions to its platform. Its Data Governance Initiative, announced earlier this morning, is the beginning of an effort that mimics to some extent what IBM has been doing for years: convene big industry customers to help define practice, and for the vendor, define targets for technology development. Charter members include Target, Merck, Aetna – plus SAS, as technology partner. This is likely to spawn future Apache projects that, if successful, will draw critical mass participation for technologies that will be optimized for the distinct environment of Hadoop.

A key challenge is delineating where vertical industry practice and requirements leave off (as that is a space already covered by many industry groups) so it doesn’t wind up reinventing the wheel. The same is true across the general domain of data management – where as we stated before there are already organizations that have defined the landscape, to which we hope that the new initiative formally or informally syncs up.

Strata 2014 Part 2: Exploratory Analytics and the need to Fail Fast

Among the unfulfilled promises of BI and data warehousing was the prospect of analytic dashboards on the desk of everyman. History didn’t quite turn out that way – BI and data warehousing caught on, but only as the domain of elite power users who were able to create and manipulate dashboards, understand KPIs, and know what to ask for when calling on IT to set up their data marts or query environments. The success of Tableau and Qlik revealed latent demand for intuitive, highly visual BI self-service tools, and the feasibility of navigating data with lesser reliance on IT.

History has repeated itself with Big Data analytics and Hadoop platforms – except that we need even more specialized skills on this go round. Whereas BI required power users and DBAs, for Big Data it’s cluster specialists, Hadoop programmers, and data scientists. When we asked an early enterprise adopter at a Hortonworks-sponsored user panel back at Hadoop Summit as to their staffing requirements, they listed Java, R, and Python programmers.

Even if you’re able to find or train Hadoop programmers and cluster admins, and can even spot the spare underemployed data scientist, you’ll still face a gap in operationalizing analytics. Unless you’re planning to rely on an elite team of programmers, data scientists or statisticians, Big Data analytics will wind up in a more gilded version of the familiar data warehousing/BI ghetto.

This pattern won’t be sustainable for mainstream enterprises. We’ve gone on record that Big Data and Hadoop must become first class citizens in the enterprise. And that means mapping to the skills of the army you already have. No wonder that interactive SQL is becoming the gateway drug for Hadoop in the enterprise. That at least gets Big Data and Hadoop to a larger addressable practitioner base, but unless you’re simply querying longer periods of the same customer data that you held in the data warehouse, you’ll face new bottlenecks getting your arms around all that bigger and badder data. You’re probably going to be wondering:
• What questions do you ask when you have a greater variety of data to choose from?
• What data sets should you select for analysis when you have dozens being ingested into your Hadoop cluster?
• What data sets will you tell your systems admins or DBAs (yes, they can be retrained for schema-on-read data collections) to provision for your Hadoop cluster?

If you’re asking these questions, you’re not lost. Your team is likely acquiring new data sets to provide richer context to perennial challenges such as optimizing customer engagement, reducing financial risk exposure, improving security, or managing operations. Unless your team is analyzing log files with a specific purpose, chances are you won’t have the exact questions or know specifically which data sets you should pinpoint in advance.

Welcome to the world of Exploratory Analytics. This is where you iterate your queries, and identify which data sets to yield the answers. It’s different from traditional analytics, where the data sets, schema, and queries are already pre-determined for you. At exploratory phase, you look for answers that explains why your KPIs have changed, or whether you’re looking at the right KPIs at all. Exploratory analytics does not replace your existing regimen of analytic, query, or reporting – it complements it. Exploratory Analytics:
• Gives you the Big Picture; it shows you the forest. Traditional analytics gives you the Precise Picture, where you get down to the trees.
• May be used for quickly getting a fix on some unique scenario occurring, where you might run a query once and move on; or it can be used for recalibrating where you should do your core data warehousing analytics – which means that it is a preparatory stage for feeding new data to the data warehouse.
• Gives you the context for making decisions. Data warehousing analytics are where final decisions (for which your organization may be legally accountable) are made.

A constant refrain from the just-concluded Strata Hadoop World conference was the urgency of being able to fail fast. You are conducting a process of discovery where you are testing and retesting assumptions and hypotheses. The nature of that process is that you are not always going to be right, and in fact, if you are thorough enough in your discovery process, you won’t. That doesn’t mean that you don’t start out with a question or hypothesis – you are. But unlike conventional BI and data warehousing, your assumptions and hypotheses are not baked in from the moment that schema has been set in concrete.

At the other end of the scale, exploratory analytics should not degenerate into a hunting expedition. Like any scientific process, you need direction and ways for setting bounds on your experiments.

Exploratory analytics requires self-service, from data preparation through query and analysis. You can’t afford to wait for IT to transform your data and build query environments, and then repeat the process for the next stage of refining your hypothesis. It takes long enough to build a single data mart.

It starts with getting data. You must identify data sets and then transform them (schema on read doesn’t eliminate this step). It may involve searching externally for data sets or scanning the data sets that are already available from your organization’s portfolio of transaction or messaging systems, log files, or other sources; or that data might already be on your Hadoop cluster. You need to identify what’s in the data sets of interest, reconcile, and conduct the transformation, a process often characterized as data wrangling. This process is not identical to ETL, but rather, the precursor. You are getting the big picture and may not require the same degree of precision in matching and de-duplicating records as you would inside a data warehouse. You’re designing for queries that give you the big picture (is your organization on the right track) as opposed to the precise or exact picture (where you are making decisions that carry legal and/or financial accountability).

You will have to become self-sufficient in performing the wrangling and setting up the queries and get the results. You’ll need the system to assist you in this endeavor. As serial database entrepreneur and MIT professor Michael Stonebraker put it during a Strata presentation, when you have more than 20 – 25 data sources, it is not humanly possible to keep track of all of them manually; you’ll need automation to help track and organize data sets for you. You may need the system to assist you in selecting data sets, and you certainly need the system to help you determine how to correlate and transform those data sets into workable form. And to keep from reinventing the wheel, you’ll need a way to track and preferably collaborate with others regarding the wrangling process – e.g., your knowledge about data sets, transformations, queries, and so on.

Advances in machine learning are helping make the wild goose chase become manageable. Compared to traditional ETL tools, these offerings use a variety of techniques, such as capabilities to recognize patterns of data and identify what kind of data is in a column, calibrated with various training techniques where you download a sample set and “teach” the system, or provide prompted or unprompted feedback as to the correctness of the transform. Unlike traditional ETL tools, you can operate from a simple spreadsheet rather than have to navigate schema.

Emerging players like Trifacta, Paxata, and Tamr introduced techniques to data preparation and reconciliation; IBM has embraced these approaches with Data Refinery and Watson Analytics, while Informatica leverages machine learning with its Springbok cloud service; and we expect to hear from Oracle very soon.

The next step is data consumption. Where data is already formatted as relational tables, existing BI self-service visualization tools may suffice. But other approaches are emerging that deduce the story. IBM’s Watson Analytics can suggest what questions to ask and pre-populate a storyboard or infographic; ClearStory Data combines live blending of data from internal and external sources to generate interactive storyboards that similarly venture beyond dashboards.

For organizations that already actively conduct BI and analytics, the prime value-add from Big Data will be the addition of exploratory analytics at the front end of the process. Exploratory analytics won’t replace traditional BI query and reporting, as the latter is where the data and processes are repeatable. Exploratory Analytics allows organization to search deeper and wider for new signals or patterns; the results might in some cases be elevated to the data warehouse, but in other cases, may provide a background process that helps the organization get the bigger picture to understand whether it has the right business or operational strategy, whether it is asking the right questions, serving the right customers, or protecting against the right threats.

Strata 2014 Part 1: Hadoop, Bright Lights, Big City

If you’re running a conference in New York, there’s pretty much no middle ground between a large hotel and the Javits Center. And so this year, Strata Hadoop World made the leap, getting provisional access to a small part of the big conventional center to see if it could fill the place. That turned out to be a foregone conclusion.

The obvious question was whether Hadoop, and Big Data, had in fact “crossed the chasm” to become a mainstream enterprise IT market. In case you were wondering, the O’Reilly folks got Geoffrey Moore up on the podium to answer that very question.

For Big Data-powered businesses, there’s little chasm to cross when you factor in the cloud. As Moore put it, if you only need to rent capacity on AWS, the cost of entry is negligible. All that early adopter, early majority, late majority stuff doesn’t really apply. A social site has a business model of getting a million eyes or nothing, and getting there is a matter of having the right buzz to go viral – the key is that there’s scant cost of entry and you get to fail fast. Save that thought – because the fail fast principle also applies to enterprises implementing Big Data projects (we’ll explain in Part 2 of this post, soon to come).

Enterprise adoption follows Moore’s more familiar chasm model – at that we’re still at early majority where the tools of the trade are arcane languages and frameworks like Spark and Pig. But the key, Moore says, is for “pragmatists” to feel pain; that is the chasm to late majority, the point where conventional wisdom is to embrace the new thing. Pragmatists in the ad industry are feeling pain responding to Google; the same goes with media and entertainment sectors were even cable TV mainstays such as HBO are willing to risk decades-old relationships with cable providers to embrace pure internet delivery.

According to Cloudera’s Mike Olson, Hadoop must “disappear” to become mainstream. That’s a 180 switch as the platform has long required specialized skills, even if you ran an off-the-shelf BI tool against it. Connecting from familiar desktop analytics tools is the easy part – they all carry interfaces that translate SQL to the query language that can run on Hive, or on any of the expanding array of interactive-SQL-on-Hadoop frameworks that are making Hadoop analytics more accessible (and SQL on Hadoop a less painful experience).

Between BI tools and frameworks like Impala, HAWQ, Tez, Big SQL, Big Data SQL, Query Grid, Drill, or Presto, we’ve got the last mile covered. But the first miles, which involve mounting clusters, managing and optimizing them, wrangling the data into shape, and governing the data, are still works in progress (there is some good news regarding data wrangling). Tools that hide the complexity and applications that move the complexity under the hood are works in progress.

No wonder that for many enterprises, offloading ETL cycles was their first use of Hadoop. Not that there’s anything wrong with that – moving ETL off Teradata, Oracle, or DB2 can yield savings because you’ve moved low value workloads off platforms where you pay by footprint. Those savings can pay the bill while your team defines where it wants to go next,

We couldn’t agree with Olson more – Hadoop will not make it into the enterprise as this weird, difficult, standalone platform that requires special skills. Making a new platform technology like Hadoop “disappear” isn’t new — it’s been done before with BI and Data Warehousing. In fact, Hadoop and Big Data today are at the same point where BI and data warehousing were in the 1995 – 96 timeframe.

The resemblance is uncanny. At the time, data warehouses were unfamiliar and required special skills because few organizations or practitioners had relevant experience. Furthermore, SQL relational databases were the Big Data of their day, providing common repositories for data that was theoretically liberated from application silos (well, reality proved a bit otherwise). Once tools automated ETL, query, and reporting, BI and data warehousing in essence disappeared. Data Warehouses became part of the enterprise database environment, while BI tools became routine additions to the enterprise application portfolio. Admittedly, the promise of BI and Data warehousing was never completely fulfilled as analytic dashboards for “everyman” remained elusive.

Back to the original question, have Hadoop and Big Data gone mainstream? The conference had little troubled filling up the hall, and questions about economic cycles notwithstanding, shouldn’t have issues occupying more of Javits next year. We’re optimists based on Moore’s “pragmatist pain” criteria — in some sectors, pragmatists will have little choice but to embrace the Big Data analytics that their rivals are already leveraging.

More specifically, we’re bullish in the short term and long term, but are concerned over the medium term. There’s been a lot of venture funding pouring into this space over the past year for platform players and tools providers. Some players, like Cloudera, have well broken the billion-dollar valuation range. Yet, if you look at the current enterprise paid installed base for Hadoop, conservatively we’re in the 1000 – 2000 range (depending on how you count). Even if these numbers double or triple over the next year, will that be enough to satisfy venture backers? And what about the impacts of Vladimir Putin or Ebola on the economy over the near term?

At Strata we had some interesting conversations with members of the venture community, who indicated that the money pouring in is 10-year money. That’s a lot of faith – but then again, there’s more pain spreading around certain sectors where leaders are taking leaps to analyze torrents of data from new sources. But ingesting the data or pointing an interactive SQL tool (or streaming or search) at it is the easy part. When you’re getting beyond the enterprise data wall garden, you have to wonder if you’re looking at the right data or asking the right questions. In the long run, that will be the gating factor as to how, whether, and when analysis data will become routine in the enterprise. And that’s what we’re going to talk about in Part 2.

We believe that self-service will be essential for enterprises to successfully embrace Big Data. We’ll tell why in our next post.

Is SQL the Gateway Drug for Hadoop?

How much difference does a year make? Last year, Last year was the point where each Hadoop vendor was compelled to plant their stake in supporting interactive SQL. Cloudera Impala; Hortonworks’ Stinger (injecting steroids to Hive); IBM’s Big SQL; Pivotal’s HAWQ; MapR and Drill (or Impala available upon request); and for good measure, Actian turbocharging their Vectorwise processing engine onto Hadoop.

This year, the benchmarketing has followed: Cloudera Impala clobbering the latest version of Hive in its own benchmarks, Hortonworks’ response, and Actian’s numbers with the Vectorwise engine (rebranded Vortex) now native on Hadoop supposedly trumping the others. OK, there are lies, damn lies, and benchmarks, but at least Hadoop vendors feel compelled to optimize interactive SQL performance.

As the Hadoop stack gets filled out, it also gets more complicated. In his keynote before this year’s Hadoop Summit, Gartner’s Merv Adrian made note of all the technologies and frameworks that are either filling out the Apache Hadoop project – such as YARN – and those that are adding new choices and options, such as the number of frameworks for tiering to memory or Flash. Add to that, the number of interactive SQL frameworks.

So where does this leave the enterprises that comprise the Hadoop market? In all likelihood, dazed and confused. All that interactive SQL is part of the problem, but it’s also part of the solution.

Yes, Big Data analytics has pumped new relevancy to the java community, which now has something sexier than middleware to keep itself employed. And it’s provided a jolt to Python, which as it turns out is a very useful data manipulation language, not to mention open source R for statistical processing. And there are loads of data science programs bringing new business to Higher Ed computer science programs.

But we digress.

Java, Python and R will add new blood to analytics teams. But face it, no enterprise in its right mind is going to swap out its IT staff. From our research at Ovum, we have concluded that Big Data (and Hadoop) must become first class citizens in the enterprise if they are to gain traction. Inevitably, that means SQL must be part of the mix.

Ironically, the great interactive SQL rollout is occurring as something potentially far more disruptive is occurring: the diversification of data platforms. Hadoop and data warehousing platforms are each adding multiple personas. As Hadoop adds interactive SQL, SQL data warehouses are adding column stores, JSON/document style support, MapReduce style analytics.

But SQL is not the only new trick up Hadoop’s sleeve; there are several open source frameworks that promise to make real-time streaming analytics possible, not to mention search, and… if only the community could settle on some de facto standard language(s) and storage formats, graph. YARN, still in its early stages, offers the possibility of running multiple workloads concurrently on the same Hadoop cluster without the need to physically split it up. On the horizon are tools applying machine learning to take ETL outside the wall garden of enterprise data, not to mention BI tools that employ other approaches not easily implemented in SQL such as path analysis. Our research has found that the most common use cases for Big Data analytics are actually very familiar problems (e.g., customer experience, risk/fraud prevention, operational efficiency), but with new data and new techniques that improve visibility.

Therefore it would be a waste if enterprises only use Hadoop as a cheaper ETL box or place to offload some SQL analytics. Hopefully, SQL will become the gateway drug for enterprises to adopt Hadoop.

Postscript: Here’s the broader context for our thoughts: databases are converging. There are more platforms to run SQL than ever before. Here’s a link to our presentation at 20145 Hadoop Summit on how Hadoop, SQL, and NoSQL data platforms are converging.

Hadoop vendor ecosystem gaining critical mass

Nature abhors a vacuum, and enterprises abhor platforms lacking tooling. Few enterprises have the developer resources or technology savvy of early adopters. For Hadoop, early adopters invented the technology; mainstream enterprises want to consume it.

On our just-concluded tour of Ovum enterprise clients across Australia/Pacific Rim, we found that the few who have progressed beyond discussion stage with Hadoop are doing so with technology staff accustomed to being on their own, building their own R programs and experimenting with embryonic frameworks like Mesos and YARN. Others are either awaiting more commercial tooling or still sorting out perennial data silos.

But Hadoop is steadily turning into a more “normal” software market. And with it, the vendor ecosystem vacuum is starting to fill in. It’s very much in line with what happened with BI and data warehousing back in the mid-1990s, when tools civilized what was a new architecture for managing data that originally required manual scripting.

So let’s take a brief tour.

Look at the exhibitor list for last month’s Strata HadoopWorld conference; as the largest such Big Data event in North America, it provides a good sampling of the ecosystem. Of nearly a hundred sponsors, roughly a third were tools encompassing BI and analytics, data federation and integration, data protection, and middleware.

There was a mix of the usual suspects who regard Hadoop as their newest target. SAS analytics takes an agnostic approach, bundling a distro of Hadoop in its LASR in-memory appliance; but SAS analytics can also execute inside Hadoop clusters, converting their HPC routines to MapReduce. MicroStrategy and other BI players are connecting to Hadoop in a variety of ways; they either provide suboptimal experience of having your SQL query execute in batch on Hadoop (which few use), or work through the data warehouse or Hadoop platform’s path for interactive SQL.

But there are also new players that are taking BI beyond SQL. Datameer and Platfora each provide their own operators (e.g., clustering, time series, decision trees, or other forms of analysis that would be laborious with SQL), presenting data either through spreadsheets or visualizations. ClearStory Data, which emerged from stealth at the show, provides a way to semantically crawl your own data and mash it with external data from publicly-available APIs. Players like Pivotal, Hadapt, SpliceMachineand CitusData are implementing or co-locating SQL data stores inside HDFS or HBase.

Significantly, some are starting to package forms of data science as well, with almost a half dozen machine learning programs. A necessary development, because there are just so many Hilary Masons to go around. Having people who have a natural feel for data, able to understand its significance, how to analyze it, and most importantly, its relevance, will remain few and far between. To use these tools, you’ll need to know what algorithms to use, but at least you don’t have to build them from scratch. For instance, 0xdata packages machine learning algorithms and combines it with a caching engine for high performance analytics on Hadoop. Skytree, packages classification, clustering, regression analyses, and most importantly, dimension reduction so you can see something meaningful after combing a billion nodes (points) and edges (relationships and context).

Security, a perennial weakness of Hadoop, is another area where you’re seeing vendor activity. Originally designed for trusted environments, Hadoop has long had the remote authentication piece down (Kerberos), because early adopters needed to gain access to remote clusters, and now there are incubating open source projects tackling the other two A’s of AAA – a gateway for access control (Knox) and a mechanism for role-based authorization (Sentry). Yes, there is also a specialized project for “cell” (data entity) level protection created for the NSA (Accumulo), which is being led by Sqrrl. But otherwise, we expect that vendor-based proprietary tools are going to be where most of the action is. Policy-based data protection, either about encryption or data masking, is now emerging via emerging players like Zettaset and Gazzang, with incumbents such as Protegrity and IBM extending support beyond SQL. Data lineage and activity monitoring (the first steps that could eventually lead to full-blown audit and selective read/write access) are emerging from IBM, Cloudera, and Revelytix.

We’ve long believed that for Big Data – and Hadoop – to gain traction with enterprises, that it must become a first class citizen. Among other things, it means Hadoop must integrate with the data center and, inevitably, apps that run against it. Incumbent data integration like Informatica, Talend, Syncsort, and Pentaho view Hadoop as yet another target. Originally touching Hadoop at arm’s length via the traditional ETL staging server topology, they have enabled their transformation tools to work natively inside Hadoop as the idea is a natural (Hadoop promises cheaper compute cycles for the task). Emerging players are adding new integration capabilities – Cirro for data federation; JethroData, for adding indexing to Hadoop; Kapow and Continuuity that are providing middleware for applications to integrate to Hadoop; and Appfluent for extending its data lifecycle management tool to support active archiving on Hadoop.

The subtext of the explosion of the ecosystem is Hadoop’s evolution into a more varied platform; to play anything more than a niche role in the enterprise (and draw a tooling and applications ecosystem), Hadoop must provide other processing options besides MapReduce.

Not surprisingly, interactive SQL on Hadoop became a prime battleground for vendors to differentiate. Cloudera introduced Impala, an MPP-based alternative to MapReduce that uses Hive metadata but bypasses the bottleneck of Hive processing (which had traditionally relied on MapReduce). Meanwhile, Hortonworks has led projects to make Hive better (read: faster), complementing it with a faster alternative to MapReduce. As noted above, several players are implementing SQL data stores directly inside Hadoop, while IBM has modified SQL to run against Hive.

The YARN (a.k.a., MapReduce 2.0) framework provides resource allocation (not full-blown resource management, however) that will allow multiple (read: MapReduce and alternative) workloads to run on Hadoop clusters. Hortonworks, which led development, announced a circle of partners who are supporting the new framework. Its rival, Cloudera, is taking a more measured approach; MapReduce and Impala workloads will be allocated under the YARN umbrella, but streaming or search won’t. Having been carved out of the original resource manager for pre-2.0 MapReduce, Cloudera doesn’t believe the new framework is suited for handling continuous workloads that don’t have starts or stops.

So, going forward, we’re seeing Hadoop emerge with an increasingly well-rounded third party ecosystem where little existed before. We expect that in the coming year, this will spread beyond tools to applications as well; we’ll see more of what the likes of Causata are doing.

So what role will Hadoop play?
For now, Hadoop remains a work in progress – data integration and lifecycle management, security, performance management, and governance practices and technologies are at early stages of evolution. At Strata, Facebook’s Ken Rudin made an eloquent plea for coexistence; they tracked against the wind by starting with Hadoop and learning that it was best for exploratory analytics while relational was best suited for queries with standard metrics (he’s pitched the same message to the data warehousing audience as well).

Cloudera’s Mike Olson, who had the podium right before Rudin, announced Cloudera’s vision of Hadoop as enterprise data hub: Hadoop is not just the logical landing spot for data, but also the place where you can run multiple workloads. Andrew Brust equates Cloudera’s positioning as making Hadoop become “the Ellis Island of data.”

So is Olson agreeing or arguing with Rudin?

The context is that analytic (and some transactional) data platforms are taking on multiple personalities (e.g., SQL row stores adding column engines, file/HDFS data stores, JSON stores – in some cases alongside or in hybrid). All analytic data platforms are grabbing for multiple data types and running workloads. They are also vying to become the logical spot where analytics are choreographed – mixing and matching data sets on different platforms for running analytic problems.

Cloudera aims to compete, not just as another Hadoop platform, but as the default platform where analytic data lives. It doesn’t necessarily replace SQL enterprise data warehouses, but assumes more workloads requiring scale, inexpensive compute cycles, and the ability to run multiple types of workloads – not just MapReduce. SQL data warehouses aren’t standing still either, and in many cases are embracing Hadoop. Hadoop has the edge on cost of compute cycles, but pieces must fall into place to gain parity regarding service level management and performance, security, availability and reliability, and information lifecycle management. Looking ahead, we expect analytics to run on multiple platforms, with the center of gravity up for grabs.