Strata 2014 Part 2: Exploratory Analytics and the need to Fail Fast

Among the unfulfilled promises of BI and data warehousing was the prospect of analytic dashboards on the desk of everyman. History didn’t quite turn out that way – BI and data warehousing caught on, but only as the domain of elite power users who were able to create and manipulate dashboards, understand KPIs, and know what to ask for when calling on IT to set up their data marts or query environments. The success of Tableau and Qlik revealed latent demand for intuitive, highly visual BI self-service tools, and the feasibility of navigating data with lesser reliance on IT.

History has repeated itself with Big Data analytics and Hadoop platforms – except that we need even more specialized skills on this go round. Whereas BI required power users and DBAs, for Big Data it’s cluster specialists, Hadoop programmers, and data scientists. When we asked an early enterprise adopter at a Hortonworks-sponsored user panel back at Hadoop Summit as to their staffing requirements, they listed Java, R, and Python programmers.

Even if you’re able to find or train Hadoop programmers and cluster admins, and can even spot the spare underemployed data scientist, you’ll still face a gap in operationalizing analytics. Unless you’re planning to rely on an elite team of programmers, data scientists or statisticians, Big Data analytics will wind up in a more gilded version of the familiar data warehousing/BI ghetto.

This pattern won’t be sustainable for mainstream enterprises. We’ve gone on record that Big Data and Hadoop must become first class citizens in the enterprise. And that means mapping to the skills of the army you already have. No wonder that interactive SQL is becoming the gateway drug for Hadoop in the enterprise. That at least gets Big Data and Hadoop to a larger addressable practitioner base, but unless you’re simply querying longer periods of the same customer data that you held in the data warehouse, you’ll face new bottlenecks getting your arms around all that bigger and badder data. You’re probably going to be wondering:
• What questions do you ask when you have a greater variety of data to choose from?
• What data sets should you select for analysis when you have dozens being ingested into your Hadoop cluster?
• What data sets will you tell your systems admins or DBAs (yes, they can be retrained for schema-on-read data collections) to provision for your Hadoop cluster?

If you’re asking these questions, you’re not lost. Your team is likely acquiring new data sets to provide richer context to perennial challenges such as optimizing customer engagement, reducing financial risk exposure, improving security, or managing operations. Unless your team is analyzing log files with a specific purpose, chances are you won’t have the exact questions or know specifically which data sets you should pinpoint in advance.

Welcome to the world of Exploratory Analytics. This is where you iterate your queries, and identify which data sets to yield the answers. It’s different from traditional analytics, where the data sets, schema, and queries are already pre-determined for you. At exploratory phase, you look for answers that explains why your KPIs have changed, or whether you’re looking at the right KPIs at all. Exploratory analytics does not replace your existing regimen of analytic, query, or reporting – it complements it. Exploratory Analytics:
• Gives you the Big Picture; it shows you the forest. Traditional analytics gives you the Precise Picture, where you get down to the trees.
• May be used for quickly getting a fix on some unique scenario occurring, where you might run a query once and move on; or it can be used for recalibrating where you should do your core data warehousing analytics – which means that it is a preparatory stage for feeding new data to the data warehouse.
• Gives you the context for making decisions. Data warehousing analytics are where final decisions (for which your organization may be legally accountable) are made.

A constant refrain from the just-concluded Strata Hadoop World conference was the urgency of being able to fail fast. You are conducting a process of discovery where you are testing and retesting assumptions and hypotheses. The nature of that process is that you are not always going to be right, and in fact, if you are thorough enough in your discovery process, you won’t. That doesn’t mean that you don’t start out with a question or hypothesis – you are. But unlike conventional BI and data warehousing, your assumptions and hypotheses are not baked in from the moment that schema has been set in concrete.

At the other end of the scale, exploratory analytics should not degenerate into a hunting expedition. Like any scientific process, you need direction and ways for setting bounds on your experiments.

Exploratory analytics requires self-service, from data preparation through query and analysis. You can’t afford to wait for IT to transform your data and build query environments, and then repeat the process for the next stage of refining your hypothesis. It takes long enough to build a single data mart.

It starts with getting data. You must identify data sets and then transform them (schema on read doesn’t eliminate this step). It may involve searching externally for data sets or scanning the data sets that are already available from your organization’s portfolio of transaction or messaging systems, log files, or other sources; or that data might already be on your Hadoop cluster. You need to identify what’s in the data sets of interest, reconcile, and conduct the transformation, a process often characterized as data wrangling. This process is not identical to ETL, but rather, the precursor. You are getting the big picture and may not require the same degree of precision in matching and de-duplicating records as you would inside a data warehouse. You’re designing for queries that give you the big picture (is your organization on the right track) as opposed to the precise or exact picture (where you are making decisions that carry legal and/or financial accountability).

You will have to become self-sufficient in performing the wrangling and setting up the queries and get the results. You’ll need the system to assist you in this endeavor. As serial database entrepreneur and MIT professor Michael Stonebraker put it during a Strata presentation, when you have more than 20 – 25 data sources, it is not humanly possible to keep track of all of them manually; you’ll need automation to help track and organize data sets for you. You may need the system to assist you in selecting data sets, and you certainly need the system to help you determine how to correlate and transform those data sets into workable form. And to keep from reinventing the wheel, you’ll need a way to track and preferably collaborate with others regarding the wrangling process – e.g., your knowledge about data sets, transformations, queries, and so on.

Advances in machine learning are helping make the wild goose chase become manageable. Compared to traditional ETL tools, these offerings use a variety of techniques, such as capabilities to recognize patterns of data and identify what kind of data is in a column, calibrated with various training techniques where you download a sample set and “teach” the system, or provide prompted or unprompted feedback as to the correctness of the transform. Unlike traditional ETL tools, you can operate from a simple spreadsheet rather than have to navigate schema.

Emerging players like Trifacta, Paxata, and Tamr introduced techniques to data preparation and reconciliation; IBM has embraced these approaches with Data Refinery and Watson Analytics, while Informatica leverages machine learning with its Springbok cloud service; and we expect to hear from Oracle very soon.

The next step is data consumption. Where data is already formatted as relational tables, existing BI self-service visualization tools may suffice. But other approaches are emerging that deduce the story. IBM’s Watson Analytics can suggest what questions to ask and pre-populate a storyboard or infographic; ClearStory Data combines live blending of data from internal and external sources to generate interactive storyboards that similarly venture beyond dashboards.

For organizations that already actively conduct BI and analytics, the prime value-add from Big Data will be the addition of exploratory analytics at the front end of the process. Exploratory analytics won’t replace traditional BI query and reporting, as the latter is where the data and processes are repeatable. Exploratory Analytics allows organization to search deeper and wider for new signals or patterns; the results might in some cases be elevated to the data warehouse, but in other cases, may provide a background process that helps the organization get the bigger picture to understand whether it has the right business or operational strategy, whether it is asking the right questions, serving the right customers, or protecting against the right threats.