What will Splunk be when it grows up?

Much of the hype around Big Data is that, not only are people generating more data, but machines. Machine data has always been there – it was traditionally collected by dedicated systems such as network node managers, firewalls systems, SCADA systems, and so on. But that’s where the data stayed.

Machine data is obviously pretty low level stuff. Depending on the format of data spewed forth by devices, it may be highly cryptic or may actually contain text that is human intelligible. It was traditionally considered low-density data that was digested either by specific programs or applications or by specific people – typically systems operators or security specialists.

Splunk’s reason for existence is putting this data onto a common data platform, then index it to make it searchable as a function of time. The operable notion is that the data could then be shared or correlated across applications, such as the weblogs. Its roots are in the underside of IT infrastructure management systems, where Splunk is often the embedded data engine. An increasingly popular use case is security, where Splunk can reach across network, server, storage, and web domains to provide a picture of exploits that could be end-to-end, at least within the data center.

There’s been a bit of hype around the company, which IPO’ed earlier this year and reported a strong Q2. Consumer technology still draws the headlines (just look at how much the release of the iPhone 5 drowned out almost all other tech news this week). But given Facebook’s market dive, maybe the turn of events on Wall Street could be characterized as revenge of the enterprise, given the market’s previous infatuation with the usual suspects in the consumer space – mobile devices, social networks, and gaming.

Splunk has a lot of headroom. With machine data proliferating and the company’s promoting its offering as an operational intelligence platform, Splunk is well-positioned as a company that leverages Fast Data. While Splunk is not split second or deterministic real-time, its ability to build searchable indexes on the fly positions itself nicely for tracking volatile environments as they change as opposed to waiting after the fact (although Splunk can also be used for retrospective historical analysis, too).

But Splunk faces real growing pains, both up the value chain, and across it.

While Splunk’s heritage is in IT infrastructure data, the company bills itself as being about the broader category of machine data analytics. And there’s certainly lots of it around, given the explosion of sensory devices that are sending log files from all over the place, inside the four walls of a data center or enterprise, and out. There’s The Internet of Things. IBM’s Smarter Planet campaign over the past few years has raised general awareness of how instrument and increasingly intelligent Spaceship Earth is becoming. Maybe we’re jaded, but it’s become common knowledge that the world is full of sensory points, whether it is traffic sensors embedded in the pavement, weather stations, GPS units, smartphones, biomedical devices, industrial machinery, oil and gas recovery and refining, not to mention the electronic control modules sitting between driver and the powertrain in your car.

And within the enterprise, there may be plenty of resistance to getting the bigger picture. For instance, while ITO owns infrastructure data, marketing probably owns the Omniture logs; yet absent the means to correlate the two, it may not be possible to get the answer on why the customer did or did not make the purchase online.

For a sub $200-million firm, this is all a lot of ground to cover. Splunk knows the IT and security market but lacks the breadth of an IBM to address all of the other segments across national intelligence, public infrastructure, smart utility grids, or healthcare verticals, to name a few. And it has no visibility above IT operations or appdev organizations. Splunk needs to pick its targets.

Splunk is trying to address scale – that’s where the Big Data angle comes in. Splunk is adding some features to increases its scale, with the new 5.0 release adding federated indexing to boost performance over larger bodies of data. But for real scale, that’s where integration with Hadoop comes in, acting as a near-line archive for Splunk data that might otherwise be purged. Splunk offers two forms of connectivity: HadoopConnect, which provides a way to stream and transform Splunk data to populate HDFS and Shuttl, a slower archival feature that treats Hadoop as a tape library (the data is heavily compressed with GZip). It’s definitely a first step – HadoopConnect as the name implies establishes connectivity, but the integration is hardly seamless or intuitive, yet. It uses Splunk’s familiar fill-in-the-blank interface (we’d love to see something more point and click), with the data in Hadoop retrievable, but without Splunk’s familiar indexes (unless you re-import the data back to Splunk). On the horizon, we’d love to see Splunk tackle the far more challenging problem of getting its indexes to work natively inside Hadoop, maybe with HBase.

Then there’s the eternal question of making machine data meaningful to the business. Splunk’s search-based interface today is intuitive to developers and systems admins, as it requires knowledge of the types of data elements that are being stored. But it won’t work for anybody that doesn’t work with the guts of applications or computing infrastructure. But it becomes more critical to convey that message as Splunk is used to correlate log files with higher level sources, such as the correlating abandoned shopping carts with underlying server data to see if the missed sale was attributable to system bugs or the buyer changing her mind.

The log files that record how different elements of IT infrastructure perform are in aggregate telling a story that tells how well your organization is serving the customer. Yet the perennial challenge of all systems level management platforms has been conveying the business impact from the events that generated those log files. For those who don’t have to dye their hair gray, there are distant memories of providers like CA, IBM, and HP promoting how their panes of glass displaying data center performance could tell a business message. There’s been the challenge for ITIL adopters to codify the running of processes in the data center to support the business. The lists of stillborn attempts to convey business meaning to the underlying operations are endless.

So maybe given the hype of the IPO, the relatively new management team that is in place, and the reality of Splunk’s heritage, it shouldn’t be surprising that we heard two different messages and tones.

From recently-appointed product SVP Guido Schroeder, we heard talk of creating a semantic metadata layer that would, in effect, create de facto business objects. That shouldn’t be surprising, as during his previous incarnation he oversaw the integration of Business Objects into the SAP business. For anyone who has tracked the BI business over the years, the key to success has been creation of a metadata layer that not only codified the entities, but made it possible to attain reuse in ad hoc query and standard reporting. Schroeder and the current management team are clearly looking to take Splunk above IT operations to CIO level.

But attend almost session at the conference, and the enterprise message was largely missing. That shouldn’t be surprising as the conference itself was aimed at the people who buy Splunk’s tools – and they tend to be down more in the depths of operations.

There were a few exceptions. One of the sessions in the Big Data track, led by Stuart Hirst, CTO of an Australian big data consulting firm Converging Data, communicated the importance of preserving the meaning of data as it moves through the lifecycle. In this case, it was a counter-intuitive pitch to conventional wisdom of Big Data, which is ingest the data, explore and classify it later. As Splunk data is ingested, it is time stamped to provide a chronological record. Although this may be low level data, as you bring more of it together, there should be a record of lineage, not to mention sensitivity (e.g., are customer-facing systems involved.

From that standpoint, the notion of adding a semantic metadata layer atop its indexing sounds quite intuitive – assign higher level meanings to buckets of log data that carries some business process meaning. For that, Splunk would have to rely on external sources – the applications and databases that run atop the infrastructure whose log files it tracks. That’s a tall order and one that will require partners, not to mention how do you define what are the entities that should be defined. Unfortunately, the track record for cross enterprise repositories is not great; maybe there could be some leveraging of MDM implementations around customer or product that could provide some beginning frame of reference.

But we’re getting way, way ahead of ourselves here. Splunk is the story of an engineering-oriented company that is seeking to climb higher up the value chain in the enterprise. Yet, as it seeks to engage higher level people within the customer organization, Splunk can’t afford to lose track of the base that has been responsible for its success. Splunk’s best route upward is likely through partnering with enterprise players like SAP. That doesn’t deal with the question of how to expand out the footprint to follow the footprint of what is called machine data, but then again, that’s a question for another day. First things first, Splunk needs to pick its target(s) carefully.