Hortonworks takes on thankless, necessary job for governing data in Hadoop

Governance has always been a tough sell for IT or the business. It’s a cost of doing business that although it might not be optional, is something that inevitably gets kicked down the road. That is, unless your policies, external regulatory requirements, or embarrassing public moments force the issue.

Put simply, data governance consists of the policies, rules, and practices that organizations enforce for the way they handle data. It shouldn’t be surprising that in most organizations, data governance is at best ad hoc. Organizations with data (and often governance) in their names offer best practices, frameworks, and so on. IBM has been active on this front as well, having convened its own data governance council among leading clients for the better part of the last decade; it has published maturity models and blueprints for action.

As data governance is a broad umbrella encompassing multiple disciplines from data architecture to data quality, security and privacy management, risk management, lifecycle management, classification and metadata, and audit logging, it shouldn’t be surprising that there is a wealth of disparate tools out there for performing specific functions.

The challenge with Hadoop, like any emerging technology, is its skunk works origins among Internet companies who had (at the time) unique problems to solve and had to invent new technology to solve them. But as Big Data – and Hadoop as platform – has become a front burner issue for enterprises at large, the dilemma is ensuring that this new Data Lake not become a desert island when it comes to data governance. Put another way, implementing a data lake won’t be sustainable if data handling is out of compliance with whatever internal policies are in force. The problem comes to a head for any organization dealing with sensitive or private information, because in Hadoop, even the most cryptic machine data can contain morsels that could compromise the identity (and habits) of individuals and the trade secrets of organizations.

For Hadoop, the pattern is repeating. Open source projects such as Sentry, Knox, Ranger, Falcon and others are attacking pieces of the problem. But there is no framework that brings it all together – as if that were possible.

Towards that end, we salute Hortonworks for taking on what in our eyes is otherwise a thankless task: herding cats to create definable targets that could be the focus of future Apache projects – and for Hortonworks, value-added additions to its platform. Its Data Governance Initiative, announced earlier this morning, is the beginning of an effort that mimics to some extent what IBM has been doing for years: convene big industry customers to help define practice, and for the vendor, define targets for technology development. Charter members include Target, Merck, Aetna – plus SAS, as technology partner. This is likely to spawn future Apache projects that, if successful, will draw critical mass participation for technologies that will be optimized for the distinct environment of Hadoop.

A key challenge is delineating where vertical industry practice and requirements leave off (as that is a space already covered by many industry groups) so it doesn’t wind up reinventing the wheel. The same is true across the general domain of data management – where as we stated before there are already organizations that have defined the landscape, to which we hope that the new initiative formally or informally syncs up.