Category Archives: Java

Hadoop: The Third Way

Working with Hadoop has been kind of a throwback. Until recently, Hadoop was synonymous with MapReduce programming, meaning that when you worked with Hadoop, it seemed that you were working with a newfangled mainframe. As if client/server never happened.

With emergence and heavy competition between the various interactive SQL frameworks (e.g., Impala, Tez, Presto, Drill, BigSQL, Big Data SQL, QueryGrid, Spark SQL), a second path emerged for database developers. So the Hadoop mainframe became a client/server machine. As if n-tier never happened.

The need for speed made n-tier happen – due to the need to bypass the bottleneck of database I/O and the overhead of large, monolithic applications. And so the application server platform was born, and with it, ways to abstract functions such as integration, security, transaction management so they could operate as modular piece parts with whatever application or database. Or to prevent abandoned online shopping carts, so a transaction can be executed without being held hostage to ensuring ACID compliance. Internet-based applications were now being developed on WebSphere, WebLogic, JBoss, and more recently, more compact open source alternatives like Apache Tomcat.

But with Hadoop, we’re still in the era of mainframe or client/server. But with the 2.x generation, where resource management has been taken out of MapReduce, the way has been cleared to make Hadoop more of a multi-purpose platform. While interactive SQL was the first shot, new frameworks supporting streaming (Storm, Spark Streaming), machine learning (Spark), and search (Solr) are among some of the new additions to the palette.

But at this point, we’re still looking at Hadoop as either a mainframe or two-tier system. Developers write MapReduce or Spark programs, or BI/query tools access HDFS with or without Hive. There’s nothing available to write data-driven programs, such as real-time user scoring or intrusion detection.

Nearly four years ago, a startup with a weird name – Continuuity – emerged to become in its own terms “the JBoss for Hadoop.” The goal was building a data fabric that abstracted the low-level APIs to HDFS, MapReduce, Hive, and other Hadoop components to clear the way for developers to write, not just MapReduce programs or run BI tools, but write API-driven programs that could connect to Hadoop. Just as a generation ago, application servers abstracted data and programs so they could flexibly connect with each other. Its first project was a data ingestion platform written on Storm that would be easier to work with than existing Hadoop projects such as Flume.

Continuuity’s problem was that the company was founded too early. During a period where Hadoop was exclusively a batch processing platform, there was little clamor for developers to write data-driven applications. But as new frameworks transform Hadoop into a platform that can deliver experiences closer to real-time, demand should emerge among developers to write, not just programs, but applications that can run against Hadoop (or other platforms).

In the interim, Continuuity changed its name to Cask, and changed its business model to become an open source company. It has diversified its streaming engine to work with other frameworks besides Storm to more readily persist data. And the 40-person company which was founded a few blocks away from Cloudera’s original headquarters, next to Fry’s Electronics in Palo Alto, has just drawn a modest investment from Cloudera to further develop its middleware platform.

Admittedly, Cask’s website really doesn’t make a good case (the home page gives you as 404 error), providing an application platform for Hadoop opens up possibilities sonly limited by the imagination. For instance, it could make possible event-driven programs for performing data validation or detecting changes in customer interactions, and so on.

For Cloudera, Cask is a low-risk proposition for developing that long-missing third path to Hadoop to further its transformation to a multi-purpose platform.

It’s happening: Hadoop and SQL worlds are converging

With Strata, IBM IOD, and Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for Hadoop-related announcements. The common thread of announcements is essentially, “We know that Hadoop is not known for performance, but we’re getting better at it, and we’re going to make it look more like SQL.” In essence, Hadoop and SQL worlds are converging, and you’re going to be able to perform interactive BI analytics on it.

The opportunity and challenge of Big Data from new platforms such as Hadoop is that it opens a new range of analytics. On one hand, Big Data analytics have updated and revived programmatic access to data, which happened to be the norm prior to the advent of SQL. There are plenty of scenarios where taking programmatic approaches are far more efficient, such as dealing with time series data or graph analysis to map many-to-many relationships. It also leverages in-memory data grids such as Oracle Coherence, IBM WebSphere eXtreme Scale, GigaSpaces and others, and, where programmatic development (usually in Java) proved more efficient for accessing highly changeable data for web applications where traditional paths to the database would have been I/O-constrained. Conversely Advanced SQL platforms such as Greenplum and Teradata Aster have provided support for MapReduce-like programming because, even with structured data, sometimes using a Java programmatic framework is a more efficient way to rapidly slice through volumes of data.

Until now, Hadoop has not until now been for the SQL-minded. The initial path was, find someone to do data exploration inside Hadoop, but once you’re ready to do repeatable analysis, ETL (or ELT) it into a SQL data warehouse. That’s been the pattern with Oracle Big Data Appliance (use Oracle loader and data integration tools), and most Advanced SQL platforms; most data integration tools provide Hadoop connectors that spawn their own MapReduce programs to ferry data out of Hadoop. Some integration tool providers, like Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and Hortonworks have been talking up the potentials of HCatalog, actuality an enhanced version of Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.

But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops. And beneath the veneer of chaos, there is an implicit order to most so-called “unstructured” data that is within the reach programmatic transformation approaches that in the long run could likely be automated or packaged inside a tool.

At Ovum, we have long believed that for Big Data to crossover to the mainstream enterprise, that it must become a first-class citizen with IT and the data center. The early pattern of skunk works projects, led by elite, highly specialized teams of software engineers from Internet firms to solve Internet-style problems (e.g., ad placement, search optimization, customer online experience, etc.) are not the problems of mainstream enterprises. And neither is the model of recruiting high-priced talent to work exclusively on Hadoop sustainable for most organizations; such staffing models are not sustainable for mainstream enterprises. It means that Big Data must be consumable by the mainstream of SQL developers.

Making Hadoop more SQL-like is hardly new
Hive and Pig became Apache Hadoop projects because of the need for SQL-like metadata management and data transformation languages, respectively; HBase emerged because of the need for a table store to provide a more interactive face – although as a very sparse, rudimentary column store, does not provide the efficiency of an optimized SQL database (or the extreme performance of some columnar variants). Sqoop in turn provides a way to pipeline SQL data into Hadoop, a use case that will grow more common as organizations look to Hadoop to provide scalable and cheaper storage than commercial SQL. While these Hadoop subprojects that did not exactly make Hadoop look like SQL, they provided building blocks from which many of this week’s announcements leverage.

Progress marches on
One train of thought is that if Hadoop can look more like a SQL database, more operations could be performed inside Hadoop. That’s the theme behind Informatica’s long-awaited enhancement of its PowerCenter transformation tool to work natively inside Hadoop. Until now, PowerCenter could extract data from Hadoop, but the extracts would have to be moved to a staging server where the transformation would be performed for loading to the familiar SQL data warehouse target. The new offering, PowerCenter Big Data Edition, now supports an ELT pattern that uses the power of MapReduce processes inside Hadoop to perform transformations. The significance is that PowerCenter users now have a choice: load the transformed data to HBase, or continue loading to SQL.

There is growing support for packaging Hadoop inside a common hardware appliance with Advanced SQL. EMC Greenplum was the first out of gate with DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with Greenplum MR, a software only product that is accompanied by a MapR Hadoop distro). Teradata Aster has just joined the fray with Big Analytics Appliance, bundling the Hortonworks Data Platform Hadoop; this move was hardly surprising given their growing partnership around HCatalog, an enhancement of the SQL-like Hive metadata layer of Hadoop that adds features such as a cost optimizer and RESTful interfaces that make the metadata accessible without the need to learn MapReduce or Java. With HCatalog, data inside Hadoop looks like another Aster data table.

Not coincidentally, there is a growing array of analytic tools that are designed to execute natively inside Hadoop. For now they are from emerging players like Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like marketplace for developers), Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry, Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).

Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes. Hadapt has been promoting a variant where it squeezes SQL tables inside HDFS file structures – not exactly a no-brainer as it must shoehorn tables into a file system with arbitrary data block sizes. Hadapt’s approach sounds like the converse of object-relational stores, but in this case, it is dealing with a physical rather than a logical impedance mismatch.

Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does Cloudera. It has just announced Impala, a SQL-based alternative to MapReduce for querying the SQL-like Hive metadata store, supporting most but not all forms of SQL processing (based on SQL 92; Impala lacks triggers, which Cloudera deems low priority). Both Impala and MapReduce rely on parallel processing, but that’s where the similarity ends. MapReduce is a blunt instrument, requiring Java or other programming languages; it splits a job into multiple, concurrently, pipelined tasks where, at each step along the way, reads data, processes it, and writes it back to disk and then passes it to the next task. Conversely, Impala takes a shared nothing, MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera claims roughly 4x performance against MapReduce; if the data is in HBase, Cloudera claims performance multiples up to a factor of 30. For now, Impala only supports row-based views, but with columnar (on Cloudera’s roadmap), performance could double. Cloudera plans to release a real-time query (RTQ) offering that, in effect, is a commercially supported version of Impala.

By contrast, Teradata Aster and Hortonworks promote a SQL MapReduce approach that leverages HCatalog, an incubating Apache project that is a superset of Hive that Cloudera does not currently include in its roadmap. For now, Cloudera claims bragging rights for performance with Impala; over time, Teradata Aster will promote the manageability of its single appliance, and with the appliance has the opportunity to counter with hardware optimization.

The road to SQL/programmatic convergence
Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists. What’s more important to enterprises is getting the right tool for the job – whether it is the flexibility of SQL or raw power of programmatic approaches.

SQL convergence is the next major battleground for Hadoop. Cloudera is for now shunning HCatalog, an approach backed by Hortonworks and partner Teradata Aster. The open question is whether Hortonworks can instigate a stampede of third parties to overcome Cloudera’s resistance. It appears that beyond Hive, the SQL face of Hadoop will become a vendor-differentiated layer.

Part of conversion will involve a mix of cross-training and tooling automation. Savvy SQL developers will cross train to pick up some of the Java- or Java-like programmatic frameworks that will be emerging. Tooling will help lower the bar, reducing the degree of specialized skills necessary. And for programming frameworks, in the long run, MapReduce won’t be the only game in town. It will always be useful for large-scale jobs requiring brute force, parallel, sequential processing. But the emerging YARN framework, which deconstructs MapReduce to generalize the resource management function, will provide the management umbrella for ensuring that different frameworks don’t crash into one another by trying to grab the same resources. But YARN is not yet ready for primetime – for now it only supports the batch job pattern of MapReduce. And that means that YARN is not yet ready for Impala or vice versa.

Of course, mainstreaming Hadoop – and Big Data platforms in general – is more than just a matter of making it all look like SQL. Big Data platforms must be manageable and operable by the people who are already in IT; they will need some new skills and grow accustomed to some new practices (like exploratory analytics), but the new platforms must also look and act familiar enough. Not all announcements this week were about SQL; for instance, MapR is throwing a gauntlet to the Apache usual suspects by extending its management umbrella beyond the proprietary NFS-compatible file system that is its core IP to the MapReduce framework and HBase, making a similar promise of high performance. On the horizon, EMC Isilon and NetApp are proposing alternatives promising a more efficient file system but at the “cost” of separating the storage from the analytic processing. And at some point, the Hadoop vendor community will have to come to grips with capacity utilization issues, because in the mainstream enterprise world, no CFO will approve the purchase of large clusters or grids that get only 10 – 15% utilization. Keep an eye on VMware’s Project Serengeti.

They must be good citizens in data centers that need to maximize resource (e.g., virtualization, optimized storage); must comply with existing data stewardship policies and practices; and must fully support existing enterprise data and platform security practices. These are all topics for another day.

Searching for Data Scientists as a Service

It’s no secret that rocket .. err … data scientists are in short supply. The explosion of data and the corresponding explosion of tools, and the knock-on impacts of Moore’s and Metcalfe’s laws, is that there is more data, more connections, and more technology to process it than ever. At last year’s Hadoop World, there was a feeding frenzy for data scientists, which only barely dwarfed demand for the more technically oriented data architects. in English, that means:

1. Potential MacArthur Grant recipients who have a passion and insight for data, the mathematical and statistical prowess for ginning up the algorithms, and the artistry for painting the picture that all that data leads to. That’s what we mean by data scientists.
2. People who understand the platform side of Big Data, a.k.a., data architect or data engineer.

The data architect side will be the more straightforward nut to crack. Understanding big data platforms (Hadoop, MongoDB, Riak) and emerging Advanced SQL offerings (Exadata, Netezza, Greenplum, Vertica, and a bunch of recent upstarts like Calpont) is a technical skill that can be taught with well-defined courses. The laws of supply and demand will solve this one – just as they did when the dot com bubble created demand for Java programmers back in 1999.

Behind all the noise for Hadoop programmers, there’s a similar, but quieter desperate rush to recruit data scientists. While some data scientists call data scientist a buzzword, the need is real.

However, data science will be a tougher number to crack. It’s all about connecting the dots, not as easy as it sounds. The V’s of big data – volume, variety, velocity, and value — require someone who discovers insights from data; traditionally, that role was performed by the data miner. But data miners dealt with better-bounded problems and well-bounded (and known) data sets that made the problem more 2-dimensional. The variety of Big Data – in form and in sources – introduces an element of the unknown. Deciphering Big Data requires a mix of investigative savvy, communications skills, creativity/artistry, and the ability to think counter-intuitively. And don’t forget it all comes atop a foundation of a solid statistical and machine learning background plus technical knowledge of the tools and programming languages of the trade.

Sometimes it seems like we’re looking for Albert Einstein or somebody smarter.

As nature abhors a vacuum, there’s also a rush to not only define what a data scientist is, but develop programs that could somehow teach it, software packages that to some extent package it, and otherwise throw them into a meat … err, the free market. EMC and other vendors are stepping up to the plate to offer training, not just on platforms, but for data science. Kaggle offers an innovative cloud-based, crowdsourced approach to data science, making available a predictive modeling platform and then staging sponsored 24-hour competitions for moonlighting data scientists to devise the best solutions to particular problems (redolent of the Netflix $1 million prize to devise a smarter algorithm for predicting viewer preferences).

With data science talent scarce, we’d expect that consulting firms would buy up talent that could then be “rented’ to multiple clients. Excluding a few offshore firms, few SIs have yet stepped up to the plate to roll out formal big data practices (the logical place where data scientists would reside), but we expect that to change soon.

Opera Solutions, which has been in the game of predictive analytics consulting since 2004, is taking the next step down the packaging route. having raised $84 million in Series A funding last year, the company has staffed up to nearly 200 data scientists, making it one of the largest assemblages of genius this side of Google. Opera’s predictive analytics solutions are designed for a variety of platforms, SQL and Hadoop, and today they join the SAP Sapphire announcement stream with a release of their offering on the HANA in-memory database. Andrew Brust provides a good drilldown on the details on this announcement.

From SAP’s standpoint, Opera’s predictive analytics solutions are a logical fit for HANA as they involve the kinds of complex problems (e.g., a computation triggers other computations) that their new in-memory database platform was designed for.

There’s too much value at stake to expect that Opera will remain the only large aggregation of data scientists for hire. But ironically, the barriers to entry will keep the competition narrow and highly concentrated. Of course, with market demand, there will inevitably be a watering down of the definition of data scientists so that more companies can claim they’ve got one… or many.

The laws of supply and demand will kick in for data scientists, but the ramp up of supply won’t be as quick as that for the more platform-oriented data architect or engineer. Of necessity, that supply of data scientists will have to be augmented by software that automates the interpretation of machine learning, but there’s only so far that you can program creativity and counter-intuitive insight into a machine.

Google and Motorola: Quick Post Mortem

There’s been plenty of excellent commentary on Google’s $12.5 billion deal for Motorola Mobility Inc. (MMI) over the past few days, and we’re certainly not going to rehash covered ground.

Clearly this is a lot of money that was invested defensively. Money that could have gone into research or acquisitions that would have grown the business or opened new markets.

Whatever.

That thought hit us this morning after reading a NY Times piece on the bull market for patents. It reinforced our thoughts after word of the deal broke: that this was money spent for arming Google against patent predators in courts of law. In this case, it’s predators sensing blood to slow down or at least exact royalties from the Android platform juggernaut.

Of course much of the issue stems from the subjective nature of software patents; that’s a longstanding issue given that the iterative nature of software development. It is simply difficult if not impossible to prove that a software innovation does not base itself in some way on prior invention. Furthermore, the fact that software relies on other software to operate makes the notion of software patents even more dubious.

This doesn’t mean that software developers should get away plagiarism. Although discovery is still underway, the evidence continues to get more damning in the Oracle-Google case over Dalvik, the Android VM that on closer inspection looks like the JVM in sheep’s clothing. The irony is that when Google was still pulling its (J)VM clean room act, the company at the other end of the line was Sun. To us, this is a reflection of Google’s Not-Invented-Here mentality. Would it have killed them to secure a JVM license at the time, as they could have gotten far more reasonable terms from Sun – rather than Oracle, the new sheriff in town.

Just askin’.

Yahoo to Hadoop: Show Me the Money

While there is relatively little to knock cloud from its hype perch, among web startups, BI and data geeks, the emergence of Big Data has become a game changer. It’s analytics and operational intelligence gone extreme.

Big Data typically is associated with obscene amounts of data – the scale blows away anything that most enterprises would maintain within their core back end business systems. We’re talking hundreds of terabytes or even petabytes.

Today, Yahoo announced that it might take the business of its best-known Big Data brainchild, Hadoop, and and consider spinning it off into a new entity.

So why are we having this conversation?

It’s because Internet giants Google, Yahoo, Facebook, Amazon, and others had to roll their own technologies to deal with magnitudes of data far beyond conventional wisdom of what was possible with enterprise systems. What makes the conversation interesting is that this technology is on the cusp of entering the enterprise mainstream today. It’s not just a matter of technology looking for a problem. When Facebook needs to understand how its 500 million members update their walls, share photographs, and have conversations, it’s because (1) it needs to optimize its IT infrastructure to support how its members use the site, but more importantly (2) it needs to understand more about its members so it can sell advertising.

And when Facebook makes its API publicly available, that same issue becomes a critical for any marketer that is B2C. And as the technology becomes available, suddenly there are downstream uses in capital markets for conducting brute force analyses on trading positions, healthcare providers for understanding outcomes, homeland security for controlling borders, metropolitan entities seeking to manage congestion pricing, life sciences organization seeking to decipher clinical studies, mobile carriers seeking to prevent or minimize customer churn, and so on.

There are a couple technology and market paths that have opened for contending with Big Data. There are Advanced SQL analytic database providers that have adapted SQL for structured data through strategies such as reducing indexing, introducing new forms of data compression and query optimization, columnar architectures, and embedding analytics and data transformation directly into the data engine to minimize data movement; in some cases, they have developed optimized appliances. We’re talking about the Aster Datas, Greenplums, Netezzas, ParAccels, and Verticas of the world – and players like Teradata that invented big data warehousing, Oracle that has extended it, and Sybase which acquired the first column-oriented database. Business has obviously picked up here; IBM, EMC, Teradata, and HP have all made acquisitions in this space over the past 12 months.

But the Facebooks and Googles of the world weren’t dealing with structured data in the enterprise sense – they are contending with web log files, document APIs, rich media files, and so on. They are dealing with data whose structure and volume is so varied and huge that there is no time to model it and form a schema; they need to just load the data into the file system and then analyze it. That spawned the NoSQL movement – initially a focus on technologies that avoided the overhead and scalability limits of SQL.

Until now, neither Google, Yahoo, or Facebook considered themselves in the tools or database business. So they released the fruits of their innovation as open source, with one of the best known projects being Apache Hadoop. Hadoop is a family of projects that includes a distributed file system, the MapReduce framework that parcels out massively parallel computing jobs across a cluster plus a number of other frameworks, file systems, and utilities.

What’s kind of fascinating is the almost incestuous relationship between these NoSQL projects. Hadoop, developed at Yahoo was descended from the Google File System that in turn was developed for Google BigTable; the same was true for Cassandra, another NoSQL file system. Meanwhile, Facebook develops Hive, a relational-like table structure designed to work with Hadoop. You get the picture.

Cloudera has stepped to the forefront in commercializing Hadoop technology and applying MapReduce. Using a Red Hat-like business model, it offers support, several open source extensions, plus an enterprise edition that adds a number of proprietary monitoring and management features. It has distinguished itself with forging partnerships with almost every major BI and data warehousing player except one – IBM. the highlights are its relationships with Informatica, for data transformation, and MicroStrategy, which provides a data mart strategy designed to complement Hadoop. And it has garnered roughly 75 enterprise paying customers in a market segment that has barely commercialized.

In the long run, we also expect IBM to make a stab at Hadoop and related technologies by extending its InfoSphere offerings -– it can see Cloudera-Informatica and Cloudera-MicroStrategy raise it one with its own InfoSphere DataStage and Cognos offerings, before it even talks about partnerships. Today we saw a shot from left field – Yahoo which invented the technology – is now saying it might spin off its Hadoop business to go up against Cloudera, and potentially IBM. In a way, its closing the doors after the horses left the barn as the creator of Hadoop is now part of Cloudera.

Clearly there will be a market for NoSQL technologies in the quest for Big Data, although for now, they require sufficient specialized skills that they are not for the faint of heart. that is, if you can find any Hadoop and MapReduce programmers who haven’t already bee scarfed up by Amazon, Zynga, or JP Morgan Chase. That market will not necessarily be in competition with Advanced SQL as there are different use cases for each. And in fact, there will likely be a blending of the technologies in the long run. Today, many Advanced SQL platforms are already extending support for MapReduce, and in the long run, we expect that SQL-like technologies in the NoSQL space like Hive or HBase will themselves be made more accessible to the huge base of SQL developers.

But we digress.

For Yahoo, this would clearly be a shot out of its comfort zone, as it is not a tools company. But it is hungry for monetizing its intellectual property, even if that property has already been open sourced. It’s redolent of Sun striving to monetize Java and we all know how that went. Obviously this will be an uphill battle for Yahoo, but at least this would be a spinoff so hopefully there won’t be distractions from the mother ship. Given Yahoo’s fortunes, we shouldn’t be surprised that they are now looking to maximize what they can get out of the family jewels.

How should enterprises navigate forks in the Hudson?

A South Jersey neighbor of ours — runner, educator, and open source mischief maker Bob Bickelrecently blogged a status report on what’s been going on with the Jenkins open source project ever since it split off from Hudson.

That’s prompted us to wade in to ask the question that’s been glossed over by the theatrics: what about the user?

For background: This is a case of a promising grassroots technology that took off beyond expectation and became a victim of its own success: governance just did not keep up with the projects growing popularity and attractiveness to enterprise developers. The sign of a mature open source project is that its governing body has successfully balanced the conflicting pressures of constant innovation vs. the need to slow things down for stable, enterprise-ready releases. Hudson failed in this regard.

That led to unfortunate conflicts that degenerated to stupid, petty, and entirely avoidable personality squabbles that in the end screwed the very enterprise users that were pumping much of the oxygen in. We know the actors on both sides – who in their everyday roles are normal, reasonable people that got caught up in mob frenzy. Both sides shot themselves in the feet as matters careened out of control. Go to SD Times if you want the blow by blow.

So what is Hudson – or Jenkins – and why is it important?

Hudson is a continuous integration (CI) server open source project that grew very popular for Java developers. The purpose of a CI server is to support agile practices of continuous integration with a server that maintains the latest copy of the truth. The project was the brainchild of former Sun and Oracle, and current Cloudbees employee Kohsuke Kawaguchi.

Since the split, it has forked into the Hudson and Jenkins branches, with Jenkins having attracted the vast majority of committers and much livelier mailing list activity. Bickel has given us a good snapshot from the Jenkins side with which he’s aligned: a diverse governance body has been established that plans to publish results of its meetings and commit, not only to continuing the crazy schedule of weekly builds, but “stable” quarterly releases. The plan is to go “stable” with the recent 1.400 release, for which a stream of patches is underway.

So most of the committers have gone to Jenkins. Case closed? From the Hudson side, Jason van Zyl of Sonatype, whose business was built around Apache Maven, states that the essential plug-ins are already in the existing Hudson version, and that the current work is more about consolidating the technology already in place, testing it, and refactoring to comply with JSR 330, built around the dependency injection technology popularized by the Spring framework. Although the promises are to keep the APIs stable, this is going to be a rewrite of the innards of Hudson.

Behind the scenes, Sonatype is competing on the natural affinity between Maven and Hudson, which share a large mass of common users, while the emerging force behind Jenkins is Cloudbees, which wants to establish itself as the leading Java development in the cloud platform.

So if you’re wondering what to do, join the crowd. There are bigger commercial forces at work, but as far as you’re concerned, you want stable releases that don’t break the APIs you already use. Jenkins must prove it’s not just the favorite of the hard core, but that its governance structure has grown up to provide stability and assurance to the enterprise, while Hudson must prove that the new rewrite won’t destabilize the old, and that it has managed to retain the enterprise base in spite of all the noise otherwise.

Stay tuned.

April 28, 2011 update. Bob Bickel has reported to me that since the “divorce,” that Jenkins has drawn 733 commits vs 172 for Hudson.

May 4, 2011 update. Oracle has decided to submit the Hudson project to the Eclipse Foundation. Eclipse board member Mik Kersten voices his support of this effort. Oracle says it didn’t consider this before because going to Eclipse was originally perceived as being too heavyweight. This leaves us wondering, why didn’t Oracle propose to do this earlier? where was the common sense?

The Second Wave of Analytics

Throughout the recession, business intelligence (BI) was one of the few growth markets in IT. Given that transactional systems that report “what” is happening are simply the price of entry for remaining in a market, BI and analytics systems answer the question of “why” something is happening, and ideally, provide intelligence that is actionable so you can know ‘how’ to respond. Not surprisingly, understanding the whys and hows are essential for maximizing the top line in growing markets, and pinpointing the path to survival in down markets. The latter reason is why BI has remained one of the few growth areas in the IT and business applications space through the recession.

Analytic databases are cool again. Teradata, the analytic database provider with a 30-year track record, had its strongest Q2 in what was otherwise a lousy 2010 for most IT vendors. Over the past year, IBM, SAP, and EMC took major acquisitions in this space, while some of the loudest decibels at this year’s Oracle OpenWorld were over the Exadata optimized database machine. There are a number of upstarts with significant venture funding, ranging from Vertica to Cloudera, Aster Data, ParAccel and others that are not only charting solid growth, but the range of varied approaches that reveal that the market is far from mature and that there remains plenty of demand for innovation.

We are seeing today a second wave of innovation in BI and analytics that matches the ferment and intensity of the 1995-96 era when data warehousing and analytic reporting went commercial. There isn’t any one thing that is driving BI innovation. At one end of the spectrum, you have Big Data, and at the other end, Fast Data — the actualization of real-time business intelligence. Advances in commodity hardware, memory density, parallel programming models, and emergence of NoSQL, open source statistical programming languages, cloud are bringing this all within reach. There is more and more data everywhere that’s begging to be sliced, diced and analyzed.

The amount of data being generated is mushrooming, but much of it will not necessarily be persisted to storage. For instance, if you’re a power company that wants to institute a smart grid, moving from monthly to daily meter reads multiplies your data volumes by a factor of 30, and if you decide to take readings every 15 minutes, better multiple all that again by a factor of 100. Much of this data will be consumed as events. Even if any of it is persisted, traditional relational models won’t handle the load. The issue is not only because of overhead of operating all the iron, but with it the concurrent need for additional equipment, space, HVAC, and power.

Unlike the past, when the biggest databases were maintained inside the walls of research institutions, public sector agencies, or within large telcos or banks, today many of the largest data stores on the Internet are getting opened through APIs, such as from Facebook. Big databases are no longer restricted to use by big companies.

Compare that to the 1995-96 time period when relational databases, which made enterprise data accessible, reached critical mass adoption; rich Windows clients, which put powerful apps on the desktop, became enterprise standard; while new approaches to optimizing data storage and productizing the kind of enterprise reporting pioneered by Information Builders, emerged. And with it all came the debates OLAP (or MOLAP) vs ROLAP, star vs. snowflake schema, and ad hoc vs. standard reporting. Ever since, BI has become ingrained with enterprise applications, as reflected by recent consolidations with the acquisitions of Cognos, Business Objects, and Hyperion by IBM, SAP, and Oracle. How much more establishment can you get?

What’s old is new again. When SQL relational databases emerged in the 1980s, conventional wisdom was that the need for indexes and related features would limit their ability to perform or scale to support enterprise transactional systems. Moore’s Law and emergence of client/server helped make mockery of that argument until the web, proliferation of XML data, smart sensory devices, and realization that unstructured data contained valuable morsels of market and process intelligence, in turn made mockery of the argument that relational was the enterprise database end-state.

In-memory databases are nothing new either, but the same hardware commoditization trends that helped mainstream SQL has also brought costs of these engines down to earth.

What’s interesting is that there is no single source or style of innovation. Just as 1995 proved a year of discovery and debate over new concepts, today you are seeing a proliferation of approaches ranging from different strategies for massively-parallel, shared-nothing architectures; columnar databases; massive networked and hierarchical file systems; and SQL vs. programmatic approaches. It is not simply SQL vs. a single post-SQL model, but variations that mix and match SQL-like programming with various approaches to parallelism, data compression, and use of memory. And don’t forget the application of analytic models to complex event processes for identifying key patterns in long-running events or coming through streaming data that is arriving in torrents too fast and large to ever consider putting into persistent storage.

This time, much of the innovation is coming from the open source world as evidenced by projects like the Java-based distributed computing platform Hadoop developed by Google; MapReduce parallel programming model developed by Google; the HIVE project that makes MapReduce look like SQL; the R statistical programming language. Google has added fuel to the fire by releasing to developers its BigQuery and Prediction API for analyzing large sets of data and applying predictive algorithms.

These are not simply technology innovations looking for problems, as use cases for Big Data or real-time analysis are mushrooming. Want to extend your analytics from structured data to blogs, emails, instant messaging, wikis, or sensory data? Want to convene the world’s largest focus group? There’s sentiment analysis to be conducted from Facebook; trending topics for Wikipedia; power distribution optimization for smart grids; or predictive analytics for use cases such as real-time inventory analysis for retail chains, or strategic workforce planning, and so on.

Adding icing to the cake was an excellent talk at a New York Technology Council meeting by Merv Adrian, a 30-year veteran of the data management field (who will soon be joining Gartner) who outlined the content of a comprehensive multi-client study on analytic databases that can be downloaded free from Bitpipe.

Adrian speaks of a generational disruption occurring to the database market that is attacking new forms of age old problems: how to deal with expanding datasets while maintaining decent performance. as mundane as that. But the explosion of data coupled with commoditization of hardware and increasing bandwidth have exacerbated matters to the point where we can no longer apply the brute force approach to tweaking relational architectures. “Most of what we’re doing is figuring out how to deal with the inadequacies of existing systems,” he said, adding that the market and state of knowledge has not yet matured to the point where we’re thinking about how the data management scheme should look logically.

So it’s not surprising that competition has opened wide for new approaches to solving the Big and Fast Data challenges; the market has not yet matured to the point where there are one or a handful of consensus approaches around which to build a critical mass practitioner base. But when Adrian describes the spate of vendor acquisitions over the past year, it’s just a hint of things to come.

Watch this space.

Pragmatism over Harmony

Times like this morning, we can really appreciate that the east coast is the right coast, as you get a nice early start on the day. We caught announcement of IBM’s agreement with Oracle to shift JDK development efforts from Apache harmony to Oracle’s (formerly Sun’s) OpenJDK project for about 5 minutes as we were literally running to the JetBlue gate at JFK on the way to SFO. By the following morning, bright and early west coast time, the verdict was out: The announcement clearly knocks the pillars out from under Google’s Dalvik JVM project, which was based on Apache Harmony.

Good summaries of the announcement are available from Gavin Clarke; great commentary is available from Shain McGlaun, Carlo Daffara, and Sacha Labourey.

As Thinkovation’s Gary Barnett tweeted to us yesterday, the announcement was Deja moo all over again. We were recalling sitting in the audience at JavaOne 2005 where Steve Mills appeared in a video reiterating support for the Java Community process in spite of differences with Sun over its more-than-equal role in governing the JCP and, of course, IBM’s successful effort in cultivating Eclipse as the successful rival to Sun in the Java tooling space.

One of the critiques of Sun over the years regarding Java was that (1) they couldn’t make a viable business of it and (2) that their stewardship over the JCP and the OpenJDK was too heavy-handed. Well, the first critique would hardly apply to Oracle, with its recently filing of litigation with Google over the so-called clean room Darvik JVM which it created, based on content from the Apache Harmony JDK (the rival to OpenJDK with which IBM was heavily vested).

IBM’s Bob Sutor characterized the agreement as “the pragmatic choice,” providing a way to work with Oracle from the inside to open up access to the Java certification tests that it denied Apache.

Clearly the bystander is Google; with IBM’s agreement to dump Apache Harmony, Google is left exposed as its Dalvik code uses class libraries from the Harmony project that IBM has abandoned. In essence, IBM didn’t have a dog in the Oracle/Google fight over Android. If you can’t beat ‘em, join ‘em. Google has become the giant pin cushion as it sits on piles of search ad dollars yet, as a player in the Java community, is an outsider whose not-invented here mentality has not exactly conjured grassroots support in the Java community.

Labourey, formerly of JBoss, summarized the prospects for the Red Hats of the world to essentially accept the new status quo. For Google, it’s another fact on the ground that, in the end, will likely result in its coming to a royalty settlement with Oracle sooner rather than later.

Oct. 20 Update — The Register’s Gavin Clarke has a scoop on the fallout at Apache.

Stack envy: Impressions for Oracle OpenWorld 2010

Last year, the anticipation of the unveiling of Fusion apps was palpable. Although we’re not finished with Oracle OpenWorld 2010 yet – we still have the Fusion middleware analyst summit tomorrow and still have loose ends regarding Oracle’s Java strategy – by now our overall impressions are fairly code complete.

In his second conference keynote – which unfortunately turned out to be almost a carbon copy of his first – Larry Ellison boasted that they “announced more new technology this week than anytime in Oracle’s history.” Of course, that shouldn’t be a heavy lift given that Oracle is a much bigger company with many more products across the portfolio, and with Sun, has a much broader hardware/software footprint at that.

On the software end – and post-Sun acquisition, we have to make that distinction – it’s hard to follow up last year’s unveiling of Fusion apps. The Fusion apps are certainly a monster in size with over 5000 tables, 10,000 task flows, representing five years of development. Among other things, the embedded analytics provide the context long missing from enterprise apps like ERP and CRM, which previously required you to slip into another module as a separate task. There is also good integration of process modeling, although for now BPM models developed using either of Oracle’s modeling tools won’t be executable. For now, Fusion apps will not change the paradigm of model, then develop.

A good sampling of coverage and comment can be found from Ray Wang, Dennis Howlett, Therese Poletti, Stefan Ried, and for the Java side, Lucas Jellema.

The real news is that Fusion apps, excluding manufacturing, will be in limited release by year end and general release in Q1. That’s pretty big news.

But at the conference, Fusion apps took a back seat to Exadata, the SPARC HP (and soon to be SPARC)-based database appliance unveiled last year, and the Exalogic cloud-in-a-box unwrapped this year. It’s no mystery that growth in the enterprise apps market has been flat for quite some time, with the main Greenfield opportunities going forward being midsized businesses or the BRIC world region. Yet Fusion apps will be overkill for small-midsized enterprises that won’t need such a rich palette of functionality (NetSuite is more likely their speed), which leaves the emerging economies as the prime growth target. The reality is most enterprises are not about to replace the very ERP systems that they implemented as part of modernization or Y2K remediation efforts a decade ago. At best, Fusion will be a gap filler, picking up where current enterprise applications leave off, which provides potential growth opportunity for Oracle, but not exactly a blockbuster one.

Nonetheless, as Oracle was historically a software company, the bulk of attendees along with the press and analyst community (including us) pretty much tuned out all the hardware talk. That likely explains why, if you subscribed to the #oow10 Twitter hashtag, that you heard nothing but frustration from software bigots like ourselves and others who got sick of the all-Exadata/ Exalogic-all-the-time treatment during the keynotes.

In a memorable metaphor, Ellison stated that one Exalogic device can schedule the entire Chinese rail system, and that two of them could run Facebook – to which a Twitter user retorted, how many enterprises have the computing load of a Facebook?

Frankly, Larry Ellison has long been at the point in his life where he can afford to disregard popular opinion. Give a pure hardware talk Sunday night, then do it almost exactly again on Wednesday (although on the second go round we were also treated to a borscht belt routine taking Salesforce’s Mark Benioff down more than a peg on who has the real cloud). Who is going to say no to the guy who sponsored and crewed on the team that won the America’s cup?

But if you look at the dollars and sense opportunity for Oracle, it’s all about owning the full stack that crunches and manages the data. Even in a recession, if there’s anything that’s growing, it’s the amount of data that’s floating around. Combine the impacts of broadband, sensory data, and lifestyles that are becoming more digital, and you have the makings for the data counterpart to Metcalfe’s Law. Owning the hardware closes the circle. Last year, Ellison spoke of his vision to recreate the unity of the IBM System 360 era, because at the end of the day, there’s nothing that works better than software and hardware that are tuned for each other.

So if you want to know why Ellison is talking about almost nothing else except hardware, it’s not only because it’s his latest toy (OK, maybe it’s partly that). It’s because if you run the numbers, there’s far more growth potential to the Exadata/Exalogic side of the business than there is for Fusion applications and middleware.

And if you look at the positioning, owning the entire stack means deeper account control. It’s the same strategy behind the entire Fusion software stack, which uses SOA to integrate internally and with the rest of the world. But Fusion apps and middleware remain optimized for an all-Oracle Fusion environment,underpinned by a declarative Application Development Framework (ADF) and tooling that is designed specifically for that stack.

So on one hand, Oracle’s pitch that big database processing works best on optimized hardware can sound attractive to CIOs that are seeking to optimize one of their nagging points of pain. But the flipside is that, given Oracle’s reputation for aggressive sales and pricing, will the market be receptive to giving Oracle even more control? To some extent the question is moot; with Oracle having made so many acquisitions, enterprises that followed a best of breed strategy can easily find themselves unwittingly becoming all-Oracle shops by default.

Admittedly, the entire IT industry is consolidating, but each player is vying for different combinations of the hardware, software, networking, and storage stack. Arguably, applications are the most sensitive layer of the IT technology stack because that is where the business logic lies. As Oracle asserts greater claim to that part of the IT stack and everything around it, it requires a strategy for addressing potential backlash from enterprises seeking second sources when it comes to managing their family jewels.

Please Keep the LightSwitch on

It’s seems almost quaint to think that once upon a time, you really had to be a rocket scientist to develop software. OK, correct that, computer scientist. Your IDE was a cryptic command line text editor, and you freelanced debugging manually. That’s OK, that was during the cowboy days of appdev, when ideas like objects, components, or models were considered the stuff of idle dreams. Besides, what self-respecting programmer (we didn’t call them developers back then) would ever condescend to using somebody else’s code? Real coders only need command lines, and they don’t need formalized architecture to tell them how to program.

Roughly 25 years ago, what was then Borland introduced the integrated development environment, and several years after that, Microsoft blew the lid out of that market with the first programming language that was really designed for, to borrow Apple’s terminology, “the rest of us: Visual Basic. For the first time, here was a language that was fairly easy to learn; offering lots of flexibility; and taaping the innovations of visual development, it made software development more intuitive. As God’s gift to liberal arts majors, they now could get paid for doing something other than waiting tables, driving cabs, or teaching art history or philosophy.

Of course, lowering barriers to entry allows the unwashed masses in, and yes, there is sound argument to say that allowing anybody to program would lower the quality of coding. Yet, democratizing development became essential because in the early 90s, the coming boom in client/server, followed by web developed, unleashed an enormous appetite for applications for which there weren’t enough computer scientists in the world to deliver. Even with bandwidth bringing millions of Indian, Chinese, and Ukrainian developers online, supply is still mismatched with demand. While you might think about outsourcing large projects or maintenance, it is simply impractical to task teams located over a dozen time zones away (not to mention language or cultural barriers) to churn out the kinds of quick, tactical applications that some agile team in a corner could crank out in days.

Not surprisingly, the democratization of development unleashed by Visual Basic and almost every development tool after made it possible for the IT profession to meet demand; it didn’t create it. But not surprisingly, with all that sloppy coding out there, emergence of robust frameworks like Java EE and .NET attempted to clean up the mess with frameworks that required disciplined practices like strongly typed coding. But as the laws of physics predict counter reaction to every action, dynamic scripting languages like PHP and Ruby emerged to provide the ease and lightweight that the top down frameworks forgot.

Anyway, it is difficult to make it through a vendor briefing call these days without hearing bromides on how they are making their tools accessible to “business developers” – as if there is such a class of people in the business who do software development. What they are really saying is that they have tools for business stakeholders who have day jobs to, by the way, craft quick little productivity or business insight apps with their drag and drop tools on the side. It’s the same thing that we have heard from players like Zoho which seem more like cloud platforms for developing trivial apps of little meaning.

Therefore our ear perked up with Microsoft’s release of LightSwitch, which provides a simpler path to developing real data-centric .NET applications. We’ll spare you the details because Andrew Brust has explained them much better than we could, and is hoping that LightSwitch might become part of “a long overdue turnaround” from Microsoft’s last decade of “courting complexity.”

We share his hopes, but our optimism is a bit more measured. Microsoft doesn’t exactly have a great track record backing innovation these days. A couple years ago, it had a similar kind of great idea with Oslo, an innovative attempt to make modeling of data-driven applications (do we see a pattern here?) more developer-centric through a coding-oriented approach. Less than a year after unveiling Oslo, Microsoft backtracked and made it a development pattern for SQL Server. Let’s hope that on this go round, Microsoft has the patience and perseverance to keep the LightSwitch on.