04.25.12

Another vote for the Apache Hadoop Stack

Posted in Big Data, Data Management at 8:59 pm by Tony Baer

As we’ve noted previously, the measure of success of an open source stack is the degree to which the target remains intact. That either comes as part of a captive open source project, where a vendor unilaterally open sources their code (typically hosting the project) to promote adoption, or a community model where a neutral industry body hosts the project and gains support from a diverse cross section of vendors and advanced developers. In that case, the goal is getting the formal standard to also become the de facto standard.

The most successful open source projects are those that represent commodity software – otherwise, why would vendors choose not to compete with software that anybody can freely license or consume? That’s been the secret behind the success of Linux, where there has been general agreement on where the kernel ends, and as a result, a healthy market of products that run atop (and license) Linux. For community open source projects, vendors obviously have to agree on where the line between commodity and unique value-add begins.

And so we’ve discussed that the fruition of Hadoop will require some informal agreement as to exactly what components make Hadoop, Hadoop. For a while, the question appeared in doubt, as one of the obvious pillars – the file system – was being contested with proprietary alternatives like MapR and IBM’s GPFS.

What’s interesting is that the two primary commercial providers that signed on for the proprietary files systems – IBM and EMC (via partnership with MapR) – have retrenched clarified their messages. They still offer the proprietary file system systems in question, but both now also offer purer are emphasizing that they also offer Apache versions. IBM made the announcement today, buried below the fold after its announced intention to acquire data federation search player, Vivisimo. The announcement had a bit of a grudging aspect to it – unlike Oracle, which has a full OEM agreement with Cloudera, IBM is simply stating that it will certify Cloudera’s Hadoop as one of the approved distributions for InfoSphere BigInsights – there’s no exchange of money or other skin in the game. If IBM also gets demand for the Hortonworks distro (or if it wants to keep Cloudera in its place), it’ll also likely add Hortonworks to the approved list.

Against this background is a technology that is a moving target. The primary drawback – that there was no redundancy or failover with the HDFS NameNode (which acts as a file directory) – has been addressed with the latest versions of Hadoop. The other – which provides POSIX compliance so Hadoop can be accessed through the NFS standard) – is only necessary for very high, transactional-like (OK, not ACID) performance which so far has not been an issue. If you want that kind of performance, Hadoop’s HBase offers more promise.

But just as the market has passed judgment on what comprises the Hadoop “kernel” (using some Linuxspeak), that doesn’t rule out differences in implementation. Teradata Aster and Sybase IQ are promoting their analytics data stores as swappable, more refined replacements for HBase (Hadoop’s column store), while upstarts like Hadapt are proposing to hang SQL data nodes onto HDFS.

When it comes to Hadoop, you gotta reverse the old maxim: The more things stay the same, the more things are actually changing.

04.16.12

Big Data and the Product Lifecycle

Posted in Application Lifecycle Management (ALM), Complex Engineered Systems, Product Lifecycle at 2:04 am by Tony Baer

Our twitter feed went silent for a few days last week as we spent some time at a conference that where chance conversations, personal reunions, and discovery were the point. In fact, this was one of the few events where attendees – like us – didn’t have our heads down buried in our computers. We’re speaking of Cyon Research’s COFES 2012 design engineering software conference, where we had the opportunity to explore the synergy of Big Data and the Product Lifecycle, why ALM and PLM systems can’t play nice, and how to keep a handle on finding the right data as product development adopts a 24/7 follow-the-sun strategy. It wasn’t an event of sessions in the conventional sense, but lots of hallways where you spent most of your time in chance, impromptu meetings. And it was a great chance to hook up with colleagues whom we haven’t caught in years.

There were plenty of contrarian views. There were a couple of keynotes in the conventional sense that each took different shots at the issue of risk. Retired Ford product life cycle management director Richard Riff took aim at conventional wisdom when it comes to product testing. After years of ingrained lean, six sigma, and zero defects practices – not to mention Ford’s old slogan that quality is job one — Riff countered with a provocative notion: sometimes the risk of not testing is the better path. It comes down to balancing the cost of defects vs. the cost of testing, the likely incidence of defects, and the reliability of testing. While we couldn’t repeat the math, in essence, it amounted to a lifecycle cost approach for testing. He claimed that the method even accounted for intangible factors, such as social media buzz or loss of reputation, when referring g to recently highly publicized quality issues with some of Ford’s rivals.

Xerox PARC computing legend Alan Kay made the case for reducing risk through a strategy that applied a combination of object-oriented design (or which he was one of the pioneers – along with the GUI of course) and what sounded to us like domain-specific languages. Or more specifically, that software describes the function, then lets other programs automatically generate the programming to execute it. Kay decried the instability that we have come to accept with software design – which reminded us that since the mainframe days, we have become all too accustomed to hearing that the server is down. Showing some examples of ancient Roman design (e.g., a 2000-year old bridge in Spain that today still carries cars and looks well intact), he insists that engineers can do better.

Some credit to host Brad Holtz who deciphered that there really was a link between our diverging interests: Big Data and meshing software development with the product lifecycle. By the definition of Big Data – volume, variability, velocity, and value – Big Data is nothing new to the product lifecycle. CAD files, models, and simulations are extremely data-intensive and containing a variety of data types encompassing graphical and alphanumeric data. Today, the brass ring for the modeling and simulation world is implementing co-simulations, where models each drive other models (the results of one drives the other).

But is anybody looking at the bigger picture? Modeling has been traditionally silo’ed – for instance, models are not typically shared across product teams, projects, or product families. Yet new technologies could provide the economical storage and processing power to make it possible to analyze and compare the utilization and reliability of different models for different scenarios – with the possible result being metamodels that provide frameworks for optimizing model development and parameters with specific scenarios. All this is highly data-intensive.

What about the operational portion of the product lifecycle? Today, it’s rare for products not to have intelligence baked into controllers. Privacy issues aside (they must be dealt with), machinery connected to networks can feed back performance data; vehicles can yield data while in the repair shop, or thanks to mobile devices, provide operational data while in movement. Add to that reams of publicly available data from services such as NOAA or the FAA, and now there is context placed around performance data (did bad weather cause performance to drop?). Such data could feed processes, ranging from MRO (maintenance, repair, and operation) and warranty, to providing feedback loops that can validate product tests and simulation models.

Let’s take another angle – harvesting publicly available data for the business. For instance, businesses could use disaster preparedness models to help their scenario planning, as described in this brief video snippet from last years COFES conference. Emerging organizations, such as the Center for Understanding Change, aim to make this reality by making available models and expertise developed through tax dollars in the national laboratory system.

Big Data and connectivity can also be used to overcome gaps in locating expertise and speed product development. Adapting techniques from the open source software world, where software is developed collaboratively by voluntary groups of experts in the field, crowdsourcing is invading design and data science (we especially enjoyed our conversation with Kaggle’s Jeremy Howard).

A personal note on the sessions – the conference marked a reunion with folks whom we have crossed paths with in over 20 years. Our focus on application development lead us to engineered systems, an area of white space between software engineering and classic product engineering disciplines. And as noted above, that in turn bought us full circle to our roots covering the emergence of CADCAM in the 80s as we had the chance to reconnect many who continue to advance the engineering discipline. What a long, fun trip it’s been.

04.12.12

SAP and databases no longer an oxymoron

Posted in Big Data, Business Intelligence, Data Management, Database, Fast Data at 12:44 am by Tony Baer

In its rise to leadership of the ERP market, SAP shrewdly placed bounds around its strategy: it would stick to its knitting on applications and rely on partnerships with systems integrators to get critical mass implementation across the Global 2000. When it came to architecture, SAP left no doubt of its ambitions to own the application tier, while leaving the data tier to the kindness of strangers (or in Oracle’s case, the estranged).

Times change in more ways than one – and one of those ways is in the data tier. The headlines of SAP acquiring Sybase (for its mobile assets, primarily) and subsequent emergence of HANA, its new in-memory data platform, placed SAP in the database market. And so it was that at an analyst meeting last December, SAP made the audacious declaration that it wanted to become the #2 database player by 2015.

Of course, none of this occurs in a vacuum. SAP’s declaration to become a front line player in the database market threatens to destabilize existing relationships with Microsoft and IBM as longtime SAP observer Dennis Howlett commented in a ZDNet post. OK, sure, SAP is sick of leaving money on the table to Oracle, and it’s throwing in roughly $500 million in sweeteners to get prospects to migrate. But if the database is the thing, to meet its stretch goals, says Howlett, SAP and Sybase would have to grow that part of the business by a cool 6x – 7x.

But SAP would be treading down a ridiculous path if it were just trying to become a big player in the database market for the heck of it. Fortuitously, during SAP’s press conference on announcements of their new mobile and database strategies, chief architect Vishal Sikka tamped down the #2 aspirations as that’s really not the point – it’s the apps that count, and increasingly, it’s the database that makes the apps. Once again.

Back to our main point, IT innovation goes in waves; during emergence of client/server, innovation focused on database where the need was mastering SQL and relational table structures; during the latter stages of client/server and subsequent waves of Webs 1.0 and 2.0, activity shifted to the app tier, which grew more distributed. With emergence of Big Data and Fast Data, energy shifted back to the data tier given the efficiencies of processing data big or fast inside the data store itself. Not surprisingly, when you hear SAP speak about HANA, they describe an ability to perform more complex analytic problems or compound operational transactions. It’s no coincidence that SAP now states that it’s in the database business.

So how will SAP execute its new database strategy? Given the hype over HANA, how does SAP convince Sybase ASE, IQ, and SQL Anywhere customers that they’re not headed down a dead end street?

That was the point of the SAP announcements, which in the press release stated the near term roadmap but shed little light on how SAP would get there. Specifically, the announcements were:
• SAP HANA on BW is now going GA and at the low (SMB) end come out with aggressive pricing: roughly $3000 for SAP BusinessOne on HANA; $40,000 for HANA Edge.
• Ending a 15-year saga, SAP will finally port its ERP applications to Sybase ASE, with tentative target date of year end. HANA will play a supporting role as the real-time reporting adjunct platform for ASE customers.
• Sybase SQL Anywhere would be positioned as the mobile front end database atop HANA, supporting real-time mobile applications.
• Sybase’s event stream (CEP) offerings would have optional integration with HANA, providing convergence between CEP and BI – where rules are used for stripping key event data for persistence in HANA. In so doing, analysis of event streams could be integrated or directly correlating with historical data.
• Integrations are underway between HANA and IQ with Hadoop.
• Sybase is extending its PowerDesigner data modeling tools to address each of its database engines.

Most of the announcements, like HANA going GA or Sybase ASE supporting SAP Business suite, were hardly surprises. Aside from go-to-market issues, which are many and significant, we’ll direct our focus on the technology roadmaps.

We’ve maintained that if SAP were serious about its database goals, that it had to do three basic things:
1. Unify its database organization. The good news is that it has started down that path as of January 1 of this year. Of course, org charts are only the first step as ultimately it comes down to people.
2. Branding. Although long eclipsed in the database market, Sybase still has an identifiable brand and would be the logical choice; for now SAP has punted.
3. Cross-fertilize technology. Here, SAP can learn lessons from IBM which, despite (or because of) acquiring multiple products that fall under different brands, freely blends technologies. For instance, Cognos BI reporting capabilities are embedded into rational and Tivoli reporting tools.

The third part is the heavy lift. For instance, given that data platforms are increasingly employing advanced caching, it would at first glance seem logical to blend in some of HANA’s in-memory capabilities to the ASE platform; however, architecturally, that would be extremely difficult as one of HANA’s strengths –dynamic indexing – would be difficult to implement in ASE.

On the other hand, given that HANA can index or restructure data on the fly (e.g., organize data into columnar structures on demand), the question is, does that make IQ obsolete? The short answer is that while memory keeps getting cheaper, it will never be as cheap as disk and that therefore, IQ could evolve as near-line storage for HANA. Of course that begs the question as to whether Hadoop could eventually perform the same function. SAP maintains that Hadoop is too slow and therefore should be reserved for offline cases; that’s certainly true today, but given developments with HBase, it could easily become fast and cheap enough for SAP to revisit the IQ question a year or two down the road.

Not that SAP Sybase is sitting still with Hadoop integration. They are providing MapReduce and R capabilities to IQ (SAP Sybase is hardly alone here, as most Advanced SQL platforms are offering similar support). SAP Sybase is also providing capabilities to map IQ tables into Hadoop Hive, slotting IQ as alternative to HBase; in effect, that’s akin to a number of strategies to put SQL layers inside Hadoop (in a way, similar to what the lesser-known Hadapt is doing). And of course, like most of the relational players, SAP Sybase is also support the bulk ETL/ELT load from HDFS to HANA or IQ.

On SAP’s side for now is the paucity of Hadoop talent, so pitching IQ as an alternative to HBase may help soften the blow for organizations seeking to get a handle. But in the long run, we believe that SAP Sybase will have to revisit this strategy. Because, if it’s serious about the database market, it will have to amplify its focus to add value atop the new realities on the ground.