Category Archives: Application Development

Hadoop and Spark: A Tale of two Cities

If it seems like we’ve been down this path before, well, maybe we have. June has been a month of juxtapositions, back and forth to the west coast for Hadoop and Spark Summits. The mood from last week to this has been quite contrasting. Spark Summit has the kind of canned heat that Hadoop conferences had a couple years back. We won’t stretch the Dickens metaphor.

Yeah, it’s human nature to say, down with the old and in with the new.

But let’s set something straight: Spark ain’t going to replace Hadoop, as we’re talking about apples and oranges. Spark can run on Hadoop, and it can run on other data platforms. What it might replace is MapReduce, if Spark can overcome its scaling hurdles. And it could fulfil IBM’s vision as the next analytic operating system if it addresses mundane – but very important concerns – for supporting scaling, high concurrency, and bulletproof security. Spark originated at UC Berkeley’s AMPLab back in 2009, with the founders forming Databricks. With roughly 700 committers contributors, Spark has ballooned to becoming the most active open source project in the apache community, barely 2 years after becoming an Apache project.

Spark is best known as a sort of in-memory analytics replacement for iterative computation frameworks like MapReduce; both employ massively parallel compute and then shuffle interim results, with the difference being that Spark caches in memory while MapReduce writes to disk. But that’s just the tip of the iceberg. Spark offers a simpler programming model, better fault tolerance, and it’s far more extensible than MapReduce. Spark is any form of iterative computation, and it was designed to support specific extensions; among the most popular are machine learning, microbatch stream processing, graph computing, and even SQL.

By contrast, Hadoop is a data platform. It is one of many that can run Spark, because Spark is platform-independent. So you could also run Spark on Cassandra, other NoSQL data store, or SQL databases, but Hadoop has been the most popular target right now.

And, not to forget Apache Mesos, another AMPLab discovery for cluster management to which Spark had originally been closely associated.

There’s little question about the excitement level over Spark. By now the headlines have poured out over IBM investing $300 million, committing 3500 developers, establishing a Spark open source development center a few BART stops from AMPLab in San Francisco, and aiming directly and through partners to educate 1 million professionals on Spark in the next few years (or about 4 – 5x the current number registered for IBM’s online Big Data university). IBM views Spark’s strength as machine learning, and wants to make machine learning a declarative programming experience that will fellow in SQL’s footsteps with its new SystemML language (which it plans to open source).

That’s not to overshadow Databricks’ announcement that its Spark developer cloud, in preview over the past year, has now gone GA. The big challenge facing Databricks was making its cloud scalable and sufficiently elastic to meet the demand – and not become a victim of its own success. And there is the growing number of vendors that are embedding Spark within their analytic tools, streaming products, and development tools. The release announcement of Spark 1.4 brings new manageability and capability for automatically renewing Kerberos tokens for long running processes like streaming. But there remain growing pains, like reducing the number of moving parts needed to make Spark a first class citizen with Hadoop YARN.

By contrast, last week was about Hadoop becoming more manageable and more amenable to enterprise infrastructure, like shared storage as our colleague Merv Adrian pointed out. Not to mention enduring adolescent factional turf wars.

It’s easy to get excited by the idealism around the shiny new thing. While the sky seems the limit, the reality is that there’s lots of blocking and tackling ahead. And the need for engaging, not only developers, but business stakeholders through applications, rather than development tools, and success stories with tangible results. It’s a stage that the Hadoop community is just starting to embrace now.

Hadoop: The Third Way

Working with Hadoop has been kind of a throwback. Until recently, Hadoop was synonymous with MapReduce programming, meaning that when you worked with Hadoop, it seemed that you were working with a newfangled mainframe. As if client/server never happened.

With emergence and heavy competition between the various interactive SQL frameworks (e.g., Impala, Tez, Presto, Drill, BigSQL, Big Data SQL, QueryGrid, Spark SQL), a second path emerged for database developers. So the Hadoop mainframe became a client/server machine. As if n-tier never happened.

The need for speed made n-tier happen – due to the need to bypass the bottleneck of database I/O and the overhead of large, monolithic applications. And so the application server platform was born, and with it, ways to abstract functions such as integration, security, transaction management so they could operate as modular piece parts with whatever application or database. Or to prevent abandoned online shopping carts, so a transaction can be executed without being held hostage to ensuring ACID compliance. Internet-based applications were now being developed on WebSphere, WebLogic, JBoss, and more recently, more compact open source alternatives like Apache Tomcat.

But with Hadoop, we’re still in the era of mainframe or client/server. But with the 2.x generation, where resource management has been taken out of MapReduce, the way has been cleared to make Hadoop more of a multi-purpose platform. While interactive SQL was the first shot, new frameworks supporting streaming (Storm, Spark Streaming), machine learning (Spark), and search (Solr) are among some of the new additions to the palette.

But at this point, we’re still looking at Hadoop as either a mainframe or two-tier system. Developers write MapReduce or Spark programs, or BI/query tools access HDFS with or without Hive. There’s nothing available to write data-driven programs, such as real-time user scoring or intrusion detection.

Nearly four years ago, a startup with a weird name – Continuuity – emerged to become in its own terms “the JBoss for Hadoop.” The goal was building a data fabric that abstracted the low-level APIs to HDFS, MapReduce, Hive, and other Hadoop components to clear the way for developers to write, not just MapReduce programs or run BI tools, but write API-driven programs that could connect to Hadoop. Just as a generation ago, application servers abstracted data and programs so they could flexibly connect with each other. Its first project was a data ingestion platform written on Storm that would be easier to work with than existing Hadoop projects such as Flume.

Continuuity’s problem was that the company was founded too early. During a period where Hadoop was exclusively a batch processing platform, there was little clamor for developers to write data-driven applications. But as new frameworks transform Hadoop into a platform that can deliver experiences closer to real-time, demand should emerge among developers to write, not just programs, but applications that can run against Hadoop (or other platforms).

In the interim, Continuuity changed its name to Cask, and changed its business model to become an open source company. It has diversified its streaming engine to work with other frameworks besides Storm to more readily persist data. And the 40-person company which was founded a few blocks away from Cloudera’s original headquarters, next to Fry’s Electronics in Palo Alto, has just drawn a modest investment from Cloudera to further develop its middleware platform.

Admittedly, Cask’s website really doesn’t make a good case (the home page gives you as 404 error), providing an application platform for Hadoop opens up possibilities sonly limited by the imagination. For instance, it could make possible event-driven programs for performing data validation or detecting changes in customer interactions, and so on.

For Cloudera, Cask is a low-risk proposition for developing that long-missing third path to Hadoop to further its transformation to a multi-purpose platform.

Hadoop and Replication

Conventional wisdom is that once Big Data is at rest, don’t move it or shake it. Akin to “don’t fold, spindle, or mutilate.” But seriously, if mainstream enterprises adopt Hadoop, they will expect it to become more robust. And so you start looking at things like data replication, or at least replication of the NameNode or other components that govern how and where data resides in Hadoop and how operations are performed against.

So here’s an interesting one to watch: Wandisco buying Altostore. They are applying replication technol developed for Subversion to Hadoop. We’re gonna check this one out

It’s happening: Hadoop and SQL worlds are converging

With Strata, IBM IOD, and Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for Hadoop-related announcements. The common thread of announcements is essentially, “We know that Hadoop is not known for performance, but we’re getting better at it, and we’re going to make it look more like SQL.” In essence, Hadoop and SQL worlds are converging, and you’re going to be able to perform interactive BI analytics on it.

The opportunity and challenge of Big Data from new platforms such as Hadoop is that it opens a new range of analytics. On one hand, Big Data analytics have updated and revived programmatic access to data, which happened to be the norm prior to the advent of SQL. There are plenty of scenarios where taking programmatic approaches are far more efficient, such as dealing with time series data or graph analysis to map many-to-many relationships. It also leverages in-memory data grids such as Oracle Coherence, IBM WebSphere eXtreme Scale, GigaSpaces and others, and, where programmatic development (usually in Java) proved more efficient for accessing highly changeable data for web applications where traditional paths to the database would have been I/O-constrained. Conversely Advanced SQL platforms such as Greenplum and Teradata Aster have provided support for MapReduce-like programming because, even with structured data, sometimes using a Java programmatic framework is a more efficient way to rapidly slice through volumes of data.

Until now, Hadoop has not until now been for the SQL-minded. The initial path was, find someone to do data exploration inside Hadoop, but once you’re ready to do repeatable analysis, ETL (or ELT) it into a SQL data warehouse. That’s been the pattern with Oracle Big Data Appliance (use Oracle loader and data integration tools), and most Advanced SQL platforms; most data integration tools provide Hadoop connectors that spawn their own MapReduce programs to ferry data out of Hadoop. Some integration tool providers, like Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and Hortonworks have been talking up the potentials of HCatalog, actuality an enhanced version of Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.

But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops. And beneath the veneer of chaos, there is an implicit order to most so-called “unstructured” data that is within the reach programmatic transformation approaches that in the long run could likely be automated or packaged inside a tool.

At Ovum, we have long believed that for Big Data to crossover to the mainstream enterprise, that it must become a first-class citizen with IT and the data center. The early pattern of skunk works projects, led by elite, highly specialized teams of software engineers from Internet firms to solve Internet-style problems (e.g., ad placement, search optimization, customer online experience, etc.) are not the problems of mainstream enterprises. And neither is the model of recruiting high-priced talent to work exclusively on Hadoop sustainable for most organizations; such staffing models are not sustainable for mainstream enterprises. It means that Big Data must be consumable by the mainstream of SQL developers.

Making Hadoop more SQL-like is hardly new
Hive and Pig became Apache Hadoop projects because of the need for SQL-like metadata management and data transformation languages, respectively; HBase emerged because of the need for a table store to provide a more interactive face – although as a very sparse, rudimentary column store, does not provide the efficiency of an optimized SQL database (or the extreme performance of some columnar variants). Sqoop in turn provides a way to pipeline SQL data into Hadoop, a use case that will grow more common as organizations look to Hadoop to provide scalable and cheaper storage than commercial SQL. While these Hadoop subprojects that did not exactly make Hadoop look like SQL, they provided building blocks from which many of this week’s announcements leverage.

Progress marches on
One train of thought is that if Hadoop can look more like a SQL database, more operations could be performed inside Hadoop. That’s the theme behind Informatica’s long-awaited enhancement of its PowerCenter transformation tool to work natively inside Hadoop. Until now, PowerCenter could extract data from Hadoop, but the extracts would have to be moved to a staging server where the transformation would be performed for loading to the familiar SQL data warehouse target. The new offering, PowerCenter Big Data Edition, now supports an ELT pattern that uses the power of MapReduce processes inside Hadoop to perform transformations. The significance is that PowerCenter users now have a choice: load the transformed data to HBase, or continue loading to SQL.

There is growing support for packaging Hadoop inside a common hardware appliance with Advanced SQL. EMC Greenplum was the first out of gate with DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with Greenplum MR, a software only product that is accompanied by a MapR Hadoop distro). Teradata Aster has just joined the fray with Big Analytics Appliance, bundling the Hortonworks Data Platform Hadoop; this move was hardly surprising given their growing partnership around HCatalog, an enhancement of the SQL-like Hive metadata layer of Hadoop that adds features such as a cost optimizer and RESTful interfaces that make the metadata accessible without the need to learn MapReduce or Java. With HCatalog, data inside Hadoop looks like another Aster data table.

Not coincidentally, there is a growing array of analytic tools that are designed to execute natively inside Hadoop. For now they are from emerging players like Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like marketplace for developers), Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry, Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).

Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes. Hadapt has been promoting a variant where it squeezes SQL tables inside HDFS file structures – not exactly a no-brainer as it must shoehorn tables into a file system with arbitrary data block sizes. Hadapt’s approach sounds like the converse of object-relational stores, but in this case, it is dealing with a physical rather than a logical impedance mismatch.

Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does Cloudera. It has just announced Impala, a SQL-based alternative to MapReduce for querying the SQL-like Hive metadata store, supporting most but not all forms of SQL processing (based on SQL 92; Impala lacks triggers, which Cloudera deems low priority). Both Impala and MapReduce rely on parallel processing, but that’s where the similarity ends. MapReduce is a blunt instrument, requiring Java or other programming languages; it splits a job into multiple, concurrently, pipelined tasks where, at each step along the way, reads data, processes it, and writes it back to disk and then passes it to the next task. Conversely, Impala takes a shared nothing, MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera claims roughly 4x performance against MapReduce; if the data is in HBase, Cloudera claims performance multiples up to a factor of 30. For now, Impala only supports row-based views, but with columnar (on Cloudera’s roadmap), performance could double. Cloudera plans to release a real-time query (RTQ) offering that, in effect, is a commercially supported version of Impala.

By contrast, Teradata Aster and Hortonworks promote a SQL MapReduce approach that leverages HCatalog, an incubating Apache project that is a superset of Hive that Cloudera does not currently include in its roadmap. For now, Cloudera claims bragging rights for performance with Impala; over time, Teradata Aster will promote the manageability of its single appliance, and with the appliance has the opportunity to counter with hardware optimization.

The road to SQL/programmatic convergence
Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists. What’s more important to enterprises is getting the right tool for the job – whether it is the flexibility of SQL or raw power of programmatic approaches.

SQL convergence is the next major battleground for Hadoop. Cloudera is for now shunning HCatalog, an approach backed by Hortonworks and partner Teradata Aster. The open question is whether Hortonworks can instigate a stampede of third parties to overcome Cloudera’s resistance. It appears that beyond Hive, the SQL face of Hadoop will become a vendor-differentiated layer.

Part of conversion will involve a mix of cross-training and tooling automation. Savvy SQL developers will cross train to pick up some of the Java- or Java-like programmatic frameworks that will be emerging. Tooling will help lower the bar, reducing the degree of specialized skills necessary. And for programming frameworks, in the long run, MapReduce won’t be the only game in town. It will always be useful for large-scale jobs requiring brute force, parallel, sequential processing. But the emerging YARN framework, which deconstructs MapReduce to generalize the resource management function, will provide the management umbrella for ensuring that different frameworks don’t crash into one another by trying to grab the same resources. But YARN is not yet ready for primetime – for now it only supports the batch job pattern of MapReduce. And that means that YARN is not yet ready for Impala or vice versa.

Of course, mainstreaming Hadoop – and Big Data platforms in general – is more than just a matter of making it all look like SQL. Big Data platforms must be manageable and operable by the people who are already in IT; they will need some new skills and grow accustomed to some new practices (like exploratory analytics), but the new platforms must also look and act familiar enough. Not all announcements this week were about SQL; for instance, MapR is throwing a gauntlet to the Apache usual suspects by extending its management umbrella beyond the proprietary NFS-compatible file system that is its core IP to the MapReduce framework and HBase, making a similar promise of high performance. On the horizon, EMC Isilon and NetApp are proposing alternatives promising a more efficient file system but at the “cost” of separating the storage from the analytic processing. And at some point, the Hadoop vendor community will have to come to grips with capacity utilization issues, because in the mainstream enterprise world, no CFO will approve the purchase of large clusters or grids that get only 10 – 15% utilization. Keep an eye on VMware’s Project Serengeti.

They must be good citizens in data centers that need to maximize resource (e.g., virtualization, optimized storage); must comply with existing data stewardship policies and practices; and must fully support existing enterprise data and platform security practices. These are all topics for another day.

Being Agile about Agile Development

This guest post comes from Ovum colleague Michael Azoff.

Agile practices have been around for over twenty years. The Agile Manifesto was written a decade after ‘agile’ first emerged (under different names of course, Agile was first coined at the 2001 manifesto meeting). There are also plenty of proof points around what works in agile and when to apply it. If you are still asking for agile to prove itself then you are missing where software development has progressed to.
Going back to Waterfall is not an option because it has inherent faults and those faults are visible all around in many failed IT projects. Ultimately, if waterfall is not broken for you then don’t fix it. But you should consider alternatives to waterfall if your software development processes or organization have become dysfunctional; over time, you might find difficulty in recruiting developers for legacy processes, but that’s another issue.
Ken Schwaber a co-originator of Scrum has said that only 25% of Scrum deployments succeed. The question then is what happens to the other 75% of failures. The problem can be examined at three levels of maturity: intra-team agility, extra-team agility, and business agility.
Teams may not be perfectly pure about their agile adoption, and we can get into discussions as Jeff Sutherland has with Scrum But scenarios (i.e. Scrum, but without some Scrum practices). But at some point there reaches a point where the team’s partial adoption of Scrum leads to failure. It could also be that cultural impediments prevent certain agile practices to take root: a highly hierarchical organization will be antithetical to the practice of self-organizing agile teams, for example.
The interface between the business and an agile team can harbor impediments. For example processes on the business side may have originally evolved around supporting waterfall processes and constrain a team that has transitioned to agile. In this scenario failure of agile is now a problem that spans beyond intra-team agile adoption and across the business-IT interface.
The biggest challenge and opportunity is with the organization as a whole: Can the business transform its agility? Can the business become agile and thereby make the agile IT department an integral part of the business, rather than a department in the basement that no executive visits? Today, many major businesses are essentially IT businesses and divorcing the IT team from the business becomes a serious handicap – witness successful businesses in technology, financial services, retail and more, where IT and the business are integral and are agile about it.
There is no magic recipe for agile adoption and it is seen in practice that the most successful agile transformation is one where the team goes through a learning process of self-discovery. Introducing agile practices, using trial and error, learning through experience, seeing what works and what does not, allows the team to evolve its agility and fit it to the constraints of the organization culture.
Organizations need support, training, and coaching in their agile transformation, but the need for business agility is greater the larger the scale of the IT project. Large scale agile projects can be swamped by business waterfall processes that impede their agility at levels above core software development. Interestingly there are cases where agility at the higher levels are introduced and succeed, while intra-team processes remain waterfall. There is no simple ‘right’ way to adopt agile. It all depends on the individual cases, but as long as we are agile about agile adoption, then we can avoid agile failure, or at least improve on what went before. Failure in adopting agile is not about giving up on agile, but re-thinking the problem and seeing what can be improved, incrementally.

Searching for Data Scientists as a Service

It’s no secret that rocket .. err … data scientists are in short supply. The explosion of data and the corresponding explosion of tools, and the knock-on impacts of Moore’s and Metcalfe’s laws, is that there is more data, more connections, and more technology to process it than ever. At last year’s Hadoop World, there was a feeding frenzy for data scientists, which only barely dwarfed demand for the more technically oriented data architects. in English, that means:

1. Potential MacArthur Grant recipients who have a passion and insight for data, the mathematical and statistical prowess for ginning up the algorithms, and the artistry for painting the picture that all that data leads to. That’s what we mean by data scientists.
2. People who understand the platform side of Big Data, a.k.a., data architect or data engineer.

The data architect side will be the more straightforward nut to crack. Understanding big data platforms (Hadoop, MongoDB, Riak) and emerging Advanced SQL offerings (Exadata, Netezza, Greenplum, Vertica, and a bunch of recent upstarts like Calpont) is a technical skill that can be taught with well-defined courses. The laws of supply and demand will solve this one – just as they did when the dot com bubble created demand for Java programmers back in 1999.

Behind all the noise for Hadoop programmers, there’s a similar, but quieter desperate rush to recruit data scientists. While some data scientists call data scientist a buzzword, the need is real.

However, data science will be a tougher number to crack. It’s all about connecting the dots, not as easy as it sounds. The V’s of big data – volume, variety, velocity, and value — require someone who discovers insights from data; traditionally, that role was performed by the data miner. But data miners dealt with better-bounded problems and well-bounded (and known) data sets that made the problem more 2-dimensional. The variety of Big Data – in form and in sources – introduces an element of the unknown. Deciphering Big Data requires a mix of investigative savvy, communications skills, creativity/artistry, and the ability to think counter-intuitively. And don’t forget it all comes atop a foundation of a solid statistical and machine learning background plus technical knowledge of the tools and programming languages of the trade.

Sometimes it seems like we’re looking for Albert Einstein or somebody smarter.

As nature abhors a vacuum, there’s also a rush to not only define what a data scientist is, but develop programs that could somehow teach it, software packages that to some extent package it, and otherwise throw them into a meat … err, the free market. EMC and other vendors are stepping up to the plate to offer training, not just on platforms, but for data science. Kaggle offers an innovative cloud-based, crowdsourced approach to data science, making available a predictive modeling platform and then staging sponsored 24-hour competitions for moonlighting data scientists to devise the best solutions to particular problems (redolent of the Netflix $1 million prize to devise a smarter algorithm for predicting viewer preferences).

With data science talent scarce, we’d expect that consulting firms would buy up talent that could then be “rented’ to multiple clients. Excluding a few offshore firms, few SIs have yet stepped up to the plate to roll out formal big data practices (the logical place where data scientists would reside), but we expect that to change soon.

Opera Solutions, which has been in the game of predictive analytics consulting since 2004, is taking the next step down the packaging route. having raised $84 million in Series A funding last year, the company has staffed up to nearly 200 data scientists, making it one of the largest assemblages of genius this side of Google. Opera’s predictive analytics solutions are designed for a variety of platforms, SQL and Hadoop, and today they join the SAP Sapphire announcement stream with a release of their offering on the HANA in-memory database. Andrew Brust provides a good drilldown on the details on this announcement.

From SAP’s standpoint, Opera’s predictive analytics solutions are a logical fit for HANA as they involve the kinds of complex problems (e.g., a computation triggers other computations) that their new in-memory database platform was designed for.

There’s too much value at stake to expect that Opera will remain the only large aggregation of data scientists for hire. But ironically, the barriers to entry will keep the competition narrow and highly concentrated. Of course, with market demand, there will inevitably be a watering down of the definition of data scientists so that more companies can claim they’ve got one… or many.

The laws of supply and demand will kick in for data scientists, but the ramp up of supply won’t be as quick as that for the more platform-oriented data architect or engineer. Of necessity, that supply of data scientists will have to be augmented by software that automates the interpretation of machine learning, but there’s only so far that you can program creativity and counter-intuitive insight into a machine.

Big Data and the Product Lifecycle

Our twitter feed went silent for a few days last week as we spent some time at a conference that where chance conversations, personal reunions, and discovery were the point. In fact, this was one of the few events where attendees – like us – didn’t have our heads down buried in our computers. We’re speaking of Cyon Research’s COFES 2012 design engineering software conference, where we had the opportunity to explore the synergy of Big Data and the Product Lifecycle, why ALM and PLM systems can’t play nice, and how to keep a handle on finding the right data as product development adopts a 24/7 follow-the-sun strategy. It wasn’t an event of sessions in the conventional sense, but lots of hallways where you spent most of your time in chance, impromptu meetings. And it was a great chance to hook up with colleagues whom we haven’t caught in years.

There were plenty of contrarian views. There were a couple of keynotes in the conventional sense that each took different shots at the issue of risk. Retired Ford product life cycle management director Richard Riff took aim at conventional wisdom when it comes to product testing. After years of ingrained lean, six sigma, and zero defects practices – not to mention Ford’s old slogan that quality is job one — Riff countered with a provocative notion: sometimes the risk of not testing is the better path. It comes down to balancing the cost of defects vs. the cost of testing, the likely incidence of defects, and the reliability of testing. While we couldn’t repeat the math, in essence, it amounted to a lifecycle cost approach for testing. He claimed that the method even accounted for intangible factors, such as social media buzz or loss of reputation, when referring g to recently highly publicized quality issues with some of Ford’s rivals.

Xerox PARC computing legend Alan Kay made the case for reducing risk through a strategy that applied a combination of object-oriented design (or which he was one of the pioneers – along with the GUI of course) and what sounded to us like domain-specific languages. Or more specifically, that software describes the function, then lets other programs automatically generate the programming to execute it. Kay decried the instability that we have come to accept with software design – which reminded us that since the mainframe days, we have become all too accustomed to hearing that the server is down. Showing some examples of ancient Roman design (e.g., a 2000-year old bridge in Spain that today still carries cars and looks well intact), he insists that engineers can do better.

Some credit to host Brad Holtz who deciphered that there really was a link between our diverging interests: Big Data and meshing software development with the product lifecycle. By the definition of Big Data – volume, variability, velocity, and value – Big Data is nothing new to the product lifecycle. CAD files, models, and simulations are extremely data-intensive and containing a variety of data types encompassing graphical and alphanumeric data. Today, the brass ring for the modeling and simulation world is implementing co-simulations, where models each drive other models (the results of one drives the other).

But is anybody looking at the bigger picture? Modeling has been traditionally silo’ed – for instance, models are not typically shared across product teams, projects, or product families. Yet new technologies could provide the economical storage and processing power to make it possible to analyze and compare the utilization and reliability of different models for different scenarios – with the possible result being metamodels that provide frameworks for optimizing model development and parameters with specific scenarios. All this is highly data-intensive.

What about the operational portion of the product lifecycle? Today, it’s rare for products not to have intelligence baked into controllers. Privacy issues aside (they must be dealt with), machinery connected to networks can feed back performance data; vehicles can yield data while in the repair shop, or thanks to mobile devices, provide operational data while in movement. Add to that reams of publicly available data from services such as NOAA or the FAA, and now there is context placed around performance data (did bad weather cause performance to drop?). Such data could feed processes, ranging from MRO (maintenance, repair, and operation) and warranty, to providing feedback loops that can validate product tests and simulation models.

Let’s take another angle – harvesting publicly available data for the business. For instance, businesses could use disaster preparedness models to help their scenario planning, as described in this brief video snippet from last years COFES conference. Emerging organizations, such as the Center for Understanding Change, aim to make this reality by making available models and expertise developed through tax dollars in the national laboratory system.

Big Data and connectivity can also be used to overcome gaps in locating expertise and speed product development. Adapting techniques from the open source software world, where software is developed collaboratively by voluntary groups of experts in the field, crowdsourcing is invading design and data science (we especially enjoyed our conversation with Kaggle’s Jeremy Howard).

A personal note on the sessions – the conference marked a reunion with folks whom we have crossed paths with in over 20 years. Our focus on application development lead us to engineered systems, an area of white space between software engineering and classic product engineering disciplines. And as noted above, that in turn bought us full circle to our roots covering the emergence of CADCAM in the 80s as we had the chance to reconnect many who continue to advance the engineering discipline. What a long, fun trip it’s been.

Who Owns the Product Lifecycle?

Turn on the ignition of your car, back out of the parking space and go into drive. As you engaged the transmission, gently tapped the accelerator and stepped on the brake, you didn’t directly interact with the powertrain. Instead, your actions were detected by sensors and executed by actuators on electronics control units that then got the car to shift, move, then stop.

Although in the end, Toyota’s recall issues from 2009-10 wound up isolating misadjusted accelerator controls, speculation around the recalls directed the spotlight to the prominent role of embedded software, prompting the realization that today when you operate your car, you are driving by wire.

Today’s automobiles are increasingly looking a lot more like consumer electronics products. They contain nearly as much software an iPhone, and in the future will contain even more. According to IDC, the market for embedded software that is designed into engineered products (like cars, refrigerators, airplanes, and consumer electronics) will double by 2015.

Automobiles are the tip of the iceberg where it comes to smart products; today most engineered products, from refrigerators to industrial machinery and aircraft all feature smart control. Adding intelligence allows designers to develop flexible control logic that brings more functionality to products and provides ways to optimize operation to gain savings in weight, bulk, and cost.

Look at the hybrid car: to function, the battery, powertrain, gas and electric engines, and braking systems must all interoperate to attain fuel economy. It takes software to determine when to let the electric engine run or let the battery recharge. The degree of interaction between components is greater compared to traditional electromechanical products designs. Features such as anti-lock braking or airbag deployment depend on the processing of data from multiple sources – wheel rotation, deceleration rate, steering, etc.

The growth of software content changes the ground rules for product development, which has traditionally been a very silo’ed process. There are well established disciplines in mechanical and electrical engineering, with each having their own sets of tools, not to mention claims to ownership of the product design. Yet with software playing the role as the “brains” of product operation, there is the need for engineering disciplines to work more interactively across silos rather than rely on systems engineers to crack the whip on executing the blueprint.

We were reminded of this after a rather enjoyable, freewheeling IEEE webcast that we had with IBM Rational’s Dominic Tavasolli last week.

Traditionally, product design fell under the mechanical engineering domain, which designed the envelope and specified the geometry, components, materials, physical properties (such as resistance to different forms of stress) and determined the clearance within which electronics could be shoehorned.

Drill down deeper and you’ll note that each engineering domain has its full lifecycle of tools. It’s analogous to enterprise software development organizations, where you’ll often stumble across well entrenched camps of Microsoft, Java, and web programmers. Within the lifecycle there is a proliferation of tools and languages to deal with the wide variety of engineering problems that must be addressed when developing a product. Unlike the application lifecycle, where you have specific tools that handle modeling or QA, on the engineering side there are multiple tools because there are many different ways to simulate a product’s behavior in the real world to perform the engineering equivalent of QA. You might want to test mechanical designs for wind shear, thermal deformation, or compressive stresses, and electrical ones for their ability to handle voltage and disperse heat from processing units.

Now widen out the picture. Engineering and manufacturing groups each have their own definitions of the product. It is expressed in the bill of materials (BOM): engineering has its own BOM, which details the design hierarchy, while the manufacturing BOM itemizes the inventory materials and the manufacturing processes needed to fabricate and assemble the product. That sets the stage for the question of who owns the product lifecycle management (PLM) process: the CADCAM vs. the ERP folks.

Into the mix between the different branches of engineering and the silos between engineering and manufacturing, now introduce the software engineers. They used to be an afterthought, yet today their programs are affecting, not only how product components and systems behave, but in many cases might impact the physical specifications. for instance, if you can design software to enable a motor to run more efficiently, the mechanical engineers can then design a smaller, lighter weight engine.

In the enterprise computing world, we’ve long gotten hung up on the silos that divide different parts of IT from itself – the developers vs. QA, DBAs, enterprise architects, systems operations – or IT from the business. However, the silos that plague enterprise IT are child’s play compared to the situation in product development where you have engineering groups pared off against each other, and against manufacturing.

OK, so the product lifecycle is a series of fiefdoms – why bother or care about making it more efficient? There is too much at stake in the success of a product: there are the constantly escalating pressures to squeeze time, defects, and cost out of the product lifecycle. That’s been the routine ever since the Japanese introduced American concepts of lean manufacturing back in the 1980s. But as automobiles and other complex engineered products adds more intelligence, the challenge is leveraging the rapid innovation of the software and consumer electronics industries for product sectors where, of necessity, lead times will stretch into one or more years.

There is no easy solution because there is no single solution. Each industry has different product characteristics that impact the length of the lifecycle and how product engineering teams are organized. Large, highly complex products such as automobiles, aircraft, or heavy machinery will have long lead times because of supply chain dependencies. At the other end of the scale, handheld consumer electronics or biomedical devices might not have heavy supply chain dependences. But, for instance, smart phones have short product lifespans and are heavily driven by the fats pace of innovation in processing power and software capabilities, meaning that product lifecycles must be quicker in order for new products to catch the market window. Biomedical devices on the other hand are often compact, but have significant regulatory hurdles to mount which impacts how the devices are tested.

The product lifecycle is a highly varied creature. The common thread is the need to more effectively integrate software engineering, which in turn is forcing the issue of integration and collaboration between other engineering disciplines. It is no longer sufficient to rely on systems engineers to get it together in the end – as manufacturers learned the hard way, it costs more to rework a design that doesn’t fit together, perform well, or be readily assembled with existing staff and facilities. The rapid evolution of software and processors also forces the issue on whether and where agile development processes can be coupled with linear or hierarchical development processes that are necessary for long-fuse products.

There is no single lifecycle process that will apply to all sectors, and no single set of tools that can perform every design and test function necessary to get from idea to product. Ultimately, the answer – as loose as it is – is that in larger product development organizations, work on the assumption that there are multiple sources of truth. The ALM and PLM worlds have at best worked warily at arms length from each other as there is a DMZ when it comes to requirements, change, and quality management. The reality is that no single constituency owns the product lifecycle – get used to federation that will proceed on rules of engagement that will remain industry- and organization-specific.

Ideally it would be great to integrate everything. Good luck. With the exception of frameworks that are proprietary for specific vendors, there is no associativity between tools that provides a process-level integration. The best that can be expected at this point is at the data exchange level.

It’s a start.

Google and Motorola: Quick Post Mortem

There’s been plenty of excellent commentary on Google’s $12.5 billion deal for Motorola Mobility Inc. (MMI) over the past few days, and we’re certainly not going to rehash covered ground.

Clearly this is a lot of money that was invested defensively. Money that could have gone into research or acquisitions that would have grown the business or opened new markets.


That thought hit us this morning after reading a NY Times piece on the bull market for patents. It reinforced our thoughts after word of the deal broke: that this was money spent for arming Google against patent predators in courts of law. In this case, it’s predators sensing blood to slow down or at least exact royalties from the Android platform juggernaut.

Of course much of the issue stems from the subjective nature of software patents; that’s a longstanding issue given that the iterative nature of software development. It is simply difficult if not impossible to prove that a software innovation does not base itself in some way on prior invention. Furthermore, the fact that software relies on other software to operate makes the notion of software patents even more dubious.

This doesn’t mean that software developers should get away plagiarism. Although discovery is still underway, the evidence continues to get more damning in the Oracle-Google case over Dalvik, the Android VM that on closer inspection looks like the JVM in sheep’s clothing. The irony is that when Google was still pulling its (J)VM clean room act, the company at the other end of the line was Sun. To us, this is a reflection of Google’s Not-Invented-Here mentality. Would it have killed them to secure a JVM license at the time, as they could have gotten far more reasonable terms from Sun – rather than Oracle, the new sheriff in town.

Just askin’.

Yahoo to Hadoop: Show Me the Money

While there is relatively little to knock cloud from its hype perch, among web startups, BI and data geeks, the emergence of Big Data has become a game changer. It’s analytics and operational intelligence gone extreme.

Big Data typically is associated with obscene amounts of data – the scale blows away anything that most enterprises would maintain within their core back end business systems. We’re talking hundreds of terabytes or even petabytes.

Today, Yahoo announced that it might take the business of its best-known Big Data brainchild, Hadoop, and and consider spinning it off into a new entity.

So why are we having this conversation?

It’s because Internet giants Google, Yahoo, Facebook, Amazon, and others had to roll their own technologies to deal with magnitudes of data far beyond conventional wisdom of what was possible with enterprise systems. What makes the conversation interesting is that this technology is on the cusp of entering the enterprise mainstream today. It’s not just a matter of technology looking for a problem. When Facebook needs to understand how its 500 million members update their walls, share photographs, and have conversations, it’s because (1) it needs to optimize its IT infrastructure to support how its members use the site, but more importantly (2) it needs to understand more about its members so it can sell advertising.

And when Facebook makes its API publicly available, that same issue becomes a critical for any marketer that is B2C. And as the technology becomes available, suddenly there are downstream uses in capital markets for conducting brute force analyses on trading positions, healthcare providers for understanding outcomes, homeland security for controlling borders, metropolitan entities seeking to manage congestion pricing, life sciences organization seeking to decipher clinical studies, mobile carriers seeking to prevent or minimize customer churn, and so on.

There are a couple technology and market paths that have opened for contending with Big Data. There are Advanced SQL analytic database providers that have adapted SQL for structured data through strategies such as reducing indexing, introducing new forms of data compression and query optimization, columnar architectures, and embedding analytics and data transformation directly into the data engine to minimize data movement; in some cases, they have developed optimized appliances. We’re talking about the Aster Datas, Greenplums, Netezzas, ParAccels, and Verticas of the world – and players like Teradata that invented big data warehousing, Oracle that has extended it, and Sybase which acquired the first column-oriented database. Business has obviously picked up here; IBM, EMC, Teradata, and HP have all made acquisitions in this space over the past 12 months.

But the Facebooks and Googles of the world weren’t dealing with structured data in the enterprise sense – they are contending with web log files, document APIs, rich media files, and so on. They are dealing with data whose structure and volume is so varied and huge that there is no time to model it and form a schema; they need to just load the data into the file system and then analyze it. That spawned the NoSQL movement – initially a focus on technologies that avoided the overhead and scalability limits of SQL.

Until now, neither Google, Yahoo, or Facebook considered themselves in the tools or database business. So they released the fruits of their innovation as open source, with one of the best known projects being Apache Hadoop. Hadoop is a family of projects that includes a distributed file system, the MapReduce framework that parcels out massively parallel computing jobs across a cluster plus a number of other frameworks, file systems, and utilities.

What’s kind of fascinating is the almost incestuous relationship between these NoSQL projects. Hadoop, developed at Yahoo was descended from the Google File System that in turn was developed for Google BigTable; the same was true for Cassandra, another NoSQL file system. Meanwhile, Facebook develops Hive, a relational-like table structure designed to work with Hadoop. You get the picture.

Cloudera has stepped to the forefront in commercializing Hadoop technology and applying MapReduce. Using a Red Hat-like business model, it offers support, several open source extensions, plus an enterprise edition that adds a number of proprietary monitoring and management features. It has distinguished itself with forging partnerships with almost every major BI and data warehousing player except one – IBM. the highlights are its relationships with Informatica, for data transformation, and MicroStrategy, which provides a data mart strategy designed to complement Hadoop. And it has garnered roughly 75 enterprise paying customers in a market segment that has barely commercialized.

In the long run, we also expect IBM to make a stab at Hadoop and related technologies by extending its InfoSphere offerings -– it can see Cloudera-Informatica and Cloudera-MicroStrategy raise it one with its own InfoSphere DataStage and Cognos offerings, before it even talks about partnerships. Today we saw a shot from left field – Yahoo which invented the technology – is now saying it might spin off its Hadoop business to go up against Cloudera, and potentially IBM. In a way, its closing the doors after the horses left the barn as the creator of Hadoop is now part of Cloudera.

Clearly there will be a market for NoSQL technologies in the quest for Big Data, although for now, they require sufficient specialized skills that they are not for the faint of heart. that is, if you can find any Hadoop and MapReduce programmers who haven’t already bee scarfed up by Amazon, Zynga, or JP Morgan Chase. That market will not necessarily be in competition with Advanced SQL as there are different use cases for each. And in fact, there will likely be a blending of the technologies in the long run. Today, many Advanced SQL platforms are already extending support for MapReduce, and in the long run, we expect that SQL-like technologies in the NoSQL space like Hive or HBase will themselves be made more accessible to the huge base of SQL developers.

But we digress.

For Yahoo, this would clearly be a shot out of its comfort zone, as it is not a tools company. But it is hungry for monetizing its intellectual property, even if that property has already been open sourced. It’s redolent of Sun striving to monetize Java and we all know how that went. Obviously this will be an uphill battle for Yahoo, but at least this would be a spinoff so hopefully there won’t be distractions from the mother ship. Given Yahoo’s fortunes, we shouldn’t be surprised that they are now looking to maximize what they can get out of the family jewels.