Conventional wisdom is that once Big Data is at rest, don’t move it or shake it. Akin to “don’t fold, spindle, or mutilate.” But seriously, if mainstream enterprises adopt Hadoop, they will expect it to become more robust. And so you start looking at things like data replication, or at least replication of the NameNode or other components that govern how and where data resides in Hadoop and how operations are performed against.
So here’s an interesting one to watch: Wandisco buying Altostore. They are applying replication technol developed for Subversion to Hadoop. We’re gonna check this one out
With Strata, IBM IOD, and Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for Hadoop-related announcements. The common thread of announcements is essentially, “We know that Hadoop is not known for performance, but we’re getting better at it, and we’re going to make it look more like SQL.” In essence, Hadoop and SQL worlds are converging, and you’re going to be able to perform interactive BI analytics on it.
The opportunity and challenge of Big Data from new platforms such as Hadoop is that it opens a new range of analytics. On one hand, Big Data analytics have updated and revived programmatic access to data, which happened to be the norm prior to the advent of SQL. There are plenty of scenarios where taking programmatic approaches are far more efficient, such as dealing with time series data or graph analysis to map many-to-many relationships. It also leverages in-memory data grids such as Oracle Coherence, IBM WebSphere eXtreme Scale, GigaSpaces and others, and, where programmatic development (usually in Java) proved more efficient for accessing highly changeable data for web applications where traditional paths to the database would have been I/O-constrained. Conversely Advanced SQL platforms such as Greenplum and Teradata Aster have provided support for MapReduce-like programming because, even with structured data, sometimes using a Java programmatic framework is a more efficient way to rapidly slice through volumes of data.
Until now, Hadoop has not until now been for the SQL-minded. The initial path was, find someone to do data exploration inside Hadoop, but once you’re ready to do repeatable analysis, ETL (or ELT) it into a SQL data warehouse. That’s been the pattern with Oracle Big Data Appliance (use Oracle loader and data integration tools), and most Advanced SQL platforms; most data integration tools provide Hadoop connectors that spawn their own MapReduce programs to ferry data out of Hadoop. Some integration tool providers, like Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and Hortonworks have been talking up the potentials of HCatalog, actuality an enhanced version of Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.
But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops. And beneath the veneer of chaos, there is an implicit order to most so-called “unstructured” data that is within the reach programmatic transformation approaches that in the long run could likely be automated or packaged inside a tool.
At Ovum, we have long believed that for Big Data to crossover to the mainstream enterprise, that it must become a first-class citizen with IT and the data center. The early pattern of skunk works projects, led by elite, highly specialized teams of software engineers from Internet firms to solve Internet-style problems (e.g., ad placement, search optimization, customer online experience, etc.) are not the problems of mainstream enterprises. And neither is the model of recruiting high-priced talent to work exclusively on Hadoop sustainable for most organizations; such staffing models are not sustainable for mainstream enterprises. It means that Big Data must be consumable by the mainstream of SQL developers.
Making Hadoop more SQL-like is hardly new
Hive and Pig became Apache Hadoop projects because of the need for SQL-like metadata management and data transformation languages, respectively; HBase emerged because of the need for a table store to provide a more interactive face – although as a very sparse, rudimentary column store, does not provide the efficiency of an optimized SQL database (or the extreme performance of some columnar variants). Sqoop in turn provides a way to pipeline SQL data into Hadoop, a use case that will grow more common as organizations look to Hadoop to provide scalable and cheaper storage than commercial SQL. While these Hadoop subprojects that did not exactly make Hadoop look like SQL, they provided building blocks from which many of this week’s announcements leverage.
Progress marches on
One train of thought is that if Hadoop can look more like a SQL database, more operations could be performed inside Hadoop. That’s the theme behind Informatica’s long-awaited enhancement of its PowerCenter transformation tool to work natively inside Hadoop. Until now, PowerCenter could extract data from Hadoop, but the extracts would have to be moved to a staging server where the transformation would be performed for loading to the familiar SQL data warehouse target. The new offering, PowerCenter Big Data Edition, now supports an ELT pattern that uses the power of MapReduce processes inside Hadoop to perform transformations. The significance is that PowerCenter users now have a choice: load the transformed data to HBase, or continue loading to SQL.
There is growing support for packaging Hadoop inside a common hardware appliance with Advanced SQL. EMC Greenplum was the first out of gate with DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with Greenplum MR, a software only product that is accompanied by a MapR Hadoop distro). Teradata Aster has just joined the fray with Big Analytics Appliance, bundling the Hortonworks Data Platform Hadoop; this move was hardly surprising given their growing partnership around HCatalog, an enhancement of the SQL-like Hive metadata layer of Hadoop that adds features such as a cost optimizer and RESTful interfaces that make the metadata accessible without the need to learn MapReduce or Java. With HCatalog, data inside Hadoop looks like another Aster data table.
Not coincidentally, there is a growing array of analytic tools that are designed to execute natively inside Hadoop. For now they are from emerging players like Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like marketplace for developers), Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry, Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).
Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes. Hadapt has been promoting a variant where it squeezes SQL tables inside HDFS file structures – not exactly a no-brainer as it must shoehorn tables into a file system with arbitrary data block sizes. Hadapt’s approach sounds like the converse of object-relational stores, but in this case, it is dealing with a physical rather than a logical impedance mismatch.
Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does Cloudera. It has just announced Impala, a SQL-based alternative to MapReduce for querying the SQL-like Hive metadata store, supporting most but not all forms of SQL processing (based on SQL 92; Impala lacks triggers, which Cloudera deems low priority). Both Impala and MapReduce rely on parallel processing, but that’s where the similarity ends. MapReduce is a blunt instrument, requiring Java or other programming languages; it splits a job into multiple, concurrently, pipelined tasks where, at each step along the way, reads data, processes it, and writes it back to disk and then passes it to the next task. Conversely, Impala takes a shared nothing, MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera claims roughly 4x performance against MapReduce; if the data is in HBase, Cloudera claims performance multiples up to a factor of 30. For now, Impala only supports row-based views, but with columnar (on Cloudera’s roadmap), performance could double. Cloudera plans to release a real-time query (RTQ) offering that, in effect, is a commercially supported version of Impala.
By contrast, Teradata Aster and Hortonworks promote a SQL MapReduce approach that leverages HCatalog, an incubating Apache project that is a superset of Hive that Cloudera does not currently include in its roadmap. For now, Cloudera claims bragging rights for performance with Impala; over time, Teradata Aster will promote the manageability of its single appliance, and with the appliance has the opportunity to counter with hardware optimization.
The road to SQL/programmatic convergence
Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists. What’s more important to enterprises is getting the right tool for the job – whether it is the flexibility of SQL or raw power of programmatic approaches.
SQL convergence is the next major battleground for Hadoop. Cloudera is for now shunning HCatalog, an approach backed by Hortonworks and partner Teradata Aster. The open question is whether Hortonworks can instigate a stampede of third parties to overcome Cloudera’s resistance. It appears that beyond Hive, the SQL face of Hadoop will become a vendor-differentiated layer.
Part of conversion will involve a mix of cross-training and tooling automation. Savvy SQL developers will cross train to pick up some of the Java- or Java-like programmatic frameworks that will be emerging. Tooling will help lower the bar, reducing the degree of specialized skills necessary. And for programming frameworks, in the long run, MapReduce won’t be the only game in town. It will always be useful for large-scale jobs requiring brute force, parallel, sequential processing. But the emerging YARN framework, which deconstructs MapReduce to generalize the resource management function, will provide the management umbrella for ensuring that different frameworks don’t crash into one another by trying to grab the same resources. But YARN is not yet ready for primetime – for now it only supports the batch job pattern of MapReduce. And that means that YARN is not yet ready for Impala or vice versa.
Of course, mainstreaming Hadoop – and Big Data platforms in general – is more than just a matter of making it all look like SQL. Big Data platforms must be manageable and operable by the people who are already in IT; they will need some new skills and grow accustomed to some new practices (like exploratory analytics), but the new platforms must also look and act familiar enough. Not all announcements this week were about SQL; for instance, MapR is throwing a gauntlet to the Apache usual suspects by extending its management umbrella beyond the proprietary NFS-compatible file system that is its core IP to the MapReduce framework and HBase, making a similar promise of high performance. On the horizon, EMC Isilon and NetApp are proposing alternatives promising a more efficient file system but at the “cost” of separating the storage from the analytic processing. And at some point, the Hadoop vendor community will have to come to grips with capacity utilization issues, because in the mainstream enterprise world, no CFO will approve the purchase of large clusters or grids that get only 10 – 15% utilization. Keep an eye on VMware’s Project Serengeti.
They must be good citizens in data centers that need to maximize resource (e.g., virtualization, optimized storage); must comply with existing data stewardship policies and practices; and must fully support existing enterprise data and platform security practices. These are all topics for another day.
This guest post comes from Ovum colleague Michael Azoff.
Agile practices have been around for over twenty years. The Agile Manifesto was written a decade after ‘agile’ first emerged (under different names of course, Agile was first coined at the 2001 manifesto meeting). There are also plenty of proof points around what works in agile and when to apply it. If you are still asking for agile to prove itself then you are missing where software development has progressed to.
Going back to Waterfall is not an option because it has inherent faults and those faults are visible all around in many failed IT projects. Ultimately, if waterfall is not broken for you then don’t fix it. But you should consider alternatives to waterfall if your software development processes or organization have become dysfunctional; over time, you might find difficulty in recruiting developers for legacy processes, but that’s another issue.
Ken Schwaber a co-originator of Scrum has said that only 25% of Scrum deployments succeed. The question then is what happens to the other 75% of failures. The problem can be examined at three levels of maturity: intra-team agility, extra-team agility, and business agility.
Teams may not be perfectly pure about their agile adoption, and we can get into discussions as Jeff Sutherland has with Scrum But scenarios (i.e. Scrum, but without some Scrum practices). But at some point there reaches a point where the team’s partial adoption of Scrum leads to failure. It could also be that cultural impediments prevent certain agile practices to take root: a highly hierarchical organization will be antithetical to the practice of self-organizing agile teams, for example.
The interface between the business and an agile team can harbor impediments. For example processes on the business side may have originally evolved around supporting waterfall processes and constrain a team that has transitioned to agile. In this scenario failure of agile is now a problem that spans beyond intra-team agile adoption and across the business-IT interface.
The biggest challenge and opportunity is with the organization as a whole: Can the business transform its agility? Can the business become agile and thereby make the agile IT department an integral part of the business, rather than a department in the basement that no executive visits? Today, many major businesses are essentially IT businesses and divorcing the IT team from the business becomes a serious handicap – witness successful businesses in technology, financial services, retail and more, where IT and the business are integral and are agile about it.
There is no magic recipe for agile adoption and it is seen in practice that the most successful agile transformation is one where the team goes through a learning process of self-discovery. Introducing agile practices, using trial and error, learning through experience, seeing what works and what does not, allows the team to evolve its agility and fit it to the constraints of the organization culture.
Organizations need support, training, and coaching in their agile transformation, but the need for business agility is greater the larger the scale of the IT project. Large scale agile projects can be swamped by business waterfall processes that impede their agility at levels above core software development. Interestingly there are cases where agility at the higher levels are introduced and succeed, while intra-team processes remain waterfall. There is no simple ‘right’ way to adopt agile. It all depends on the individual cases, but as long as we are agile about agile adoption, then we can avoid agile failure, or at least improve on what went before. Failure in adopting agile is not about giving up on agile, but re-thinking the problem and seeing what can be improved, incrementally.
It’s no secret that rocket .. err … data scientists are in short supply. The explosion of data and the corresponding explosion of tools, and the knock-on impacts of Moore’s and Metcalfe’s laws, is that there is more data, more connections, and more technology to process it than ever. At last year’s Hadoop World, there was a feeding frenzy for data scientists, which only barely dwarfed demand for the more technically oriented data architects. in English, that means:
1. Potential MacArthur Grant recipients who have a passion and insight for data, the mathematical and statistical prowess for ginning up the algorithms, and the artistry for painting the picture that all that data leads to. That’s what we mean by data scientists.
2. People who understand the platform side of Big Data, a.k.a., data architect or data engineer.
The data architect side will be the more straightforward nut to crack. Understanding big data platforms (Hadoop, MongoDB, Riak) and emerging Advanced SQL offerings (Exadata, Netezza, Greenplum, Vertica, and a bunch of recent upstarts like Calpont) is a technical skill that can be taught with well-defined courses. The laws of supply and demand will solve this one – just as they did when the dot com bubble created demand for Java programmers back in 1999.
Behind all the noise for Hadoop programmers, there’s a similar, but quieter desperate rush to recruit data scientists. While some data scientists call data scientist a buzzword, the need is real.
However, data science will be a tougher number to crack. It’s all about connecting the dots, not as easy as it sounds. The V’s of big data – volume, variety, velocity, and value — require someone who discovers insights from data; traditionally, that role was performed by the data miner. But data miners dealt with better-bounded problems and well-bounded (and known) data sets that made the problem more 2-dimensional. The variety of Big Data – in form and in sources – introduces an element of the unknown. Deciphering Big Data requires a mix of investigative savvy, communications skills, creativity/artistry, and the ability to think counter-intuitively. And don’t forget it all comes atop a foundation of a solid statistical and machine learning background plus technical knowledge of the tools and programming languages of the trade.
Sometimes it seems like we’re looking for Albert Einstein or somebody smarter.
As nature abhors a vacuum, there’s also a rush to not only define what a data scientist is, but develop programs that could somehow teach it, software packages that to some extent package it, and otherwise throw them into a meat … err, the free market. EMC and other vendors are stepping up to the plate to offer training, not just on platforms, but for data science. Kaggle offers an innovative cloud-based, crowdsourced approach to data science, making available a predictive modeling platform and then staging sponsored 24-hour competitions for moonlighting data scientists to devise the best solutions to particular problems (redolent of the Netflix $1 million prize to devise a smarter algorithm for predicting viewer preferences).
With data science talent scarce, we’d expect that consulting firms would buy up talent that could then be “rented’ to multiple clients. Excluding a few offshore firms, few SIs have yet stepped up to the plate to roll out formal big data practices (the logical place where data scientists would reside), but we expect that to change soon.
Opera Solutions, which has been in the game of predictive analytics consulting since 2004, is taking the next step down the packaging route. having raised $84 million in Series A funding last year, the company has staffed up to nearly 200 data scientists, making it one of the largest assemblages of genius this side of Google. Opera’s predictive analytics solutions are designed for a variety of platforms, SQL and Hadoop, and today they join the SAP Sapphire announcement stream with a release of their offering on the HANA in-memory database. Andrew Brust provides a good drilldown on the details on this announcement.
From SAP’s standpoint, Opera’s predictive analytics solutions are a logical fit for HANA as they involve the kinds of complex problems (e.g., a computation triggers other computations) that their new in-memory database platform was designed for.
There’s too much value at stake to expect that Opera will remain the only large aggregation of data scientists for hire. But ironically, the barriers to entry will keep the competition narrow and highly concentrated. Of course, with market demand, there will inevitably be a watering down of the definition of data scientists so that more companies can claim they’ve got one… or many.
The laws of supply and demand will kick in for data scientists, but the ramp up of supply won’t be as quick as that for the more platform-oriented data architect or engineer. Of necessity, that supply of data scientists will have to be augmented by software that automates the interpretation of machine learning, but there’s only so far that you can program creativity and counter-intuitive insight into a machine.
Our twitter feed went silent for a few days last week as we spent some time at a conference that where chance conversations, personal reunions, and discovery were the point. In fact, this was one of the few events where attendees – like us – didn’t have our heads down buried in our computers. We’re speaking of Cyon Research’s COFES 2012 design engineering software conference, where we had the opportunity to explore the synergy of Big Data and the Product Lifecycle, why ALM and PLM systems can’t play nice, and how to keep a handle on finding the right data as product development adopts a 24/7 follow-the-sun strategy. It wasn’t an event of sessions in the conventional sense, but lots of hallways where you spent most of your time in chance, impromptu meetings. And it was a great chance to hook up with colleagues whom we haven’t caught in years.
There were plenty of contrarian views. There were a couple of keynotes in the conventional sense that each took different shots at the issue of risk. Retired Ford product life cycle management director Richard Riff took aim at conventional wisdom when it comes to product testing. After years of ingrained lean, six sigma, and zero defects practices – not to mention Ford’s old slogan that quality is job one — Riff countered with a provocative notion: sometimes the risk of not testing is the better path. It comes down to balancing the cost of defects vs. the cost of testing, the likely incidence of defects, and the reliability of testing. While we couldn’t repeat the math, in essence, it amounted to a lifecycle cost approach for testing. He claimed that the method even accounted for intangible factors, such as social media buzz or loss of reputation, when referring g to recently highly publicized quality issues with some of Ford’s rivals.
Xerox PARC computing legend Alan Kay made the case for reducing risk through a strategy that applied a combination of object-oriented design (or which he was one of the pioneers – along with the GUI of course) and what sounded to us like domain-specific languages. Or more specifically, that software describes the function, then lets other programs automatically generate the programming to execute it. Kay decried the instability that we have come to accept with software design – which reminded us that since the mainframe days, we have become all too accustomed to hearing that the server is down. Showing some examples of ancient Roman design (e.g., a 2000-year old bridge in Spain that today still carries cars and looks well intact), he insists that engineers can do better.
Some credit to host Brad Holtz who deciphered that there really was a link between our diverging interests: Big Data and meshing software development with the product lifecycle. By the definition of Big Data – volume, variability, velocity, and value – Big Data is nothing new to the product lifecycle. CAD files, models, and simulations are extremely data-intensive and containing a variety of data types encompassing graphical and alphanumeric data. Today, the brass ring for the modeling and simulation world is implementing co-simulations, where models each drive other models (the results of one drives the other).
But is anybody looking at the bigger picture? Modeling has been traditionally silo’ed – for instance, models are not typically shared across product teams, projects, or product families. Yet new technologies could provide the economical storage and processing power to make it possible to analyze and compare the utilization and reliability of different models for different scenarios – with the possible result being metamodels that provide frameworks for optimizing model development and parameters with specific scenarios. All this is highly data-intensive.
What about the operational portion of the product lifecycle? Today, it’s rare for products not to have intelligence baked into controllers. Privacy issues aside (they must be dealt with), machinery connected to networks can feed back performance data; vehicles can yield data while in the repair shop, or thanks to mobile devices, provide operational data while in movement. Add to that reams of publicly available data from services such as NOAA or the FAA, and now there is context placed around performance data (did bad weather cause performance to drop?). Such data could feed processes, ranging from MRO (maintenance, repair, and operation) and warranty, to providing feedback loops that can validate product tests and simulation models.
Let’s take another angle – harvesting publicly available data for the business. For instance, businesses could use disaster preparedness models to help their scenario planning, as described in this brief video snippet from last years COFES conference. Emerging organizations, such as the Center for Understanding Change, aim to make this reality by making available models and expertise developed through tax dollars in the national laboratory system.
Big Data and connectivity can also be used to overcome gaps in locating expertise and speed product development. Adapting techniques from the open source software world, where software is developed collaboratively by voluntary groups of experts in the field, crowdsourcing is invading design and data science (we especially enjoyed our conversation with Kaggle’s Jeremy Howard).
A personal note on the sessions – the conference marked a reunion with folks whom we have crossed paths with in over 20 years. Our focus on application development lead us to engineered systems, an area of white space between software engineering and classic product engineering disciplines. And as noted above, that in turn bought us full circle to our roots covering the emergence of CADCAM in the 80s as we had the chance to reconnect many who continue to advance the engineering discipline. What a long, fun trip it’s been.
Turn on the ignition of your car, back out of the parking space and go into drive. As you engaged the transmission, gently tapped the accelerator and stepped on the brake, you didn’t directly interact with the powertrain. Instead, your actions were detected by sensors and executed by actuators on electronics control units that then got the car to shift, move, then stop.
Although in the end, Toyota’s recall issues from 2009-10 wound up isolating misadjusted accelerator controls, speculation around the recalls directed the spotlight to the prominent role of embedded software, prompting the realization that today when you operate your car, you are driving by wire.
Today’s automobiles are increasingly looking a lot more like consumer electronics products. They contain nearly as much software an iPhone, and in the future will contain even more. According to IDC, the market for embedded software that is designed into engineered products (like cars, refrigerators, airplanes, and consumer electronics) will double by 2015.
Automobiles are the tip of the iceberg where it comes to smart products; today most engineered products, from refrigerators to industrial machinery and aircraft all feature smart control. Adding intelligence allows designers to develop flexible control logic that brings more functionality to products and provides ways to optimize operation to gain savings in weight, bulk, and cost.
Look at the hybrid car: to function, the battery, powertrain, gas and electric engines, and braking systems must all interoperate to attain fuel economy. It takes software to determine when to let the electric engine run or let the battery recharge. The degree of interaction between components is greater compared to traditional electromechanical products designs. Features such as anti-lock braking or airbag deployment depend on the processing of data from multiple sources – wheel rotation, deceleration rate, steering, etc.
The growth of software content changes the ground rules for product development, which has traditionally been a very silo’ed process. There are well established disciplines in mechanical and electrical engineering, with each having their own sets of tools, not to mention claims to ownership of the product design. Yet with software playing the role as the “brains” of product operation, there is the need for engineering disciplines to work more interactively across silos rather than rely on systems engineers to crack the whip on executing the blueprint.
We were reminded of this after a rather enjoyable, freewheeling IEEE webcast that we had with IBM Rational’s Dominic Tavasolli last week.
Traditionally, product design fell under the mechanical engineering domain, which designed the envelope and specified the geometry, components, materials, physical properties (such as resistance to different forms of stress) and determined the clearance within which electronics could be shoehorned.
Drill down deeper and you’ll note that each engineering domain has its full lifecycle of tools. It’s analogous to enterprise software development organizations, where you’ll often stumble across well entrenched camps of Microsoft, Java, and web programmers. Within the lifecycle there is a proliferation of tools and languages to deal with the wide variety of engineering problems that must be addressed when developing a product. Unlike the application lifecycle, where you have specific tools that handle modeling or QA, on the engineering side there are multiple tools because there are many different ways to simulate a product’s behavior in the real world to perform the engineering equivalent of QA. You might want to test mechanical designs for wind shear, thermal deformation, or compressive stresses, and electrical ones for their ability to handle voltage and disperse heat from processing units.
Now widen out the picture. Engineering and manufacturing groups each have their own definitions of the product. It is expressed in the bill of materials (BOM): engineering has its own BOM, which details the design hierarchy, while the manufacturing BOM itemizes the inventory materials and the manufacturing processes needed to fabricate and assemble the product. That sets the stage for the question of who owns the product lifecycle management (PLM) process: the CADCAM vs. the ERP folks.
Into the mix between the different branches of engineering and the silos between engineering and manufacturing, now introduce the software engineers. They used to be an afterthought, yet today their programs are affecting, not only how product components and systems behave, but in many cases might impact the physical specifications. for instance, if you can design software to enable a motor to run more efficiently, the mechanical engineers can then design a smaller, lighter weight engine.
In the enterprise computing world, we’ve long gotten hung up on the silos that divide different parts of IT from itself – the developers vs. QA, DBAs, enterprise architects, systems operations – or IT from the business. However, the silos that plague enterprise IT are child’s play compared to the situation in product development where you have engineering groups pared off against each other, and against manufacturing.
OK, so the product lifecycle is a series of fiefdoms – why bother or care about making it more efficient? There is too much at stake in the success of a product: there are the constantly escalating pressures to squeeze time, defects, and cost out of the product lifecycle. That’s been the routine ever since the Japanese introduced American concepts of lean manufacturing back in the 1980s. But as automobiles and other complex engineered products adds more intelligence, the challenge is leveraging the rapid innovation of the software and consumer electronics industries for product sectors where, of necessity, lead times will stretch into one or more years.
There is no easy solution because there is no single solution. Each industry has different product characteristics that impact the length of the lifecycle and how product engineering teams are organized. Large, highly complex products such as automobiles, aircraft, or heavy machinery will have long lead times because of supply chain dependencies. At the other end of the scale, handheld consumer electronics or biomedical devices might not have heavy supply chain dependences. But, for instance, smart phones have short product lifespans and are heavily driven by the fats pace of innovation in processing power and software capabilities, meaning that product lifecycles must be quicker in order for new products to catch the market window. Biomedical devices on the other hand are often compact, but have significant regulatory hurdles to mount which impacts how the devices are tested.
The product lifecycle is a highly varied creature. The common thread is the need to more effectively integrate software engineering, which in turn is forcing the issue of integration and collaboration between other engineering disciplines. It is no longer sufficient to rely on systems engineers to get it together in the end – as manufacturers learned the hard way, it costs more to rework a design that doesn’t fit together, perform well, or be readily assembled with existing staff and facilities. The rapid evolution of software and processors also forces the issue on whether and where agile development processes can be coupled with linear or hierarchical development processes that are necessary for long-fuse products.
There is no single lifecycle process that will apply to all sectors, and no single set of tools that can perform every design and test function necessary to get from idea to product. Ultimately, the answer – as loose as it is – is that in larger product development organizations, work on the assumption that there are multiple sources of truth. The ALM and PLM worlds have at best worked warily at arms length from each other as there is a DMZ when it comes to requirements, change, and quality management. The reality is that no single constituency owns the product lifecycle – get used to federation that will proceed on rules of engagement that will remain industry- and organization-specific.
Ideally it would be great to integrate everything. Good luck. With the exception of frameworks that are proprietary for specific vendors, there is no associativity between tools that provides a process-level integration. The best that can be expected at this point is at the data exchange level.
It’s a start.
There’s been plenty of excellent commentary on Google’s $12.5 billion deal for Motorola Mobility Inc. (MMI) over the past few days, and we’re certainly not going to rehash covered ground.
Clearly this is a lot of money that was invested defensively. Money that could have gone into research or acquisitions that would have grown the business or opened new markets.
That thought hit us this morning after reading a NY Times piece on the bull market for patents. It reinforced our thoughts after word of the deal broke: that this was money spent for arming Google against patent predators in courts of law. In this case, it’s predators sensing blood to slow down or at least exact royalties from the Android platform juggernaut.
Of course much of the issue stems from the subjective nature of software patents; that’s a longstanding issue given that the iterative nature of software development. It is simply difficult if not impossible to prove that a software innovation does not base itself in some way on prior invention. Furthermore, the fact that software relies on other software to operate makes the notion of software patents even more dubious.
This doesn’t mean that software developers should get away plagiarism. Although discovery is still underway, the evidence continues to get more damning in the Oracle-Google case over Dalvik, the Android VM that on closer inspection looks like the JVM in sheep’s clothing. The irony is that when Google was still pulling its (J)VM clean room act, the company at the other end of the line was Sun. To us, this is a reflection of Google’s Not-Invented-Here mentality. Would it have killed them to secure a JVM license at the time, as they could have gotten far more reasonable terms from Sun – rather than Oracle, the new sheriff in town.
While there is relatively little to knock cloud from its hype perch, among web startups, BI and data geeks, the emergence of Big Data has become a game changer. It’s analytics and operational intelligence gone extreme.
Big Data typically is associated with obscene amounts of data – the scale blows away anything that most enterprises would maintain within their core back end business systems. We’re talking hundreds of terabytes or even petabytes.
Today, Yahoo announced that it might take the business of its best-known Big Data brainchild, Hadoop, and and consider spinning it off into a new entity.
So why are we having this conversation?
It’s because Internet giants Google, Yahoo, Facebook, Amazon, and others had to roll their own technologies to deal with magnitudes of data far beyond conventional wisdom of what was possible with enterprise systems. What makes the conversation interesting is that this technology is on the cusp of entering the enterprise mainstream today. It’s not just a matter of technology looking for a problem. When Facebook needs to understand how its 500 million members update their walls, share photographs, and have conversations, it’s because (1) it needs to optimize its IT infrastructure to support how its members use the site, but more importantly (2) it needs to understand more about its members so it can sell advertising.
And when Facebook makes its API publicly available, that same issue becomes a critical for any marketer that is B2C. And as the technology becomes available, suddenly there are downstream uses in capital markets for conducting brute force analyses on trading positions, healthcare providers for understanding outcomes, homeland security for controlling borders, metropolitan entities seeking to manage congestion pricing, life sciences organization seeking to decipher clinical studies, mobile carriers seeking to prevent or minimize customer churn, and so on.
There are a couple technology and market paths that have opened for contending with Big Data. There are Advanced SQL analytic database providers that have adapted SQL for structured data through strategies such as reducing indexing, introducing new forms of data compression and query optimization, columnar architectures, and embedding analytics and data transformation directly into the data engine to minimize data movement; in some cases, they have developed optimized appliances. We’re talking about the Aster Datas, Greenplums, Netezzas, ParAccels, and Verticas of the world – and players like Teradata that invented big data warehousing, Oracle that has extended it, and Sybase which acquired the first column-oriented database. Business has obviously picked up here; IBM, EMC, Teradata, and HP have all made acquisitions in this space over the past 12 months.
But the Facebooks and Googles of the world weren’t dealing with structured data in the enterprise sense – they are contending with web log files, document APIs, rich media files, and so on. They are dealing with data whose structure and volume is so varied and huge that there is no time to model it and form a schema; they need to just load the data into the file system and then analyze it. That spawned the NoSQL movement – initially a focus on technologies that avoided the overhead and scalability limits of SQL.
Until now, neither Google, Yahoo, or Facebook considered themselves in the tools or database business. So they released the fruits of their innovation as open source, with one of the best known projects being Apache Hadoop. Hadoop is a family of projects that includes a distributed file system, the MapReduce framework that parcels out massively parallel computing jobs across a cluster plus a number of other frameworks, file systems, and utilities.
What’s kind of fascinating is the almost incestuous relationship between these NoSQL projects. Hadoop, developed at Yahoo was descended from the Google File System that in turn was developed for Google BigTable; the same was true for Cassandra, another NoSQL file system. Meanwhile, Facebook develops Hive, a relational-like table structure designed to work with Hadoop. You get the picture.
Cloudera has stepped to the forefront in commercializing Hadoop technology and applying MapReduce. Using a Red Hat-like business model, it offers support, several open source extensions, plus an enterprise edition that adds a number of proprietary monitoring and management features. It has distinguished itself with forging partnerships with almost every major BI and data warehousing player except one – IBM. the highlights are its relationships with Informatica, for data transformation, and MicroStrategy, which provides a data mart strategy designed to complement Hadoop. And it has garnered roughly 75 enterprise paying customers in a market segment that has barely commercialized.
In the long run, we also expect IBM to make a stab at Hadoop and related technologies by extending its InfoSphere offerings -– it can see Cloudera-Informatica and Cloudera-MicroStrategy raise it one with its own InfoSphere DataStage and Cognos offerings, before it even talks about partnerships. Today we saw a shot from left field – Yahoo which invented the technology – is now saying it might spin off its Hadoop business to go up against Cloudera, and potentially IBM. In a way, its closing the doors after the horses left the barn as the creator of Hadoop is now part of Cloudera.
Clearly there will be a market for NoSQL technologies in the quest for Big Data, although for now, they require sufficient specialized skills that they are not for the faint of heart. that is, if you can find any Hadoop and MapReduce programmers who haven’t already bee scarfed up by Amazon, Zynga, or JP Morgan Chase. That market will not necessarily be in competition with Advanced SQL as there are different use cases for each. And in fact, there will likely be a blending of the technologies in the long run. Today, many Advanced SQL platforms are already extending support for MapReduce, and in the long run, we expect that SQL-like technologies in the NoSQL space like Hive or HBase will themselves be made more accessible to the huge base of SQL developers.
But we digress.
For Yahoo, this would clearly be a shot out of its comfort zone, as it is not a tools company. But it is hungry for monetizing its intellectual property, even if that property has already been open sourced. It’s redolent of Sun striving to monetize Java and we all know how that went. Obviously this will be an uphill battle for Yahoo, but at least this would be a spinoff so hopefully there won’t be distractions from the mother ship. Given Yahoo’s fortunes, we shouldn’t be surprised that they are now looking to maximize what they can get out of the family jewels.
And now for something completely different. This week, we offer a guest post from my Ovum colleague and agile methodology expert Michael Azoff.
Software development is more art than science: more about sociology than computer science — Agile has demonstrated that. The dream of computer scientists back in the 1970s, the era of the birth of computing, was that all you needed was a perfect specification and that programmers simply had to implement that spec. And what was implied? Maybe one day, you could automate that step and remove the need for a human programmer. Of course that dream didn’t happen and could never be fulfilled. The reasons are twofold: change and people.
* Change: because you can never nail down a spec perfectly upfront for most projects. Change is introduced during the lifetime of the project, so even if you had that perfect spec it can easily go stale.
* People: because for anything but the smallest projects, you need a team or multiple teams. And when people interact, there is scope for miscommunication and misunderstandings.
It is not a joke when project leaders looking back on large scale project failures say that rather than the hundred developers that were used, if they could rewind history and try again, they would pick the ten ablest developers and get the job completed and in short time.
Fast forward to today: Agile methodology has reached beyond the innovators and visionaries and has arguably gone mainstream. In practice that means various contortions and customizations of Agile methodologies exist, entwined with other processes and methodologies found in organizations, including neo-waterfall.
Neo-waterfall is an interesting case. I use that term because I do not believe developers ever did strict waterfall — if they did, the job would never get accomplished. So there was even a hint of agile in classic waterfall. Developers generally do what is necessary to get the job done and present the results to management in whatever form management expects it. Some form of iteration is essential, call it rework or doing it twice or whatever, because most software requirements are unique and getting it implemented perfectly right first time is difficult.
So now we have a situation where Agile adoption has reached the masses and organizations are ready to try it alongside other options, or, in some cases using only Agile and nothing else. The question is where do we go from here? Have we now solved the software development problem? (To recognize that there is a problem, read Fred Brooks’ The Mythical Man-Month). First of all, the overall (research and anecdotal) evidence is in favor of Agile: it is a step in the right direction (actually major strides forward). Agile methodologies solve development project management problems better than other known methodologies and processes.
However, Agile is not the end of the software development road. There is a “beyond Agile.” The idea is to retain the strengths of Agile and improve its weaker areas.
On the strengths side: the values and principles as expressed in the Agile Manifesto; the philosophy of adaptability and continuous learning (there is good overlap with Lean thinking here); the embracing of change; the emphasis on delivering business value; the iteration heart beat; the retrospectives for making continuous learning happen; the use of testing throughout the lifecycle; gaining feedback from users; getting the business involved; applying macro-management to the team, with a multi-skilled team self-organizing. The list continues: pair programming, test-driven development etc.
However, what will change in ‘beyond Agile’ are the areas where Agile has addressed itself less well. So the emphasis in early phase Agile has been on small teams where possible. The problem is that some enterprise projects are very large scale and need a lot of teams on a global basis. Various Agile development groups have addressed these issues but there is no consensus. The use of architecture and modeling also varies across these approaches. I expect some new form of Agile-friendly architecture and modeling will emerge. Certainly, the technology needs to improve: nothing quite beats the adaptability, the versatility (the agility) of programming languages — and creating software by drawing UML diagrams alone is dreadfully dull.
Another fault line has to do with QA and testing. Listening to some developers talk about how they had to bypass the (non-Agile) QA facility within their organization because it became a bottleneck, where they took on the job of QA themselves. That illustrates how QA and development have become separated in some organizations. ‘Beyond Agile’ I envisage will see QA and testing (the whole range, not just developer testing) become better integrated with development. While Agile developers have embraced quality and testing, the expertise in traditional QA and testing should not be lost.
Managing Agile stories, vast numbers running into thousands, and dealing with interdependencies and the transformation from business orientation to technical orientation — this is another area that could benefit from refinement.
While Agile expands its reach beyond development into operations in DevOps, and into business development (where Lean thinking is already established), the question is whether in the future the practices will be recognizably Agile or follow a new development wave. My hunch is that it will be recognizably rooted in our current understanding of Agile. It would be just fine if Agile became so established and traditional that we called it simply ‘software development’, without further distinction.
Event notice: Special 10th anniversary: The Agile Alliance’s Agile2011 Conference, Aug 8-12, 2011, will be revisiting Salt Lake City, Utah, where the Agile Manifesto was written back in 2001. I’m told the original signatories of the Agile Manifesto will be on stage to debate the progress of Agile.
A South Jersey neighbor of ours — runner, educator, and open source mischief maker Bob Bickel – recently blogged a status report on what’s been going on with the Jenkins open source project ever since it split off from Hudson.
That’s prompted us to wade in to ask the question that’s been glossed over by the theatrics: what about the user?
For background: This is a case of a promising grassroots technology that took off beyond expectation and became a victim of its own success: governance just did not keep up with the projects growing popularity and attractiveness to enterprise developers. The sign of a mature open source project is that its governing body has successfully balanced the conflicting pressures of constant innovation vs. the need to slow things down for stable, enterprise-ready releases. Hudson failed in this regard.
That led to unfortunate conflicts that degenerated to stupid, petty, and entirely avoidable personality squabbles that in the end screwed the very enterprise users that were pumping much of the oxygen in. We know the actors on both sides – who in their everyday roles are normal, reasonable people that got caught up in mob frenzy. Both sides shot themselves in the feet as matters careened out of control. Go to SD Times if you want the blow by blow.
So what is Hudson – or Jenkins – and why is it important?
Hudson is a continuous integration (CI) server open source project that grew very popular for Java developers. The purpose of a CI server is to support agile practices of continuous integration with a server that maintains the latest copy of the truth. The project was the brainchild of former Sun and Oracle, and current Cloudbees employee Kohsuke Kawaguchi.
Since the split, it has forked into the Hudson and Jenkins branches, with Jenkins having attracted the vast majority of committers and much livelier mailing list activity. Bickel has given us a good snapshot from the Jenkins side with which he’s aligned: a diverse governance body has been established that plans to publish results of its meetings and commit, not only to continuing the crazy schedule of weekly builds, but “stable” quarterly releases. The plan is to go “stable” with the recent 1.400 release, for which a stream of patches is underway.
So most of the committers have gone to Jenkins. Case closed? From the Hudson side, Jason van Zyl of Sonatype, whose business was built around Apache Maven, states that the essential plug-ins are already in the existing Hudson version, and that the current work is more about consolidating the technology already in place, testing it, and refactoring to comply with JSR 330, built around the dependency injection technology popularized by the Spring framework. Although the promises are to keep the APIs stable, this is going to be a rewrite of the innards of Hudson.
Behind the scenes, Sonatype is competing on the natural affinity between Maven and Hudson, which share a large mass of common users, while the emerging force behind Jenkins is Cloudbees, which wants to establish itself as the leading Java development in the cloud platform.
So if you’re wondering what to do, join the crowd. There are bigger commercial forces at work, but as far as you’re concerned, you want stable releases that don’t break the APIs you already use. Jenkins must prove it’s not just the favorite of the hard core, but that its governance structure has grown up to provide stability and assurance to the enterprise, while Hudson must prove that the new rewrite won’t destabilize the old, and that it has managed to retain the enterprise base in spite of all the noise otherwise.
April 28, 2011 update. Bob Bickel has reported to me that since the “divorce,” that Jenkins has drawn 733 commits vs 172 for Hudson.
May 4, 2011 update. Oracle has decided to submit the Hudson project to the Eclipse Foundation. Eclipse board member Mik Kersten voices his support of this effort. Oracle says it didn’t consider this before because going to Eclipse was originally perceived as being too heavyweight. This leaves us wondering,
why didn’t Oracle propose to do this earlier? where was the common sense?
« Previous entries Next Page » Next Page »