05.15.12

Searching for Data Scientists as a Service

Posted in Big Data, Java at 8:12 am by Tony Baer

It’s no secret that rocket .. err … data scientists are in short supply. The explosion of data and the corresponding explosion of tools, and the knock-on impacts of Moore’s and Metcalfe’s laws, is that there is more data, more connections, and more technology to process it than ever. At last year’s Hadoop World, there was a feeding frenzy for data scientists, which only barely dwarfed demand for the more technically oriented data architects. in English, that means:

1. Potential MacArthur Grant recipients who have a passion and insight for data, the mathematical and statistical prowess for ginning up the algorithms, and the artistry for painting the picture that all that data leads to. That’s what we mean by data scientists.
2. People who understand the platform side of Big Data, a.k.a., data architect or data engineer.

The data architect side will be the more straightforward nut to crack. Understanding big data platforms (Hadoop, MongoDB, Riak) and emerging Advanced SQL offerings (Exadata, Netezza, Greenplum, Vertica, and a bunch of recent upstarts like Calpont) is a technical skill that can be taught with well-defined courses. The laws of supply and demand will solve this one – just as they did when the dot com bubble created demand for Java programmers back in 1999.

Behind all the noise for Hadoop programmers, there’s a similar, but quieter desperate rush to recruit data scientists. While some data scientists call data scientist a buzzword, the need is real.

However, data science will be a tougher number to crack. It’s all about connecting the dots, not as easy as it sounds. The V’s of big data – volume, variety, velocity, and value — require someone who discovers insights from data; traditionally, that role was performed by the data miner. But data miners dealt with better-bounded problems and well-bounded (and known) data sets that made the problem more 2-dimensional. The variety of Big Data – in form and in sources – introduces an element of the unknown. Deciphering Big Data requires a mix of investigative savvy, communications skills, creativity/artistry, and the ability to think counter-intuitively. And don’t forget it all comes atop a foundation of a solid statistical and machine learning background plus technical knowledge of the tools and programming languages of the trade.

Sometimes it seems like we’re looking for Albert Einstein or somebody smarter.

As nature abhors a vacuum, there’s also a rush to not only define what a data scientist is, but develop programs that could somehow teach it, software packages that to some extent package it, and otherwise throw them into a meat … err, the free market. EMC and other vendors are stepping up to the plate to offer training, not just on platforms, but for data science. Kaggle offers an innovative cloud-based, crowdsourced approach to data science, making available a predictive modeling platform and then staging sponsored 24-hour competitions for moonlighting data scientists to devise the best solutions to particular problems (redolent of the Netflix $1 million prize to devise a smarter algorithm for predicting viewer preferences).

With data science talent scarce, we’d expect that consulting firms would buy up talent that could then be “rented’ to multiple clients. Excluding a few offshore firms, few SIs have yet stepped up to the plate to roll out formal big data practices (the logical place where data scientists would reside), but we expect that to change soon.

Opera Solutions, which has been in the game of predictive analytics consulting since 2004, is taking the next step down the packaging route. having raised $84 million in Series A funding last year, the company has staffed up to nearly 200 data scientists, making it one of the largest assemblages of genius this side of Google. Opera’s predictive analytics solutions are designed for a variety of platforms, SQL and Hadoop, and today they join the SAP Sapphire announcement stream with a release of their offering on the HANA in-memory database. Andrew Brust provides a good drilldown on the details on this announcement.

From SAP’s standpoint, Opera’s predictive analytics solutions are a logical fit for HANA as they involve the kinds of complex problems (e.g., a computation triggers other computations) that their new in-memory database platform was designed for.

There’s too much value at stake to expect that Opera will remain the only large aggregation of data scientists for hire. But ironically, the barriers to entry will keep the competition narrow and highly concentrated. Of course, with market demand, there will inevitably be a watering down of the definition of data scientists so that more companies can claim they’ve got one… or many.

The laws of supply and demand will kick in for data scientists, but the ramp up of supply won’t be as quick as that for the more platform-oriented data architect or engineer. Of necessity, that supply of data scientists will have to be augmented by software that automates the interpretation of machine learning, but there’s only so far that you can program creativity and counter-intuitive insight into a machine.

04.25.12

Another vote for the Apache Hadoop Stack

Posted in Big Data, Data Management at 8:59 pm by Tony Baer

As we’ve noted previously, the measure of success of an open source stack is the degree to which the target remains intact. That either comes as part of a captive open source project, where a vendor unilaterally open sources their code (typically hosting the project) to promote adoption, or a community model where a neutral industry body hosts the project and gains support from a diverse cross section of vendors and advanced developers. In that case, the goal is getting the formal standard to also become the de facto standard.

The most successful open source projects are those that represent commodity software – otherwise, why would vendors choose not to compete with software that anybody can freely license or consume? That’s been the secret behind the success of Linux, where there has been general agreement on where the kernel ends, and as a result, a healthy market of products that run atop (and license) Linux. For community open source projects, vendors obviously have to agree on where the line between commodity and unique value-add begins.

And so we’ve discussed that the fruition of Hadoop will require some informal agreement as to exactly what components make Hadoop, Hadoop. For a while, the question appeared in doubt, as one of the obvious pillars – the file system – was being contested with proprietary alternatives like MapR and IBM’s GPFS.

What’s interesting is that the two primary commercial providers that signed on for the proprietary files systems – IBM and EMC (via partnership with MapR) – have retrenched clarified their messages. They still offer the proprietary file system systems in question, but both now also offer purer are emphasizing that they also offer Apache versions. IBM made the announcement today, buried below the fold after its announced intention to acquire data federation search player, Vivisimo. The announcement had a bit of a grudging aspect to it – unlike Oracle, which has a full OEM agreement with Cloudera, IBM is simply stating that it will certify Cloudera’s Hadoop as one of the approved distributions for InfoSphere BigInsights – there’s no exchange of money or other skin in the game. If IBM also gets demand for the Hortonworks distro (or if it wants to keep Cloudera in its place), it’ll also likely add Hortonworks to the approved list.

Against this background is a technology that is a moving target. The primary drawback – that there was no redundancy or failover with the HDFS NameNode (which acts as a file directory) – has been addressed with the latest versions of Hadoop. The other – which provides POSIX compliance so Hadoop can be accessed through the NFS standard) – is only necessary for very high, transactional-like (OK, not ACID) performance which so far has not been an issue. If you want that kind of performance, Hadoop’s HBase offers more promise.

But just as the market has passed judgment on what comprises the Hadoop “kernel” (using some Linuxspeak), that doesn’t rule out differences in implementation. Teradata Aster and Sybase IQ are promoting their analytics data stores as swappable, more refined replacements for HBase (Hadoop’s column store), while upstarts like Hadapt are proposing to hang SQL data nodes onto HDFS.

When it comes to Hadoop, you gotta reverse the old maxim: The more things stay the same, the more things are actually changing.

04.16.12

Big Data and the Product Lifecycle

Posted in Application Lifecycle Management (ALM), Complex Engineered Systems, Product Lifecycle at 2:04 am by Tony Baer

Our twitter feed went silent for a few days last week as we spent some time at a conference that where chance conversations, personal reunions, and discovery were the point. In fact, this was one of the few events where attendees – like us – didn’t have our heads down buried in our computers. We’re speaking of Cyon Research’s COFES 2012 design engineering software conference, where we had the opportunity to explore the synergy of Big Data and the Product Lifecycle, why ALM and PLM systems can’t play nice, and how to keep a handle on finding the right data as product development adopts a 24/7 follow-the-sun strategy. It wasn’t an event of sessions in the conventional sense, but lots of hallways where you spent most of your time in chance, impromptu meetings. And it was a great chance to hook up with colleagues whom we haven’t caught in years.

There were plenty of contrarian views. There were a couple of keynotes in the conventional sense that each took different shots at the issue of risk. Retired Ford product life cycle management director Richard Riff took aim at conventional wisdom when it comes to product testing. After years of ingrained lean, six sigma, and zero defects practices – not to mention Ford’s old slogan that quality is job one — Riff countered with a provocative notion: sometimes the risk of not testing is the better path. It comes down to balancing the cost of defects vs. the cost of testing, the likely incidence of defects, and the reliability of testing. While we couldn’t repeat the math, in essence, it amounted to a lifecycle cost approach for testing. He claimed that the method even accounted for intangible factors, such as social media buzz or loss of reputation, when referring g to recently highly publicized quality issues with some of Ford’s rivals.

Xerox PARC computing legend Alan Kay made the case for reducing risk through a strategy that applied a combination of object-oriented design (or which he was one of the pioneers – along with the GUI of course) and what sounded to us like domain-specific languages. Or more specifically, that software describes the function, then lets other programs automatically generate the programming to execute it. Kay decried the instability that we have come to accept with software design – which reminded us that since the mainframe days, we have become all too accustomed to hearing that the server is down. Showing some examples of ancient Roman design (e.g., a 2000-year old bridge in Spain that today still carries cars and looks well intact), he insists that engineers can do better.

Some credit to host Brad Holtz who deciphered that there really was a link between our diverging interests: Big Data and meshing software development with the product lifecycle. By the definition of Big Data – volume, variability, velocity, and value – Big Data is nothing new to the product lifecycle. CAD files, models, and simulations are extremely data-intensive and containing a variety of data types encompassing graphical and alphanumeric data. Today, the brass ring for the modeling and simulation world is implementing co-simulations, where models each drive other models (the results of one drives the other).

But is anybody looking at the bigger picture? Modeling has been traditionally silo’ed – for instance, models are not typically shared across product teams, projects, or product families. Yet new technologies could provide the economical storage and processing power to make it possible to analyze and compare the utilization and reliability of different models for different scenarios – with the possible result being metamodels that provide frameworks for optimizing model development and parameters with specific scenarios. All this is highly data-intensive.

What about the operational portion of the product lifecycle? Today, it’s rare for products not to have intelligence baked into controllers. Privacy issues aside (they must be dealt with), machinery connected to networks can feed back performance data; vehicles can yield data while in the repair shop, or thanks to mobile devices, provide operational data while in movement. Add to that reams of publicly available data from services such as NOAA or the FAA, and now there is context placed around performance data (did bad weather cause performance to drop?). Such data could feed processes, ranging from MRO (maintenance, repair, and operation) and warranty, to providing feedback loops that can validate product tests and simulation models.

Let’s take another angle – harvesting publicly available data for the business. For instance, businesses could use disaster preparedness models to help their scenario planning, as described in this brief video snippet from last years COFES conference. Emerging organizations, such as the Center for Understanding Change, aim to make this reality by making available models and expertise developed through tax dollars in the national laboratory system.

Big Data and connectivity can also be used to overcome gaps in locating expertise and speed product development. Adapting techniques from the open source software world, where software is developed collaboratively by voluntary groups of experts in the field, crowdsourcing is invading design and data science (we especially enjoyed our conversation with Kaggle’s Jeremy Howard).

A personal note on the sessions – the conference marked a reunion with folks whom we have crossed paths with in over 20 years. Our focus on application development lead us to engineered systems, an area of white space between software engineering and classic product engineering disciplines. And as noted above, that in turn bought us full circle to our roots covering the emergence of CADCAM in the 80s as we had the chance to reconnect many who continue to advance the engineering discipline. What a long, fun trip it’s been.

04.12.12

SAP and databases no longer an oxymoron

Posted in Big Data, Business Intelligence, Data Management, Database, Fast Data at 12:44 am by Tony Baer

In its rise to leadership of the ERP market, SAP shrewdly placed bounds around its strategy: it would stick to its knitting on applications and rely on partnerships with systems integrators to get critical mass implementation across the Global 2000. When it came to architecture, SAP left no doubt of its ambitions to own the application tier, while leaving the data tier to the kindness of strangers (or in Oracle’s case, the estranged).

Times change in more ways than one – and one of those ways is in the data tier. The headlines of SAP acquiring Sybase (for its mobile assets, primarily) and subsequent emergence of HANA, its new in-memory data platform, placed SAP in the database market. And so it was that at an analyst meeting last December, SAP made the audacious declaration that it wanted to become the #2 database player by 2015.

Of course, none of this occurs in a vacuum. SAP’s declaration to become a front line player in the database market threatens to destabilize existing relationships with Microsoft and IBM as longtime SAP observer Dennis Howlett commented in a ZDNet post. OK, sure, SAP is sick of leaving money on the table to Oracle, and it’s throwing in roughly $500 million in sweeteners to get prospects to migrate. But if the database is the thing, to meet its stretch goals, says Howlett, SAP and Sybase would have to grow that part of the business by a cool 6x – 7x.

But SAP would be treading down a ridiculous path if it were just trying to become a big player in the database market for the heck of it. Fortuitously, during SAP’s press conference on announcements of their new mobile and database strategies, chief architect Vishal Sikka tamped down the #2 aspirations as that’s really not the point – it’s the apps that count, and increasingly, it’s the database that makes the apps. Once again.

Back to our main point, IT innovation goes in waves; during emergence of client/server, innovation focused on database where the need was mastering SQL and relational table structures; during the latter stages of client/server and subsequent waves of Webs 1.0 and 2.0, activity shifted to the app tier, which grew more distributed. With emergence of Big Data and Fast Data, energy shifted back to the data tier given the efficiencies of processing data big or fast inside the data store itself. Not surprisingly, when you hear SAP speak about HANA, they describe an ability to perform more complex analytic problems or compound operational transactions. It’s no coincidence that SAP now states that it’s in the database business.

So how will SAP execute its new database strategy? Given the hype over HANA, how does SAP convince Sybase ASE, IQ, and SQL Anywhere customers that they’re not headed down a dead end street?

That was the point of the SAP announcements, which in the press release stated the near term roadmap but shed little light on how SAP would get there. Specifically, the announcements were:
• SAP HANA on BW is now going GA and at the low (SMB) end come out with aggressive pricing: roughly $3000 for SAP BusinessOne on HANA; $40,000 for HANA Edge.
• Ending a 15-year saga, SAP will finally port its ERP applications to Sybase ASE, with tentative target date of year end. HANA will play a supporting role as the real-time reporting adjunct platform for ASE customers.
• Sybase SQL Anywhere would be positioned as the mobile front end database atop HANA, supporting real-time mobile applications.
• Sybase’s event stream (CEP) offerings would have optional integration with HANA, providing convergence between CEP and BI – where rules are used for stripping key event data for persistence in HANA. In so doing, analysis of event streams could be integrated or directly correlating with historical data.
• Integrations are underway between HANA and IQ with Hadoop.
• Sybase is extending its PowerDesigner data modeling tools to address each of its database engines.

Most of the announcements, like HANA going GA or Sybase ASE supporting SAP Business suite, were hardly surprises. Aside from go-to-market issues, which are many and significant, we’ll direct our focus on the technology roadmaps.

We’ve maintained that if SAP were serious about its database goals, that it had to do three basic things:
1. Unify its database organization. The good news is that it has started down that path as of January 1 of this year. Of course, org charts are only the first step as ultimately it comes down to people.
2. Branding. Although long eclipsed in the database market, Sybase still has an identifiable brand and would be the logical choice; for now SAP has punted.
3. Cross-fertilize technology. Here, SAP can learn lessons from IBM which, despite (or because of) acquiring multiple products that fall under different brands, freely blends technologies. For instance, Cognos BI reporting capabilities are embedded into rational and Tivoli reporting tools.

The third part is the heavy lift. For instance, given that data platforms are increasingly employing advanced caching, it would at first glance seem logical to blend in some of HANA’s in-memory capabilities to the ASE platform; however, architecturally, that would be extremely difficult as one of HANA’s strengths –dynamic indexing – would be difficult to implement in ASE.

On the other hand, given that HANA can index or restructure data on the fly (e.g., organize data into columnar structures on demand), the question is, does that make IQ obsolete? The short answer is that while memory keeps getting cheaper, it will never be as cheap as disk and that therefore, IQ could evolve as near-line storage for HANA. Of course that begs the question as to whether Hadoop could eventually perform the same function. SAP maintains that Hadoop is too slow and therefore should be reserved for offline cases; that’s certainly true today, but given developments with HBase, it could easily become fast and cheap enough for SAP to revisit the IQ question a year or two down the road.

Not that SAP Sybase is sitting still with Hadoop integration. They are providing MapReduce and R capabilities to IQ (SAP Sybase is hardly alone here, as most Advanced SQL platforms are offering similar support). SAP Sybase is also providing capabilities to map IQ tables into Hadoop Hive, slotting IQ as alternative to HBase; in effect, that’s akin to a number of strategies to put SQL layers inside Hadoop (in a way, similar to what the lesser-known Hadapt is doing). And of course, like most of the relational players, SAP Sybase is also support the bulk ETL/ELT load from HDFS to HANA or IQ.

On SAP’s side for now is the paucity of Hadoop talent, so pitching IQ as an alternative to HBase may help soften the blow for organizations seeking to get a handle. But in the long run, we believe that SAP Sybase will have to revisit this strategy. Because, if it’s serious about the database market, it will have to amplify its focus to add value atop the new realities on the ground.

02.22.12

Informatica’s Stretch Goal

Posted in Big Data, Business Intelligence, Data Management, Database, Middleware at 8:00 am by Tony Baer

Informatica is within a year or two of becoming a $1 billion company, and the CEO’s stretch goal is to get to $3b.

Informatica has been on a decent tear. It’s had a string of roughly 30 consecutive growth quarters, growth over the last 6 years averaging 20%, and 2011 revenues nearing $800 million. Abbasi took charge back in 2004, lifting Informatica out of its midlife crisis by ditching an abortive foray into analytic applications, instead expanding from the company’s data transformation roots to data integration. Getting the company to its current level came largely through a series of acquisitions that then expanded the category of data integration itself. While master data management (MDM) has been the headliner, other recent acquisitions have targeted information lifecycle management (ILM), complex event processing (CEP), low latency messaging (ultra messaging), along with filling gaps in its B2B and data quality offerings. While some of those pieces were obvious additions, others such as ultra messaging or event processing were not.

CEO Sohaib Abbassi is talking about a stretch goal of $3 billion revenue. The obvious chunk is to deepen the company’s share of existing customer wallets. We’re not at liberty to say how much, but Informatica had a significant number of 6-figure deals. Getting more $1m+ deals will help, but on their own won’t triple revenue.

So how to get to $3 billion?
Obviously, two strategies: deepen the existing business while taking the original formula to expand the footprint of what’s data integration.

First, the existing business. Of the current portfolio, MDM is likely best primed to allow Informatica to more deeply penetrate the installed base. Most of its data integration clients haven’t yet done MDM, and it is not a trivial investment. And for MDM clients who may have started with a customer or product domain, there are always more domains to tackle. During Q&A, Abbasi listed MDM has having as much potential addressable market as the traditional ETL and data quality segments.

The addition of SAP and Oracle veteran Dennis Moore to the Informatica MDM team points to the classic tightrope for any middleware vendor that claims it’s not in the applications game – build more “solutions” or jumpstart templates to confront the same generic barrier that packaged applications software was designed to surmount: provide customers an alternative to raw toolsets or custom programming. For MDM, think industry-specific “solutions” like counter-party risk, or horizontal patterns like social media profiles. If you’re Informatica, don’t think analytic applications.

That’s part of a perennial debate (or rant) on whether middleware is the new enterprise application: you implement for a specific business purpose as opposed to technology project, such as application or data integration, and you implement with a product that offers patterns of varying granularity as a starting point. Informatica MDM product marketing director Ravi Shankar argues it’s not an application because applications have specific data models and logic that become their own de factor silos, whereas MDM solutions reuse the same core metadata engine for different domains (e.g., customer, product, operational process). Our contention? If it solves a business problem and it’s more than a raw programming toolkit, it’s a de facto application. If anybody else cares about this debate, raise your hand.

MDM is typically a very dry subject but demo’ing a social MDM straw man showing a commerce application integrated into Facebook perked Twitter debate among analysts in the room. The operable notion is that such a use of MDM could update the customer’s (some might say, victim’s) profile by the associations that they make in social networks. An existing Informatica higher educational client that shall remain anonymous actually used MDM to mine LinkedIn to prove that its grads got jobs.

This prompts the question, just because you can do it, should you. When a merchant knows just a bit too much about you – and your friends (who may not have necessarily opted in) – that more than borders on creepy. Informatica’s Facebook MDM integration was quite effective; as a pattern for social business, well, we’ll see.

So what about staking new ground? When questioned, Abbasi stated that Informatica had barely scratched the surface with productizing around several megatrend areas that it sees impacting its market: cloud, social media, mobile, and Big Data. More specifically:
• Cloud continues to be a growing chunk of the business. Informatica doesn’t have all of its tooling up in the cloud, but it’s getting there. Consumption of services from the Informatica Cloud continues to grow at a 100 – 150% annual run rate. Most of the 1500 cloud customers are new to Informatica. Among recent introductions are a wizard-driven Contact Validation service that verifies and corrects postal addresses from over 240 countries and territories. A new rapid connectivity framework further eases the ability of third parties to OEM Informatica Cloud services.
• Social media – there were no individual product announcements her per se, just that Informatica’s tools must increasingly parse data coming from social feeds. That covers MDM, data profiling and data quality. Much of it leverages HParser, the new Hadoop data parsing tool released late last year.
• Mobile – for now this is mostly a matter of making Informatica tools and apps (we’ll use the term) consumable on small devices. On the back end, there are opportunities for optimizing virtualizing and replicating data on demand to the edges of highly distributed networks. Aside from newly-announced features such as iPhone and Android support of monitoring the Informatica cloud, for now Informatica is making a statement of product direction.
• Big Data – Informatica, like other major BI and database vendors, have discovered big Data with a vengeance over the past year. The ability to extract from Hadoop is nothing special – other vendors have that – but Informatica took a step ahead with release of HParser last fall. In general there’s growing opportunity for tooling in a variety of areas touching Hadoop, with Informatica’s data integration focus being one of them. We expect to see extension of Informatica’s core tools to not only parse or extract from Hadoop, but increasingly, work natively inside HDFS on the assumption that customers are not simply using it as a staging platform anymore. We also see opportunities in refinements to HParser providing templates or other shortcuts for deciphering sensory data. ILM, for instance, is another obvious one. While Facebook et al might not archive or deprecate their Hadoop data, mere mortal enterprises will have to bite the bullet. Data quality in Hadoop in many cases may not demand the same degree of vigilance as SQL data warehouses, creating demand for lighter weight data profiling and cleansing tooling And for other real-time web centric use case, alternatives stores like MongoDB, Couchbase, and Cassandra may become new Informatica data platform targets.

What, no exit talk?
Abbasi commented at the end of the company’s annual IT analyst meeting that this was the first time in recent memory that none of the analysts asked who would buy Informatica when. Buttonholing him after the session, we got his take which, very loosely translated to Survivor terms, Informatica has avoided getting voted off the island.

At this point, Informatica’s main rivals – Oracle and IBM – have bulked up their data integration offerings to the point where an Informatica acquisition would no longer be gap filling; it would simply be a strategy of taking out a competitor – and with Informatica’s growth, an expensive one at that. One could then point to dark horses like EMC, Tibco, Teradata, or SAP (for obvious reasons we’ve omitted HP). A case might be made for EMC, or SAP if it remains serious in raising its profile as database player– but we believe both have bigger fish to fry. Never say never. But otherwise, the common thread is that data integration will not differentiate these players and therefore it is not strategic to their growth plans.

01.31.12

EMC’s Hadoop Strategy cuts to the chase

Posted in Big Data, Data Management at 8:00 am by Tony Baer

To date, Big Storage has been locked out of Big Data. It’s been all about direct attached storage for several reasons. First, Advanced SQL players have typically optimized architectures from data structure (using columnar), unique compression algorithms, and liberal usage of caching to juice response over hundreds of terabytes. For the NoSQL side, it’s been about cheap, cheap, cheap along the Internet data center model: have lots of commodity stuff and scale it out. Hadoop was engineered exactly for such an architecture; rather than speed, it was optimized for sheer linear scale.

Over the past year, most of the major platform players have planted their table stakes with Hadoop. Not surprisingly, IT household names are seeking to somehow tame Hadoop and make it safe for the enterprise.

Up till now, anybody with armies of the best software engineers that Internet firms could buy could brute force their way to scale out humungous clusters and if necessary, invent their own technology, then share and harvest from the open source community at will. Hardly a suitable scenario for the enterprise mainstream, the common thread behind the diverse strategies of IBM, EMC, Microsoft, and Oracle toward Hadoop has been to not surprisingly make Hadoop more approachable.

What’s been conspicuously absent so far was a play from Big Optimized Storage. The conventional wisdom is that SAN or NAS are premium, architected systems whose costs might be prohibitive when you talk petabytes of data. Similarly, so far there has been a different operating philosophy behind the first generation implementations from the NoSQL world that assumed that parts would fail, and that five nines service levels were overkill. And anyway, the design of Hadoop brute forced the solution: replicate to have three unique copies of the data distributed around the cluster, as hardware is cheap.

As Big Data gains traction in the enterprise, some of it will certainly fit this pattern of something being better than nothing, as the result is unique insights that would not otherwise be possible. For instance, if you’re running analysis of Facebook or Twitter goes down, it probably won’t take the business with it. But as enterprises adopt Hadoop – and as pioneers stretch Hadoop to new operational use cases such as what Facebook is doing with its messaging system – those concepts of mission-criticality are being revisited.

And so, ever since EMC announced last spring that its Greenplum unit would start supporting and bundling different versions of Hadoop, we’ve been waiting for the other shoe to drop: When would EMC infuse its Big Data play with its core DNA, storage?

Today, EMC announced that its Isilon networked storage system was adding native support for Apache Hadoop’s HDFS file system. There were some interesting nuances to the rollout.

1. Big vendors are feeling their way around Hadoop
It’s interesting to see how IT household names are cautiously navigating their way into unfamiliar territory. EMC becomes the latest, after Oracle and Microsoft, to calibrate their Hadoop strategy in public.

Oracle announced its Big Data appliance last fall before it lined up its Hadoop distribution. Microsoft ditched its Dryad project built around its HPC Server. Now EMC has recalibrated its Hadoop strategy; when it first unveiled its Hadoop strategy last spring, the spotlight was on the MapR proprietary alternatives to the HDFS file system of Apache Hadoop. It’s interesting that vendor initial announcements have either been vague, or have been tweaked as they’ve waded into the market. For EMC’s shift, more about that below.

2. What is Hadoop? For EMC, HDFS is the mainstream, not MapR

MapR’s strategy (and IBM’s along with it, regarding GPFS) has prompted debate and concern in the Hadoop community about commercial vendors forking the technology. As we’ve ranted previously, Hadoop’s growth will be tied, not only to megaplatform vendors that support it, but the third party tools and solutions ecosystem that grows around it. For such a thing to happen, ISVs and consulting firms need to have a common target to write against, and having forked versions of Hadoop won’t exactly grow large partner communities.

Regarding EMC, the original strategy was two Greenplum Hadoop editions: a Community Edition with a free Apache distro and an Enterprise Edition that bundled MapR, both under the Greenplum HD branding umbrella. At first blush, it looked like EMC was going to earn the bulk of its money from the proprietary side of the Hadoop business. What’s significant is that the new announcement of Isilon support pertains to the HDFS open source side. More to the point, EMC is rebranding and subtly repositioning its Greenplum Hadoop offerings: Greenplum HD is the Apache HDFS edition with the optional Isilon support, and Greenplum MR is the MapR version, which is niche targeted towards advanced Hadoop use cases that demand higher performance.

Update: Even if EMC later extends Isolon support to Greenplum MR, it doesn’t change the core positioning.

Coming atop recent announcements from Oracle and Microsoft that have come clearly out on the side of OEM’ing Apache rather than anything limited or proprietary, and this amounts to an unqualified endorsement of Apache Hadoop/HDFS as not only the formal, but also the de facto standard. This reflects emerging conventional wisdom that the enterprise mainstream is leery about lock-in to anything that smells proprietary for technology where they still are in the learning curve. Other forks may emerge, but they will not be at the base file system layer. This leaves IBM and MapR pigeonholed – admittedly, there will be API compatibility, but clearly both are swimming upstream.

3. Central Storage is newest battleground for Scale Up vs. Scale Out Hadoop

As noted earlier, Hadoop’s heritage has been the classic Internet data center scale-out model. The advantage is that, leveraging Hadoop’s highly linear scalability, organizations could easily expand their clusters quite easily by plucking more commodity server and disk. Pioneers or purists would scoff at the notion of an appliance approach because it was always simply scaling out inexpensive, commodity hardware, rather than paying premiums for big vendor boxes.

In blunt terms, the choice is whether you pay now or pay later. As mentioned before, do-it-yourself compute clusters require sweat equity – you need engineers who know how to design, deploy, and operate them. The flipside is that many, arguably most corporate IT organizations either lack the skills or the capital. There are various solutions to what might otherwise appear a Hobson’s Choice:
• Go to a cloud service provider that has already created the infrastructure, such as what Microsoft is offering with its Hadoop-on-Azure services;
• Look for a happy, simpler medium such as Amazon’s Elastic MapReduce on its DynamoDB service;
• Subscribe to SaaS providers that offer Hadoop applications (e.g., social network analysis, smart grid as a service) as a service;
• Get a platform and have a systems integrator put it together for you (key to IBM’s BigInsights offering, and applicable to any SI that has a Hadoop practice)
• Go to an appliance or engineered systems approach that puts Hadoop and/or its subsystems in a box, such as with Oracle Big Data Appliance or EMC’s Greenplum DCA. The systems engineering is mostly done for you, but the increments for growing the system can be much larger than simply adding a few x86 servers here or there (Greenplum HD DCA can scale in groups of 4 server modules). Entry or expansion costs are not necessarily cheap, but then again, you have to balance capital cost against labor.
• Surrounding Hadoop infrastructure with solutions. This is not a mutually exclusive strategy; unless you’re Cloudera or Hortonworks, which make their business bundling and supporting the core Apache Hadoop platform, most of the household names will bundle frameworks, algorithms, and eventually solutions that in effect place Hadoop under the hood. For EMC, the strategy is their recent announcement of a Unified Analytics Platform (UAP) that provides collaborative development capabilities for Big Data applications. EMC is (or will be) hardly alone here.

With EMC’s new offering, the scale-up option tackles the next variable: storage. This is the natural progression of a market that will address many constituencies, and where there will be no single silver bullet that applies to all.

01.10.12

Oracle fills another gap in its Big Data offering

Posted in Big Data, Business Intelligence, Database at 2:35 pm by Tony Baer

When we last left Oracle’s Big Data plans, there was definitely a missing piece. Oracle’s Big Data Appliance as initially disclosed at last fall’s OpenWorld was a vague plan that appeared to be positioned primarily as an appliance that would accompany and feed data to Exadata. Oracle did specify some utilities, such as an enterprise version of the open source R statistical processing program that was designed for multithreaded execution, plus a distribution of a NoSQL database based on Oracle’s BerkeleyDB as an alternative to Apache Hive. But the emphasis appeared to be extraction and transformation of data for Exadata via Oracle’s own utilities that were optimized for its platform.

As such, Oracle’s plan for Hadoop was competition, not for Cloudera (or Hortonworks), which featured a full Apache Hadoop platform, but EMC which offered a comparable, appliance-based strategy that pairs Hadoop with an Advanced SQL data store; and IBM, which took a different approach by emphasizing Hadoop as an analytics platform destination enhanced with text and predictive analytics engines, and other features such as unique query languages and file systems.

Oracle’s initial Hadoop blueprint lacked explicit support of many pieces of the Hadoop stack such as HBase, Hive, Pig, Zookeeper, and Avro. No more. With Oracle’s announcement of general availability of the Big Data appliance, it is filling in the blanks by disclosing that it is OEM’ing Cloudera’s CDH Hadoop distribution, and more importantly, the management tooling that is key to its revenue stream. For Oracle, OEM’ing Cloudera’s Hadoop offering fully fleshes out its Hadoop distribution and positions it as a full-fledged analytic platform in its own right; for Cloudera, the deal is a coup that will help establish its distribution as the reference. It is fully consistent with Cloudera’s goal to become the Red Hat of Hadoop as it does not aspire to spread its footprint into applications or frameworks.

Of course, whenever you put Oracle in the same sentence as OEM deal, the question of acquisition inevitably pops up. There are several reasons why an Oracle acquisition of Cloudera is unlikely.

1. Little upside for Oracle. While Oracle likes to assert maximum control of the stack, from software to hardware, its foray into productizing its own support for Red Hat Enterprise Linux has been strictly defensive; its offering has not weakened Red Hat.

2. Scant leverage. Compare Hadoop to MySQL and you have a Tale of Two Open Source projects. One is hosted and controlled by Apache, the other is hosted and controlled by Oracle. As a result, while Oracle can change licensing terms for MySQL, which it owns, it has no such control over Hadoop. Were Oracle to buy Cloudera, another provider could easily move in to fill the vacuum. The same would happen to Cloudera if, as a prelude to such a deal, it began forking from the Apache project with its own proprietary adds-ons or substitutions.

OEMs deals are a major stage of building the market. Cloudera has used its first mover advantage with Hadoop well with deals Dell, and now Oracle. Microsoft in turn has decided to keep the “competition” honest by signing up Hortonworks to (eventually) deliver the Hadoop engine for Azure.

OEM deals are important for attaining another key goal in developing the Hadoop market: defining the core stack – as we’ve ranted about previously. Just as Linux took off once a robust kernel was defined, the script will be identical for Hadoop. With IBM and EMC/MapR forking the Apache stack at the core file system level, and with niche providers like Hadapt offering replacement for HBase and Hive, there is growing variability in the Hadoop stack. However, to develop the third party ecosystem that will be vital to the development of Hadoop, a common target (and APIs for where the forks occur) must emerge. A year from now, the outlines of the market’s decision on what makes Hadoop Hadoop will become clear.

The final piece of the trifecta will be commitments from the Accentures and Deloittes of the world to develop practices based on specific Hadoop platforms. For now they are still keeping their cards close to their vests.

11.20.11

Who Owns the Product Lifecycle?

Posted in Application Development, Product Lifecycle at 9:14 pm by Tony Baer

Turn on the ignition of your car, back out of the parking space and go into drive. As you engaged the transmission, gently tapped the accelerator and stepped on the brake, you didn’t directly interact with the powertrain. Instead, your actions were detected by sensors and executed by actuators on electronics control units that then got the car to shift, move, then stop.

Although in the end, Toyota’s recall issues from 2009-10 wound up isolating misadjusted accelerator controls, speculation around the recalls directed the spotlight to the prominent role of embedded software, prompting the realization that today when you operate your car, you are driving by wire.

Today’s automobiles are increasingly looking a lot more like consumer electronics products. They contain nearly as much software an iPhone, and in the future will contain even more. According to IDC, the market for embedded software that is designed into engineered products (like cars, refrigerators, airplanes, and consumer electronics) will double by 2015.

Automobiles are the tip of the iceberg where it comes to smart products; today most engineered products, from refrigerators to industrial machinery and aircraft all feature smart control. Adding intelligence allows designers to develop flexible control logic that brings more functionality to products and provides ways to optimize operation to gain savings in weight, bulk, and cost.

Look at the hybrid car: to function, the battery, powertrain, gas and electric engines, and braking systems must all interoperate to attain fuel economy. It takes software to determine when to let the electric engine run or let the battery recharge. The degree of interaction between components is greater compared to traditional electromechanical products designs. Features such as anti-lock braking or airbag deployment depend on the processing of data from multiple sources – wheel rotation, deceleration rate, steering, etc.

The growth of software content changes the ground rules for product development, which has traditionally been a very silo’ed process. There are well established disciplines in mechanical and electrical engineering, with each having their own sets of tools, not to mention claims to ownership of the product design. Yet with software playing the role as the “brains” of product operation, there is the need for engineering disciplines to work more interactively across silos rather than rely on systems engineers to crack the whip on executing the blueprint.

We were reminded of this after a rather enjoyable, freewheeling IEEE webcast that we had with IBM Rational’s Dominic Tavasolli last week.

Traditionally, product design fell under the mechanical engineering domain, which designed the envelope and specified the geometry, components, materials, physical properties (such as resistance to different forms of stress) and determined the clearance within which electronics could be shoehorned.

Drill down deeper and you’ll note that each engineering domain has its full lifecycle of tools. It’s analogous to enterprise software development organizations, where you’ll often stumble across well entrenched camps of Microsoft, Java, and web programmers. Within the lifecycle there is a proliferation of tools and languages to deal with the wide variety of engineering problems that must be addressed when developing a product. Unlike the application lifecycle, where you have specific tools that handle modeling or QA, on the engineering side there are multiple tools because there are many different ways to simulate a product’s behavior in the real world to perform the engineering equivalent of QA. You might want to test mechanical designs for wind shear, thermal deformation, or compressive stresses, and electrical ones for their ability to handle voltage and disperse heat from processing units.

Now widen out the picture. Engineering and manufacturing groups each have their own definitions of the product. It is expressed in the bill of materials (BOM): engineering has its own BOM, which details the design hierarchy, while the manufacturing BOM itemizes the inventory materials and the manufacturing processes needed to fabricate and assemble the product. That sets the stage for the question of who owns the product lifecycle management (PLM) process: the CADCAM vs. the ERP folks.

Into the mix between the different branches of engineering and the silos between engineering and manufacturing, now introduce the software engineers. They used to be an afterthought, yet today their programs are affecting, not only how product components and systems behave, but in many cases might impact the physical specifications. for instance, if you can design software to enable a motor to run more efficiently, the mechanical engineers can then design a smaller, lighter weight engine.

In the enterprise computing world, we’ve long gotten hung up on the silos that divide different parts of IT from itself – the developers vs. QA, DBAs, enterprise architects, systems operations – or IT from the business. However, the silos that plague enterprise IT are child’s play compared to the situation in product development where you have engineering groups pared off against each other, and against manufacturing.

OK, so the product lifecycle is a series of fiefdoms – why bother or care about making it more efficient? There is too much at stake in the success of a product: there are the constantly escalating pressures to squeeze time, defects, and cost out of the product lifecycle. That’s been the routine ever since the Japanese introduced American concepts of lean manufacturing back in the 1980s. But as automobiles and other complex engineered products adds more intelligence, the challenge is leveraging the rapid innovation of the software and consumer electronics industries for product sectors where, of necessity, lead times will stretch into one or more years.

There is no easy solution because there is no single solution. Each industry has different product characteristics that impact the length of the lifecycle and how product engineering teams are organized. Large, highly complex products such as automobiles, aircraft, or heavy machinery will have long lead times because of supply chain dependencies. At the other end of the scale, handheld consumer electronics or biomedical devices might not have heavy supply chain dependences. But, for instance, smart phones have short product lifespans and are heavily driven by the fats pace of innovation in processing power and software capabilities, meaning that product lifecycles must be quicker in order for new products to catch the market window. Biomedical devices on the other hand are often compact, but have significant regulatory hurdles to mount which impacts how the devices are tested.

The product lifecycle is a highly varied creature. The common thread is the need to more effectively integrate software engineering, which in turn is forcing the issue of integration and collaboration between other engineering disciplines. It is no longer sufficient to rely on systems engineers to get it together in the end – as manufacturers learned the hard way, it costs more to rework a design that doesn’t fit together, perform well, or be readily assembled with existing staff and facilities. The rapid evolution of software and processors also forces the issue on whether and where agile development processes can be coupled with linear or hierarchical development processes that are necessary for long-fuse products.

There is no single lifecycle process that will apply to all sectors, and no single set of tools that can perform every design and test function necessary to get from idea to product. Ultimately, the answer – as loose as it is – is that in larger product development organizations, work on the assumption that there are multiple sources of truth. The ALM and PLM worlds have at best worked warily at arms length from each other as there is a DMZ when it comes to requirements, change, and quality management. The reality is that no single constituency owns the product lifecycle – get used to federation that will proceed on rules of engagement that will remain industry- and organization-specific.

Ideally it would be great to integrate everything. Good luck. With the exception of frameworks that are proprietary for specific vendors, there is no associativity between tools that provides a process-level integration. The best that can be expected at this point is at the data exchange level.

It’s a start.

11.11.11

What will Hadoop be when it grows up?

Posted in Big Data, Business Intelligence, Data Management, Database at 6:26 pm by Tony Baer

Hadoop World was sold out and it seemed like “For Hire” signs were all over the place –- or at least that’s what it said on the slides at the end of many of the presentations. “We’re hiring, and we’re paying 10% more than the other guys,” declared a member of the office of the CIO at JP MorganChase in a conference keynote. Not to mention predictions that there’s big money in big data. Or that Accel Partner’s announced a new $100 million venture fund for big data startups; Cloudera scored $40 million in D funding; and rival Hortonworks previously secured $20 million for Round A.

These are heady days. For some like Matt Asay it’s time to voice a word of caution for all the venture money pouring into Hadoop: Is the field bloating with more venture dollars than it can swallow?

The resemblance to Java 1999 was more than coincidental; like Java during the dot com bubble, Hadoop is a relatively new web-related technology undergoing its first wave of commercialization ahead of the buildup of the necessary skills base. We haven’t seen such a greenfield opportunity in the IT space in over a decade. And so the mood at the conference became a bit heady –– where else in the IT world today is the job scene a seller’s market?

Hadoop has come a long way in the past year. A poll of conference attendees showed at least 200 petabytes under management. And while Cloudera has had a decent logo slide of partners for a while, it is no longer the lonely voice in the wilderness for delivering commercial distributions and enterprise support of Hadoop. Within this calendar year alone, Cloudera has finally drawn the competition to legitimize Hadoop as a commercial market. You’ve got the household names from data management and storage -– IBM, Oracle, EMC, Microsoft, and Teradata — jumping in.

Savor the moment. Because the laws of supply and demand are going to rectify the skills shortage in Hadoop and MapReduce and the market is going to become more “normal.” Colleagues like Forrester’s Jim Kobielus predict Hadoop is going to enter the enterprise data warehousing mainstream; he’s also gone on record that interactive and near real-time Hadoop analytics are not far off.

Nonetheless, Hadoop is not going to be the end-all; with the learning curve, we’ll understand the use cases where Hadoop fits and where it doesn’t.

But before we declare victory and go home, we’ve got to get a better handle of what Hadoop is and what it can and should do. In some respects, Hadoop is undergoing a natural evolution that happens with any successful open source technology: there are always questions over what is the kernel and where vendors can differentiate.

Let’s start with the Apache Hadoop stack, which is increasingly resembling a huge brick wall where things are arbitrarily stacked atop one another with no apparent order, sequence, or interrelationship. Hadoop is not a single technology or open source project but –– depending on your perspective –– an ecosystem or a tangled jumble of projects. We won’t bore you with the full list here, but Apache projects are proliferating. That’s great if you’re an open source contributor as it provides lots of outlet for innovation, but if you’re at the consuming end in enterprise IT, the last thing you want is to have to maintain a live scorecard on what’s hot and what’s not.

Compounding the situation, there is still plenty of experimentation going on. Like most open source technologies that get commercialized, there is the question of where the open source kernel leaves off and vendor differentiation picks up. For instance, MapR and IBM each believe it is in the file system, with both having have their own answers to the inadequacies of the core Hadoop file system, (HDFS).

But enterprises need an answer. They need to know what makes Hadoop, Hadoop. Knowing that is critical, not only for comparing vendor implementations, but software compatibility. Over the coming year, we expect others to follow Karmasphere and create development tooling, and we also except new and existing analytic applications to craft solutions targeted at Hadoop. If that’s the case, we better know where to insist on compatibility. Defining Hadoop the way that Supreme Court justice Potter Stewart defined pornography (“I know it when I see it”) just won’t cut it.

Of course, Apache is the last place to expect clarity as that’s not its mission. The Apache Foundation is a meritocracy. Its job is not to pick winners, although it will step aside once the market pulls the plug as it did when it mothballed Project Harmony. That’s where the vendors come in –– they package the distributions and define what they support. What’s needed is not an intimidating huge rectangle showing a profile, but instead a concentric circle diagram. For instance, you’d think that the file system would be sacred to Hadoop, but if not, what are the core building blocks or kernel of Hadoop? Put that at the center of the circle and color it a dark red, blue, or the most convincing shade of elephant yellow. Everything else surrounds the core and is colored pale. We call upon the Clouderas, Hortonworks, IBMs, EMCs et al to step up the plate and define Hadoop.

Then there’s the question of what Hadoop does. We know what it’s done traditionally. It’s a large distributed file system that is used for offline, a.k.a., batch –– analytic runs grinding through ridiculous amounts of data. Hadoop literally chops huge problems down to size thanks a lot of things: it has a simple file structure and it brings computation directly to the data; leverages cheap commodity hardware; supports scaled-out clustering; has a highly distributed and replicated architecture; and uses the MapReduce pattern for dividing and pipelining jobs into lots of concurrent threads, and mapping them back to unity.

But we also caught a presentation from Facebook’s Jonathan Grey on how Hadoop and its HBase column store was adapted to real-time operation for several core applications at Facebook such as its unified messaging system, the polar opposite of a batch application. In summary, there were a number of brute force workarounds to make Hadoop and HBase more performant, such as extreme denormalization of data; heavy reliance on smart caching; and use of inverted indexes that point to the physical location of data, and so on. There’s little doubt that Hadoop won’t become a mainstream enterprise analytic platform until performance bottlenecks are addressed. Not surprisingly, there’s little doubt that the HBase Apache project is targeting interactivity as one of the top development goals.

Conversely, we also heard lots of mention about the potential for Hadoop to function as an online alternative to offline archiving. That’s fed by an architectural design assumption that Big Data analytic data stores allow organizations to analyze all the data, not just a sample of it. Organizations like Yahoo have demonstrated dramatic increases in click-through rates from using Hadoop to dissect all user interactions. That’s instead of using MySQL or other relational data warehouse that can only analyze a sampling. And the Yahoos and Googles of the world currently have no plan to archive their data –– they will just keep scaling their Hadoop clusters out and distributing them. Facebook’s messaging system –– which was used for rolling out real-time Hadoop, is also designed with the use case that old data will not be archived.

The challenge is that the same Hadoop cannot be all things to all people. Optimizing the same data store for interactive and online archiving is like violating the laws of gravity –– either you make the storage cheap or you make it fast. Maybe there will be different flavors of Hadoop, as data in most organizations outside the Googles, Yahoos, or Facebooks of the world is more mortal –– as are the data center budgets.

Admittedly, there is an emerging trend to brute force design databases for mixed workloads –– that’s the design pattern behind Oracle’s Exadata. But even Oracle’s Exadata strategy has limitations in that its design will be overkill for smaller-midsize organizations, and that is exactly why Oracle came out with the Oracle Database Appliance. Same engine, but optimized differently. As few organizations will have Google’s IT budget, Hadoop will also have to have personas –– one size won’t fit all. And the Hadoop community –– Apache and vendor alike –– has got to decide what Hadoop’s going to be when it grows up.

10.05.11

The Elegance of Steve Jobs

Posted in OS/Platforms, Technology Market Trends at 9:17 pm by Tony Baer

Outside of politicians there are few individuals that have truly changed the way we live. It’s more than coincidental that Steve Jobs named his company after the record company of The Beatles, the group of four individuals who changed the musical tastes of our generation.

Steve jobs’ life was obviously too short, but in that short life he crammed four public lives. He was one of the first in Silicon Valley who saw a personal future for the technology being invented there; that culminated with the Apple II. His next life introduced the GUI; after a false start with Lisa, the Mac was a fully realized system that made Apple the de facto publishing machine. It also transformed Apple into a corporation, a challenge for which Jobs was not yet prepared. His third life was NeXT, which provided the springboard for his final life #4, returning to Apple.

It would be an accomplishment on its own to say that Jobs returned Apple to its former glory. That’s an understatement. Under hjis (final) watch, Apple evolved from computer company, changing the way we consume music and media; significantly it was Jobs that finally got the record companies to agree on a common pricing model. Then he redefined the mobile experience with the iPhone, and introduced a new form of computing with the iPad.

In so doing, Apple has changed ouir lives and changed industries. Although music downloads were going to happen regardless of the iPod, it not only made CDs obsolete, but also record stores, and arguably, albums. It also made it more accessible for garage bands everywhere to distribute and bypass the record company, a situation from which the record companies shave yet to recover. It’s also changing the nature of the phone business, and realigning major handset providers.

But most of all we’ll miss Steve Jobs’ sense of style. The minimalism that was Apple provided a sense of elegance and peace that cuts through the noise of our everyday lives. For that alone, thank you Steve Jobs.

« Previous entries Next Page » Next Page »

viagraviagra