Home
About
Perspectives
Writings
Contact
Resources
WritingsPerspectivesCollateralMediaResearchOverview

ComputerWire

Data Quality --Methodology and Payback

Issue Date: January 1997
Issue Number: 7.08
Category: INDUSTRY

Data warehousing and enterprise client-server migrations have made organizations painfully aware of the poor state of their data. Cleaning up the mistakes can be costly - but it can be highly rewarding, as this report goes to show the high upfront costs will produce a solid return in the long term.

Most organizations take their data quality for granted. When data is entered, on-line transaction systems may include validation routines that check the format, and in some cases, the content of the data being entered. Chances are, the newer the transaction system, the more sophisticated the validation checks.

Until the advent of data warehousing, for most organizations data quality, if a priority, was considered a tactical chore. Consumer marketing organizations relied on service bureaus to 'process' their customer lists as part of direct mail fulfillment services, while most other organizations regarded incomplete archival data as better than none at all. Several trends are responsible for the sudden surge in interest:

* Data warehouses are forcing organizations to make decisions regarding the management of legacy data. Their responses are mixed: Some take an 80/20 approach, contending that data warehouses can be useful even if the data isn't perfect because they are not mission-critical systems. Other organizations believe just the opposite: that faulty data in data warehouses defeats the purpose of extracting intelligence from historical data.

* Enterprise client-server ERP migrations, which often consolidate the operations of multiple, formerly standalone systems, are forcing organizations to pay attention to the quality of their data as part of their reengineering efforts.

* Householding has become a powerful inducement for consumer-oriented marketing organizations, including retailing, telecommunications, and banking. These organizations believe that they could market to existing consumers far more effectively (and cross-sell related products and services) if they only had an accurate picture of who in each household or family was already buying which products and services.

Is data cleansing worth it? In this report, we examine the leading approaches and their impacts on project cost and budgeting.

LOOKING FOR A SOURCE SOLUTION
Data cleansing isn't a new process. Many organizations incorporated validations checks for data entry which managed part of the problem. Other organizations have mounted special efforts to correct data. For instance, at a major telecommunications carrier, small scale projects were implemented for specific product lines in past years, but until recent advances in hardware price/performance and the availability of configurable software, 'the magnitudes made it impossible to deal with', the project leader noted. Upcoming deregulation of the market, along with advances in hardware price/performance and the recent availability of off-the-shelf, configurable, maintainable software made the difference.

Although data warehousing projects have taken the data cleansing issue off the back burner, most organizations are not progressing further than high-level data reformatting. According to an informal survey conducted by the data extraction tool vendor Prism, only 15% - 20% of Prism customers are currently using data cleansing tools. Prism also reports that among its customer base, the amount of time spent on data cleansing is still quite modest. According to an informal survey of customer representatives, about 5% - 10% of total project time and money is spent on data cleansing; that compares to 25%- 30% for data extraction and transformation, with the remainder of time directed to analysis, design, testing, and training. Nonetheless, Prism reports that interest among its customer base has grown over the past year.

* Costs and Budgeting: Data cleansing isn't cheap. Serious efforts which comb millions of records begin at $250,000, including the cost of software and staff resources for well-bounded problems such as name and address matching. Prior to its acquisition by New York's Chemical Bank, Chase Manhattan Bank invested nearly 100 staff years on cleansing name, address, and account information for over 20 million customers.

When dealing with such large volumes of data, organizations must adopt a form of triage to keep their cleansing projects on budget. 'You have to decide if a change that only impacts 0.03% of the data is worthwhile', commented the lead analyst for a name and address cleansing initiative at a major telecommunications carrier.

The rewards can be huge. One organization recently invested $250,000 in tools and staff time to clean up tens of millions of customer names and addresses, plus another $1.2 million for syndicated demographic data to fill in information gaps. It used the results to develop pinpoint direct marketing campaigns which netted over $60 million in revenues, roughly triple the returns of what such campaigns usually earned.

Not surprisingly, while many organizations are willing to discuss costs, few are willing to discuss the returns on data cleansing because they consider data quality as a unique competitive advantage. Furthermore, without side-by-side tests of clean and dirty data samples, it is difficult to determine that any improved returns were due to the use of better lists or the content of the marketing campaign itself.

Organizations have used a wide range of measures to clean their data, ranging from routine data entry validations to service bureaus and software tools. The traditional approach is for organizations to check the quality of their data during routine data entry or data migration/extraction operations. In these cases, data cleansing is part of an overall task, and is therefore not budgeted or staffed separately. Conversely, organizations that invest in tools from the likes of Vality, Harte Hanks, or ISI must budget separately for the tool and its implementation.

* Responsibilities: This varies, depending on the scope of the data involved and the organization's priorities. In some cases, line of business (LOB) personnel are made responsible, while in other cases, it becomes the guarantor of data quality. In many cases, LOB personnel are responsible for departmental or workgroup applications and databases, while IT staff is responsible for enterprise level implementations. However, there is no cut and dry formula; the division of responsibility often has more to do with the IT organization's internal role as facilitator or owner of enterprise data resources.

In most cases, active data cleansing projects are usually managed by IT because of its technical knowledge of managing data. However, like any business software project, data cleansing requires a blend of IT and domain expertise. Domain experts are usually consulted (and may devote part of their time to these projects) for decisions regarding data sources and content, because the end result from many projects is a form of expert system. According to one project leader, domain experts are hard to find and recruit. At a telecommunications carrier, a project to scrub nearly 100 million consumer records involved a ten-person team including a mix of IT, name and address matching, and marketing experience. Most people on the team had at least 5 - 10 years experience in their respective fields. The team began with four people who had domain expertise, and as the project advanced, IT people were added.

On the technical side, the team's skillset included data manipulation, text processing, mathematics, pattern recognition, and statistical processing skills. Domain experts familiar with variations in address patterns and naming conventions, helped team members devise or apply the proper algorithms for deconstructing lines of text into standard, formatted, recognizable data.

OUTLINING GENERAL APPROACHES
Data Entry Validations and Data Cleansing: This includes several lines of defense, the first of which occurs at data entry. Validation routines can be implemented which check:

* Formatting (e.g., product serial numbers contain a specific number of characters, some of which are alphabetic or numeric).

* Data ranges (e.g., date field values cannot exceed '12' for months or '31' for days; or the date of a customer shipment must be later than the date the goods were originally received in the warehouse; or the 'sold to' amount must be greater than the raw component price).

* Required fields (e.g., all US addresses require zip codes, or all government regulated items require specific data such as product inspection grades or shelf-life expiration dates).

More recent software applications often contain more sophisticated validation logic or other devices, such as the use of pop-up menus or dialogue boxes which limit data entry to approved values or codes.

Because they are the first line of defense, data entry validations can nip data quality problems at the source, but they do not prevent them. A data entry clerk can still input data or codes that are within range, but still incorrect. If the goal is to get as close to 100% accuracy as possible, data entry validations can help reduce data cleansing workloads down the line, but they do not replace them.

Nonetheless, many IT operations consider them sufficient guarantee that the data is usable. 'We feel comfortable with the quality of our data', said the manager of data mart projects for a leading pharmaceutical manufacturer. He added, 'We're not doing a lot of data analysis and clean-up because we have these mature [IMS] systems where we've put in all the edits and constraints necessary to a good application system.'

At this point, data quality is not a separately budgeted item which requires staff, special software tools, or other resources. Instead, the incorporation of validation checks is considered part of the budget for developing and maintaining applications.

Data Extraction and Meta Data Tools: The second line of defense (when building a data warehouse), these tools are used to identify source data, transform it, and move it to a target database. Like the data validation checks mentioned earlier, these tools can be used as a coarse method of cleansing data. Some tools include verification capabilities that allow users to statistically sample the consistency of incoming data and edit values, either individually or through global replacements (e.g., standardizing abbreviations, deleting unwanted data types). In some cases, data extractions tools (e.g., Carleton Passport) allow users to test data using specific rules.

Data cleansing is just one element of a more generalized solution for moving data to a data warehouse. Other steps include:

* Meta Data Mapping: building data models which identify source data and necessary conversion operations, generate data conversion audit trails, and specify target databases.

* Change Management: identifying which data elements have changed for updating purposes.

* Data Extraction: the process of retrieving data from a source database or file.

* Data Conversion: the process of merging, consolidating, and sorting data before it is loaded into the target database.

Traditional market leaders - Carleton (Burlington, MA), Prism (Sunnyvale, CA), Evolutionary Technologies Inc. (ETI) (Austin, TX) - have provided groups of tools designed to extract data, primarily from mainframe databases and files (the primary source of historical data in most organizations). They have recently been joined by Informatica and Sagent (both based in Menlo Park, CA), which provide integrated, Windows-based toolsets designed primarily to work with client-server source and target databases.

The minimum investment for data extraction solutions (including tools and staff resources) is at least six figures. The ultimate cost is highly scope-dependent; if a modest, subject-oriented data mart is involved, the number of sources and data elements can be well-bounded. Tools, such as Informatica's Powermart, begin at $45,000 for a single data source and target platform, and escalate from there. Extraction tools for enterprise data warehouses involving mainframe sources cost several times that. Staff resources range from 6 - 12 weeks for simple data marts, and rising considerably for enterprise data warehouses.

In most cases, the data cleansing portion of data extraction is not usually planned or budgeted as a separate activity. Here, data integrity is based on consistency (e.g., applying the same COBOL routines on a VSAM file and getting repeatable results). The result is that data flowing into the warehouse is filtered for consistency (e.g., erroneous characters are removed and missing fields filled). Consistency is not the same as accuracy.

Some data extraction tools are providing more robust data cleansing capabilities. Apertus (Eden Prairie, MN), which is better known as a middleware company, also provides Enterprise/Integrator, a tool which can best be described as data extraction on steroids. The key is its object-oriented meta data repository (built on ODI's ObjectStore OO database), which provides stronger data cleansing capabilities than most extraction tools. Enterprise/Integrator allows users to graphically map sources and targets (like other data extraction tools) and use a C++ code generator to develop rules for converting and cleansing data. It also provides fuzzy logic capabilities which searches for logically similar records and consolidates them into composite objects that represent best values, and a utility for resolving conflicts.

DATA CLEANSING TOOL OPTIONS
For a growing number of organizations, data entry validations and meta data tools are not sufficient for meeting their data quality requirements. Specifically, these organizations require tools that are able to examine individual records and use rules to correct them, and in many cases relate them to other records within the database(s). Customer information is the largest single category of tools. Mass marketing companies have paid the price of poor data quality in high postal costs and lower response rates to direct marketing campaigns. Budgeting for data cleansing was typically a sales and marketing, not an IT expense.

Traditionally, most organizations relied on third party service bureaus which bundled data cleansing as part of other direct marketing fulfillment activities. Over the past 20 years, a number of firms such as Epsilon (Burlington, Mass.) and Harte Hanks (Billerica, Mass) have developed services which 'clean' customer lists to generate accurate lists for mass mailings. For instance, Harte Hanks currently processes over 500 million names each month. The level of cleansing varied from performing the basic tasks necessary to render a mass mailing compliant with USPS (postal) standards and regulations (e.g., address format, sort sequences, and reporting requirements) to market research which attempted to identify customer demographics, and perform sophisticated merge/purge operations.

In recent years, service bureaus and ISVs have developed analytical tools which provide more intelligence to the process of cleansing customer data files. Many include extensive sets of rules that perform the following functions:

* Identifying the relationships between members of a family who may or may not live in the same household.

* Identifying the relationships between companies and their subsidiaries or partners.

* Identifying whether a specific address is a single-family residence or rental apartment complex.

* Identifying naming conventions (e.g., if the word 'Church' is a last name or part of the name of a religious institution, depending on which line the word appears and whether it is located next to a word recognized as a first name or initial).

* Parsing street addresses.

* Adding postal codes and postal bar coding.

More recently, other tools have emerged to tackle other types of data. Typically, they are employed when businesses reengineer, consolidate operations, and migrate to new enterprise-wide business systems.

Because a separate investment is required for these tools, data cleansing becomes a distinct line item on the budget which is funded by IT, LOBs, and/or sales and marketing.

Customer Information Tools: Major providers include Harte Hanks, Group 1 (Lanham, MD), and PostalSoft (La Crosse, WI). Their packages clean, code, and standardize address information according to USPS standards. Other features include adding Zip+4 codes, USPS standard bar coding, generating bulk mailing forms and reports, and organizing mailing to USPS third-class bulk mailing standards.

Most of these offerings were designed to run in batch mode for bulk mailings on the mainframe or AS/400, and are only beginning to be ported over to open systems environments.

Harte Hanks' Trillium package provides additional capabilities beyond traditional direct mail packages. Written in C, it was designed as platform-independent, and for both on-line and batch operation. Often, a customer file is downloaded from the mainframe and processed in batch mode. The resulting customer file is maintained in a SQL relational database and can be updated on-line. One of Trillium's strongest selling points is that it allows telemarketing and customer service representatives to update customer files while the customer is on the phone - the time when information capture is the most reliable.

Trillium is based on a series of user-configurable tables which package the logic of the company's original mass marketing service within those tables. Harte Hanks claims that this architecture makes Trillium a low-cost, rapid implementation tool which is configured, not programmed, and designed to run inside the customer service applications and business processes which an organization already has built. According to one user, the application was fully implemented within 16 weeks, cleansing 200 million records, with most of the effort devoted to configuring Trillium's rules tables. The overall cost of software and implementation was under $250,000.

The tables and logic are encapsulated as CORBA-compliant objects which can be inserted into business applications (e.g., order entry) as appropriate. This feature makes Trillium more than a bulk mail application; its objects can be inserted into customer service applications (e.g., order entry and post-sales support) and financial applications (e.g., customer or vendor listings for accounts payable or receivable programs).

Another vendor, Innovative Systems Inc. (part of the Innovative Group Inc., Pittsburgh, PA), offers a mainframe-based customer data cleansing system that has been heavily oriented to the banking industry. The company claims that 60% of all 'major banks' use ISI software to scrub their customer account records. It offers a complete set of tools and services that matches and corrects the spelling of names, and identifies relationships (either within consumer households or between companies. The company is used to tackling big projects, and provided nearly a dozen consultants to Chase Manhattan Bank's four-year effort to clean the records of over 20 million account holders.

Programming Tools: A prime criticism of bulk mailing and customer information tools is that they weren't designed for other types of corporate data (see Table 1). Vality (Boston, MA) is the best-known provider of tools designed for all types of structured and free-form data.

The advantage is that these tools offer a wide degree of flexibility; the drawback is that they tend to be much costlier than customer-focused approaches because they are generalized programming environments, not niche applications. Vality's Integrity tool offers:

* Lexical analysis (rules which determine the business significance of particular data items).

* Patterns analysis (which develop maps of field content and structure), around which users develop applications for cleansing any form of data.

* Algorithms which assign probability factors for defining matches.

However, because the tool is designed for all forms of data, it does not contain the knowledge bases associated with customer-focused tools from Harte Hanks or ISI.

Once the user has developed the business rules covering the data in question, Integrity has a number of features that allow it to decompose data for the cleansing process. For instance, data is not simply parsed at field level, but can be broken apart to help users identify 'hidden' data. Integrity also provides pattern analysis that can tag any data type, and based on that tagging, take mixed fields and parse them into standard record formats. The product can also fill in missing values and format into business object level information for data warehouse usage. For example, rather than having separate fields for company name, address, city, state, zip code, and phone number, Integrity provides data aggregation into a meaningful business object, identifiable by an index number. This approach is far more sophisticated than that of simple data movement utilities.

The primary drawback of Integrity is cost. Its $235,000 list price is high, especially considering the fact that the product is not a configurable application, but a programming environment which requires significant user effort. Although the tool automates parsing, pattern recognition, and consolidation, analysis and rules-making is labor-intensive, and does not lend itself to rapid delivery data warehouse initiatives. A high-end Vality project involved 1.5 staff years of Integrity programming plus 25 staff years of COBOL programming; the company claims that more typical projects involve 1 - 6 staff months from information model development through actual data conversion (roughly $4,000 - $50,000 staff cost).

Not surprisingly, only a minority of Integrity installations cover customer lists. Roughly 40% comprise migrations to enterprise client-server business applications such as SAP R/3 (for manufacturing) or PMSC (for the insurance industry), with another 40% used for data warehouses (some of which include customer information lists).

Statistical Quality Tools: QDB Inc. (Cambridge, MA) offers a statistical approach to evaluating the incidence of faulty data. Like a manufacturing quality management program, the QDB tool examines samples of data, and using rules developed by the user, rates the incidence of deviation or inconsistency. The system can also create tables of good and bad data for further inspection. In most cases, the tool is not used for the actual cleansing, although Elf Atochem, North America (Philadelphia, PA) did use it as part of its R/3 migration.

QDB's approach is that it provides a low-cost means for conducting data cleansing triage: the degree of the problem is scoped out, and the user decides which part of the problem should merit attention. The notion is that data cleansing is expensive, and therefore, to get the biggest bang for the buck, cleansing should only be performed where absolutely needed.

The cost is relatively modest: the base package costs $29,500; the company estimates that an R/3 conversion analysis project could cost between $50,000 - $75,000 for software and consulting.

USER EXPERIENCES
Bank of Montreal: Data quality was considered the responsibility of LOBs, when the bank implemented three department-level data warehouses (tackling narrowly-defined functions such as credit monitoring and loan approvals) over the past year. The data warehouses were built using the bank's existing DB2/MVS infrastructure; although the projects resemble data marts in that they are targeted to individual LOBs, they utilise the existing DB2 database. Over the coming year, the bank will build 13 more such data warehouse projects.

The bank did not invest in data cleansing tools. Instead, it relied on Prism's Warehouse Data Manager to filter incoming data, the diligence of end users, who 'owned' the data, and careful analysis for identifying the best sources of data.

Data selection and transformation was the most labour intensive part of the project; it consumed half of all project time, with JAD sessions, data modelling, and front-end design accounting for the other 50%. The most difficult part of the data transformation task was identifying the best sources of data (especially when there were multiple sources available) and resolving conflicts in data definitions. For instance, some operating units termed accounts delinquent when they were 30 days past due while other units defined the threshold at 60 days.

Each data warehouse project required two IT people dedicated full time to data transformations; at the going rate of C$80/hour (burdened), the overall staff cost per project for data transformation was roughly C$50,000. By comparison, data modelling required only one IT staff member. As for the software, the bank paid roughly $50,000 for Prism, and continues to pay 15% annual maintenance.

Ohio Casualty: Although the company's property and casualty policies are marketed by independent agents, the firm recently changed its notion of customer, or 'client' to include any person or entity that does business with the company, directly or indirectly. This change was in sync with the rest of the industry, according to customer information project leader Susan Parson. The definition of clients included policy-holders, lien-holders, contractors, attorneys representing claimants, payee organizations (e.g., body shops for automotive policies), and employees. The company currently maintains 1 million policies; the number of people and organizations in its client database will be several times that.

The initial goal was providing coherent customer information for its new national customer service center. Although customers actually buy policies from independent agents, happy policy-holders will in turn make happy (and loyal) agents. Eventually, the company may also create data warehouses based on customer data for decision support.

From the outset, the company realized that such a system would be useless if the data wasn't clean. It initially conducted a limited project devising its own cleansing system for roughly 1,500 Ohio policy-holders. Although not considered a pilot project, the initial effort was a proof-of-concept to give the team a better idea of what would be required for scale up. A three-person IT team spent a full year on the project; the estimated staff cost was around $200,000 (although domain experts were called in occasionally, the organization did not track the amount of time involved for non-IT personnel).

The experience of writing a home-grown system taught them that 'we'd be retired' before it would have gotten finished, according to Parson. The team chose Vality because they considered it more flexible than other name and address matching alternatives. In all, the Vality effort involved 8 days of training, 5 days of vendor consulting, 3 - 4 weeks of startup (for learning the tool), and 8 weeks for cleansing the first 3,000 policy-holder names.

Since then, the three-person team has cleansed 10,000 additional names over 4 months, and expects to have all 1 million policy-holder name, address, and policy number records cleansed within 2 years. With the team up to speed, they estimate that it will cost $1/customer record for the remainder of the job. (The company would not disclose how much it paid for Vality tools and services.)

Elf Atochem: A SAP R/3 migration was the driver for its data quality effort. Previously, the company operated multiple standalone systems for 13 business units, each of which were being consolidated for the enterprise client-server migration. (For instance, the 13 LOBs previously operated 23 separate payroll systems.)

The company used the QDB tool to measure the incidence of bad data. Some of the results included:

* 7% of all product codes were incorrect. While this sounds like a small number, errors here could result in wrong or redundant products and invoices sent to the customer, producing unhappy customers.

* 44% of all customer credit files had discrepancies between order entry and accounts receivable - another problem that could cause unhappy customers.

* Payroll transactions normally take 1 - 30 seconds; correcting errors could require 10 minutes to several hours.

* Over 35% of all vendor listings were duplicates or dormant (or, 15,000 out of a total vendor population of 40,000).

Once top management was convinced of the need to make data cleansing a top priority, three IT staff were dedicated full time for one year (cost, roughly $250,000) to screen and clean the data. Elf Atochem spent $10,000 for the PC-based QDB software, and roughly $300,000 for consulting. It also dedicated three full-time IT staff for a year (estimated cost: $250,000).

THE COST OF DATA QUALITY
Data quality is an apple pie issue that has often been taken for granted. Consumer marketing companies have felt the pinch first. They were the first ones to treat data quality as a business problem, and paid for outside organizations to cleanse data as part of a direct marketing bundle of services. Here, data cleansing was a defensive task aimed at minimizing direct mail rejects.

The advent of householding has turned data cleansing into a marketing strategy. Marketers used to look at customers account by account. Householding allows them to look at the number and type of accounts held by a specific person, and the relationships that the customer has with other members of their family or household. By identifying such patterns, marketers uncover opportunities for selling related products (affinity selling). For data cleansing, the bottom line is that it changes from cost center to revenue producer. Consumer marketing currently remains the largest industry sector for data cleansing.

Data warehousing has also raised consciousness of the issue, but not necessarily consensus. Because data warehouses are not mission-critical systems, may users believe that an 80/20 approach to data quality is good enough. Conversely, others claim that data accuracy is critical in order to realize the benefits of data warehousing as decision support database environments.

Similarly, there is no consensus as to who should be responsible for cleaning data. Clearly, it requires a combination of IT experience, analytical skills, and business (domain) knowledge. Some organizations leave this to users, who are closer to the data and can correct problems in the bud. However, when information is integrated for enterprise systems, common data quality standards must be managed and enforced by IT personnel.

However, if the goal is consistent data, a centralized approach is necessary. Previous decentralized efforts at a telecommunications carrier to cleanse data by product market resulted in the proliferation of standards, which subsequently required extensive merge and purge efforts to put Humpty Dumpty back together again. Cost justifying data cleansing is difficult, unless scientific steps are taken to compare results with clean and dirty batches of data. However, one Trillium user noted that it could proudly point to the fact that its $250,000 project saved $500,000 in avoided mailing costs alone.

Table 1. Advantages and disadvantages of data cleansing methods (Source: Computer Finance).

Method

Advantages

Disadvantages

Data Entry Validations

Low-cost

Does not prevent content errors

 

Prevents low-level (e.g., data format, out-of-range) errors

 

 

Ensures required fields are completed

 

Data Extraction Tools

Rudimentary data cleansing

Most tools do not prevent content errors

 

Builds data models which are useful for identifying data sources, conversions, etc.

Provides change management

Customer Information Tools

Provides pre-built applications for formatting names and address data for bulk mailing requirements

Limited capability for cleansing internal business data (outside customer files)

 

Some tools contain robust naming convention rules, expert systems

Also requires extraction tools for retrieving data from legacy sources

 

Robust parsing and pattern matching capabilities

 

Programming Tools

Capable of cleansing any type of business data

Requires extensive programming

 

Highly flexible solutions allow users to tailor solutions to internal business rules

High cost because applications must be built

 

Robust parsing and pattern matching capabilities

Also requires extraction tools for retrieving data from legacy sources

Statistical Quality Tools

Low cost

Not intended for data cleansing

 

Decision support tools which illustrate extent and sources of problems

Also requires investment in data cleansing tools or internal software development

 

Well-suited for triage strategies

Also requires extraction tools for retrieving data from legacy sources


all original content copyright © 2001 - 2006 onstrategies
site designed and copyright © 2001 - 2006 oz barron