Y. Richard (Yng-Yuh Richard) Wang.

Toward quality data : an attribute-based approach online

. (page 1 of 3)
Online LibraryY. Richard (Yng-Yuh Richard) WangToward quality data : an attribute-based approach → online text (page 1 of 3)
Font size
QR-code for this ebook


M.I.T. LrSRARIES - DEV\^Y



^ ' Pewey



HD28
.M414

"1^



MIT LIBRARIES
llhl I



3 9080 00932 7625



Toward Quality Data: An
Attribute-Based Approach

Richard Y. Wang

M.P. Reddy

Henry B. Kon

WP #3762 November 1992

PROFIT #92-01



Productivity From Information Technology

"PROHT" Research Initiative

Sloan School of Management

Massachusetts Institute of Technology

Cambridge, MA 02139 USA

(617)253-8584

Fax: (617)258-7579



Copyright Massachusetts Institute of Technology 1992. The research described
herein has been supported (in whole or in part) by the Productivity From Information
Technology (PROFIT) Research Initiative at MIT. This copy is for the exclusive use of
PROFIT sponsor firms.



Productivity From Information Technology

(PROFIT)

The Productivity From Information Technology (PROFIT) Initiative was established
on October 23, 1992 by MIT President Charles Vest and Provost Mark VVrighton "to
study the use of information technology in both the private and public sectors and
to enhance productivity in areas ranging from finance to transportation, and from
manufacturing to telecommunications." At the time of its inception, PROFIT took
over the Composite Information Systems Laboratory and Handwritten Character
Recognition Laboratory. These two laboratories are now involved in research re-
lated to context mediation and imaging respectively.




SSACHUSETTSNSTl.OTE
OF TECHNOLOGY

MAY 2 3 1995

UBRARIES



In addition, PROFFT has undertaken joint efforts with a number of research centers,
laboratories, and programs at MIT, and the results of these efforts are documented
in Discussion Papers published by PROFFT and/or the collaborating MFI entity.

Correspondence can be addressed to:

The "PROFFF" Initiative
RoomE53-310, MFI
50 Memorial Drive
Cambridge, MA 02142-1247
TeL (617) 253-8584
Fax: (617) 258-7579
E-Mail: [email protected]



Toward Quality Data: An Attribute-Based Approach



ABSTRACT The need for a quality perspective in the nianagement of the
data resource is becon^ing increasingly critical. Managing data quality, however, is a
complex task. Although it would be ideal to achieve zero defect data, this may not
always be attainable. Moreover, different users may have different criteria in
determining the quality of data. This suggests that it would be useful to be able to
tag data with quality indicators which are characteristics of the data and its
manufacturing process. From these quality indicators, users can make their own
judgment of the quality of the data for the specific application at hand.

This paper investigates how quality indicators may be specified, stored,
retrieved, and processed. Specifically, we propose an attribute-based data model that
facilitates cell-level tagging of data. Included in this attribute-based model are a
mathematical model description that extends the relational model, a set of quality
integrity rules, and a quality indicator algebra which can be used to process SQL
queries that are augmented with quality indicator requirements. From these quality
indicators, the user can make a better interpretation of the data and determine the
believability of the data. In order to establish the relationship between data quality
dimensions and quality indicators, a data quality requirements analysis
methodology that extends the Entity Relationship model is also presented.



1. Introduction 1

1.1. Dimensions of data quality 2

1.2. Data quality: an attribute-based example 4

1.3. Research focus and paper organization 4

2. Research background 5

2.1. Rationale for cell-level tagging 5

2.2. Work related to data tagging 6

2.3. Terminology 7

3. Data quality requirements analysis 8

3.1. Step 1: Establishing the applications view 9

3.2. Step 2: Determine (subjective) quality parameters 9

3.3. Step 3: Determine (objective) quality indicators 10

3.4. Step 4: Creating the quality schema 11

4. The attribute-based model of data quality 12

4.1. Data structure 12

4.2. Data integrity 15

4.3. Data manipulation 15

4.3.1. Ql-Compatibility and QIV-Equal 15

4.3.2. Quality Indicator Algebra 18

4.3.2.1. Selection 18

4.3.2.2. Projection 19

4.3.2.3. Union 20

4.3.2.4. Difference 22

4.3.2.5. Cartesian Product 24

5. Discussion and future directions 25

6. References 27

7. Appendix A: Premises about data quality requirements analysis 29

7.1. Premises related to data quality modeling 29

7.2. Premises related to data quality definitions and standards across
users 30

7.3. Premises related to a single user 30



Toward Quality Data: An Attribute-Based Approach

1. Introduction

Organizations in industries such as banking, insurance, retail, consumer marketing, and health
care are increasingly integrating their business processes across functional, product, and geographic
lines. The integration of these business processes, in turn, accelerates demand for more effective
application systems for product development, product delivery, and customer service (Rockart St. Short,
1989). As a result, many applications today require access to corporate functional and product
databases. Unfortunately, most databases are not error-free, and some contain a surpnsingly large
number of errors (Johnson, Leitch, it Neter, 1981). In a recent industry executive report, Computerworld
surveyed 500 medium size corporations (with annual sales of more than $20 million), and reported that
more than 60% of the firms had problems in data quality.^ The Wall Street lournal also reported that:

Thanks to computers, huge databases brimming with information are at our fingertips, )ust
waiting to be tapped. They can be mined to find sales prospects among existing customers; they
can be analyzed to unearth costly corporate habits, they can be manipulated to divine future
trends. Just one problem: Those huge databases may be full of |unk. ... In a world where p>eople
are moving to total quality management, one of the cntical areas is data.^

In general, inaccurate, out-of-date, or incomplete data can have significant impacts both
socially and economically (Laudon, 1986; Liepins & Uppuluri, 1990; Liepins, 1989; Wang k Kon, 1992;
Zarkovich, 1966). Managing data quality, however, is a complex task. Although it would be ideal to
achieve zero defect data} this may not always be necessary or attainable for, among others, the
following two reasons:

First, in many applications, it may not always be necessary to attain zero defect dau. Mailing
addresses in database marketing is a good example. In sending promotional materials to target
customers, it is nfil necessai7 to have the correct city name in an address as long as the zip code is correct.

Second, there is a cost/quality tradeoff in implementing data quality programs. Ballou and
Pazer found that "in an overwhelming ma)onty of cases, the best solutions in terms of error rate
reduction is the worst in terms of cost" (Ballou k Pazer, 1987). The Pareto Principle also suggests that
losses are nev« uniformly distributed over the quality characteristics. Rather, the losses are always
distributed in such a way that a small percentage of the quality characteristics, "the vital few, "
always contributes a high percentage of the quality loss. As a result, the cost improvenrtent potential is



1 Computerworld, Sept«nib«r 28, 1992. p. 80-84.

2 Th€ WaU Slr*«t Joum*!, May 26, 1992, pagt B6.

3 )ust like the weil puWldzed coiKept of itw dtfect products m the manufactunng literittir*.



high for "the vital few" projects whereas the "trivial many" defects are not worth tackling because the
cure costs more than the disease (Juran & Cryna, 1980). in sum, when the cost is prohibitively high, it is
not feasible to attain zero defect data.

Given that zero defect data may not always be necessary nor attainable, it would be useful to be
able to judge the quality of data. This suggests that we tag data with quality indicators which are
characteristics of the data and its manufacturing process . From these quality indicators, the user can
make a judgment of the quality of the data for the specific application at hand. In making a financial
deasion to purchase stocks, for example, it would be useful to know the quality of data through quality
indicators such as who originated the data, when the data was collected, and how the data was
collected.

In this paper, we pro|X)se an attribute-based model that facilitates cell-level tagging of data.
Included in this attribute-based model are a mathematical model description that extends the
relational model, a set of quality integrity rules, and a quality indicator algebra which can be used to
process SQL queries that are augmented with quality indicator requirements. From these quality
indicators, the user can make a better interpretation of the data and determine the believability of the
data. In order to establish the relationship between data quality dimensions and quality indicators, a
data quality requirements analysis methodology that extends the Entity Relationship (ER) model is
also presented.

Just as it is difficult to manage product quality without understanding the attnbutes of the
product which define its quality, it is also difficult to manage data quality without understanding the
characteristics that define data quality. Therefore, before one can address issues involved in data
quality, one must define what data quality means. In the following subsecfion, we present a definition
for the dimensioru of data quality.

LL Dimension of data guall^

Accuracy is the most obvious dimension when it comes to data quality. Morey suggested that
"errors ocair beciuse of delays in processing times, lengthy correction times, and overly or insufficiently
stringent data edits" (Morey, 1982). In addifion to defining accuracy as "the recorded value is m
conformity with the actual value," Ballou and Pazer defined timeliness (the recorded value is not out
of date), completeness (all values for a certain variables are recorded), and consistency (the
representafion of the data value is the same in all cases) as the key dimensions of data quality (Ballou



& Pazer, 1987). Huh et al. identified accuracy, completeness, consistency, and currency as the most
important dimensions of data quality (Huh, ot al., 1990)

It IS interesting to note that although methods tor quality control have been well established in
the manufactunng field (e.g., Juran, 1979), neither the dimensions of quality for manufacturing nor for
data have been ngorously defined (Ballou k Pazer, 1985, Garvin, 1983; Garvin, 1987; Garvin, 1988;
Huh, et al , 1990; Juran, 1979; Juran


DELL


Dell


3




As shown in Table 2, "7, (source: Barron's, reporting_date: 10-05-92, data_entry_operator Joe>"
in Column 3 indicates that "$7 rvas the Earnings Estimate of IBM" was reported by the Barron's on
October 5, 1992 and was entered by Joe. An experienced analyst would know that Barron's is a credible
source; that October 5, 1992 is ti^neiy (assuming that October 5 was recent); and that Joe is experienced,
therefore the data is likely to be accurate . As a result, he may conclude that the earnings estimate is
believable . This example both illustrates the need for, and provides an example approach for,
incorporating quality indicators into the database through data tagging.

IJ. Research focus and paper organization

The goal of the attnbute-based approach is to facilitate the collection, storage, retrieval, and
processing of data that has quality indicators. Central to the approach is the notion that an attnbute



value may have a set of quality indicators associated with it. In some applications, it mav be
necessary to know the quality of the quality mdicators themselves, in which case a quality indicator
may, in turn, have another set of associated quality indicators. As such, an attnbute mav have an
arbitrary number of underlying levels of quality indicators. This consntutes a tree structure, as shown in
Figure 2 below.

(attribute)



(indicator) (indicatoT)

zr\... tK:



(indicator) ^ (indicator)

Figure 2: An attribute with quality indicators

Conventional spreadsheet programs and database systems are not appropriate for handling
data which is structured in this manner. In particular, they lack the quality integrity constraints
necessary for ensunng that quality indicators are always tagged along with the data (and deleted
when the data is deleted) and the algebraic operators necessary for attnbute-based query processing.
In order to associate an attribute with its immediate quality indicators, a mechanism must be
developed to facilitate the linkage between the two, as well as between a quality indicator and the set
of quality indicators associated with it.

This paper is organized as follows. Section 2 presents the research background. Section 3
presents the data quality requirements analysis methodology In section 4, we present the attnbute-
based data model. Discussion and future direchons are made in Section 5.

2. Research background

In this section we discuss our rationale for tagging data at the cell level, summarize the
literature related to data tagging, and present the terminology used in this paper.

■LI. RarionaU fnr f»ll.l»v»» tagging

Any characteristics of dau at the relation level should be applicable to all instances of the
relation. It is, however, not reasonable to assume that all instances (i.e., tuples) of a relarion have the
same quality. Therefore, tagging quality indicators at the relation level is not sufficient to handle
quality heterogeneity at the instance level.



By the same token, any characteristics of data tagged at the tuple level should be applicable
to all attribute values in the tuple. However, each attribute value in a tuple may be collected from
different sources, through different collection methods, .ind updated at different points m time.
Therefore, tagging data at the tuple level is also msutticient Smce the attribute value of a cell is the
basic unit of manipulation, it is necessary to tag quality information at the cell level.

We now examine the literature related to data tagging.

2ol Work related to data tagging

A mechanism for tagging data has been proposed by Codd. It includes NOTE, T.AG, and
DENOTE operations to tag and un-tag the name of a relation to each tuple. The purpose of these
operators is to permit both the schema information and the database extension to be manipulated in a
uniform way (Codd, 1979). It does not, however, allow for the tagging of other data (such as source) at
either the tuple or cell level.

Although self-describing data files and meta-data management have been proposed at the
schema level (McCarthy, 1982; McCarthy, 1984; McCarthy, 1988), no specific solution has been offered
to marupulate such quality information at the hjple and cell levels.

A rule-based representation language based on a relational schema has been proposed to store
data semantics at the instance level (Siegel k Madnick, 1991). These rules are used to denve meta-
attribute values based on values of other attributes in the tuple. However, these rules are specified at
the tuple level as opposed to the cell level, and thus cell-level operations are not inherent in the
model.

A polygen model (px)Iy = multiple, gen = source) (Wang & Madnick, 1990) has been proposed to
tag multiple data sources at the cell level in a heterogeneous database environment where it is
important to know not only the originating data source but also the intermediate data sources which
contribute to final query results. The research, however, focused on the "where from" perspective and
did not provide mechanisms to deal with more general quality indicators.

In (Sciore, 1991), annotations are used to support the temporal dimension of data in an object-
oriented environment However, data quality is a multi-dimensional concept. Therefore, a more
general treatment is necessary to address the data quality issue. More importantly, no algebra or
calculus-based language is provided to support the manipulation of annotations associated with the
data.



The examination of the above research efforts suggests that in order to support the
functionality of our attnbute-based model, an extension of existing data models is required.

2^ — Tcnninology

To facilitate further discussion, we introduce the following terms:

• An application ^ttrit>Mte refers to an attnbute associated with an entity or a relationship m an
entity-relationship (ER) diagram. This would mclude the data traditionally associated with
an application such as part number and supplier.

• A quality parameter is a qualitative or subjective dimension of data quality that a user of data
defines when evaluating data quality. For example, believability and timeliness are such
dimensions.

• As introduced in Section 1, quality indicators provide objective information about the
characteristics of data and its manufacturing process.^ Data source, creation time, and
collection method are examples of such objective measures.

• A quality parameter value is the value determined (directly or indirectly) by the user of data
for a particular quality parameter based on underlying quality indicators. Functions can be
defined by users to map quality indicators to quality parameters. For example, the quality
parameter credibility may be defined as high or low depending on the quality indicator source
of the data .

• A quality indicator value is a measured characteristic of the stored data. For example, the
data quality indicator source may have a quality indicator value The Wall Street journal.

We have discussed the rationale for cell-level tagging, summarized work related to data
tagging, and introduced the terminology used in this paper. In the next section, we present a
methodology for the specification of data quality parameters and indicators. The intent is to allow
users to think through their data quality requirements, and to determine which quality indicators
would be appropriate for a given application.



We consider u\ uidicatof objective li it i» generated using a well defined and widely accepted m«««if«.



3. Data quality requirements analysis

In general, different users may have different data quality requirements, and different tvpes of
data may have different quality characteristics. The reader is referred to Appendix A for a more
thorough treatment of these issues.

Data quality requirements analysis is an effort similar in spirit to traditional data
requirements analysis (Batini, Lenzirini, it X'avathe, 1986; Mavathe, Batini, & Ceri, 1992; Teorey,
1990), but focusing on quality aspects of the data. Based on this similarity, parallels can be drawn
between traditional data requirements analysis and data quality requirements analysis. Figure 3
depicts the steps involved in performing the proposed data quality requirements arulysis.



application's
quality requinnents^

candkjata quality
paramaters



application requirements



Step1



i



detemiine the application view of data



I



Step 2



application view



determine (subjective) quality parameters for the
application



T



Step3



parameter view



determine (objective) quality Indicators for the
application



T



quality vl«w (1 ) • quality view (I) • quality view (n)
s"ter4 - ^ I ^^ ^



quality view Integration



1



quality schema



Figure 3: The process of daU quality requirtmento analysis
The input, output and objective of each step are descnbed in the following subsections.



IL — Step 1: Establishing the applications vievy

Step 1 IS the whole of the traditional data modeling process and will not be elaborated upon in
this paper. A comprehensive treatment ot the subiect has been presented elsewhere (Batini, Leazirini,
&c Navathe, 1986; Navathe, Batini, &c Cen. 1992; Teorev. 1990).

For illustrative purposes, suppose that we ire interested in designing a portfolio management
system which contains companies that issue stocks. .A company has a company name, a CEO, and an
earnings estimate, while a stock has a share pnce. a stock exchange (NfYSE, AMS, or OTC), and a ticker
symbol. An ER diagram that documents the a pplication view for our running example is shown below in
Figure 4 .



COMPANY




ISSUES



CEO NAME



EARNINGS ESTIMATE



STOCK



SHARE PRICE




Figure 4: Application view (output from Step 1)
2JL Step 2: Determine (subjective) q uality parametera

The goal in this step is to elicit quality parameters from the user given an application view
These parameters need to be gathered from the user in a systematic way as data quality is a multi-
dimensional concept, and may be operationalized for tagging purposes in different ways. Figure 5
illustrates the addition of the two high level parameters, interpretability and believability. to the
application view. Each quality parameter identified is shown inside a "cloud" in the diagram.




Figure 5: Interpretability and believability added to the application view



Interpretability can be defined through quality indicators such as data units (eg., in dollars)
and scale (e.g., in millions). Believabilitv can be defined in terms ot lower-level quality parameters
such as completeness, timeliness, consistency , credibility, and accuracy . Timeliness, in turn, can be


1 3

Online LibraryY. Richard (Yng-Yuh Richard) WangToward quality data : an attribute-based approach → online text (page 1 of 3)