Y. Richard (Yng-Yuh Richard) Wang.

Data quality requirements analysis and modeling online

. (page 1 of 3)
Online LibraryY. Richard (Yng-Yuh Richard) WangData quality requirements analysis and modeling → online text (page 1 of 3)
Font size
QR-code for this ebook


HU/0

.M414
no.

^3



WORKING PAPER
ALFRED P. SLOAN SCHOOL OF MANAGEMENT



Data Quality Requirements Analysis
and Modeling



December 1992



WP #3515-93
CISL WP# 92-04



Richard Y. Wang

M. P. Reddy

H. B. Kon

Sloan School of Management, MIT



MASSACHUSETTS

INSTITUTE OF TECHNOLOGY

50 MEMORIAL DRIVE

CAMBRIDGE, MASSACHUSETTS 02139



To appear in the Journal of Decision Support Systems (DSS)
Special Issue on Information Technologies and Systems



Data Quality Requirements Analysis
and Modeling

Deceml)er 1992 WP #3515-93

CISL WP# 92-^

Richard Y. Wang

M. P. Reddy

H. B. Kon

Sloan School of Management, MIT

* see page bottom for complete address



Richard Y. Wang E53-317
M. P. Reddy E53-322
Henry B. Kon E53-322

Sloan School of Management

Massachusetts Institute of Technology

Cambridge, MA 01239



smpv'; •



E30SeE35Es3BW.'.



To appsear in the Journal of Decision Support Systems (DSS)
Special Issue on Information Technologies and Systems



Toward Quality Data: An Attribute-Based Approach

Richard Y. Wang
M. P. Reddy
Henry B. Kon

November 1992

(CIS-92-04, revised)

Composite Information Systems Laboratory

E53-320, Sloan School of Management

Massachusetts Institute of Technology

Cambridge, Mass. 02139

ATTN: Prof. Richard Wang

(617) 253-0442
Bitnet Address: [email protected]

© 1992 Richard Y. Wang, M.P. Reddy, and Henry B. Kon



ACKNOWLEDGEMENTS Work reported herein has been supported, in part, by
MIT's International Financial Service Research Center and MIT's Center for
Information Systems Research. The authors wish to thank Stuart Madnick and
Amar Gupta for their comments on earlier versions of this paper. Thanks are also
due to Amar Gupta for his support and Gretchen Fisher for helping prepare this
manuscript.



1. Introduction 1

1.1. Dimensions of data quality 2

1.2. Data quality: an attribute-based example 4

1.3. Research focus and paper organization 4

2. Research background 5

2.1. Rationale for cell-level tagging 5

2.2. Work related to data tagging 6

2.3. Terminology 7

3. Data quality requirements analysis 8

3.1. Step 1: Establishing the applications view 9

3.2. Step 2: Determine (subjective) quality parameters 9

3.3. Step 3: Determine (objective) quality indicators 10

3.4. Step 4: Creating the quality schema 11

4. The attribute-based model of data quality 12

4.1. Data structure 12

4.2. Data integrity 15

4.3. Data manipulation 15

4.3.1. Ql-CompatibiUty and QIV-Equal 15

4.3.2. Quality Indicator Algebra 18

4.3.2.1. Selection 18

4.3.2.2. Projection 19

4.3.2.3. Union 20

4.3.2.4. Difference 22

4.3.2.5. Cartesian Product 24

5. Discussion and future directions 25

6. References 27

7. Appendix A: Premises about data quality requirements analysis 29

7.1. Premises related to data quality modeling 29

7.2. Premises related to data quality definitions and standards across
users 30

7.3. Premises related to a single user 30



Toward Quality Data: An Attribute-Based Approach

1. Introduction

Organizations in industries such as banking, insurance, retail, consumer marketing, and health
care are increasingly integrating their business processes across functional, product, and geographic
lines. The integration of these business processes, in turn, accelerates demand for more effective
application systems for product development, product delivery, and customer service (Rockart & Short,
1989). As a result, many applications today require access to corporate functional and product
databases. Unfortunately, most databases are not error-free, and some contain a surprisingly large
number of errors (Johnson, Leitch, & Neter, 1981). In a recent industry executive report. Computer-world
surveyed 500 medium size corporations (with annual sales of more than $20 million), and rejX)rted that
more than 60% of the firms had problems in data quality.^ The Wall Street journal also reported that:

Thanks to computers, huge databases brimming with information are at our fingertips, just
waiting to be tapped. They can be mined to find sales prospects among existing customers; they
can be analyzed to unearth costly corporate habits; they can be manipulated to divine future
trends. Just one problem: Those huge databases may be hill of junk. ... In a world where people
are moving to total quality management, one of the critical areas is data.'^

In general, inaccurate, out-of-date, or incomplete data can have significant impacts both
socially and economically (Laudon, 1986; Liepins & Uppuluri, 1990; Liepins, 1989; Wang & Kon, 1992;
Zarkovich, 1966). Managing data quality, howrever, is a complex task. Although it would be ideal to
achieve zero defect datar' this may not always be necessary or attainable for, among others, the
following two reasons:

First, in many applications, it may not always be necessary to attain zero defect data. Mailing
addresses in database marketing is a good example. In sending promotional materials to target
customers, it is not necessary to have the correct city name in an address as long as the zip code is correct.

Second, there is a cost/quality tradeoff in implementing data quality programs. Ballou and
Pazer found that "in an overwhelming majority of cases, the best solutions in terms of error rate
reduction is the worst in terms of cost" (Ballou & Pazer, 1987). The Pareto Principle also suggests that
losses are never uniformly distributed over the quality characteristics. Rather, the losses are always
distributed in such a way that a small percentage of the quality characteristics, "the vital few,"
always contributes a high percentage of the quality loss. As a result, the cost improvement potential is



1 Computerworld, September 28, 1 992, p. 80-84.

2 The Wall Street Journal, May 26, 1992, page B6.

3 just like the well publicized concept of zero defect products in the manufacturing literature.



high for "the vital few" projects whereas the "trivial many" defects are not worth tackling because the
cure costs more than the disease (Juran & Gryna, 1980). In sum, when the cost is prohibitively high, it is
not feasible to attain zero defect data.

Given that zero defect data may not always be necessary nor attainable, it would be useful to be
able to judge the quality of data. This suggests that we tag data with quality indicators which are
characteristics of the data and its manufacturing process . From these quality indicators, the user can
make a judgment of the quality of the data for the specific application at hand. In making a financial
decision to purchase stocks, for example, it would be useful to know the quality of data through quality
indicators such as who originated the data, when the data was collected, and how the data was
collected.

In this paper, we propose an attribute-based model that facilitates cell-level tagging of data.
Included in this attribute-based model are a mathematical model description that extends the
relational model, a set of quality integrity rules, and a quality indicator algebra which can be used to
process SQL queries that are augmented with quality indicator requirements. From these quality
indicators, the user can make a better interpretation of the data and determine the believability of the
data. In order to establish the relationship between data quality dimensions and quality indicators, a
data quality requirements analysis methodology that extends the Entity Relationship (ER) model is
also presented.

Just as it is difficult to manage product quality without understanding the attributes of the
product which define its quality, it is also difficult to manage data quality without understanding the
characteristics that define data quality. Therefore, before one can address issues involved in data
quality, one must define what data quality means. In the following subsection, we present a definition
for the dimensions of data quality.

LL Dimensions of data quality

Accuracy is the most obvious dimension when it comes to data quality. Morey suggested that
"errors occur because of delays in processing times, lengthy correction times, and overly or insufficiently
stringent data edits" (Morey, 1982). In addition to defining accuracy as "the recorded value is in
conformity with the actual value," Ballou and Pazer defined timeliness (the recorded value is not out
of date), completeness (all values for a certain variables are recorded), and consistency (the
representation of the data value is the same in all cases) as the key dimensions of data quality (Ballou



& Pazer, 1987). Huh et al. identified accuracy, completeness, consistency, and currency as the most
important dimensions of data quality (Huh, et a!., 1990).

It is interesting to note that although methods for quality control have been well established in
the manufacturing field (e.g., Juran, 1979), neither the dimensions of quality for manufacturing nor for
data have been rigorously defined (Ballou & Pazer, 1985; Garvin, 1983; Garvin, 1987; Garvin, 1988;
Huh, et al., 1990; Juran, 1979; Juran & Gryna, 1980; Morey, 1982; Wang & Guarrascio, 1991). It is also
interesting to note that there are two intrinsic characteristics of data quality:

(1) Data quality is a multi-dimensional concept.

(2) Data quality is a hierarchical concept.

We illustrate these two characteristics by considering how a user may make decisions based on
certain data retrieved from a database. First the user must be able to get to the data, which means that
the data must be accessible (the user has the means and privilege to get the data). Second, the user
must be able to interpret the data (the user understands the syntax and semantics of the data). Third,
the data must be useful (data can be used as an input to the user's decision making process). Finally, the
data must be believable to the user (to the extent that the user can use the data as a decision input).
Resulting from this list are the following four dimensions: accessibility, interpretability, usefulness,
and believability. In order to be accessible to the user, the data must be available (exists in some form
that can be accessed); to be useful, the data must be relevant (fits requirements for making the decision);
and to be believable, the user may consider, among other factors, that the data be complete, timely,
consistent, credible, and accurate . Timeliness, in turn, can be characterized by currency (when the data
item was stored in the database) and volatility (how long the item remains valid). Figure 1 depicts
the data quality dimensions illustrated in this scenario.




currenT) Qion-volatile;
Figure 1: A Hierarchy of Data Quality Dimensions



These multi-dimensional concepts and hierarchy of data quality dimensions provide a
conceptual framework for understanding the characteristics that define data quality. In this paper, we
focus on interpretability and believability, as we consider accessibility to be primarily a function of the
information system and usefulness to be primarily a function of an interaction between the data and the
application domain. The idea of data tagging is illustrated more concretely below.

12. Data quality: an attrib ute-based example

Suppose an analyst maintains a database on technology companies. The schema used to support
this effort may contain attributes such as company name, CEO name, and earnings estimate (Table 1).
Data may be collected over a period of time and come from a variety of sources.

Table 1: Company Information



Company Name


CEO name


Earnings Estimate


IBM


Akers


7


DELL


Dell


3



As part of determining the believability of the data (assuming high interpretability), the
analyst may want to know when the data was generated, where it came from, how it was originally
obtained, and by what means it was recorded into the database. From Table 1, the analyst would have
no means of obtaining this information. We illustrate in Table 2 an approach in which the data is
tagged with quality indicators which may help the analyst determine the believability of the data.

Table 2: Company iniormation with quality indicators



Company Name



CEO name



Earnings Estimate



IBM



Akers





1 3

Online LibraryY. Richard (Yng-Yuh Richard) WangData quality requirements analysis and modeling → online text (page 1 of 3)