Copyright
Yeona Jang.

A data consumer-based approach to supporting data quality judgement online

. (page 1 of 2)
Online LibraryYeona JangA data consumer-based approach to supporting data quality judgement → online text (page 1 of 2)
Font size
QR-code for this ebook


HD28

.M414

no.

^3



WORKING PAPER
ALFRED P. SLOAN SCHOOL OF MANAGEMENT



A Data Consumer-based Approach to Supporting
Data Quality Judgement



December 1992



WP #3516-93
CISL WP# 92-05



Y.Jang

H. B. Kon

Richard Y. Wang

Sloan School of Management, MIT



MASSACHUSETTS

INSTITUTE OF TECHNOLOGY

50 MEMORIAL DRIVE

CAMBRIDGE, MASSACHUSETTS 02139



Published in the Proceedings of the Second Annual Workshop

on Information Technology and Systems (WITS)

Dallas, TX December 1992



A Data Consumer-based Approach to Supporting
Data Quality Judgement

December 1992 WP #3516-93

CISL WP# 92-05

Y. Jang

H. B. Kon

Richard Y. Wang

Sloan School of Management, MIT

* see page bottom for complete address



Yeona Jang E53-317

Henry B. Kon E53-322

Richard Y. Wang E53-322

Sloan School of Management

Massachusetts Institute of Technology

Cambridge, MA 01239



Published in the Proceedings of the Second Annual Workshop on Information Technology and Systems

(WITS) Dallas, Texas December, 1992



A Data Consumer-based Approach
to Supporting Data Quality Judgment



Yoena Jang

Henry B. Kon

Richard Y. Wang

December 1992

CISL-92-05



Composite Information Systems Laboratory

E53-320, Sloan School of Management

Massachusetts Institute of Technology

Cambridge, Moss. 02139

ATTN: Professor Richard Wang

Tel: (617) 253^3442

Fax:(617)734-2137

e-mail: [email protected]

© 1992 Yeona Jang, Henry B. Kon , and Richard Y. Wang



A Knowledge-Based Approach
to Assisting In Data Quality Judgment

(Extended Abstract)
Yeona Jang Henry B. Kon Richard Y. Wang

Laboratory for Computer Science Sloan School of Management Sloan School of Management

Massachusetts Institute of Technology Massachusetts Institute of Technology Massachusetts Institute of Technology

Yeona®lcs.mit.edu hkonSfmiLedu rwangdmit.edu

Abstract

As the integration of information systems enables greater accessibility to data from multiple sources, the issue
of data quality becomes increasingly important. This pap>er attempts to formally address the data quality
judgment problem with a knowledge-based approach. Our analysis has identified several related theoretical
and practical issues. For example, data quality is determined by several factors, referred to as quality
parameters. Quality parameters are often not independent of each other, raising the issue of how to represent
relationships among quality parameters and reason with such relationships to draw insightful knowledge
about the overall quality of data.

In particular, this paper presents a data quality reasoner. The data quality reasoner is a data quality
judgment model based on the notion of a "census of needs." It provides a framework for deriving an overall
data quaUty value from local relationships among quaUty parameters. The data quality reasoner will assist
data consumers in judging data quality. This is particularly important when a large amount of data involved
in decision-making come from different, imfamUiar sources.

1. Introduction

As the integration of information systems has enabled data consumers to gain access to both familiar
and unfamiliar data, there has been growing interest and activity in the area of data quality. Even if
each individual data supplier v^ere to guarantee the integrity and consistency of data, data from
different suppliers may still be of different quality levels — due, for example, to different data
maintenance policies. Unfortunately, as demonstrated in studies presented in the literature such as
[Bonoma, 1985;Bumham, 1985;Johnson, 1990;Laudon, 1986], decisions made based on inaccurate or out-of-
date data can result in serious economic and social damage. The problem of data quality is thus
increasingly critical.

A majority of previous research efforts on data quality has focused on providing to data
consumers "meta-data," i.e., data about data, that can facilitate the judgment of data quality; for
example, data source, creation time, and collection method. We refer to these characteristics of the
data manufacturing process as quaUty indicators (see Table 1 for examples of quality indicators). Data-
quality judgment is still, however, left to the data consumers. Unfortunately, information overload
makes it difficult to analyze such data and draw useful conclusions about data quality. This paper
seeks to assist data consumers in judging if the quality of data meets his or her requirements, by
reasoning about information critical to data quality judgment.

Regarding data quality, this paper focuses especially on the problem of assessing levels of data
quality, i.e., the degree to which data meets desired characteristics of the data from the user's
perspective. In considering the data quality assessment problem, our analysis has identified several
theoretical and practical issues:

1) What are data quality requirements?

2) How can relationships between dimensions of these requirements be represented?

3) What can be known about overall data quality from such relationships, and how?

The study conducted on major US firms, in [Wang & Guarrascio, 1991], identified a relatively
exhaustive list of requirements, such as timeliness and credibility. Such requirements are referred to as
quality parameters in this paper (see Table 2 for examples of data quality parameters). Unfortunately,

Research presented in this paper was supported in part by the International Financial Services Research Center at MTT, in part by the National
Heart Lung, and Blood Iiwtitute under the grant number ROl HL33041, and in part by the National Iiutitute of Health under the grant number ROl
LM04493 from the National Library of Medidne.



requirements of data depjend to largely on the intended usage of the data. For example, consider patient
records. Availability of the records may be more important than accuracy to hospital administration,
while to physicians accuracy is as important as availability of the records for effective patient
management. The issue, then, is how to deal with such user- or application-specificity of quality-
parameter relationships. This paper attempts to address this issue with a knowledge-based approach.
This raises the issue of how to represent relationships among quality parameters. Another important
issue is how to reason with such relationships to draw insightful knowledge about overall data
quality. This paper focuses mainly on addressing the last two issues: representational and reasoning
issues. To do so, we assume that data quality parameters, such as shown in Table 2, are available for
use.



Table 1: Data Quality Indicatore




Indicator


data #1


data #2


data «3


Source


DB#1


DB#2


DB#3


Creation-time


6/11/92


6/9/92


6/2/92


Update-frequency


daily


weekly


monthly


CoUecbon-method


barcode


entry clerk


radio freq.



Table 2: Data Quality Parameters


Parameter


data #1


data #2


data #3


Credibility


High


Medium


Medium


Timeliness


High


Low


Low


Accuracy


High


Medium


Medium



The mechanism investigated in this paper is the data quality reasoner. This is a simple data
quality judgment model based on the notion of a "census of needs." It applies a knowledge-based
approach in data quality judgment. The intention is to provide flexibility advantages in dealing with
the subjective, decision-analytic nature of data quality judgment. The data quality reasoner provides a
framework for representing and reasoning with local relationships among quality parameters to
produce an overall data quality level. Such "informating" ability of the data quality reasoner would
have significant value for assisting data consumers in judging data quality, particularly when data
involved in decision-making come from different, unfamiliar sources.

1.1. Quality Indicators and Quality Parameters

It is worth noting relationship between quality parameters and quality indicators. The essential
distinction among quality indicators and quality parameters is that quality indicators are intended
(primarily) to represent objective information about the data manufacturing process [Wang & Kon,
1992]. Quality parameters, however, can be user- or application-specific, and are derived from either
underlying quality indicators or other quality parameters. The topology of the "quality hierarchy" in
this paper is: a single quality parameter being derived from n underlying quality parameters. Each
underlying quality parameter, in turn, could be derived from either its underlying quality parameters or
quality indicators. For example, a user may conceptualize quality parameter Credibility as one
depending on underlying quality parameters such as Source-reputation and Timeliness. The quality
parameter Source-reputation, in turn, can be derived from quality indicators such as the number of times
that a source supplies obsolete data. This pap>er assumes that such derivations are complete, and that
relevant quality parameter values are available.



1J2. Overview

In general, several quality parameters may be involved in determining overall data quality. This
raises the issue of how to specify the degree to which each quality parameter contributes to overall
data quality. One approach is to specify the degree, in certain absolute terms, for each quality
parameter. It may not, however, be practical to completely specify such values. Rather, people often
conceptualize local relationships, such as "Timeliness is more important than the credibility of a source
for this data, except when timeless is low." So that, if timeliness is high and Source-credibility is
medium, the data may be of high quality. The model presented in this paper provides a formal
specification of such local "dominance relationships" between quality parameters.

The issue is, then, how to use these local dominance relationships between quality parameters,
and what can be known about data quality from them. Observe that each local relationship between
quality parameters specifies the local relative significance of quality parameters. One way to use local
dominance relationships would be to rank and enumerate quality parameters in the order of
significance implied by local dominance relationships. Finding a total ordering of quality parameters
consistent with local relative significance, however, can be computationally intensive. In addition, a



complete enumeration of quality parameters may contain too much information to convey to data
consumers any insights about overall data quality. This paper provides a model to help data consumers
raise their levels of knowledge about the data they use, and thus make informed decisions. Such a
process represents data quality filtering.

Our project involves an investigation of a data quality judgment model, with the aim of raising
related issues and describing mechanisms behind the use of knowledge about local quality-parameter
relationships in data quality judgment. Section 2 discusses a representation for specifying various local
relationships between quality parameters. Section 3 discusses the computational component of the
quality judgment model. It includes a mechanism for reasoning with local dominance relationships to
identify information critical to overall data quality. Finally, Section 4 summarizes this research and
suggests future directions for the field of data quality evaluation.

13. Related Work

The decision-analytic approach, as summarized in [Keeney & Raiffa, 1976], and utility analysis under
multiple objectives, as summarized in [Chankong & Haimes, 1983], describe solution approaches for
specifying preferences and resolving multiple objectives. The preference structure of a decision maker or
evaluator is specified as a hierarchy of objectives. Through a decomposition of objectives using either
subjectively defined mappings or formal utility analyses, the hierarchy can be reduced to an overall
value. The decision-analytic approach is generally built around the presupposition of the existence of
continuous utility functions. The approach presented in this paper, on the other hand, does not require
that dominance relations between quality parameters be continuous functions, or that their interactions
be completely specified. It only presupposes that some local dominance relationships between quality
parameters exist.

Representational schemes similar to one presented in this paper are investigated, to represent
preferences, in sub disciplines of Artificial Intelligence such as Planning [Wellman, 1990,Wellman &
Doyle, 1991]. The research effort, however, has focused primarily on issues involved in representing
preferences, and much less so on computational mechanisms for reasoning with such knowledge.

2. Data Quality Reasoner

This section discusses the data quality reasoner, called EX^R. DQR is a data quality judgment model
which derives an overall data quality value for a particular data element, based on the following
information:

1) A set, QP, of underlying quality parameters that affect data quality: QP = {(Jj, (?2' •••' lj-

2) A set, DR, of local dominance relationships between quality parameters in QP.

In particular, this paper addresses the following fundamental issues that arise in considering
the use of local relationships between quality parameters in data quality judgment:

1) How to represent local dominance relationships between quality parameters.

2) What to do with such local dominance relationships.

Section 2.1 presents a representation scheme for specifying local dominance relationships between
quality parameters in order to facilitate data quality judgment. Section 2.2 discusses a computational
framework which exploits such relationships to draw insights about overall data quality.

2.1. Representation of Local Dominance Relationships

This subsection discusses a representation of local dominance relationships between quality parameters.
To facilitate further discussion, additional notations are introduced below. For any quality parameter
q,, let symbol Vj denote the set of values that (j, can take on. In addition, the following notation is used
to describe value assignments for quality parameters. For any quality parameter (j,, the value
assignment (J, .= v (for example. Timeliness := High) represents the instantiation of the value of q, as v,
for some v in V- . Value assignments for quality parameters, such as (j, .= v, are called "quality-



parameter value assignments". A quality parameter with a particular value assigned to it is also
referred to as an instantiated quality parameter.

For some quality parameters (j,, CJ2,...,(]„, for some integer n>\, qjnq^r^.-.nq^ represents a
conjunction of quality parameters. Similarly, (yj.=z;jn '?2'=^2'^-"'^'?n'=^n' ^^^ some v, in V., for all ;' =
1,2, ... , and n, represents a conjunction of quality-parameter value assignments. Note that the symbol n
used in the above statement denotes the logical conjunction, not set intersection, of events asserted by
instantiating quality parameters.

Finally, notation '©' is used to state that data quality is affected by quality parameters. It is
represented as ®(£jjn(j2'^...'^(j„) that data quality is affected by quality parameters qi,q2' ■■■> '^^^ ^n-
Statement ®(i^, nij2n...nq^) is called a quality-merge statement, and is read as "the quality merge of qi_
^2. ■■■, and q„." Simpler notation, ®(^i,(?2' •••' 'Jn)' '^ also used. A quality-merge statement is said to be
instantiated, if all quality parameters in a quality-merge statement are instantiated to certain values.
For example, statement ®((7].=f,n ^2'~^2'^ •••'^'?n'=^n'' '^ ^" instantiated quality-merge statement of
®((jj ^2' •■■' lJ' ^'^^ some y, in V,, for all i = 1, 2, ..., and n.

The following defines a local dominance relationship among quality parameters.

Definition 1 (Dominance relation): Let £j and Ej be two conjunctions of quality-parameter value
assignments. £j is said to dominate £2, denoted by £; >i£2' if a"d only if ®(£j ,£2,+) is reducible to
®(£,,+), where "+" stands for the conjunction of value assignments for the rest of the quality
parameters, in QP, which are shown neither in Ej nor in £2-

Note that as implied by "+," this definition assumes the context-insensitivity of reduction: (£,,£2,+)
can be reduced to ®(£j,+), regardless of the values of the quality parameters, in QP, that are not
involved in the reduction. Moreover, "+" implies that these uninvolved quality parameters in QP
remain unaffected by the application of reduction. For example, consider a quality-merge statement
which consists of quality parameters Source-credibility, Interpretability, Timeliness, and more.
Suppose that when Source-credibility and Timeliness are High, and Interpretability is Medium,
Interpretability dominates the other two. This dominance relationship can be represented as follows:

"Interpretability := Medium >jSource-Credibility := High nTimeliness := High. "
Then, ©(Source-credibility := High, Interpretability := Medium, Timeliness := High, +) is reducible to
quality-merge statement ®(Interpretability := Medium, +).

As mentioned at the beginning of Section 2, the evaluation of the overall data quality for a
particular data element requires information about a set of quality parameters that play a role in
determining the overall quality, QP = (ijj (^2, ••, (?„), and a set DR of local dominance relationships
between quality parameters in QP. Information provided in QP is interpreted by DQR as "the overall
quality is the result of quality merge of quality parameters cj^, q^' ■••' ^nd q„, i.e., ®(q^j^2 - -''ln^-" Local
dominance relationships in DR are used to derive an overall data quality value. It may be unnecessary
or impossible, however, to explicitly state each and every plausible relationship between quality
parameters in DR. Assuming incompleteness of preferences in quality parameter relationships, this
paper approaches the incompleteness issue with the following default assumption: For any two
conjunctions of quality parameters, if no information on dominance relationships between them is
available, then they are assumed to be in the indominance relation. The indominance relation is
represented as follows:

Definition 2 (Indominance relation): Let £j and £2 be two conjunctions of quality-parameter value
assignments. £j and £2 are said to be in the indominance relation, if neither £, >^E2 nor E^'^d^i

When two conjunctions of quality parameters are indominant, a data consumer may specify the result of
quality merge of them, according to his or her needs.



22. Reasoning Component of DQR

The previous subsection discussed how to represent local relationships between quality parameters. The
next question that arises is then how to derive overall data quality from such local dominance
relationships, i.e., how to evaluate a quality-merge statement based on such relationships. This task,
simply referred to as the "data-quality-estimating problem," is summarized as follows:



Data-Quality-Estimating Problem:

Let DR be a set of local dominance relationships between quality parameters, q^ (jj, ., and (j„.
Compute ©((jj^2-''?n^' subject to local dominance relationships in DR.



An instance of the data-quality-estimating problem is represented as a list of a quality-merge
statement and a corresponding set of local dominance relationships, i.e., (©((^2^2 ■•''?n^' ^^)-

The rest of this section presents a framework for solving the data-quality-estimating problem,
based on the notion of "reduction". The following axiom defines the data quality value when only one
quality parameter is involved in quality merge.

Axiom 1 (Quality Merge): For any quality-merge statement ®(qj/f2' - -''1n^' if m = 1, then the value of
®{q^xi2-..,q„) is equal to that of q^.

Quality-merge statements with more than one quality parameter are reduced to ones with a
smaller number of quality parameters. The following define axioms which provide a basis for the
reduction. As implied by Definition 1 and the default assumption, any two conjunctions of quality-
parameter value-assignments can be in either the dominance relation or in the indominance relation.
The following axiom specifies that any two conjunctions cannot be both in the dominance relation and in
the indominance relation.

Axiom 2 (Mutual Exclusivity): For any two conjunctions £j and £3 of quality-parameter value
assignments, Ej and £2 are related to each other in exactly one of the following ways:

l.£j>,E2
2.£2>,£,
3. £j and £2 are in the indominance relation.

The following axiom defines the precedence of the dominance relation over the indominance
relation. This implies that while evaluating a quality-merge statement, quality parameters in the
dominance relation are considered before those not in the dominance relation.

Axiom 3 (Precedence of >J: The dominance relation takes precedence over the indominance relation.

Reduction-Based Evaluation: A reduction-based evaluation scheme is any evaluation process where the
reduction operations take precedence over all other evaluation operations. Definition 1 and axiom 3
allow the reduction-based evaluation strategy to be used to solve the data-quality-estimating problem
for quality-merge statements with more than one quality parameter.

The use of dominance relationships to reduce a quality-merge statement raises the issue of
which local dominance relationships to apply first, i.e., regarding the order in which local dominance
relationships are applied. Unfortunately, the reduction of a quality-merge statement is not always
well-defined. In particular, a quality-merge statement can be reduced in more than one way, depending
on the order in which the reduction is jjerformed. For example, consider an instance of the data-
quality-estimating problem, (®(


1

Online LibraryYeona JangA data consumer-based approach to supporting data quality judgement → online text (page 1 of 2)