Chris F Kemerer.

Improving the reliability of function point measurement : an empirical study online

. (page 1 of 4)
Online LibraryChris F KemererImproving the reliability of function point measurement : an empirical study → online text (page 1 of 4)
Font size
QR-code for this ebook

|i.|.T. UBRAR4E8 - D£WEY





3 1Q60 007M671D fi



Chris F. Kemerer
Benjamin S. Porter

October 1991

CISR WP No. 229
Sloan WP No. 3352-91-MSA

Center for Information Systems Research

Massachusetts Institute of Technology

Sloan School of Management

77 Massachusetts Avenue

Cambridge, Massachusetts, 02139




Chris F. Kemerer
Benjamin S. Porter

October 1991

CISR WP No. 229
Sloan WP No. 3352-91 -MSA

®1991 C.F. Kemerer, B.S. Porter

Center for Information Systems Research

Sloan School of Management
Massachusetts Institute of Technology

"fEB 2 \ ^992-


Improving the Reliability of Function Point Measurement:

An Empirical Study

Chris F. Kemerer

Massachusetts Institute of Technology


50 Memorial Drive

Cambridge, MA 02139

[email protected]

617/253-2971 (o)

617/258-7579 (fax)

Benjamin S. Porter

DMR Group, Inc

12 Post Office Square

Boston, MA 02109

617/451-9500 (o)

617/695-1537 (fax)

October 1991

Research support from the International Function Point Users Group and MIT's Center for
Information Systems Research is gratefully acknowledged. The cooperation of A. Belden,
M. Braun, and J. Frisbie was invaluable in providing data for this research. Helpful
comments were received from J.M. Deshamais, F. Mazzucco, J. Quillard, R. Selby, C. Scates,
W. Rumpf, and L. Smith on an earlier version.

Improving the Reliability of Function Point Measurement:

An Empirical Study


Information Systems development has operated for virtually its entire history
without the quantitative measurement capability of other business functional areas
such as marketing or manufacturing. Today, managers of Information Systems
organizations are increasingly taken to task to measure and report, in quantitative
terms, the effectiveness and efficiency of their internal operations. In addition,
measurement of information systems development products is also an issue of
increasing importance due to the growing costs associated with information systems
development and maintenance.

One measure of the size and complexity of information systems that is growing in
acceptance and adoption is Function Points, a user-oriented non-source line of code
metric of the product of systems development. Recent previous research has
documented the degree of reliabihty of Function Points as a metric. This research
extends that work by (a) identifying the major sources of variation through a survey
of current practice, and (b) estimating the magnitude of the effect of these sources of
variation using detailed case study data from actual commercial systems.

The results of the research show that a relatively small number of factors have the
greatest potential for affecting reliability, and recommendations are made for using
these results to improve the reliability of Function Point counting in organizations.

ACM CR Categories and Subject Descriptors: Di8 (Software Engineering): Metrics; D.2.9 (Software
Engineering): Management; K.6.0 (Management of Computing and Information Systems): General - Economics;
K.6.1 (Management of Computing and Information Systems): Project and People Management; K.63
(Management of Computing and Information Systems): Software Management

General Terms: Management, Measurement, Performance, Estimation, Reliability.

Additional Key Words and Phrases: Function Points, Project Planning, Productivity Evaluation.


Management of software development and maintenance encompasses two major
functions, planning and control, both of which require the capability to accurately and
reliably measure the software being delivered. Planning of software development projects
emphasizes estimation of the size of the delivered system in order that appropriate budgets
and schedules can be agreed upon. Without vaUd size estimates, this process is likely to be
highly inaccurate, leading to software that is delivered late and over-budget. Control of
software development requires a means to measure progress on the project and to perform
after-the-fact evaluations of the project in order, for exaniple, to evaluate the effectiveness
of the tools and techniques employed on the project to improve productivity and quality-

Unfortunately, as current practice often demonstrates, both of these activities are typically
not well performed, in part because of the lack of well-accepted measures, or metrics.
Software size is a critical component of productivity and quality ratios, and has
traditionally been measured by the number of source lines of code (SLOC) delivered in the
final system. This metric has been criticized in both its planning and control applications.
In planning, the task of estimating the final SLOC count for a proposed system has been
shown to be difficult to do accurately in actual practice (Low and Jeffery 1990). And in
control, SLOC measures for evaluating productivity have weaknesses as well, in particular,
the problem of comparing systems written in different languages (Jones 1986).

Against this background, an alternative software size metric was developed by Allan
Albrecht of IBM (Albrecht and Gaffney 1983). This metric, which he termed "function
points" (hereafter FPs), is designed to size a system in terms of its delivered functionality,
measured as a weighted sum of numbers of inputs, outputs, inquiries, and files. Albrecht
argued that these components would be much easier to estimate than SLOC early in the
software project life-cycle, and would be generally more meaningful to non-programmers.

In addition, for evaluation purposes, they would avoid the difficulties involved in
comparing SLOC counts for systems written in different languages.

FPs have proven to be a broadly accepted metric with both practitioners and academic
researchers. Dreger estimates that some 500 major corporations world-wide are using FPs
(Dreger 1989), and, in a survey by the Quality Assurance Institute, FPs were found to be
regarded as the best available MIS productivity metric (Perry 1986). They have also been
widely used by researchers in such applications as cost estimation (Kemerer 1987), software
development productivity evaluation (Behrens 1983) (Rudolph 1983), software
maintenance productivity evaluation {Banker et al. 1991), software quality evaluation
(Cooprider and Henderson 1989) and software project sizing (Banker and Kemerer 1989).
Additional work in defining standards has been done by Zwanzig (Zwanzig 1984) and
Desharnais (Desharnais 1988). Although originally developed by Albrecht for traditional
MIS applications, recently there has been significant work in extending FPs to scientific and
real time systems (Jones 1988; Reifer 1990; Whitmire et al. 1991).

Despite their wide use by researchers and their growing acceptance in practice, FPs are not
without criticism. The main criticism revolves around the alleged low inter-rater
reUabiUtv of FP counts, that is, whether two individuals performing a FP count for the
same svstem would generate the same result (Carmines and Zeller 1979). Barry Boehm, a
leading researcher in the software estimation and modeling area, has described the
definitions of function types as "ambiguous" (Boehm 1987). And, the author of a leading
software engineering textbook summarizes his discussion of FPs as follows:

"The function-point metric, like LOC, is relatively controversial. ..Opponents claim that the method
requires some 'sleight of hand' in that computation is based on subjective, rather than objective,
data..." (Pressman 1987, p. 94)

This perception of FPs as being unreliable has undoubtedly slowed their acceptance as a
metric, as both practitioners and researchers may feel that in order to ensure sufficient
measurement reliability either a) a single individual would be required to count all

systems, or b) multiple raters should be used for ail systems and their counts averaged to
approximate the 'true' value (Symons 1988). Both of these options are unattractive in
terms of either decreased flexibility in the first case and likely increased cost and cycle times
in the second.

Against this background some recent research has measured the actual magnitude of the
inter-rater reliability. Kemerer performed a field experiment where pairs of systems
developers measured FP counts for completed medium-sized commercial systems
(Kemerer 1991). The results of this analysis were that the pairs of FP counts were highly
correlated (p = .8) and had an average variance of approximately eleven percent.

While these results are encouraging for the continued use of FPs, as the reliability is much
higher than previously speculated, there is clearly still room for improvement. In
particular, given that one use of FPs is for managerial control in the form of post-
implementation productivity and quality evaluations, an 11% variance in counting could
mask small but real underlying productivity changes, and therefore could interfere with
proper managerial decision making. For example, a software project might have been a
pilot test for use of a new tool or method, which resulted in a ten percent productivity
gain. If, through unfortunate coincidence the output of this project was understated by
eleven percent, then managers might come to the mistaken conclusion that the new tool
or method had no or even a slightly negative impact, and thus inappropriately abandon it.

Given this and similar scenarios, it is clearly important for management to have reliable
instruments with which to measure their output. And, given that (1) FPs are already
v^ddely in use as a metric, and (2) have been shovm to have good but imperfect reliability, it
seems appropriate to attempt to determine the sources of the variation in counting as a
first step towards eliminating them and making FPs an even more reliable metric.

The previous research described above used a large scale experimental design to identify
the magnitude of the variations in FP counting. However, that research approach is ill-

suited to the detailed analysis necessary to address the source of the variations in
reliability. Therefore, this paper reports on the results of a two-phased research approach
that is complementary to the research described earlier. The first phase used a
combination of key informants and a field survey to identify the most likely sources of FP
counting variance. The second phase collected data from three detailed case studies which
were then used to the estimate the magnitude of effect of the variations. In all, thirty-three
FP counts were estimated from the detailed case study data.

The results from this analysis identified three potential sources of variation in FP
counting: the treatment of backup files, menus, and external files used as transactions.
These are the three areas where tighter standards are necessary and where managers
should focus their attention on adopting and adhering to standard counting practices. The
results of this research also identified several areas that have been suggested to cause
variation, but may not be important sources of error in actual practice. These include
treatment of error message responses and hard coded tables.

This paper is organized as follows. Section 2 presents a brief description of the research
problem and the previous research. Section 3 describes the research methodology, which
consisted of a survey and a set of quantitative case studies. Results of this analysis are
presented in Section 4, and Section 5 offers some concluding remarks.


2.1. Introduction

The uses of software measurement are as varied as the organizations which are putting the
measures into practice. One widespread use of software measurement is to improve the
estimation of the size of development projects. Much of the early literature on software
measurement focuses on the complexities of estimation (Boehm 1981) (Jones 1986).

It has only been within the past several years that many organizations have begun
systematically collecting a wdde variety of data about their software development and
maintenance activities. These measurement activities are the advent of both management
programs (designed to set and achieve various effectiveness and efficiency objectives) and
professional development programs (assisting professionals in the furtherance of their
development and maintenance skills).

2.2. Previous Research

Despite both the widespread use of FPs and some attendant criticism of their suspected lack
of reliability, there has been little research on this question. Perhaps the first attempt at
investigating the inter-rater reliability question was made by members of the IBM GUIDE
Productivity Project Group, the results of which are described by Rudolph as follows:

'In a pilot experiment conducted in February 1983 by members of the GUIDE Productivity Project Group
...about 20 individuals judged independently the function point value of a system, using the
requirement specifications. Values within the range +/- 30% of the average judgement were observed
...The difference resulted largely from differing interpretation of the requirement specification. This
should be the upper limit of the error range of the function point technique. Programs available in
source code or with detailed design specification should have an error of less than +/- 10% in their
function point assessment. With a detailed description of the system there is not much room for
different interpretations." (Rudolph 1983, p. 6)

Aside from this description, the only other research documented study is by Low and
Jeffery (Low and Jeffery 1990). Their research focused on the inter-rater reliability of FP
counts using as their research methodology an experiment using professional systems
developers as subjects, with the unit of analysis being a set of program level specifications.
Two sets of program specifications were used, both pre-tested with student subjects. For
the inter-rater reliability question, 22 systems development professionals who counted FPs
as part of their employment in seven Australian organizations were used, as were an
additional 20 inexperienced raters who were given training in the then current Albrecht
standard. Each of the experienced raters used his or her organization's own variation on
the Albrecht standard (Jeffery 1990). With respect to the inter-rater reliability research

question Low and Jeffery found that the consistency of FP counts "appears to be within the
30 percent reported by Rudolph" within organizations (Low and Jeffery 1990, p. 71).

Most recently, Kemerer conducted a large-scale field experiment to address, among other
objectives, the question of inter-rater reUabiUty using a different research design. Low and
Jeffery chose a small group experiment, with each subject's identical task being to count the
FPs implied from the two program specifications. Due to this design choice, they were
limited to choosing relatively small tasks, with the mean FP size of each program being 58
and 40 FPs, respectively. A possible concern with this design would be the external validity
of the results obtained from the experiment in relation to real world systems. Typical
medium sized application systems are generally an order of magnitude larger than the
programs counted in the Low and Jeffrey experiment (Emrick 1988) (Topper 1990). The
Kemerer study tested inter-rater reliability using more than 100 different total counts in a
data set with 27 actual commercial systems. Multiple raters were used to count the
systems, whose average size was 450 FPs. The results of the study were that the FP counts
from pairs of raters using a standard method^ differed on average by approximately eleven
percent. These results suggest that FPs are much more reliable than previously suspected,
and therefore may indicate that wider acceptance and greater adoption of FPs as a software
metric is appropriate.

However, these results also point out that variation is still present, and that the ideal goal
of zero percentage variation has not been achieved in practice. In addition, this previous
research, while identifying the magnitude of the variance, has not identified its sources.
Therefore, of continued interest to managers are any systematic sources of this variation
with accompanying recommendations for how to reduce or eliminate these variations.

1 As defined by International Function Points User Group Counting Practices Manual Release 3.0


3.1 Introduction

This research was designed to address the question of the sources of decreased reliabiUtv of
FP variations and consisted of two phases, hi the first phase, key informants identified
sixteen likely sources of variation. A survey of forty-five experienced users identified nine
of these sixteen as especially problematic. In the second phase, detailed quantitative case
study data on three commercial systems were collected and each system was counted using
each rule variation. These cases are from three diverse organizations and management
information systems.

3.2 Survey Phase

Development of the survey form was accomplished with significant involvement of the
Counting Practices Committee (CPC) of the International Function Points Users Group
(IFPUG). The committee consists of approximately a dozen experts drawn from within the
membership of IFPUG. IFPUG consists of approximately 350 member organizations
worldwide, with the vast majority being from the United States and Canada (Scates 1991) .
IFPUG is generally viewed as the lead organization involved with FP measurement and
the CPC is the standards setting body within IFPUG (Albrecht 1990).

The CPC is responsible for the pubUcation of the Counting Practices Manual (CPM), now in
its third general release (Sprouls 1990). This is their definitive standards manual for the
counting of FPs. In soliciting input from the CPC for this research, attention was focused
on those systems areas for which (a) no current standard exists in the CPM, and (b) areas
for which a standard exists but for which there is believed to be significant non-


From a series of meetings and correspondence with these key informants an original
survey of fourteen questions was developed^. This survey was pre-tested with members of
the CPC and a small number of IFPUG member organizations not represented on the CPC,
which resulted in the addition of two questions and some minor changes to existing
questions. The final sixteen question survey is presented in Appendix A. This survey was
mailed to eighty-four volunteer member organizations of IFPUG, who were asked to
document how FP counting was actually done within their organization. No
compensation was provided for completing the survey, although respondents were
promised a summary of the results. Completion of the survey was estimated to require
one hour of an experienced FP counter's time. Forty-five usable surveys were received,
for a response rate of fifty-four percent. The survey respondents are believed to represent
experienced to expert practice in current FP counting.

3.3. Case Study Phase

3.3.2 Introduction

While the survey phase of the research identified those areas that are likely sources of
variation, it did not identify the magnitude of those effects. For example, while
organizations may differ on the proper interpretation of a given FP construct, it may be the
case that the situation described is relatively rare within actual information systems, such
that differences in how it is treated may have negligible effect on an average FP count.
Detailed data for each variant are required to assess the magnitude of the potential
differences caused by each of the possible sources of variation. Given these data

^It is interesting to note that all of these questions deal with how to measure the five function count types, and
none with the fourteen complexity factors'. This reflects the fact that any reliability concerns relating to the
fourteen complexity factors are small, given that their potential impact on the final FP count is constrained by
the mathematical formula [Albrecht and Gaffney, 1983] (Bock and Klepper, 1990]. This is in contrast to the
five function types, where the impact of a different interpretation is unconstrained, and can be potentially very
large. Empirical research has also documented the result that the impact of the fourteen complexity factors is
small (Kemerer, 1987].

requirements, a quantitative case study methodology was chosen. As described by
Swanson and Beath, this approach features the collection of multiple types of data,
including documentation, archival records, and interviews (Swanson and Beath 1988).

The demand for detailed data with which to evaluate the multiple variations suggested by
the surveys had two effects upon the research. First, a significant data collection and
analysis effort was required for each case, since each variant required the collection of
additional data and the development of a new FP coiant. Second, the detailed data
requirements excluded a number of initially contacted organizations from participating in
the final research.

The project selection criteria were that the projects were recently completed and for which
there was an already completed FP count in the range of 200 - 600 FPs. This range was
selected as encompassing medium sized application development and is the size range of
the bulk of projects which are undertaken in North American systems development
organizations today (Dreger 1989) (Kemerer 1991). None of them was composed of leading
edge technology which might limit the applicability of standard FP analysis, such as
"multi-media" or "compound document" systems. Rather, they represent tvpical MIS
applications, and are described in more detail in the next section.

Obtaining the final usable three sets of case study data required the solicitation of ten
organizations. Only these three possessed the necessary data and were willing to share
these data with the researchers. These cases represent systems that are of the type for
which FPs were developed, and which are representative of the type of systems developed
and maintained by the original survey respondents.

The results were obtained using a variance analysis approach. Each of the systems
submitted for the analysis had an original FP count and other relevant documentation.
The analysis then systematically applied single variations of the counting rules which
were identified in the research. These variations were those identified in the first phase


for further analysis because they were different from the CPM standard (or for which no
standard had been established in the area), and they were being used by a significant
population of the survey respondents.

33.2. Site A - Fortune 100 Manufacturer; Accounting Application

This case was provided by a large, diversified manufacturing and financial services
company. This accounting application supports the need for rapid access to information
from a variety of separate Accounts Payable applications. It was designed to operate in a
PC /LAN environment, and is primarily used by accountants for inquiry purposes. It has
built-in help facilities which can be maintained by the users of the system.

3.3.3. Site B - Fortune 50 Financial Services firm; MIS Data Base System

This case was provided by a large diversified financial services organization that has
recently implemented a software measurement program. The system under study was
developed as a stand-alone PC application, using a relational data base technology. The
application is initiallv used by a single individual, but is expected to be expanded in its
availability as its data bases become more robust. The application supports the
management of the development function of the business, providing data and analvsis to
the managers of the software development and maintenance functions. The system was
designed for ease of access, and has a robust set of menus to give the users easy access to the

3.3.4. Site C - Fortune 100 Manufacturing Company; Program Management System

This case was provided by the high technology division of a large aerospace manufacturing
companv. The system is used to track information concerning the management of various
"programs" which are in process within the division. The system specifically tracks the

1 3 4

Online LibraryChris F KemererImproving the reliability of function point measurement : an empirical study → online text (page 1 of 4)