Carl Campbell Brigham.

Variable Factors in the Binet Tests


presented to the

Faculty of Princeton University

IN Candidacy for the Degree

OF Doctor of Philosophy




Princeton University Press


Table of Contents

I. Introduction i

II. Subjects and Methods 8

III. The Personal Equation i8

VI. Grade Correlations 37

V. Sex Differences 65

VI. Summary 91


During the past decade, the Binet-Simon measuring scale for
intelligence has received considerable attention, and a large
amount of literature has appeared on the subject. No attempt
has been made in the following pages to review all the literature
on this scale or other systems of intelligence testing. Kite (38)
gives an excellent account of the history and nature of the scale.
Kohs (41) has assembled a very complete bibliography on the
subject up to June 1914. Schmitt (57) gives an historical ac-
count of the development of the various attempts to correlate
psychological findings with general intelligence, particularly in
this country and England. Bo'bertag (10) and Schmitt both
give detailed descriptions and analyses of the individual tests.
Stern (62) has devoted a monograph to the collection, exposition
and critical analysis of the large amount of data bearing on the
problem of intelligence testing, and in another work (61) has
assembled the literature of cognate fields. The literature bearing
on the Binet scale up to 19 12 is largely descriptive of the scale
itself, the standard methods of procedure, etc. The more recent
literature has been critical and reveals a tendency at the present
time for investigators to depart from the methods of the exten-
sive application of the scale as a whole to the more intensive
study of the individual tests.

All systems of intelligence tests may be classified as qualitative
or quantitative. The qualitative system consists of an aggrega-
tion of tests designed to detect the capacities or incapacities of
the subject in order to afford the experimenter an opportunity
to make a diagnosis concerning the subject's mentality. This
method throws the responsibility for the final diagnosis on the
experimenter. The system of tests proposed by Healy and Fer-
nald (34) are of this type. Quantitative systems of tests necessi-
tate a final score of some sort, whether that score be in the form
of a mental age, a mental quotient, a certain number of points,


a coefficient of intellectual ability, a percentile rank or what not.
The essential characteristics of the quantitative systems are the
interpretation of the total scores in terms of the age of the sub-
ject, and the placing of the responsibility for the final diagnosis
on the tests rather than the experimenter,

Binet and Simon's 1905 scale (5 and 6) was of the qualitative
type. A series of 30 tests of approximately increasing difficulty
was published with directions for their application. The authors
reported in a general way that from their experience in examin-
ing a few selected normal children of different ages, and other
subnormal children in the schools and at the Salpetriere, approxi-
mate levels of performance could be found characteristic of the
development of normal children of 3, 7, 9 and 11 years chrono-
logically, the performance of idiots, imbeciles and morons cor-
responding roughly with that of normal children of 3, 7 and 9.
Although the reference to chronological ages introduced the
quantitative element, at no place were the authors insistent on
this point, merely stating that they had found the series of tests
exceedingly valuable in diagnosing and classifying defectives,
and in their opinion others would also find it valuable.

The 1908 scale (7) was quantitative in character owing to
the introduction of the concept of "mental age". It included
a list of 56 tests grouped according to ages from 3 to 13, each
group containing from four to eight tests. Most of the tests of
the 1905 series were included, the additions including in a large
measure tests of a scholastic nature. The authors gave directions
for applying the series and for computing the resultant "mental
age". A child testing three years below his chronological age
was to be considered defective.

Although the scheme of the 1908 series was entirely quantita-
tive, the authors did not discard the qualitative idea, and they
cautioned against the application of the scale in the manner of
a measure of height or weight. The border line between the
idiot and the imbecile was fixed by the ability to use and compre-
hend spoken language. The imbecile was differentiated from
the moron by the use of written language, illiteracy being di^



ferentiated from imbecility by certain tests. The authors stated
that the moron could be defined only in terms of the environ-
ment in which he lived, and they considered six tests important
in differentiating the moron from the normal individual of the
Paris population. Any system of tests which throws more
weight on some tests than on others in making a differential
diagnosis is fundamentally qualitative in kind, for the responsi-
bility is placed not on the score but on the judgment of the ex-
perimenter. The idea of a quantitative measuring scale of
intelligence however met with instant favor. The interest that
actuated the psychologists of the "early nineties" to correlate
the measurements of reaction time, motor ability, sensory dis-
crimination, etc. with intelligence was revived. The scale was
translated into several languages and applied to individuals of
many classes and types.

In 191 1, the authors published a revised scale (8) in which
many of the tests of scholastic ability were discarded, and the
remaining tests shifted about so that there were five tests for
every year except one from III to X with similar groups for
"twelve year", "fifteen year" and "adult" mentality. In the same
year, Binet published an article (4), his last word on the sub-
ject, in which he discussed many of the criticisms which the scale
had received, and again sounded the note of warning against
the mechanical interpretation of results. However, as one traces
Binet's thought on the subject through his writings, he may see
the idea of a qualitative system of tests gradually dropping into
the background, and more and more weight placed on the "scien-
tific" (quantitative) measure of intelligence.

That Binet did not depart entirely from the qualitative stand-
point is shown by his discussion of the test of comprehending
difficult questions. "Sometimes after an examination one hesi-
tates on a diagnosis. The child has failed in one or two tests,
but this does not seem to be convincing. Failure to give the day
and date and the months of the year are excusable errors, which
may be caused by distraction or by lack of education. But the
questions for comprehension dissipate all doubts. We recall


several instances when teachers brought us children, desiring to
know whether or not they were abnormal; occasionally, in this
way they set a trap for us, but we did not object, it was fair
play. Our questions for comprehension decided us every time.
We remember one child who was very slow in answering as
though dull, his face was expressionless and unprepossessing;
he knew neither the day nor the date, nor what day comes after
Sunday, and he was 103^ years old; his reading was syllabic.
But when we asked question 5 : Why do we judge a person by
his acts rather than by his words ? he gave the following answer :
Because words are not very sure and acts are more sure. This
was enough — our opinion was formed, that child was not so bad
as he seemed." (Town's (72) translation, page 48.)

The popular interest that was manifest before the advent of
the 191 1 scale was tremendously reinforced in this country by
Goddard's (30) publication of the results of the application of
the scale to "two thousand" non-selected school children in Vine-
land, N. J. Popular interest increased rapidly, and the scale
continued to have wider and wider application in the hands of
less and less experienced investigators. The concept of "mental
age" was exceedingly easy of comprehension, no apparatus was
needed, and the scale has now become the common property of
all. This development or overdevelopment has taken place in
spite of the warnings of the authors themselves and the psycho-
logical fraternity in general. The very fact of overdevelopment
however is striking evidence that persons interested in the social
sciences need a quantitative scale for measuring intelligence.

The question whether the Binet scale is an accurate measure
of intelligence can be decided only by the study of the individual
tests and the factors underlying them. A study of this sort will
show the errors that underlie the total score or "mental age",
and at the same time will show the direction in which the cor-
rection of the scale should take place. The proper understanding
of the individual tests involves the theory on which the measur-
ing scale was constructed.

The method which Binet and Simon used in constructing their


measuring scale of intelligence was entirely empirical. A large
number of tests were given to children of a certain social status.
Certain tests could be shown to be correlated with age, and in
the authors' opinion were correlated with intelligence. The fact
that at a certain age a test could be passed by a certain propor-
tion of the subjects was taken to mean that the test in question
was characteristic of that age. Tests that were characteristic
of the same age level were then combined into one age group.
In this way a scale was built up with a number of tests for each
age group. By a certain arbitrary system of scoring the re-
actions of a subject to all or part of the scale of tests, the "men-
tal age" of the subject was obtained. The comparison of the
"mental age" with the chronological age of the subject would
show him to be advanced, at age or retarded, and the amount
of acceleration or retardation would afford a quantitative index
of his intelligence.

A person could construct a scale on the same basis and arrive
at an age score using entirely different tests. A scale could be
constructed containing tests of height, weight, vital capacity,
strength of grip, circumference of the head, etc. and the results
interpreted in terms of age. In this case however the age ob-
tained would be more physical than mental. A scale of tests
could also be constructed which involved the subject's knowledge
of geography, spelling, history, grammar, etc. but in this case
the resulting age would be determined very largely by the amount
of training the subject had received.

The assumptions that a child at a certain age should weigh
25 pounds, at another age 50 pounds, etc., that a child can repeat
3 digits at one age, 5 digits at another and 7 digits at another,
and that a certain percentage of children at one age can enu-
merate the months, and a higher percentage at another age, differ
only in the possible determiners to which the growth may be re-
ferred. In the first case the growth is referred to certain physio-
logical processes which are supposedly independent of intelligence
and training. Binet believed that the principal determiner of
growth in the last two cases was intelligence, but the possibility


remains that they might be more or less independent of intelli-
gence, and more or less dependent on training and other variable

The principle on which the scale was constructed involves three
assumptions, (i) that the individual tests are correlated with
age, (2) that the individual tests are correlated with intelligence,
and (3) that intelligence is correlated with age — three distinct
assumptions any one of which does not necessarily involve the
others. The purpose of this investigation is to study the correla-
tion of the individual tests with age, to determine the variable
factors that might operate on the tests to produce an apparent
correlation with age that was not a real correlation, or that might
alter the real correlation in some way.

There is a possibility that an error might occur in the statistical
treatment of the results, so that figures which would apparently
indicate a correlation with age of a certain degree might actually
represent a correlation of another degree. Another variable
factor is the personal equation of the experimenter, who might
alter the procedure in giving a certain test so that the correlation
of that test with age might be different from the correlation
obtained by another experimenter. If the subjects of various
ages had received different school training, this difference might
introduce another factor which would vary independently of the
age of the subjects. If the tests used depended on any inherited
or acquired differences between the sexes, then the correlation
of the tests with age might be different for the two sexes. If
any or all of the variable factors mentioned prove to be present
in the correlation of the tests with age, then certain allowances
will have to be made for these factors in making a diagnosis
of the subject's intellectual ability on the basis of his total score
or "mental age", and the scale becomes qualitative rather than

At the Fourth International Conference for School Hygiene
held in Buffalo in the summer of 19 13, several persons of un-
questioned authority in the field of mental tests held an informal


conference on the Binet-Simon scale, reporting the results in
1914 in the form of recommendations and suggestions (15).
The question, "How much is the outcome of the testing in-
fluenced by the personal equation, both of the examiner and ex-
aminee?" was answered, "Undoubtedly there is some influence
and it may be a serious source of error." Another question,
"How much do previous environment and school training effect
the outcome of the tests?" was left unanswered by the opinion,
"The experimental evidence thus far available is conflicting.
Further investigation is needed." The question, "Should the
scale be divided, in the upper years at least, to furnish separate
standards or separate tests for the two sexes?" was answered,
"We do not know, and recommend this a subject for investiga-
tion." The following study is in part an attempt to answer these

The method used in this study is that of studying the indi-
vidual tests, disregarding entirely the total score or "mental
age". There are at present so many revisions and editions of
the Binet scale, that the term "mental age" has no meaning out-
side of the particular scale in question. The tests that are used
in the various standardizations are however approximately the
same, so that conclusions concerning the factors underlying the
individual tests have a wider significance than those drawn from
the "mental ages". Furthermore variable factors in the indi-
vidual tests may balance each other in the total score so that
their influence might be obscured.

The subjects and methods will be described first, and in con-
nection with the methods of treating the results a statistical error
will be pointed out. The problems of the personal equation,
grade correlations and sex differences will then be taken up in


The data which are here analysed to determine the influence
of the personal equation, of grade training and of sex differ-
ences, are derived from all the boys and girls below the seventh
grade in the Princeton, N. J., Model School. This group in-
cludes 422 subjects of the following age distribution, —

Chronological Ages.
4 5 6 7 8 9 10 II 12 13 14 15 16
4 17 62 52 56 42 53 49 36 32 II 62

Each of the first six school grades was divided into a plus
and minus grade, the latter division being under a different
teacher, and containing those who were either backward, or, on
account of illness, change of school, or for reasons not neces-
sarily related to their mental development, were not sufficiently
advanced to perform the work of their grade. The school also
contained a special class for defective and exceptionally back-
ward children. The subjects were distributed in the school
grades as follows, —

School Grades.

Spec. Kind I— 1+ II— 11+ HI— III+ IV— IV+ V— V+ VI— VI+

18 32 38 51 12 40 12 45 15 35 IS 49 II 49

39 or 9.2% of the subjects were children of non-English speak-
ing parents, this group including 6.6% of the children in the
Kindergarten and first six regular grades, and 15.7% of those
in the special class and minus grades.

The selection of subjects is only fairly typical of the general
run, for Princeton has no manufactories. The children examined
came, for the most part, from the homes of laborers, domestics,
artisans, farmers, tradesmen, clergymen and college professors.
The selection is atypical in that none of the children came from
homes of the manufacturing class, while an unusually large pro-


portion came from the homes of those engaged in domestic,
personal, and professional service.


The scale used was Goddard's (28) 191 1 revision of the
Binet-Simon scale. The methods used in giving the tests were,
as far as possible, the same as those outlined by Goddard in the
original revision, incorporating the rules and suggestions for
standardized scoring published by that writer (29) in 19 13.
The methods used will not be discussed in detail, for the data
are not used in obtaining age norms and standards for children
generally. For the analysis of the data in terms of grade and
sex it is not necessary that the procedure should be absolutely
standardized, but that the experimenters who gave the tests
should have used the same procedure. Differences in the tech-
nique of the experimenters will be discussed in the chapter on
the personal equation.

One variation frorh the usual procedure was adopted. In no
case did the experimenter know the chronological age of the
child being tested. The influence of any prejudice or bias on
the part of the experimenter is therefore eliminated from the
problem of the correlation of the tests with age. The three ex-
perimenters who gathered the material in the spring of 1913
examined the sixth grade first and the remaining grades in de-
creasing order. During the school year 19 13- 19 14, the fourth
experimenter examined all children at that time in the kinder-
garten and first grades, and others who were not examined in
the spring of 1913.

The tests in the "three year", "four year", "five year", "fifteen
year" and "adult" groups were given so 'in frequently that the
data from them are not treated. The tests used are as follows.
The figure at the right shows the total number of times each
test was given.


1. Distinguishing between morning and afternoon 108

2. Defining in terms of use 333

3. Executing three commissions 100


4. Showing right hand and left ear 107

5. Choosing the prettier of given faces 117


1. Counting 13 pennies 217

2. Describing pictures 219

3. Indicating omissions in pictures 217

4. Copying the diamond (in pencil) 225

5. Naming four colors 218


1. Comparing remembered objects (butterfly and fly) 271

2. Counting backwards from 20 to 251

3. Enumerating the days of the week 277

4. Counting stamps 258

5. Repeating 5 digits 413


1. Making change 271

2. Defining in terms superior to use 333

3. Giving the day and date 307

4. Enumerating the months 284

5. Arranging five weights 334


1. Recognizing pieces of money 282

2. Copying designs from memory 252

3. Repeating 6 digits 413

4. Comprehending easy and difficult questions 250

5. Using three words in sentence (two ideas) 279


1. Detecting absurdities in statements 226

2. Using three words in sentence (one idea) 279

3. Giving 60 words in three minutes 233

4. Giving rhymes with day, mill and spring 213

5. Reconstructing dissected sentences 190


1. Repeating 7 digits 413

2. Defining abstract terms 144

3. Repeating a sentence of 28 syllables 169

4. Resisting suggestion (length of lines) 203

5. Solving problems from various facts 123

The tests in the "six year" group, with the exception of de-
fining in terms of use, and the tests in the "twelve year" group,
with the exception of repeating 7 digits, were given so infre-
quently or so irregularly that the data from them could not be
treated. The apparatus used in the test of arranging five weights
was not constant throughout the experiment, the standard cubes


and weighted pill boxes being used at different times by different
experimenters. On this account, the data from this test are
not included in the subsequent discussion.

Methods of Treating Results

The chronological age of each subject was taken as that at
the last birthday, one tenth of a year being allowed for each
36 days beyond the birthday. The subject that was 10 years
and 35 days would be rated lo.o years, while ten years and 36
days would be 10. i years. A subject one .day short of 11 would
be rated 10.9 etc. The teachers of each grade submitted the
dates of birth of all pupils after the grade had been tested.
These data were later checked up from the entrance cards. Since
the purpose of this study is to analyze the factors involved in
the individual tests, no "mental ages" or total scores were fig-
ured. The classifications of the subjects are all made independ-
ently of the tests.

Two measures of central tendency will be used in the subse-
quent discussion, the average and the median. The measure
of variability from the average, that will be used, is the mean
variation (or average deviation), the average of the differences,
regardless of signs, between the separate measures in the series
and the average of the whole series. The measure of variability
from the median that will be used is the semi-interquartile range
(Q), or half the difference between the measure with three
times as many measures above as below it and the measure with
one third as many measures above as below it, i. e. half the
difference between the 25 percentile, and the 75 percentile. Any
coefficients of correlation used will be stated in terms of the
formula applied. The reader is referred to Thorndike (70) for
the discussion and explanation of the statistical measures used.

The measures of ability in most of the tests are in the "all
or none" form — the tests are either passed or failed. The only
measure that can be obtained from data of this sort is the per-
centage that an ability is present in a defined group. This
method of treating the results has as many "pit-falls" as the
tests themselves. Before undertaking the analysis of the Prince-


ton data to determine the effect of the personal equation of the
experimenter, and the age, grade, and sex of the subject upon
the results of the individual tests, it is necessary to consider
an error which underlies incomplete data, or those data derived
from experimenting in which every test is not given to every

No uniform instructions were given to the experimenters con-
cerning the order in which the tests should be given, nor the
number of tests that should be tried. The experimenters at-
tempted to determine the mental age of the child according to
the scale. In doing this they would start with some test which
they considered would be interesting to the child, and, at the
same time, well within his reach. The tests given first were
usually those of describing pictures and arranging five weights.
The experimenter would then gradually explore the subject's
range of ability, varying the order of the tests so as to maintain
the subject's interest, and to ward off fatigue. In this way the
experimenter would eventually establish the basal age of the
subject (that age in which he passed all five of the tests), and
by the end of the examination would have tried all the tests
above the basal age which, in his judgment, there was any possi-

1 3 4 5 6 7 8 9 10 11 12 13 14

