Download Lecture Notes on Validity - Advanced Practicum in Clinical Psychology | PSY 394Q and more Assignments Psychology in PDF only on Docsity! Validity
————
We shall use the concepts vatidity and invalidity to refer to the best available
approximation to the truth or falsity of propositions, including propositions about
cause. In keeping with the discussion in chapter 1, we should always use the
modifier ‘‘approximately’” when referring to validity, since one can never know
what is true. At best, one can know what has not yet been ruled out as false.
Hence, when we use the terms valid and invalid in the rest of this book, they
should always be understood to be prefaced by the modifiers ‘‘approximately”’ or
“tentatively.”
One could invoke many types of validity when trying to develop a framework
in which to understand experiments in complex field settings. Campbell and Stan-
ley (1963) invoked two, which they called internal’” and ‘‘external’’ validity.
Internal validity refers to the approximate validity with which we infer that a
relationship between two variables is causal or that the absence of a relationship
implies the absence of cause. External validity refers to the approximate validity
with which we can infer that the presumed causal relationship can be generalized
to and across alternate measures of the cause and effect and across different types
of persons, settings, and times.
For convenience, we shall further subdivide the two validity types of Campbell
and Stanley, Covariation is a necessary condition for inferring cause, and practic-
ing scientists begin by asking of their data: ‘‘Are the presumed independent and
dependent variables related?'’ Therefore, it is useful to consider the particular
teasons why we can draw false conclusions about covariation. We shall call these
reasons (which are threats to valid inference-making) threats to statistical conclu-
sion validity, for conclusions about covariation are made on the basis of statistical
evidence. (This type of validity was listed by Campbell [1969] as a threat to
internal validity. It was called “‘instability’’ and was concerned with drawing false
conclusions about population covariation from unstable sample data. We shail
later consider “‘instability’’ as one of the major threats to statistical conclusion
validity.)
If a decision is made on the basis of sample data that two variables are
telated, then the practicing researcher's next question is likely to be: ‘Is there a
causal tetationship from variable A to variable B, where A and B ate manipulated
‘or measured variables (operations) rather than the theoretical or otherwise general-
ized constructs they are meant to represent?’’ To answer this question, the
researcher has to rule out a variety of other reasons for the relationship, including
the threat that B causes A and the threat that C causes both A and B. The first of
these threats is usually handled easily in experiments, as we shall see later. The
latter is not so easily dealt with, especially in quasi-cxperiments. Much of the
researcher's task involves self-consciously thinking through and testing the ptau-
sibility of noncausal reasons why the two variables might be related and why
“change’’ might have been observed in the dependent variable even in the
absence of any explicit treatment of theoretical or practical significance. We shall
use the term internal validity to refer to the vatidity with which statements can be
made about whether there is a causa! relationship from one variable to another in
the form in which the variables were manipulated or measured.
internal validity has nothing to do with the abstract labeling of a presumed
cause or effect; rather, it deals with the re/ationship between the research opera-
tions irrespective of what they theoretically represent. However, researchers
would like to be able to give their presumed cause and effect operations names
which refer to theoretical constructs. The need for this is most explicit in theory-
testing research where the operations are explicitly derived to represent theoretical
notions. But applied researchers also like to give generalized abstract names to
their variables, for it is hardly usefut to assume that the relationship between the
two variables is causal if one cannot summarize these variables other than by
describing them in exhaustive operational detail. Whether one wants to test theory
about the effects of ‘‘dissonance’’ on “‘attitude change,”’ of is interested in policy
issues relating to ‘‘school desegregation’ and “tacademic achievement,’’ one
wants to be able to make generalizations about higher-order terms that have a
referent in explicit theory or everyday abstract language. Following the lead of
Cronbach and Meeh} (1955) in the area of measurement, we shall use the term
construct validity of causes or effects to refer to the approximate validity with
which we can make generalizations about higher-order constructs from research
operations. Extending their usage, we shall use the term to refer to manipulated
independent variables as well as measured traits. We shail base inferences about
constructs more on the fit between operations and conceptual definitions than on
the fit between obtained data patterns and theoretical predictions about such data
patterns—more on what Campbell (1960) called trait validity than what Cronbach
and Meehl (1955) termed nomological validity. We shall not ignore nomolagicat
validity, however.
The construct vatidity of causes and effects was listed by Campbell and Stan-
ley (1963) under the heading of ‘external validity,” and it is what experimental-
ists mean when they refer to inadvertent ‘‘confounding.”"? (That is, was the effect
‘Confounding is sometimes done deliberately in more complex experimental designs, ¢.g., Latin
squares or incomplete lattice designs. Such deliberate confounding is meant ta achieve efficiency at the
cost of reduced interpretability for some carefully chosen interactions that are considered implausible or
that are of little theoretical or practical significance.
VALIDITY
due to the planned variable X, or was X confounded with “
dhe ‘experimenter expectan-
cies ora Hawthorne effect,” or was X a “negative incentive’’ rather than
dissonance arousal 2) As such, construct validity had to do with generalization.
with the question: “Can I generalize from this one operation ot set of operations
to a referent construct?” Given this grounding in the need to generalize, it is not
difficult to see why Campbell and Stanley linked generalizing to abstract con
structs with generalizing to (and across) populations of persons, settings, and his-
torical moments, Just as one gains more information by knowing that a causal
relationship is probably not limited to particular operational representations of a
cause and effect, So one gains by knowing that the relationship (1) is not limited to
a Particular idiosyncratic sample of persons or settings of a given type, and (2) i
not limited to a particular population of Xs but also holds with populations of vs
and Zs. We shail use the term external validity to refer to the approximate validit
with which conclusions are drawn about the generalizability of a causal relati ,
ship to ‘and across populations of persons, settings, and times. ”
Our justification for restricting the discussion of validity to these four types is
Practical only, based on their apparent correspondence to four major detision
questions that the practicing researcher faces. These are: (1) Is there a relationshi
between the two variables? (2) Given that there is a relationship, is it plausibly
causal from one operational variable to the other or would the same relationship
have been obtained in the absence of any treatment of any kind? (3) Given that the
relationship is plausibly causal and is reasonably known to be from one variable to
another, what are the particular cause and effect constructs involved in the rela-
tionship? and (4) Given that there is probably a causal relationship from construct
Ato construct B, how generalizable is this retationship across Persons, settings.
and times? As stated previously, each of these decision questions was implicit
Campbell and Stantey’s explication of validity, with the present statistical conclu-
sion and internal validities being part of internal validity and with the present
construct and external validities being part of extemal validity. All we have done
here is to subdivide each validity type and try to make the differences among types
explicit. We want to stress that our approach is entirely practical being derived
from our belief that practicing researchers need to answer each of the above ques-
Gons in their work. There are no totally compellin, ical reasons for the
lo}
pelling log St U
STATISTICAL CONCLUSION VALIDITY
Introduction
Tn evaluating any experiment, three decisions about covariation have to be
made with the sample data on hand: (1) Is the study sensitive enough to permit
feasonable statements about covariation? (2) If it is sensitive enough, is there
any reasonable evidence from which to infer that the presumed cause and effect
covary? and (3) if there is such evidence, how strongly do the two variables
covary?
The first of these issues concerns statistical power. It is vital in reportin,
and planning experiments to analyze how much power one has to detect an
VALIDITY
those that were. actually obtained in the study, then one can compute with a known
level of canfidence whether a specified point standard has been exceeded with the
data on had. This is clearly a desirable situation for any data analyst.
Let us illustrate the above points by describing a section from a report on the
effects of manpower training programs on subsequent earnings. Aschenfelter
(1974) knew that training costs were about $1,800 for cach trainee. He estimated
that a return of at least 10% on this investment (i.¢., $200) would be adequate for
declaring the manpower (raining program a ‘‘success.'’ Then, assuming equal
numbers of persons in the experimental training group and the no-training control
group, and knowing from previous data that the standard deviation in income.was
about $2,000 for white males, Aschenfelter calculated that about 1,600 persons
would be needed in the experiment if a true effect of at feast $200 was to be
detected with 95% certainty, However, Aschenfelter further calculated that if he
were to break down the data by two race and two sex groups, he would need a
total of 6,400 respondents—! ,600 in each subgroup. Knowing this, he was then in
a position to assess whether he had the necessary resources to design an experi-
ment of this size or whether he would be better served by using some other tech-
nique for trying to evaluate the training program.
Unfortunately, it is rare to have valid variance estimates and a prior point
estimate of the size of an expected effect. The problem of specifying expected
effect sizes is sometimes political, largely because a publicized point estimate can
become a reference against which a social innovation is evaluated. As a result,
even if an innovation has had some ameliorative effect, it may not be given much
credit if it failed to have the promised effect. It is no small wonder, therefore, that
the managers of programs prefer less specific statements such as ‘““We want to
increase achievement,'’ or “‘We want to reduce inflalion’’ to statements such as
“We want to increase achievement by two years for every year of teaching’ or
“We want ta reduce inflation to 5% 2 year.’ The problem of specifying magni-
tudes is also sometimes one of ‘‘consciousness,"’ for the issue may simply not be
considered in designing the research. Alternatively, it may be silently considered
by some persons but never brought to the level of discussion for fear that different
parties to the research may disagree on the level of effect required to conclude that
& ceatment has made a significant, practical difference.
Situation 3
Even when no magnitude-of-effect estimate is available, it is still possible to
use information about sample sizes and variances in order to calculate retrospec-
“fively the-size of any effect that could have been detected in @ particular experi-
ment with, say, 95% confidence. This magnitude can then be inspected and
interpreted. At times it will seem so unreasonably large that the only responsible
conclusion is that the experiment was not powerful enough to have detected a true
effect. For instance, in the Aschenfelter case a sample size of 400, split equally
between experimentals and controls, would have required the experimentals to
earn considerably more than $200 on the average if a true effect were to be
detected at the $% level. How reasonable is it to expect an average increase in
‘earnings over $200 in the Rest working year after graduating from a job-training
program? The answer to this cannot be definitive since no criteria exist for assess-
VALIDITY
ing reasonableness. Nonetheless, the figure seems to us to be very high. We
would strongly advise anyone whose research results in a no-difference conclusion
to conducl the retrospective analysis indicated above,
Situation 4
When data are first analyzed, it is often the case that the estimate of the treat-
ment effect (say, a difference between sample means) is statistically nonsignifi-
cant but in the expected direction. ‘Typically, efforts are then made to “‘reduce the
err term'’ used for testing the treatment effect—a topic that we shal? now
USCUSS.
Obviously, it is desirable to design the research initially so as to minimize this
error. For instance, ‘*Student’” (1931) reexamined an experiment which compared
how four months of free pasteurized milk affected the height and weight of Scot-
tish school children when compared to four months of raw milk. About 5,000
students received each type of milk. ‘‘Student’’ maintained that the same. statisti-
eal Bower (and much lower financial costs) would have resulted had only 50 sets
of identical twins been used. This is because weight and height are highly cor-
related for monozygotic twins, leading to lower error terms than those associated
with differences between nonrelated children. In light of modem knowledge, we
might not want to design the study in the way “‘Student’* suggested because of
honstatistical considerations. For example, would Parents seek to supplement one
of their twin's dicts if they knew that the other was receiving a school-provided
supplement, and how generalizable would findings from 50 sets of twins be?
Nonetheless, ‘‘Student'’s’’ point is important, and it Suggests designing research
wherever possible so as to have small error terms, provided that the means of
reducing the error do not trivialize the research.
; Perhaps the best way of reducing the error due to differences between persons
is to match before random assignment to treatments. (This, we shall soon see, is
quite different from matching as a substitute for randomiza i
Prior to randomization can increase statistical conclusion va
to discover in which particutar subgroups a treatment effect
as an alternative to randomization often leads to Statistical regression artifacts that
can masquerade as treatment effects.) The best matching variables are those that
are most highly correlated with posttest scores. Normally the pretest is the best
single matching variable since it is a proxy for all the social and biological forces
that make some individuals or aggregates of individuals different from others. The
actual Process of matching is simple. One takes all the scores on the matching
variable, ranks them, and places them into blacks whose size corresponds to the
number of experimental groups (say, three). Then, the three persons in the first
block are randomly assigned to one of the experimental groups, the next three in
the next block are randomly assigned, and so on until all the cases are assigned.
‘The data that result fro i .
Levels m such a design can then be analyzed as coming from a
x Treatment design. The same logic basically holds when matching takes
place on multiple variables, but the problem of finding matches is harder, (Match-
ing will be discussed in greater detail in several of the chapters to come,)
Given random assignment to treatment conditions, it is nonetheless possible to
match retrospectively after all the data have been collected. With large sample
tion. While matching
lidity and permit tests
is obtained, matching
VAUDITY
sizes, retrospective matching will result in treatment groups that have comparable
proportions of units with the characteristic on which matching takes place. How-
ever, the major disadvantage of this technique compared to prospective matching
is that subgroups with few members (e.g., blacks in many settings) can be dispro-
portionately represented in each treatment group, with very few persons in one of
the groups. This makes it difficult to estimate treatment effects for the subgroups
in question. But this problem aside, ex post facto blocking can be extremely use-
ful both because effects of the blocking variable can be removed from the error
term and because the interaction of the blocking variable with the treatment can be
assessed.
‘When there is no interest in testing how the dependent variables are related to
the matching or blocking variable, an altemative method of reducing the error term
can be used that loses fewer degrees of freedom. It requires using multiple regres-
sion strategies involving variables which are correlated with the dependent variable
within treatment groups and whose effects are to be partialled out of the dependent
variable. Covariance analysis is one such strategy. The extent to which covariance
analysis reduces error depends on the correlation between the covariates {the lower
the better) and the correlation of each of them with the dependent variable (the
higher the better). But two words of caution are required about such multiple regres-
sion adjustments. First, important statistical assumptions have to be demonstrably
met for the results of the analysis to be meaningful, especially the assumption of
homogeneous regression within treatment groups. Second, in experiments with non-
comparable groups, the analysis will reduce error but will rarely adjust away all the
initial differences between groups. Thus, the function of reducing ermor—which
makes covariance so useful with both randomized experiments and quasi-experi-
ments-—should not be confused with the purported function of making groups equiv-
alent. Equivalence is not needed with randomized experiments and is rarely
achieved by regression adjustments with quasi-experiments, (For an extended dlis-
cussion of these last points see chapters 3 and 4.)
Both matching and multiple regression adjustments assume that meusures have
been made of the variables for which adjustments are to be made. Failure to
measure them means that error terms cannot be reduced to reflect the way that
person or setting variables are related to the major outcome measures of an experi-
ment. Increasing one’s confidence in accepting the null hypothesis demands valid
measurement of the variables that are most likely to affect posttest scores,
There is little point in reducing the error variance due to differences among
persons and settings if the outcome measures are so unreliable that they cannot
register true change. Thus, the experimenter has to be certain to begin ith reli-
able measures. Altematively, the experimenter has to try and develop even more
reliable measures after an experiment is under way by adding items to tests, by
rescaling, or by aggregating data. But whether or not attempts are undertaken to
increase refiability, it is important that internal consistency estimates and test-
retest correlations be displayed in a research report. The reader can ut least judge
for himself the extent to which measures may have been capable of registering
true changes.
Statistical procedures exist for correcting for unretiable measurement. This
means that analysis of ‘true’ scores should be possible. (Details of this procedure
MALIDITY
are available in many standard texts, including McNemar, 1975.) These corec-
tional procedures can often be misleading in practice. First, there are many ways
of conceptualizing reliability, each of which implies a different rcliabitity measure
and different numerical estimates of the amount of reliability. Second, for any one
kind of reliability, its own reliability will not be directly known. And third, reli-
ability-adjusted values do not logically correspond with the values that would have
been obtained had there been perfectly reliable measurement. This is perhaps most
dramatically illustrated by reliability-adjusted correlations in excess of 1.00, or by
the fact that a nonsignificant r of, say, —.10 must inevitably result in a higher
adjusted value of the same sign, whereas the population correlation may have been
+.04, Great caution must be exercised, therefore, in the use of reliability adjust-
ments. It would be naive to present the results only for adjusted data or, when
adjusted results are presented, to use only one estimate of reliability. A range
would be better.
Each of the foregoing strategies can reduce error terms. Consequently, it is
advisable for purposes of statistical conclusion validity to use as many as possible
of the following design features, (1) Each person might be his own control {i.e.,
serve in more than one experimental group); (2) samples might be selected that are
as homogeneous as possible (monozygotic twins arc merely the extreme of this);
(3) pretest measures should be collected on the same scales that are used for
measuting effect; (4} matching might take place, before or after randomization, on
variables that are correlated with the posttest; (5) the effects of other vasiables that
are correlated with the posttest might be covaried out; (6) the reliability of depen-
dent variable measures might be increased; or, (7) occasionally the raw scores
might be adjusted for unreliability. In addition, (8) estimates of the desired magni-
tude of effect should be elicited, where possible, before the research begins. Even
when no such estimate can be determined, (9) the absolute magnitude of 4 treat-
ment effect should be presented so that readers can infer for themselves whether a
statistically reliable effect is so small as to be practically insignificant or whether a
nonreliable effect is so large as to merit further research with more powerful statis-
tical analyses. It should not be forgotten that all of these strategies have negative
consequences if uncritically used and that all of them require trade-offs that wilt
becomé more obvious later. Moreover, most of them are more problematic when
analyzing data from quasi-experiments than data from randomized experiments.
Simation §
Having tried to make the error term as small as possible, the researcher will
encounter a problem if the data analysis still fails to result in statistically signifi-
cant effects. All one can then conclude is that this particular example of this
particular treatment contrast had no observable effects. One cannot easily draw
conclusions about what would have happened if each treatment had been more
homogeneously implemented (i.e., each person or unit in a group had received
exactly the same amounts of the treatment) or if the experimental contrast hud
been larger (i.e., the mean difference between groups had been greater on some
measure designed to assess the strength of the treatment implementation).
As we shall see Jater, quasi-experimental analyses can sometimes be con-
ducted to assess these two possibilities by capitalizing upon the fact that measures
VALIDITY
of treatment implementation can be made which estimate presumed differences in
the strength of the treatment. Such differences can then be associated with esti-
mates of the magnitude of changes between a pretest and posttest in order to
determine if the two are related. While such analyses should definitely be con-
ducted, chapters 3 and 4 will illustrate that great care must be exercised in inter
preting the results. This is because individuals will normally have voluntarily
chosen to expose themselves to treatments in different amounts, and so the kind of
Person at one treatment level is likely to be different from a person at another
level. Nonetheless, if sophisticated quasi-experimental analyses of the kind in
chapters 3 and 4 still fail to result in covariation between the treatment and aut-
come measures, then the analyst can be all the more confident in accepting the
null hypothesis.
INTERNAL VALIDITY
Introduction
Once it has been established that two variables covary, the problem is to
decide whether there is any causal relationship between the two and, if there
is, to decide whether the direction of causality is from the measured or ma-
nipulated 4 to the measured B, or vice versa.
The task of ascertaining the direction of causality usually depends on
knowledge of a time sequence. Such knowledge is usually available for experi-
Ments, as opposed to most passive observational studies. In a randomized
experiment, the researcher knows that the measurement of possible outcomes
takes place after the treatment has been manipulated. In quasi-experiments,
most of which require both pretest and posttest measurement, the researcher
can relate some measure of pretest-posttest change to differences in treatments.
3, {tis more difficult to assess the possibility that A and B may be related
onty through some third variable (C). If they were, the causal relationship
would have to be described as: A —» C -» B. This is quite different fram the
model A — B which most clearly implies that A causes 8, To conclude that A
causes B when in fact the mode! A -» C —» B is true would be to draw a
false positive conclusion about cause. Accounting for third-variable alternative
interpretations of presumed A-B relationships is the essence of internal validity
and is the major focus of this book.
Although in the examples that follow we shall deal primarily with the pos-
sibility of false positive findings, it should not be forgotten that third variables
can also threaten internal validity by leading to false negatives. The latter occur
whenever relationship signs are as below. In the case to the left, an increase in
A causes an increase in both B and C, but the increase in C causes a decrease
in B. Thus,
B
xy ad a
A A 4 ~ te 4 ~ he
j Me 2
ex c ~~ e c
VALIDITY
the net effect of A and C an B would be to tend to obscure a true A> B
relationship. In the case depicted in the center, an increase in 4 would cause
an increase in B and a decrease in C, while a decrease in C would cause a
decrease in B. Once again, the effects of A and C would tend to cance! each
other out. In the final case, an increase in A would cause a decrease in both 8
and C, and the decrease in C would cause a countervailing increase in B.
Let us give an example of the second of these three relationships. Imagine
that A is tutoring and B is academic achievement. Imagine, further, that tutor-
ing is given to the weaker students academically and is withheld from the
stronger, this process of selection into the treatment being C. Since tutoring is
negatively related to initial achievement, we have A => C. Being weaker, the
students with tutoring would be expected to gain less over time than their fel-
low students for a number of reasons that have nothing to do with tutoring
{e.g., slower rates of learning from other sources). Hence, C => B. Thus, if
tutoring did raise achievement (4 4 BY but the children who received tutoring
were expected to gain less from schooling anyway (that is, C =» 8), then the
effects of tutoring and of lower expected growth rates would countervail. In the
special case where the two forces were of equal magnitude, they would totally
cancel each other out. In cases where one force was stronger than the other,
the stronger catise would prevail but its effect would be weakened by the coun-
tervaiting cause. We hope that our later examples, which emphasize intemal
validity threats and false positive findings, will not blind readers to the effects
that such threats can have in leading to false negative findings because of the
operation of suppressor variables.
1 is possible for more than one internal validity threat to operate in a given
situation. The net bias that the threats cause depends on whether they are simi-
lar or different in the direction of bias and on the magnitude of any bias they
cause independently. Clearly, false causal inferences are more likely the more
numerous and powerful the validity threats and the more homogeneous the
direction of their effects. Though ovr discussion will be largely in terms of
threats aken singly, this should not blind readers to the possibility that multiple
internal validity threats can operate in cumulative or countervailing fashion in a
single study.
Threats to Internal Vatidity
Bearing this brief introduction in mind, we want to define some specific
threats to internal validity.
History
“History"’ is a threat when an observed effect might be due to an event
which takes place between the pretest and the posttest, when this event is not
the treatment of rescarch interest. In much laboratory research the threat is con-
trolled by insulating respondents from outside influences (e.g., in a quiet labo-
ratory) of by choosing dependent variables that could not plausibly have been
affected by outside forces (e.g., the learning of nonsense syllables). Unfortu-
nately, these techniques are rarely available to the field researcher.
VALIDITY
operated, then the investigator has to conclude that a demonstrated relationship
between two variables may or may not be causal. Sometimes the alternative inter-
pretations may seem implausible enough to be ignored and the investigator will be
inclined to dismiss them. They can be dismissed with a special degree of con-
fidence when the alternative interpretations scem unlikely on the basis of findings
from a research tradition with a large number of relevant and replicated
findings.
Invoking plausibility has its pitfalls, since it may often be difficult to obtain high
inter-judge agreement about the plausibility of a particular alternative interpretation.
Moreover, theory testers place great emphasis on testing theoretical predictions that
seem so implausible that neither common sense nor other theories would make the
same prediction. There is in this an implied confession that the ‘‘timplausible’’ is
sometimes true. Thus, ‘implausible’ alternative interpretations should reduce, but
not eliminate, our doubt about whether relationships are causal.
When respondents are randomly assigned to treatment groups, each group is
similarly constituted on the average (no selection, maturation, or selection-matura-
tion problems). Each experiences the same testing conditions and research instru-
ments (no testing or instrumentation problems). No deliberate selection is made of
high and Jow scorers on any tests except under conditions where respondents are
first matched according to, say, pretest scores and are then randomly assigned to
treatment conditions (no statistical regression problem), Each group experiences
the same global pattern of history (no history problem). And if there are treatment-
telated differences in who drops out of the experiment, this is interpretable as a
consequence of the treatment. Thus, randomization takes care of many threats to
internal validity.
With quasi-experimental groups, the situation is quite different. Instead of
relying on randomization to rule out most internal validity threats, the investigator
has to make all the threats explicit and then rule them out one by one, His task is,
therefore, mare laborious. It is also less enviable since his final causal inference
will not be as strong as if he had conducted a randamized experiment. The prin-
ciple reason for choosing to conduct randomized experiments over other types of
research design is that they make causal inference easier.
Threats to Internal Validity That Randomization Does Not Rule Out
Though randomization conveniently rules out many threats to internal validity,
it does not rule out all of them. In particular, imitation of treatments, compensa-
tory equalization, compensatory rivalry, and demoralization in groups receiving
less desirable treatments can each threaten internal validity even when randomiza-
tion has been successfully implemented and maintained over time. Some of these
threats will usually cause spurious differences (e.g., demoralization in the con-
trols). However, other threats will tend to obscure true differences, especially by
making no-treatment control groups perform atypically. This Jast happens with the
imitation of treatments, compensatory equalization, and compensatory rivalry, We
want to make clear that, while randomized experiments are superior to quasi-
experiments with respect to internal validity, they are not perfect.
Most of the threats that randomization does not rule out result from the
focused inequities that inevitably accompany experimentation because some peo-
VALIDITY
honed
ple receive one treatment and others receive different treatments or no treatment at
all. Since much social experimentation is ameliorative, treatments have to differ in
desirability by virtue of the very research problem (¢.g., the different payment
levels in a compensatory education or an incame supplement program, or the
different amounts of time that can be spent away from cell-block confinement in a
prison experiment on “‘rehabilitation’’). Obviously, individual respondents want
to receive the more desirable treatments. In the same vein, officials want to avoid
salient inequities which can lead to charges that they favored some Tespondents
over others in distributing treatments.
It is rare in our society to have valuable resources distributed on a random
basis. Instead, we expect them to be distributed according to need, merit, senior-
ity or on a ‘first come, first served"’ basis. The point is that distributing resources
by lottery violates the meritocratic and/or social responsibility norms which regu-
late and justify most differences in rewards and opportunities in the United States.
‘This is not to say that lotteries are never used in resource distribution. They seem
to be convenient, for instance, in distributing sudden “windfalls” or universally
undesired resources (e.g., a lottery was used for inducting young men into the
U.S. armed services after 1969), Nonetheless, distribution by merit or need is
more common than distribution by chance, and the latter often violates expecta-
tions about what is ‘“‘just.’* It is this which leads to randomization exacerbating
some intemal validity threats.
The extent of an administrator’s apprehension about randomization probably
depends on four subjective estimates: (1) the differences in desirability between
treatments; (2) the probability that individuals will lear of treatment differences;
(3) the probability that organized constituencies will learn of these differences; and
(4) how much the various constituencies will feel that their interests are affected
by the most likely rescarch outcomes. Some research questions make it difficult to
tule out all of an administrator’s apprehensions since, first, they absolutely require
treatments that differ in desirability (e.g., what is the effect of extra payments to
schools?}. Second, scarce research resources require geographicat contiguity (e.g.,
we can only do the study in one school district), Third, it seems to be part of an
administrator’s job to consider how various constituencies might react to focused
inequities and to fear the worst {e.g., what will the teachers’ union or the PTA
think if resources are distributed by chance instead of by need or merit?). And
fourth, administrators know that constituency representatives want to get the best
possible advantages for their charges and want to avoid any potential harm to them
{e.g., a teachers’ union official might think: If performance contracting works in
schools, then the role of the classroom teacher could be reduced in scope and
importance—do we want that?). Such considerations highlight both the difficulties
of gaining permission to randomize and of ruling out the threat of compensatory
equalization when randomization has taken place.
The only other internal validity threat that can operate in @ randomized experi-
ment is differentiat mortality from the treatment groups. While such differences
can be interpreted as a consequence of the treatment—and as a resuft will often be
very important—they have the undesirable side effect of obscuring the interpreta-
tion of other results. This is because the units remaining in one treatment group
may not be comparable on the average to those in another group. Thus, if there
VALIDITY
were differential attrition from, say, an experiment on the effect of income supple-
ments on the motivation to find work, we would not be sure if a relationship
between the'doflar value of a supplement and the number of days worked was duce
to the supplement reducing the number of days worked or to selection differences
associated with the kinds of persons who remained in each treatment condition for
the entire experiment. Treatment-correlated attrition leads to the possibility of a
selection confound. We might readily surmise that such attrition is all the more
likely the more the treatments differ in desirability.
With the exception of differential mortality and the selection problems that
follow from it, the threats to internal validity which random assignment does not
Tule out are caused by atypical behavior on the Part of persons in no-treatment
control groups or groups that receive less desirable treatments, Such behavior
represents an unplanned but nonetheless causal consequence of the planned experi-
mental contrast. Even when there is a valid causal relationship at the operational
level, one may wonder how differences in B can be interpreted as the result of
threats to internal validity. Intemal validity is, after all, concerned with threats
that cast doubt on whether there is a valid causal connection, and the threats we
are discussing da not deny the validity of a causal connection. The answer is in
One sense simple. Internal validity refers to doubt about whether there is a causal
connection from A-as-manipulated (or measured) to B-as-measured: on the other
hand, the threats to internal validity which we are discussing (c.g., resentful
demoralization of the controls) cast doubt on whether the causal connection is
from 4 to B or is from A’s comparison group to B. (In another sense, this issue is
academic, for causal inference always depends on the contrast between A and A's
comparison, irrespective of whether 4 or the comparison causes the observed
changes in the dependent variable. Given our emphasis on the desirability of iden-
tifying active causal agents, it is important to identity whether 4 or its comparison
accounts for change, since knowing the active causal agent allows one to know
what to manipulate. This is why we specify internal validity in terms of the pattern
of influence from A to B rather than in terms of the pattem of influence from the
contrast between A and its comparison to B).
Assessing the Plausibility of Internal Validity Threats If a Randomized
Experiment Has Been Implemented
The possibility of a selection artifact resulting from differential attrition can
best be empirically assessed in two ways. First, an analysis is called for of the
Proportion of respondents, originally assigned to each experimental condition,
who actually provide posttest data. Differences in this Proportion across treatments
indicate a differentiat dropout. Second, an analysis is called for of the pretest
scores in each treatment group computed on the basis of all those who provided
posttest data. This gives a preliminary indication of whether the dropouts differed
across groups on the background characteristics that are most likely to affect post-
test scores (i.e., those that are highly correlated with pretest scores on the same
test). We will deal with these points in greater detait in chapter 8.
An assessment of imitation, compensatory equalization, or compensatory
rivalry can often be made by direct measures in the experimental and contro!
groups of the process that the independent variable was meant to affect. Thus, ifa
VALIDITY wb
nm
treatment were meant to provide money to some schools but not others, the
finances of both kinds of schools would necd examining. If a treatment was
expected to make experimental children view an education television Program, it
would be necessary to measure how often they watch the show and how often the
controls watch it. A small or nonexistent experimental contrast would suggest that
imitation, compensatory equalization, or compensatory rivalry may have occurred.
Thus, measures of the exact nature of the treatment in aif treatment and control
groups are absolutely vital in any experiment. The sooner such measurements are
taken, the easier it will be to detect unexpected pattems of behavior in the experi-
ment and control groups and the easier it will be to take corrective action,
It will normally be easy to use background information to find out if controls
had contact with experimentals and copied them or to find out if administrators
Provided additional resources to some units from nonexperimental sources, It wit]
normally not be as easy to assess whether compensatory rivalry took place, though
direct measures of verbal expressions of such tivalry by the controls can give a
lead, as can indications of whether control group performance is greater than
would be expécted, Saretsky (1972), it will be remembered, tried to determine this
in the performance level in past years in the same classes; but he probably ran into
a regression problem. Nonetheless, if used with care, the use of secondary data
from past classes can be useful for attempting to assess the magnitude of any
compensatory rivalry. Such data could also be useful for assessing resentful
demoratization, because this threat leads to the testable prediction that perfor-
mance should be atypically low in the control gtoup during the experiment.
CONSTRUCT VALIDITY OF PUTATIVE CAUSES AND EFFECTS
Introduction
Construct validity is what experimental Psychologists are concerned with
whea they worry about “confounding.” This refers to the possibility that the
operations which are meant to represent a particular cause or effect construct
can be construed in terms of more than one construct, each of which is stated
at the same level of reduction. Confounding means that what one investigator
interprets as a causal relationship between theoretical constructs labeled A and
B, another investigator might imerpret as a causal relationship between con-
structs A and ¥ or between X and B or even between X and Y.
In the discussion that follows we shall restrict ourselves to the construct
validity of presumed causes and effects, since these play an especially crucial
role in experiments whose raison d’étre is to test causal Propositions. But it
should be clearly noted that construct validity concems are not limited to cause
and effect constructs. All aspects of the research require naming samples in
generalizable terms, including samples of people and settings as well as sam-
ples of measures or manipulations. Even with intemal validity and statistical
conclusion validity, inferences have to be made about abstract constructs: viz
“canse"’ and '‘reliable change’’ or ‘‘reliable differences."”
The reference to the level of reduction in the definition of “confounding’'
is important, because it is always possible to ‘‘translate’’ sociological terms
VALIDITY
into psychological ones, or psychological terms into biological ones. For cxam-
ple, patlicipative decision making could become conformity to membership
group norms on one level, or some correlate of, say, the ascending reticular
activating system on another. Each of these fevels of reduction is useful in dit
ferent ways and none is more legitimate than any other. But such ‘‘transta-
tions’’ from one levef to another do not involve the confounding of rival
explanations that is at issue here.
Before we continue our abstract characterization of construct validity, some
conerete examples of well-known construct validity concerns may help. In ear-
lier medical experiments on drugs, the psychotherapeutic effect of the doctor's
helpful concern was confounded with the chemical action of the pill. So, too,
were the doctor’s and the patient's belief that the pill should have helped. To
circumvent these problems and to increase confidence that any observed effects
could be attributed to the chemical action of the pill alone, the placebo control
group and the double-blind experimental design were introduced. (The first of
these involves giving a chemically inert substance to respondents, and the sec-
ond requires that neither the person prescribing the pill nor the person evatuat-
ing its effects knows the experimental condition to which the patient has been
assigned.)
In industrial relations research, the Hawthorne effect is another confound
which causes uncertainty about how operations should be labeled. If we assume.
for the moment that productivity was increased in the original Hawthome stud-
ies by the planned experimental intervention, the issue for construct validity
purposes is: Was the increase due to shifts in itlumination {the planned treat-
ment) or to the demonstrated administrative concem over improved working
conditions (the *‘Hawthome effect’) or to telling the women how well they
were doing their work (an inadvertent correlate of increasing the illumination)?
Construct validity concerns begin to surface at the planning and pilot-testing
stages of an experiment when attempts are made io fit the anticipated cause
and effect operations to their referent constructs, whether these are derived
from formal social science theory or from policy considerations. Such ‘*fitting’”’
to the construct of interest is best achieved (1) by the careful preexperimental
explication of constructs so that definitions are clear and in conformity with
public understanding of the words being used, and (2) by data analyses
directed at some of the four fotlowing points, preferably all of them.
First, a test should be made of the extent to which the independent vari-
ables aller what they are meant to alter. This is done by assessing whether the
teatment manipulation is related to direct measures of the process designed to
be affected by the treatment. (This is called ‘assessing the ‘take’ of the inde-
pendent variable.’") Second, a test should be conducted to assess whether an
independent variable does not vary with measures of related but different con-
structs. For instance, a manipulation of ‘communicator expertise’” should he
correlated with reports from respondents about the communicator’s level of
knowledge, but it should not be correlated with attributions about cognate con-
structs, such as trustworthiness, congeniality, or power. If there are such corre-
fations, it is difficult to differentiate effects due to expertise from those due ta
the other variables. Third, the proposed dependent variables should tap into the
—_—
cat
wan IDITY
factors they are meant to measure. Nomnally, some form of inter-item correla.
tion can suggest this. And fourth, the dependent variables should not be domi-
nated by irrelevant factors that make them measures of more or less than was
intended. Thus, the outcome construct, like the treatment construct, has to be
differentiated from its particular cognates.
As we have detailed the procedure, assessing construct validity depends on
two pro esses: first, testing for a convergence across diferent measures or manip-
ulations of the same ‘thing’ and, second, testing for a divergence between mea-
sures and manipulations of related but conceptually distinct “things."* Our
position should not be interpreted to imply that construct validity absolutely
depends on having information about both convergences and divergences, for it
is clearly desirable to have information about convergences even when nothing is
known directly about divergences. Indeed, other discussions of construct validity
have restricted themselves to convergences, even while noting that a clase core-
spondence between different types of measures of the same thing is less mean-
ingful if there are similar measurement inelevancies associated with each
measure, as when only paper-and-pencil or observational measures of the same
construct are made—see Campbell and Tyler, 1957; Cronbach and Meebl, 1955;
Cronbach, Glesser, Nanda, and Rajaratnam, 1972. However, as Campbell and
Fiske (1959) suggest, a construct should be differentiated from related theoretical
Constructs as well as from methodological irmelevancics. (For an example of dif-
ferentiation from other theoretical constructs in basic sesearch, see Cook, Crosby
and Hennigan, 1977; and for an example in applied research, see the differentia
tion of viewing ‘Sesame Strect'’ from “being encouraged to view ‘Sesame
Street’ by paid professionals,’’ Cook et al., 1975.)
‘We can illustrate these points by considering a possible experiment on the
effects of supervisory distance. Suppose we operationalized ‘‘supervision"' as a
foreman standing within comfortable speaking distance of workers (e.g., ten
feet). This particutar operationalization would exclude distances that were
beyond speaking but not beyond seeing, and the treatment might be more
exactly characterized as “supervision from speaking distances.’ It would be
dangerous to generalize from such a specific treatment to the general ‘‘supervi-
sion” construct, especially if supervision has different consequences when it
comes from shorer and longer distances. To lessen this possibility, it would be
useful if supervisory distance were systematically varied by means of planned
manipulations. That is not always possible. However, it would still be useful if
Supervision inadvertently varied across a wide range of distances because fore-
men differed in their behavior from day to day. Careful analysis of the effects
of spontaneous variation in distance would then allow us to test whether we
can generalize from one supervisory distance to another. If we can, we can
generalize with greater confidence to the general construct of “supervision,”
whereas if we cannot, we would like 10 restrict our generalization to “supervi-
sion from ten fect or less.’
The foremen might also differ from each other, or might themselves differ
from day to day, in whether they supervise with a smile or in an officious
manner. Neither the smile nor the officiousness would seem to be necessary
components of most definitions of ‘‘supervision.”” Hence, the researcher might
VALIDITY
Mono-Method Bias
To fave more than one operational representation of a construct does not
necessarily imply that all irrelevancies have been made heterogeneous. Indeed,
wheri aif the manipulations are presented the same way, or all the measures use
the same means of recording responses, then the method is itseif an irrelevancy
whose influence cannot be dissociated from the influence of the target con-
struct. Thus, if all the experts in the previous hypothetical example had been
presented to respondents in writing, it would not logically be possible to gener-
alize to experts who are seen or heard. Thus it would be more accurate to
label the treatment as “‘experts presented in writing."’ To cite another example,
attitude scales are often presented to respondents without apparent thought to
(a) using methods of recording other than paper-and-pencil, (b} varying whether
the attitude statements are positively or negatively worded, or (c) varying
whether the positive or negative end of the response scale appears on the right
or left of the page. On these three points depends whether one can test if
“*personal private attitude’’ has been measured as opposed to ‘‘paper-and-pencil
nonaccountable responses,’* or ‘‘acquiescence,’" or “‘response bias."”
Hypothesis-Guessing Within Experimentat Conditions
The intemal validity threats called “resentful demoralization’’ am. ‘‘com-
pensation rivalry’? were assumed to result because persons who received less
desirable treatments compared themselves to persons who received more desir-
able treatments, making it unclear whether treatment effects of any kind
occurred in the treatment group. Reactive research may not onty obscure tne
treatment effects, but also result in effects of diminished interpretability. This
is especiatly true if it is suspected that persons in one treatment group com-
pared themselves to persons in other groups and guessed how the experimenters
expected them to behave. Indeed, in many situations it is not difficult to guess
what the experimenters hope for, especially in education or industrial organiza-
tions. Hypothesis-guessing can occur without social comparison processes, as
when respondents know only about their own treatment but persist in trying to
discover what the experimenters want to learn from the research,
The prablem of hypothesis-guessing can best be avoided by making hypoth-
eses (if present) hard 10 guess, by decreasing the general level of reactivity in
the experiment, or by deliberately giving different hypotheses to different
respondents. But these solutions are at best partial, since respondents are not
passive and can always generate their own treatment-related hypotheses which
may or may not be the same as the experimenters’. Learning an hypothesis
does not necessarily imply either the motivation or the ability to alter one’s
“behavior because of the hypothesis. Despite the widespread discussion of treat-
‘Toment confounds that are presumed to result from wanting to give data that will
please the researcher—which we suspect is a result of discussions of the
Hawthome effect—there is neither widespread evidence of the Hawthorne effect
in field experiments (see reviews by D. Cook, 1967; Diamond, 1974), nor is
there evidence of a similar orientation in laboratory contexts (Weber and Cook,
1572), However, we still lack a sophisticated and empirically corroborated
theory of the conditions under which hypothesis-guessing (a) occurs, (b) is
VALIDITY
treatment specific, and (c) is translated into behavior that (d) could Jead to
erroneous conclusions about the nature of a treatment construct when (e) the
research takes place in a field setting.
Evaluation Apprehension
Rosenberg (1969) has reviewed considerable evidence from taboratory
experiments which indicates that respondents are apprehensive about being
evaluated by persons who are experts in personality adjustment or the assess-
Tent of human skills. In such cases respondents attempt to present themselves
to such persons as both competent and psychologically healthy. It is not clear
how widespread such an orientation is in social science experiments in field
settings, especially when treatments last a long time and populations do not
especially value the way that social scientists or their sponsors evaluate them.
Nonetheless, it is possible that some past treatment effects were due to respan-
dents being willing to present themselves to experimenters in ways that would
lead to a favorable personal evaluation. Being evaluated favorably by experi-
menters is rarcly the target construct around which experiments are designed. It
is a confound.
Experimenter Expectancies
There is some literature (Rosenthal, 1972) which indicates that an experi-
menter’s expectancies can bias the data obtained. When this happens, it will
not be clear whether the causal treatment is the treatment-as-labeled or the
expectations of the persons who deliver the treatments to. respondents. This
threat can be decreased by employing experimenters who have no expectations
of have false expectations, or by analyzing the data separately for persons who
deliver the treatments and have different kinds or levels of expectancy. Experi-
menter expectancies are thus a special case of treatment-correlated irrelevancy,
and they may well operate in some (but certainly not all) field settings.
Confounding Constructs and Levels of Constructs
Experiments can involve the manipulation of several discrete levels of an
independent variable that is continuous. Thus, one might conclude from an
experiment that A does not affect B when in fact A-at-level-one does not affect
B, whereas A-at-level-four might well have affected B if A had been manipu-
lated as far as level four. This threat is a problem when A and 8 are not lin-
early related along the whole continuum of A; and it is especially prevalent, we
assume, when treatments have only a weak impact. If they do, because low
levels of A are manipulated, and if conclusions are drawn about A without any
qualifications concerning the strength of the manipulation, then misleading neg-
ative conclusions can be drawn. The best contro} for this threat is to conduct
parametric research in which many levels of A are varied and many levels of B
are measured.
Interaction of Different Treatments
This threat occurs if respondents experience more than one treatment which
ig common in laboratory research but quite rare in field settings, We do not
VALIDITY
fiction books but not for the circulation of factual ones. The process of hypothesizing
constructs and testing how well treatrnent and outcome operations fit these constructs
is similar whether it occurs before the research begins or after the data are seceived.
The major difference is that in the later stage one specifies constructs that fit the data,
whereas in the earlier stage one derives operations from constructs.
In their pathfinding discussion of construct validity, Cronbach and Meehl
(1955) stressed the utility of drawing inferences about constructs from the fit
between patterns of data that would be predicted if a particular theoretical con-
struct was operating and the multivariate pattern of data was actually obtained in
the research. They used the term ‘‘nomological net’’ to refer to the predicted
pattern of relationships that would permit naming a construct. For instance, 2
current version of dissonance theory predicts that being underpaid for a counterat-
titudinal advocacy will result in greater belief change than being overpaid, pro-
vided that the individual who makes the advocacy thinks he has a free choice to
Tefuse to perform the advocacy. The construct “‘dissonance’’ would therefore be
partially validated if experimental data showed that underpayment caused more
belief change than overpayment but only under free choice conditions. However,
the fit between the complex prediction and the complex data only facilitates belief
in “‘dissonance"’ to the extent that other theoretical constructs could not explain
this same data pattern. Bem (1972) obviously believes that reinforcement gone
structs do as good a job of complex prediction in this case as “dissonance.
We have implicitly used the ‘‘nomological net’’ idea in our presentation of
construct validity. First, we discussed the usefulness—for labeling the treatment—
of examining whether the planned treatment is related to direct measures of the
treatment process and is not related to cognate processes. Second, we discussed
the advantages of determining in what ways the outcome variables are related to
treatments and the type of treatment that could have resulted in such a differen-
tiated impact. For instance, if the introduction of television decreases the circula-
tion of fiction but not fact books, one can hypothesize that the causal impact is
mediated by television taking time away from activities that are functionally simi-
lar—such as fantasy amusement—but not from functionally dissimilar ectivities—
such as learning specific facts. However, our emphasis has differed slightly from
that of Cronbach and Mecht (1955) inasmuch as we are more interested in fitting
cause and effect operations to a generalizable construct (see Campbell, 1960—the
discussion of ‘trait validity’’) than we are in using complex predictions and data
‘mmf patterns to validate entirely hypothetical scientific constructs like ‘‘anxiety,"*
GO “‘intelligence"* or “‘dissonance.'’ However, we readily acknowledge that the way
the data turn out in experiments helps us edit the constructs we deal with, as when
we find that a foreman’s *‘supervision’’ has different consequences from less than
ten feet as opposed to more than ten feet.
EXTERNAL VALIDITY
Introduction
Under external validity, Campbell and Stanley originatly listed the threat of
not being able to generalize across exemplars of a particular Presumed cause or
effect construct. We have obviously chasen to incorporate this feature under con-
VALIDITY
Struct validity as ‘‘mono-operation bias."* The reason for listing this threat dif-
ferently from Campbell and Stanley is not fundamental. Rather it is meant to
emphasize that most researchers want to draw conclusions about constructs, but
the Campbell and Stanley discussion had a flavor of definitional operationalism,
although a multiple definitional operationalism. We have tried to avoid this flavor
by invoking construct validity to replace generalizing across cause and effect
exemplars. The other features of Campbell and Stanley's conceptualization of
external validity are preserved here and elaborated upon. They have to do with (1)
generalizing to particular target Persons, settings, and times, and (2) generalizing
across types of persons, settings, and times.
Bracht and Glass (1968) have succinctly explicated extemal validity, pointing
out that a two-stage process is involved: a target population of persons, settings,
or times has first to be defined and then samples are drawn to represent these
Populations, Very occasionally, the samples are drawn from the populations with
knawn probabilities, thereby maximizing the final representativeness discussed in
textbooks on sampling theory (e.g., Kish, 1965). But usually the samples cannot
be drawn so systematically and are drawn instead because they are convenient and
give an intuitive impression of fepresentativeness, even if it is only the representa-
tiveness entailed by class membership (¢.g., I want to generalize to Englishmen
and the people 1 found on streetcorners in Birkenhead, England, belong to the
class called Englishmen). Accidental sampling, as it 1s technically labeted, gives
Us no guarantee that the achieved population (a subset of Englishmen who hang
around streetcorners in Birkenhead) is Tepresentative of the target population of
which they are members. Consequently, we find it useful to distinguish among (1)
target populations, (2) formally representative samples that correspond ta known
Populations, (3) samples actually achieved in field fesearch, and (4) achieved
populations.
One of many cxamples that could be cited to illustrate these last points con-
cems the design of the first negative income tax experiment. Practical administra-
tive considerations led to the study being conducted in a few localities within New
Jersey and in one city in neighboring Pennsylvania. Since the basic question guid-
ing the research did not require such a restricted Reographical location, the New
Jersey-Pennsylvania setting must be considered a limitation which reduces one's
ability to generalize to the implicit target population of the whole United States.
{To criticize the study because the achieved sample of settings was not formally
tepresentative of the target population may appear unduly harsh in light of the fact
that financial and logistical resources for the experiment were limited, and so
sampling was conducted for convenience rather than formal representativeness.
We shall return to this point later. For the Present, however, it is worth noting
that accidental samples of convenience do not make it easy to infer the target
population, nor is it clear what population is actually achieved.)
Generalizing to well-explicated target populations should be clearly distin-
guished from generalizing ceross Populations. Each is germane to extemal
validity: the former is crucial for ascertaining whether any research goals that
Specified populations have been met, and the latter is crucial for ascertaining
which different populations (or subpopulations) have been affected by a treat-
ment, i.e., for assessing how far one can generalize. Let us give an example.
VALIDITY
Suppose .a new television show were introduced that was aimed at teaching
basi¢ ‘arithmetic to sevent-year-olds in the United States, Suppose, further, that
one could somehow draw a random sample of all seven-ycar-olds to give a
representative national sample within known limits of sampling error. Suppose,
further, that one could then randomly assign each of the children to watching
or not watching the television show, This would result in two randomly
formed, and thus equivalent, experimental groups which were representative of
all seven-year-olds in the United States. Imagine, now, that the data analysis
indicated that the average child in the viewing group gained more than the
average child in the nonviewing group. One could generalize such a finding zo
the average seven-year-old in the nation, the target population of interest.
‘This is equivalent to saying that the results were obtained despite possible
variations in how much different kinds of children in the experimental viewing
group may have gained from the show. A more differentiated data analysis
might show that the boys gained more than the girls (or even that only the
boys gained), or the analysis might show that children with certain kinds of
home background gained while children from different backgrounds did not.
Such differentiated findings indicate that the effects of the televised arithmetic
show could not be generalized across all subpopulations of seven-year-old
viewers, even though they could be generalized to the population of seven-
year-old viewers in the United States.
To generalize across subpopulations like boys and girls logically presup-
+ poses being able to generalize to boys and girls. Thus, the logical distinction
bt
between generalizing to and across should not be overstressed. The distinction
is most useful for its practical implications insofar as many researchers who are
concemed about generalizing across populations are usually not as concermed
with careful samplings as are persons who want to generalize fo target popula-
tions. Many researchers with the former focus would be happy to conclude that
a treatment had a specific effect with the particular achieved sample of boys or
girls in the study, irrespective of how well the achieved population of boys or
girls can be specified.
‘The distinction between generalizing to target populations and across multiple
populations or subpopulations is also useful because commentators on extemal
vatidity have often implicitly stressed one over the other. For instance, some persons
discuss external validity as though it were only about estimating limits of general-
izability, as is evidenced by comments.such as: ‘Sure, the treatment a“ected seven-
year-olds in Tucson, Arizona, and that was your target group. But what about chil-
dren of different ages in other areas of the United States?"’ Other commentators
discuss external validity exclusively in terms of the fit between samptes and target
populations, as is evidenced by comments such as: “I’m not sure whether the treat-
ment is really effective with children who have learning disabilities, for if you look
at the pretest achievement means for the groups in your experiment, you'll see that
they are as high as the test publisher quotes for the national average, How could
children with learning disabilities have scored so high? I doubt that the research
really involved the kind of child you said it did.”
Finally, we make the distinction between generalizing to and across in order to
emphasize the greater stress that we shall place in this presentation on generalizing
VALIDITY
across. The rationale for this is that formal random sampling for representative-
ness is rare in field research, so that strict generalizing to targets of external vatid-
ity is rare. Instead, the practice is more one of generalizing across haphazard
instances where similar-appearing treatments are implemented. Any inferences
about the targets to which one can generalize from these instances are necessarily
fallible and their validity is only haphazardly checked by examining the instances
in question and any new instances thal might later be experimented upon. It is also
worth noting that the formal generalization tw target populations of persons is
often associated with large-scale experiments. These are often difficult to admin-
ister both in terms of treatment implementation and securing high-quality mea-
surement. Moreover, attrition is almost inevitable, and so the sample with which
one finishes the research may not represent the same population with which one
began the research. A case can be made, therefore, that external validity is enhanced
more by a number of smaller studies with haphazard samples than by a single study
with initially representative samples if the latter could be implemented, Of course, it
should not be forgotten that all the haphazard instances of persons and settings that
are examined can belong to the class of persons or settings to which one would like
to be able to generalize research findings. Indeed, they should belong to such a
class.
List of Threats to External Validity
Tests of the extent to which one can generalize across various kinds of per-
sons, settings, and times are, in essence, tests of statistical interactions. If there is
an interaction between, say, an educational treatment and the social class of chil-
dren, then we cannot say that the same result holds across social classes. We
know that it does not. Where effects of different magnitude exist, we must then
specify where the effect does and does not hold and, hopefully, begin to explore
why these differences exist. Since the method we prefer of conceptualizing exter-
nal validity involves generalizing across achieved populations, however uncleatly
defined, we have chosen to fist all of the threats to external validity in terms of
statistical interaction effects.
Interaction of Selection and Treatment
In which categories of persons can a cause-effect relationship be generalized?
Can it be generalized beyond the groups used to establish the initial relationship—
to various racial, social, geographical, age, sex, or personality groups? Even when
respondents belong to a target class of interest, systematic recruitment factors lead
to findings that are only applicable to volunteers, exhibitionists, hypochondriacs,
scientific do-gooders, those who have nothing else to do, and so forth. One feas-
ible way of reducing this bias is to make cooperation in the experiment as conven-
ient as possible. For example, volunteers in a television-radio audience experiment
who have to come downtown to participate are much more likely to be atypical
than are volunteers in an experiment carried door-to-door. An experiment involy-
ing executives is more likely to be ungeneralizable if it takes a day's time than if it
takes only ten minutes, for only the latter experiment is likely to include those
people who have little free time.
VALIDITY
preliminary understanding of what a business ot project is capable of. But that is
another matter.)
The determination of modal instances is more difficult the closer one comes to
theoretical research. This is because target populatians are less likely to be spec-
ified. For instance, in testing propositions about “‘hetping’* behavior, it is not
desirable to generalize only to workers who are presently employed in a particular
factory, working at a particular task, and producing a particular product. The
pérsons, the settings, the task, and the product would be irrelevant to any helping
theory, Yet—togically speaking—the factors incorporated into a particular test of
@ proposition about helping determine the external validity of the findings, and the
researcher presumably does not welcome this restriction. Instead, he or she would
like to generalize to all persons (in the United States? beyond our shores?), all
settings (the street, the home, the factory?), and all tasks (helping someone who
has fainted, helping the permanently disabled?). The feasibility of confident gener-
alizations of such breadth is low, and the most that the basic researcher can do is
to attempt to replicate his or her original findings across settings with different
restrictions or to wait until others have conducted the replications, Sampling for
heterogeneity is at issue here rather than sampling to obtain impressionistically
modal instances that the researcher cannot convincingly define.
It should be clear by now that, where targets are specified, the mode! of ran-
dom sampling for representativeness is the most powerful model for generalizing
and that the mode! of impressionistic modal instances is the least powerful. The
model of hetcrogeneous instances lies between the two, However, the last model
has advantages over the other two in that it can be used when no targets are
specified and the major concern is not to be limited in one's generalizations.
Moreover, it can be used with smatl numbers of samples of convenience. Tn many
cases the random selection of instances results in generalizing to targets that are of
minimal significance for persons whose interests differ from those of the original
Tesearcher’s. For instance, to be able to generalize to all whites living in the
Detroit area, while of interest for some purposes, is generally of little interest to
most people. However, it is worth noting that whites in Detroit differ in age, SES,
intelligence, and the like so that it is possible to test whether a particular treatment
can have similar effects despite such differences. In addition, subgroup analyses
can be conducted to examine generality across subpopulations. In other words,
formal randomization from populations of low interest can be used to test causal
relationships across heterogeneous subpopulations. In other words, an important
function of random samples is to permit examining the data for differential effects
ona variety of subpopulations. Given the negative relationships between ‘‘inferen-
tial’? power and feasibility, the model of heterogeneous instances would seem
most useful, particularly if great care is made to include impressionistically modal
D\Oinstances among the heterogeneous ones.
In the last analysis, external validity—like construct validity—is a matter of
replication. It is worth noting that one can have multipie replication both within a
single study—subgroup analyses exemplify this—and also across studies—as
when one investigator is intrigued by a pattern of findings and tries to replicate
them using his or her own procedures or procedures that have been closely mod-
eled on the original investigators’.
ALIDITY
Three dimensions of replication are worth noting. First, is the simultaneous or
consecutive replication dimension, with the latter being preferable since it offers
some test, however restricted, of whether a causal relationship can be corroborated
at two different times. (Generalizing across times is necessarily more difficult than
generalizing across persons or settings.) Second is the independent or nonindepen-
dent investigator dimension, with the former being more important, especially if
the independent investigators have different expectations about how an experiment
wilt tum out. Third is the dimension of demonstrated or assumed replication, The
former is asscssed by explicit comparisons among different types of persons and
settings where some persons did or did not receive a particular treatment. The
latter is inferred from treatment effects that are obtained with heterogeneous sam-
ples, but no explicit statistical cognizance is taken of the differences among per-
sons, sellings, and times. Demonstrated replication is clearly more informative
than assumed, for to obtain an effect with a mixed sample of, say, boys and girls,
does not logically entail that the effect could be obtained separately for both boys
and girls. It only entails that the effect was obtained despite any differences
between boys and girls in how they reacted to the treatment.
The difficulties associated with external validity should not blind experimenters
to practical steps that can be taken to increase generalizability. For instance, one can
often deliberately choose to perform an experiment at three or more sites where.
different kinds of persons live or work. Or, if one can randamly sample, it is useful
to do so even if the population involved is not meaningful, for random sampling
ensures heterogencity. Thus, in their experiment on the relationship between beliefs
and behavior about open housing, Brannon et al. (1973) chose a random sample of
all white households in the metropotitan Detroit area. While few of us are interested
in generalizing to such a population, the sample was nonetheless considerably more
heterogeneous than that used in most research, despite the homogeneity on the attri-
butes of race and geographical residence.
{n addition, our three models for increasing external validity can be used in
combination, as has been achieved in some survey research experiments on
improving survey research procedures (Schuman and Duncan, 1974). Usually,
random samples of respondents are chosen in such experiments, but the interview-
ers are not randomly chosen; they are merely impressionistically modal of all
experienced interviewers. Moreover, the physical setting of the research is limited
to one target setting that is of little interest to anyone who is not a survey
fesearcher—the respondent's living room—and the range of outcome variables is
usually limited to those that survey researchers typically study—that is, those that
can be assessed using paper and pencil. However, great care is normally taken that
these questions cover a wide range of possible effects, thereby ensuring consider-
able heterogeneity in the effect constructs studied,
Our pessimism about external validity should not be overgeneralized. An
awareness of targets of generalization, of the kinds of settings in which a target
class of behaviors most frequently occurs, and of the kinds of Persons who most
often experience particular kinds of natural treatments will, at the very least, pre-
vent the designing of experiments that many persons shrug off willy-nilly as
“‘erelevant."’ Also, it is frequently possible to conduct muttiple replications of an
experiment at different times, in different settings, and with different kinds of
VALIDITY
experimenters and respondents. Indeed, a strong case can be made that external
validity is enhanced more by many heterogeneous small experiments than by one or
two large experiments, for with the latter one runs the risks of having heterogeneous
treatment, measures that are not as reliable as they could be, and measures
that do not reflect the unique nature of the treatment at different sites. Many
small-scale experiments with local contro] and choice of measures is in many
ways preferable to giant national experiments with a promised standardization that
is neither feasible nor even desirable from the standpoint of making irrelevancies
heterogeneous.
RELATIONSHIPS AMONG THE FOUR KINDS OF VALIDITY
Internal Validity and Statistical Conclusion Validity
Drawing false positive or false negative conclusions about causal hypotheses
is the essence of internat validity. This was a major justification for Campbell
(1969) adding ‘‘instability’’ to his list of threats to internal validity. *‘Instabil-
ity’ was defined as ‘‘unretiability of measures, fluctuations in sampling persons
or components, autonomous instability of repeated or equivalent measures,"’ alt
of which are threats to drawing correct conclusions about covariation and hence
about a treatment's effect, (What precipitated the need for this additional threat
was the viewpoint of same sociologists who had argued against using tests of
significance unless the comparison followed random assignment to treatments.
See Winch and Campbell, 1969, for further details.)
The status of statistical conclusion validity as a special case of internal
validity can be further illustrated by considering the distinction between bias
and error, Bias refers to factors which systematically affect the value of means;
error refers to factors which increase variability and decrease the chance of
obtaining statistically significant effects. If we erroneously conclude from a
quasi-experiment that A causes 8, this might either be because threats to inter-
nal validity bias the relevant means or because, for a specifiable percentage of
possible comparisons, sample differences as large as those found in a study
would be obtained by chance. If we erroneously conclude that A does not
affect B (or cannot be demonstrated to affect 8), this can either be because
threats to internal validity bias means and obscure true differences or because
the uncontrolled variability obscures true differences. Statistical conclusion
validity is concerned not with sources of systematic bias but with sources of
random error and with the appropriate use of statistics and statistical tests.
An important caveat has to be added to the preceding statement that ran-
dam errors reduce the risk of statistically corroborating true differences. This
does not imply that random errors invariably inflate standard errors or that they
never lead to false positive conclusions about covariation. Let us try to iHus-
trate these points. Imagine multiple replications of an unbiased experiment
where the treatment had no effect. The distribution of sample mean differences
should be normal with a mean of zero. However, many individual sample
mean differences will not be zero. Some will inevitably be targer or smatler
than zero, even 1o-a statistically significant degree.
nO
VALIDITY uu
Imagine, now, the same assumptions except that bias is operating. Because
of the bias, the distribution of sampte mean differences will no longer have a
mean of zero, and the difference from zero indicates the magnitude of the bias.
However, the point to be emphasized is that some sample mean differences
will be as large when there is bias as when there is Not, although the propor-
tion of differences reaching the specified magnitude will vary between the bias
and nonbias cases depending on the direction and magnitude of bias. Since
sampling error, which is one kind of random error, affects both sample means
and variances, it can lead to both false positive and false negative differences.
In this respect, sampling error is like intemal validity. But it is unlike internal
validity in that it cannot affect population means, Only sources of bias—threats
to internal validity—can do the latter.
Construct Validity and External Validity
Making generalizations is the essence of both construct and external valid-
ity. It is instructive, we think, to analyze the similarities and differences
between the two types of validity. The major similarity can perhaps best be
summarized in terms of the notion of statistical interaction—that is, the sign or
direction of a treatment effect differs across Populations. It is easy to see how
Person, setting, and time variables can moderate the effectiveness of a treat-
ment. It is probably also casy to see how an estimate of the effect may depend
on such threats to construct validity as the number of treatments a respondent
receives or the frequency with which outcomes are measured. It may be less
easy to see how a treatment effect can interact with (ie., depend on) the par-
ticular method used for collecting data (mono-methad bias}, or the expectan-
cies of the persons implementing a treatment (experimenter expectancies), or
the guesses that respondents make about how they are supposed io behave
(hypothesis-guessing). But in all these instances an internally valid effect can
be obtained under one condition (say, when Paper-and-pencil measures of atti-
tude are used) and a different, but stil! vatid, effect may result when attitude is
measured some other way,
Specifying the factors that codetermine the direction and size of a particu-
lar cause-effect relationship is useful for inferring cause and effect constructs.
This is because some of the causes or effects that might explain a particular
relationship observed under one condition may not be able to explain why there
are different causal relationships under other conditions. It should especially be
noted that specifying the populations of persons, settings, and times over which
a relationship holds can also clarify construct validity issues. For instance, sup-
Pose a negalive income tax causes more married women than men to withdraw
their labor from the labor market (see the summary statements of the four neg-
ative income tax experiments in Cook, Del Rosario, Hennigan, Mark, and Tro-
chim, 1978). Such an action might suggest that the causal treatment can be
understood, not just in monetary terms but also in terms of a possible shift in
economic risks (ie., where the family breadwinner is guaranteed an income,
the withdrawal of his or her labor could have extremely serious consequences
if the income guarantee were withdrawn or if the guaranteed sum failed to rise
with inflation. But where a source of more marginal income is involved—as
VALIDITY
with somé married women—the withdrawal of their labor is less critical since
the family is not so heavily dependent on that one source of incame.} Other
interpretations of why men and women are affected differently are also posst-
ble. Their existence highlights the difficulty of inferring causal constructs where
the clarifying inference is indirect, being based on differences in responding
across populations rather than on attempts to refine the causal operations
directly ‘so that they better fit a planned construct. The major point to be noted,
however, is that both external and construct validity are concemed with speci-
fying the contingencies on which a causal relationship depends and all such
specifications have important implications for the generalizability and nature of
causal relationships. Indeed, external validity and construct validity are so
highly related that it was difficult for us to clarify some of the threats as
belonging to one validity type or another. In fact, two of them are differently
placed in this book than in Cook and Campbell (1976). These are ‘‘the interac-
tion of ireatments'’ and “the interaction of testing and treatment.*’ They were
formerly included as threats to external validity on grounds that the number of
treatments and testings were part of the research setting. On reflection, how-
ever, we think they are more useful for specifying cause and effect constructs
than for delimiting the settings under which a causal relationship holds, though
they obviously can serve both purposes.
The major difference between extemal and construct validity has to do with
the extent to which real target populations are available. In the case of extemal
validity the researcher often wants to generalize to specific populations of per-
sons, settings, and times that have a grounded existence, even if he or she can
only accomplish this by impressionistically examining data patterns across acci-
dental samples. However, with cause and effect constructs it is more difficult
to specify a particular construct—what, for instance, is aggression? Any defini-
tions would be disputed and would not have the independent existence of, say,
the population of American citizens over 18 years of age. Even though the lat-
ter is a theoretical construct, it is obviously more grounded in reality than such
constructs as ‘‘attitude towards authority’’ or “‘a negative income tax.""
Issues of Priority Among Validity Types
Some ways of increasing one kind of validity will probably decrease
another kind. For instance, internal validity is best served by carrying out
randomized experiments, but the organizations willing to tolerate these are
probably less representative than organizations willing to tolerate passive mea-
surement. Second, statistical conclusion validity is increased if the experimenter
can rigidly control the stimuli impinging on respondents, but this procedure can
decrease both extemal and construct validity. And third, increasing the con-
struct validity of effects by multiply operationalizing each of them is likely to
increase the tedium of measurement and to cause either attrition from the
experiment or lower reliabi for individual measures.
These countervailing rclationships—and there are many others—suggest how
erucial it is to be explicit about the priority ordering among validity types in
planning any experiment. Means have to be developed for avoiding all unnec-
essary trade-offs between one kind of validity and another, and to minimize the
VALIDITY
loss entailed by the necessary trade-offs. However, since some trade-offs are
inevitable, we think it unrealistic to expect that a single piece of research will
effectively answer all of the validity questions surrounding even the simplest
causal relationship.
The priority among validity types varies with the kind of research being
conducted. For persons interested in theory testing it is almost as important to
show that the variables involved in the research are constructs A and B {con-
Struct validity) as it is to show that the relationship is causal and goes from
one variable to the other (internal validity). Few theories specify crucial target
settings, populations, or times to or across which generalization is desired.
Consequently, external validity is of relatively little importance. In practice, it
is often sacrificed for the greater statistical power that comes through having
isolated settings, standardized procedures, and homogeneous Tespondent poputa-
tions. For investigators with theoretical interests our estimate is that the types
of validity, in order of importance, are probably internal, construct, statistical
conclusion, and external validity,
We also estimate that the construct validity of causes may be more impor-
tant for such researchers than the construct validity of effects, particularly in
psychology. Think, for example, of how simplistically “‘atlitude’? is operation
alized in many persuasion experiments, or ‘‘cooperation”’ in bargaining studies,
or ‘‘ageression”’ in studies of interpersonal violence, Think, on the other hand,
about how much care goes into demonstrating that a particular manipulation
varied ‘‘cognitive dissonance’’ and not reactance, communicator expertise and
not experimenter expectancies or evaluation apprehension. Might not the con-
struct validity of effects rank lower than statistical conclusion validity for most
theory-testing researchers? If it does, this would be ironic since multiple opera-
tionalism makes it easier to achieve higher construct validity of effects than of
causes.
Much applied research has a different set of priorities. It is concerned with
testing whether a particular problem has been alleviated by a treatment—recidi-
vism in criminal justice setiings, achievement in education, or productivity in
industry (high intemal validity and high construct validity of the effect). ht is
also crucial that any demonstration of change in the indicator be made in a
context which permits either wide generalization or generalization to the spe-
cific target settings or persons in whom the researcher or his clients are particu-
larly interested (high interest in external validity). The researcher is relatively
tess concemed with determining the causally efficacious components of a com-
plex treatment package, for the major issue is whether the treatment as impte-
mented caused the desired change (low interest in construct validity of the
cause). The priority ordeting for many applied researchers is something like
intemal validity, external validity, construct validity of the effect, statistical
conclusion validity, and construct validity of the cause.
For the kinds of causal problems we have been discussing, the primacy of
internal validity should be noted for both basic and applied researchers. The
reasons for this will be given below, and they relate to the often considerable
casts of being wrong about the magnitude and direction of causal relations, and
the often minimal gains in external validity that are achieved in moving from
VALIDITY
eG
notion of ‘cause is an abstract one and that the single study only approxi-
mates causal knowledge. But we believe it is confusing to insist that internal
validity is a contradiction in terms because all validily is external, referring to
abstract concepts beyond a study and not to concrete research operations within
a study. It is confusing because the choice of populations and the fit between
samples and populations determines representativeness, whereas neither popula-
tions nor samples are necessary for inferring cause.
Nonetheless, the critics make a very useful point, for if the goals of a
research project are formulated wel! enough to permit specifying target con-
structs and. populations, then the research operations have to represent hese tar~
gets if the research is to be relevant either to theory at to policy Moreover, a
focus on sepresentativeness has historically entailed a heightened sensitivity to
unplanned and isrelevant targets that unnecessarily limit generalizability, as
when all the persons who collect posttest achievement data in an early chitd-
hood experiment with economically disadvantaged children are of the same
social class. Clearly, relevant research demands representativeness where target
constructs or populations are specified. It also demands heterogeneity where
irrelevant populations could fimit the applicability of the research. Though we
advocate putting considerable resources into the preexperimental explication of
relevant theory or policy questions—and hence targets—ihis should not be
interpreted in any way as an exclusive focus. As we tried to demonstrate in the
discussion of both construct and extemal validity, it is sometimes the case that
the data, once collected and analyzed, force us to restrict (or extend) generaliz-
ability beyond the scope of the original formulation of target constructs and
populations, The data edit the kinds of general statements we can make.
For instance, in his experiment on the help given to compatriots and for-
eigners, Fetdman (1968) wanted to generalize to ‘‘cooperation."’ He deduced
that if his independent variable affected cooperation, he would find five depen-
dent variable measures related to his treatment. But only two were related, and
the data outcomes forced him to conclude tentatively that his treatment was dif-
ferently related to two kinds of cooperation. Similarly, the designers of the
New Jersey Negative Income Tax Experiment wanted to generalize to working
poor persons, but the data forced them tentatively to conclude that working
poor blacks responded one way to the treatments, working poor persons who
were Spanish speaking reacted another way, and working poor whites probably
did not respond to the treatments at all. The point is this: While it is Jaudable
to sample for representaliveness when targets of generalization are specified in
advance—and we heartily endorse such sampling—in the last analysis it is the
patterning of data outcomes which determines the range of constructs and pop-
ulations over which one can claim a treatment effect was obtained. One has
always to be alert to the data demanding a respecification of the affected popu-
lations and constructs and to the possibility that the affected populations and
constructs will not be the same as those originally specified.
A fourth objection has been directed towards Campbell and Stanley's stress
on the primacy of internat over external validity. The critics argue that no kind
of validity can logicaily have precedence over another. Of what use, critics
say, is a theory-testing experiment if the true causal variable is not what the
VALIDITY
researchers say it is; or of what use is a policy experiment abou! the effects of
school desegregation if it involves a school in rural Mississippi when most
desegregation is in large, northern cities where white children have fewer alter-
natives to public schools than in the deep South? This point of view has been
best expressed by Snow (1974). He uses the term “referent validity’ to desig-
nate the extent to which research operations correspond to their referent terms
in research propositions of the form: ‘Counseling for pregnant teenagers
improves their mental health’’ or ““The introduction of national health insurance.
causes an increase in the use of outpatient services."’ Without using our termi-
nology, Snow notes that such propositions usually contain implicit or explicit
references to populations, settings, and times (external validity), to the nature
of the presumed cause and effect (construct validity), to whether the operations
representing the cause and effect covary (statistical conclusion validity), and to
whether this covariation is plausibly the result of causal forces {internal validity).
For a study to be useful, the argument goes, each part of the proposition
should be given approximately equal weight. These is no need to stress the causal-
ity term over any other. Other critics (Hultsch and Hickey, 1978; Cronbach, in
preparation) take the argument one step further and stress the primacy of external
over internal validity. Hultsch claims that if we have a target population of special
interest—for example, the educable mentally retarded—then it is better to test
causal propositions about this group on representative samples. He maintains this
should be done even if less rigoraus means have to be used for testing causal
propositions than would be the case if a study was restricted to easily available but
nonrepresentative subgroups of the educable mentally retarded or to children who
were not educable and retarded. Cronbach (in preparation) echoes this argument
and adds, first, that in much applied social research the results are needed quickly
and, second, that a high degree of confidence about causal attribution is less
important in the decisions of policy-makers (broadly conceived) than is confidence
in knowing that one is working with formally or impressionistically representative
samples. Consequently, Cronbach contends that the time demands of experiments
with experimenter-controiled manipulanda and the reality of how research is (and
is not) used in decision making suggest a higher priority for speedy research using
available data sets, simple one-wave measurement studies or qualitative studies as
opposed to studies which, like quasi-experiments, take more time and explicitly
stress internal validity.
It is in some ways ironic that the charge of neglecting external validity should
be leveled against one of the persons who invented the construct and elevated its
importance in the eyes of those who restricted experimentation to laboratory set-
tings and who wrote about experimentation without formally mentioning the spe-
cial problems that arise in field settings. But this aside, we have no quarrel in the
abstract with the point of view that, where causal propositions include references
to populations af persons and settings and to constructs about cause and effect,
each should be equally weighted in empirical tests of these propositions. The real
difficully comes in particular instances of research design and implementation
where very often the investigator is forced to make undesirable choices between
internal and external validity. Gaining a representative sample of educable, men-
tally retarded students across the whole nation demands considerable resources,
VALIDITY
Even gaining such a sample in a few cities located more closely together is diffi-
cult, requiring resources for implementing a treatment, ensuring its consistent de-
livery, collecting the required pretest and Posttest data, and doing the necessary
public relations wark, Without such resources, one runs the risk of a large, poorly
implemented study with 2 representative sample or of a smaller, better implement-
ed study where the smal] sample size limits our confidence in generalizing.
Since random sampling is so rare for Purposes of achieving representativeness,
iL is useful to consider the trade-off between internal and external validity when
heterogeneaus but unrepresentative sampling is used or when impressionistically
modal but unrepresentative instances are selected that at least belong in the general
class to which generalization is desired. Samples selected this way will have
unknown initial biases, since not all schools will volunteer to permit measure-
ment, even fewer schools will agree to deliberate manipulation of any kind, and
the sample af schools that wil! agree to randomized manipulation will probably be
even more circumscribed than the sample of schools that agrees to measurement
with or without quasi-experimentation. The crucial issue is this; Would one do
betrer to work with the initially more Tepresentative sample of schools in a particu-
lar geographical area that volunteered to permit measurement, even though no
deliberate manipulation tock place? Or would one rather work with a less repre-
sentative sample of schools where both measurement and deli!
took place?
Solving this problem boils down, we think, to asking whether the internal
validity costs of eschewing deliberate manipulation and more confident causal
inferences are worth the gains for external validity of having an initially mare
Tepresentative sample from which bias-inducing attrition will nonetheless take
place. Any resolution must aiso consider two other factors. First, the study which
stresses interi.al validity has at least to take place in a setting and with persons
who belong in the class to which generalization is desired, however formatly
unrepresentalive of the class they might be. Second, the study which stresses extemal
validity and has apparently more representative samples of settings and persons will
Tesult in less confident causal conclusions because more powerful techniques of field
experimentation were not used or were nol used as well as they might have been under
other circumstances.
The art of designing causal studies is to minimize the need for trade-offs and
to lry to estimate in any particular instance the size of the gains and losses in
internal and external validity that are involved in different trade-off options.
Scholars differ considerably in their estimate of gains and losses. Cronbach main-
tains that timely, representative, but less rigorous studies can still Jead to reason-
able approximations to causal inference, even if the studies are nonexperimental
and of the kind we shall discuss—somewhat pessimistically—in chapter 7. Camp-
bell and Boruch (1975), on the other hand, maintain that causal inference is prob-
lematic with nonexperimental and single-wave quasi-experimenis because of the
Many threats to internal validity that remain unexamined or have to be miled put
eoby fiat rather than through direct design or measurement. The issue involves esti-
mating how to balance timeliness and the quality of causal inference, whether the
costs of being wrong in one’s causal inference are not greater thun the costs of
being late with the results,
berate manipulation
VALIDITY
Consider two cases of timely research aimed at answering causal questions
which used manifestly inadequate experimental procedures. Head Start was oa
uated by Ohio-Westinghouse (Cicirelli, 1969) in a design with only one weave
measurement of academic achievement. The conclusion—Head Start was harmful.
Analysis of the same data using different statistical models appeared to cormobarate
this conclusion (Bamow, 1973); to reverse it completely, making Head Start appear
heipfut (Magidson, 1977); or to result in no-difference findings (Bentler and Wood-
ward, 1978}. Since we do not know the effects of Head Start, any timely decisions
based on the data would have been premature and perhaps harmful. The secon
example worth citing is the Coleman Report (Coleman et al., 1966). In this large-
scale, one-wave study it was concluded that black children gained niore in achieve-
ment the higher the percentage of white children in their classes. ‘This finding was
used to justify schoo) desegregation. However, better designed subsequent researe
has shown that if blacks gain at all because of desegregation (which is not clear),
they gain much less than was originally claimed. It is important, we feel me to
underestimate the costs of producing timely results about cause, particularly its
direction, which tum out to be wrong. Clearly, the chances of being wrong aout
cause are higher the more one deviates from an experimental mode! and conducts
rimental research using primitive one-wave quasi-experi: . |
renner timeliness is important in policy cesearch—though Jess so for basic
researchers for whom this book is also intended—we shall devote some of the next
chapter to quasi-experimental designs that do not require pretests and to ways. n
which archives can be used for rigorous and timely causal analysis. In the end,
however, each investigator has to try to design research which maximizes all kinds
of validity and, if he or she decides to Place a primacy on internal validity, this
ywed to trivialize the research. a
oye ‘rave at tried to place internal validity above other forms of validity.
Rather, we wanted to outline the issues. In a sense, by writing a book abaut
experimentation in field settings, we are assuming that readers already believe that
internal validity is of great importance, for the raison d'etre of experiments is to
facilitate causal inference. Other forms of knowledge about the social world one
more accurately or more efficiently gained through other means—e.g., surveys or
participant observation. Our aim differs, therefore, from that of the last critics we
discussed. They argue that experimentation is not necessary for causal inference or
that it is harmful to the pursuit of knowledge which will be useful in poticy for-
mulation. We assume that readers believe causal inference is important and that
experimentation is one of the most useful, if not she most useful, way of gaining
knowledge about cause.
SOME OBJECTIONS TO OUR TENTATIVE PHILOSOPHY
OF THE SOCIAL SCIENCES
‘ientism'” i i taries on the
Protests against ‘*scientism'’ have been prominent in recent commentaries
theory of conducting social science. Such protests focus on inappropriate and
blind efforts to apply “the scientific method’’ to the social sciences. Crities argue
that quantification, random assignment, control groups and the deliberate intrusion
VALIDITY
of treatments—all techniques borrowed from the physical scicnces—distort the
context in which social research takes place. Their protest against scientism is
often linked with the now-pervasive rejection of the logical positivist philosophy
of science and is frequently accompanied by a greater emphasis on humanistic and
qualitative research methods such as ethnography, participation observation, and
ethno-methodology. Critics also point to the irreducibly judgmental and subjective
components in all social science research and to the pretensions to scientific preci-
sion found in many current studies.
‘We agree with much of this criticism and have addressed the issue in our
previous work (Campbell, 1966, 1974, 1975; Cook, 1974a; Cook and Cook,
1977; Cook and Gruder, 1978; Cook and Reichardt, in press). However, some of
the critics of scientism (Guttentag, 1971, 1973; Weiss and Rein, 1970; Hultsch
and Hickey, 1978; Mitroff and Bonoma, 1978; Mitroff and Kilman, 1978; Cron-
bach, in preparation) have cited Campbell and Stanley (1966) and Cook and
Campbell (1976) as prime examples of the scientistic norm to which they object.
While the identification of our previous work with scientism oversimplifies and
blurs the issues, we acknowledge that in this volume, as in the past, we advocate
using the methods of experiments and quantitative science that are shared in part
with the physical sciences. We cannot here comment extensively on these criti-
cisms of our background assumptions, which go beyond criticisms of causation
issues alone. But we can indicate in broad terms the approach we would take in
responding to these objections.
First, we of course agree with the critics of logical positivism. The philosophy
was wrong in describing how physical science achieved its vegree of validity,
which was not through descriptive best-fit theories and definitional operationalism.
Although the error did not have much impact on the practice of physics, its effect
on social science methods was disastrous. We jain in the criticism of positivist
social science when positivist is used in this technical sense rather than as a syn-
onym for “‘science.** We do not join critics when they advocate giving up the
search for objective, intersubjectively verifiable knowledge. Instead we advocate
substituting a critical-realist philosophy of science, which will help us understand
the success of the physical sciences and guide our efforts to achieve a more valid
social science. Critical realists (Mandelbaum, 1964) or “‘metaphysical realists”
(Popper, 1972), ‘‘structural realists” (Maxwell, 1972), or ‘Yogical realists”
Worthrop, 1959; Northrop and Livingston, 1964) are among the most vigorous
modern critics of logical positivism. Critical realists particularly concerned with
the social sciences identify their position with Marx’s materialist criticism of ide~
alism and positivism, e.g., Bhaskar, 1975, 1978; Keat and Urry, 1975,
Second, it is generally agreed that the social disciplines, pure or applied, are
not truly successful as sciences. In fact, they may never have the predictive and
explanatory power of the physical sciences—a pessimistic conclusion that merits
serious debate (Herskovits, 1972, Campbell, 1972). This book, with its many
categories of threats to validity and its general tone of modesty and caution in
making causal inferences, supports such pessimism and underscores the equivocal
nature of our conclusions. However, it is sometimes forgotten that these threats
are not limited to quantitative or deliberately experimental studies. They also arise
in less formal, more commonsense, humanistic, global, contextual integrative and
VALIDITY,
qualitative approaches to knowledge. Even the “‘regression artifacts,’ identified
with measurement error, are an observational-inferential illusion that occurs in
ordinary cognition (see Tversky and Kahnman, 1974, and Fischoff, 1975).
We feel that those who advocate qualitative methods for social science
research are at their best when they expose the blindness and gullibility of spe-
cious quantitutive studies. Field experimentation should always include qualitative
research to describe and illuminate the context and conditions under which
research is conducted. These efforts often may uncover important site-specific
threats to validity and contribute to valid explanations of experimental results in
general and of perplexing or unexpected outcomes in particular. We also believe,
along with many critics, that quantitative researchers in the past have used poorly
framed questions to generate quantitative scores and that these scores have then
been applied uncritically to a variety of situations. (Chapters 4 and 7, in particu-
lar, highlight some of the abuses associated with traditions of quantitative data
analysis which have probably led to many specious findings.) In uncritical quanti-
tative research, measurement has been viewed as an essential first step in the
research process, whereas in physics the routine measures are the products of past
crucial experiments and elegant theories, not the essential first sieps. Also, the
definitional operationalism of logical positivists has supported the uncritical reifi-
cation of measures and has encouraged research practitioners to overlook the mea-
sures’ inevitable shortcomings and the consequences of these shortcomings. A
fundamental oversight of uncritical quantifiers has been to misinterpret quantifica-
tions as replacing rather than depending upon ordinary perception and judgment,
even though quantification at its best goes beyond these factors (Campbell, 1966,
1974, 1975). Experimental and quantitative social scientists have often used tests
of significance as though they were the sole and final proof of their conclusions.
From our perspective, tests of significance render implausible only one of the
many plausible threats to validity that are continually arising. Naive social quan-
tifiers continue to overlook the presumptive, qualitatively judgmental nature of al!
science. In contrast, the tradition we represent, with its heavy use of the word
“plausible,"’ stresses that the scicntist must continually judge whether a given
rival hypothesis will explain the data. Qualitative contextual information (as well
as quantitative evidence on tangential variables) has long been recognized as rele-
vant ta such judgments.
Valid as these criticisms are, it is not enough merely to point out the limita-
tions of our methods. Critics should be able to offer viable alternatives. To be
superior to the techniques described in the next six chapters, however, the pro-
posed qualitative methods would have to eliminate more of the threats to validity
listed in this chapter than do the quantitative methods. in this regard, it is refresh-
ing to note that our humanistic colleague, Howard Becker (1978), has tried to rule
out some of the validity threats in research which uses photogruphs either as evi-
dence or as the means of presenting final research results, no quantification having
intervened. Others conducting qualitative research under nonlaboratory conditions
also recognize the equivocal nature of any causal inferences drawn from their
observations. Many sociologists, anthropologists and historians have attempted to
avoid causal explanations, aiming instead for uninterpreted description. Yet care-
ful linguistic analysis of their reports shows that they are rarely successful. Their
VALIDITY