Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lecture Notes on Validity - Advanced Practicum in Clinical Psychology | PSY 394Q, Assignments of Psychology

Material Type: Assignment; Class: 4-ADV PRACTICUM IN CLIN PSYCH; Subject: Psychology; University: University of Texas - Austin; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 08/30/2009

koofers-user-uig
koofers-user-uig 🇺🇸

10 documents

1 / 28

Toggle sidebar

Related documents


Partial preview of the text

Download Lecture Notes on Validity - Advanced Practicum in Clinical Psychology | PSY 394Q and more Assignments Psychology in PDF only on Docsity! Validity ———— We shall use the concepts vatidity and invalidity to refer to the best available approximation to the truth or falsity of propositions, including propositions about cause. In keeping with the discussion in chapter 1, we should always use the modifier ‘‘approximately’” when referring to validity, since one can never know what is true. At best, one can know what has not yet been ruled out as false. Hence, when we use the terms valid and invalid in the rest of this book, they should always be understood to be prefaced by the modifiers ‘‘approximately”’ or “tentatively.” One could invoke many types of validity when trying to develop a framework in which to understand experiments in complex field settings. Campbell and Stan- ley (1963) invoked two, which they called internal’” and ‘‘external’’ validity. Internal validity refers to the approximate validity with which we infer that a relationship between two variables is causal or that the absence of a relationship implies the absence of cause. External validity refers to the approximate validity with which we can infer that the presumed causal relationship can be generalized to and across alternate measures of the cause and effect and across different types of persons, settings, and times. For convenience, we shall further subdivide the two validity types of Campbell and Stanley, Covariation is a necessary condition for inferring cause, and practic- ing scientists begin by asking of their data: ‘‘Are the presumed independent and dependent variables related?'’ Therefore, it is useful to consider the particular teasons why we can draw false conclusions about covariation. We shall call these reasons (which are threats to valid inference-making) threats to statistical conclu- sion validity, for conclusions about covariation are made on the basis of statistical evidence. (This type of validity was listed by Campbell [1969] as a threat to internal validity. It was called “‘instability’’ and was concerned with drawing false conclusions about population covariation from unstable sample data. We shail later consider “‘instability’’ as one of the major threats to statistical conclusion validity.) If a decision is made on the basis of sample data that two variables are telated, then the practicing researcher's next question is likely to be: ‘Is there a causal tetationship from variable A to variable B, where A and B ate manipulated ‘or measured variables (operations) rather than the theoretical or otherwise general- ized constructs they are meant to represent?’’ To answer this question, the researcher has to rule out a variety of other reasons for the relationship, including the threat that B causes A and the threat that C causes both A and B. The first of these threats is usually handled easily in experiments, as we shall see later. The latter is not so easily dealt with, especially in quasi-cxperiments. Much of the researcher's task involves self-consciously thinking through and testing the ptau- sibility of noncausal reasons why the two variables might be related and why “change’’ might have been observed in the dependent variable even in the absence of any explicit treatment of theoretical or practical significance. We shall use the term internal validity to refer to the vatidity with which statements can be made about whether there is a causa! relationship from one variable to another in the form in which the variables were manipulated or measured. internal validity has nothing to do with the abstract labeling of a presumed cause or effect; rather, it deals with the re/ationship between the research opera- tions irrespective of what they theoretically represent. However, researchers would like to be able to give their presumed cause and effect operations names which refer to theoretical constructs. The need for this is most explicit in theory- testing research where the operations are explicitly derived to represent theoretical notions. But applied researchers also like to give generalized abstract names to their variables, for it is hardly usefut to assume that the relationship between the two variables is causal if one cannot summarize these variables other than by describing them in exhaustive operational detail. Whether one wants to test theory about the effects of ‘‘dissonance’’ on “‘attitude change,”’ of is interested in policy issues relating to ‘‘school desegregation’ and “tacademic achievement,’’ one wants to be able to make generalizations about higher-order terms that have a referent in explicit theory or everyday abstract language. Following the lead of Cronbach and Meeh} (1955) in the area of measurement, we shall use the term construct validity of causes or effects to refer to the approximate validity with which we can make generalizations about higher-order constructs from research operations. Extending their usage, we shall use the term to refer to manipulated independent variables as well as measured traits. We shail base inferences about constructs more on the fit between operations and conceptual definitions than on the fit between obtained data patterns and theoretical predictions about such data patterns—more on what Campbell (1960) called trait validity than what Cronbach and Meehl (1955) termed nomological validity. We shall not ignore nomolagicat validity, however. The construct vatidity of causes and effects was listed by Campbell and Stan- ley (1963) under the heading of ‘external validity,” and it is what experimental- ists mean when they refer to inadvertent ‘‘confounding.”"? (That is, was the effect ‘Confounding is sometimes done deliberately in more complex experimental designs, ¢.g., Latin squares or incomplete lattice designs. Such deliberate confounding is meant ta achieve efficiency at the cost of reduced interpretability for some carefully chosen interactions that are considered implausible or that are of little theoretical or practical significance. VALIDITY due to the planned variable X, or was X confounded with “ dhe ‘experimenter expectan- cies ora Hawthorne effect,” or was X a “negative incentive’’ rather than dissonance arousal 2) As such, construct validity had to do with generalization. with the question: “Can I generalize from this one operation ot set of operations to a referent construct?” Given this grounding in the need to generalize, it is not difficult to see why Campbell and Stanley linked generalizing to abstract con structs with generalizing to (and across) populations of persons, settings, and his- torical moments, Just as one gains more information by knowing that a causal relationship is probably not limited to particular operational representations of a cause and effect, So one gains by knowing that the relationship (1) is not limited to a Particular idiosyncratic sample of persons or settings of a given type, and (2) i not limited to a particular population of Xs but also holds with populations of vs and Zs. We shail use the term external validity to refer to the approximate validit with which conclusions are drawn about the generalizability of a causal relati , ship to ‘and across populations of persons, settings, and times. ” Our justification for restricting the discussion of validity to these four types is Practical only, based on their apparent correspondence to four major detision questions that the practicing researcher faces. These are: (1) Is there a relationshi between the two variables? (2) Given that there is a relationship, is it plausibly causal from one operational variable to the other or would the same relationship have been obtained in the absence of any treatment of any kind? (3) Given that the relationship is plausibly causal and is reasonably known to be from one variable to another, what are the particular cause and effect constructs involved in the rela- tionship? and (4) Given that there is probably a causal relationship from construct Ato construct B, how generalizable is this retationship across Persons, settings. and times? As stated previously, each of these decision questions was implicit Campbell and Stantey’s explication of validity, with the present statistical conclu- sion and internal validities being part of internal validity and with the present construct and external validities being part of extemal validity. All we have done here is to subdivide each validity type and try to make the differences among types explicit. We want to stress that our approach is entirely practical being derived from our belief that practicing researchers need to answer each of the above ques- Gons in their work. There are no totally compellin, ical reasons for the lo} pelling log St U STATISTICAL CONCLUSION VALIDITY Introduction Tn evaluating any experiment, three decisions about covariation have to be made with the sample data on hand: (1) Is the study sensitive enough to permit feasonable statements about covariation? (2) If it is sensitive enough, is there any reasonable evidence from which to infer that the presumed cause and effect covary? and (3) if there is such evidence, how strongly do the two variables covary? The first of these issues concerns statistical power. It is vital in reportin, and planning experiments to analyze how much power one has to detect an VALIDITY those that were. actually obtained in the study, then one can compute with a known level of canfidence whether a specified point standard has been exceeded with the data on had. This is clearly a desirable situation for any data analyst. Let us illustrate the above points by describing a section from a report on the effects of manpower training programs on subsequent earnings. Aschenfelter (1974) knew that training costs were about $1,800 for cach trainee. He estimated that a return of at least 10% on this investment (i.¢., $200) would be adequate for declaring the manpower (raining program a ‘‘success.'’ Then, assuming equal numbers of persons in the experimental training group and the no-training control group, and knowing from previous data that the standard deviation in income.was about $2,000 for white males, Aschenfelter calculated that about 1,600 persons would be needed in the experiment if a true effect of at feast $200 was to be detected with 95% certainty, However, Aschenfelter further calculated that if he were to break down the data by two race and two sex groups, he would need a total of 6,400 respondents—! ,600 in each subgroup. Knowing this, he was then in a position to assess whether he had the necessary resources to design an experi- ment of this size or whether he would be better served by using some other tech- nique for trying to evaluate the training program. Unfortunately, it is rare to have valid variance estimates and a prior point estimate of the size of an expected effect. The problem of specifying expected effect sizes is sometimes political, largely because a publicized point estimate can become a reference against which a social innovation is evaluated. As a result, even if an innovation has had some ameliorative effect, it may not be given much credit if it failed to have the promised effect. It is no small wonder, therefore, that the managers of programs prefer less specific statements such as ‘““We want to increase achievement,'’ or “‘We want to reduce inflalion’’ to statements such as “We want to increase achievement by two years for every year of teaching’ or “We want ta reduce inflation to 5% 2 year.’ The problem of specifying magni- tudes is also sometimes one of ‘‘consciousness,"’ for the issue may simply not be considered in designing the research. Alternatively, it may be silently considered by some persons but never brought to the level of discussion for fear that different parties to the research may disagree on the level of effect required to conclude that & ceatment has made a significant, practical difference. Situation 3 Even when no magnitude-of-effect estimate is available, it is still possible to use information about sample sizes and variances in order to calculate retrospec- “fively the-size of any effect that could have been detected in @ particular experi- ment with, say, 95% confidence. This magnitude can then be inspected and interpreted. At times it will seem so unreasonably large that the only responsible conclusion is that the experiment was not powerful enough to have detected a true effect. For instance, in the Aschenfelter case a sample size of 400, split equally between experimentals and controls, would have required the experimentals to earn considerably more than $200 on the average if a true effect were to be detected at the $% level. How reasonable is it to expect an average increase in ‘earnings over $200 in the Rest working year after graduating from a job-training program? The answer to this cannot be definitive since no criteria exist for assess- VALIDITY ing reasonableness. Nonetheless, the figure seems to us to be very high. We would strongly advise anyone whose research results in a no-difference conclusion to conducl the retrospective analysis indicated above, Situation 4 When data are first analyzed, it is often the case that the estimate of the treat- ment effect (say, a difference between sample means) is statistically nonsignifi- cant but in the expected direction. ‘Typically, efforts are then made to “‘reduce the err term'’ used for testing the treatment effect—a topic that we shal? now USCUSS. Obviously, it is desirable to design the research initially so as to minimize this error. For instance, ‘*Student’” (1931) reexamined an experiment which compared how four months of free pasteurized milk affected the height and weight of Scot- tish school children when compared to four months of raw milk. About 5,000 students received each type of milk. ‘‘Student’’ maintained that the same. statisti- eal Bower (and much lower financial costs) would have resulted had only 50 sets of identical twins been used. This is because weight and height are highly cor- related for monozygotic twins, leading to lower error terms than those associated with differences between nonrelated children. In light of modem knowledge, we might not want to design the study in the way “‘Student’* suggested because of honstatistical considerations. For example, would Parents seek to supplement one of their twin's dicts if they knew that the other was receiving a school-provided supplement, and how generalizable would findings from 50 sets of twins be? Nonetheless, ‘‘Student'’s’’ point is important, and it Suggests designing research wherever possible so as to have small error terms, provided that the means of reducing the error do not trivialize the research. ; Perhaps the best way of reducing the error due to differences between persons is to match before random assignment to treatments. (This, we shall soon see, is quite different from matching as a substitute for randomiza i Prior to randomization can increase statistical conclusion va to discover in which particutar subgroups a treatment effect as an alternative to randomization often leads to Statistical regression artifacts that can masquerade as treatment effects.) The best matching variables are those that are most highly correlated with posttest scores. Normally the pretest is the best single matching variable since it is a proxy for all the social and biological forces that make some individuals or aggregates of individuals different from others. The actual Process of matching is simple. One takes all the scores on the matching variable, ranks them, and places them into blacks whose size corresponds to the number of experimental groups (say, three). Then, the three persons in the first block are randomly assigned to one of the experimental groups, the next three in the next block are randomly assigned, and so on until all the cases are assigned. ‘The data that result fro i . Levels m such a design can then be analyzed as coming from a x Treatment design. The same logic basically holds when matching takes place on multiple variables, but the problem of finding matches is harder, (Match- ing will be discussed in greater detail in several of the chapters to come,) Given random assignment to treatment conditions, it is nonetheless possible to match retrospectively after all the data have been collected. With large sample tion. While matching lidity and permit tests is obtained, matching VAUDITY sizes, retrospective matching will result in treatment groups that have comparable proportions of units with the characteristic on which matching takes place. How- ever, the major disadvantage of this technique compared to prospective matching is that subgroups with few members (e.g., blacks in many settings) can be dispro- portionately represented in each treatment group, with very few persons in one of the groups. This makes it difficult to estimate treatment effects for the subgroups in question. But this problem aside, ex post facto blocking can be extremely use- ful both because effects of the blocking variable can be removed from the error term and because the interaction of the blocking variable with the treatment can be assessed. ‘When there is no interest in testing how the dependent variables are related to the matching or blocking variable, an altemative method of reducing the error term can be used that loses fewer degrees of freedom. It requires using multiple regres- sion strategies involving variables which are correlated with the dependent variable within treatment groups and whose effects are to be partialled out of the dependent variable. Covariance analysis is one such strategy. The extent to which covariance analysis reduces error depends on the correlation between the covariates {the lower the better) and the correlation of each of them with the dependent variable (the higher the better). But two words of caution are required about such multiple regres- sion adjustments. First, important statistical assumptions have to be demonstrably met for the results of the analysis to be meaningful, especially the assumption of homogeneous regression within treatment groups. Second, in experiments with non- comparable groups, the analysis will reduce error but will rarely adjust away all the initial differences between groups. Thus, the function of reducing ermor—which makes covariance so useful with both randomized experiments and quasi-experi- ments-—should not be confused with the purported function of making groups equiv- alent. Equivalence is not needed with randomized experiments and is rarely achieved by regression adjustments with quasi-experiments, (For an extended dlis- cussion of these last points see chapters 3 and 4.) Both matching and multiple regression adjustments assume that meusures have been made of the variables for which adjustments are to be made. Failure to measure them means that error terms cannot be reduced to reflect the way that person or setting variables are related to the major outcome measures of an experi- ment. Increasing one’s confidence in accepting the null hypothesis demands valid measurement of the variables that are most likely to affect posttest scores, There is little point in reducing the error variance due to differences among persons and settings if the outcome measures are so unreliable that they cannot register true change. Thus, the experimenter has to be certain to begin ith reli- able measures. Altematively, the experimenter has to try and develop even more reliable measures after an experiment is under way by adding items to tests, by rescaling, or by aggregating data. But whether or not attempts are undertaken to increase refiability, it is important that internal consistency estimates and test- retest correlations be displayed in a research report. The reader can ut least judge for himself the extent to which measures may have been capable of registering true changes. Statistical procedures exist for correcting for unretiable measurement. This means that analysis of ‘true’ scores should be possible. (Details of this procedure MALIDITY are available in many standard texts, including McNemar, 1975.) These corec- tional procedures can often be misleading in practice. First, there are many ways of conceptualizing reliability, each of which implies a different rcliabitity measure and different numerical estimates of the amount of reliability. Second, for any one kind of reliability, its own reliability will not be directly known. And third, reli- ability-adjusted values do not logically correspond with the values that would have been obtained had there been perfectly reliable measurement. This is perhaps most dramatically illustrated by reliability-adjusted correlations in excess of 1.00, or by the fact that a nonsignificant r of, say, —.10 must inevitably result in a higher adjusted value of the same sign, whereas the population correlation may have been +.04, Great caution must be exercised, therefore, in the use of reliability adjust- ments. It would be naive to present the results only for adjusted data or, when adjusted results are presented, to use only one estimate of reliability. A range would be better. Each of the foregoing strategies can reduce error terms. Consequently, it is advisable for purposes of statistical conclusion validity to use as many as possible of the following design features, (1) Each person might be his own control {i.e., serve in more than one experimental group); (2) samples might be selected that are as homogeneous as possible (monozygotic twins arc merely the extreme of this); (3) pretest measures should be collected on the same scales that are used for measuting effect; (4} matching might take place, before or after randomization, on variables that are correlated with the posttest; (5) the effects of other vasiables that are correlated with the posttest might be covaried out; (6) the reliability of depen- dent variable measures might be increased; or, (7) occasionally the raw scores might be adjusted for unreliability. In addition, (8) estimates of the desired magni- tude of effect should be elicited, where possible, before the research begins. Even when no such estimate can be determined, (9) the absolute magnitude of 4 treat- ment effect should be presented so that readers can infer for themselves whether a statistically reliable effect is so small as to be practically insignificant or whether a nonreliable effect is so large as to merit further research with more powerful statis- tical analyses. It should not be forgotten that all of these strategies have negative consequences if uncritically used and that all of them require trade-offs that wilt becomé more obvious later. Moreover, most of them are more problematic when analyzing data from quasi-experiments than data from randomized experiments. Simation § Having tried to make the error term as small as possible, the researcher will encounter a problem if the data analysis still fails to result in statistically signifi- cant effects. All one can then conclude is that this particular example of this particular treatment contrast had no observable effects. One cannot easily draw conclusions about what would have happened if each treatment had been more homogeneously implemented (i.e., each person or unit in a group had received exactly the same amounts of the treatment) or if the experimental contrast hud been larger (i.e., the mean difference between groups had been greater on some measure designed to assess the strength of the treatment implementation). As we shall see Jater, quasi-experimental analyses can sometimes be con- ducted to assess these two possibilities by capitalizing upon the fact that measures VALIDITY of treatment implementation can be made which estimate presumed differences in the strength of the treatment. Such differences can then be associated with esti- mates of the magnitude of changes between a pretest and posttest in order to determine if the two are related. While such analyses should definitely be con- ducted, chapters 3 and 4 will illustrate that great care must be exercised in inter preting the results. This is because individuals will normally have voluntarily chosen to expose themselves to treatments in different amounts, and so the kind of Person at one treatment level is likely to be different from a person at another level. Nonetheless, if sophisticated quasi-experimental analyses of the kind in chapters 3 and 4 still fail to result in covariation between the treatment and aut- come measures, then the analyst can be all the more confident in accepting the null hypothesis. INTERNAL VALIDITY Introduction Once it has been established that two variables covary, the problem is to decide whether there is any causal relationship between the two and, if there is, to decide whether the direction of causality is from the measured or ma- nipulated 4 to the measured B, or vice versa. The task of ascertaining the direction of causality usually depends on knowledge of a time sequence. Such knowledge is usually available for experi- Ments, as opposed to most passive observational studies. In a randomized experiment, the researcher knows that the measurement of possible outcomes takes place after the treatment has been manipulated. In quasi-experiments, most of which require both pretest and posttest measurement, the researcher can relate some measure of pretest-posttest change to differences in treatments. 3, {tis more difficult to assess the possibility that A and B may be related onty through some third variable (C). If they were, the causal relationship would have to be described as: A —» C -» B. This is quite different fram the model A — B which most clearly implies that A causes 8, To conclude that A causes B when in fact the mode! A -» C —» B is true would be to draw a false positive conclusion about cause. Accounting for third-variable alternative interpretations of presumed A-B relationships is the essence of internal validity and is the major focus of this book. Although in the examples that follow we shall deal primarily with the pos- sibility of false positive findings, it should not be forgotten that third variables can also threaten internal validity by leading to false negatives. The latter occur whenever relationship signs are as below. In the case to the left, an increase in A causes an increase in both B and C, but the increase in C causes a decrease in B. Thus, B xy ad a A A 4 ~ te 4 ~ he j Me 2 ex c ~~ e c VALIDITY the net effect of A and C an B would be to tend to obscure a true A> B relationship. In the case depicted in the center, an increase in 4 would cause an increase in B and a decrease in C, while a decrease in C would cause a decrease in B. Once again, the effects of A and C would tend to cance! each other out. In the final case, an increase in A would cause a decrease in both 8 and C, and the decrease in C would cause a countervailing increase in B. Let us give an example of the second of these three relationships. Imagine that A is tutoring and B is academic achievement. Imagine, further, that tutor- ing is given to the weaker students academically and is withheld from the stronger, this process of selection into the treatment being C. Since tutoring is negatively related to initial achievement, we have A => C. Being weaker, the students with tutoring would be expected to gain less over time than their fel- low students for a number of reasons that have nothing to do with tutoring {e.g., slower rates of learning from other sources). Hence, C => B. Thus, if tutoring did raise achievement (4 4 BY but the children who received tutoring were expected to gain less from schooling anyway (that is, C =» 8), then the effects of tutoring and of lower expected growth rates would countervail. In the special case where the two forces were of equal magnitude, they would totally cancel each other out. In cases where one force was stronger than the other, the stronger catise would prevail but its effect would be weakened by the coun- tervaiting cause. We hope that our later examples, which emphasize intemal validity threats and false positive findings, will not blind readers to the effects that such threats can have in leading to false negative findings because of the operation of suppressor variables. 1 is possible for more than one internal validity threat to operate in a given situation. The net bias that the threats cause depends on whether they are simi- lar or different in the direction of bias and on the magnitude of any bias they cause independently. Clearly, false causal inferences are more likely the more numerous and powerful the validity threats and the more homogeneous the direction of their effects. Though ovr discussion will be largely in terms of threats aken singly, this should not blind readers to the possibility that multiple internal validity threats can operate in cumulative or countervailing fashion in a single study. Threats to Internal Vatidity Bearing this brief introduction in mind, we want to define some specific threats to internal validity. History “History"’ is a threat when an observed effect might be due to an event which takes place between the pretest and the posttest, when this event is not the treatment of rescarch interest. In much laboratory research the threat is con- trolled by insulating respondents from outside influences (e.g., in a quiet labo- ratory) of by choosing dependent variables that could not plausibly have been affected by outside forces (e.g., the learning of nonsense syllables). Unfortu- nately, these techniques are rarely available to the field researcher. VALIDITY operated, then the investigator has to conclude that a demonstrated relationship between two variables may or may not be causal. Sometimes the alternative inter- pretations may seem implausible enough to be ignored and the investigator will be inclined to dismiss them. They can be dismissed with a special degree of con- fidence when the alternative interpretations scem unlikely on the basis of findings from a research tradition with a large number of relevant and replicated findings. Invoking plausibility has its pitfalls, since it may often be difficult to obtain high inter-judge agreement about the plausibility of a particular alternative interpretation. Moreover, theory testers place great emphasis on testing theoretical predictions that seem so implausible that neither common sense nor other theories would make the same prediction. There is in this an implied confession that the ‘‘timplausible’’ is sometimes true. Thus, ‘implausible’ alternative interpretations should reduce, but not eliminate, our doubt about whether relationships are causal. When respondents are randomly assigned to treatment groups, each group is similarly constituted on the average (no selection, maturation, or selection-matura- tion problems). Each experiences the same testing conditions and research instru- ments (no testing or instrumentation problems). No deliberate selection is made of high and Jow scorers on any tests except under conditions where respondents are first matched according to, say, pretest scores and are then randomly assigned to treatment conditions (no statistical regression problem), Each group experiences the same global pattern of history (no history problem). And if there are treatment- telated differences in who drops out of the experiment, this is interpretable as a consequence of the treatment. Thus, randomization takes care of many threats to internal validity. With quasi-experimental groups, the situation is quite different. Instead of relying on randomization to rule out most internal validity threats, the investigator has to make all the threats explicit and then rule them out one by one, His task is, therefore, mare laborious. It is also less enviable since his final causal inference will not be as strong as if he had conducted a randamized experiment. The prin- ciple reason for choosing to conduct randomized experiments over other types of research design is that they make causal inference easier. Threats to Internal Validity That Randomization Does Not Rule Out Though randomization conveniently rules out many threats to internal validity, it does not rule out all of them. In particular, imitation of treatments, compensa- tory equalization, compensatory rivalry, and demoralization in groups receiving less desirable treatments can each threaten internal validity even when randomiza- tion has been successfully implemented and maintained over time. Some of these threats will usually cause spurious differences (e.g., demoralization in the con- trols). However, other threats will tend to obscure true differences, especially by making no-treatment control groups perform atypically. This Jast happens with the imitation of treatments, compensatory equalization, and compensatory rivalry, We want to make clear that, while randomized experiments are superior to quasi- experiments with respect to internal validity, they are not perfect. Most of the threats that randomization does not rule out result from the focused inequities that inevitably accompany experimentation because some peo- VALIDITY honed ple receive one treatment and others receive different treatments or no treatment at all. Since much social experimentation is ameliorative, treatments have to differ in desirability by virtue of the very research problem (¢.g., the different payment levels in a compensatory education or an incame supplement program, or the different amounts of time that can be spent away from cell-block confinement in a prison experiment on “‘rehabilitation’’). Obviously, individual respondents want to receive the more desirable treatments. In the same vein, officials want to avoid salient inequities which can lead to charges that they favored some Tespondents over others in distributing treatments. It is rare in our society to have valuable resources distributed on a random basis. Instead, we expect them to be distributed according to need, merit, senior- ity or on a ‘first come, first served"’ basis. The point is that distributing resources by lottery violates the meritocratic and/or social responsibility norms which regu- late and justify most differences in rewards and opportunities in the United States. ‘This is not to say that lotteries are never used in resource distribution. They seem to be convenient, for instance, in distributing sudden “windfalls” or universally undesired resources (e.g., a lottery was used for inducting young men into the U.S. armed services after 1969), Nonetheless, distribution by merit or need is more common than distribution by chance, and the latter often violates expecta- tions about what is ‘“‘just.’* It is this which leads to randomization exacerbating some intemal validity threats. The extent of an administrator’s apprehension about randomization probably depends on four subjective estimates: (1) the differences in desirability between treatments; (2) the probability that individuals will lear of treatment differences; (3) the probability that organized constituencies will learn of these differences; and (4) how much the various constituencies will feel that their interests are affected by the most likely rescarch outcomes. Some research questions make it difficult to tule out all of an administrator’s apprehensions since, first, they absolutely require treatments that differ in desirability (e.g., what is the effect of extra payments to schools?}. Second, scarce research resources require geographicat contiguity (e.g., we can only do the study in one school district), Third, it seems to be part of an administrator’s job to consider how various constituencies might react to focused inequities and to fear the worst {e.g., what will the teachers’ union or the PTA think if resources are distributed by chance instead of by need or merit?). And fourth, administrators know that constituency representatives want to get the best possible advantages for their charges and want to avoid any potential harm to them {e.g., a teachers’ union official might think: If performance contracting works in schools, then the role of the classroom teacher could be reduced in scope and importance—do we want that?). Such considerations highlight both the difficulties of gaining permission to randomize and of ruling out the threat of compensatory equalization when randomization has taken place. The only other internal validity threat that can operate in @ randomized experi- ment is differentiat mortality from the treatment groups. While such differences can be interpreted as a consequence of the treatment—and as a resuft will often be very important—they have the undesirable side effect of obscuring the interpreta- tion of other results. This is because the units remaining in one treatment group may not be comparable on the average to those in another group. Thus, if there VALIDITY were differential attrition from, say, an experiment on the effect of income supple- ments on the motivation to find work, we would not be sure if a relationship between the'doflar value of a supplement and the number of days worked was duce to the supplement reducing the number of days worked or to selection differences associated with the kinds of persons who remained in each treatment condition for the entire experiment. Treatment-correlated attrition leads to the possibility of a selection confound. We might readily surmise that such attrition is all the more likely the more the treatments differ in desirability. With the exception of differential mortality and the selection problems that follow from it, the threats to internal validity which random assignment does not Tule out are caused by atypical behavior on the Part of persons in no-treatment control groups or groups that receive less desirable treatments, Such behavior represents an unplanned but nonetheless causal consequence of the planned experi- mental contrast. Even when there is a valid causal relationship at the operational level, one may wonder how differences in B can be interpreted as the result of threats to internal validity. Intemal validity is, after all, concerned with threats that cast doubt on whether there is a valid causal connection, and the threats we are discussing da not deny the validity of a causal connection. The answer is in One sense simple. Internal validity refers to doubt about whether there is a causal connection from A-as-manipulated (or measured) to B-as-measured: on the other hand, the threats to internal validity which we are discussing (c.g., resentful demoralization of the controls) cast doubt on whether the causal connection is from 4 to B or is from A’s comparison group to B. (In another sense, this issue is academic, for causal inference always depends on the contrast between A and A's comparison, irrespective of whether 4 or the comparison causes the observed changes in the dependent variable. Given our emphasis on the desirability of iden- tifying active causal agents, it is important to identity whether 4 or its comparison accounts for change, since knowing the active causal agent allows one to know what to manipulate. This is why we specify internal validity in terms of the pattern of influence from A to B rather than in terms of the pattem of influence from the contrast between A and its comparison to B). Assessing the Plausibility of Internal Validity Threats If a Randomized Experiment Has Been Implemented The possibility of a selection artifact resulting from differential attrition can best be empirically assessed in two ways. First, an analysis is called for of the Proportion of respondents, originally assigned to each experimental condition, who actually provide posttest data. Differences in this Proportion across treatments indicate a differentiat dropout. Second, an analysis is called for of the pretest scores in each treatment group computed on the basis of all those who provided posttest data. This gives a preliminary indication of whether the dropouts differed across groups on the background characteristics that are most likely to affect post- test scores (i.e., those that are highly correlated with pretest scores on the same test). We will deal with these points in greater detait in chapter 8. An assessment of imitation, compensatory equalization, or compensatory rivalry can often be made by direct measures in the experimental and contro! groups of the process that the independent variable was meant to affect. Thus, ifa VALIDITY wb nm treatment were meant to provide money to some schools but not others, the finances of both kinds of schools would necd examining. If a treatment was expected to make experimental children view an education television Program, it would be necessary to measure how often they watch the show and how often the controls watch it. A small or nonexistent experimental contrast would suggest that imitation, compensatory equalization, or compensatory rivalry may have occurred. Thus, measures of the exact nature of the treatment in aif treatment and control groups are absolutely vital in any experiment. The sooner such measurements are taken, the easier it will be to detect unexpected pattems of behavior in the experi- ment and control groups and the easier it will be to take corrective action, It will normally be easy to use background information to find out if controls had contact with experimentals and copied them or to find out if administrators Provided additional resources to some units from nonexperimental sources, It wit] normally not be as easy to assess whether compensatory rivalry took place, though direct measures of verbal expressions of such tivalry by the controls can give a lead, as can indications of whether control group performance is greater than would be expécted, Saretsky (1972), it will be remembered, tried to determine this in the performance level in past years in the same classes; but he probably ran into a regression problem. Nonetheless, if used with care, the use of secondary data from past classes can be useful for attempting to assess the magnitude of any compensatory rivalry. Such data could also be useful for assessing resentful demoratization, because this threat leads to the testable prediction that perfor- mance should be atypically low in the control gtoup during the experiment. CONSTRUCT VALIDITY OF PUTATIVE CAUSES AND EFFECTS Introduction Construct validity is what experimental Psychologists are concerned with whea they worry about “confounding.” This refers to the possibility that the operations which are meant to represent a particular cause or effect construct can be construed in terms of more than one construct, each of which is stated at the same level of reduction. Confounding means that what one investigator interprets as a causal relationship between theoretical constructs labeled A and B, another investigator might imerpret as a causal relationship between con- structs A and ¥ or between X and B or even between X and Y. In the discussion that follows we shall restrict ourselves to the construct validity of presumed causes and effects, since these play an especially crucial role in experiments whose raison d’étre is to test causal Propositions. But it should be clearly noted that construct validity concems are not limited to cause and effect constructs. All aspects of the research require naming samples in generalizable terms, including samples of people and settings as well as sam- ples of measures or manipulations. Even with intemal validity and statistical conclusion validity, inferences have to be made about abstract constructs: viz “canse"’ and '‘reliable change’’ or ‘‘reliable differences."” The reference to the level of reduction in the definition of “confounding’' is important, because it is always possible to ‘‘translate’’ sociological terms VALIDITY into psychological ones, or psychological terms into biological ones. For cxam- ple, patlicipative decision making could become conformity to membership group norms on one level, or some correlate of, say, the ascending reticular activating system on another. Each of these fevels of reduction is useful in dit ferent ways and none is more legitimate than any other. But such ‘‘transta- tions’’ from one levef to another do not involve the confounding of rival explanations that is at issue here. Before we continue our abstract characterization of construct validity, some conerete examples of well-known construct validity concerns may help. In ear- lier medical experiments on drugs, the psychotherapeutic effect of the doctor's helpful concern was confounded with the chemical action of the pill. So, too, were the doctor’s and the patient's belief that the pill should have helped. To circumvent these problems and to increase confidence that any observed effects could be attributed to the chemical action of the pill alone, the placebo control group and the double-blind experimental design were introduced. (The first of these involves giving a chemically inert substance to respondents, and the sec- ond requires that neither the person prescribing the pill nor the person evatuat- ing its effects knows the experimental condition to which the patient has been assigned.) In industrial relations research, the Hawthorne effect is another confound which causes uncertainty about how operations should be labeled. If we assume. for the moment that productivity was increased in the original Hawthome stud- ies by the planned experimental intervention, the issue for construct validity purposes is: Was the increase due to shifts in itlumination {the planned treat- ment) or to the demonstrated administrative concem over improved working conditions (the *‘Hawthome effect’) or to telling the women how well they were doing their work (an inadvertent correlate of increasing the illumination)? Construct validity concerns begin to surface at the planning and pilot-testing stages of an experiment when attempts are made io fit the anticipated cause and effect operations to their referent constructs, whether these are derived from formal social science theory or from policy considerations. Such ‘*fitting’”’ to the construct of interest is best achieved (1) by the careful preexperimental explication of constructs so that definitions are clear and in conformity with public understanding of the words being used, and (2) by data analyses directed at some of the four fotlowing points, preferably all of them. First, a test should be made of the extent to which the independent vari- ables aller what they are meant to alter. This is done by assessing whether the teatment manipulation is related to direct measures of the process designed to be affected by the treatment. (This is called ‘assessing the ‘take’ of the inde- pendent variable.’") Second, a test should be conducted to assess whether an independent variable does not vary with measures of related but different con- structs. For instance, a manipulation of ‘communicator expertise’” should he correlated with reports from respondents about the communicator’s level of knowledge, but it should not be correlated with attributions about cognate con- structs, such as trustworthiness, congeniality, or power. If there are such corre- fations, it is difficult to differentiate effects due to expertise from those due ta the other variables. Third, the proposed dependent variables should tap into the —_— cat wan IDITY factors they are meant to measure. Nomnally, some form of inter-item correla. tion can suggest this. And fourth, the dependent variables should not be domi- nated by irrelevant factors that make them measures of more or less than was intended. Thus, the outcome construct, like the treatment construct, has to be differentiated from its particular cognates. As we have detailed the procedure, assessing construct validity depends on two pro esses: first, testing for a convergence across diferent measures or manip- ulations of the same ‘thing’ and, second, testing for a divergence between mea- sures and manipulations of related but conceptually distinct “things."* Our position should not be interpreted to imply that construct validity absolutely depends on having information about both convergences and divergences, for it is clearly desirable to have information about convergences even when nothing is known directly about divergences. Indeed, other discussions of construct validity have restricted themselves to convergences, even while noting that a clase core- spondence between different types of measures of the same thing is less mean- ingful if there are similar measurement inelevancies associated with each measure, as when only paper-and-pencil or observational measures of the same construct are made—see Campbell and Tyler, 1957; Cronbach and Meebl, 1955; Cronbach, Glesser, Nanda, and Rajaratnam, 1972. However, as Campbell and Fiske (1959) suggest, a construct should be differentiated from related theoretical Constructs as well as from methodological irmelevancics. (For an example of dif- ferentiation from other theoretical constructs in basic sesearch, see Cook, Crosby and Hennigan, 1977; and for an example in applied research, see the differentia tion of viewing ‘Sesame Strect'’ from “being encouraged to view ‘Sesame Street’ by paid professionals,’’ Cook et al., 1975.) ‘We can illustrate these points by considering a possible experiment on the effects of supervisory distance. Suppose we operationalized ‘‘supervision"' as a foreman standing within comfortable speaking distance of workers (e.g., ten feet). This particutar operationalization would exclude distances that were beyond speaking but not beyond seeing, and the treatment might be more exactly characterized as “supervision from speaking distances.’ It would be dangerous to generalize from such a specific treatment to the general ‘‘supervi- sion” construct, especially if supervision has different consequences when it comes from shorer and longer distances. To lessen this possibility, it would be useful if supervisory distance were systematically varied by means of planned manipulations. That is not always possible. However, it would still be useful if Supervision inadvertently varied across a wide range of distances because fore- men differed in their behavior from day to day. Careful analysis of the effects of spontaneous variation in distance would then allow us to test whether we can generalize from one supervisory distance to another. If we can, we can generalize with greater confidence to the general construct of “supervision,” whereas if we cannot, we would like 10 restrict our generalization to “supervi- sion from ten fect or less.’ The foremen might also differ from each other, or might themselves differ from day to day, in whether they supervise with a smile or in an officious manner. Neither the smile nor the officiousness would seem to be necessary components of most definitions of ‘‘supervision.”” Hence, the researcher might VALIDITY Mono-Method Bias To fave more than one operational representation of a construct does not necessarily imply that all irrelevancies have been made heterogeneous. Indeed, wheri aif the manipulations are presented the same way, or all the measures use the same means of recording responses, then the method is itseif an irrelevancy whose influence cannot be dissociated from the influence of the target con- struct. Thus, if all the experts in the previous hypothetical example had been presented to respondents in writing, it would not logically be possible to gener- alize to experts who are seen or heard. Thus it would be more accurate to label the treatment as “‘experts presented in writing."’ To cite another example, attitude scales are often presented to respondents without apparent thought to (a) using methods of recording other than paper-and-pencil, (b} varying whether the attitude statements are positively or negatively worded, or (c) varying whether the positive or negative end of the response scale appears on the right or left of the page. On these three points depends whether one can test if “*personal private attitude’’ has been measured as opposed to ‘‘paper-and-pencil nonaccountable responses,’* or ‘‘acquiescence,’" or “‘response bias."” Hypothesis-Guessing Within Experimentat Conditions The intemal validity threats called “resentful demoralization’’ am. ‘‘com- pensation rivalry’? were assumed to result because persons who received less desirable treatments compared themselves to persons who received more desir- able treatments, making it unclear whether treatment effects of any kind occurred in the treatment group. Reactive research may not onty obscure tne treatment effects, but also result in effects of diminished interpretability. This is especiatly true if it is suspected that persons in one treatment group com- pared themselves to persons in other groups and guessed how the experimenters expected them to behave. Indeed, in many situations it is not difficult to guess what the experimenters hope for, especially in education or industrial organiza- tions. Hypothesis-guessing can occur without social comparison processes, as when respondents know only about their own treatment but persist in trying to discover what the experimenters want to learn from the research, The prablem of hypothesis-guessing can best be avoided by making hypoth- eses (if present) hard 10 guess, by decreasing the general level of reactivity in the experiment, or by deliberately giving different hypotheses to different respondents. But these solutions are at best partial, since respondents are not passive and can always generate their own treatment-related hypotheses which may or may not be the same as the experimenters’. Learning an hypothesis does not necessarily imply either the motivation or the ability to alter one’s “behavior because of the hypothesis. Despite the widespread discussion of treat- ‘Toment confounds that are presumed to result from wanting to give data that will please the researcher—which we suspect is a result of discussions of the Hawthome effect—there is neither widespread evidence of the Hawthorne effect in field experiments (see reviews by D. Cook, 1967; Diamond, 1974), nor is there evidence of a similar orientation in laboratory contexts (Weber and Cook, 1572), However, we still lack a sophisticated and empirically corroborated theory of the conditions under which hypothesis-guessing (a) occurs, (b) is VALIDITY treatment specific, and (c) is translated into behavior that (d) could Jead to erroneous conclusions about the nature of a treatment construct when (e) the research takes place in a field setting. Evaluation Apprehension Rosenberg (1969) has reviewed considerable evidence from taboratory experiments which indicates that respondents are apprehensive about being evaluated by persons who are experts in personality adjustment or the assess- Tent of human skills. In such cases respondents attempt to present themselves to such persons as both competent and psychologically healthy. It is not clear how widespread such an orientation is in social science experiments in field settings, especially when treatments last a long time and populations do not especially value the way that social scientists or their sponsors evaluate them. Nonetheless, it is possible that some past treatment effects were due to respan- dents being willing to present themselves to experimenters in ways that would lead to a favorable personal evaluation. Being evaluated favorably by experi- menters is rarcly the target construct around which experiments are designed. It is a confound. Experimenter Expectancies There is some literature (Rosenthal, 1972) which indicates that an experi- menter’s expectancies can bias the data obtained. When this happens, it will not be clear whether the causal treatment is the treatment-as-labeled or the expectations of the persons who deliver the treatments to. respondents. This threat can be decreased by employing experimenters who have no expectations of have false expectations, or by analyzing the data separately for persons who deliver the treatments and have different kinds or levels of expectancy. Experi- menter expectancies are thus a special case of treatment-correlated irrelevancy, and they may well operate in some (but certainly not all) field settings. Confounding Constructs and Levels of Constructs Experiments can involve the manipulation of several discrete levels of an independent variable that is continuous. Thus, one might conclude from an experiment that A does not affect B when in fact A-at-level-one does not affect B, whereas A-at-level-four might well have affected B if A had been manipu- lated as far as level four. This threat is a problem when A and 8 are not lin- early related along the whole continuum of A; and it is especially prevalent, we assume, when treatments have only a weak impact. If they do, because low levels of A are manipulated, and if conclusions are drawn about A without any qualifications concerning the strength of the manipulation, then misleading neg- ative conclusions can be drawn. The best contro} for this threat is to conduct parametric research in which many levels of A are varied and many levels of B are measured. Interaction of Different Treatments This threat occurs if respondents experience more than one treatment which ig common in laboratory research but quite rare in field settings, We do not VALIDITY fiction books but not for the circulation of factual ones. The process of hypothesizing constructs and testing how well treatrnent and outcome operations fit these constructs is similar whether it occurs before the research begins or after the data are seceived. The major difference is that in the later stage one specifies constructs that fit the data, whereas in the earlier stage one derives operations from constructs. In their pathfinding discussion of construct validity, Cronbach and Meehl (1955) stressed the utility of drawing inferences about constructs from the fit between patterns of data that would be predicted if a particular theoretical con- struct was operating and the multivariate pattern of data was actually obtained in the research. They used the term ‘‘nomological net’’ to refer to the predicted pattern of relationships that would permit naming a construct. For instance, 2 current version of dissonance theory predicts that being underpaid for a counterat- titudinal advocacy will result in greater belief change than being overpaid, pro- vided that the individual who makes the advocacy thinks he has a free choice to Tefuse to perform the advocacy. The construct “‘dissonance’’ would therefore be partially validated if experimental data showed that underpayment caused more belief change than overpayment but only under free choice conditions. However, the fit between the complex prediction and the complex data only facilitates belief in “‘dissonance"’ to the extent that other theoretical constructs could not explain this same data pattern. Bem (1972) obviously believes that reinforcement gone structs do as good a job of complex prediction in this case as “dissonance. We have implicitly used the ‘‘nomological net’’ idea in our presentation of construct validity. First, we discussed the usefulness—for labeling the treatment— of examining whether the planned treatment is related to direct measures of the treatment process and is not related to cognate processes. Second, we discussed the advantages of determining in what ways the outcome variables are related to treatments and the type of treatment that could have resulted in such a differen- tiated impact. For instance, if the introduction of television decreases the circula- tion of fiction but not fact books, one can hypothesize that the causal impact is mediated by television taking time away from activities that are functionally simi- lar—such as fantasy amusement—but not from functionally dissimilar ectivities— such as learning specific facts. However, our emphasis has differed slightly from that of Cronbach and Mecht (1955) inasmuch as we are more interested in fitting cause and effect operations to a generalizable construct (see Campbell, 1960—the discussion of ‘trait validity’’) than we are in using complex predictions and data ‘mmf patterns to validate entirely hypothetical scientific constructs like ‘‘anxiety,"* GO “‘intelligence"* or “‘dissonance.'’ However, we readily acknowledge that the way the data turn out in experiments helps us edit the constructs we deal with, as when we find that a foreman’s *‘supervision’’ has different consequences from less than ten feet as opposed to more than ten feet. EXTERNAL VALIDITY Introduction Under external validity, Campbell and Stanley originatly listed the threat of not being able to generalize across exemplars of a particular Presumed cause or effect construct. We have obviously chasen to incorporate this feature under con- VALIDITY Struct validity as ‘‘mono-operation bias."* The reason for listing this threat dif- ferently from Campbell and Stanley is not fundamental. Rather it is meant to emphasize that most researchers want to draw conclusions about constructs, but the Campbell and Stanley discussion had a flavor of definitional operationalism, although a multiple definitional operationalism. We have tried to avoid this flavor by invoking construct validity to replace generalizing across cause and effect exemplars. The other features of Campbell and Stanley's conceptualization of external validity are preserved here and elaborated upon. They have to do with (1) generalizing to particular target Persons, settings, and times, and (2) generalizing across types of persons, settings, and times. Bracht and Glass (1968) have succinctly explicated extemal validity, pointing out that a two-stage process is involved: a target population of persons, settings, or times has first to be defined and then samples are drawn to represent these Populations, Very occasionally, the samples are drawn from the populations with knawn probabilities, thereby maximizing the final representativeness discussed in textbooks on sampling theory (e.g., Kish, 1965). But usually the samples cannot be drawn so systematically and are drawn instead because they are convenient and give an intuitive impression of fepresentativeness, even if it is only the representa- tiveness entailed by class membership (¢.g., I want to generalize to Englishmen and the people 1 found on streetcorners in Birkenhead, England, belong to the class called Englishmen). Accidental sampling, as it 1s technically labeted, gives Us no guarantee that the achieved population (a subset of Englishmen who hang around streetcorners in Birkenhead) is Tepresentative of the target population of which they are members. Consequently, we find it useful to distinguish among (1) target populations, (2) formally representative samples that correspond ta known Populations, (3) samples actually achieved in field fesearch, and (4) achieved populations. One of many cxamples that could be cited to illustrate these last points con- cems the design of the first negative income tax experiment. Practical administra- tive considerations led to the study being conducted in a few localities within New Jersey and in one city in neighboring Pennsylvania. Since the basic question guid- ing the research did not require such a restricted Reographical location, the New Jersey-Pennsylvania setting must be considered a limitation which reduces one's ability to generalize to the implicit target population of the whole United States. {To criticize the study because the achieved sample of settings was not formally tepresentative of the target population may appear unduly harsh in light of the fact that financial and logistical resources for the experiment were limited, and so sampling was conducted for convenience rather than formal representativeness. We shall return to this point later. For the Present, however, it is worth noting that accidental samples of convenience do not make it easy to infer the target population, nor is it clear what population is actually achieved.) Generalizing to well-explicated target populations should be clearly distin- guished from generalizing ceross Populations. Each is germane to extemal validity: the former is crucial for ascertaining whether any research goals that Specified populations have been met, and the latter is crucial for ascertaining which different populations (or subpopulations) have been affected by a treat- ment, i.e., for assessing how far one can generalize. Let us give an example. VALIDITY Suppose .a new television show were introduced that was aimed at teaching basi¢ ‘arithmetic to sevent-year-olds in the United States, Suppose, further, that one could somehow draw a random sample of all seven-ycar-olds to give a representative national sample within known limits of sampling error. Suppose, further, that one could then randomly assign each of the children to watching or not watching the television show, This would result in two randomly formed, and thus equivalent, experimental groups which were representative of all seven-year-olds in the United States. Imagine, now, that the data analysis indicated that the average child in the viewing group gained more than the average child in the nonviewing group. One could generalize such a finding zo the average seven-year-old in the nation, the target population of interest. ‘This is equivalent to saying that the results were obtained despite possible variations in how much different kinds of children in the experimental viewing group may have gained from the show. A more differentiated data analysis might show that the boys gained more than the girls (or even that only the boys gained), or the analysis might show that children with certain kinds of home background gained while children from different backgrounds did not. Such differentiated findings indicate that the effects of the televised arithmetic show could not be generalized across all subpopulations of seven-year-old viewers, even though they could be generalized to the population of seven- year-old viewers in the United States. To generalize across subpopulations like boys and girls logically presup- + poses being able to generalize to boys and girls. Thus, the logical distinction bt between generalizing to and across should not be overstressed. The distinction is most useful for its practical implications insofar as many researchers who are concemed about generalizing across populations are usually not as concermed with careful samplings as are persons who want to generalize fo target popula- tions. Many researchers with the former focus would be happy to conclude that a treatment had a specific effect with the particular achieved sample of boys or girls in the study, irrespective of how well the achieved population of boys or girls can be specified. ‘The distinction between generalizing to target populations and across multiple populations or subpopulations is also useful because commentators on extemal vatidity have often implicitly stressed one over the other. For instance, some persons discuss external validity as though it were only about estimating limits of general- izability, as is evidenced by comments.such as: ‘Sure, the treatment a“ected seven- year-olds in Tucson, Arizona, and that was your target group. But what about chil- dren of different ages in other areas of the United States?"’ Other commentators discuss external validity exclusively in terms of the fit between samptes and target populations, as is evidenced by comments such as: “I’m not sure whether the treat- ment is really effective with children who have learning disabilities, for if you look at the pretest achievement means for the groups in your experiment, you'll see that they are as high as the test publisher quotes for the national average, How could children with learning disabilities have scored so high? I doubt that the research really involved the kind of child you said it did.” Finally, we make the distinction between generalizing to and across in order to emphasize the greater stress that we shall place in this presentation on generalizing VALIDITY across. The rationale for this is that formal random sampling for representative- ness is rare in field research, so that strict generalizing to targets of external vatid- ity is rare. Instead, the practice is more one of generalizing across haphazard instances where similar-appearing treatments are implemented. Any inferences about the targets to which one can generalize from these instances are necessarily fallible and their validity is only haphazardly checked by examining the instances in question and any new instances thal might later be experimented upon. It is also worth noting that the formal generalization tw target populations of persons is often associated with large-scale experiments. These are often difficult to admin- ister both in terms of treatment implementation and securing high-quality mea- surement. Moreover, attrition is almost inevitable, and so the sample with which one finishes the research may not represent the same population with which one began the research. A case can be made, therefore, that external validity is enhanced more by a number of smaller studies with haphazard samples than by a single study with initially representative samples if the latter could be implemented, Of course, it should not be forgotten that all the haphazard instances of persons and settings that are examined can belong to the class of persons or settings to which one would like to be able to generalize research findings. Indeed, they should belong to such a class. List of Threats to External Validity Tests of the extent to which one can generalize across various kinds of per- sons, settings, and times are, in essence, tests of statistical interactions. If there is an interaction between, say, an educational treatment and the social class of chil- dren, then we cannot say that the same result holds across social classes. We know that it does not. Where effects of different magnitude exist, we must then specify where the effect does and does not hold and, hopefully, begin to explore why these differences exist. Since the method we prefer of conceptualizing exter- nal validity involves generalizing across achieved populations, however uncleatly defined, we have chosen to fist all of the threats to external validity in terms of statistical interaction effects. Interaction of Selection and Treatment In which categories of persons can a cause-effect relationship be generalized? Can it be generalized beyond the groups used to establish the initial relationship— to various racial, social, geographical, age, sex, or personality groups? Even when respondents belong to a target class of interest, systematic recruitment factors lead to findings that are only applicable to volunteers, exhibitionists, hypochondriacs, scientific do-gooders, those who have nothing else to do, and so forth. One feas- ible way of reducing this bias is to make cooperation in the experiment as conven- ient as possible. For example, volunteers in a television-radio audience experiment who have to come downtown to participate are much more likely to be atypical than are volunteers in an experiment carried door-to-door. An experiment involy- ing executives is more likely to be ungeneralizable if it takes a day's time than if it takes only ten minutes, for only the latter experiment is likely to include those people who have little free time. VALIDITY preliminary understanding of what a business ot project is capable of. But that is another matter.) The determination of modal instances is more difficult the closer one comes to theoretical research. This is because target populatians are less likely to be spec- ified. For instance, in testing propositions about “‘hetping’* behavior, it is not desirable to generalize only to workers who are presently employed in a particular factory, working at a particular task, and producing a particular product. The pérsons, the settings, the task, and the product would be irrelevant to any helping theory, Yet—togically speaking—the factors incorporated into a particular test of @ proposition about helping determine the external validity of the findings, and the researcher presumably does not welcome this restriction. Instead, he or she would like to generalize to all persons (in the United States? beyond our shores?), all settings (the street, the home, the factory?), and all tasks (helping someone who has fainted, helping the permanently disabled?). The feasibility of confident gener- alizations of such breadth is low, and the most that the basic researcher can do is to attempt to replicate his or her original findings across settings with different restrictions or to wait until others have conducted the replications, Sampling for heterogeneity is at issue here rather than sampling to obtain impressionistically modal instances that the researcher cannot convincingly define. It should be clear by now that, where targets are specified, the mode! of ran- dom sampling for representativeness is the most powerful model for generalizing and that the mode! of impressionistic modal instances is the least powerful. The model of hetcrogeneous instances lies between the two, However, the last model has advantages over the other two in that it can be used when no targets are specified and the major concern is not to be limited in one's generalizations. Moreover, it can be used with smatl numbers of samples of convenience. Tn many cases the random selection of instances results in generalizing to targets that are of minimal significance for persons whose interests differ from those of the original Tesearcher’s. For instance, to be able to generalize to all whites living in the Detroit area, while of interest for some purposes, is generally of little interest to most people. However, it is worth noting that whites in Detroit differ in age, SES, intelligence, and the like so that it is possible to test whether a particular treatment can have similar effects despite such differences. In addition, subgroup analyses can be conducted to examine generality across subpopulations. In other words, formal randomization from populations of low interest can be used to test causal relationships across heterogeneous subpopulations. In other words, an important function of random samples is to permit examining the data for differential effects ona variety of subpopulations. Given the negative relationships between ‘‘inferen- tial’? power and feasibility, the model of heterogeneous instances would seem most useful, particularly if great care is made to include impressionistically modal D\Oinstances among the heterogeneous ones. In the last analysis, external validity—like construct validity—is a matter of replication. It is worth noting that one can have multipie replication both within a single study—subgroup analyses exemplify this—and also across studies—as when one investigator is intrigued by a pattern of findings and tries to replicate them using his or her own procedures or procedures that have been closely mod- eled on the original investigators’. ALIDITY Three dimensions of replication are worth noting. First, is the simultaneous or consecutive replication dimension, with the latter being preferable since it offers some test, however restricted, of whether a causal relationship can be corroborated at two different times. (Generalizing across times is necessarily more difficult than generalizing across persons or settings.) Second is the independent or nonindepen- dent investigator dimension, with the former being more important, especially if the independent investigators have different expectations about how an experiment wilt tum out. Third is the dimension of demonstrated or assumed replication, The former is asscssed by explicit comparisons among different types of persons and settings where some persons did or did not receive a particular treatment. The latter is inferred from treatment effects that are obtained with heterogeneous sam- ples, but no explicit statistical cognizance is taken of the differences among per- sons, sellings, and times. Demonstrated replication is clearly more informative than assumed, for to obtain an effect with a mixed sample of, say, boys and girls, does not logically entail that the effect could be obtained separately for both boys and girls. It only entails that the effect was obtained despite any differences between boys and girls in how they reacted to the treatment. The difficulties associated with external validity should not blind experimenters to practical steps that can be taken to increase generalizability. For instance, one can often deliberately choose to perform an experiment at three or more sites where. different kinds of persons live or work. Or, if one can randamly sample, it is useful to do so even if the population involved is not meaningful, for random sampling ensures heterogencity. Thus, in their experiment on the relationship between beliefs and behavior about open housing, Brannon et al. (1973) chose a random sample of all white households in the metropotitan Detroit area. While few of us are interested in generalizing to such a population, the sample was nonetheless considerably more heterogeneous than that used in most research, despite the homogeneity on the attri- butes of race and geographical residence. {n addition, our three models for increasing external validity can be used in combination, as has been achieved in some survey research experiments on improving survey research procedures (Schuman and Duncan, 1974). Usually, random samples of respondents are chosen in such experiments, but the interview- ers are not randomly chosen; they are merely impressionistically modal of all experienced interviewers. Moreover, the physical setting of the research is limited to one target setting that is of little interest to anyone who is not a survey fesearcher—the respondent's living room—and the range of outcome variables is usually limited to those that survey researchers typically study—that is, those that can be assessed using paper and pencil. However, great care is normally taken that these questions cover a wide range of possible effects, thereby ensuring consider- able heterogeneity in the effect constructs studied, Our pessimism about external validity should not be overgeneralized. An awareness of targets of generalization, of the kinds of settings in which a target class of behaviors most frequently occurs, and of the kinds of Persons who most often experience particular kinds of natural treatments will, at the very least, pre- vent the designing of experiments that many persons shrug off willy-nilly as “‘erelevant."’ Also, it is frequently possible to conduct muttiple replications of an experiment at different times, in different settings, and with different kinds of VALIDITY experimenters and respondents. Indeed, a strong case can be made that external validity is enhanced more by many heterogeneous small experiments than by one or two large experiments, for with the latter one runs the risks of having heterogeneous treatment, measures that are not as reliable as they could be, and measures that do not reflect the unique nature of the treatment at different sites. Many small-scale experiments with local contro] and choice of measures is in many ways preferable to giant national experiments with a promised standardization that is neither feasible nor even desirable from the standpoint of making irrelevancies heterogeneous. RELATIONSHIPS AMONG THE FOUR KINDS OF VALIDITY Internal Validity and Statistical Conclusion Validity Drawing false positive or false negative conclusions about causal hypotheses is the essence of internat validity. This was a major justification for Campbell (1969) adding ‘‘instability’’ to his list of threats to internal validity. *‘Instabil- ity’ was defined as ‘‘unretiability of measures, fluctuations in sampling persons or components, autonomous instability of repeated or equivalent measures,"’ alt of which are threats to drawing correct conclusions about covariation and hence about a treatment's effect, (What precipitated the need for this additional threat was the viewpoint of same sociologists who had argued against using tests of significance unless the comparison followed random assignment to treatments. See Winch and Campbell, 1969, for further details.) The status of statistical conclusion validity as a special case of internal validity can be further illustrated by considering the distinction between bias and error, Bias refers to factors which systematically affect the value of means; error refers to factors which increase variability and decrease the chance of obtaining statistically significant effects. If we erroneously conclude from a quasi-experiment that A causes 8, this might either be because threats to inter- nal validity bias the relevant means or because, for a specifiable percentage of possible comparisons, sample differences as large as those found in a study would be obtained by chance. If we erroneously conclude that A does not affect B (or cannot be demonstrated to affect 8), this can either be because threats to internal validity bias means and obscure true differences or because the uncontrolled variability obscures true differences. Statistical conclusion validity is concerned not with sources of systematic bias but with sources of random error and with the appropriate use of statistics and statistical tests. An important caveat has to be added to the preceding statement that ran- dam errors reduce the risk of statistically corroborating true differences. This does not imply that random errors invariably inflate standard errors or that they never lead to false positive conclusions about covariation. Let us try to iHus- trate these points. Imagine multiple replications of an unbiased experiment where the treatment had no effect. The distribution of sample mean differences should be normal with a mean of zero. However, many individual sample mean differences will not be zero. Some will inevitably be targer or smatler than zero, even 1o-a statistically significant degree. nO VALIDITY uu Imagine, now, the same assumptions except that bias is operating. Because of the bias, the distribution of sampte mean differences will no longer have a mean of zero, and the difference from zero indicates the magnitude of the bias. However, the point to be emphasized is that some sample mean differences will be as large when there is bias as when there is Not, although the propor- tion of differences reaching the specified magnitude will vary between the bias and nonbias cases depending on the direction and magnitude of bias. Since sampling error, which is one kind of random error, affects both sample means and variances, it can lead to both false positive and false negative differences. In this respect, sampling error is like intemal validity. But it is unlike internal validity in that it cannot affect population means, Only sources of bias—threats to internal validity—can do the latter. Construct Validity and External Validity Making generalizations is the essence of both construct and external valid- ity. It is instructive, we think, to analyze the similarities and differences between the two types of validity. The major similarity can perhaps best be summarized in terms of the notion of statistical interaction—that is, the sign or direction of a treatment effect differs across Populations. It is easy to see how Person, setting, and time variables can moderate the effectiveness of a treat- ment. It is probably also casy to see how an estimate of the effect may depend on such threats to construct validity as the number of treatments a respondent receives or the frequency with which outcomes are measured. It may be less easy to see how a treatment effect can interact with (ie., depend on) the par- ticular method used for collecting data (mono-methad bias}, or the expectan- cies of the persons implementing a treatment (experimenter expectancies), or the guesses that respondents make about how they are supposed io behave (hypothesis-guessing). But in all these instances an internally valid effect can be obtained under one condition (say, when Paper-and-pencil measures of atti- tude are used) and a different, but stil! vatid, effect may result when attitude is measured some other way, Specifying the factors that codetermine the direction and size of a particu- lar cause-effect relationship is useful for inferring cause and effect constructs. This is because some of the causes or effects that might explain a particular relationship observed under one condition may not be able to explain why there are different causal relationships under other conditions. It should especially be noted that specifying the populations of persons, settings, and times over which a relationship holds can also clarify construct validity issues. For instance, sup- Pose a negalive income tax causes more married women than men to withdraw their labor from the labor market (see the summary statements of the four neg- ative income tax experiments in Cook, Del Rosario, Hennigan, Mark, and Tro- chim, 1978). Such an action might suggest that the causal treatment can be understood, not just in monetary terms but also in terms of a possible shift in economic risks (ie., where the family breadwinner is guaranteed an income, the withdrawal of his or her labor could have extremely serious consequences if the income guarantee were withdrawn or if the guaranteed sum failed to rise with inflation. But where a source of more marginal income is involved—as VALIDITY with somé married women—the withdrawal of their labor is less critical since the family is not so heavily dependent on that one source of incame.} Other interpretations of why men and women are affected differently are also posst- ble. Their existence highlights the difficulty of inferring causal constructs where the clarifying inference is indirect, being based on differences in responding across populations rather than on attempts to refine the causal operations directly ‘so that they better fit a planned construct. The major point to be noted, however, is that both external and construct validity are concemed with speci- fying the contingencies on which a causal relationship depends and all such specifications have important implications for the generalizability and nature of causal relationships. Indeed, external validity and construct validity are so highly related that it was difficult for us to clarify some of the threats as belonging to one validity type or another. In fact, two of them are differently placed in this book than in Cook and Campbell (1976). These are ‘‘the interac- tion of ireatments'’ and “the interaction of testing and treatment.*’ They were formerly included as threats to external validity on grounds that the number of treatments and testings were part of the research setting. On reflection, how- ever, we think they are more useful for specifying cause and effect constructs than for delimiting the settings under which a causal relationship holds, though they obviously can serve both purposes. The major difference between extemal and construct validity has to do with the extent to which real target populations are available. In the case of extemal validity the researcher often wants to generalize to specific populations of per- sons, settings, and times that have a grounded existence, even if he or she can only accomplish this by impressionistically examining data patterns across acci- dental samples. However, with cause and effect constructs it is more difficult to specify a particular construct—what, for instance, is aggression? Any defini- tions would be disputed and would not have the independent existence of, say, the population of American citizens over 18 years of age. Even though the lat- ter is a theoretical construct, it is obviously more grounded in reality than such constructs as ‘‘attitude towards authority’’ or “‘a negative income tax."" Issues of Priority Among Validity Types Some ways of increasing one kind of validity will probably decrease another kind. For instance, internal validity is best served by carrying out randomized experiments, but the organizations willing to tolerate these are probably less representative than organizations willing to tolerate passive mea- surement. Second, statistical conclusion validity is increased if the experimenter can rigidly control the stimuli impinging on respondents, but this procedure can decrease both extemal and construct validity. And third, increasing the con- struct validity of effects by multiply operationalizing each of them is likely to increase the tedium of measurement and to cause either attrition from the experiment or lower reliabi for individual measures. These countervailing rclationships—and there are many others—suggest how erucial it is to be explicit about the priority ordering among validity types in planning any experiment. Means have to be developed for avoiding all unnec- essary trade-offs between one kind of validity and another, and to minimize the VALIDITY loss entailed by the necessary trade-offs. However, since some trade-offs are inevitable, we think it unrealistic to expect that a single piece of research will effectively answer all of the validity questions surrounding even the simplest causal relationship. The priority among validity types varies with the kind of research being conducted. For persons interested in theory testing it is almost as important to show that the variables involved in the research are constructs A and B {con- Struct validity) as it is to show that the relationship is causal and goes from one variable to the other (internal validity). Few theories specify crucial target settings, populations, or times to or across which generalization is desired. Consequently, external validity is of relatively little importance. In practice, it is often sacrificed for the greater statistical power that comes through having isolated settings, standardized procedures, and homogeneous Tespondent poputa- tions. For investigators with theoretical interests our estimate is that the types of validity, in order of importance, are probably internal, construct, statistical conclusion, and external validity, We also estimate that the construct validity of causes may be more impor- tant for such researchers than the construct validity of effects, particularly in psychology. Think, for example, of how simplistically “‘atlitude’? is operation alized in many persuasion experiments, or ‘‘cooperation”’ in bargaining studies, or ‘‘ageression”’ in studies of interpersonal violence, Think, on the other hand, about how much care goes into demonstrating that a particular manipulation varied ‘‘cognitive dissonance’’ and not reactance, communicator expertise and not experimenter expectancies or evaluation apprehension. Might not the con- struct validity of effects rank lower than statistical conclusion validity for most theory-testing researchers? If it does, this would be ironic since multiple opera- tionalism makes it easier to achieve higher construct validity of effects than of causes. Much applied research has a different set of priorities. It is concerned with testing whether a particular problem has been alleviated by a treatment—recidi- vism in criminal justice setiings, achievement in education, or productivity in industry (high intemal validity and high construct validity of the effect). ht is also crucial that any demonstration of change in the indicator be made in a context which permits either wide generalization or generalization to the spe- cific target settings or persons in whom the researcher or his clients are particu- larly interested (high interest in external validity). The researcher is relatively tess concemed with determining the causally efficacious components of a com- plex treatment package, for the major issue is whether the treatment as impte- mented caused the desired change (low interest in construct validity of the cause). The priority ordeting for many applied researchers is something like intemal validity, external validity, construct validity of the effect, statistical conclusion validity, and construct validity of the cause. For the kinds of causal problems we have been discussing, the primacy of internal validity should be noted for both basic and applied researchers. The reasons for this will be given below, and they relate to the often considerable casts of being wrong about the magnitude and direction of causal relations, and the often minimal gains in external validity that are achieved in moving from VALIDITY eG notion of ‘cause is an abstract one and that the single study only approxi- mates causal knowledge. But we believe it is confusing to insist that internal validity is a contradiction in terms because all validily is external, referring to abstract concepts beyond a study and not to concrete research operations within a study. It is confusing because the choice of populations and the fit between samples and populations determines representativeness, whereas neither popula- tions nor samples are necessary for inferring cause. Nonetheless, the critics make a very useful point, for if the goals of a research project are formulated wel! enough to permit specifying target con- structs and. populations, then the research operations have to represent hese tar~ gets if the research is to be relevant either to theory at to policy Moreover, a focus on sepresentativeness has historically entailed a heightened sensitivity to unplanned and isrelevant targets that unnecessarily limit generalizability, as when all the persons who collect posttest achievement data in an early chitd- hood experiment with economically disadvantaged children are of the same social class. Clearly, relevant research demands representativeness where target constructs or populations are specified. It also demands heterogeneity where irrelevant populations could fimit the applicability of the research. Though we advocate putting considerable resources into the preexperimental explication of relevant theory or policy questions—and hence targets—ihis should not be interpreted in any way as an exclusive focus. As we tried to demonstrate in the discussion of both construct and extemal validity, it is sometimes the case that the data, once collected and analyzed, force us to restrict (or extend) generaliz- ability beyond the scope of the original formulation of target constructs and populations, The data edit the kinds of general statements we can make. For instance, in his experiment on the help given to compatriots and for- eigners, Fetdman (1968) wanted to generalize to ‘‘cooperation."’ He deduced that if his independent variable affected cooperation, he would find five depen- dent variable measures related to his treatment. But only two were related, and the data outcomes forced him to conclude tentatively that his treatment was dif- ferently related to two kinds of cooperation. Similarly, the designers of the New Jersey Negative Income Tax Experiment wanted to generalize to working poor persons, but the data forced them tentatively to conclude that working poor blacks responded one way to the treatments, working poor persons who were Spanish speaking reacted another way, and working poor whites probably did not respond to the treatments at all. The point is this: While it is Jaudable to sample for representaliveness when targets of generalization are specified in advance—and we heartily endorse such sampling—in the last analysis it is the patterning of data outcomes which determines the range of constructs and pop- ulations over which one can claim a treatment effect was obtained. One has always to be alert to the data demanding a respecification of the affected popu- lations and constructs and to the possibility that the affected populations and constructs will not be the same as those originally specified. A fourth objection has been directed towards Campbell and Stanley's stress on the primacy of internat over external validity. The critics argue that no kind of validity can logicaily have precedence over another. Of what use, critics say, is a theory-testing experiment if the true causal variable is not what the VALIDITY researchers say it is; or of what use is a policy experiment abou! the effects of school desegregation if it involves a school in rural Mississippi when most desegregation is in large, northern cities where white children have fewer alter- natives to public schools than in the deep South? This point of view has been best expressed by Snow (1974). He uses the term “referent validity’ to desig- nate the extent to which research operations correspond to their referent terms in research propositions of the form: ‘Counseling for pregnant teenagers improves their mental health’’ or ““The introduction of national health insurance. causes an increase in the use of outpatient services."’ Without using our termi- nology, Snow notes that such propositions usually contain implicit or explicit references to populations, settings, and times (external validity), to the nature of the presumed cause and effect (construct validity), to whether the operations representing the cause and effect covary (statistical conclusion validity), and to whether this covariation is plausibly the result of causal forces {internal validity). For a study to be useful, the argument goes, each part of the proposition should be given approximately equal weight. These is no need to stress the causal- ity term over any other. Other critics (Hultsch and Hickey, 1978; Cronbach, in preparation) take the argument one step further and stress the primacy of external over internal validity. Hultsch claims that if we have a target population of special interest—for example, the educable mentally retarded—then it is better to test causal propositions about this group on representative samples. He maintains this should be done even if less rigoraus means have to be used for testing causal propositions than would be the case if a study was restricted to easily available but nonrepresentative subgroups of the educable mentally retarded or to children who were not educable and retarded. Cronbach (in preparation) echoes this argument and adds, first, that in much applied social research the results are needed quickly and, second, that a high degree of confidence about causal attribution is less important in the decisions of policy-makers (broadly conceived) than is confidence in knowing that one is working with formally or impressionistically representative samples. Consequently, Cronbach contends that the time demands of experiments with experimenter-controiled manipulanda and the reality of how research is (and is not) used in decision making suggest a higher priority for speedy research using available data sets, simple one-wave measurement studies or qualitative studies as opposed to studies which, like quasi-experiments, take more time and explicitly stress internal validity. It is in some ways ironic that the charge of neglecting external validity should be leveled against one of the persons who invented the construct and elevated its importance in the eyes of those who restricted experimentation to laboratory set- tings and who wrote about experimentation without formally mentioning the spe- cial problems that arise in field settings. But this aside, we have no quarrel in the abstract with the point of view that, where causal propositions include references to populations af persons and settings and to constructs about cause and effect, each should be equally weighted in empirical tests of these propositions. The real difficully comes in particular instances of research design and implementation where very often the investigator is forced to make undesirable choices between internal and external validity. Gaining a representative sample of educable, men- tally retarded students across the whole nation demands considerable resources, VALIDITY Even gaining such a sample in a few cities located more closely together is diffi- cult, requiring resources for implementing a treatment, ensuring its consistent de- livery, collecting the required pretest and Posttest data, and doing the necessary public relations wark, Without such resources, one runs the risk of a large, poorly implemented study with 2 representative sample or of a smaller, better implement- ed study where the smal] sample size limits our confidence in generalizing. Since random sampling is so rare for Purposes of achieving representativeness, iL is useful to consider the trade-off between internal and external validity when heterogeneaus but unrepresentative sampling is used or when impressionistically modal but unrepresentative instances are selected that at least belong in the general class to which generalization is desired. Samples selected this way will have unknown initial biases, since not all schools will volunteer to permit measure- ment, even fewer schools will agree to deliberate manipulation of any kind, and the sample af schools that wil! agree to randomized manipulation will probably be even more circumscribed than the sample of schools that agrees to measurement with or without quasi-experimentation. The crucial issue is this; Would one do betrer to work with the initially more Tepresentative sample of schools in a particu- lar geographical area that volunteered to permit measurement, even though no deliberate manipulation tock place? Or would one rather work with a less repre- sentative sample of schools where both measurement and deli! took place? Solving this problem boils down, we think, to asking whether the internal validity costs of eschewing deliberate manipulation and more confident causal inferences are worth the gains for external validity of having an initially mare Tepresentative sample from which bias-inducing attrition will nonetheless take place. Any resolution must aiso consider two other factors. First, the study which stresses interi.al validity has at least to take place in a setting and with persons who belong in the class to which generalization is desired, however formatly unrepresentalive of the class they might be. Second, the study which stresses extemal validity and has apparently more representative samples of settings and persons will Tesult in less confident causal conclusions because more powerful techniques of field experimentation were not used or were nol used as well as they might have been under other circumstances. The art of designing causal studies is to minimize the need for trade-offs and to lry to estimate in any particular instance the size of the gains and losses in internal and external validity that are involved in different trade-off options. Scholars differ considerably in their estimate of gains and losses. Cronbach main- tains that timely, representative, but less rigorous studies can still Jead to reason- able approximations to causal inference, even if the studies are nonexperimental and of the kind we shall discuss—somewhat pessimistically—in chapter 7. Camp- bell and Boruch (1975), on the other hand, maintain that causal inference is prob- lematic with nonexperimental and single-wave quasi-experimenis because of the Many threats to internal validity that remain unexamined or have to be miled put eoby fiat rather than through direct design or measurement. The issue involves esti- mating how to balance timeliness and the quality of causal inference, whether the costs of being wrong in one’s causal inference are not greater thun the costs of being late with the results, berate manipulation VALIDITY Consider two cases of timely research aimed at answering causal questions which used manifestly inadequate experimental procedures. Head Start was oa uated by Ohio-Westinghouse (Cicirelli, 1969) in a design with only one weave measurement of academic achievement. The conclusion—Head Start was harmful. Analysis of the same data using different statistical models appeared to cormobarate this conclusion (Bamow, 1973); to reverse it completely, making Head Start appear heipfut (Magidson, 1977); or to result in no-difference findings (Bentler and Wood- ward, 1978}. Since we do not know the effects of Head Start, any timely decisions based on the data would have been premature and perhaps harmful. The secon example worth citing is the Coleman Report (Coleman et al., 1966). In this large- scale, one-wave study it was concluded that black children gained niore in achieve- ment the higher the percentage of white children in their classes. ‘This finding was used to justify schoo) desegregation. However, better designed subsequent researe has shown that if blacks gain at all because of desegregation (which is not clear), they gain much less than was originally claimed. It is important, we feel me to underestimate the costs of producing timely results about cause, particularly its direction, which tum out to be wrong. Clearly, the chances of being wrong aout cause are higher the more one deviates from an experimental mode! and conducts rimental research using primitive one-wave quasi-experi: . | renner timeliness is important in policy cesearch—though Jess so for basic researchers for whom this book is also intended—we shall devote some of the next chapter to quasi-experimental designs that do not require pretests and to ways. n which archives can be used for rigorous and timely causal analysis. In the end, however, each investigator has to try to design research which maximizes all kinds of validity and, if he or she decides to Place a primacy on internal validity, this ywed to trivialize the research. a oye ‘rave at tried to place internal validity above other forms of validity. Rather, we wanted to outline the issues. In a sense, by writing a book abaut experimentation in field settings, we are assuming that readers already believe that internal validity is of great importance, for the raison d'etre of experiments is to facilitate causal inference. Other forms of knowledge about the social world one more accurately or more efficiently gained through other means—e.g., surveys or participant observation. Our aim differs, therefore, from that of the last critics we discussed. They argue that experimentation is not necessary for causal inference or that it is harmful to the pursuit of knowledge which will be useful in poticy for- mulation. We assume that readers believe causal inference is important and that experimentation is one of the most useful, if not she most useful, way of gaining knowledge about cause. SOME OBJECTIONS TO OUR TENTATIVE PHILOSOPHY OF THE SOCIAL SCIENCES ‘ientism'” i i taries on the Protests against ‘*scientism'’ have been prominent in recent commentaries theory of conducting social science. Such protests focus on inappropriate and blind efforts to apply “the scientific method’’ to the social sciences. Crities argue that quantification, random assignment, control groups and the deliberate intrusion VALIDITY of treatments—all techniques borrowed from the physical scicnces—distort the context in which social research takes place. Their protest against scientism is often linked with the now-pervasive rejection of the logical positivist philosophy of science and is frequently accompanied by a greater emphasis on humanistic and qualitative research methods such as ethnography, participation observation, and ethno-methodology. Critics also point to the irreducibly judgmental and subjective components in all social science research and to the pretensions to scientific preci- sion found in many current studies. ‘We agree with much of this criticism and have addressed the issue in our previous work (Campbell, 1966, 1974, 1975; Cook, 1974a; Cook and Cook, 1977; Cook and Gruder, 1978; Cook and Reichardt, in press). However, some of the critics of scientism (Guttentag, 1971, 1973; Weiss and Rein, 1970; Hultsch and Hickey, 1978; Mitroff and Bonoma, 1978; Mitroff and Kilman, 1978; Cron- bach, in preparation) have cited Campbell and Stanley (1966) and Cook and Campbell (1976) as prime examples of the scientistic norm to which they object. While the identification of our previous work with scientism oversimplifies and blurs the issues, we acknowledge that in this volume, as in the past, we advocate using the methods of experiments and quantitative science that are shared in part with the physical sciences. We cannot here comment extensively on these criti- cisms of our background assumptions, which go beyond criticisms of causation issues alone. But we can indicate in broad terms the approach we would take in responding to these objections. First, we of course agree with the critics of logical positivism. The philosophy was wrong in describing how physical science achieved its vegree of validity, which was not through descriptive best-fit theories and definitional operationalism. Although the error did not have much impact on the practice of physics, its effect on social science methods was disastrous. We jain in the criticism of positivist social science when positivist is used in this technical sense rather than as a syn- onym for “‘science.** We do not join critics when they advocate giving up the search for objective, intersubjectively verifiable knowledge. Instead we advocate substituting a critical-realist philosophy of science, which will help us understand the success of the physical sciences and guide our efforts to achieve a more valid social science. Critical realists (Mandelbaum, 1964) or “‘metaphysical realists” (Popper, 1972), ‘‘structural realists” (Maxwell, 1972), or ‘Yogical realists” Worthrop, 1959; Northrop and Livingston, 1964) are among the most vigorous modern critics of logical positivism. Critical realists particularly concerned with the social sciences identify their position with Marx’s materialist criticism of ide~ alism and positivism, e.g., Bhaskar, 1975, 1978; Keat and Urry, 1975, Second, it is generally agreed that the social disciplines, pure or applied, are not truly successful as sciences. In fact, they may never have the predictive and explanatory power of the physical sciences—a pessimistic conclusion that merits serious debate (Herskovits, 1972, Campbell, 1972). This book, with its many categories of threats to validity and its general tone of modesty and caution in making causal inferences, supports such pessimism and underscores the equivocal nature of our conclusions. However, it is sometimes forgotten that these threats are not limited to quantitative or deliberately experimental studies. They also arise in less formal, more commonsense, humanistic, global, contextual integrative and VALIDITY, qualitative approaches to knowledge. Even the “‘regression artifacts,’ identified with measurement error, are an observational-inferential illusion that occurs in ordinary cognition (see Tversky and Kahnman, 1974, and Fischoff, 1975). We feel that those who advocate qualitative methods for social science research are at their best when they expose the blindness and gullibility of spe- cious quantitutive studies. Field experimentation should always include qualitative research to describe and illuminate the context and conditions under which research is conducted. These efforts often may uncover important site-specific threats to validity and contribute to valid explanations of experimental results in general and of perplexing or unexpected outcomes in particular. We also believe, along with many critics, that quantitative researchers in the past have used poorly framed questions to generate quantitative scores and that these scores have then been applied uncritically to a variety of situations. (Chapters 4 and 7, in particu- lar, highlight some of the abuses associated with traditions of quantitative data analysis which have probably led to many specious findings.) In uncritical quanti- tative research, measurement has been viewed as an essential first step in the research process, whereas in physics the routine measures are the products of past crucial experiments and elegant theories, not the essential first sieps. Also, the definitional operationalism of logical positivists has supported the uncritical reifi- cation of measures and has encouraged research practitioners to overlook the mea- sures’ inevitable shortcomings and the consequences of these shortcomings. A fundamental oversight of uncritical quantifiers has been to misinterpret quantifica- tions as replacing rather than depending upon ordinary perception and judgment, even though quantification at its best goes beyond these factors (Campbell, 1966, 1974, 1975). Experimental and quantitative social scientists have often used tests of significance as though they were the sole and final proof of their conclusions. From our perspective, tests of significance render implausible only one of the many plausible threats to validity that are continually arising. Naive social quan- tifiers continue to overlook the presumptive, qualitatively judgmental nature of al! science. In contrast, the tradition we represent, with its heavy use of the word “plausible,"’ stresses that the scicntist must continually judge whether a given rival hypothesis will explain the data. Qualitative contextual information (as well as quantitative evidence on tangential variables) has long been recognized as rele- vant ta such judgments. Valid as these criticisms are, it is not enough merely to point out the limita- tions of our methods. Critics should be able to offer viable alternatives. To be superior to the techniques described in the next six chapters, however, the pro- posed qualitative methods would have to eliminate more of the threats to validity listed in this chapter than do the quantitative methods. in this regard, it is refresh- ing to note that our humanistic colleague, Howard Becker (1978), has tried to rule out some of the validity threats in research which uses photogruphs either as evi- dence or as the means of presenting final research results, no quantification having intervened. Others conducting qualitative research under nonlaboratory conditions also recognize the equivocal nature of any causal inferences drawn from their observations. Many sociologists, anthropologists and historians have attempted to avoid causal explanations, aiming instead for uninterpreted description. Yet care- ful linguistic analysis of their reports shows that they are rarely successful. Their VALIDITY
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved