Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Experimental Evaluation of Data Flow and Mutation Testing: A Comparative Study, Papers of Computer Science

This paper presents an experimental comparison of data flow and mutation testing techniques. While both methods are effective for unit-level software testing, they have not been successfully compared on an analytical or experimental basis. The study evaluates the effectiveness of test data developed for each technique and indicates that mutation-adequate test sets are closer to being data flow-adequate than vice versa.

Typology: Papers

Pre 2010

Uploaded on 08/19/2009

koofers-user-3m4-1
koofers-user-3m4-1 🇺🇸

10 documents

1 / 14

Toggle sidebar

Related documents


Partial preview of the text

Download Experimental Evaluation of Data Flow and Mutation Testing: A Comparative Study and more Papers Computer Science in PDF only on Docsity! An Experimental Evaluation of Data Flow and Mutation Testing A. JEFFERSON OFFUTTy Department of ISSE { 4A4, George Mason University, Fairfax, VA 22030 U.S.A., ofut@isse.gmu.edu AND JIE PANy PRC, 12005 Sunrise Valley Drive, Reston, VA 22091 U.S.A., pan jenny@prc.com AND Kanupriya Tewary Oracle Corporation, Box 659311, 500 Oracle Parkway, Redwood Shores, CA 94065 U.S.A., ktewary@us.oracle.com AND Tong Zhang SRA Corporation, 2000 15th Street North, Arlington, VA 22201 U.S.A. SUMMARY This paper presents two experimental comparisons of data ow and mutation testing. These two techniques are widely considered to be e ective for unit-level software testing, but can only be analytically compared to a limited extent. We compare the techniques by evaluating the e ectiveness of test data developed for each. For a number of programs, we develop ten independent sets of test data; ve to satisfy the mutation criterion, and ve to satisfy the all-uses data ow criterion. These test sets are developed using automated tools, in a manner consistent with the way a test engineer might be expected to generate test data in practice. We use these test sets in two separate experiments. First we measure the e ectiveness of the test data that was developed for one technique in terms of the other technique. Second, we investigate the ability of the test sets to nd faults. We place a number of faults into each of our subject programs, and measure the number of faults that are detected by the test sets. Our results indicate that while both techniques are e ective, mutation-adequate test sets are closer to satisfying the data ow criterion, and detect more faults. KEY WORDS Software testing Data ow Mutation Experimentation yPartially supported by the National Science Foundation under grant CCR-93-11967. 1 INTRODUCTION Mutation testing and data ow testing are two unit testing techniques that have recently matured enough to be used by industrial software developers. Both techniques are thought to provide a higher level of testing than older techniques such as statement and branch coverage (for example, mutation and data ow subsume statement and branch coverage), but are also more costly to apply, and require automation. Many engineering advances have been made in the past few years to support mutation and data ow, and commercial tools are now available (PiSCES for mutation and ATAC for data ow), which are currently being used in practical situations. Unfortunately, the relative merits of these techniques are still not well understood. Test engineers and test managers need objective, factual studies such as this to make well-informed decisions about testing. In this paper, we try to establish some idea of the practical cost to bene t tradeo s between mutation and data ow testing, based on experience with the techniques. Both techniques are white box in nature [1] and require large amounts of computational and human resources (although recent engineering advances are reducing both types of cost). Although experience has led us to believe there is signi cant overlap between the two techniques, they have not been successfully compared on either an analytical or experimental basis. We attempt the comparison using two experiments. First, we compare mutation and the all-uses data ow criterion to see whether either method covers the other in the sense of how close test data sets that satisfy one technique come to satisfying the other. Second, we compare mutation and all-uses by executing faulty versions of programs and comparing how many faults are found by test data sets that satisfy each technique. Our results lead us to believe that while mutation o ers more stringent testing than data ow does, both techniques provide bene ts the other lacks. Our eventual goal is to nd a way to test software that provides the advantages of both techniques, either by combining the two techniques or by deriving a new technique that o ers the power of both mutation and data ow testing. The remainder of this section includes a short discussion on the notion of test adequacy criteria, provides overviews of mutation and data ow testing and reviews related research. Subsequent sections present our analytical results, and discuss our experimental procedures and results. Details about the programs and faults we used can be found in a technical report [2]. Adequacy Criteria There are two aspects of any testing process. The rst is test data generation, which may be manual, automated, or a combination of both. The second aspect of a testing process is the stopping rule, or adequacy of the generated test data. Early researchers in mutation de ned adequacy as follows: a test set is adequate if, for every fault in the program being tested, there is a test case in the test set that detects that fault [3, 4]. 2 A testing criterion C1 is ProbBetter than C2 for a program P if a randomly selected test set T that satis es C1 is more \likely" to detect a failure than a randomly selected test set that satis es C2. Mathur and Wong [17, 18] suggest a di erent relationship called ProbSubsumes: A testing criterion C1 ProbSubsumes C2 for a program P if a test set T that is adequate with respect to C1 is \likely" to be adequate with respect to C2. If C1 ProbSubsumes C2, C1 is said to be more \dicult" to satisfy than C2. The ProbBetter relation is de ned with respect to the fault detection capability of test sets, whereas the ProbSubsumes relation is de ned with respect to the diculty of satisfying one criterion in terms of another. Both are probabilistic relations between two testing criteria and are de ned in terms of speci c programs. Although this means that it is dicult to draw general conclusions from any one study, as the number and variety of programs studied increases, our con dence in the validity of a ProbSubsumes or a ProbBetter relationship with a larger set of programs also increases. Mathur and Wong [17, 18] used the ProbSubsumes relation in experimental comparisons of all- uses data ow testing with mutation testing, by manually generating test data to satisfy both criteria and comparing the scores. They used 4 programs and 30 sets of test cases per program and detected equivalent mutants and unexecutable subpaths by hand. This study indicated that mutation-adequate test sets were closer to being data ow-adequate than data ow-adequate test sets were to being mutation-adequate. Their study was quite similar to ours, particularly insofar as we both compared mutation with all-uses data ow. We used di erent sets of programs and test cases, and di erent methods of generating the test cases. While we use more programs (10 rather than 4), Mathur and Wong generated more test sets per program (30 rather than 5). The major di erence is that we have also measured the fault detection ability of the test case sets. Although no experiment can conclusively show that one test technique is better than another, more and di erent programs and studies can increase our con dence in the validity of such a conclusion. Mathur and Wong's study developed from a previous study by Mathur [19] that used human testers to compare the diculty of satisfying another data ow criterion (all-DU paths [6]) and mutation. This study found that the mutation criterion was more dicult to satisfy than all-DU paths, but the study had several problems, as detailed by Mathur and Wong [17]. Recently, Tewary [20] devised algorithms for inserting faults into programs using the program de- pendence graph. These algorithms were demonstrated by inserting faults into programs and comparing the fault detection ability of mutation and data ow testing. She found that when the faults involved changes to the control dependence relations in the program dependence graph, the mutation-adequate and data ow-adequate test sets were almost equally e ective in detecting the faults. 5 EXPERIMENTAL HYPOTHESES AND CONDUCT We compare mutation and data ow in two di erent ways. First, it seems reasonable to suppose that if test sets created for one criterion also satisfy another, then the rst criterion can in some sense be considered to be \better" than the second. This is the essence of the ProbSubsumes relationship. Thus, we have tried to determine if mutation-adequate test sets always cover data ow, and vice versa. Second, an independent and perhaps more practically useful question is whether tests sets created for a testing technique will actually nd faults in programs. This is the essence of the ProbBetter relationship. For our comparison, we have formulated the following hypotheses: Hypothesis 1: Mutation testing ProbSubsumes all-uses data ow. Hypothesis 2: All-uses data ow testing ProbSubsumes mutation. Hypothesis 3: All-uses data ow is testing ProbBetter than mutation. Hypothesis 4: Mutation testing is ProbBetter than all-uses data ow. We chose 10 program units that cover a range of applications. These programs range in size from 10 to 29 executable statements, have from 183 to 3010 mutants, and have from 10 to 101 DU-pairs. The programs are described in Table 1. For each program, we give a short description and the number of executable Fortran statements. We also give the number of DU-pairs (both predicate and computation) and the number of infeasible DU-pairs, and the number of mutants and equivalent mutants. Because of the nature of the two techniques, programs typically have many more mutants than DU-pairs. There also tends to be a lot of overlap in the test cases in the sense that one test case will usually kill many mutants, and often cover several DU-pairs. The programs are listed in a technical report [2]. Program Description Statements DU-pairs Infeasible Mutants Equivalent Bub Bubble sort on an integer array 11 29 1 338 35 Cal Days between two dates 29 28 0 3010 236 Euclid Greatest common divisor (Euclid's) 11 10 1 196 24 Find Partitions an array 28 100 13 1022 75 Insert Insertion sort on an integer array 14 29 1 460 46 Mid Median of three integers 16 30 0 183 13 Pat Pattern matching 17 55 3 513 61 Quad Real roots of quadratic equation 10 15 0 359 31 Trityp Classi es triangle types 28 101 14 951 109 Warshall Transitive closure of a matrix 11 44 0 305 35 Table 1: Experimental Programs. We used three tools for our experimentation. The Mothra mutation system automates the process of mutation testing by creating and executing mutants, managing test cases, and computing the mutation score. We used all twenty-two Mothra mutation operators [21] for this experiment. To generate test data to satisfy mutation, we used Godzilla, an automated constraint-based test case generator that is integrated with Mothra [22]. For the data ow analysis part of the experiment we used ATAC, a data ow tool for C programs developed by Bellcore [23]. ATAC implements all-uses by having the requirement that, if a 6 predicate uses the same variable in more than one condition, each condition must be evaluated separately. There is no test data generation tool associated with ATAC, thus we generated test data to satisfy all-uses by using a special-purpose random test data generator of our own devising. This tool repetitively generated test cases, keeping test cases that covered new DU-pairs, and throwing away test cases that did not. We feel that these methods of generating test data are realistic in the sense that if a coverage-based criterion is used, they are reasonable ways that test engineers might be expected to generate test data in practice. Since Mothra tests Fortran-77 programs and ATAC tests C programs, we had to translate each program into both languages. We started with Fortran versions of the programs, and rst made sure that the programs did not use any features of Fortran-77 that would not translate directly into C. Then we hand- translated the programs into C, taking care to use as direct a translation as possible so as not to introduce any bias into our results by using di erent programs. The control ow graphs, executable statements, and DU-pairs were exactly the same for all pairs of programs. We tested our translations by running both versions on every test case that we generated, and comparing the outputs of the two versions. As described below, this amounted to a total of 10 di erent test sets per program. We de ne test requirements to be things that must be satis ed or covered; for example, reaching statements are test requirements for statement coverage, killing mutants are test requirements for mutation, and executing DU-pairs are test requirements for data ow testing. Both mutation and data ow have problems with unrealizable test requirements. Mutation systems create equivalent mutants, which cannot be killed, and data ow systems ask for infeasible DU-pairs to be covered. For each program, as part of our preparation, we identi ed all equivalent mutants and infeasible DU-pairs by hand. For each program, we generated test sets that were mutation-adequate and test sets that were data ow-adequate. To avoid any bias that could be introduced by any particular test set, we generated ve independent test sets for each criteria. Thus, for each program, we had ten test sets; ve mutation-adequate test sets, and ve data ow-adequate test sets, for a total of 100 test sets for our ten programs. We consider a minimum test case set for a criterion to contain the smallest number of cases necessary to satisfy the criterion, and a minimal test case set to be a satisfying set such that if any test case was removed, the set would no longer satisfy the criterion. We eliminated redundant test cases (by incrementally adding test cases, and only keeping those that contributed to satisfying the criteria) until we had minimal test sets, but did not attempt to create minimum test sets. The minimal test case sets are in the technical report [2]. EXPERIMENTS AND ANALYSIS We compare the two techniques of all-uses and mutation testing on three bases, a coverage measurement (using the ProbSubsumes relationship), fault detection capability (using the ProbBetter relationship), 7 Fault Detection Experimentation To further assess the relative merits of the testing techniques, we inserted several faults into each of the programs, and evaluated the test sets based on the number of faults detected by them. So as to avoid any bias, we introduced faults according to the following considerations: 1. faults must not be equivalent to mutants; otherwise the mutation-adequate test data would by de nition detect them, 2. faults should not be N-order mutants (else the coupling e ect would indicate that mutation-based test cases should nd the fault, biasing our results in favor of mutation), 3. the faults should not have a high failure rate, or the detection becomes trivial. A general outline of our fault creation procedure is that for each program statement, we attempted to: 1. create multiple related transpositions of variables (e.g., substituting one variable for another through- out, or exchanging the use of two variables), 2. modify multiple, related, arithmetic or relational operators, 3. change precedence of operation (i.e., by changing parenthesis), 4. delete a conditional or iterative clause, 5. change conditional expressions by adding extra conditions, 6. change the initial values and stop conditions of iteration variables. The changes were only applied when a change did not violate one of our considerations. For the most part, these resulted in faults that appear to be realistic in the sense that they look like mistakes that programmers typically make. None of the faults were found by all test cases. Additionally, neither criterion seemed biased towards any of our fault types in the sense that the criterion always found faults of that type. The actual faults are listed in the technical report [2]. To gather the results, we inserted each fault separately, creating N incorrect versions of each program. This allowed us to always know which fault a test case detected when the faulty program failed. The data are shown in Table 4. The Mutation column gives the percentage of faults detected by the mutation-adequate test cases, averaged over the ve sets of data for each program, and the Data Flow column gives the percentage of faults detected by the data ow-adequate test cases, averaged over the ve sets of data for each program. The mutation sets detected all the faults for six of our ten programs, and the least percentage of faults detected was 67% for Find. The data ow sets detected all the faults for two programs, and as few as 15% for one program (Insert). On average, the mutation sets detected 92% of the faults, versus 76% of the faults for the data ow sets. Thus, we conclude that our data support hypothesis 3, but not hypothesis 4. 10 Program Faults Mutation % Data Flow % Bub 5 100 92 Cal 10 98 56 Euclid 6 83 83 Find 6 67 47 Insert 4 75 15 Mid 5 100 100 Pat 6 100 87 Quad 6 100 100 Trityp 7 100 86 Warshall 5 100 92 TOTALS/MEAN 60 92 76 Table 4: Percentage of Faults Found by Mutation-Adequate and Data Flow-Adequate Test Data Test Set Size Table 5 gives the mean number of test cases for the mutation-adequate test sets and the data ow- adequate test sets for each program. The most obvious observation is that in most cases, mutation requires many more test cases than data ow does. Weyuker [24] discusses comparing the costs of testing criteria based on the number of test cases. With the ability to automatically generate test data, this cost is somewhat less important during initial testing, although the cost of examining the outputs still makes the number of test cases a factor. Additionally, the number of test cases is still important during regression testing. Mutation Data Flow Program Adequate Adequate Bub 6.6 1.4 Cal 36.0 6.2 Euclid 4.0 1.0 Find 14.0 6.2 Insert 3.8 3.0 Mid 24.6 6.0 Pat 26.4 5.8 Quad 13.4 2.0 Trityp 51.4 14.0 Warshall 4.8 3.2 Table 5: Mean Number of Test Cases Per Set CONCLUSIONS In this paper, we have presented results from three empirical studies. First, we compared mutation with all-uses data ow on the basis of a \cross scoring", where tests generated for each criterion are measured against the other. Second, we measured the fault detection of test data generated for each criterion, and 11 compared the results. Third, we compared the two techniques on the basis of the number of test cases generated to satisfy them, in a rough attempt to compare their relative costs. For our programs, the mutation scores for the data ow-adequate test sets are reasonably high, with an average coverage of mutation by data ow of 88.66%. While this implies that a program tested with the all-uses data ow criterion has been tested to a level close to mutation-adequate, it may still have to be tested further to obtain the testing strength a orded by mutation. However, the mutation-adequate test sets come very close to covering the data ow criterion. The average coverage of data ow by mutation is 98.99% for our ten programs. We can infer that a program that has been completely tested with mutation analysis will usually be very close to having been tested to the all-uses data ow criterion { within one or two DU-pairs of being complete. This provides some evidence that mutation ProbSubsumes all-uses data ow. On the other hand, mutation required more test cases in almost every case than data ow testing did, providing a cost to bene t tradeo between the two techniques. These conclusions are supported by the faults that the test sets detected. Although both mutation- adequate and data ow-adequate test sets detected signi cant percentages of the faults, the mutation- adequate test sets detected an average of 16% more faults than the data ow-adequate test sets. The di er- ence was as high as 6O% for one program. This provides some evidence that mutation is PROBBETTER than all-uses data ow. Of course, these experiments have limitations that are dicult to avoid in this area. To do certain kinds of statistical analyses of data, we must be able to assume that the data is based on \representative samples" from the population being studied. Unfortunately, we have no way to choose a representative sample of software, test cases, or faults, so we are limited in our ability to use statistical analysis tools to make claims of signi cance. The programs we studied are also relatively small, which leaves the question of how the conclusions might scale up to large software systems. Although we cannot know without further study, it is worth observing that both mutation and data ow testing are expected to be used for unit testing, not integration or system testing. That is, the techniques are typically applied to subroutines and functions individually, and largely independently of other subroutines and functions. Although there is ongoing research in applying data ow intra-procedurally, this research has not yet been put into practical use. A positive aspect is that our results are in general agreement with those of Mathur and Wong [17]. Both studies found that mutation o ers more coverage than data ow, but at a higher cost. Although their study did not include any fault detection, we also found that mutation-adequate test sets detected more faults, which is in general agreement with both of our other results. The fact that both studies, performed at about the same time by di erent researchers using di erent programs and test cases, got similar results, greatly strengthens both conclusions. 12
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved