Over a decade ago, Jeff Offutt noted, "The field of mutation analysis has been growing, both in the number of published papers and the number of active researchers."33 This trend has since continued, as confirmed by a survey of recent literature.36
Mutation analysis is "the use of well-defined rules defined on syntactic descriptions to make systematic changes to the syntax or to objects developed from the syntax."33 It has been successfully used in research for assessing test efficacy and as a building block for testing and debugging approaches. It systematically generates syntactic variations, called mutants, of an original program based on a set of mutation operators, which are well-defined program transformation rules. The most common use case of mutation analysis is to assess test efficacy. In this use case, mutants represent faulty versions of the original program, and the ratio of detected mutants quantifies a test suite's efficacy. Empirical evidence supports the use of systematically generated mutants as a proxy for real faults.2,5,19 Another use case is automated debugging (for example, Gazzola et al.10 and Ghanbari et al.11). In this use case, mutants represent variations of a faulty program and are used either to locate the fault or to iteratively mutate the program until it satisfies a given specification (for example, passes a given test suite).
Mutation analysis can be applied at different levels, including design and specification level, unit level, and integration level. Similarly, it can be applied to both models and programs. For example, prior work applied mutation analysis at the design level to finite state machines, state charts, Estelle specifications, Petri nets, network protocols, security policies, and Web services.16
Mutation-based testing leverages mutation analysis and is a testing approach that uses mutants as test goals to create or improve a test suite. Mutation-based testing has long been considered impractical because of the sheer number of mutants that can be generated, even for small programs. Mutation-based testing is now increasingly adopted in industry, in part due to a shift in perspective, including the notion of incremental, commit-level mutation, suppression of unproductive mutants, and the focus on individual mutants as opposed to adequacy regarding mutant detection.3,37,38,39
This article characterizes the empirical studies that analyzed and compared Java mutation tools based on a rapid review of the research literature. Additionally, we propose a framework for comparing mutation tools, considering five dimensions: Tool version; deployment; mutation process; user-centric features; and mutation operators. Finally, we propose a framework to highlight the similarities and differences of eight state-of-the-art Java mutation tools.
Figure 1 visualizes a common mutation analysis process and how mutation-based testing is a specific use case and instantiation of that process. Readers can find related process descriptions in Papadakis et al.36 and Jia et al.,16 which are, like ours, an adaptation of Offutt's and Untch's original formulation of mutation analysis.32 The literature has largely used the terms mutation analysis and mutation testing interchangeably, but we make the distinction more precise because other mutation-based approaches and use cases exist (for example, test suite reduction, fault localization, or program repair). To avoid ambiguity, we use the terms mutation analysis and mutation-based testing. Mutation analysis involves two main steps—mutant generation and test suite execution. Mutation-based testing iteratively applies mutation analysis, until a stopping condition is met, and involves two additional steps—test suite augmentation and (possibly) program repair.
As an example, consider Figure 2—a mutation analysis, with an original program and a corresponding test suite. First, the analysis generates the three mutants (m1–m3), each by applying a mutation operator to the return statement of the original program. Next, the analysis executes each test against each mutant and computes the kill matrix shown in the lower-right corner. A test that detects a mutant is said to kill that mutant. A mutant that is not killed by any test is referred to as a live mutant. Finally, the analysis reports on the results, indicating the mutation score, the set of live mutants, and the kill matrix. While the mutation score is usually defined as the ratio of killed to all non-equivalent mutants, most tools approximate it and report the number of killed mutants divided by the total number of generated mutants. The reason is the set of equivalent mutants is unknown and reasoning about program equivalence is an undecidable problem. Note that the computation of a complete kill matrix is not required for all use cases. For example, if the goal is to simply compute the mutation score and the set of live mutants, then t3 need not be executed against m2 after t2—m2 is already known to be killed at that point. Indeed, a kill matrix is rarely, if ever, computed in mutation-based testing because it is computationally expensive.
The mutation analysis results in Figure 2 show that two out of three mutants are live and that t1 does not kill any mutants. Generally, a test suite that fails to kill most of the mutants is deficient and should be improved. The core idea of mutation-based testing is to use live mutants as concrete test goals. In the example in Figure 2, mutant m1 is a live mutant and indicates the test suite lacks a test case—one that passes a non-zero argument to the second parameter of the add method. Mutation-based testing repeats this iterative process of adding tests based on mutants until a stopping condition is met, for example, a given mutation score threshold or a fixed test budget.
Not every mutant can be killed. An equivalent mutant is semantically equivalent to the original program and cannot be killed by any test. Mutant m3 in Figure 2 is an example of an equivalent mutant. Moreover, not every mutant that can be killed should be killed. Traditionally, killable mutants were generally deemed desirable because they lead to tests; conversely, equivalent mutants were generally deemed undesirable. Petrović et al.,39 however, noted that this classification is unworkable in practice and insufficient to capture the notion of developer productivity. For example, developers justifiably should not and, in practice, will not write a test for a killable mutant if that test would be detrimental to the test suite quality, in particular maintainability. Conversely, equivalent mutants may point to actual program issues, prompting developers to make meaningful improvements to the code itself. Petrović et al. introduced the notion of productive mutants.
A mutant is productive if the mutant is killable and elicits an effective test, or if the mutant is equivalent but its analysis and resolution improves code quality. For example, a mutant that changes the initial capacity of a Java collection (for example, replacing new
ArrayList(20) with new
ArrayList(10)) is unproductive. While such a mutant is theoretically killable by writing a test that asserts on the collection capacity or expected memory allocations, it is unproductive to do so because the corresponding test would be brittle and not testing actual functionality. Note that the notion of productive mutants is qualitative: different developers may sometimes reach different conclusions as to whether a test is effective.
Based on the knowledge of what previous comparative studies propose, we decided to collect evidence from the literature to understand how mutation tools were compared and whether some mutation tools consistently outperform others along multiple dimensions.
We adopted a rapid review (RR)41 process for this purpose. RRs are literature review processes less formal than systematic mappings (SMs) and systematic literature reviews (SLRs), but similarly they follow a well-structured selection process. Hence, RRs can be further analyzed, replicated, and improved by other researchers. According to Cartaxo et al.,4 the main goal of a RR is to reduce the amount of time needed to gather, analyze, interpret, review, and publish evidence that could benefit practitioners. To achieve this goal, RRs deliberately omit or simplify steps of traditional SLRs (for example, limiting literature search, using just one person to screen studies, skipping formal synthesis). Our RR process relies on the following four sequential steps:
Scopus search string definition and application. We defined the search string, specifically crafted for the Scopusa search engine is shown in Figure 3. The key rationale is to search for studies in computer science that present an empirical evaluation or an experimental comparison of mutation tools. We applied this search string to the Scopus engine in November 2020, and it returned 187 results.
Primary studies selection. Our goal was to select primary studies presenting an empirical evaluation for comparing two or more Java mutation tools. To that end, we divided the 187 search results into two sets of 93 and 94 studies. Two of the authors independently analyzed the studies of the two sets. Specifically, each researcher identified papers that satisfied all the following four inclusion criteria:
I1: The study presents an empirical study involving at least two Java mutation tools.
I2: The study considers Java mutation tools that are publicly accessible and free of charge.
I3: The study analyzes Java mutation tools that are described in at least one publication.
I4: The authors of the study are different from the authors of the analyzed mutation tools.
Data extraction. One of the authors fully read the five primary studies and extracted sentences to collect evidences for which Java mutation tools were analyzed and how these tools were compared. These sentences were stored in a spreadsheet file for analysis.
Data analysis and abstraction. We adopted a Delphi method, which is commonly used when the problem under analysis can benefit from collective and subjective judgments or decisions and when group dynamics do not allow for effective communication (for example, time differences, distance).14 Three of the authors, in weekly meetings, iteratively analyzed the extracted data, resolved ambiguity, and converged onto the final abstraction shown in Table 1. Based on a final data analysis, we made three key observations.
Observation 1. The five primary studies compared a total of eight Java mutation tools. Table 1 lists these tools together with the references that report on how these tools are implemented and how they are used. As shown in Table 1, PIT is the only tool analyzed in all five studies, followed by Major and MuJava (four studies), Jumble and Judy (three studies), Jester (two studies), and Bacterio (one study).
Ensuring our analysis does not miss other Java mutation tools, we executed an additional query in the Scopus database, using the search string in Figure 4.
This query searches more broadly for Java mutation tools presented in literature. We found three additional Java-specific mutation tools: HOMAJ,34 Paraμ,27 and JavaMut.6 We did not include these three tools for two main reasons. First, no empirical comparison considered these tools. Second, these tools are not available. To the best of our knowledge, we can conclude that our list of mutation tools, reported in Table 1, represents state of the art in available Java mutation tools.
Observation 2. The empirical studies that compared the mutation tools used different study designs and measures. More concretely, the studies differ in three main aspects for evaluating the tools: how the test cases for killing mutants were generated, what evaluation metrics were adopted, and what Java subjects were selected. We observed three distinct approaches for test case generation: Manually writing test cases, automatically generating test cases, using tools such as EvoSuite9 or Randoop,35 and using existing test cases (that is, using the test cases that are distributed with the subject application). Further, we observed a total of 10 adopted metrics, nine of which evaluate the effectiveness and one evaluates the efficiency of the mutation tools. Almost all the studies reported on absolute measures such as the mutation score, number of mutants, and number of test cases. S1, S2, and S3 reported on additional measures:
Only two studies considered the efficiency of the mutation tools. In both cases, the costs of adopting the mutation tools were evaluated in terms of mutation analysis execution time. The five primary studies either considered simple Java classes or real-world projects from the Defects4J benchmark.18
Observation 3. Only four of the eight tools (PIT, Major, Jumble, and Judy), did not present any limitations when executed on real-world projects. For example, prior studies excluded mutation tools from parts of their empirical evaluations because of tool limitations (see S1, S2, and S5 for further details).
Mutation analysis involves two main steps—mutant generation and test suite execution.
Given the diversity of study designs and measures, there is no clear evidence that one of the eight tools consistently outperforms the others—particularly when considering different use cases. While PIT and Major overall achieve slightly better results in most of the empirical evaluations, there is insufficient information to compare the tools for different use cases.
The analysis of the selected papers showed that Java mutation tools were compared from the point of view of the features they offer. To provide a comprehensive, unified representation of the different ways the Java mutation tools can be qualitative compared according to their features, we inferred the mutation tool comparison framework shown in Figure 5. This model describes each tool along five dimensions, each one with one or more attributes. The gray boxes represent dimensions or attributes that were already used as comparative parameters in the primary studies we analyzed, meanwhile the white boxes render the novel dimensions and attributes we introduced to provide additional details for comparing mutation tools. Overall, we introduced 11 novel attributes for a total of 21 attributes across five dimensions.
Version. This dimension characterizes the version of the tool and provides some indication about its level of obsolescence. It has the following three attributes:
Deployment. This dimension typifies the requirements of the execution environment where the mutation tool can be installed and executed. It has the following three attributes:
Mutation process. This dimension describes the features provided by the tool in supporting the execution of mutation analysis processes. It has the following seven attributes:
User-centric features. This dimension describes the "pick and use" characteristics of the tool. It has the following four attributes:
Mutation operators. This dimension expresses the tool's ability to implement different classes of mutation operators. Due to the lack of a unified approach for describing the operators actually implemented by each tool in all the primary studies we analyzed, we decided to abstract a set of reference mutation operator classes according to the official Java documentation.b This dimension has the following four attributes to represent classes of mutation operators:
Our goal was to describe the eight mutation tools according to the proposed framework. To that end, we first abstracted all the possible values that can be assumed by the framework attributes and then we outlined the mutation tools according to these values. We used a process that involved three different surveys for inferring these values: literature and documentation survey, student survey, and tool-author survey. Table 2 shows, for each framework attribute, which survey(s) we adopted for inferring its possible values. We used the literature and documentation survey to assess all attribute values for each tool. Additionally, we used the student and tool-author surveys to validate and improve the attribute values inferred by the literature and documentation survey and infer missing information.
Literature and documentation survey process. This process was performed in two steps. First, we performed a snowballing procedure.43 Starting from the primary studies reported in Table 1, we gathered from the literature additional published papers describing in detail the selected tools (how they work, how they can be used for mutation analysis, and how they have been designed and implemented). Additionally, we consulted the tools' official documentation (user manual, technical report, and more). Afterwards, we followed a Delphi-type cycle during which three researchers read the collected documents, classified the tools based on the framework and explained their judgment.
Student survey process. We performed a user study with 46 MSc students in computer science. The user study involved an exit survey and had three main steps: The first step involved a full, theoretical lecture (90 minutes) on mutation analysis and mutation-based testing. After this lecture, the students were divided in groups of two or three members and assigned a mutation tool. The second step was designed to provide a hands-on training session for employing a mutation tool, using a simple Java class. During this step, an instructor provided guidance and focused on consolidating the students' theoretical knowledge, resolving open questions, and addressing any problems reported by the students. The simple Java class served as a didactic example to bring the students up to speed on using a mutation tool. The third step aimed at assessing the tools' characteristics and infer some attributes by running the tools. After providing sufficient background and preparation, the goal of this step was to assess to what extent the students were able to install and use a mutation tool for a more complex Java class, relying on the provided documentation. To avoid bias, we randomly assigned a new tool to each group of students.
No student had experience with mutation analysis and mutation tools, prior to the lecture in the first step. The Java class used in the second step was
Triangle.java, which contains a single method with three integer parameters, representing the lengths of a triangle's sides. This method's return value indicates the type of the triangle. At the end of the second step, all students successfully used the assigned tool and applied the learned concepts related to mutation-based testing. All tools described in Table 2, except Jester, were used in this step. The main reason is that Jester requires mutation operators to be provided in a configuration file, which was out of the scope of this work. Instead of asking students to use Jester, one of the authors installed the tool and used it to assess its attributes. Consequently, Jester was excluded from the third step. The third step randomly assigned a new mutation tool to each group and asked them to use these tools to perform mutation-based testing over three classes from the Defects4J benchmark: Cli
(com.google.gson.stream.JsonWriter) and Lang
(org.apache.commons.lang.time.DateUtils). We aimed at using PIT, MuJava, Major, Jumble, Judy, and Bacterio for the third step, but had to exclude MuJava and Bacterio because we were unable to run them on the Defects4J classes. MuJava gave errors when trying to generate mutants. Bacterio was unable to execute the test cases.
Mutant ranking is a priority-based mechanism that is applied to define the execution order of the mutant operators.
The students installed the tools in a predefined environment/configuration: Oracle Virtual Machine on Ubuntu 20.04, using Java version 1.8 and JUnit 4. After installation, students had to create mutants for the class defined, perform mutation analysis to determine the number of killed and live mutants, and perform mutation-based testing to develop additional test cases to kill live mutants. At the end of the user study, students had to deliver a report and complete an exit surveyc with nine questions. Each question evaluated a tool characteristic on a scale of 1–5 and elicited a justification for the choice made. The final question allowed providing additional information.
Tool-author survey process. We implemented a surveyd in Google Forms and send it to the authors to describe their tool according to the proposed framework. We designed this survey to collect information that was neither clearly reported in the tool's online documentation nor directly inferable by using the tool. As such, the authors' answers complemented the data extracted from the other surveys. By analyzing the answers, we were able to validate the data extracted in literature survey and collect additional information we were not able to find elsewhere. In case, discrepancies arose, the author survey overrode our initial, partial findings. The survey included seven specific and four open-ended questions. The seven specific questions had an "Other" field where the respondents were free to extend the options we provided as possible answers. Three of the four open-ended questions were proposed to have more information about the equivalent mutant prevention, analysis run-time reduction, required inputs, and produced outputs. The mutation operators implemented by the tools were inferred by means of the literature and author surveys. The last open-ended question elicited suggestions and comments about our research.
Tools description according to the inferred attribute values. The data collected through the surveys were merged in a single spreadsheet file and analyzed by three researchers in weekly meetings where they analyzed and discussed the gathered information. The attribute values were inferred after unanimous consent was reached. Table 3 shows the values of the attributes we abstracted and the descriptions of the eight mutation tools according to them.
Version. As for the License attribute we observed that six tools have a type permissive free software license, such as Apache2, GPL, or LGPL. Jester and Bacterio have been considered as freeware since they provide only an executable .jar file that can be freely used but they do not distribute the source code. Regarding the Release version, all the tools rely on a version control system, only Javalance does not provide releases tracking. As for the Release version most of the tools have not changed for five or more years (MuJava, Jumble, Javalanche, Jester, and Judy). PIT, Major, and Bacterio have very recent updates.
Deployment. Regarding the Java version, we observed that PIT, Major, Jumble, and Judy were able to work with no limitations on large scale projects developed in Java 1.8, MuJava and Bacterio run only on simple Java classes developed in Java 1.8. When executed on real projects, MuJava presented unhandled exception, whereas the behavior of Bacterio was unreliable that is, it produced different mutants in diverse executions performed on the same code. We were not able to run Javalanche with Java 1.8 and so, we used Java 1.6. However, even with Java 1.6, students experienced lots of problems with the installation process and could not generate mutants neither run test cases. Jester was not used in our experiments, so we do not have practical evidence that it works on large scale projects. However, it worked with Java 1.8 in small project.
As for the build-tool integration attribute, we observed Javalanche and Jester need additional tools on the running environment. Javalanche requires Ant to compile, assemble, test, and run both the mutation tool and the test cases. Jester relies on Python for running its scripts. All the remaining tools are self-contained, and some of them, like PIT, Major, and Jumble, exploit optional additional tools for extra features. PIT also supports several build tools like Maven, Ant, and Gradle. It can be also installed as Eclipse plugin, like Jumble, and IntelliJ plugin. Major is distributed along with its own Ant version. As for the Testing framework, all the tools need JUnit for running the test cases. PIT is the only one supporting also test cases developed in TestNG. Almost all tools support the automatic execution of test cases developed in JUnit 4, except for Jester and Bacterio that work only with JUnit 3.
Mutation process. Regarding the mutation level, five out of eight tools work exclusively at byte code level, whereas two tools—MuJava and Major—apply mutants on both source code and byte code level. Only Jester works at source code level.
As for the test selection, PIT, Major, and Javalanche provide an automatic mechanism that selects test cases to execute based on code coverage. For the other five tools, the tester selects manually the tests to be performed. In Jester and Judy, the tester must remove the tests that should not be run from the folder where they are placed. In Jumble the JUnit tests to run should be listed trough the command line. MuJava and Bacterio allow to select the tests through the GUI.
Concerning the mutation operator selection, we observed that six over eight tools can be configured for applying selected mutation operators even if they belong to different classes. Bacterio and Jumble are less configurable since they can execute all the operators belonging to selected classes, MuJava provides both options.
Regarding mutant inspection, Bacterio and MuJava provide side-by-side windows showing the original code alongside the mutated versions. However, students were not always able to make those windows appear in Bacterio. All other tools provide information about the mutation operators applied to the lines of code (LoC), but Jumble provides this data only for live mutants.
As for the kill matrix, we observed that four tools do not render a kill matrix. Jumble provides a very coarse-grained kill matrix showing for each mutation which is the test cases killing it. MuJava, Major, and Bacterio generate fine grained kill matrices displaying both test cases and test methods killing mutants.
As for the analysis runtime reduction, we observed that all the tools, implement at least one or use combinations of them. From the answers we gathered from authors, we were able to infer five possible values for this attribute. Code coverage indicates strategies that reduce the execution time of next iterations by prioritizing or excluding the execution of mutation operators based on the code coverage achieved by a set of tests. Test order refers to mechanisms where the execution order or the exclusion of the test cases is determined before they are launched. Mutant ranking is a priority-based mechanism that is applied to define the execution order of the mutant operators. Infinite loops prediction is a strategy to predetermine and to exclude the execution of the mutation operators that may produce infinite loops in the mutated code. In parallel execution, mutation tools are executed in two or more JVMs running in parallel. The limited mutants mechanism executes only a limited number of mutants.
Regarding the equivalent mutant prevention, we observed that only four tools implement a mechanism for reducing the generation of equivalent mutants. PIT and Jumble provide a statistical one that executes operators having low probability to generate equivalent mutants. The code-based approach, implemented by Major and Javalanche, avoids the generation of equivalent mutants relying on code knowledge obtained through the AST analysis or by considering the code coverage reached by the test cases.
User-centric features. Regarding the user interface, we observed that almost all the tools provide a command line interface (CLI) with exception of Bacterio providing only a graphical user interface (GUI). By default, MuJava comes as a tool that should be used through its GUI, but it provides muScript that is a CLI allowing direct access to key functionality provided by MuJava. Moreover, PIT, MuJava, and Jumble distribute plugins to extend well known Java IDEs like Eclipse or IntelliJ. As for the required inputs, all the tools need test cases as input. Moreover, five tools over eight require as input the Java byte code of the classes to be mutated, the remaining three tools work on the Java source code. PIT requires both source and byte code. Even if PIT applies mutations to the compiled code, it needs source code for evaluating the code coverage reached by the test cases.
Moreover, PIT can launch also test cases implemented in TestNG. Regarding the Produced outputs, all the tools produce a report summarizing the results of the tool execution. PIT and Major also produce a code coverage report showing the code executed by the test cases. Likewise, PIT and Major classify killed mutants, distinguishing between mutants that crash during execution, mutants that are killed by an assertion, and mutants that time out. Other tools only provide aggregate metrics such as the number of generated and killed mutants. Moreover, MuJava, Bacterio, Jester, and Major produce the mutated source code. Bacterio, PIT, Major, Javalanche, and Judy generate mutated byte code. Whether mutated source code or mutated byte code is preferable depends on the use case. For instance, generating source code mutants is important for mutation-based testing, which involves reasoning about the mutated code to develop new tests. In contrast, mutated byte code may be more efficient and sufficient for performing a mutation analysis for simply measuring test-suite efficacy. Bacterio also produces a reduced JUnit test suite.
As for the documentation quality we observed that Bacterio and PIT provide good documentation. However, even with good documentation, Bacterio was very difficult to use. The tool did not behave as described in the documentation and some students were not able to select or to execute the test cases. Major, and Jumble do not provide a clear (sufficient) documentation and students felt they would need more information, for instance, to help them with the installation process. MuJava, Javalanche, Jester, and Judy need to improve the documentation provided. It is clearly not enough (insufficient). MuJava, Javalanche, and Jester do not provide enough information to support the installation process. Judy was difficult to use, and more information is needed to interpret final report. Also, Jester does not provide enough information to understand how the configuration file may be built/updated.
Mutation operators. In our analysis we identified 12 different types of mutation operators, corresponding to the four attributes reported in the framework:
As Table 3 shows, MuJava and Major cover the most mutation types (10 out of 12), followed by PIT, Jumble, and Judy (each covering nine out of 12). Jester covers the fewest mutation operator types (four out of 12). From a different point of view, arithmetic, unary, and relational mutation operators are implemented by all the tools, followed by primitive data types and conditional, that are supported by most of the tools. It is interesting to show that no tool provides mutation operators able to inject mutations related to the concurrent nature of Java. Also, the inheritance and polymorphism mechanism mutation operators have a low support since they are implemented only by three tools.
We assert that the question of what the best or most suitable mutation tool is has no generic answer and depends on the concrete use case. Specifically, when researchers, educators, or practitioners select a mutation tool, their choice is motivated by different considerations.
To better understand what considerations are most important and to what extent our framework is sufficiently detailed, we created three questionnaires with identical questions for research,e education,f and practice.g These questionnaires were anonymous, based on our framework attributes, and included 21 specific questions plus one open-ended question for additional comments. We sent links to these questionnaires to contacts in academia and industry, whom we also asked to share them with other colleagues in the mutation analysis and mutation-based testing domain.
We received a total of 47 answers: 24 from researchers, 14 from educators, and 9 from practitioners. Based on the answers, this section outlines common and use-case-specific considerations. For simplicity, it refers to "important," "very important," and "mandatory" responses collectively as important considerations. It also shows how the proposed framework can aid in selecting a suitable tool for research, education, and practice by linking important considerations to attributes in Table 3.
Common considerations. It is desirable to select a mutation tool that is actively maintained and evolving. This increases the chances of using a state-of-the-art approach and timely resolving questions and issues. Another consideration is compatibility of a mutation tool, particularly regarding supported testing frameworks and language features. A major challenge for comparing mutation tools is the lack of standardized descriptions and labels for supported mutation operators. Mutation tools may name the same mutation operator differently, they may use the same name for related yet distinct mutation operators, or they may apply the same mutation operator in different scopes. Consider Table 4, which demonstrates this challenge. Specifically, this table shows what mutants each of three mutation tools generate for the statement
return x+10;. To produce this table, we applied three tools, MuJava, Major, and Jester (enabling all mutation operators they support) to the return statement. MuJava applied 15 mutation operators, Major 6, and Jester just 1. Moreover, MuJava and Major name related mutation operators differently, and Jester does not name them at all. Finally, efficiency is a general concern, but specific requirements may differ between use cases.
Based on the responses to our questionnaires, we observed the following. Out of 47 respondents:
PIT and Major satisfy all three considerations. Javalanche, Jester, and Bacterio are compatible with older versions of Java or JUnit. A related, cross-sectional concern is whether a tool works on sufficiently complex code (in Table 3, two asterisks next to the Java version indicate that a tool does not). The tools working on real projects are PIT, Major, Jumble, and Judy.
Furthermore, 44 out of 47 respondents consider a tool's capability for mutant inspection important.
Out of 47 respondents:
The tools fulfilling these requirements are MuJava, Major, and Jester. An interesting observation is that 38 out of 47 respondents do not consider mutated byte code an important tool output (such as, "not important" or "do not care"). Additionally, 24 out of 47 respondents prefer a side-by-side visualization of the mutated and original code; MuJava and Bacterio are the only tools supporting this functionality.
Finally, 45 out of 47 respondents rate a comprehensive Documentation quality as important, which is something that most tools lack. Only PIT and Bacterio provide a good documentation. Additionally, 39 out of 47 consider a standardized description of the supported Mutation operators important.
Research. Selecting a mutation tool for research purposes requires some unique considerations. For example, foundational research exploring the effectiveness of individual mutation operators and mutant selection strategies (for example, Gopinath et al.,13 Just et al.,20 Kurtz et al.,24 and Zhang et al.44), requires a high degree of configurability for selecting mutation operators. Furthermore, studying subsumption relationships and redundancy1,23 requires the computation of a complete kill matrix.
From the answers to our questionnaires, we observed the following. Out of 24 researchers:
Tools satisfying these requirements are MuJava, Major, and Bacterio. The most configurable tools are MuJava and Major, which support the selection of individual and classes of mutation operators.
Finally, a response to the open-ended question supports the notion of different use cases, suggesting that a complete kill matrix should be computed during "execution in research mode."
Education. The selection of a mutation tool for educational purposes is less constrained by having access to a highly configurable tool that implements state-of-the-art approaches or achieves a high degree of developer productivity. Likewise, the goal for this use case is not for students to overcome the difficulties of the installation process and a steep learning curve, but rather to learn important concepts related to mutation-based testing and understand its challenges and benefits. Therefore, the ease of installation and use and the existence of supporting documentation are relevant concerns.
Another relevant aspect is the presentation and interpretability of the produced artifacts. For example, a tool may only produce a summary of the generated mutants but no mutated source code for inspection. Likewise, a tool may provide information about why a mutant was (not) killed, whereas others may require additional tooling. As a result, a self-contained tool that works on small examples may be preferable to tools that work at scale but need to be fully integrated into a developer's workflow. Similarly, a GUI may be preferable to a command-line interface.
Based on the responses to our questionnaires, all educators prefer free/open source tools, and many identify ease of installation as an important consideration. Additionally, out of 14 educators:
Javalanche, Jester, and Judy do not provide any type of GUI, and MuJava and Bacterio are the only tools that provide a side-by-side comparison. MuJava, Major, and Bacterio support the computation of a complete kill matrix.
Practice. Adopting a mutation tool in practice usually requires its integration with the existing development environment. For example, some tools only work with certain testing frameworks and Java versions. Another important aspect is ease of interpretation—that is, quickly understanding how a mutant was generated and how it can be killed. This is particularly important for developers if mutants serve as test goals. Likewise, suppressing unproductive mutants and preventing equivalent mutants are important to maximize developer productivity. Another consideration is the mutation level—that is, whether a tool mutates the source or byte code. While these details are often not clearly described in a tool's documentation, they are important when reasoning about mutant interpretability and workflow integration.
Based on the responses to our questionnaires, we observed the following. Out of nine practitioners:
PIT, Major, Jumble, and Javalanche include techniques for the first two aspects, and PIT, Major, and Javalanche are the most suitable tools for integration—these can be integrated with traditional CI/CD environments such that they can be run from scripts.
Unlike in other use cases, five out of nine practitioners prefer tools that support automated Test selection and coarse-grained Mutation operator selection. Major is the only tool that fulfills both requirements.
Discussion. To provide an aggregated view and contrast the answers between use cases (research, education, and practice), Tables 5,6,7 summarize the expressed importance of, and preferences for, individual aspects of mutation tools. All three tables highlight percentages above 70% (↑) and below 30% (↓) to draw attention to the most important aspects as well as conflicts between use cases.
Table 5 lists the most and least important aspects for each use case. This table groups all responses for each aspect into two categories: important (grouping "important," "very important," and "mandatory" together) and not important (grouping "not important" and "do not care" together). The table reports on the percentage of responses that fall into the first category. While a comprehensive documentation and support for a recent JUnit version is important for all use cases, the importance of many other aspects varies by use case. For example, a detailed summary report and a standardized description of mutation operators are important for research (96%) and education (86%), but not in practice (44%). Likewise, a selected tool being updated recently is important for research (75%) and in practice (89%), but to a lesser extent for education (64%). Moreover, some aspects are mostly important for a single use case. For example, support for recent Java versions is important in practice (78%) and producing a kill matrix is important for research (79%). Regarding the least important aspects, outputting mutated byte code is not important for any use case and outputting a reduced test suite is not important for education. The importance of run-time reduction techniques shows a conflict: while it is not important for education, it is important for practice.
Table 6 shows the preferred type of user interface for each use case. This was the only multiple-choice question in the three questionnaires, and it included a "do not care" option. Two responses to the research questionnaire and one response to the practice questionnaire did not indicate a preference; we excluded these responses. The responses show that IDE integration is a preference for education (79%) and in practice (88%). In contrast, a command-line interface is not a preference for education (21%), but again a preference in practice (88%). While a command-line interface is also preferred for research (68%), the differences among the three options are not as striking for this use case. A (dedicated) graphical user interface is not a preference for any use case.
Table 7 summarizes the expressed preferences for one of two implementation choices for each use case. As before, we excluded "do not care" responses on a per-choice basis. For example, five responses to the research questionnaire, two responses to the education questionnaire, and three responses to the practice questionnaire did not express a preference for the choice of mutation level. This is important context: the table reports on the percentage and fraction of responses that do prefer the underlined choice, calculated over the number of responses that indeed expressed an opinion. For all three use cases, there is a strong preference for mutating source code, as opposed to byte code, as well as providing source code as the input to the mutation tool. Similarly, there is a preference for tools being free/open source, though only 44% (4/9) of responses to the practice questionnaire expressed an opinion on this. The preference for selecting individual mutation operators (for example, mutating (a+b to a-b) as opposed to groups of mutation operators (for example, mutating all arithmetic operators) shows a conflict: individual mutation operators are preferred for research (78%) and education (75%), but not in practice (29%). Preferences for mutant inspection and test selection show a similar conflict, but the differences are not as pronounced.
Unlike in other use cases, five out of nine practitioners prefer tools that support automated Test selection and coarse-grained Mutation operator selection. Major is the only tool that fulfills both requirements.
Overall, the responses to the three questionnaires show that there is consensus among the three use cases about the importance of many aspects of mutation tools. However, the responses also highlight some conflicts. These results, together with the summary of the surveyed tools' features (Table 3) allow researchers, educators, and practitioners to select a suitable mutation tool, based on a ranking of aspects that are most important to them. These results also allow developers of mutation tools to make informed decisions about what features to implement and how to improve their tools, based on their target audience.
This article presents the results of a meta-analysis of existing comparisons of Java mutation tools, following a RR literature process. First, it proposes a comprehensive mutation tool comparison framework encompassing five dimensions, each with multiple attributes. Second, it reports on an application of the proposed framework to eight state-of-the-art Java mutation tools, based on a literature survey, a tool-author survey, and a student survey. Finally, we report on a survey of researchers, educators, and practitioners to understand which of the tool characteristics are most important for choosing a mutation tool for a particular use case. The responses indicate common as well as use-case-specific considerations, and the article shows how the proposed framework can be used to identify the most suitable mutation tool for a given use case.
Acknowledgment. This work is supported in part by National Science Foundation grant CCF-1942055.
Figure. Watch the authors discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/java-mutation-tools
3. Beller, M. et al. What it would take to use mutation testing in industry—A study at Facebook. In 2021 IEEE/ACM 43rd Intern. Conf. on Software Engineering: Software Engineering in Practice (2021), 268–277; https://doi.org/10.1109/ICSESEIP52600.2021.00036
4. Cartaxo, B. et al. Software engineering research community viewpoints on rapid reviews. In 2019 ACM/IEEE Intern. Symp. Empirical Software Engineering and Measurement, 1–12; https://doi.org/10.1109/ESEM.2019.8870144
6. Chevalley, P. Applying mutation analysis for object-oriented programs using a reflective approach. In Proceedings 8th Asia-Pacific Software Engineering Conf. (2001), 267–270; https://doi.org/10.1109/APSEC.2001.991487.
7. Coles, H., Laurent, T., Henard, C., Papadakis, M., and Ventresque, A. PIT: A Practical Mutation Testing Tool for Java (Demo). In Proceedings of the 25th Intern. Symposium on Software Testing and Analysis (2016), 449–452.
14. Grime, M.M. and Wright, G. Delphi Method. American Cancer Society (2016), 1–6; https://doi.org/10.1002/9781118445112.stat07879
15. Irvine, S., Pavlinic, T., Trigg, L., Cleary, J., Inglis, S., and Utting, M. Jumble Java byte code to measure the effectiveness of unit tests. In Proceedings of Testing: Academic and Industrial Conf. Practice and Research Techniques, (2007).
18. Just, R., Jalali, D., and Ernst, M.D. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the Intern. Symp. Software Testing and Analysis (2014), 437–440.
19. Just, R., Jalali, D., Inozemtseva, L., Ernst, M., Holmes, R., and Fraser, G. Are mutants a valid substitute for real faults in software testing? In Proceedings of the Symp. Foundations of Software Engineering (2014), 654–665.
21. Just, R., Schweiggert, F., and Kapfhammer, G.M. MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler. In Proceedings of the Intern. Conf. on Automated Software Engineering (2011), 612–615.
22. Kintis, M., Papadakis, M., Papadopoulos, A., Valvis, E., Malevris, N., and Traon, Y. How effective are mutation testing tools? An empirical analysis of Java mutation testing tools with manual analysis and real faults. Empirical Softw. Eng. 23, 4 (2018), 2426–2463.
23. Kurtz, B., Ammann, P., Delamaro, M.E., Offutt, J., and Deng, L. Mutant subsumption graphs. In Proceedings of the Intern. Conf. on Software Testing, Verification and Validation Workshops (2014), 176–185.
26. Madeyski, L., Orzeszyna, W., Torkar, R., and Jozala, M. Overcoming the equivalent mutant problem: A systematic literature review and a comparative experiment of second order mutation. IEEE Trans. Softw. Eng. 40, 1 (2014), 23–42.
27. Madiraju, P. and Namin, A. Para—A partial and higher-order mutation tool with concurrency operators. In Proceedings of the IEEE 4th Inter. Conf. on Software Testing, Verification and Validation Workshops (2011), 351–356; https://doi.org/10.1109/ICSTW.2011.34
34. Omar, E., Ghosh, S., and Whitley, D. HOMAJ: A tool for higher order mutation testing in AspectJ and Java. In Proceedings of the IEEE 7th Intern. Conf. on Software Testing, Verification and Validation Workshops (2014), 165–170; https://doi.org/10.1109/ICSTW.2014.19
39. Petrović, G., Ivanković, M., Kurtz, B., Ammann, P., and Just, R. An industrial application of mutation testing: Lessons, challenges, and research directions. In Proceedings of the Intern. Workshop on Mutation Analysis (2018). How do Java mutation tools differ? (Mutation). 47–53.
40. Rani, S., Suri, B., and Khatri, S.K. Experimental comparison of automated mutation testing tools for Java. In Proceedings of the 4th Intern. Conf. on Reliability, Infocom Technologies and Optimization (2015), 1–6.
42. Schuler, D. and Zeller, A. Javalanche: Efficient mutation testing for Java. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conf. and the ACM SIGSOFT Symp. on Foundations of Software Engineering (Amsterdam, The Netherlands, 2009). ACM, New York, NY, USA, 297–298.
43. Wohlin, C. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th Intern. Conf. on Evaluation and Assessment in Software Engineering (London, England, U.K., 2014). ACM, New York, NY, USA, Article 38; https://doi.org/10.1145/2601248.2601268
44. Zhang, L., Gligoric, M., Marinov, D., and Khurshid, S. Operator-based and random mutant selection: Better together. In Proceedings of the Intern. Conf. on Automated Software Engineering (2013), 92–102.
This work is licensed under a http://creativecommons.org/licenses/by/4.0/
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.
No entries found