The need for changing existing software has been with us since the first programs were written. After the pioneering work of Lehman30 we know that real-world software systems require continuous change and enhancement to satisfy new user requirements and expectations, to adapt to new and emerging business models and organizations, to adhere to changing legislation, to cope with technology innovation, and to preserve the system structure from deterioration. A major proportion of software life cycle costs is due to maintenance and support; while figures vary, there is a general agreement in the software industry that the proportion is well over 50% of software costs.
The need for modifying software induces the need to comprehend it. Software comprehension is a human-intensive process, where developers acquire sufficient knowledge about a software artifact, or an entire system, so as to be able to successfully accomplish a given task, by means of analysis, guessing, and formulation of hypotheses. In most cases, software comprehension is challenged by the lack of adequate and up-to-date software documentation, often due to limited skills or to resource and timing constraints during development activities.
Software reverse engineering is a broad term that encompasses an array of methods and tools to derive information and knowledge from existing software artifacts and leverage it into software engineering processes. In a seminal paper, Chikofsky and Cross12 define software reverse engineering as "the process of analyzing a subject system to identify the system's components and their inter-relationships and create representations of the system in another form or at a higher level of abstraction." In accordance with this definition, reverse engineering is a process of examination rather than a process of change, as the core of reverse engineering is deriving, from available software artifacts, representations understandable by humans.
Software reverse engineering originated in software maintenance: the standard IEEE-1219a recommends reverse engineering as a key supporting technology to deal with systems that have the source code as the only reliable representation. Since then, it has been successfully exploited to deal with numerous software engineering problems. A non-exhaustive list includes: recovering architectures and design patterns, re-documenting programs and database, identifying reusable assets, building traceability between software artifacts, computing change impacts, re-modularizing existing systems, renewing user interfaces, migrating toward new architectures and platforms.9
To support such a wide range of applications, many different types of tools have been developed either in academia or industry, including pretty printers, static and dynamic code analyzers, and tools for code visualization and exploration, design recovery, documents and diagrams generation, and, more recently, for mining software repositories.
During the last decade, the Y2K and the Euro problems focused the world's attention on system evolution issues, demonstrating the maturity of reverse engineering as an effective support to face difficult and critical problems. In such context, reverse engineering was used, above all, to locate features that needed modifications, and to evaluate the impact of such modifications. Success stories of reverse engineering can be seen every day, when organizations:
In this article we discusswith no ambition to provide an exhaustive surveydevelopments of broad significance in the field of reverse engineering, and highlight unresolved questions and future directions. We summarize the most important reverse engineering concepts and describe the main challenges and achievements. We also provide references to relevant and available tools for the two major activities of a reverse engineering process,12 for example, performing software analysis and building software views, respectively. Finally, we outline future trends of reverse engineering in today's' software engineering landscape.
The accompanying figure shows a conceptual model for software reverse engineering as a UML class diagram. This model stems from the Chikofsky and Cross view of the reverse engineering process, highlighted in the gray area in the figure. As described in the model, a reverse engineering activity is performed by a software engineer to solve some specific problems related to a software product, consisting of software artifacts (for example, source code, executables, project documentation, test cases, and change requests). Software engineers benefit of reverse engineering by exploring software artifacts by means of software views, representations, sometimes visual, aimed at increasing the current comprehension of a software product, or at favoring maintenance and evolution activities. Software views are built by means of an abstractor, which in turn uses information extracted from software artifacts and stored in the information base by an analyzer. Often reverse engineering aims at analyzing the evolution of software artifacts across their revisions, which occur to satisfy change requests.
The analyzers can extract information by means of static analysis (static analyzers), dynamic analysis (Dynamic Analyzers) or a combination of the two (hybrid analyzers). Recently, historical analyzers, which extract information from the evolution repository of a software product, are gaining a growing popularity.
The software engineer can provide feedbacks to the reverse engineering tool with the aim of producing refined and more precise views. A particular type of software view that emerged recently are recommendation systems,39 which provide suggestions to the software engineer and, if needed, trigger a new change request.
Software analysis is performed by analyzerstools that take software artifacts as input and extract information relevant to reverse engineering tasks. Software analysis can be: static, when it is performed, within a single system snapshot, on software artifacts without requiring their execution; dynamic, when it is performed by analyzing execution traces obtained from the execution of instrumented versions of a program, or by using an execution environment able to capture facts from program executions; and historical, when the aim is to gain information about the evolution of the system under analysis by considering the changes performed by developers to software artifacts, as recorded by versioning systems. David Binkley6 has written a thorough survey on source code analysis.
Software comprehension is a human-intensive process, where developers acquire sufficient knowledge about a software artifact or system to successfully accomplish a given task, by means of analysis, guessing, and formulation of hypotheses.
Static analyzers must deal with a number of challenges:
There are tools such as the Design Maintenance Systems (DMS)3 that perform a complete parsing of the source code, also contemplating the presence of preprocessor directives where these are foreseen by the programming language. Whenever we do not need a complete parsing, but rather to extract some specific information, we can use island and lake parsers.35 Island parsers only parse source code fragments of interest, ignoring any other token and activating the parser only when tokens of interest are encountered. Lake parsers, instead, are able to ignore source code fragments not contemplated by the grammar, such as, related to programming language dialects. Tools such as TXL15 or SrcML14 allow for a robust source code island parsing. Whenever the source code is not available, decompilation tools are needed to perform the analysis.13
Fact extractors such as the Bauhaus fact extractor26 or Columbus19 extract factsfor example, dependencies, metricsfrom the source code and store them in an information base. Approaches such as the one by Kuhn et al.28 consider the source code as text, and extract fact from it by indexing the terms it contains.
Tools such as CodeSurfer1 allow to perform a series of analyses related to a program's semantics, such as control dependence analysis, data flow analysis, or pointer analysis, and then build views on top of these analyses, for example, aimed at represent a program slice or to highlight the impact of a change.
Static analysis is reasonably fast, precise, and cheap. However, many peculiarities of programming languages, such as pointers and polymorphism, or dynamic class loading, make static analysis difficult and sometimes imprecise. In addition, the extraction of some specific information related, for instance, to the interaction of a user with the application or to the sequence of messages exchanged between objects, is difficult, if not infeasible, to perform statically.
To overcome the limitations of static analysis, reverse engineers can count on dynamic analysis, which extracts information from execution traces. For many reverse engineering tasks, it is beneficial to combine static and dynamic analysis, as static analysis can be imprecise, for example, not able to fully deal with aspects such as pointer resolution, while dynamic analysis can be incomplete, because it depends on program inputs. An extensive survey of dynamic analysis approaches for the specific purpose of program comprehension can be found in Cornelissen et al.16
Key challenges of dynamic analysis include:
For some reverse engineering tasks, it may be necessary to perform a thorough coverage of the application code or functionality, while in other cases it would be more appropriate to execute the program under a realistic execution scenarios, where the most frequently used features are exercised more than others.
Approaches such as the one proposed by Hassan et al.23 aim at filtering the execution traces and mining recurring patterns to analyze the application's operational profiles, and tools like Daikon18 identify likely invariants by analyzing execution traces.
Some approaches31 reverse engineer properties of a deployed service relying only on the observed outputs, without the need for source code analysis.
Software analysis is performed by analyzerstools that take software artifacts as input and extract information relevant to reverse engineering tasks.
The growing diffusion of versioning systems, bug tracking systems, and other software repositories, such as mailing lists or security advisories, poses the basis for a third dimension of software analysis, namely the historical analysis of data extracted from software repositories. Historical analysis is complementary to static and dynamic analysis, in that it allows us to understand how an artifact was modified over time, when it was changed, why it was changed, who changed it, and what artifacts changed together. (For a thorough survey on historical analysis see Kagdi et al.24) While mining software repositories opens challenging research directions, such sources of information must be used with particular care. Among others, in a recent roundtable of mining software repositories experts,21 Notkin warned against the flourishing of mining software repository studies able to find relationships in and among repositories, that, although significant, are not causal, or cannot lead to improved approaches and tools to develop better software systems. Also, Mockus pointed out that information about software projects cannot be fully observed from versioning systems and problem reporting systems only.
Historical analysis poses a series of challenges related to:
A widely used heuristic aims at matching the bug tracking IDs to versioning system commit notes. Another problem historical analysis must deal with is the accuracy of the available information, for example, of the way issues posted on bug tracking systems are classified and prioritized.
There are approaches working on a flat representation of the program, thus ensuring language independence, and other, language-dependent, that work on ASTs. In the first category fall the Unix diff and ldiff,11 which overcomes diff limitations in distinguishing likely line changes from additions and removals, and of tracking line moving. In the second category falls ChangeDistiller,20 which is able to provide detailed information about the changes occurred, for example, a change to a method parameter or to control-flow construct.
Table 1 provides a categorization of some relevant examples of static, dynamic, historical, and hybrid analyzers.
A primary goal of reverse engineering is to produce software views by abstracting facts stored in the knowledge base. During the process of building views, there is a series of challenges that reverse engineers must face off:
Examples of querying mechanism include: Grok, a notation for manipulating binary relations using the Tarski's relational algebra, and comprising operators for manipulating sets or relations, or CrocoPat,5 which uses relational calculus to query the information base. If the information base is populated with source code ASTs, it can be queried by means of pattern matching mechanisms working over ASTs, as it happens for tools such as DMS and TXL, or even using XQuery when AST is represented in XML, as it happens for SrcML.
There are tools aimed at computing and visualizing slices or data-flow information,1 extracting source code clones and visualizing them,25 or at identifying the presence of crosscutting concerns.33 A different case of code view is a decompilation,13 where the objective is not to augment the source code with further information, but rather to abstract the (unavailable) source code from binaries.
Techniques using information retrieval methods have been developed to recover traceability links, hypothesizing the consistency between highlevel document terms and source code identifiers and comments.2
Software visualizations are not necessarily related to a single snapshot of a software system. Historical analysis allows for building historical views; as an example, the animated storyboards proposed by Beyer and Hassan4 use a sequence of animated panels to visualize the changes occurring in a software system.
Some relevant examples of tools supporting the creation of reverse engineering views are described in Table 2.
Reverse engineering has reached great maturity, as we have examined. However, there are still a number of open problems that call for additional research to advance and improve current methods and tools, and to consider new software development scenarios where reverse engineering could play a crucial role. Moreover, radical innovations are needed to cope with new and emerging software development scenarios and new system architectures.
Current approaches to build views suffer two main limitations:
These limitations indicate the need to close the loop between tool support and developers. Future research activities in reverse engineering should push towards a tight integration of human feedbacks into automatic reverse engineering techniques. In other words, the reverse engineering machinery should be able to learn from expert feedbacks to automatically produce better results.
A primary goal of reverse engineering is to produce software views by abstracting facts stored in the knowledge base.
Reverse engineering has been traditionally intended as a support for post-delivery activities. There is a growing need for full integration of reverse engineering within the development process, so that developers can benefit from on-the-fly application of reverse engineering techniques, for example, while they are writing code or a design model. Muller et al.34 highlighted the idea of exploiting information extracted by reverse engineering as a feedback on the forward development process. This idea of continuous reverse engineering has several expected benefits:
In recent years, the availability of multiple sources of data (software repositories) demands for techniques to combine them and for the need of filtering such a huge amount of information, that would otherwise cause an overload for developers.36 Recommender systems are an emerging response to this kind of issues favored by the availability of highly customizable development environments, such as Eclipsed and NetBeanse providing ways to quickly develop new tools completely integrated in the environment the developer is using. For example, CloneTracker17 keep tracks of changes occurring on source code clones. Recommender systems integrate many reverse engineering techniques and their growing popularity suggests that next generation software development environments will most probably leverage reverse engineering into the standard outfit of software developers.
In the view of analyzing developers' work on software projects, an important challenge is to relate information about software artifacts, extracted by means of reverse engineering techniques, with information about communication/cooperation among developers or in general among project contributors. In recent years, the combination of program analysis and mining software repositories with social network analysis techniques is becoming an effective tool that helps to understand the relationships existing between developers' social networking the characteristics of the code they tend to modify.7
Today, many systems are multi-languages and cross-platforms, requiring reverse engineering tools able to deal with multiple languages and multiple platforms within a single conceptual framework. Finally, new kinds of software artifacts, that need to be reverse engineered, are emerging, including spreadsheets and macros deriving from the growing phenomenon of end-user programming.8
The advent of service oriented architectures (SOA) poses new challenges to reverse engineering. A service oriented system is composed of distributed services published by different providers and poses relevant software understanding issues.22 In fact, each service offers, through its interface, a limited view of the features because providers "sell" the usage of a service but want to protect the technology behind it. This affects the service understandability and, being the implementation not available for reverse engineering, black box understanding techniques must be used. Upcoming work in reverse engineering can also support service providers to (semi) automatically produce service annotationsinferred using black box reverse engineering techniquesfor the purpose of automatic discovery and reconfiguration.31 From the service provider side, service annotations can also be produced using source code reverse engineering techniques, aimed at extracting models to be used for annotating the service. Of course, this kind of automatically extracted information can be different from what today is assumed by the already developed automatic service discovery and composition mechanisms, necessitating a step back and a rethinking for some of them.
Radical innovations are needed to cope with new and emerging software development scenarios and new system architectures.
Many organizations are currently using SOA as an integration platform for their systems. Thus, within an enterprise a key challenge is often to turn monolithic systems into service oriented architectures, so that they are better aligned with the business processes. Reverse engineering is a valuable option to help re-architecting these intra-enterprise applications.10 Recasting existing systems into services is also a prerequisite to move towards cloud computing, a style of computing where not only software but also the infrastructure and the platform are seen as services.
Reverse engineering has traditionally focused on recovering functional and architectural views from existing software artifacts. However, reverse engineering approaches have also shown useful to recover performance models and workload models of an application, manly by analyzing execution traces.23,44 These models have proven particularly useful to deal with applications running on limited-resources devices, including embedded systems and mobile and pervasive applications, and to help migrating existing applications onto these devices. Example of key problems are adapting the interface to the limited size and resolution of these devices and miniaturizing libraries to accommodate for reduced size of memory.
With the increasing diffusion of handled-devices and wireless sensor networks, power management is becoming a key issue. Consequently, an emerging goal of reverse engineering is to analyze an application and identify changes aimed at reducing power consumption. This is crucial for mobile devices such as smart phones and Personal Digital Assistants (PDAs), but, especially, in the context of sensor networks, where there is the need for increasing battery duration for geographically distributed sensors. In this context, reverse engineering has the role of understanding application power bottlenecks, which may be an excessive usage of the display or of a wireless connection, but also to the use of CPU instructions having high power requirements. Last but not least, in a long-term vision, "green computing" envisages scenarios where general-purpose computing makes a reduced utilization of power resources. This is, indeed, a software engineering problem and, for existing systems, a reverse engineering problem.
Reverse engineering is, at the same time, an old activity and young discipline. In fact, developers have always struggled about analyzing software components to gather information that the documentation leaves out, examining source code to reconstruct the underlying rationale and design choices, and inspecting data formats to maintain communications among applications. However, for a long time these and other reverse engineering tasks have been carried-on using ad-hoc approaches, with the help of very simple general-purposes tools, such as editors and regular-expression matchers.
Only in the past two decades has software engineering recognized the need for systematic approaches, and supporting tools, to gather information from existing software and leveraging it into engineering processes. Nevertheless, pushing the adoption of reverse engineering techniques in the development practice is still a major need, requiring appropriate education of developers both in university courses and within industry, and to support reverse engineering techniques and tools with empirical evidence about their performance and usability and with guidelines for their adoption.
3. Baxter, I.D., Pidgeon, C., and Mehlich, M. DMS: Program transformations for practical scalable software evolution. In Proc. of the 26th International Conference on Software Engineering, 2004, 625634.
7. Bird, C., Pattison, D., D'Souza, R., Filkov, V., and Devanbu, P. Latent social structure in open source projects. In Proc. of the 16th ACM SIGSOFT international Symposium on Foundations of Software Engineering, 2008. ACM Press, 2435.
9. Canfora, G. and Di Penta, M. New frontiers of reverse engineering. In Proc. of the International Conference on Software Engineering - Future of Software Engineering Track (FOSE), 2007. IEEE CS Press, 326341.
10. Canfora, G., Fasolino, A.R., Frattolillo, G., and Tramontana, P. A wrapping approach for migrating legacy system interactive functionalities to service oriented architectures. Journal of Systems and Software 81, 4 (2008), 463480.
16. Cornelissen, B., Zaidman, A., van Deursen, A., Moonen, L., and Koschke, R. A systematic survey of program comprehension through dynamic analysis. IEEE Transactions on Software Engineering. http://doi.ieeecomputersociety.org/10.1109/TSE.2009.28
19. Ferenc, R., Beszédes, À., Tarkiainen, and M., Gyimòthy, T. Columbusreverse engineering tool and schema for C++. In Proc. of the 18th International Conference on Software Maintenance, 2002. IEEE CS Press, 172181.
21. Godfrey, M.W., Hassan, A.E., Herbsleb, J.D., Murphy, G.C., Robillard, M.P., Devanbu, P.T., Mockus, A., Perry, D.E., and Notkin, D. Future of mining software archives: A roundtable. IEEE Software 26, 1 (2009), 6770.
23. Hassan, A.E., Martin, D.J., Flora, P., Mansfield, P., and Dietz, D. An industrial case study of customizing operational profiles using log compression. In Proc. of the 30th international Conference on Software Engineering 2008. ACM Press, 713723.
24. Kagdi, H.H., Collard, M.L., and Maletic, J.I. A survey and taxonomy of approaches for mining software repositories in the context of software evolution. Journal of Software Maintenance 19, 2 (2007). Wiley, 77131.
25. Kamiya, T., Kusumoto, S., and Inoue, K. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7, (2002), 654670.
34. Muller, H.A., Jahnke, J.H., Smith, D.B., Storey, M.D., Tilley, S.R., and Wong, K. Reverse engineering: A roadmap. In Proc. of the 22nd International Conference on Software EngineeringFuture of Software Engineering Track, 2000, 4760.
37. Nierstrasz, O., Ducasse, S., and Gîrba, T. The story of Moose: An agile reengineering environment. In Proc. of the 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2005, Lisbon, Portugal, (Sept. 59, 2005). ACM Press, 110.
41. Storey, M.D., and Müller, H.A. Manipulating and documenting software structures using Shrimp views. In Proc. of the 11th International Conference on Software Maintenance, 1995. IEEE CS Press, 275284.
44. Woodside, C.M., Franks, G., and Petriu, D.C. The future of software performance engineering. In Proc. of the International Conference on Software Engineering, Future of Software Engineering Track (FOSE) 2007. IEEE CS Press, 171187.
©2011 ACM 0001-0782/11/0400 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.
No entries found