Sign In

Communications of the ACM

Research highlights

Predicting Program Properties from 'Big Code'


View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
figure standing next to giant laptop

We present a new approach for predicting program properties from large codebases (aka "Big Code"). Our approach learns a probabilistic model from "Big Code" and uses this model to predict properties of new, unseen programs.

The key idea of our work is to transform the program into a representation that allows us to formulate the problem of inferring program properties as structured prediction in machine learning. This enables us to leverage powerful probabilistic models such as Conditional Random Fields (CRFs) and perform joint prediction of program properties.

As an example of our approach, we built a scalable prediction engine called JSNICE for solving two kinds of tasks in the context of JavaScript: predicting (syntactic) names of identifiers and predicting (semantic) type annotations of variables. Experimentally, JSNICE predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of cases. Since its public release at http://jsnice.org, JSNice has become a popular system with hundreds of thousands of uses.

By formulating the problem of inferring program properties as structured prediction, our work opens up the possibility for a range of new "Big Code" applications such as de-obfuscators, decompilers, invariant generators, and others.

Back to Top

1. Introduction

Recent years have seen significant progress in the area of programming languages driven by advances in type systems, constraint solving, program analysis, and synthesis techniques. Fundamentally, these methods reason about each program in isolation and while powerful, the effectiveness of programming tools based on these techniques is approaching its inherent limits. Thus, a more disruptive change is needed if a significant improvement is to take place.

At the same time, creating probabilistic models from large datasets (also called "Big Data") has transformed a number of areas such as natural language processing, computer vision, recommendation systems, and many others. However, despite the overwhelming success of "Big Data" in a variety of application domains, learning from large datasets of programs has previously not had tangible impact on programming tools. Yet, with the tremendous growth of publicly available source code in repositories such as GitHub4 and BitBucket2 (referred to as "Big Code" by a recent DARPA initiative11) comes the opportunity to create new kinds of programming tools based on probabilistic models of such data. The vision is that by leveraging the massive effort already spent in developing millions of programs, such tools will have the ability to solve tasks beyond the reach of traditional techniques. However, effectively learning from programs is a challenge. One reason is that programs are data transformers with complex semantics that should be captured and preserved in the learned probabilistic model.

* 1.1. Structured prediction for programs

The core technical insight of our work is transforming the input program into a representation that enables us to formulate the problem of predicting program properties as structured prediction with Conditional Random Fields (CRFs).15 Indeed, CRFs are a powerful probabilistic model successfully used in a wide variety of applications including computer vision and natural language processing.12, 15, 16 We show how to instantiate this approach towards predicting semantic information (e.g., type annotations) as well as syntactic facts (e.g., identifier names). To our knowledge, this is the first work which shows how CRFs can be learned and used in the context of programs. By connecting programs to CRFs, a wide range of learning and inference algorithms14 can be used in the domain of programs.

Figure 1 illustrates the structured prediction approach. In the prediction phase (upper part of figure), we are given an input program for which we are to infer properties of interest. In the next step, we convert the program into a representation called dependency network. The dependency network captures relationships between program elements whose properties are to be predicted with elements whose properties are known. Once the network is obtained, we perform structured prediction and in particular, a query referred to as Maximum a Posteriori (MAP) inference.14 This query makes a joint prediction for all elements together by optimizing a scoring function based on the learned CRF model. Making a joint prediction which takes into account structure and dependence is particularly important as properties of different elements are often related. A useful analogy is the ability to make joint predictions in image processing where the prediction of a pixel label is influenced by the predictions of neighboring pixels.

f1.jpg
Figure 1. Structured prediction for programs.

f2.jpg
Figure 2. A screenshot of http://jsnice.org/: minified code (left), deobfuscated version (right).

* 1.2. JSNICE: Name and type inference for JavaScript

As an example of this approach, we built a system which addresses two important challenges in JavaScript: predicting (syntactic) identifier names and (semantic) type annotations of variables. Such predictions have applications in software engineering (e.g., refactoring to improve code readability), program analysis (e.g., type inference) and security (e.g., deobfuscation). We focused on JavaScript for three reasons. First, in terms of type inference, recent years have seen extensions of JavaScript that add type annotations such as the Google Closure Compiler5 and TypeScript.7 However, these extensions rely on traditional type inference, which does not scale to realistic programs that make use of dynamic evaluation and complex libraries (e.g., jQuery).13 Our work predicts likely type annotations for real world programs which can then be provided to the programmer or to a standard type checker. Second, much of JavaScript code found on the Web is obfuscated, making it difficult to understand what the program is doing. Our approach recovers likely identifier names, thereby making much of the code on the Web readable again. This is enabled by a large and well-annotated corpus of JavaScript programs available in open source repositories such as GitHub.

Since its release, JSNICE has become a widely used system with users ranging from JavaScript developers to security specialists. In a period of a year, our users deobfuscated over 9 GB (87.7 mn lines of code) of unique (non-duplicate) JavaScript programs. Figure 3 shows a histogram of the size of these programs, indicating that users often query it with large code fragments. The average JavaScript program size is 91.7 KB.

f3.jpg
Figure 3. Histogram of query sizes to http://jsnice.org/ sent by users in the period May 10, 2015-May 10, 2016.

* 1.3. Nice2Predict: Structured prediction framework

To facilitate faster creation of new applications (JSNICE being one example), we built a reusable framework called Nice2Predict (found at http://nice2predict.org) which includes all components of this work (e.g., training and inference) except the definition of feature functions (which are application specific). Then, to use our method one only needs to phrase their application in terms of a CRF model which is done by defining suitable feature functions (we show such functions for JSNICE later in the paper) and then invoke the Nice2Predict training and inference mechanisms. A recent example of this instantiation is DeGuard9 (http://apk-deguard.com), a system that performs Android layout de-obfuscation by predicting method, class, and field names erased by ProGuard.6

Back to Top

2. Overview

We now provide an informal description of our probabilistic approach on a running example. Consider the JavaScript program shown in Figure 4(a). This is a program which has short, non-descriptive identifier names. Such names can be produced by both a novice inexperienced programmer or by an automated process known as minification (a form of layout obfuscation) which replaces identifier names with shorter names. In the case of client-side JavaScript, minification is a common process on the Web and is used to reduce the size of the code being transferred over the network and/or to prevent users from understanding what the program is actually doing. In addition to obscure names, variables in this program also lack annotated type information. It can be difficult to understand that this obfuscated program happens to partition an input string into chunks of given sizes, storing those chunks into consecutive entries of an array.

f4.jpg
Figure 4. Probabilistic inference of program properties on an example.

Given the program in Figure 4(a), JSNICE automatically produces the program in Figure 4(e). The output program has new identifier names and is annotated with predicted types for the parameters, local variables, and return statement. Overall, it is easier to understand what that program does when compared to the original. We now provide an overview of the prediction recipe that performs this transformation. We focus on predicting names (reversing minification), but the process for predicting types is identical.

* 2.1. Step 1: Determine known and unknown properties

Given the program in Figure 4(a), we use a simple static (scope) analysis which determines the set of program elements for which we would like to infer properties. These are elements whose properties are unknown in the input (i.e., are affected by minification). When predicting names, this set consists of all local variables and function parameters of the input program: e, t, n, r, and i. We also determine the set of elements whose properties are known (not affected by minification). These include field and method names (e.g., the field element with name length). Both kinds of elements are shown in Figure 4(b). The goal is to predict the unknown properties based on: (i) the obtained known properties and (ii) the relationship between elements (discussed here).

* 2.2. Step 2: Build dependency network

Next, using features we later define, we build a dependency network capturing relationships between program elements. The dependency network is key to capturing structure when performing predictions and intuitively determines how properties to be predicted influence each other. For example, the link between known and unknown properties allows us to leverage the fact that many programs use common anchors (e.g., JQuery API) meaning that the unknown quantities we aim to predict are influenced by how known elements are used. Dependencies are triplets of the form (n,m,rel) where, n, m are program elements, and rel is the particular relationship between the two elements. In our work all dependencies are triplets, but these can be extended to relationships involving more than two elements.

In Figure 4(c), we show three example dependencies between the program elements. For instance, the statement i += t generates a dependency (i,t,L+=R), because i and t are on the left and right side of a+= expression. Similarly, the statement var r = e.length generates several dependencies including (r,length,L=_.R) which designates that the left part of the relationship, denoted by L, appears before the de-reference of the right side denoted by R (we elaborate on the different types of relationships in Section 4). For clarity, in Figure 4(c) we include only some of the relationships.

* 2.3. Step 3: MAP inference

After obtaining the dependency network, we next infer the most likely values for the nodes via a query known as MAP inference.14 Informally, in this type of query, a prediction leverages the structure of the network (i.e., the connections between the nodes) as opposed to predicting separately each node at a time. As illustrated in Figure 4(d), for the network of Figure 4(c), our system infers the new names step and len. It also predicts that the previous name i was most likely. Let us consider how we predicted the names step and len. The network in Figure 4(d) is the same one as in Figure 4(c) but with additional tables produced as an output of the learning phase (i.e., the learning determines the concrete feature functions and the weights associated with them). Each table is a function that scores the assignment of properties when connected by the corresponding edge (intuitively, how likely the particular pair is). Consider the topmost table in Figure 4(d). The first row says the assignment of i and step is scored with 0.5. The MAP inference searches for assignment of properties to nodes such that the sum of the scores shown in the tables is maximized. For the two nodes i and t, the inference ends up selecting the highest score from that table (i.e., the values i and step). Similarly for nodes i and r. However, for nodes r and length, the inference does not select the topmost row but the values from the second row. The reason is that if it had selected the topmost row, then the only viable choice (in order to match the value length) for the remaining relationship is the second row of i, r table (with value 0.6). However, the assignment 0.6 leads to a lower combined overall score. That is, MAP inference must take into account the global structure and cannot simply select the maximal score of each function.

To achieve good performance for the MAP inference, we developed a new algorithmic variant which targets the domain of programs (existing inference algorithms cannot efficiently deal with the combination of unrestricted network structure and a large number of possible predictions per element).

* 2.4. Output program

Finally, after the new names are inferred, our system transforms the original program to use these names. The output of the entire inference process is captured in the program shown in Figure 4(e). Notice how in this output program, the names tend to accurately capture what the program does.

* 2.5. Predicting type annotations

Even though we illustrated the inference process for variable names, the overall flow for predicting type annotations is identical. Interestingly, after predicting types, one can invoke a standard type checker to check whether the predicted types are valid for the program. For our example in Figure 4(e), the predicted type annotations (shown in comments) are indeed valid. In general, when predicting semantic properties (such as types) where soundness plays a role, our approach can be used as part of a guess-and-check loop.

* 2.6. Independent of minification

We note that our name inference process is independent of what the minified names are. In particular, the process will return the same names regardless of which minifier was used to obfuscate the original program (provided these minifiers always rename the same set of variables).

Back to Top

3. Structured Prediction

We now introduce the structured prediction approach. We later instantiate this approach in Section 4.

* 3.1. Notation: programs, labels, predictions

Let x X be a program. As with standard program analysis, we will infer properties about program statements or expressions (referred to as program elements). For a program x, each element (e.g., a variable) is identified with an index (a natural number). We usually separate the elements into two kinds: (i) elements for which we are interested in inferring properties and (ii) elements whose properties we know (e.g., obtained via say standard program analysis or manual annotation). We use two helper functions n, m: X → N to return the appropriate number of program elements for a given program x: n(x) returns the number of elements of kind (i) and m(x) the number of elements of kind (ii). When x is clear from the context, we write n instead of n(x) and m instead of m(x).

We use the set LabelsU to denote all possible values that a predicted property can take. For instance, in type prediction, LabelsU contains all possible basic types (e.g., number, string, etc). Then, for a program x, we use the notation y = (y1, ..., yn(x)) to denote a vector of predicted program properties. Here, y Y where Y = (LabelsU)*. That is, each entry yi in vector y ranges over LabelsU and denotes that program element i has property yi.

For a program x, we define the vector cacm6203_m.gif to capture known properties. Here, each cacm6203_n.gif can take a value from the set of properties LabelsK which could potentially differ from the set of properties LabelsU that we use for prediction. For example, if the known properties are integer constants, LabelsK will contain all valid integers. To avoid clutter when x is clear from the context, we use z instead of Zx. We use Labels = LabelsU LabelsK to denote the set of all properties.

Note that to apply this method the total number of predictions must be fixed (bounded) in advance (i.e., n(x)). This is unlike other settings, for example, grammars,10 where the number of predictions can be unbounded.

* 3.2. Problem definition

Let cacm6203_o.gif denote the training data: a set of t programs each annotated with corresponding properties. Our goal is to learn a model that captures the conditional probability Pr(y | x). Once the model is learned, we can predict properties of new programs by posing the following MAP query:

Given a new program x, find y = arg maxy'Ωx Pr(y'|x)

That is, we aim to find the most likely assignment of program properties y according to the probabilistic model. Here, ΩxY describes the set of possible assignments of properties y' to program elements of x. The set Ωx allows restricting the set of possible properties and is useful for encoding problem-specific constraints. For example, in type annotation inference, the set Ωx may restrict the annotations to types that make the resulting program typecheck.

* 3.3. Conditional random fields (CRFs)

We now briefly describe CRFs, a particular model defined in Lafferty et al.15 and previously used for a range of tasks such as natural language processing, image classification, and others. CRFs represent the conditional probability Pr(y | x). We consider the case where the factors are positive in which case, without loss of generality, any conditional probability of properties y given a program x can be encoded as follows:

ueq01.gif

where, score is a function that returns a real number indicating the score of an assignment of properties y for a program x. Assignments with higher score are more likely than assignments with lower score. Z(x), called the partition function, ensures that the above expression does in fact encode a conditional distribution. It returns a real number depending only on the program x, that is:

ueq02.gif

W.l.o.g, score can be expressed as a composition of a sum of k feature functions fi associated with weights wi:

ueq03.gif

Here, f is a vector of functions fi and w is a vector of weights wi. The feature functions fi: Y x X → R; are used to score assignments of program properties. This representation of score functions is well-suited for learning (as the weights w can be learned from data).

* 3.4. Features as constraints

Feature functions are key to controlling the likelihood predictions. For instance, a feature function can be defined to prohibit or lower the score of an undesirable prediction: say if fi(yB, x) = -H where H is a very large positive number, then fi (with weight wi > 0) essentially disables an assignment yB as Pr(yB | x) will approach 0.

* 3.5. General prediction approach

Let us first define the kind of relations between program elements we use when making predictions. Let the set of possible relations be Rels. An example relation we considered in our running example was L + = R. It relates variables i and t in Figure 4(c) and arises due to the expression i + = t in the code. Examples of other relations are found in Section 4.

* 3.6. Pairwise indicator feature functions

Let cacm6203_p.gif be a set of pairwise feature functions where each ψi: Labels x Labels x Rels → R scores a pair of properties when related with a relation from Rels. For example:

ueq04.gif

In general, any feature function can be used, but our work shows that these pairwise functions are sufficient for making high-quality predictions of names and types. Next, we go over the steps of the prediction procedure more formally.

* 3.7. Step 1: Build dependency network

We begin by building the network Gx = Vx, Ex for a program x, capturing dependencies between predictions. Here, cacm6203_q.gif consists of elements (e.g., variables) for which we would like to predict properties cacm6203_r.gif and elements whose properties we already know cacm6203_s.gif The set of edges ExVx x Vx x Rels denotes the fact that there is a relationship between two program elements and describes what that relationships is. This definition of network is also called a multi-graph because there is no restriction on having only a single edge between a pair of nodes – our definition permits multiple dependencies with different Rels.

We define the feature functions f(y, x) over the graph Gx as follows. Let (y, zx) be a concatenation of the unknown properties y and the known properties zx in x. Then, fi is defined as the sum of the applications of its corresponding ψi over the set of all network edges in Gx:

ueq05.gif

* 3.8. Step 2: MAP inference

Recall that the key query we perform is MAP inference. That is, given a program x, find a prediction y such that:

ueq06.gif

As arg max is independent of Z(x), we obtain an equivalent simplified query:

ueq07.gif

In theory, one can use any of the available inference algorithms to solve for the above query (exact inference is in general an NP-hard MaxSAT problem). In this work, we designed a fast and approximate greedy MAP inference algorithm tailored to our setting of programs: pairwise feature functions, unrestricted nature of Gx and the a large set of possible assignments. Our algorithm changes the labels of each node one-by-one or in pairs until the assignment score cannot be improved further.

* 3.9. Learning

The goal of the training phase (lower part of Figure 1) is to learn the weights w used in the score function from a large training set of programs. These weights cannot be obtained by means of counting in the training data.14 [Section 20.3.1]. Instead, we use a learning technique from online support vector machines: given a training dataset cacm6203_t.gif of t samples, the goal is to find w such that the given assignments y(j) are the highest scoring assignments in as many training samples as possible subject to additional margin learning constraints. The learning procedure is described in Ratliff et al.17

Back to Top

4. Predicting Names and Types

In this section we present the JSNICE system for prediting: (i) names of local variables and (ii) type annotations of function arguments. We investigate the above challenges in the context of JavaScript, a popular language where addressing the above two questions is of significant importance. We do note however that much of the machinery discussed in this section applies as-is to other programming languages.

* 4.1. JavaScript identifier name prediction

The goal of our name prediction task is to predict the (most likely) names of local variables in a given program x. We proceed as follows. First, we define cacm6203_u.gif to range over all constants, objects properties, methods, and global variables of x. Each element in cacm6203_u.gif can be assigned values from the set LabelsK = JSConsts JSNames, where JSNames is a set of all valid identifier names and JSConsts is a set of possible constants. We note that object property names and Application Programming Interface (API) names are modeled as constants, as the dot (.) operator takes an object on the left-hand side and a string constant on the right-hand side. We define the set cacm6203_r.gif to contain all local variables of x. Here, a variable name belonging to two different scopes leads to two program elements in cacm6203_r.gif. Finally, LabelsU ranges over JSNames.

To ensure the newly predicted names preserve program semantics, we ensure the following additional constraints hold: (i) all references to a renamed local variable must be renamed to the same name. This is enforced by how we define cacm6203_r.gif (each element corresponds to a local variable as opposed to one element per variable occurrence), (ii) the predicted identifier names must not be reserved keywords. This is enforced by ensuring that LabelsU does not contain keywords, and (iii) the prediction must not suggest the same name for two different variables in the same scope. This is enforced by prohibiting assignments of labels with conflicting names.

* 4.2. JavaScript type annotation prediction

Our second application involves probabilistic type annotation inference of function parameters. These annotations are particularly challenging to derive via standard program analysis techniques because such a derivation would require finding all possible callers of a function. Instead, we leverage existing manually (type) annotated JavaScript programs. In JSNICE we use JSDoc1 annotated code for training data.

The simplified language over which we predict type annotations is defined as follows:

ueq08.gif

Here, n ranges over constants (n JSConsts), var is a meta-variable ranging over the program variables, * ranges over binary operators (+, -, *, /, ., <, ==, ===, etc.), and τ ranges over all possible variable types. That is, τ = {?} L where L is a set of types (we discuss how to instantiate L below) and ? denotes the unknown type. To be explicit, we use the set JSTypes where JSTypes = τ. We use the function []x : exJSTypes to obtain the type of an expression in a program x. This map can be manually provided or built via program analysis. When the program x is clear from the context we use [e] as a shortcut for []x(e).

* 4.3. Defining known and unknown program elements

We define the set of unknown program elements as follows:

ueq09.gif

That is, cacm6203_r.gif contains variables whose type is unknown. In principle, we can differentiate between the type and the unknown type ? in order to allow for finer control over which types we would like to predict. However, since standard type inference cannot predict types of function parameters, we annotate all non-annotated parameters with type?.

Next, we define the set of known elements cacm6203_u.gif. Note that cacm6203_u.gif can contain any expression, not just variables like cacm6203_r.gif:

ueq10.gif

That is, cacm6203_u.gif contains both, expressions whose types are known as well as constants. We do not restrict the set of possible assignments Ωx, that is, Ωx = (JSTypes)n (recall that n is a function which returns the number of elements whose property is to be predicted). This means that we rely entirely on the learning to discover the rules that will produce non-contradicting types. The only restriction (discussed here) that we apply is constraining JSTypes when performing predictions.

For JSTypes = {?} L the set L of types can be instantiated in various ways. In this work we define L = P(T) where T, is a complete lattice of types with T and as defined in Figure 5. In the figure we use "..." to denote a potentially infinite number of user-defined object types. Note that L allows that a variable can be of a union type {string, number} which for convenience can also be written as string V number.

f5.jpg
Figure 5. The lattice of types over which prediction occurs.

* 4.4. Relating program elements

The relationships between program elements that we introduce define how to build the set of edges Ex of a program x. Since the elements for both prediction tasks are similar, so are the relationships. If a relationship is specific to a particular task, we explicitly state so in its description.

Relating expressions. The first relationship we discuss is syntactic in nature: it relates two program elements based on the program's Abstract Syntax Tree (AST). Let us consider the expression i + j<k. First, we build the AST of the expression as shown in Figure 6 (a). Suppose we are interested in performing name prediction for variables i, j, and k, represented with indices 1, 2, and 3 respectively, that is, cacm6203_v.gif Then, we build the dependency network as shown in Figure 6(b) to indicate the prediction for the three elements are dependent on one another (with the particular relationship shown over the edge). For example, the edge between 1 and 2 represents the relationship that these nodes participate in an expression L+R where L is a node for 1 and R is a node for 2.

f6.jpg
Figure 6. (a) The AST of expression i + j < k and two dependency networks built from the AST relations: (b) for name, and (c) for type predictions.

Relationships are defined using the following grammar:

ueq11.gif

All relationships relast are part of Rels, that is, relast Rels. As discussed earlier, ranges over binary operators. All relationships derived using the above grammar have exactly one occurrence of L and R. For a relationship r relast, let r[x/L, y/R, e/_] denote the expression where x is substituted for L, y is substituted for R and the expression e is substituted for _. Then, given two program elements a and b and a relationship r relast, a match is said to exist if r[a/L, b/R, [expr]/_] Exp(x) ≠ /. Here, [expr] denotes all possible expressions in the programming language and Exp(x) is all expressions of program x. An edge (a, b, r) Ex between two program elements a and b exists if there exists a match between a, b, and r.

Note that for a given pair of elements a and b there could be more than one relationship which matches, that is, both r1, r2 relast match where r1r2 (therefore, there could be multiple edges between a and b with different relationships).

The relationships described above are useful for both name and type inference. For names, the expressions being related are variables, while for types, they need not be restricted to variables. For example, in Figure 6(c) there is a relationship between the types of k and i + j via L<R. Note that our rules do not directly capture relationships between [i] and [i+j], but they are transitively dependent. Still, many important relationships for type inference are present. For instance, in classic type inference, the relationship L=R implies a constraint rule [L] [R] where is the supertype relationship (indicated in Figure 5). Interestingly, our inference model can learn such rules instead of providing them manually.

Other relations. We introduce three other types of semantic relations: (i) a relationship capturing aliasing between expressions, (ii) a function calling relationship capturing whether a function (represented by a variable) may call another function (also represented by a variable), and (iii) an object access relationship specifying whether an object field (represented by a string constant) may be accessed by a function. The last two relationships are only used in name prediction and are particularly effective when predicting function names.

Back to Top

5. Implementation and Evaluation

We implemented our approach in an interactive system, called JSNICE, which targets name and type predictions for JavaScript. JSNICE modifies the Google Closure Compiler.5 In standard operation, this compiler takes human-readable JavaScript with optional type annotations, type-checks it and returns an optimized, minified, and human-unreadable JavaScript with stripped annotations.

In our system, we added a new mode to the compiler that reverses its standard operation: given an optimized minified JavaScript code, JSNICE generates JavaScript code that is well annotated (with types) and as human-readable as possible (with useful identifier names). Our two applications for names and types were implemented as two probabilistic models that can be invoked separately.

* 5.1. Our dataset

To evaluate our approach, we collected two disjoint sets of JavaScript programs to form our training and evaluation data. For training, we downloaded 10,517 JavaScript projects from GitHub.4 For evaluation, we took the 50 JavaScript projects with the highest number of commits from BitBucket.2 By taking projects from different repositories, we decrease the likelihood of overlap between training and evaluation data.

We also searched in GitHub to check that the projects in the evaluation data are not included in the training data. Finally, we implemented a simple checker to detect and filter out minified and obfuscated files from the training and the evaluation data. After filtering minified files, we ended up with training data consisting of 324,501 files and evaluation data of 2,710 files.

* 5.2. Precision

To evaluate the precision of JSNICE, we first minified all 2,710 files in the training data with UglifyJS8 (other sound minifiers should produce input that is equivalent for the purposes of using JSNICE). As mentioned, we focus on a particular popular form of obfuscation called layout obfuscation. It works by renaming local variables to meaningless short names and removing whitespaces and type annotations (other types of obfuscation such as encryption are not considered in this work). Each minified program is semantically equivalent (except when using with or eval) to the original. Then, we used JSNICE on the minified programs to evaluate its capabilities in reconstructing name and type information. We compared the precision of the following configurations:

  • The most powerful system works with all of the training data and performs structured prediction as described so far.
  • Two systems using a fraction of the training data — one on 10% and one on 1% of the files.
  • To evaluate the effect of structure when making predictions, we disabled relationships between unknown properties and performed predictions on that network (the training phase still uses structure).
  • A naïve baseline which does no prediction: it keeps names the same and sets all types to the most common type string.

Name predictions. To evaluate the accuracy of name predictions, we took each of the minified programs and used the name inference in JSNICE to rename its local variables. Then, we compared the new names to the original names (before obfuscation) for each of the tested programs. The results for the name reconstruction are summarized in the second column of Table 1. Overall, our best system exactly recovers 63.4% of identifier names. The systems trained on less data have significantly lower precision showing the importance of training data size.

t1.jpg
Table 1. Precision and recall for name and type reconstruction of minified JavaScript programs evaluated on our test set.

Not using structured prediction also drops the accuracy significantly and has about the same effect as an order of magnitude less data. Finally, not changing any identifier names produces accuracy of 25.3%—this is because minifying the code may not rename some variables (e.g., global variables) in order to guarantee semantic preserving transformations and occasionally one-letter local variable names stay the same (e.g., induction variable of a loop).

Type annotation predictions. Out of the 2,710 test programs, 396 have type annotations for functions in a JSDoc. For these 396, we took the minified version with no type annotations and tried to rediscover all types in the function signatures. We first ran the Closure compiler type inference, which produces no types for the function parameters. Then, we ran and evaluated JSNICE on inferring these function parameter types.

JSNICE does not always produce a type for each function parameter. For example, if a function has an empty body, or a parameter is not used, we often cannot relate the parameter to any known program properties and as a result, no prediction can be made and the unknown type (?) is returned. To take this effect into account, we report both recall and precision. Recall is the percentage of function parameters in the evaluation for which JSNICE made a prediction other than ?. Precision refers to the percentage of cases—among the ones for which JSNICE made a prediction—where it was exactly equal to the manually provided JSDoc annotation of the test programs. We note that the manual annotations are not always precise, and as a result 100% precision is not necessarily a desired outcome.

We present our evaluation results for types in the last two columns of Table 1. Since we evaluate on production JavaScript applications that typically have short methods with complex relationships, the recall for predicting program types is only 66.9% for our best system. However, we note that none of the types we infer are inferred by state-of-the-art forward type analysis (e.g., Facebook Flow3).

Since the total number of commonly used types is not as high as the number of names, the amount of training data has less impact on the system precision and recall. To further increase the precision and recall of type prediction, we hypothesize that adding more (semantic) relationships between program elements would be of higher importance than adding more training data. Dropping structure increases the precision of the predicted types slightly, but at the cost of a significantly reduced recall. The reason is that some types are related to known properties only transitively via other predicted types—relationships that non-structured approaches cannot capture. On the other end of the spectrum is a prediction system that suggests for every variable the most likely type in JavaScript programs—string. Such a system has 100% recall, but its precision is only 37.8%.

* 5.3. Usefulness of predicted types

To see if the predicted types are useful, we compared them to the original ones. First, we note that our evaluation data has 3,505 type annotations for function parameters in 396 programs. After removing these annotations and reconstructing them with JSNICE, the number of annotations that are not ? increased to 4,114 for the same programs. The reason JSNICE produces more types than originally present despite having 66.3% recall is that not all functions in the original programs had manually provided types.

Interestingly, despite annotating more functions than the original code, the output of JSNICE has fewer type errors. We summarize these findings in Figure 7. For each of the 396 programs, we ran the typechecking pass of Google's Closure Compiler to discover type errors. Among others, this pass checks for incompatible types, calling into a non-function, conflicting, missing types, and non-existent properties on objects. For our evaluation, we kept all checks except the non-existent property check, which fails on almost all (even valid) programs, because it depends on annotating all properties of JavaScript classes—annotations that almost no program in training or evaluation data possesses.

f7.jpg
Figure 7. Evaluation results for the number of type-checking programs with manually provided types and with predicted types.

When we ran typechecking on the input programs, we found the majority (289) to have typechecking errors. While surprising, this can be explained by the fact that JavaScript developers typically do not typecheck their annotations. Among others, we found the original code to have misspelled type names. Most typecheck errors occur due to missing or conflicting types. In a number of cases, the types provided were interesting for documentation, but were semantically wrong—for example, a parameter is a string that denotes function name, but the manual annotation designates its type to be Function. In contrast, the types reconstructed by JSNICE make the majority (227) of programs typecheck. In 141 of programs that originally did not typecheck, JSNICE was able to infer correct types. On the other hand, JSNICE introduced type errors in 21 programs. We investigated some of these errors and found that not all of them were due to wrong types—in several cases the predicted types were rejected due to inability of the type system to precisely express the desired program properties without also manually providing type cast annotations.

* 5.4. Model sizes

Our models contain 7,627,484 features for names and 70,052 features for types. Each feature is stored as a triple, along with its weight. As a result we need only 20 bytes per feature, resulting in a 145.5 MB model for names and 1.3 MB model for types. The dictionary which stores all names and types requires 16.8 MB. As we do not compress our model, the memory requirements for query processing are proportional to model size.

Back to Top

6. Conclusion

We presented a new probabilistic approach for predicting program properties by learning from large codebases (termed "Big Code"). The core technical idea is to formulate the problem of property inference as structured prediction with CRFs, enabling joint predictions of program properties. As an instantiation of our method, we presented a system called JSNICE that can reverse the process of layout de-obfuscation by predicting name and type annotations for JavaScript. Since its public release, JSNICE has become a popular tool for JavaScript layout de-obfuscation.

Interesting items for future work include reversing other types of obfuscation (beyond layout), extending the approach to predict semantic invariants of programs, as well as richer integration with standard program analyzers where the next prediction of the machine learning model is guided by a potential counter-example produced by the analyzer to the previous prediction.

We also remark that over the last few years the field of learning from "Big Code" has become an active area of research. Recent surveys covering various developments in this space can be found in Raychev18 and Vechev & Yahav.19

Back to Top

References

1. Annotating javascript. https://github.com/google/closure-compiler/wiki/Annotating-JavaScript-for-the-Closure-Compiler.

2. Bitbucket. https://bitbucket.org/.

3. Facebook flow. https://github.com/facebook/flow.

4. Github. http://github.com/.

5. Google closure compiler. https://developers.google.com/closure/compiler/.

6. Shrink your code and resources. ProGuard for Android Applications: https://developer.android.com/studio/build/shrink-code.html.

7. Typescript. https://www.typescriptlang.org/.

8. Uglifyjs. https://github.com/mishoo/UglifyJS.

9. Bichsel, B., Raychev, V., Tsankov, P., Vechev, M. Statistical deobfuscation of android applications. CCS 2016.

10. Bielik, P., Raychev, V., Vechev, M.T. PHOG: probabilistic model for code. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 (2016), pp. 2933–2942.

11. DARPA. Mining and understanding software enclaves (muse). http://www.darpa.mil/news-events/2014-03-06a (2014).

12. He, X., Zemel, R.S., Carreira-Perpiñán, M.A. Multiscale conditional random fields for image labeling. CVPR 2004.

13. Jensen, S.H., Møller, A., Thiemann, P. Type analysis for javascript. In Proceedings of the 16th International Symposium on Static Analysis, SAS 2009 (Berlin, Heidelberg, 2009), Springer-Verlag, pp. 238–255.

14. Koller, D., Friedman, N. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, Cambridge, Massachusetts and London, England, 2009.

15. Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001 (San Francisco, CA, USA, 2001), pp. 282–289.

16. Quattoni, A., Collins, M., Darrell, T. Conditional random fields for object recognition. In NIPS (2004), 1097–1104.

17. Ratliff, N.D., Bagnell, J.A., Zinkevich, M. (Approximate) subgradient methods for structured prediction. In AISTATS (2007), 380–387.

18. Raychev, V. Learning from Large Codebases. PhD dissertation, ETH Zurich, 2016.

19. Vechev, M., Yahav, E. Programming with "big code". Foundations and Trends in Programming Languages 3, 4 (2016), 231–284.

Back to Top

Authors

Veselin Raychev ([email protected]), ETH Zurich, Zurich, Switzerland.

Martin Vechev ([email protected]), ETH Zurich, Zurich, Switzerland.

Andreas Krause ([email protected]), ETH Zurich, Zurich, Switzerland.

Back to Top

Footnotes

The original version of this paper was published in ACM POPL'2015.


©2019 ACM  0001-0782/19/03

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2019 ACM, Inc.


 

No entries found