Artificial Intelligence and Machine Learning Research highlights

Spreadsheet Data Manipulation Using Examples

By Sumit Gulwani, William R. Harris, and Rishabh Singh

Posted Aug 1 2012

Abstract
1. Introduction
2. Overview
3. Syntactic Transformations
4. Semantic Transformations
5. Table Layout Transformations
6. Related Work
7. Conclusion and Future Work
References
Authors
Footnotes
Figures

Millions of computer end users need to perform tasks over large spreadsheet data, yet lack the programming knowledge to do such tasks automatically. We present a programming by example methodology that allows end users to automate such repetitive tasks. Our methodology involves designing a domain-specific language and developing a synthesis algorithm that can learn programs in that language from user-provided examples. We present instantiations of this methodology for particular domains of tasks: (a) syntactic transformations of strings using restricted forms of regular expressions, conditionals, and loops, (b) semantic transformations of strings involving lookup in relational tables, and (c) layout transformations on spreadsheet tables. We have implemented this technology as an add-in for the Microsoft Excel Spreadsheet system and have evaluated it successfully over several benchmarks picked from various Excel help forums.

1. Introduction

The IT revolution over the past few decades has resulted in two significant advances: the digitization of massive amounts of data and widespread access to computational devices. It is thus not surprising that more than 500 million people worldwide use spreadsheets for storing and manipulating data. These business end users have myriad diverse backgrounds and include commodity traders, graphic designers, chemists, human resource managers, finance professionals, marketing managers, underwriters, compliance officers, and even mailroom clerks—they are not professional programmers, but they need to create small, often one-off, applications to perform business tasks.⁴

Unfortunately, the state of the art of interfacing with spreadsheets is far from satisfactory. Spreadsheet systems, like Microsoft Excel, come with a maze of features, but end users struggle to find the correct features to accomplish their tasks.¹² More significantly, programming is still required to perform tedious and repetitive tasks such as transforming names or phone numbers or dates from one format to another, cleaning data, or extracting data from several text files or Web pages into a single document. Excel allows users to write macros using a rich inbuilt library of string and numerical functions, or to write arbitrary scripts in Visual Basic or .NET programming languages. However, since end users are not proficient in programming, they find it too difficult to write desired macros or scripts. Moreover, even skilled programmers might hesitate to write a script for a one-off repetitive task.

We performed an extensive case study of spreadsheet help forums and observed that string and table processing is a very common class of programming problems that end users struggle with. This is not surprising given that various languages such as Perl, Awk, and Python were designed to support string processing, and that new languages such as Java and C# provide rich support for string processing. During our study, we also observed how novice users specified their desired programs to expert users: most specifications consisted solely of one or more inputoutput examples. Since inputoutput examples may underspecify a program, the interaction between a novice and an expert often involved multiple rounds of communication over multiple days. Inspired by this observation, we developed a programming by example (PBE), or inductive synthesis, methodology¹⁵ that has produced synthesizers that can automatically generate a wide range of string/table manipulating programs in spreadsheets from inputoutput examples. Each synthesizer takes the role of the forum expert, removing a human from the interaction loop and enabling users to solve their problems in a few seconds instead of few days.

This paper is organized as follows. We start with a brief overview of our PBE methodology (Section 2). We then describe an application of this methodology to perform syntactic string manipulation tasks (Section 3).⁶ This is followed by an extension that automates more sophisticated semantic string manipulations requiring background knowledge, which can often be encoded as relational tables (Section 4).¹⁸ We also describe an application of this methodology to perform layout transformations on tables (Section 5).⁸ In Section 6, we discuss related work, and in Section 7, we conclude and discuss future work.

2. Overview

In this section, we outline a general methodology that we have used for developing inductive synthesizers for end-user programming tasks, along with how a user can interact with the synthesizers. In the first step of our methodology, we identify a domain of useful tasks that end users struggle with and can clearly describe with examples, by studying help forums or performing user studies (this paper presents two domains: string manipulation and table manipulation). We then develop the following.

Domain-specific language: We design a domain-specific language L that is expressive enough to capture several real-world tasks in the domain, but also restricted enough to enable efficient learning from examples.

Data structure for representing consistent programs: The number of programs in L that are consistent with a given set of inputoutput examples can be huge. We define a data structure D based on a version-space algebra¹⁴ to succinctly represent a large set of such programs.

Algorithm for synthesizing consistent programs: Our synthesis algorithm for language L applies two key procedures: (i) Generate learns the set of all programs, represented using data structure D, that are consistent with a given single example. (ii) Intersect intersects these sets (each corresponding to a different example).

Ranking: We develop a scheme that ranks programs, preferring programs that are more general. Each ranking scheme is inspired by Occam’s razor, which states that a smaller and simpler explanation is usually the correct one. We define a partial order relationship between programs to compare them. Any partial order can be used that efficiently orders programs represented in the version-space algebra used by the data structure D. Such an order can be applied to efficiently select the top-ranked programs from among a set represented using D. The ranking scheme can also take into account any test inputs provided by the user (i.e., new additional inputs on which the user may execute a synthesized program). A program that is undefined on any test input or generates an output whose characteristics are different from that of training outputs can be ranked lower.

2.1. Interaction models

A user provides to the synthesizer a small number of examples, and then can interact with the synthesizer according to multiple models. In one model, the user runs the top-ranked synthesized program on other inputs in the spreadsheet and checks the outputs produced by the program. If any output is incorrect, the user can fix it and reapply the synthesizer, using the fix as an additional example. However, requiring the user to check the results of the synthesized program, especially on a large spreadsheet, can be cumbersome. To enable easier interaction, the synthesizer can run all synthesized programs on each new input to generate a set of corresponding outputs for that input. The synthesizer can highlight for the user the inputs that cause multiple distinct outputs. Our prototypes, implemented as Excel add-ins, support this interaction model.

A second model accommodates a user who requires a reusable program. In this model, the synthesizer presents the set of consistent programs to the user. The synthesizer can show the top k programs or walk the user through the data structure that succinctly represents all consistent programs and let the user select a program. The programs can be shown using programming-language syntax, or they can be described in a natural language. The differences between different programs can be explained by synthesizing a distinguishing input on which the programs behave differently.¹⁰ The user can reapply the synthesizer with the distinguishing input and its desired output as an additional example.

3. Syntactic Transformations

Spreadsheet users often struggle with reformatting or cleaning data in spreadsheet columns. For example, consider the following task.

EXAMPLE 1 (PHONE NUMBERS). An Excel user wants to uniformly format the phone numbers in the input column, adding a default area code of “425” if the area code is missing.

Such tasks can be automated by applying a program that performs syntactic string transformations. We now present an expressive domain-specific language of string-processing programs that supports limited conditionals and loops, syntactic string operations such as substring and concatenate, and matching based on regular expressions.⁶

3.1. Domain-specific language

Our domain-specific programming language for performing syntactic string transformations is given in Figure 1(a). A string program P is an expression that maps an input state σ, which holds values for m string variables v₁, …, v_m (denoting the multiple input columns in a spreadsheet), to an output string s. The top-level string expression P is a Switch constructor whose arguments are pairs of Boolean expressions b and trace expressions e. The set of Boolean expressions in a Switch construct must be disjoint, that is, for any input state, at most one of the Boolean expressions can be true. The value of P in a given input state σ is the value of the trace expression that corresponds to the Boolean expression satisfied by σ. A Boolean expression b is a propositional formula in Disjunctive Normal Form (DNF). A predicate Match(v_i, r, k) is satisfied if and only if v_i contains at least k nonoverlapping matches of regular expression r. (In general, any finite set of predicates can be used.)

A trace expression Concatenate(f₁, …, f_n) is the concatenation of strings represented by atomic expressions f₁, …, f_n. An atomic expression f is either a constant-string expression ConstStr, a substring expression constructed from SubStr, or a loop expression constructed from Loop.

The substring expression SubStr(v_i, p₁, p₂) is defined partly by two position expressions p₁ and p₂, each of which implicitly refers to the (subject) string v_i and must evaluate to a position within the string v_i. (A string with characters has + 1 positions, numbered from 0 to starting from left.) SubStr(v_i, p₁, p₂) is the substring of string v_i in between positions p₁ and p₂. For a nonnegative constant k, CPos(k) denotes the k^th position in the subject string. For a negative constant k, CPos(k) denotes the ( + 1 + k)^th position in the subject string, where = Length(s). Pos(r₁,r₂,c) is another position expression, where r₁ and r₂ are regular expressions and integer expression c evaluates to a nonzero integer. Pos(r₁,r₂,c) evaluates to a position t in the subject string s such that r₁ matches some suffix of s[0:t], and r₂ matches some prefix of s[t: ], where = Length(s). Furthermore, if c is positive (negative), then t is the |c|^th such match starting from the left side (right side). We use the expression s[t₁:t₂] to denote the substring of s between positions t₁ and t₂. The substring construct is quite expressive. For example, in the expression SubStr(v_i, Pos(r₁,r₂,c), Pos(r₃,r₄,c)), r₂ and r₃ describe the characteristics of the substring in v_i to be extracted, while r₁ and r₄ describe the characteristics of the surrounding delimiters. We use the expression SubStr2(v_i, r, c) as an abbreviation to denote the c^th occurrence of regular expression r in v_i, that is, SubStr(v_i, Pos(ε, r, c), Pos(r, ε, c)).

A regular expression r is ε (which matches the empty string, and therefore can match at any position of any string), a token T, or a token sequence TokenSeq(T₁, …, T_n). This restricted choice of regular expressions enables efficient enumeration of regular expressions that match certain parts of a string. We use the following finite (but easily extended) set of tokens: (a) StartTok, which matches the beginning of a string, (b) EndTok, which matches the end of a string, (c) a token for each special character, such as hyphen, dot, semicolon, comma, slash, or left/right parenthesis/bracket, and (d) two tokens for each character class C, one that matches a sequence of one or more characters in C, and another that matches a sequence of one or more characters that are not in C. Examples of a character class C include numeric digits (09), alphabetic characters (azAZ), lowercase alphabetic characters (az), uppercase alphabetic characters (AZ), alphanumeric characters, and whitespace characters. UpperTok, NumTok, and AlphNumTok match a nonempty sequence of uppercase alphabetic characters, numeric digits, and alphanumeric characters, respectively. DecNumTok matches a non-empty sequence of numeric digits and/or decimal point. HyphenTok and SlashTok match the hyphen character and the slash character, respectively.

The task described in Example 1 can be expressed in our domain-specific language as

The atomic expression Loop(λw : e) is the concatenation of e₁, e₂, …, e_n, where e_i is obtained from e by replacing all occurrences of integer w by i, and n is the smallest integer such that evaluation of e_n+1 is undefined. (It is also possible to define more interesting termination conditions, e.g., based on position expressions or predicates.) A trace expression e is undefined when (i) a constituent CPos(k) expression refers to a position not within its subject string, (ii) a constituent Pos(r₁, r₂, c) expression is such that the subject string does not contain c occurrences of a match bounded by r₁ and r₂, or (iii) a constituent SubStr(v_i, p₁, p₂) expression has position expressions that are both defined but the first refers to a position that occurs later in the subject string than the position indicated by the second. The following example illustrates the utility of the loop construct.

EXAMPLE 2 (GENERATE ABBREVIATION). The following task was presented originally as an Advanced Text Formula.²³

This task can be expressed in our language as

Our tool synthesizes this program from the first example row and uses it to produce the entries in the second and third rows (shown here in boldface for emphasis) of the output column.

3.2. Synthesis algorithm

The synthesis algorithm first computes, for each inputoutput example (σ, s), the set of all trace expressions that map input σ to output s (using procedure Generate). It then intersects these sets for similar examples and learns conditionals to handle different cases (using procedure Intersect). The size of such sets can be huge; therefore, we must develop a data structure that allows us to succinctly represent and efficiently manipulate huge sets of program expressions.

Data structure: Figure 1(b) describes our data structure for succinctly representing sets of programs from our domain-specific language. &cacm5508_b;, and &cacm5508_c; denote representations of, respectively, a set of string programs, a set of trace expressions, a set of atomic expressions, and a set of position expressions. &cacm5508_d; and &cacm5508_e; represent a set of regular expressions and a set of integer expressions; these sets are represented explicitly.

The Concatenate constructor used in our string language is generalized to the Dag constructor &cacm5508_f;, where &cacm5508_g; is a set of nodes containing two distinctly marked source and target nodes η^s and η^t, &cacm5508_h; is a set of edges over nodes in &cacm5508_g; that defines a Directed Acyclic Graph (DAG), and W maps each &cacm5508_i; &cacm5508_j; to a set of atomic expressions. The set of all Concatenate expressions represented by a &cacm5508_f; constructor includes exactly those whose ordered arguments belong to the corresponding edges on any path from η^s to η^t. The Switch, Loop, SubStr, and Pos constructors are all overloaded to construct sets of the corresponding program expressions that are shown in Figure 1(a). The ConstStr and CPos constructors can be regarded as producing singleton sets.

The data structure supports efficient implementation of various useful operations including intersection, enumeration of programs, and their simultaneous execution on a given input. The most interesting of these is the intersection operation, which is similar to regular automata intersection. The additional challenge is to intersect edge labels—in the case of automata, the labels are simply sets of characters, while in our case, the labels are sets of string expressions.

Procedure Generate: The number of trace expressions that can generate a given output string from a given input state can be huge. For example, consider the second inputoutput pair in Example 1, where the input state consists of one string “(425)-706-7709” and the output string is “425-706-7709”. Figure 2 shows a small sampling of different ways of generating parts of the output string from the input string using SubStr and ConstStr constructors. Each substring extraction task itself can be expressed with a huge number of expressions, as explained later. The following are three of the trace expressions represented in the figure, of which only the second one, shown in the figure in bold, expresses the program expected by the user:

Extract substring “425”. Extract substring “-706-7709”.
Extract substring “425”. Print constant “-“. Extract substring “706”. Print constant “-“. Extract substring “7709”.
Extract substring “425”. Extract substring “-706”. Print constant “-“. Extract substring “7709”.

We apply two crucial observations to succinctly generate and represent all such trace expressions. First, the logic for generating some substring of an output string is completely decoupled from the logic for generating another disjoint substring of the output string. Second, the total number of different substrings/parts of a string is quadratic (and not exponential) in the size of that string.

The Generate procedure creates a Directed Acyclic Graph (DAG) &cacm5508_f; that represents the trace set of all trace expressions that generate a given output string from a given input state. Generate constructs a node corresponding to each position within the output string and constructs an edge from a node corresponding to any position to a node corresponding to any later position. Each edge corresponds to some substring of the output and is annotated with the set of all atomic expressions that generate that substring. We describe below how to generate the set of all such SubStr expressions. Any Loop expressions are generated by first generating candidate expressions (by unifying the sets of trace expressions associated with the substrings s[k₁ : k₂] and s[k₂ : k₃], where k₁, k₂, and k₃ are the boundaries of the first two loop iterations, identified by considering all possibilities), and then validating them.

The number of substring expressions that can extract a given substring from a given string can be huge. For example, following is a small sample of various expressions that extract “706” from the string “425-706-7709” (call it v₁).

Second number: SubStr2(v₁, NumTok, 2).
2^nd last alphanumeric token: SubStr2(v₁, AlphNumTok, −2).
Substring between the first hyphen and the last hyphen: SubStr(v₁, Pos(HyphenTok, ε, 1), Pos(ε, HyphenTok, −1)).
First number that occurs between hyphen on both ends. SubStr(v₁, Pos(HyphenTok, TokenSeq(NumTok, HyphenTok), 1), Pos(TokenSeq(HyphenTok, NumTok), HyphenTok, 1)).
First number preceded by a numberhyphen sequence. SubStr(v₁, Pos(TokenSeq(NumTok, HyphenTok), NumTok, 1), Pos(TokenSeq(NumTok, HyphenTok, NumTok), ε, 1)).

The substring-extraction problem can be decomposed into two independent position-identification problems, each of which can be solved independently. The solutions to the substring-extraction problem can also be maintained succinctly by independently representing the solutions to the two position-identification problems. Note the representation of the SubStr constructor in Eq. 1 in Figure 1(b).

Procedure Intersect: Given a trace set for each inputoutput example, the Intersect procedure generates the top-level Switch constructor. Intersect first partitions the examples, so that inputs in the same partition are handled by the same conditional in the top-level Switch expression, and then intersects the trace sets for inputs in the same partition. If a set of inputs are in the same partition, then the intersection of trace sets is non-empty. Intersect uses a greedy heuristic to minimize the number of partitions by starting with singleton partitions and then iteratively merging partitions that have the highest compatibility score, which is a function of the size of the resulting partition and its potential to be merged with other partitions.

Intersect then constructs a classifier for each of the resultant partitions, which is a Boolean expression that is satisfied by exactly the inputs in the partition. The classifier for each partition and the intersection of trace sets for the inputs in the partition serve as the Boolean condition and corresponding trace expression in the constructed Switch expression, respectively.

Ranking: We prefer Concatenate and TokenSeq expressions that have fewer arguments. We prefer SubStr expressions to both ConstStr expressions (it is less likely for constant parts of an output string to also occur in the input) and Concatenate expressions (if there is a long substring match between the input and output, it is more likely that the corresponding part of the output was produced by a single substring extraction). We prefer a Pos expression to CPos expression (giving less preference to extraction expressions based on constant offsets). StartTok and EndTok are our most preferred tokens; otherwise, we prefer tokens corresponding to a larger character class (favoring generality).

The implementation of the synthesis algorithm is less than 5,000 lines of C# code, and takes less than 0.1 s on average for a benchmark suite of more than 100 tasks obtained from online help forums and the Excel product team.

4. Semantic Transformations

Some string transformation tasks also involve manipulating strings that need to be interpreted as more than a sequence of characters, for example, as a column entry from some relational table, or as some standard data type such as date, time, currency, or phone number. For example, consider the following task from an Excel help forum.

EXAMPLE 3. A shopkeeper wants to compute the selling price of an item (Output) from its name (Input v₁) and selling date (Input v₂). The inventory database of the shop consists of two tables: (i) MarkupRec table that stores id, name and markup percentage of items, and (ii) CostRec table that stores id, purchase date (in month/year format), and purchase price of items. The selling price of an item is computed by adding its purchase price (for the corresponding month) to its markup charges, which in turn is calculated by multiplying the markup percentage by the purchase price.

To perform the above task, the user must perform a join of the two tables on the common item Id column to lookup the item Price from its Name (v₁) and selling Date (substring of v₂). We present an extension to the trace expression (from Section 3.1) that can also manipulate strings present in such relational tables.¹⁸

4.1. Domain-specific language

We extend the trace expression (from Section 3.1), as shown in Figure 3(a), to obtain the semantic string transformation language that can also perform table lookup operations. The atomic expression f is modified to represent a constant string, a select expression, or a substring of a select expression. A select expression e_t is either an input string variable v_i or a lookup expression denoted by Select(Col, Tab, g), where Tab is a relational table identifier and Col is a column identifier of the table. The Boolean condition g is an ordered conjunction of column predicates h₁ … h_n, where a column predicate h is an equality comparison between the content of some column of the table and a constant or a trace expression e. We require columns present in these conditions to together constitute a primary key of the table to ensure that the select queries produce a single string as opposed to a set of strings.

The task in Example 3 can be represented in the language as

The expression f₄ looks up the Id of the item (present in v₁) from the MarkupRec table and f₅ generates a substring of the date present in v₂, which are then used together to lookup the Price of the item from the CostRec table (f₁). The expression f₆ looks up the Markup percentage of the item from the MarkupRec table and f₂ generates a substring of this lookup value by extracting the first numeric token (thus removing the % sign). Similarly, the expression f₃ generates a substring of f₁, removing the $ symbol. Finally, the top-level expression concatenates the strings obtained from expressions f₁, f₂, and f₃ with constant strings “+0.” and “*”.

This extended language also enables manipulation of strings that represent standard data types, whose semantic meaning can be encoded as a database of relational tables. For example, consider the following date manipulation task.

EXAMPLE 4 (DATE MANIPULATION). An Excel user wanted to convert dates from one format to another, and the fixed set of hard-coded date formats supported by Excel 2010 do not match the input and output formats. Thus, the user solicited help on a forum.

We can encode the required background knowledge for the date data type in two tables, namely a Month table with 12 entries: (1, January), …, (12, December) and a DateOrd table with 31 entries (1, st), (2, nd), …, (31, st). The desired transformation is represented in our language as

where e₁ = SubStr2(v₁, NumTok, 1), e₂ = SubStr2(v₁, NumTok, 2), e₃ = SubStr2(v₁, NumTok, 3). (MW,MN) and (Num,Ord) denote the columns of the Month and DateOrd tables, respectively.

4.2. Synthesis algorithm

We now describe the key extensions to the synthesis algorithm for syntactic transformations (Section 3.2) to obtain the synthesis algorithm for semantic transformations.

Data structure: Figure 3(b) describes the data structure that succinctly represents the large set of programs in the semantic transformation language that are consistent with a given inputoutput example. The data structure consists of a generalized expression &cacm5508_k;_t, generalized Boolean condition &cacm5508_l;, and generalized predicate &cacm5508_m; (which, respectively, denote a set of select expressions, a set of Boolean conditions g, and a set of predicates h). The generalized expression &cacm5508_k;_t is represented using a tuple (&cacm5508_g;, η^t, Progs) where &cacm5508_g; denotes a set of nodes containing a distinct target node η^t (representing the output string), and Progs: &cacm5508_g; → 2^&cacm5508_n; maps each node η ε &cacm5508_g; to a set consisting of input variables v_i or generalized select expressions Select(Col, Tab, B). A generalized Boolean condition &cacm5508_l;_i corresponds to some primary key of table T and is a conjunction of generalized predicates &cacm5508_m;_j, where each &cacm5508_m;_j is an equality comparison of the j^th column of the corresponding primary key with a constant string s or some node &cacm5508_g; or both. The two key aspects of this data structure are (i) the use of intermediate nodes for sharing sub-expressions to represent an exponential number of expressions in polynomial space and (ii) the use of Conjunctive Normal Form (CNF) form of Boolean conditions to represent an exponential number of conditionals &cacm5508_l; in polynomial space.

Procedure Generate: We first consider the simpler case where there are no syntactic manipulations on the table lookups and the lookups are performed using exact string matches, that is, the predicate h is Col = e_t instead of Col = e. The Generate procedure operates by iteratively computing a set of nodes (&cacm5508_g;), where each node η ε &cacm5508_g; represents a string val(η) that corresponds to some table entry or an input string. Generate performs an iterative forward reachability analysis of the string values that can be generated in a single step (i.e., using a single Select expression) from the string values computed in the previous step based on string equality and assigns the Select expression to the Progs map of the corresponding node. The base case of the procedure creates a node for each of the input string variables. After performing the analysis for a bounded number of iterations, the procedure returns the set of select expressions of the node that corresponds to the output string s, that is, Progs[val⁻¹(s)].

The Generate procedure for the general case, which also includes syntactic manipulations on table lookups, requires a relaxation of the above-mentioned reachability criterion of strings that is based on string equality. We now define a table entry to be reachable from a set of previously reachable strings if the entry can be generated from the set of reachable strings using the Generate procedure of Section 3.2. The rest of the reachability algorithm operates just as before.

Procedure Intersect: A basic ingredient of the Intersect procedure for syntactic transformations is a method to intersect two Dag constructs, each representing a set of trace expressions. We replace this by a method to intersect two tuples (&cacm5508_g;₁, η₁^t, Progs₁) and (&cacm5508_g;₂, η₂^t, Progs₂), each representing a set of extended trace expressions. The tuple after intersection is (&cacm5508_g;₁ × &cacm5508_g;₂, (η₁^t,η₂^t), Progs₁₂), where Progs₁₂((&cacm5508_g;₁, &cacm5508_g;₂)) is given by the intersection of Progs₁(&cacm5508_g;₁) and Progs₂(&cacm5508_g;₂).

Ranking: We prefer expressions of smaller depth (fewer nested chains of Select expressions) and ones that match longer strings in table entries for indexing. We prefer lookup expressions that use distinct tables (for join queries) as opposed to using the same table twice. We prefer conditionals with fewer predicates. We prefer predicates that compare columns with other table entries or input variables (as opposed to comparing columns with constant strings).

We implemented our algorithm as an extension to the Excel add-in (Section 3.2) and evaluated it successfully over more than 50 benchmark problems obtained from various help forums and the Excel product team. For each benchmark, our implementation learned the desired transformation in <10 s (88% of them taking <1 s each) requiring at most three inputoutput examples (with 70% of them requiring only one example). The data structure had size between 100 and 2,000 (measured as the number of terminal symbols in the data-structure syntax), with an average size of 600, and typically represented 10²⁰ expressions.

5. Table Layout Transformations

End users often transform a spreadsheet table not by changing the data stored in the cells of a table, but instead by changing how the cells are grouped or arranged. In other words, users often transform the layout of a table.⁸

EXAMPLE 5. The following example input table and subsequent example output table were provided by a novice on an Excel user help thread to specify a layout transformation:

The example input contains a set of dates on which tests were given, where each date is in a row corresponding to the name of the test taker, and in a column corresponding to the name of the test. For every date, the user needs to produce a row in the output table containing the name of the test taker, the name of the test, and the date on which the test was taken. If a date cell in the input is empty, then no corresponding row should be produced in the output.

5.1. Domain-specific language

We may view every program P that transforms the layout of a table as constructing a map m_P from the cells of an input table to the coordinates of the output table. For a cell c in an input table, if m_P(c) = (row, col), P fills the cell in the output table at the coordinate (row, col) with the data in c. A program in our language of layout transformations is defined syntactically as a finite collection of component programs, each of which builds a map from input cells to output coordinates (Figure 4: table program). We designed our language on the principle that most layout transformations can be implemented by a set of component programs that construct their map using one of the two complementary procedures: filtering and associating.

When a component program filters, it scans the cells of the input table in row-major order, selects a subset of the cells, and maps them in order to a subrange of the coordinates in the output table. A filter program Filter(φ, SEQ_i,j,k) (Figure 4: filter program) is a mapping condition φ, which is a function whose body is a conjunction of predicates over input cells drawn from a fixed set and an output sequencer SEQ_i,j,k, where i, j, and k are nonnegative integers. For a mapping condition φ and sequencer SEQ_i,j,k, the filter program Filter(φ, SEQ_i,j,k) scans an input table and maps each cell that satisfies φ to the coordinates in the output table between columns i and j, starting at row k, in row-major order.

For the tables in Example 5, the filter program F₁ = Filter(λc.(c.data ≠ “” c.col ≠ 1 c.row ≠ 1), SEQ_3,3,1) maps each date, that is, each nonempty cell not in column 1 and not in row 1, to its corresponding cell in column 3 of the output table, starting at row 1. Call this map m_F₁.

A table program can also construct a map using spatial relationships between cells in the input table and spatial relationships between coordinates in the output table; we call this construction association. When a table program associates, it takes a cell c in the input table mapped by some filter program F, picks a cell c₁ in the input table whose coordinate is related to c, finds the coordinate m_F(c) that c maps to under m_F, picks a coordinate d₁ whose coordinate is related to m_F(c), and maps c₁ to d₁.

An associative program A = Assoc(F, s₀, s₁) (Figure 4: Associative program) is constructed from a filter program F and two spatial functions s₀ and s₁, each of which may be of the form RelCol_i or RelRow_j. The spatial function RelCol_i takes a cell c as input, and returns the cell in the same row as c and in column i. The spatial function RelRow_j takes a cell c as input, and returns the cell in row j and in the same column as c. For each cell c in the domain of m_K, the map of A contains an entry m_A(s₀(c)) = s₁(m_F(c)).

For the example tables in Example 5, and the filter program F₁ introduced above that maps to column 3 of the example output table, the associative program A₁ = Assoc(F₁, RelCol₁, RelCol₁) constructs a map to every cell in column 1 of the output table. To construct its map, A₁ takes each cell c in the input table mapped by F₁, finds the cell RelCol₁(c) in the same row as c and in column 1, finds the coordinate m_F₁(c) that F₁ maps c to, finds the coordinate RelCol₁(m_F₁(c)), and maps RelCol₁(c) to RelCol₁(m_F₁(c)): that is, A₁ sets m_A₁(RelCol₁(c)) = RelCol₁(m_F₁(c)). Similarly, the associative program A₂ = Assoc(F₁, RelRow₁, RelCol₂) constructs a map to every cell in column 2 of the example output table. The table program TabProg({F₁, A₁, A₂}) takes the input table in Example 5 and maps to every cell in the output table.

It is possible that the ranges of constituent component programs of a table program may overlap on a given input table. In such a case, if two cells with different values are mapped to the same output coordinate, then we say that the table program is undefined on the input table.

5.2. Synthesis algorithm

The synthesis algorithm generates the set of all table programs that are consistent with each example, intersects the sets, and picks a table program from the intersection that is consistent with all of the examples.

Data structure for sets of table programs: To compactly represent sets of table programs, our synthesis algorithm uses a table program itself. Let a component program K be consistent with an inputoutput example if when K is applied to the example input and K maps an input cell c, then the cell in the output table at coordinate m_K(c) has the same data as cell c in the input table. Let a set of component programs &cacm5508_o; cover an example if, for each cell coordinate d in the example output, there is some component K ε &cacm5508_o; and cell c in the input table such that d = m_K(c). Let a table program TabProg(&cacm5508_o;) be consistent with an example if &cacm5508_o; is consistent with the example and &cacm5508_o; covers the example. For a fixed inputoutput example, TabProg(&cacm5508_o;) stores TabProg(&cacm5508_o;’) if &cacm5508_o;’ ⊆ &cacm5508_o; covers the example.

Procedure Generate: From a single inputoutput example, Generate constructs a table program that stores the set of all table programs that are consistent with the example by constructing the set &cacm5508_o;* of all consistent component programs, in three steps. First, from the example input and output, Generate defines a set of spatial functions and map predicates. Second, from the set of map predicates, Generate collects the set of all consistent filter programs. Third, from the set of consistent filter programs, Generate constructs the set of consistent associative programs. To generate associative programs, Generate combines each consistent filter program with pairs of spatial functions defined in the first step and checks if the resulting associative program is consistent. If so, then Generate adds the associative program to the set of consistent component programs. The table program TabProg(&cacm5508_o;*) stores all table programs that are consistent with the example if any exist, and is thus called the complete table program of the example.

Procedure Intersect: Given two sets of table programs represented as table programs stored in TabProg(&cacm5508_o;₀) and TabProg(&cacm5508_o;₁), Intersect efficiently constructs the intersection of the sets as all consistent table programs stored in TabProg(&cacm5508_o;₀ ∩ &cacm5508_o;₁).

The synthesis algorithm applies Generate to construct the complete table program for each example, applies Intersect to the set of complete table programs, and checks if the resulting table program TabProg(&cacm5508_o;_I) is consistent with each of the examples. If so, it returns TabProg(&cacm5508_o;’_I) for some subset &cacm5508_o;’_I of &cacm5508_o;_I that covers each of the examples. The exact choice of &cacm5508_o;’_I depends on the ranking criteria.

Ranking: We prefer table programs that are constructed from smaller sets of component programs, as such table programs are intuitively simpler. The subset order over sets of component programs thus serves as a partial order for ranking. Also, suppose that a table program P₀ uses a filter program F₀, while another table program P₁ uses a filter program F₁ that builds the same map as F₀, but whose condition is a conjunction of fewer predicates than the condition of F₀. Then, we prefer P₁, as the condition used by F₁ is intuitively more general.

To evaluate our synthesis algorithm, we implemented it as a plug-in for the Excel spreadsheet program and applied it to inputoutput tasks taken directly from over 50 real-world Excel user help threads. The tool automatically inferred programs for each task. The tool usually ran in under 10 s and inferred a program that behaved as we expected, given the user’s description in English of their required transformation. If the highest-ranking program inferred by the tool behaved in an unexpected way on an input, it inferred a program that behaved as expected after taking at most two additional examples.

6. Related Work

The Human Computer Interaction (HCI) community has developed programming by demonstration (PBD) systems¹ for data cleaning, which require the user to specify a complete demonstration or trace visually on the data instead of writing code: SMARTedit¹⁴ for text manipulation, Simultaneous Editing¹⁶ for string manipulation, and Wrangler¹¹ for table transformations. Our system is based on programming by example (as opposed to demonstration); it requires the user to provide only the initial and final states, as opposed to also providing the intermediate states. This renders our system more usable,¹³ but at the expense of complicating the learning problem. Furthermore, we synthesize programs over a more sophisticated language consisting of conditionals and loops.

The database community has studied the view synthesis problem,^{2, 22} which aims to find a succinct query for a given relational view instance. Our semantic string transformation synthesis also infers similar queries, but infers from very few examples and over a richer language that combines lookup operations with syntactic manipulations. Our table layout synthesis addresses the challenges of dealing with spreadsheet tables, which, unlike database relations, have ordering relationships between rows and carry meta-information in some cells.

The PADS project in the programming languages community has simplified ad hoc data-processing tasks for programmers by developing domain-specific languages for describing data formats, and learning algorithms for inferring such formats using annotations provided by the user.³ The learned format can then be used by programmers to implement custom data-analysis tools. In contrast, we focus on automating the entire end-to-end process for relatively simpler tasks, which include not only learning the text structure from the inputs, but also learning the desired transformation from the outputs, without any user annotations.

The area of program synthesis is gaining renewed interest. However, it has traditionally focused on synthesizing nontrivial algorithms²⁰ (e.g., graph algorithms⁹ and program inverses¹⁹) and discovering intricate code-snippets (e.g., bit-vector tricks,⁷ switching logic in hybrid systems²¹). In this paper, we apply program synthesis to discover simpler programs required by the much larger class of spreadsheet end-users. Various classes of techniques have been explored for program synthesis: exhaustive search, logical reasoning, probabilistic inference, and version-space algebras (for a recent survey, see Gulwani⁵). Because the data manipulations required by end users are usually relatively simple, we can apply version-space algebras, which allow real-time performance, unlike other techniques. Version-space algebras were pioneered by Mitchell for refinement-based learning of Boolean functions,¹⁷ while Lau et al. extended the concept to learning more complex functions in a PBD setting.¹⁴ Our synthesis algorithms lift the concepts of version-space algebra to the PBE setting, for a fairly expressive string expression language involving conditionals and loops.

7. Conclusion and Future Work

General-purpose computational devices, such as smartphones and computers, are becoming accessible to people at large at an impressive rate. In the future, even robots will become household commodities. Unfortunately, programming such general-purpose platforms has never been easy, because we are still mostly stuck with the model of providing step-by-step, detailed, and syntactically correct instructions on how to accomplish a certain task, instead of simply describing what the task is. Program synthesis has the potential to revolutionize this landscape, when targeted for the right set of problems and using the right interaction model.

This paper reports our initial experience with designing domain-specific languages and inductive synthesizers for string and table transformations. Our choice of domains was motivated by our study of help-forum problems that end users struggled with. A next step is to develop a general framework that can allow synthesizer writers to easily develop domain-specific synthesizers of the kind described in this paper, similar to how declarative parsing frameworks allow a compiler writer to easily write a parser.

Acknowledgments

We thank Guy L. Steele Jr. for providing insightful and detailed feedback on multiple versions of this draft.

Figures

Figure 1. (a) Syntax of syntactic string-processing programs. (b) Data structure for representing a set of such programs.

Figure 2. Small sampling of different ways of generating parts of an output string from the input string.

Figure 3. Extensions to the syntax and data structure in Figure 1 for semantic processing.

Figure 4. Syntax of layout transformations.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Spreadsheet Data Manipulation Using Examples

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/2240236.2240260

August 2012 Issue

Published: August 1, 2012

Vol. 55 No. 8

Pages: 97-105

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

1. Introduction

2. Overview

3. Syntactic Transformations

4. Semantic Transformations

5. Table Layout Transformations

6. Related Work

7. Conclusion and Future Work

Figures

Spreadsheet Data Manipulation Using Examples

DOI

August 2012 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.