Home/Magazine Archive/February 2019 (Vol. 62, No. 2)/Scalable Computation of High-Order Optimization Queries/Full Text

Research highlights
# Scalable Computation of High-Order Optimization Queries

Constrained optimization problems are at the heart of significant applications in a broad range of domains, including finance, transportation, manufacturing, and healthcare. Modeling and solving these problems has relied on application-specific solutions, which are often complex, error-prone, and do not generalize. Our goal is to create a domain-independent, declarative approach, supported and powered by the system where the data relevant to these problems typically resides: the database. We present a complete system that supports *package queries*, a new query model that extends traditional database queries to handle complex constraints and preferences over answer sets, allowing the declarative specification and efficient evaluation of a significant class of constrained optimization problemsinteger linear programs (ILP)within a database.

Traditional database queries follow a simple model: they define constraints, in the form of selection predicates, that each tuple in the result must satisfy. This model is computationally efficient, as the database system can evaluate each tuple individually to determine whether it satisfies the query conditions. However, many practical, real-world problems require a collection of result tuples to satisfy constraints collectively, rather than individually.

EXAMPLE 1 (MEAL PLANNER). *A dietitian needs to design a daily meal plan for a patient. She wants a set of three gluten-free meals, between 2000 and 2500 calories in total, and with a low total intake of saturated fats.*

Similar scenarios, requiring complex, high-order constraints arise frequently, and in many practical settings. A broad set of domains have applications that boil down to modeling and solving constrained optimization problems, for example, coordinating fleet and crew assignments in airline scheduling to reduce delays and costs,^{19} managing delinquent consumer credit to minimize losses,^{14} optimizing organ transplant allocation and acceptance,^{1} and planning of cancer radiotherapy treatments.^{20, 21} A significant class of constrained optimization problems are *integer linear programs* (ILP). ILP solutions alone account for billions in US dollars of projected benefits within each of these and other industry sectors.^{7}

Modeling and solving these problems has relied on application-specific solutions,^{2, 9, 13, 17, 23, 18} which can often be complex and error-prone, and fail to generalize. Our goal is to create a domain-independent, declarative approach, supported and powered by the system where the data relevant to these problems typically resides: the database. We present a complete system that supports *package queries*, a new query model that extends traditional database queries to handle complex constraints and preferences over answer sets, allowing the declarative specification and efficient evaluation of a significant class of constrained optimization problemsILPwithin a database. Package queries are defined over traditional relations, but return *packages.* A package is a collection of tuples that (a) individually satisfy *base predicates* (traditional selection predicates), and (b) collectively satisfy *global predicates* (package-specific predicates). Package queries are combinatorial in nature: the result of a package query is a (potentially infinite) set of packages, and an *objective criterion* can define a preference ranking among them.

Extending traditional database functionality to provide support for packages, rather than supporting packages at the application level, is justified by two reasons: First, the features of packages and the algorithms for constructing them are not unique to each application; therefore, the burden of package support should be lifted off application developers, and database systems should support package queries like traditional queries. Second, the data used to construct packages typically reside in a database system, and packages themselves are structured data objects that should naturally be stored in and manipulated by a database system.

Our work addresses *three important challenges.* The first challenge is to support *declarative* specification of packages. SQL enables the declarative specification of properties that result tuples should satisfy. In Example 1, it is easy to specify the exclusion of meals with gluten using a regular selection predicate in SQL. However, it is difficult to specify global constraints (e.g., total calories of a set of meals should be between 2000 and 2500 calories). Expressing such a query in SQL requires either complex self-joins that explode the size of the query, or recursion, which results in extremely complex queries that are hard to specify and optimize. Our goal is to maintain the declarative power of SQL, while extending its expressiveness to allow for the easy specification of packages.

The second challenge relates to the *evaluation* of package queries. Due to their combinatorial complexity, package queries are harder to evaluate than traditional database queries.^{10} Package queries are in fact as hard as ILP.^{5} Existing database technology is ineffective at evaluating package queries, even if one were to express them in SQL. Figure 1 shows the performance of evaluating a package query expressed as a multi-way self-join query in traditional SQL. As the cardinality of the package increases, so does the number of joins, and the runtime quickly becomes prohibitive: In a small set of 100 tuples from the Sloan Digital Sky Survey (SDSS) dataset,^{22} SQL evaluation takes almost 24 hours to construct a package of 7 tuples. Our goal is to extend the database evaluation engine to take advantage of external tools, such as ILP solvers, which are more effective for combinatorial problems.

**Figure 1. Traditional database technology is ineffective at package evaluation, and the runtime of a SQL formulation of a package query grows exponentially. In contrast, tools such as ILP solvers are more effective.**

The third challenge pertains to query evaluation *performance* and *scaling* to large datasets. Integer programming solvers have two major limitations: they require the entire problem to fit in main memory, and they fail when the problem is too complex (e.g., too many variables and/or too many constraints). Our goal is to overcome these limitations through sophisticated evaluation methods that allow solvers to scale to large data sizes.

Our work addresses these challenges through the design of language and algorithmic support for the specification and evaluation of package queries. We present PaQL (Package Query Language), a declarative language that provides simple extensions to standard SQL to support constraints at the package level. PaQL is at least as expressive as ILP, which implies that evaluation of package queries is NP-hard.^{5} We present a fundamental evaluation strategy, Direct, that combines the capabilities of databases and constraint optimization solvers to derive solutions to package queries. The core of our approach is a set of translation rules that transform a package query to an ILP. This translation allows for the use of highly-optimized external solvers for the evaluation of package queries. We introduce an offline data partitioning strategy that allows package query evaluation to scale to large data sizes. The core of our evaluation strategy, SKETCHREFINE, lies in separating the package computation into multiple stages, each with small subproblems, which the solver can evaluate efficiently. In the first stage, the algorithm "sketches" an initial sample package from a set of representative tuples, while the subsequent stages "refine" the current package by solving an ILP within each partition. SKETCHREFINE offers strong approximation guarantees for the package results compared to DIRECT. We present an extensive experimental evaluation on real-world data that shows that our query evaluation method SKETCHREFINE: (1) is able to produce packages an order of magnitude faster than the ILP solver used directly on the entire problem; (2) scales up to sizes that the solver cannot manage directly; (3) produces packages of very good quality in terms of objective value.

Database systems do not natively support package queries. While there are ways to express package queries in SQL, these are cumbersome and inefficient.

**Specifying packages with self-joins.** In the limited case of packages with strict cardinality, that is, a fixed number of tuples, it is possible to express package queries using relational self-joins. The query of Example 1 requires three meals (a package with cardinality three) and can be expressed as a three-way self-join:

Such a query is efficient only for constructing packages with very small cardinality: larger cardinality requires a larger number of self-joins, quickly rendering evaluation time prohibitive (Figure 1). The benefit of this specification is that the optimizer can use the traditional relational algebra operators and augment its decisions with package-specific strategies. However, this method does not apply for packages of unbounded cardinality.

**Specifying packages using recursion.** SQL can express package queries by generating and testing each possible subset of the input relation. This requires recursion to build a *powerset table*; checking each set in the powerset table for the query conditions will yield the result packages. This approach has three major drawbacks. First, it is not declarative, and the specification is tedious and complex. Second, it is not amenable to optimization in existing systems. Third, it is extremely inefficient to evaluate, because the powerset table generates an exponential number of candidates.

**2.1. PaQL: The package query language**

Our goal is to support declarative and intuitive package specification. In this section, we describe PaQL, a declarative query language that introduces simple extensions to SQL to define package semantics and package-level constraints. Figure 2 shows the general syntax of PaQL (left) and the specification for the query of Example 1 (right), which we use as a running example to demonstrate PaQL's features. Square brackets enclose optional clauses and arguments, and a vertical bar separates syntax alternatives. In this specification,

is a non-negative integer; **repeat**

is a Boolean expression over tuple values (as in standard SQL) and can only contain references to **w_expression**

and **relation_name**

is a Boolean expression and **relation_alias; st_expression**

is an expression over aggregate functions or SQL subqueries with aggregate functions; both **obj_expression**

and **st_expression**

can only contain references to **obj_expression**

, which specifies the name of the package result.**package_name**

**Figure 2. Specification of the PaQL syntax (left), and the PaQL query for Example 1 (right).**

**Basic package query.** The new keyword PACKAGE differentiates PaQL from traditional SQL queries.

The semantics of Q_{1} and Q_{2} are fundamentally different: Q_{1} is a traditional SQL query, with a unique, finite result set (the entire Recipes table), whereas there are infinitely many packages that satisfy the package query Q_{2}: all possible *multisets* of tuples from the input relation. The result of a package query like Q_{2} is a set of packages. Each package resembles a relational table containing a collection of tuples (with possible repetitions) from relation `Recipes`

, and therefore a package result of Q_{2} follows the schema of `Recipes.`

Similar to SQL, the PaQL syntax allows the specification of the output schema in the SELECT clause. For example, PACKAGE(sat_fat, kcal) only returns the saturated fat and calorie attributes of the package.

Although semantically valid, a query like Q_{2} would not occur in practice, as most application scenarios expect few, or even exactly one result. We proceed to describe the additional constraints in the example query Q (Figure 2) that restrict the number of package results.

**Repetition constraints.** The REPEAT 0 statement in query Q from Figure 2 specifies that each tuple from the input relation Recipe can appear in a package result at most once (no repetitions are allowed). If this restriction is absent (as in query Q_{2}), the multiplicity of a tuple is unbounded. By allowing no repetitions, Q restricts the package space from infinite to 2^{n}, where *n* is the size of the input relation. Generalizing, REPEAT allows a package to repeat tuples up to times, resulting in (2 + )^{n} candidate packages.

**Base and global predicates.** A package query defines two types of predicates. A *base predicate*, defined in the WHERE clause, is equivalent to a selection predicate and can be evaluated with standard SQL: any tuple in the package needs to *individually* satisfy the base predicate. For example, query Q from Figure 2 specifies the base predicate: `R.gluten`

= 'free'. Since base predicates directly filter input tuples, they are specified over the input relation R. *Global predicates* are the core of package queries, and they appear in the new SUCH THAT clause. Global predicates are higher-order than base predicates: they cannot be evaluated on individual tuples, but on tuple collections. Since they describe package-level constraints, they are specified over the package result P, for example, COUNT(P.*) = 3, which limits the query results to packages of exactly 3 tuples.

The global predicates in query Q abbreviate aggregates that are in reality SQL subqueries. For example, COUNT(P.*) = 3, abbreviates (SELECT COUNT(*) FROM P) = 3. Using sub-queries, PaQL can express arbitrarily complex global constraints among aggregates over a package.

**Objective clause.** The objective clause specifies a ranking among candidate package results and appears with either the MINIMIZE or MAXIMIZE keyword. It is a condition on the package-level, and hence it is specified over the package result P, for example, MINIMIZE SUM(P.sat_fat). Similar to global predicates, this form is a shorthand for MINIMIZE (SELECT SUM(sat_fat) FROM P). A PaQL query with an objective clause returns a single result: the package that optimizes the value of the objective. The evaluation methods that we present in this work focus on such queries. In prior work,^{6} we described preliminary techniques for returning multiple packages in the absence of optimization objectives, but a thorough study of such methods is left to future work.

**Expressiveness and complexity.** PaQL can express general ILP, which means that evaluation of package queries is NP-complete.^{4, 5} As a first step in package evaluation, we proceed to show how a PaQL query can be transformed into a linear program and solved using general ILP solvers.

In this section, we present an ILP formulation for package queries, which is at the core of our evaluation methods DIRECT and SKETCHREFINE. The results in this section are inspired by the translation rules employed by Tiresias^{15} to answer *how-to queries.*

**3.1. PaQL to ILP translation**

Let R indicate the input relation of the package query, *n* = |R| be the number of tuples in R, R.attr an attribute of R, P a package, *f* a linear aggregate function (such as COUNT and SUM), {,} a constraint inequality, and *v* R a constant. For each tuple *t _{i}* from R, 1

**Repetition constraint.** The REPEAT keyword, expressible in the FROM clause, restricts the domain that the variables can take on. Specifically, REPEAT implies 0 *x _{i}* + 1.

**Base predicate.** Let be a base predicate, for example, R.gluten = 'free', and R_{} the relation containing tuples from R satisfying . We encode by setting *x _{i}* = 0 for every tuple

**Global predicate.** Each global predicate in the SUCH THAT clause takes the form *f*(P) *v.* For each such predicate, we derive a linear function over the integer variables. A cardinality constraint *f*(P) = COUNT(P.*) is translated into a linear function . A summation constraint *f*(P) = SUM(P.attr) is translated into a linear function . Other nontrivial constraints and general Boolean expressions over the global predicates can be encoded into a linear program with the help of Boolean variables and linear transformation tricks found in the literature.^{3} We refer to the original version of this paper for further details.^{4,5}

**Objective clause.** We encode MAXIMIZE *f*(P) as max , where is the encoding of *f*(P). Similarly MINIMIZE *f*(P) is encoded as min .

EXAMPLE 2 (ILP TRANSLATION). *Figure 3 shows a toy example of the* Recipes *table, with two columns and 5 tuples. To transform Q into an ILP, we first create a non-negative, integer variable for each tuple:* *x*_{1}, ..., *x*_{5}. *The cardinality constraint specifies that the sum of the* *x _{i}*

**Figure 3. Example ILP formulation and solution for query Q, on a sample Recipe dataset. There are only two packages that satisfy all the constraints, namely { t_{2}, t_{3}, t_{5}} and {t_{1}, t_{2}, t_{5}}, but the first one is the optimal because it minimizes the objective function.**

**3.2. Query evaluation with DIRECT**

Using the ILP formulation, we develop DIRECT, our basic evaluation method for package queries. In Section 4, we extend this technique to our main algorithm, SKETCHREFINE, which supports efficient package evaluation in large datasets. Package evaluation with DIRECT employs three steps:

**Base Relations:**We first compute the base relations, such as R_{}, R_{c}, and R_{p}, with a series of standard SQL queries, one for each, or by simply scanning R once and populating these relations simultaneously.**ILP Formulation:**We transform the PaQL query to an ILP problem using the rules described in Section 3.1. After this phase, all variables*x*such that_{i}*x*= 0 can be eliminated from the ILP problem because the corresponding tuple_{i}*t*cannot appear in any package solution._{i}**ILP Execution:**We employ an off-the-shelf ILP solver, as a black box, to get a solution to each of the integer variables*x*. Each_{i}*x*informs the number of times tuple_{i}*t*should be included in the answer package._{i}

EXAMPLE 3 (ILP SOLUTION). *The ILP solver operating on the program of Figure 3 returns the variable assignments to x _{i} that lead to the optimal solution; x_{i}* = 0

The DIRECT algorithm has two crucial drawbacks. First, it is only applicable if the input relation is small enough to fit entirely in main memory: ILP solvers, such as IBM's CPLEX, require the entire problem to be loaded in memory before execution. Second, even for problems that fit in main memory, this approach may fail due to the complexity of the integer problem. In fact, ILP is a notoriously hard problem, and modern ILP solvers use algorithms, such as *branch-and-cut*,^{16} that often perform well in practice, but can "choke" even on small problem sizes due to their exponential worst-case complexity.^{8} This may result in unreasonable performance if the solvers use too many resources (main memory, virtual memory, CPU time), eventually thrashing the entire system.

In this section, we present SKETCHREFINE, an approximate divide-and-conquer evaluation technique for efficiently answering package queries on large datasets. Rather than solving the original large problem with DIRECT, SKETCHREFINE smartly decomposes a query into smaller queries, formulates them as ILP problems, and employs an ILP solver as a blackbox evaluation method to answer each individual query. By breaking down the problem into smaller subproblems, the algorithm avoids the drawbacks of DIRECT.

The algorithm is based on an important observation: *similar tuples are likely to be interchangeable within packages.* A group of similar tuples can therefore be "compressed" to a single *representative tuple* for the entire group. SKETCHREFINE *sketches* an initial answer package using only the set of representative tuples, which is substantially smaller than the original dataset. This initial solution is then *refined* by evaluating a subproblem for each group, iteratively replacing the representative tuples in the current package solution with original tuples from the dataset. Figure 4 provides a high-level illustration of the three main steps of SKETCHREFINE:

**Figure 4. The original tuples (a) are partitioned into four groups and a representative is constructed for each group (b). The initial sketch package (c) contains only representative tuples, with possible repetitions up the size of each group. The refine query for group G_{1} (d) involves the original tuples from G_{1} and the aggregated solutions to all other groups (G_{2}, G_{3}, and G_{4}). Group G_{2} can be skipped (e) because no representatives could be picked from it. Any solution to previously refined groups is used while refining the solution for the remaining groups (f and g). The final approximate package (h) contains only original tuples.**

**Offline Partitioning (Section 4.1):**The algorithm assumes a partitioning of the data into groups of similar tuples, with a representative tuple chosen for each group. This partitioning is performed offline (not at query time).**Sketch (Section 4.2.1):**SKETCHREFINE sketches an initial package by evaluating the package query only over the set of representative tuples.**Refine (Section 4.2.2):**Finally, SKETCHREFINE transforms the initial package into a complete package by replacing each representative tuple with some of the original tuples from the same group, one group at a time.

SKETCHREFINE always constructs *approximate feasible* packages, that is, packages that satisfy all the query constraints, but with a possibly sub-optimal objective value that is guaranteed to be within certain approximation bounds. SKETCHREFINE may suffer from *false infeasibility*, which happens when the algorithm reports a feasible query to be infeasible. The probability of false infeasibility is, however, low and bounded. We formalize these properties in Section 4.3.

In the subsequent discussion, we use R(attr_{1},...,attr_{k}) to denote an input relation with *k* attributes. R is partitioned into *m* groups *G*_{1}, ..., *G _{m}*. Each group

No entries found