Home/Magazine Archive/April 2011 (Vol. 54, No. 4)/A Co-Relational Model of Data for Large Shared Data.../Full Text

Practice
# A Co-Relational Model of Data for Large Shared Data Banks

Fueled by their promise to solve the problem of distilling
valuable information and business insight from big data in a
scalable and programmer-friendly way, noSQL databases have been
one of the hottest topics in our field recently. With a plethora
of open source and commercial offerings (Membase, CouchDB,
Cassandra, MongoDB, Riak, Redis, Dynamo, BigTable, Hadoop, Hive,
Pig, among others) and a surrounding cacophony of technical terms
(Paxos, CAP, Merkle trees, gossip, vector clocks, sloppy quorums,
MapReduce, and so on), however, it is hard for businesses and
practitioners to see the forest for the trees. The current noSQL
market satisfies the three characteristics of a monopolistically
competitive market: the barriers to entry and exit are low; there
are many small suppliers; and these suppliers produce technically
heterogeneous, highly differentiated
products.^{12} Monopolistically
competitive markets are inconsistent with the conditions for
perfect competition. Hence in the long run monopolistically
competitive firms will make zero economic profit.

In the early 1970s, the database world was in a similar sorry
state.^{14} An overabundance of database
products exposed many low-level implementation details and, as
database people like to say, forced programmers to work at the
physical level instead of the logical level. The landscape
changed radically when Ted Codd proposed a new data model and a
structured query language (SQL) based on the mathematical concept
of relations and foreign-/primary-key
relationships.^{4} In the relational
model, data is stored in conceptually simple containers (tables
of rows), and queries over this data are expressed declaratively
without knowledge of the underlying physical storage
organization.

Codd's relational model and SQL allowed implementations from
different vendors to be (near) perfect substitutes, and hence
provided the conditions for perfect competition. Standardizing on
the relational model and SQL created a secondary network effect
around complementary producers such as educators, tool vendors,
consultants, etc., all targeting the same underlying mathematical
principles. Differences between actual relational database
implementations and SQL dialects became to a large extent
irrelevant.^{7}

Today, the relational database market is a classic example of an oligopoly. The market has a few large players (Oracle, IBM, Microsoft, MySQL), the barriers to entry are high, and all existing SQL-based relational database products are largely indistinguishable. Oligopolies can retain high profits in the long run; today the database industry is worth an estimated $32 billion and still growing in the double digits.

In this article we present a mathematical data model for the
most common noSQL databases—namely, key/value
relationships—and demonstrate that this data model is the
mathematical dual of SQL's relational data model of
foreign-/primary-key relationships. Following established
mathematical nomenclature, we refer to the dual of SQL as
*coSQL.* We also show how a single generalization of the
relational algebra over sets—namely, monads and monad
comprehensions—forms the basis of a common query language
for both SQL and noSQL. Despite common wisdom, SQL and coSQL are
not diabolically opposed, but instead deeply connected via
beautiful mathematical theory.

Just as Codd's discovery of relational algebra as a formal basis for SQL shifted the database industry from a monopolistically competitive market to an oligopoly and thus propelled a billion-dollar industry around SQL and foreign-/primary-key stores, we believe that our categorical data-model formalization and monadic query language will allow the same economic growth to occur for coSQL key-value stores.

To set the scene let's look at a simple example of products with authors and recommendations as found in the Amazon SimpleDB samples, and implement it using both object graphs and relational tables.

While we don't often think of it this way, the RAM for storing object graphs is actually a key-value store where keys are addresses (l-values) and values are the data stored at some address in memory (r-values). Languages such as C# and Java make no distinction between r-values and l-values, unlike C or C++, where the distinction is explicit. In C, the pointer dereference operator *p retrieves the value stored at address p in the implicit global store. In the rest of this article we conveniently confuse the words object (graph) and key-value store.

In C# (or any other modern object-oriented language) we can model products using the following class declaration, which for each product has scalar properties for title, author, publication date, and number of pages, and which contains two nested collections—one for keywords and another for ratings:

Given this class declaration, we can use object initializers to create a product and insert it into a collection using collection initializers:

The program produces in memory the object graph shown in Figure 1.

Note that the two nested collections for the
`Keywords`

and `Ratings`

properties are
both represented by actual objects with their own identity.

Using the LINQ (Language Integrated Query) comprehension
syntax introduced in C# 3.0,^{11} we can
find the titles and keywords of all products that have four-star
ratings using the following query:

The LINQ comprehension syntax is just syntactic sugar for a
set of standard query operators that can be defined in any modern
programming language with closures, lambda expressions (written
here as `rating==>rating == "****"`

), or
inner-classes such as Objective-C, Ruby, Python, JavaScript,
Java, or C++. The C# compiler translates the previous query to
the following de-sugared target expression:

The various values in the query result, in particular the Keywords collection, are fully shared with the original object graph, and the result is a perfectly valid object graph, shown in Figure 2.

Now let's redo this example using the relational model. First
of all, we must normalize our nested Product class into three
flat tables, for `Products, Keywords`

, and
`Ratings`

respectively, as shown in Figure
3. Each value in the relational world requires a new
primary key property (here all called ID). Furthermore, the
`Keywords`

and `Ratings`

tables require an
additional foreign-key property `ProductID`

that
encodes the one-to-many association between `Products`

and `Keywords`

and `Ratings`

. Later we will
decorate these class declarations with additional metadata to
reflect the underlying database tables.

In most commercial relational database systems, tables are defined by executing imperative CREATE TABLE DDL (data definition language) statements.

As usual in the relational world, we do not model the individual collections of keywords and ratings for each product as separate entities, but instead we directly associate multiple keywords and ratings to a particular product. This shortcut works only for one-to-many relationships. The standard practice for many-to-many relationships (multivalued functions) requires intersection tables containing nothing but pairs of foreign keys to link the related rows.

Perhaps surprisingly for a "declarative" language, SQL does not have expressions that denote tables or rows directly. Instead we must fill the three tables in an imperative style using loosely typed DML statements, which we express in C# as shown in Figure 4.

These DML statements create three tables filled with the rows as shown in Figure 5.

An important consequence of normalizing a single type into
separate tables is that the database must maintain *referential
integrity* to ensure that: the foreign-/primary-key
relationships across related tables remain synchronized across
mutations to the tables and rows; the primary key of every row
remains unique within its table; and foreign keys always point to
a valid primary key. For example, we cannot delete a row in the
`Products`

table without also deleting the related
rows in the `Keywords`

and `Ratings`

tables.

Referential integrity implies a *closed-world assumption*
where transactions on the database are serialized by
(conceptually) suspending the world synchronously, applying the
required changes, and resuming the world again when referential
integrity has been restored successfully, rolling back any
chances otherwise. Assuming a closed world is, as we claim, both
a strength and a weakness of the relational model. On the one
hand, it simplifies the life of developers via ACID (atomicity,
consistency, isolation, durability) transactions (although in
practice, for efficiency, one must often deal with much weaker
isolation levels) and allows for impressive (statistics-based)
query optimization. The closed-world assumption, however, is in
direct contradiction with distribution and scale-out. The bigger
the world, the harder it is to keep closed.

Returning to our example, we present the naïve query to
find the titles and keywords of all products that have four
stars, expressed directly in terms of the relational model. It
creates the cross-product of all possible combinations of
`Products, Keywords`

, and `Ratings`

, and
selects only the title and keyword where the keyword and rating
are related to the product and the rating has four stars:

The result of this query is the row set shown in Figure 6. Disappointingly, this row set is not itself normalized.

In fact, to return the normalized representation of our object-graph query, we need to perform two queries (within a single transaction) against the database: one to return the title and its primary key, and a second query that selects the related keywords.

What we observe here is SQL's lack of
*compositionality*—the ability arbitrarily to combine
complex values from simpler values without falling outside the
system. By looking at the grammar definition for table rows, we
can immediately see that SQL lacks compositionality; since there
is no recursion, rows can contain only scalar values:

Compare this with the definition for anonymous types, where a row can contain arbitrary values, including other rows (or nested collections):

SQL is rife with noncompositional features. For example, the
semantics of NULL is a big mess: why does adding the number 13 to
a NULL value, `13+NULL`

, return NULL, but summing the
same two values, `SUM(13, NULL)`

, returns 13?

Also, even though query optimizers in modern SQL implementations are remarkably powerful, the original query will probably run in cubic time when implemented via three nested loops that iterate over every value in the three tables. A seasoned SQL programmer would instead use explicit join syntax to ensure that the query is as efficient as our object-based query:

Depending on the encoding of the nesting of the result of a
join using flat result sets, the SQL programmer must choose among
various flavors of `INNER, OUTER, LEFT`

, and
`RIGHT`

joins.

In 1984 George Copeland and David Maier recognized the
impedance mismatch between the relational and the object-graph
model just described,^{5} and in the
quarter century since, we have seen an explosion of O/R
(object-relational) mappers that attempt to bridge the gap
between the two worlds.

A more skeptical view of O/R mappers is that they are undoing the damage caused by normalizing the original object model into multiple tables. For our running example this means that we have to add back information to the various tables to recover the relationships that existed in the original model. In this particular case we use the LINQ-to-SQL custom metadata annotations; other O/R mappers use similar annotations, which often can be inferred from naming conventions of types and properties.

Note that the resulting object model is necessarily more complex than we started with since we are forced to introduce explicit representations for Rating and Keyword collections that did not exist in our original object model. The existence of the various foreign- and primary-key properties is further evidence that the O/R mapping abstraction is leaky.

Aside from those small differences, the net result of all this work is that we can now write the query to find all products nearly as concisely as we did before normalization:

Since the results must be rendered as object graphs, the O/R
mapper will make sure that the proper nested result structures
are created. Unfortunately, not every O/R mapper does this
efficiently.^{9}

It is not only the programmer who needs to recover the
original structure of the code. The database implementer must
also jump through hoops to make queries execute efficiently by
building indexes that avoid the potential cubic effect that we
observed earlier. For one-to-many relationships, indexes are
nothing more than nested collections resulting from precomputing
joins between tables to quickly find all the rows whose foreign
keys point to a row with a particular primary key. Since the
relational model is not closed under composition, however, the
notion of index has to be defined *outside* the model.

Two natural indexes correspond to the relationship between
`Products`

and `Ratings`

and
`Products`

and `Keywords`

, respectively.
For each product in the Products table, the ratings index
contains the collection of all related ratings:

Similarly, for each `product`

in the
`Product`

table, the *keywords* index contains
the collection of all `keywords`

related to that
`product`

:

If we visualize the indexes as additional columns on the Products table, the reversal of the original relationships between the tables becomes apparent. Each row in the Products table now has a collection of foreign keys pointing to the Keywords and Ratings tables much as the original object graph, as shown in Figure 7.

One of the advantages touted for normalization over hierarchical data is the ability to perform ad-hoc queries—that is, to join tables on conditions not envisioned in the original data model. For example, we could try to find all pairs of products where the length of the title of one product is the same as the length of the author's name in the other using the following query:

Without an index, however, this query requires a full table scan and hence takes quadratic time in the length of the Products table.

The ability to create indexes makes a closed-world assumption. For example, if we modify the previous ad-hoc query to find all pairs of Web pages where one page has a URL referencing the other, it should be obvious that building an index for this join is quite a hard task when you do not have the whole Web available inside a single database:

Summarizing what we have learned so far, we see that in order to use a relational database, starting with a natural hierarchical object model, the designer needs to normalize the data model into multiple types that no longer reflect the original intent; the application developer must reencode the original hierarchical structure by decorating the normalized data with extra metadata; and, finally, the database implementer has to speed up queries over the normalized data by building indexes that essentially re-create the original nested structure of the data as well.

At this point it feels like the conceptual dissonance between the key-value and foreign-/primary-key data models is insurmountable. That would be a pity since clearly each has its strengths and weaknesses. Wouldn't it be great if we could give a more mathematical explanation of where the relational model shines and where the object-graph model works best? As it turns out, we can find the answer to this question by taking a closer look at the (in-memory) structures created for our running example in both models.

Let's start by slightly simplifying the object-graph example. We do so by removing the object identity of the Ratings and Authors collections to reflect more directly how they are modeled in the relational world. We inline the Keywords and Ratings items directly into the parent Product, as if we had value-based collections. Pictorially, we move from the diagram on the left to the one on the right in Figure 8:

For the tabular representation, we show explicit arrows for the relationship from foreign keys to primary keys. Again, pictorially, we move from the diagram on the left to the one on the right in Figure 9:

When we do a side-by-side comparison of the two rightmost diagrams, we notice two interesting facts:

- In the object-graph model, the identity of objects is
*intensional*—that is, object identity is not part of the values themselves but determined by their keys in the store. In the relational model, object identity is*extensional*—that is, object identity is part of the value itself, in the form of a primary key. - Modulo the notion of object identity, the two representations are extremely similar; the only difference is that the arrows are reversed!

At this point, it appears there is a strong correspondence
between these two representations: they both consist of a
collection of elements (objects or rows) and a collection of
arrows between the elements, the only difference being the
direction of the arrows. Is there some precise way of describing
such a situation? Fortunately, there is an entire area of
mathematics designed exactly for this: category
theory.^{1}

Obviously, the precise formalization of SQL and noSQL as
categories is outside the scope of this article, but it is
illustrative to learn a little bit of category theory
nonetheless. Category theory arose from studies of mathematical
structures and an attempt to relate classes of structures by
considering the relations between them. Thus, category theory
expresses mathematical concepts in terms of *objects,
arrows* between objects, and the *composition* of arrows,
along with some axioms that the composition of arrows should
satisfy. A computational view of category theory is that it is a
highly stylized, compact functional language. Small as it is,
it's enough to represent all of mathematics. For computer
scientists, category theory has proved to be a rich source of
techniques that are readily applicable to many real-world
problems. For example, Haskell's approach to modeling imperative
programming is lifted straight from a categorical model of the
problem.

The first powerful concept from category theory that we will
use is *duality.* Examples of duality abound in computer
science. Every programmer is familiar with the two dual forms of
De Morgan's law:

Other examples of dualities in computer science are between reference counting and tracing garbage collection, between call-by-value and call-by-name, between push- and pull-based collections, and between transactional memory and garbage collection, among many others.

Formally, given a category `C`

of objects and
arrows, we obtain the dual category `co(C)`

by
reversing all the arrows in `C`

. If a statement
`T`

is true in `C`

, then its dual statement
`co(T)`

is true in the dual category
`co(C)`

. In the context of this article, we can read
"opposite statement" for "dual statement".

In the SQL category, child nodes point to parent nodes when the foreign key of a child node equals the primary key of the parent node (Figure 10). In the noSQL category, the arrows are reversed. Parent nodes point to child nodes when the child pointer in the parent equals the address of the child node in the store (see Figure 11).

In other words, the noSQL category is the dual of the SQL category—noSQL is really coSQL. The implication of this duality is that coSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony and can transmute into each other like yin and yang. Interestingly, in Chinese philosophy yin symbolizes open and hence corresponds to the open world of coSQL, and yang symbolizes closed and hence corresponds to the closed world of SQL.

Because SQL and coSQL are mathematically dual, we can reason precisely about the tradeoffs between the two instead of relying on rhetoric and anecdotal evidence. The accompanying table gives a number of statements and their duals as they hold for SQL and coSQL, respectively.

If we really take the duality to heart, we may also choose to
(but don't have to) fine-tune our model for key-value stores to
reflect the duality between values and computations, and that
between synchronous ACID and asynchronous BASE (basically
available, soft state, eventually
consistent).^{13}

Looking up a value given its address or key in a key-value
story can involve a *computation*, which in a truly open
world has potential latency and may fail. For example, in the C#
language getters and setters, known as properties, can invoke
arbitrary code. Perhaps an even better example of a
computation-driven key-value store with long latency and high
probability of failure (always able to handle a 404) is the Web,
with URI (Uniform Resource Identifier) as keys, "resources" as
values, and the HTTP verbs as a primitive query and
data-manipulation language. On the other hand, in a C-like
key-value memory model, we usually make the simplifying
assumption that a key lookup in memory takes constant time and
does not fail.

Traversing a relationship in the closed world of the
relational model involves comparing two *values* for
equality, which is guaranteed to succeed because of referential
integrity; and vice versa, referential consistency dictates that
relationships are value-based. Otherwise, we could never be sure
that referential consistency actually holds.

Note that comparing keys by value requires that objects in the SQL category are strongly typed, at least enough to identify primary and foreign keys; and dually, since we do not need to know anything about the value of a coSQL object to find it using its key, objects in the coSQL world can be loosely typed.

Our abstract model of the SQL category did not impose any restrictions on the structure of rows; we assumed only that we could determine a primary or foreign key to relate two rows.

In the typical relational model we would further impose the constraint that rows consist of flat sequences of scalar values (the so-called First Normal Form, or 1-NF). If we dualize relations in 1-NF, then we get a key-value model where values consist of either scalars or keys or collections of scalars or keys. Surprisingly, this is precisely the Amazon SimpleDB data model (see Figure 12).

If we assume that rows in the foreign-/primary-key model are simply blobs and keys are strings, then the dual key-value model is exactly the HTML5 key-value storage model:

So far we have discussed the basic data models for SQL and coSQL, but we have not yet touched upon queries. By applying a little more category theory we can show how a single abstraction, monads and monad comprehensions, can be used as a unified query language for both SQL and coSQL.

To talk about queries, we need to be more precise about what
we mean by *collections* of values. Pure relational algebra
is based on sets of rows, but actual relational databases use
multisets (bags) or ordered multisets (permutations). To model
collections abstractly, we look at sets/bags/permutations of rows
and apply the category theory dictum: "What is the interface that
these various collections of rows implement?" and "How do we
generalize queries based on such an interface?"

First, let us stick with simple set collections. When we write a SQL query such as

the SQL compiler translates that pretty syntax into a relational-algebra expression in terms of selection, projection, joins, and Cartesian product. As is the custom in the relational world, the various operations are denoted using Greek symbols:

There is no reason to restrict the relational algebra
operators to work over just sets (or bags, or permutations) of
rows. Instead, we can define similar operators on *any*
collection `M<T>`

of values of arbitrary type
`T`

.

The interface for such abstract collections
`M<T>`

is a straightforward generalization of
that of sets. It allows us to create an empty collection using
the constant `Ø`

; create a singleton collection
of type `M<T>`

given some value of type T using
the function `{_} → M<T>`

(the notation
`T → M<T>`

denotes a
function/closure/lambda expression that maps an argument value of
type `T`

to a result collection of type
`M<T>`

); and combine two collections into a
larger collection using the binary operator ```
(depending on the commutativity and idempotence of
```

```
, we obtain the various sorts of collections
such as lists, permutations, bags, and sets):
```

Using these constructors, we can generalize the traditional
relational algebra operators (selection σ_{P},
projection π_{F}, and Cartesian product ×) to
operate over generalized collections using the following
signatures:

In fact, if we assume a single operator for correlated
subqueries, which we call `SelectMany`

, but is often
called `CROSS APPLY`

in SQL

then we can define the other operators in terms of this single one, as follows:

Rather incredibly, an interface of this shape is well known in
category theory. It is called a *monad*, where the type
constructor `M< _ >`

is a *functor* of the
monad; the constructor `{_}`

is the unit of the monad;
`SelectMany`

is the bind of the monad; and
`Ø`

and ```
are the neutral
element and addition, respectively. For the rest of us, they are
just the signatures for methods defined on an abstract interface
for collections.
```

This is no theoretical curiosity. We can play the same
syntactic tricks that SQL does with relational algebra, but using
monads instead. Such *monad comprehensions* have been
recognized as a versatile query language by both functional
programmers and database
researchers.^{8}

LINQ queries are just a more familiar SQL-like syntax for
monad comprehensions. Instead of Greek symbols, LINQ uses
human-readable identifiers such as `xs.Where(P)`

for
`σ`

and
_{P}(xs)`xs.Select(F)`

for `π`

.
To accommodate a wide range of queries, the actual LINQ standard
query operators contain additional operators for aggregation and
grouping such as_{F}(xs)

Any data source that implements the standard LINQ query operators can be queried using comprehension syntax, including both SQL and coSQL data sources, as we show in the next section.

The .NET framework defines a pair of standard interfaces
`IEnumerable<T>`

and
`IQueryable<T>`

that are often implemented by
data sources to support querying, but it is by no means necessary
to use these particular interfaces. Other standard interfaces
that support LINQ query operators include the
`IObservable<T>`

and
`IQbservable<T>`

interfaces that make it
possible to use LINQ for complex event
processing.^{10}

In contrast to most treatments of noSQL, we did not mention scalability as a defining characteristic of coSQL. The openness of the coSQL model eases scalable implementations across large numbers of physically distributed machines.

When using LINQ to query SQL databases, typically similar rows
are stored in tables of some concrete type that implements the
`IQueryable<T>`

interface. Relationships between
rows are lifted to bulk operations over collections that perform
joins across these tables; hence, queries take any number of
related tables and from those produce a new one. Pictorially for
three tables, this is shown in Figure 13.

Because relationships in SQL cross different tables, it is nontrivial to partition the single closed world into independent worlds of "strongly connected" components that can be treated independently. In a closed world, however, query optimizers can leverage all the information and statistics that are available about the various tables. This allows users to write declarative queries focusing on the "what" and letting the system take care of the "how."

In the coSQL case, a typical scenario is to have a single
collection of type `IQueryable<S>`

of (pointers
to) self-contained denormalized "documents" of type S. In that
case, queries have type `IQueryable<S> → R`

(see Figure 14).

When there are no cross-table relationships, collections
{x_{0}, x_{1},...,x_{n–1}}
`M<S>`

that are the source of coSQL queries can
be naturally horizontally partitioned, or sharded, into
individual subcollections
{x_{0}}{x}...{x_{n-1}}, and
each such subcollection {x_{i}} can be distributed across
various machines on a cluster.

For a large subset of coSQL queries, the shape of the query
closely follows the shape of the data. Such *homomorphic*
queries map a collection
xs={x_{0}}{x_{1}}...{x_{n-1}}
to the value
f(x_{0})f(x_{1})...f(x_{n-1})—that
is, they are of the form
`xs.Select(f).Aggregate()`

for some function
`f S → R`

and binary operator
` RxR→R`

. In fact, Richard Bird's
first homomorphism lemma^{3} says that
any function `h M<S>→R`

is a
homomorphism with respect to if and only if it can be
factored into a map followed by a reduce: ```
h(xs) =
xs.Select(f).Aggregate()
```

. Mathematics dictates that
coSQL queries are performed using
MapReduce.^{6}

Practical implementations of MapReduce usually slightly
generalize Bird's lemma to use `SelectMany`

instead of
`Select`

so that the map phase can return a collection
instead of a single value, and insert an intermediate
`GroupBy`

as a way to "write" equivalence classes of
values from the map phase into the key-value store for subsequent
processing in the reduce phase, and then aggregate over each
subcollection:

For example, DryadLINQ^{15} uses the
type `PartitionedTable<S>:IQuer yable<S>`

to represent the partitioned input collection for LINQ queries
and then implements MapReduce over the partitioned collection
using the function illustrated in Figure 15.

In an open world where collections are distributed across the network, it is much harder for a query optimizer to perform a global optimization taking into account latency, errors, etc. Hence, most coSQL databases rely on explicit programmatic queries of a certain pattern such as MapReduce that can be executed reliably on the target physical machine configuration or cluster.

The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor's product to another.

A necessary condition for the network effect to take off in
the noSQL database market is the availability of a common
abstract mathematical data model and an associated query language
for noSQL that removes product differentiation at the *logical
level* and instead shifts competition to the *physical*
and *operational* level. The availability of such a common
mathematical underpinning of all major noSQL databases can
provide enough critical mass to convince businesses, developers,
educational institutions, etc. to invest in noSQL.

In this article we developed a mathematical data model for the most common form of noSQL—namely, key-value stores as the mathematical dual of SQL's foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics.

In contrast to common belief, the question of big versus small
data is orthogonal to the question of SQL versus coSQL. While the
coSQL model naturally supports extreme sharding, the fact that it
does not require strong typing and normalization makes it
attractive for "small" data as well. On the other hand, it is
possible to scale SQL databases by careful
partitioning.^{2}

What this all means is that coSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony and can transmute into each other like yin and yang. Because of the common query language based on monads, both can be implemented using the same principles.

Many thanks to Brian Beckman, Jimmy "the aggregator" Nilsson, Bedarra-Dave Thomas, Ralf Lämmel, Torsten Grust, Maarten Fokkinga, Rene Bouw, Alexander Stojanovic, and the anonymous referee for their comments that drastically improved the presentation of this paper, and of course to Dave Campbell for supporting work on all cool things LINQ.

**Related articles
on queue.acm.org**

**A Conversation with Erik Meijer and Jose Blakeley**

http://queue.acm.org/detail.cfm?id=1394137

**BASE: An Acid Alternative**

*Dan Pritchett*

http://queue.acm.org/detail.cfm?id=1394128

**Bridging the Object-Relational Divide**

*Craig Russell*

http://queue.acm.org/detail.cfm?id=1394139

1. Awodey, S. *Category Theory* (2nd
edition). Oxford University Press, 2010.

2. Baker, J., Bond, C. et al. Megastore:
providing scalable, highly available storage for interactive
services. *Conference on Innovative Data Systems Research.*
(2011).

3. Bird, R. An introduction to the theory of
lists. In *Logic Programming and Calculi of Discrete
Design.* M. Broy, ed. Springer-Verlag (1987), 3–42.

4. Codd, T. A relational model of data for
large shared data banks. *Commun. ACM 13* (June 1970).

5. Copeland, G. and Maier, D. Making
Smalltalk a database system. In *Proceedings of the ACM SIGMOD
International Conference on Management of Data.*1984.

6. Fokkinga, M. MapReduce—a two-page explanation for laymen; http://www.cs.utwente.nl/~fokkinga/mmf2008j.pdf.

7. Ghosh, R.A. An economic basis for open standards (2005); flosspols.org.

8. Grust, T. 2003. Monad comprehensions: a
versatile representation for queries. In *The Functional
Approach to Data Management.* P. Gray, L. Kerschberg, P. King,
and A. Poulovassilis, Eds. Springer Verlag, 2003,
288–311.

9. Grust, T., Rittinger, J., Schreiber, T.
Avalanche-safe LINQ compilation. *Proceedings of the VLDB
Endowment 3* (1–2), 2010.

10. Meijer, E. Subject/Observer is dual to iterator. Presented at FIT: Fun Ideas and Thoughts at the Conference on Programming Language Design and Implementation (2010); http://www.cs.stanford.edu/pldi10/fit.html.

11. Meijer, E., Beckman, B., Bierman, G.
LINQ: reconciling objects, relations, and XML in the .NET
framework. *Proceedings of the ACM SIGMOD International
Conference on Management of Data.* ACM, New York, 2006.

12. Pirayoff, R. *Economics Micro &
Macro.* Cliffs Notes, 2004.

13. Pritchett, D. BASE: an Acid alternative.
*ACM Queue* (July 2008).

14. Stonebraker, M., Hellerstein, J.M. What
goes around comes around. In *Readings in Database Systems*
(Fourth Edition). M. Stonebraker, and J.M. Hellerstein, eds. MIT
Press, Cambridge, MA, 2005, 2–41.

15. Yuan Yu, M.I. DryadLINQ: A system for
general-purpose distributed data-parallel computing using a
high-level language. *Operating Systems Design and
Implementation.* 2008.

Figure 1. Object graph for Products collection.

Figure 2. Object graph result for books with four star rating.

Figure 3. Data declaration for Product database.

Figure 4. Inserting values into Product database.

Figure 5. Relational tables for Product database.

Figure 6. Tabular result for books with four-star ratings.

Figure 7. Keyword and Ratings index on Products table.

Figure 8. Object graph for Products collection with keywords and ratings inlined.

Figure 9. Relational tables for Products database with explicit relationships.

Figure 10. SQL arrow from child to parent.

Figure 11. coSQL arrow from parent to child.

Figure 12. Amazon simpleDB representation of Products collection.

Figure 13. Foreign key relationships between three relational tables.

**©2011
ACM 0001-0782/11/0400 $10.00**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.

I like the idea but would one need to manually merge the result sets like LINQ'ing over SQL and XML or use the same query for both?

It would be handy to have coSQL take an IQueryable and return an IObservable or preferably Task of IObservable. As most GUIs only need the first x number of reference items for the initial display in the sequence. ie. drop down, lists before scrolling, etc. Having the rest come when ready or having the control cancel the request if not needed would be handy. "JIT data access".

- N2Cheval

Interesting subject. But I always have a feel that query on object models cannot perform like relational models. Or is this with assumption that performance can be achieved by distributed computing frameworks like Map reduce etc..

+Anil

The following letter was published in the Letters to the Editor in the July 2011 CACM (http://cacm.acm.org/magazines/2011/7/109897).

--CACM Administrator

Erik Meijer's and Gavin Bierman's article "A Co-Relational Model of Data for Large Shared Data Banks" (Apr. 2011) overreached by claiming equivalence between the Relational Model and NoSQL "key-value pairs" without regard to the definition of a data model by E.F. Codd more than 30 years ago. Finding similarity in NoSQL systems to some parts of the Relational Model, Meijer and Bierman mistakenly concluded the two are equivalent.

Codd, in his paper "Data Models in Database Management" in Proceedings of the 1980 Workshop on Data Abstraction, Databases and Conceptual Modeling (http://portal.acm.org/citation.cfm?id=806891) defined a data model as comprising three components: data structures to represent well-formed expressions in first-order logic; operators closed over these structures, permitting inferencing; and integrity constraints to enforce internal consistency.

NoSQL systems have no data model so defined. All else is commentary.

Meijer and Bierman ignored logic and inferencing and did not explain how key-value systems recognize, let alone enforce, integrity constraints. They cited referential integrity a form of integrity constraint as an exogenous cost relational databases bear to correct for a deficiency. The truth is actually the opposite; consistency is a central obligation of any database-management system. The lack of constraint-checking in key-value systems imposes the constraint-checking burden on the application, a situation the Relational Model was invented specifically to correct.

Codd encountered a similar lack of understanding in his day. In the same proceedings paper, he wrote, "In comparing data models people often ignore the operators and integrity rules altogether. When this occurs, the resulting comparisons run the risk of being meaningless."

Codd's landmark article "A Relational Model of Data for Large Shared Data Banks" (Communications, June 1970) addressed other points raised by Meijer and Bierman, including path independence. An interested reader would learn much in an evening spent with that one article alone.

Object Relational Mapping libraries and NoSQL systems attempt to solve (through technical means) a nontechnical problem: reluctance of talented people to master the Relational Model, and thus benefit from its data consistency and logical inferencing capabilities. Rather than exploit it and demand more relational functionality from DBMS vendors, they seek to avoid and replace it, unwittingly advocating a return to the fragile, unreliable, illogical systems of the 1960s, minus the green-bar fanfold paper.

James K. Lowden

New York

----------------------------------------------------

AUTHORS' RESPONSE:

Lowden's comment contains a number of errors. Our article was, in fact, explicitly critical of the lack of an agreed data model for NoSQL. We didn't ignore "inferencing," proposing instead a query language based on monad comprehensions interestingly, the same query language we prefer for the relational model. We did not assert that the relational and key-value models are equivalent, but rather dual. The issue of weakening consistency checking goes to the heart of the interest in NoSQL systems and is beyond the scope of our article.

Erik Meijer

Redmond, WA

Gavin Bierman

Cambridge, U.K.

The following letter was published in the Letters to the Editor in the June 2011 CACM (http://cacm.acm.org/magazines/2011/6/108669).

--CACM Administrator

I would like to clarify Erik Meijer's and Gavin Bierman's statement concerning the origin of SQL in their article "A Co-Relational Model of Data for Large Shared Data Banks" (Apr. 2011) that said, "The landscape changed radically when Ted Codd proposed a new data model and a structured query language (SQL) based on the mathematical concepts of relations and foreign/primary key relationships." There were actually additional steps and additional people involved getting from Codd's influential 1970 Communications article "A Relational Model of Data for Large Shared Data Banks" (cited by Meijer and Bierman) to the SQL language. In it and in subsequent articles and papers, Codd discussed query languages based on both relational calculus and relational algebra. But it was Raymond Boyce and Donald Chamberlin who, after studying Codd's work, defined the SEQUEL language, later renamed SQL, as first documented by Chamberlin and Boyce.(1) For more on how Codd's ideas influenced Boyce and Chamberlin and the rest of the System R team, see Chamberlin.(2)

Paul McJones

Mountain View, CA

REFERENCES

(1) Chamberlin, D. D. and Boyce, R. F. SEQUEL: A structured English query language. In Proceedings of the 1974 ACM SIGFIDET (Now Sigmod) Workshop on Data Description, Access and Control (Ann Arbor, MI, May 13). ACM Press, New York, 1974, 249264.

(2) Chamberlin, D. Oral History, Paul McJones, interviewer. Computer History Museum, Mountain View, CA, July 21, 2009; http://www.computerhistory.org/collections/accession/102702111

Displaying **all 4** comments