Graph databases (GDBs)13,30 have gained momentum with the rise of large unstructured repositories of information that emphasize relations between entities. Dozens of GDB management systems,8,22,25,31 prototypes,1,2,15,21 models and languages,3,10,12,14 large knowledge graphs like Wikidata,33 and efforts from companies like Apache, Facebook, Google, Microsoft, Neo4j, and Oracle, illustrate the growing interest in this technology. While the expressive power and flexibility of their data model and query languages is the key to their success, the efficiency challenges posed by their implementation is the main obstacle to the wider adoption of GDBs.
Latin America has a long-standing tradition in fundamental research areas like database theory, string processing, information retrieval, and the design and analysis of algorithms and data structures—all of which are relevant for the development of GDBs. In the last few years, several researchers in Chile started collaborating on algorithms and systems for evaluating complex queries on large-scale GDBs. Indeed, this is one of the main research lines within the Millennium Institute for Foundational Research on Data (www.imfd.cl). The main objective is to develop new data models, query languages, and algorithmic solutions with solid theoretical foundations that are also practical and efficient. We shall briefly describe key advances achieved by these researchers to address some of the challenges that arise when querying GDBs. We will focus on the Ring index,6 a succinct data structure that supports some of the most important kinds of queries popular in GDBs, with the added benefit of having a small memory footprint as compared to other algorithms.
GDBs for modeling and querying data. A GDB represents information as a labeled graph. There are several GDB models, and several languages to query them (for example, see Hogan,13 and Robinson et al.30). For concreteness, we focus on the RDF model19 and its query language SPARQL,12 although the challenges and techniques we discuss apply to most GDB models and query languages. In RDF, a graph is a set of triples (s,p,o) representing the graph edges, where s is the subject (source node), p is the predicate (edge label), and o is the object (target node). Consider the graph in Figure 1 as our running example, where nodes represent scientists, and the Nobel prize and edges indicate that a scientist advised (adv) another or that a scientist won the Nobel prize.
A SPARQL query12 is based on a graph pattern to be matched against the GDB. The simplest such pattern is a triple pattern, which searches for individual edges. A triple pattern specifies constants or variables for the subject, predicate, and object of the desired triples; every matching triple in the graph binds the variables in the triple pattern. In our example, (Nobel,won,x) binds x to Nobel prize winners (that is, x=Thorne, x=Bohr, and so on).
Basic graph patterns (BGPs) are sets of triple patterns aimed to match subgraphs in the database, and to return all the variable bindings that make the subgraph occur. BGPs are akin to natural joins in relational databases, or full conjunctive queries in logic databases. In our example, {(x,adv,y), (Nobel,won,x), (Nobel,won,y)} returns pairs (x,y) where x advised y and both won the Nobel prize (that is, (x,y) = (Bohr,Thompson) and (x,y) = (Thompson,Strutt)).
GDBs can also obtain paths between nodes, commonly by means of regular path queries (RPQs), which extract pairs of nodes connected by a path whose labels conform to a regular expression. In our example, the RPQ “Wheeler adv+ x” retrieves the academic descendants of Wheeler (that is, x=Bohr, x=Thomson, and x=Strutt).
BGPs and RPQs are integral parts of graph query languages, but their evaluation presents several challenges, especially over huge graphs. BGPs can join tens of triple patterns (for example, up to 22 triple patterns were found in some queries in the Wikidata query log18), which traditional relational engines are not designed to process efficiently. Furthermore, RPQs cannot be evaluated in relational algebra, rather requiring recursion. Such challenges have triggered work on algorithms and data structures able to handle complex queries on large knowledge graphs.
An important breakthrough for efficiently evaluating join queries was the development of worst-case optimal (wco) join algorithms. A join algorithm is wco if its time complexity is upper bounded by the AGM bound,7 that is, the maximum possible output of the query over some database with the same schema and number of tuples as the one at hand. Techniques used by relational engines since the 1960s, applying pairwise joins, are intrinsically non-wco. Several wco algorithms were developed for relational databases,17,26–29,31 and later translated to GDBs,15 where they are particularly relevant because BGPs tend to involve many joins.1,15,16,29 Experimentally, wco algorithms notably improve performance when compared to pairwise joins,1 especially for processing queries involving many joins.15,34
Leapfrog Triejoin (LTJ),32 a seminal wco join algorithm, proceeds attribute-wise over all involved tables at the same time, binding one attribute at a time (that is, finding all the possible values each attribute may get in the output). For LTJ to run efficiently, tables must be arranged as tries,11 where each tuple of the table becomes a root-to-leaf path in the trie. The order in which attributes are read root-to-leaf must correspond to the order in which they are bound by the query process (attributes not bound in the query or by a join must come at the end). We maintain a pointer to the current node of the tries of each joined relation (all starting at the root). Say that we decide to start by binding attribute A. We then find the values of A that appear in all the joined tables, which is done by intersecting the children of the current node of all tries (relations) having attribute A. For each concrete value of A in the intersection, LTJ descends to that child in all those tries and continues by that branch until either an attribute cannot be bound (the current branch is abandoned), or we have bound all the join attributes (we output all possible combinations of the remaining attributes as a Cartesian product). The main primitive needed to implement LTJ’s intersections is seek(x), which finds the child of the current trie node with the smallest value y ≥ x. LTJ is wco if seek is supported in polylogarithmic time.
However, this progress comes at the cost of space. At indexing time, one cannot predict which attributes will be joined in queries. Furthermore, a query planner should be able to choose different binding orders to improve performance. LTJ requires that each relation with d attributes is indexed in d! tries: one per possible attribute ordering. For triples (d=3), this implies 3! = 6 tries. Other wco algorithms pose similar or worse space problems. This is at odds with leveraging wco algorithms for querying massive knowledge graphs. To illustrate, Wikidata is approaching 15 billion triples;a 6 copies of it, using just 32-bits per element and without the additional trie structures, surpasses a terabyte.
Tackling the Main Challenges
Basic graph pattern matching. The Ring index6 is a novel space-efficient data structure that supports BGPs in wco time. It replaces the GDB with a data representation that uses about the same space as the graph, allows one to retrieve the original graph, and can simulate the 6 tries needed by the LTJ algorithm. It is a breakthrough in GDB indexing, enabling practical implementations of the LTJ algorithm at large-scale. The high-level idea is that each (s,p,o) triple is regarded as a circular string that can be navigated forward or backward. Any of the 6 orders can be then obtained by starting somewhere on the circle and moving in some direction.
To build the Ring, we must sort the triples in lexicographic (s,p,o) order, to then collect the o components in a sequence Lo. We then sort them in (o,s,p) order to build sequence Lp, and finally sort in (p,o,s) order to build Ls. The three sequences use the same space as a plain representation of the triples. Using techniques like the well-known FM-Index9 for indexing compressed text, every possible node in any of the 6 tries corresponds to a range in some of the three sequences. Figure 2 shows the Ring data structure for our example database, where we keep only the three highlighted columns.
The seek(x) LTJ primitive corresponds to finding the smallest value y ≥ x appearing in a given range L*[i..j]. This is supported in logarithmic time by representing strings L* with wavelet trees,23 along with several more sophisticated operations and the primitives necessary to simulate trie navigation on the sequences L* forward or backward as needed.
Using about 13 bytes per triple in practice (which is almost no space on top of the raw data, and 5–140 times less space than competing algorithms and systems) the Ring is on average twice as fast as the next-best competitor. Within this space the Ring offers even further functionality, like on-the-fly statistics to help find good query plans. The Ring can be extended to higher dimensions, needing much fewer than the d! copies required by classical schemes (for example, one needs 7 rings, instead of 720 tries, for d=6). This allows the implementation of LTJ on higher dimensions.
Path queries. The Ring also supports RPQs efficiently,5 using little additional space. The idea is to carry out the classical traversal of the product automaton,20 with a couple of twists. First, the NFA is produced by Glushkov’s algorithm,24 which obtains the worst-case minimum number of states and has some regularities we can exploit (for example, all transitions to a given state have the same label). Second, the wavelet trees of the sequences L* are enhanced to find in optimal time the edges of the product automaton that lead to relevant NFA states. The Ring further provides additional information for free (in terms of space) to conduct traversals more efficiently. For instance, an RPQ can be split by some edge that disconnects the NFA and is infrequent in the graph, obtaining a “left”’ (from the initial state to the split) and a “right” (from the split to the final states) part, whose results are computed and then joined. This yields large performance improvements.5 The ability of the wavelet trees representing the L* sequences to count distinct values in a range and to intersect two ranges are key for this improvement and cannot be implemented for free on classical representations. The resulting algorithm is competitive with state-of-the-art systems (thrice faster than the next-best, on average), while using 3–5 times less space than all of them.
Multimodal graph databases. While graph query languages are agnostic to the semantics of data, nodes may feature implicit properties or relations not expressed by edges. For example, if some nodes are cities, we may wish to query nearby cities but without materializing pairwise distance edges. We refer to multimodal GDBs as those where graph nodes may have types, on which some relations can be defined. For example, some nodes can belong to a metric space and we might want to include relations like near(x,y) in queries, together with triple patterns. For example, a query like (u,lives-in,x), (v,lives-in,y), (u,works-in,z), (v,works-in,z), near(x,y) finds pairs (u,v) of people that could share a ride to work.
Our strategy to support multimodal queries is to generalize LTJ to progressively bind variables and instantiate the corresponding clauses (now triple patterns and relations). The key operation we must support to carry out the LTJ intersections across different triples and relations is the operation seek. In the case of relations mentioning a certain variable v, given a value x for v, we should find the least value y ≥ x for which the relation holds if we give value y to v. Our aim is then to represent the typed information on the nodes in a way that uses little space, and supports operation seek efficiently, ideally in polylogarithmic time. For near(x,y), for example, if we have bound x to c, we can be given d and asked for the smallest y ≥ d such that near(c,y).
Some examples of types we are currently addressing with this approach are:
Metric spaces: A distance function specifies that x is within the k nearest neighbors of y, or that x and y are within some distance of each other. We have shown that this type can be handled by overlaying the database graph with the k-nearest-neighbor graph of the metric space, reusing the same Ring representation on this second graph to integrate it seamlessly into the query process.4
Topological information: Nodes represent objects in an abstract space, where two elements can be contained in each other, adjacent to each other, overlap each other, and so on. We are developing data structures supporting seeks on such binary relations, where one element is bound and the other is not, that can work in time logarithmic on the data size, and possibly linear in the height of the topological hierarchy.
Temporal information: Edges have a time interval of validity, thus forming a temporal graph. Several query extensions are possible. The simplest finds the timestamps where a given BGP occurred in the graph. In a more general model, triples can be extended to quads with a fourth component for time, and BGPs on quads can be specified. We are developing data representations that can still offer wco times in the more general model while using linear space on the data size. RPQs also extend to temporal graphs, representing journeys of different types (for example, the whole path must exist at some time instant, edges must exist at consecutive time instances, and so on).
A challenge for multimodal queries is that LTJ is not necessarily wco anymore, even if we manage to support seek in logarithmic time, because node types may restrict the possible database output sizes. For example, in the first case, the query kNN(x,y) that states that y must belong to the k elements closest to x cannot be instantiated to any arbitrary relation of two columns, because for each value of x there are at most k distinct values of y. This is known as a degree constraint in GDBs and requires specialized algorithms to handle it. In cases where general wco algorithms cannot be obtained, we aim for algorithms that are wco for common cases and efficient in practice.
Conclusion
We are conducting research on efficient algorithms and systems for evaluating graph queries, motivated by the growing popularity of GDBs. Within this sphere, we see much promise in applying compact data structures—as adopted for text indexing—to index graphs at large scale, with little additional space overhead, while still supporting all the primitives necessary to efficiently evaluate graph patterns and path queries.
Future challenges involve scaling up to support fully fledged query languages, supporting multimodal queries on new data types, reaching finer measures of optimality beyond wco,1,17,21 supporting updates, transactions, and recovery in the database, and many others. This work requires expertise in various areas, principally database theory and systems, information retrieval, and compact data structures. It also spans both theoretical and practical challenges. To put these techniques into practice, we are further developing MillenniumDB:34 an open source graph database system implementing state-of-the-art techniques, such as those described here. A demo of MillenniumDB can be accessed at https://wikidata.imfd.cl/, where we are hosting the full Wikidata knowledge graph using some of the techniques described in this article.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment