Technical Perspective: Mesa Takes Data Warehousing to New Heights

Leave it to Google to make business data processing—among the stodgiest topics in the rather buttoned-up world of database systems—seem cool. The application here involves producing reports over Google’s ads infrastructure: Google executives want to see how many ads each Google property is serving, and how profitable they are, and Google’s customers want to see how many users are clicking on their ads, how much they are paying, and so on.

At a small scale, solving this problem is straightforward—new ad click and sales data are appended to a database file as they are sent from the processing system, and computing the answer to a particular query over the data involves reading the contents of ("scanning") the data file to compute a running total of the records in the groups the user is interested in. Making this perform at the scale of Google Ads, where billions of clicks happen per day, is the challenge addressed by the Mesa system described in this following paper.

A natural question is how Mesa compares to existing parallel transactional database systems?

Fundamentally, the key technique is to employ massive parallelism, both when adding new data and when looking up specific records. The techniques used are largely a collection of best practices developed in the distributed systems and database communities over the last decade, with some clever new ideas thrown in. Some of the highlights from this work include:

The use of batch updates and append-only storage. New data is not added one record at a time, but is aggregated into batches that are sent into Mesa. Instead of merging these batches into the existing storage, these batches are simply written out as additional files that need to be scanned when processing a query. This has the advantage that existing files are never modified, so queries can continue to be executed while new data is added without worrying about new data being partially read by these existing queries.
Massive scale parallel query processing. Each query is answered by one query processing node, but there can be hundreds or thousands of compute nodes. They can each answer queries independently because of the use of append-only storage: query processors never need to worry that the files they are scanning will change while they are running, and query processors never wait for each other to perform operations.
Atomic updates. Some care is needed to atomically install update batches, such that they are either not seen at all or are seen in their entirety. Mesa labels each update with a unique, monotonically incrementing version number, which is periodically communicated to each query processor. Once a query processor learns of a new version number, it will answer queries up to and including that batch, and is guaranteed that the files containing the batch have been completely written and will not change. This means it can take some time (a few seconds to a minute) for a query processor to see a new update, but this will only result in a slightly stale answer (missing the most recent update), never an answer that is missing some arbitrary subset of updates.
Unusual data model. Unlike most database systems that simply represent (or "model") data as a collection of records each with a number of attributes, Mesa allows a programmer to specify a merge function that can be used to combine two records when a record with a duplicate key arrives. This makes it possible, for example, to keep a running total of clicks or revenue for a particular ad or customer. One advantage of this model is that it allows new data to be added without reading the old data first when computing running sums and so on.

A natural question is how Mesa compares to existing parallel transactional database systems? Database systems are optimized for high throughput, but lack several features that are a requirement of the Mesa solution. First, Mesa fits neatly into the elegant modular (layered) software architecture stack Google has built: It runs on top of Colossus (their distributed file system), and provides a substrate on which advanced query processing techniques (like their F1 system) can be built. Layering software this way allows different engineering teams to maintain code, and allows different layers to service multiple clients. Many existing data processing systems are much more monolithic, and would be difficult to integrate into the Google software ecosystem. Second, conventional databases were not built to replicate data across multiple datacenters. Traditional systems (typically) use a single-master approach for fault tolerance, replicating to a (read-only) standby that can take over on a master failure. Such a design will not work well if datacenter failures or network partitions are frequent.

Footnotes

To view the accompnaying paper, visit doi.acm.org/10.1145/2936722