We have arrived in the so-called Big Data era with data volumes growing exponentially. This requires effective data management systems with extreme scalability in size as well as speed, often coined “Internet scalability.” David Patterson (Communications, Oct. 2004) referred to an old network saying: “Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed—you can’t bribe God.” Therefore, reducing latency in a global application scenario requires distributing (that is, storing and/or replicating) data worldwide. Unfortunately, building efficient and reliable distributed data management systems remains an inherently difficult task.
With the emergence of (geographically) distributed data management in cloud infrastructures, the key value (KV) systems were promoted as NoSQL systems. To achieve maximum availability and performance, these KV stores sacrificed the “holy grail” of database consistency and relied on relaxed consistency models, such as eventual consistency. This was a consequence of Eric Brewer’s so-called CAP observation (aka Theorem)1 stating that only two of the three desiderata of distributed systems could possibly be satisfied at the same time: Consistency, where every read operation receives the most recent committed write; availability which nowadays typically strives for the so-called many nines up-time (meaning, say, 99.99999% up-time); and, partition tolerance, which demands the system must remain operational even under severe network malfunctions.
Because of this conjecture, NoSQL system designers traded consistency for higher availability and network fault tolerance. The relaxed replica convergence models were categorized by Amazon’s Werner Vogels2 in 2009. However, many mission-critical applications even in cloud settings do require stronger consistency guarantees. Eventual consistency only ensures convergence to the same state of all replicas.
FoundationDB, as explored in the following paper, pioneered the development of a scalable distributed KV store with strong consistency guarantees. It started 10 years ago as an open source project and is now widely used as a mission-critical backbone repository in cloud infrastructures, such as Apple and Snowflake. In this respect, FoundationDB reunites the NoSQL paradigm of high availability and low latency with ACID (atomicity, consistency, isolation, durability) guarantees imposed by traditional database systems. This new breed of systems is therefore coined NewSQL—albeit not all offering SQL interfaces.
For scalability and elasticity in cloud infrastructures, FoundationDB exhibits a fully disaggregated architecture consisting of a storage system, a logging system, and a separated transaction system. Storage servers are decoupled from log servers, which maintain the “ground truth.” Storage servers continuously apply log records from log servers to lag only slightly behind the transaction commit state—in production system measurements the authors report a lag in the order of just milliseconds only. The transaction system employs an optimistic multi-version concurrency control scheme with a subsequent verification phase to ensure strict serializability. Thereby, read-only transactions are not penalized as they access versions that were committed by the time they started. For this purpose, a so-called sequencer process determines the corresponding time stamp that is then observed by proxies to access the correct version from storage servers.
For write transactions, the sequencer assigns the commit time stamp with which the resolvers verify strict serializability by comparing the transaction’s read set (that is, the key ranges of key value pairs that were accessed during the transaction) with the write sets of transactions committed in the meantime. Successfully verified transactions are made durable by writing their logs to multiple log servers for fault tolerance. These log servers can be replicated across distinct (geographic) regions to tolerate failures (such as power outages) within an entire region. The FoundationDB system can be configured for synchronous as well as asynchronous log transferal—thereby trading off between safety and latency requirements. In case of a primary server failure the synchronous mode guarantees seamless instantaneous takeover by the secondary server—albeit incurring extra latency overhead during commit processing. Under asynchronous mode the primary server maintains several satellites within the same region which, in case of failure, can submit the log’s suffix that was not yet sent to the secondary server in a remote region. Thus, the trade-off between performance/responsiveness and failure resilience is one of the foremost goals of the FoundationDB design.
Correctness and robustness were also a key aspect in the development life cycle of the FoundationDB system. To achieve this, the team relied on simulation testing that allowed the distributed software code to run in a deterministic manner for error reproducibility and extensive test coverage.
To conclude, the design of FoundationDB lays the “foundation” for future cloud information systems that must balance performance and consistency requirements.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment