Transactions and Serverless are Made for Each Other

Serverless cloud offerings are becoming increasingly popular for stateless applications because they simplify cloud deployment. This article argues that if serverless platforms could wrap functions in database transactions, they would also be a good fit for database-backed applications. There are two unique benefits of such a transactional serverless platform: time-travel debugging of past events and reliable program execution with “exactly-once” semantics.

Serverless cloud platforms such as Amazon Web Services (AWS) Lambda and Azure Functions are increasingly popular for building production applications as varied as website front ends, machine-learning (ML) pipelines, and image-processing systems. These platforms radically simplify development by managing application deployment. Developers can deploy functions with the click of a button and the platform automatically hosts them, guarantees their availability, and scales them to handle changing loads.

Serverless platforms are primarily used for stateless operations such as image resizing or video processing. Here, we will argue they should also be used to deploy stateful applications, particularly database-backed applications whose business logic frequently queries and updates a transactional database such as Postgres or MySQL. Database-backed applications are ubiquitous in modern businesses; examples include e-commerce Web services, banking systems, and online reservation systems. They run primarily on server-based platforms such as Kubernetes. Thus, they form a massive opportunity for serverless offerings, including the back ends of most enterprise APIs and much of the modern Web.

To make serverless work for database-backed applications, serverless platforms would need to make one critical addition: Allow developers to execute functions as database transactions. Figure 1 shows an inventory reservation function implemented in a conventional serverless platform versus a transactional serverless platform. The checkInventory and updateInventory functions perform SQL queries. In a conventional serverless platform, if a function accesses the database, developers must obtain a database connection, manually begin a transaction, execute business logic and SQL queries, and then finally commit the transaction (Figure 1a).

An inventory reservation function implemented in a conventional serverless platform versus a transactional serverless platform. checkInventory and updateInventory perform SQL queries. — Figure 1. An inventory reservation function implemented in a conventional serverless platform versus a transactional serverless platform. `checkInventory` and `updateInventory` perform SQL queries.

By contrast, a transactional serverless platform manages the database connection: If a function accesses the database, it uses a platform-provided connection that automatically wraps the function in a transaction (Figure 1b). The idea of building such a platform has been explored in several research projects—by these authors¹ and others.³^,⁴

As this article explains, a transactional serverless platform not only is more convenient for the developer but can also provide powerful benefits for database-backed applications beyond the capabilities of conventional serverless or server-based systems.

First, a transactional serverless platform makes programs easier to debug. Modern applications are difficult to debug because they run in distributed settings with frequent concurrent accesses to shared state, so bugs often involve complex race conditions that are not easy to reproduce in a development environment. Reproducing errors is particularly difficult in conventional serverless platforms because their execution environment is transient and exists only in the cloud. A transactional serverless platform, however, can simplify debugging through time travel.² Because the platform wraps functions in transactions to coordinate their state accesses, a debugger can leverage the transaction log to “travel back in time” and locally replay any past transactional function execution.

Second, a transactional serverless platform can provide reliable program execution. Writing reliable database-backed applications is difficult because they often coordinate several business-critical tasks, any of which may fail. In a server-based application, addressing this problem is difficult, as developers must manually track each request’s status and recover failed requests. Conventional serverless platforms make this easier by automatically restarting any task that fails, but this can be problematic if it causes an operation to execute multiple times (for example, paying twice). If functions are transactions, however, the platform can record their success or failure in the same transaction as their business logic, thus guaranteeing that each function executes once and only once.

Programming a Transactional Serverless Platform

A transactional serverless platform could provide a programming model similar to today’s serverless platforms, where developers write programs as workflows of functions. Each function performs a single operation. Workflows, implemented as directed graphs or state machines, orchestrate many functions. Popular serverless workflow orchestrators include AWS Step Functions and Azure Durable Functions.

The distinguishing feature of a transactional serverless platform is that all functions accessing the application database are wrapped in atomic, consistent, isolated, and durable (ACID) database transactions, as shown in Figure 1b. These functions must be deterministic and have no side effects outside the database. Functions not accessing the database, such as those making external API calls, work the same as they do in conventional serverless platforms.

As a running example for this article, Figure 2 shows a diagram of a serverless checkout service workflow that first reserves inventory for all items in an order, then processes payment for the order, and finally marks the order as ready to fulfill. Each step is implemented in a separate function. All functions except “process payment” (which uses a third-party payment provider) contact the database and are wrapped in transactions. If any step fails, the workflow runs rollback functions to undo previous operations (for example, returning reserved inventory if the payment fails).

Figure 2. Serverless checkout service workflow, including both success and rollback paths.

Time-Travel Debugging

One powerful and unique feature enabled by a transactional serverless platform is time-travel debugging: letting developers faithfully replay production traces in a local development environment to reproduce bugs that happened in the past. Time-travel debugging is especially useful for database-backed applications because they frequently run in distributed environments where bugs manifest as race conditions that occur only under high concurrency and are nearly impossible to reproduce locally.

For example, suppose the “reserve inventory” operation in Figure 2 is split into two separate transactional functions, as in Figure 3, which shows a buggy implementation of the “reserve inventory” operation. This implementation contains a race condition where, if two requests arrive at the same time, both can reserve the same item—potentially causing the vendor to sell more items than it has available.

Figure 3. A buggy implementation of the “reserve inventory” operation. Two concurrent requests both try to reserve the same item and both succeed, causing overselling.

Debugging issues like this is tricky because they surface only if multiple concurrent requests with specific inputs are interleaved in a specific way with a particular database state. To reproduce the bug locally, the developer must determine not only which requests caused the bug, but also the order in which different operations in those requests interleaved and the exact database state that made the bug possible. In a conventional platform, tracking execution order and reconstructing database state are prohibitively expensive: Requests execute concurrently on many parallel threads on many distributed servers, potentially modifying the database thousands of times per second.

By contrast, prior research² has shown that a transactional serverless platform makes faithful replay practical because each function is wrapped in an isolated, atomic, and deterministic transaction. This enables a time-travel debugger, which can faithfully replay a production trace (including race conditions and concurrency bugs) in two steps:

Using database transaction logs, it can reconstruct the state of the application database at the time of the trace’s first request.
It can locally execute each request in the trace on the reconstructed database, executing their transactional functions in the order they originally executed in the application database’s transaction log.

A time-travel debugger improves developers’ lives by reproducing complex concurrency bugs in a controlled local environment. For example, if the debugger is run on a trace containing the bug described in Figure 3, it executes both check transactions on a database containing only one item, and then executes both update transactions, thus overselling the item and reproducing the bug. This process is shown in Figure 4.

Figure 4. A time-travel debugger replaying an execution trace containing the reserve inventory bug.

A time-travel debugger can provide another powerful feature called retroaction: the execution of modified code over past events. For a given trace, the debugger performs retroaction similarly to faithful replay but uses the updated implementation of each function instead of the original one. Retroaction is especially useful for regression testing: running a new code version over old production traces to verify it handles them correctly. For example, assume the bug in Figure 3 was fixed by combining the check and update functions into a single transactional function. A time-travel debugger can retroactively test this fix by re-executing the original trace but running the combined function in place of the original checks and updates. As shown in Figure 5, this validates that the fix eliminates the bug.

Figure 5. A time-travel debugger testing a fix to the reserve inventory bug using retroaction.

Reliable Program Execution

Another key benefit of a transactional serverless platform is reliable program execution. Many database-backed applications must coordinate multiple business-critical tasks, any of which may fail. For example, the checkout workflow in Figure 2 performs three tasks for each order:

Reserving its inventory
Processing its payment
Marking it as ready to fulfill

To execute reliably, such applications must not only handle failures in any of those tasks, but also recover from interruptions such as server crashes. Specifically, they must have two properties:

Programs run to completion. If a program begins executing, it must continue, recovering through any interruptions until it reaches a terminal success or failure state. For example, if the checkout service is interrupted after processing a payment, it must recover and either mark the order as fulfilled (if the payment succeeded) or cancel the order and return reserved inventory (if the payment failed).
Operations execute exactly once. While executing a program, each of its operations must execute once and only once. For example, if you are recovering the checkout service after it is interrupted, you cannot naively re-send the payment request; otherwise, the customer may pay twice. You must instead determine the status of the original payment request (whether it was sent at all, and if so, whether it succeeded or failed) and recover accordingly.

Manually obtaining these properties in a traditional server-based application is difficult. One approach is to write the application as a state machine that checkpoints its state to persistent storage after every operation. If the program is interrupted, resume execution from the last checkpointed state. To ensure “exactly-once” execution, make all operations idempotent so they can be safely re-executed during recovery. While such an approach works, it is tedious, error-prone, and requires careful program design.

Existing serverless platforms simplify writing programs that run to completion but do not provide “exactly-once” execution. This follows naturally from the serverless programming model. If a program is written as a workflow of functions, the workflow orchestrator can record the workflow’s state after every function execution, then resume from the last recorded state if workflow execution is interrupted. Thus, serverless function orchestrators such as AWS Step Functions and Azure Durable Functions run workflows to completion, restarting each function until it succeeds or reaches a predefined failure state.

Durable workflow engines such as Temporal provide similar guarantees for server-based programs, provided they are written as workflows of operations. Because orchestrators treat functions as black boxes, however, they cannot provide “exactly-once” semantics, but instead restart each function until it succeeds. If a function crashes after completion but before its success is recorded, it is re-executed, potentially corrupting data.

As prior work has shown,¹^,⁴ a transactional serverless platform can guarantee not only that programs run to completion, but also that transactional operations execute exactly once. Because the platform wraps functions in transactions, it can record the success or failure of a transactional function in the same transaction as the function. Therefore, if a function completes, its success or failure is always recorded in the database, while if a function fails, all its actions are rolled back by the database. Thus, the platform knows never to re-execute a function with a recorded result but can always safely re-execute without a recorded result.

Conclusion

Database-backed applications are an exciting new frontier for serverless computation. By tightly integrating application execution and data management, a transactional serverless platform enables many new features not possible in either existing serverless platforms or server-based deployments.

This article has explained how such a platform could benefit application debuggability and reliability. Its additional benefits include:

Observability, as the platform can track the full history (provenance) of each data item through all functions that have modified it
Security, as the platform can monitor all operations on data in realtime
Performance, as the platform can collocate transactional functions with the application database.

We look forward to future work in this space.

From

Transactions and Serverless are Made for Each Other

Programming a Transactional Serverless Platform

Time-Travel Debugging

Reliable Program Execution

Conclusion

Transactions and Serverless are Made for Each Other

DOI

December 2024 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Programming a Transactional Serverless Platform

Time-Travel Debugging

Reliable Program Execution

Conclusion

Transactions and Serverless are Made for Each Other

DOI

December 2024 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.