Notes on DoublePlay: Parallelizing Sequential Logging and Replay

The paper [1]dp won a Best Paper Award at ASPLOS 2011 (ASPLOS = Architectural Support for Programming Languages and Operating Systems). Its ancestry is a long series of papers on deterministic replay, plus a different series of papers on operating system support for speculation:

(Blue is deterministic replay, yellow is OS speculation, and green is both. Asterisks represent best paper awards. Click on the boxes for links to the ACM Digital Library.)

We read the paper as an introduction to two interesting and increasingly used operating systems techniques, deterministic replay and OS speculation; because it is new; and for the sweet technical idea that is the paper’s main contribution, namely uniparallelism.

For your information, here’s how DoublePlay fits in with the prior work in the diagram above:

ReVirt: Efficient deterministic replay for uniprocessor virtual machines.
Operating system support for virtual machines: Performance improvements for hosted VMs, used in the other papers.
BackTracker: The logging part of replay is separated out and used to track dependencies, allowing the system to answer questions like “How did the attack happen that modified this file?”.
Time-Traveling VMs: Debug operating systems in reversed time: reverse single-stepping, reverse breakpoints, reverse watchpoints, etc. Supported by periodic checkpointing and deterministic replay. (For example, to reverse single-step, you go back to the most recent checkpoint, then deterministically replay forward until you reach the relevant instruction.)
ExtraVirt: Deterministic replay for tolerance of processor faults: Run a process multiple times, ensure determinism through replay, vote on the outputs.
SubVirt: Virtual machine based rootkits. Deterministic replay ideas are (sort of) used to fool OSes into thinking they’re not in a VM.
SMP-ReVirt: Deterministic replay for multiprocessor virtual machines. Virtual memory protections are used to limit the bandwidth of shared memory nondeterminism.
Speck: Deterministic replay for security checks of single-threaded processes. First occurrence in this work of speculation and a uniparallel-like idea.
Respec: Online deterministic replay for multithreaded programs. Presents a system much like DoublePlay, but not the key uniparallelism idea. Rather than one n-thread “thread-parallel” execution and n single-thread “epoch-parallel” executions, there are two n-threaded executions, the “original” and a “replay.” (Yes, the replay is also parallel.) If these diverge Respec falls back to something like a single epoch of uniparallelism before returning to two executions.
- Respec and DoublePlay are compared in more detail below.

Deterministic replay

Deterministic replay is the ability to exactly reproduce an execution of a system. One can replay the execution of a virtual machine, a process, a group of processes, anything.

Deterministic replay of software systems is in some ways easy, since most machine instructions have deterministic effects. (addl %eax, $1 depends only on the value of %eax.) The main requirement is to record and replay all nondeterministic events, so that these events happen at exactly the same times and in exactly the same ways during replay as they did originally. In single-threaded code, the only nondeterministic events are system call return values and signal delivery, and these happen rarely enough that logging them is pretty cheap. (The DoublePlay paper considers single-threaded replay a solved problem.) But multi-threaded code, or code with multiple threads concurrently accessing shared memory, presents a huge challenge. Any memory shared between multiple concurrent threads forms a very high bandwidth nondeterministic communication channel. (It’s nondeterministic because different threads can run at slightly different speeds, and unpredictable factors like bus design can affect which of several simultaneous modifications to a memory address will “win.”) Logging this channel is super expensive.

Why would anyone care about deterministic replay? Fundamentally for debugging, but there are other reasons. Deterministic replay is a primitive useful in several contexts. For instance, imagine replaying a system under an augmented virtual machine—with more security checks, say. The deterministic replay guarantee tells you that any security bugs found actually happened on the original execution. The “time-traveling virtual machines” work [2]ttvm uses deterministic replay to debug operating systems with features like “single-step backwards”.

OS speculation

Operating system speculation is the ability for processes to enter speculative mode, a state analogous to the middle of a database transaction in which the process’s external effects are temporarily buffered. Each OS speculation is eventually committed or aborted. On commit, the relevant processes leave speculative mode and their effects become permanent—for example, any file changes may be sent to the disk. On abort, however, all of the affected processes’ actions are undone. Any forked processes are obliterated; any disk writes are thrown away; any pending network packets are junked; and any signals delivered to other processes are “undelivered”, by rolling those receiving processes back to a checkpointed state immediately before the signal.

Speculation is a long-known performance improvement technique and has been implemented many times in different contexts. The version relevant here was originally developed to improve distributed file system performance [3]speculator. It’s really well done: first off, it’s in a conventional OS kernel, which is hard, and secondly, it handles many things (like signals) previous systems did not.

Like deterministic replay, speculation is a primitive useful in several contexts. It may be even more widely applicable than replay: it is used in replay systems, to speed up distributed file systems, to make synchronous I/O appear faster, and elsewhere.

As you read these papers be aware of the new primitives that you could use in your own work!

Approach

DoublePlay’s goal is efficient deterministic replay of multithreaded systems. The big problem they need to solve is how to log the high-bandwidth shared memory channel. Their solution is a clever variant on earlier ideas. Specifically, they transform the multithreaded code to a form with lower-bandwidth nondeterminism, record that form, and then check that the result has the same observable effects as the original code. This is called uniparallelism.

Why do they need speculative execution? The goal, remember, is efficiency. They don’t want to slow down the original execution. But uniparallelism slows down execution a lot, since it reduces nondeterminism bandwidth by running threads sequentially (interleaved). A uniparallel execution of n concurrent threads might run n times slower.

Speed can be recovered if we run and record many uniparallel executions in parallel with the original code. The uniparallel executions turn into checks. At the end of an epoch, the uniparallel (“epoch parallel”) execution thread’s results are compared with a checkpointed version of the original execution’s results. (The original execution is now far ahead.) If the results are the same, all is well: the uniparallel execution, when replayed deterministically, will produce the same effects as the ongoing truly-parallel execution. If they’re not the same, though, there’s a problem. We can fix the problem by rolling back to before the uniparallel checkpoint and trying again. This, though, might never make progress. The authors instead implement forward recovery, which adopts the uniparallel execution’s results as the truth. (Why is this OK?) The truly-parallel execution must be killed (rolled back) and then restarted with a copy of the uniparallel execution’s state.

Uniparallelism is a clever idea (related most to the ideas behind Speck). Even better for us, it combines interesting primitives (speculation, programs as state machines, deterministic execution) to get a cool system.

DoublePlay vs. Respec

Respec could support offline replay, despite DoublePlay’s claim. (“When requested, Respec can optionally save information to enable an offline replay of the recorded process.” [p83, 4]respec) In offline mode, Respec would log a checksum of each thread’s state (memory, registers, etc.) after each epoch; during replay, Respec would verify these checksums, with a rollback and retry in case of divergence.

The theoretical difference between Respec and DoublePlay is that DoublePlay can always precisely replay any recorded execution in bounded time. (And DoublePlay can record any execution in bounded time.) Respec may find replay difficult: if the replayed execution diverges from the recorded execution, Respec must roll back and retry that epoch, possibly unboundededly many times. Note that this is the same problem DoublePlay faced before the forward recovery optimization.

Possibly-unbounded replay time seems infinitely bad. However, we could view this difference as quantitative, rather than qualitative; and rather than eliminating Respec because it might diverge, we could evaluate how often it diverges, or how many rollbacks are required in practice. Respec points out that “since the recorded process has been replayed successfully at least once [i.e., during the record phase], it is likely that offline replay will eventually succeed, although it may require a number of rollbacks and retries.” [4]respec

The quantitative performance difference between Respec and DoublePlay is not evaluated. Respec does not evaluate offline replay directly. Benchmarks in Respec can’t be compared with those in DoublePlay—they overlap, but appear for example to have different problem sizes.

We can guess how a comparison might go. Respec must calculate MD5 checksums during the record phase, which DoublePlay need not; this will add some cost. Frequent rollbacks caused by replay divergence could make Respec much slower than DoublePlay, but DoublePlay’s evaluation shows only a modest benefit from the forward recovery optimization (Table 2), which addresses exactly the replay-divergence problem. Respec by default features parallel replay, which is faster than uniparallel replay by a factor of n for n threads (though uniparallel replay could be sped up). Thus, it is possible that for online replay (or even, possibly, offline!) Respec is as fast or faster.

Nevertheless, the DoublePlay authors believe that DoublePlay will be slightly faster than or equivalent to Respec for most applications, as long as rollbacks are rare (personal communication). They also emphasize the risk that a Respec-recorded execution might, due to bad luck, be impossible to replay.

Whether or not it makes a difference for common-case multithreaded replay, the uniparallelism idea is cool enough and evocative enough to discuss and understand.

Questions

How would you present these results? What is missing? I miss graphs or other visual representations of performance, and comparisons with prior systems (especially those from the same group).
Is it really reasonable to use supposedly “extra” cores this way?
Truly-parallel executions exist that uniparallelism would never record. Which ones?
DoublePlay essentially assumes a sequentially consistent memory model. How and why?
The paper discusses several optimizations of the basic idea. Why might they matter? Which, if any, are required for correctness, and which are merely performance optimizations? (Some example optimizations: DoublePlay’s version of Respec assumes the underlying file system is copy-on-write [p5, 1]dp; forward recovery [p7, 1]dp; looser divergence checks for getpid and nanosleep [p7, 1]dp; recording certain user-level operations, such as synchronization operations, in the thread-parallel execution, and replaying them in the epoch-parallel execution [p4–6, 1]dp.)
Unanswerable, but good to have an opinion: Is the goal of multithreaded offline deterministic replay worth this amount of work?

dp
“DoublePlay: Parallelizing sequential logging and replay”, Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Peter M. Chen, Jason Flinn, and Satish Narayanasamy, in Proc. ASPLOS XVI, Mar. 2011 (ACM Digital Library)
ttvm
“Debugging operating systems with time-traveling virtual machines”, Samuel T. King, George W. Dunlap, and Peter M. Chen, in Proc. USENIX 2005 Annual Technical Conference. (ACM Digital Library)
speculator
“Speculative execution in a distributed file system”, Edmund B. Nightingale, Peter M. Chen, and Jason Flinn, in Proc. SOSP ’05. (ACM Digital Library)
respec
“Respec: efficient online multiprocessor replay via speculation and external determinism”, Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn, in Proc. ASPLOS XV, 2010. (ACM Digital Library)