Notes on Making the “Box” Transparent

The paper [1]debox appeared at the USENIX Annual Technical Conference, a great home for seriously practical work.

The motivating question: What bottlenecked the performance of the Flash web server [2]flash? But first, we need to discuss blocking vs. non-blocking system calls.

Blocking vs. Non-blocking

A system call blocks if it can cause the calling process to become unrunnable. (We then say the process has blocked.)

Some system calls inherently block: the purpose of sleep, for example, is to block the calling process. However, other system calls block by convention. Take open, for example. open cannot complete until the kernel knows whether the file exists, has the right type, and the calling process has sufficient permissions to open it. Finding these things out might require reading data from disk or elsewhere, so the open implementation blocks the calling process until the necessary data is available. But this is not the only possible design! open could return a distinguished value if it couldn’t complete yet—say, an EAGAIN error. Then the kernel could load the relevant data in the background, leaving the calling process responsible for retrying the open later. This application retry approach is called polling or non-blocking I/O.

In a pure polling system, applications never block. An application always remains schedulable and retries the system call until it succeeds, like an annoying four-year-old: “Are we there yet? Are we there yet? Are we there yet? Are we there yet? Are we there yet?” Pure polling often causes utilization problems: the CPU is kept busy answering uninteresting questions, causing lots of context switches and wasted time.

Notified polling is usually an improvement over pure polling. In notified polling, there are two classes of system calls, action system calls and event notification system calls. An action system call changes the system’s state, for example by opening a file, but might return an EAGAIN value if it can’t complete. An event notification system call blocks until some incomplete action system call from a specified set can make progress. select, poll, kevent, epoll, and aio_suspend are examples of event notification APIs in Unix.

Here’s an example of how blocking, pure-polling, and notified-polling APIs behave:

BLOCKING              PURE POLLING               NOTIFIED POLLING
P: read ...           P: read                    P: read
    |                 P: ... returns -EAGAIN     P: ... returns -EAGAIN
    | not ready       P: read                    P: select ...
    | so P blocks     P: ... returns -EAGAIN         |
    |                 P: read                        | not ready
    |                 P: ... returns -EAGAIN         | so P blocks
    |                 P: read                        |
    v                 P: ... returns -EAGAIN         v
P: ... returns 100    P: read                    P: ... returns 0
                      P: ... returns 100         P: read
                                                 P: ... returns 100

Blocking makes the smallest number of system calls, and therefore has the lowest kernel crossing overhead. Polling plus notification is a close second, but can have the highest latency, since the extra read happens after the select returns. Pure polling looks bad here. (If a server always has some work to do, pure polling can match or beat the other approaches, but this is rare in practice.)

Blocking system calls are a great solution for completely serial processes, where every operation must complete before the next operation can begin. But fewer and fewer applications actually work this way. Servers generally support multiple connections at once; if one connection blocks, the server can tend to another. Any application with a GUI involves independent event streams, one to the user and one to the system’s other hardware. (For example, users like the ability to cancel expensive operations, such as slow web page loads.) Furthermore, even for noninteractive applications, speculative performance improvements like prefetching can’t be implemented with blocking system calls in a single thread.

Multiple threads, each with blocking system calls, can implement independent event streams. Unfortunately, threads introduce overhead that can be greater than the cost of notified polling. In notified polling, the kernel can run a single server process until it has no more work to do; in multi-threaded blocking, the kernel must switch among many threads, which is generally a bit slower than returning to the same process that was previously running. The fastest servers in practice use notified polling system calls. (Note that server code need not look event driven to use notified-polling system calls. For instance, von Behren et al.’s Capriccio project compiles multithreaded servers to use notified-polling system calls underneath.)

A nice property of notified polling is that event notification system calls are hints. If a select system call returns, but the relevant file descriptor actually isn’t ready (for instance, because another process sharing the file descriptor already read all available data by the time the selecting process runs), it’s no big deal, since the application must already be prepared for EAGAIN. It is much easier to implement event notification hints than guarantees.

Sidebar: In notified polling, state changes are only performed by action system calls that succeed. An application might or might not retry an action system call that returned EAGAIN: the application can choose. But in a variant of notified polling called asynchronous I/O, action system calls execute in the background. The application can fire off many action system calls in parallel; they will complete in any order. Event notification is used to determine which system calls have completed. In Unix, POSIX Asynchronous I/O offers an asynchronous I/O interface to reading and writing files. Asynchronous I/O offers some advantages over notified polling; for example, fewer system calls are required. However, it also has disadvantages. The calling process cannot nail down an execution order for asynchronous system calls, and must prepare memory for system call return values in advance. Synchronization is just hard! Due to these factors, and some implementation-specific costs of the POSIX AIO API, notified polling is much more commonly used.

Question: How would you classify approaches have we seen earlier? In the exOS exokernel, for example, are wakeup predicates a pure-polling approach? What consequences does this have?

Flash

Flash is the best of the early Web server design papers [2]flash. It surveys several alternate server architectures, then introduces a new one, called AMPED (Asymmetric Multi-Process Event-Driven).

The key to Web server performance is minimizing overhead. Modern networks are so fast, and client loads can be so high, that server overheads, such as wasted memory, blocking, and frequent context switching, actually start to matter. Server architectures can be analyzed for overhead this way:

Multi-process (MP): A server consists of multiple processes, each of which processes one incoming connection at a time. A main advantage is simplicity of programming, but overhead is rather high: each process has its own stack, heap, and kernel data structures; the several server processes may have independent, and possibly redundant, memory caches; and so forth.
Multi-threaded (MT): Like MP, but the multiple processes become multiple threads sharing the same address space. One main overhead is that of each thread’s stack. And since the kernel is allowed to schedule threads simultaneously on independent cores, the threads must synchronize with each other, which can be a surprisingly large cost. (For instance, simply linking with the POSIX Threads library (-lpthread) can slow down your program by 20%!)
Single-process event-driven (SPED): A single, single-threaded server process handles all connections using non-blocking system calls and notified polling. This eliminates thread-stack and synchronization overhead, but can complicate the programming model. Unfortunately, not all blocking Unix system calls have good non-blocking equivalents. For example, the non-blocking open (available in Posix AIO) doesn’t integrate well with the best non-blocking system calls for communicating with network connections. As a result, SPED servers can block; and when they do, it’s a disaster for efficiency, causing all server connections to pause until the process unblocks.
Asymmetric multi-process event-driven (AMPED): This is the Flash contribution. A main server process handles connections SPED-style, but separate helper processes (or threads) are spun off to handle system calls that might block.

Note 1. The Flash paper is called “Flash: An Efficient and Portable Web Server.” Why is “portable” in the title? Well, SPED servers have what amounts to a bad-API problem: most system call APIs lack important non-blocking system calls. One could just fix the API (and I wish someone would for Linux; Windows’s API is better). But Flash’s AMPED architecture shows that the benefits of SPED can be obtained even on current APIs.

Note 2. Web servers are important, but in some ways a special case for server performance in that they rarely have much CPU work of their own to do. Services that require significant computation are driven inevitably toward multithreaded/multiprocess designs on modern multicore hardware. But to get good performance and good I/O concurrency, such services might still use notified polling system calls!

Performance metrics

There are two key server performance metrics, throughput and latency. (Latency can be measured in seconds/connection or in connections/second; the DeBox paper shows latency CDFs, an excellent choice, whereas the earlier Flash paper has graphs of connection rate vs. file size.) In very simple systems—single-threaded, blocking systems processing streams of identically-expensive requests—these metrics are related: average throughput equals one over average latency. But in more complex systems, this relationship doesn’t hold. Systems that process many connections at once can exhibit high throughput and high latency.

Both performance metrics matter, but for Web servers (and many others), good latency is harder to achieve than good throughput. The reason is kernel buffering. Many server responses are larger than one packet, so the kernel stores the extra data in a buffer until TCP decides it’s ready to send. (32 kB is a typical socket buffer size; see man setsockopt, SO_SNDBUF.) This kernel buffering can keep throughput high even when the server is otherwise blocked. But blocking always affects server latency. (For similar reasons, per-connection latency is often more variable than per-connection throughput.)

“Our experiments with the Flash web server indicate that adding a 1 ms delay to one out of every 1000 requests can degrade latency by a factor of 8 while showing little impact on throughput.” [p3, 1]debox

The Flash paper shows that Flash achieves good throughput, but its latency measurements aren’t very deep. The DeBox paper concentrates not on throughput, but on a “SPECweb99” score. This is the number of simultaneous connections a server can sustain while still providing a minimum quality of service, which is basically a latency measurement. Flash’s initial SPECweb99 score was disappointing. Thus, DeBox.

DeBox

DeBox provides first-class, in-band, per-call performance analysis of system calls. This means:

First-class: System call performance information is proactively collected by the kernel, rather than collected separately or inferred.
In-band: Performance information is returned simultaneously with the system call’s other results, allowing the application to react to observed anomalies (for example, by dumping core, or printing unexpected anomalies specifically).
Per-call: Performance information is collected per system call, helping the application isolate problematic calls.

Performance information collected includes:

Wall clock time.
Number of page faults handled during system call.
Trace of kernel functions executed during system call, with per-function elapsed time (optional).
Locations when the system call blocked, including blocking location, reason, elapsed time, and number of other processes waiting for the same resource (optional).

A DeBoxControl system call turns profiling on and off and controls how much information is collected, after which users simply make system calls as usual. Each system call fills out a DeBoxInfo structure in addition to returning its normal results. Overhead is relatively low, ranging from 1–3% on macrobenchmarks.

Questions

How well designed is the DeBoxInfo data structure?
- Why does CallTrace define kernel locations using an address, while PerSleepInfo does so using a filename and line number?
- What other information might you care about?
- How specialized is DeBoxInfo for web server debugging?
- How can “users … specify the level of detail to control the overall costs”? How fine-grained is application control?
The paper makes strong claims about the importance of first-class, in-band, per-call system call performance reports. Are these claims well supported? What aspects of the debugging narrative absolutely required first-class, in-band, per-call reports, and which could’ve been handled another way?
How serious is the “User-level timing can be misleading” problem [p2, 1]debox?
How does the kernel transfer a filled-out DeBoxInfo structure to the application? This can be inferred from Section 3 [1]debox. How many memory operations are required?
How many kernel changes did DeBox require?
In contrast to DeBox, a statistical profiler tool called DCPI can gather “bottom-half information.” What does this mean?

Debugging narrative (Case study)

But we don’t read the paper for DeBox itself. We read it for the detailed and fascinating case study of how DeBox was used to improve Flash performance. Over the course of the paper Flash’s SPECweb99 score goes from 200 to 820, and its median latency drops by a factor of 47 (!!). Internalize this debugging process and you’ll be well equipped to evaluate real system performance problems, whether or not you do so with first-class, in-band, per-call performance information.

Flash was already fast, so it’s not surprising that there was no one key optimization. Profiling a slow server often reveals obviously slow routines, but profiling a fast server may not reveal much. The key to improving performance of a fast server is to improve everything that can ever be slow. And, in Flash’s architecture, the key to improving latency is ensuring that the main server never blocks. DeBox helps this process by exposing system call performance and by letting the application assert-fail on blocking.

All performance improvements involve (1) making expensive operations cheaper, or (2) making expensive operations unnecessary. Flash does both. Here’s a summary.

mincore degradation (§5.1): Server is CPU-bound. mincore is the slowest system call.
- Solution: Make mincore cheaper. Replace in-kernel linked lists with splay trees, reducing mincore expense from O(n) to O(log n) (where n is the number of mapped regions).
Data copying (§5.1): Server is CPU-bound. No obvious bottleneck.
- Solution: Data copying takes CPU time, so reduce data copying. Shift from explicit write system calls to sendfile, which lets the kernel copy data directly out of the buffer cache.
select degradation (§5.1): Server is CPU-bound. Event notification API (select) costs O(n) for n file descriptors (see also the kqueue paper).
- Solution: Shift to an API (kqueue/kevent) that usually costs O(1) time.
- Note: It appears that §5.1 was implemented before DeBox.
Filename lock contention (§5.2): Server has partially-idle CPU and network (should never have both). Suspect blocking.
- Analysis: DeBox exposes enough information that the authors find a fascinating locking interaction. Previously, a helper would open a file (possibly involving blocking), and on success, inform the server, who would open the file itself. But the server’s open might block—not because the necessary information had been flushed from the cache, but because some other open had locked the directory (for example).
- Solution: Move from implicit coordination (helper opens file, then server opens file) to explicit coordination (helper opens file, then transfers file descriptor to server)—a cleaner architecture too.
File cache synchronization (§5.3): Server still blocks sometimes.
- Solution: Dumping core on blocking opens (possible because DeBox is in-line) demonstrates a bug—two caches aren’t in sync. The solution: file descriptor transfer, again.
- Now all main-process blocking is eliminated.
fork degradation (§5.4): fork is the most expensive system call, and it gets more expensive over time.
- Analysis: Forking the main process is expensive, because it has lots of file descriptors and memory regions.
- Solution: Delegate forking to a lean “fork helper,” which makes an expensive operation cheaper.
mmap degradation (§5.4): mmap is an expensive system call, and it gets more expensive over time.
- Analysis: mmap is getting more expensive because the main server process’s memory map is filling up.
- Solution: Don’t use mmap. Flash used mmap just so it could use mincore. Forget them both; instead, infer memory residency using the helpers (despite the small blocking risk this introduces).
CGI read (§5.5): In Flash’s CGI interface, the main process’s read system call is a heavy hitter (20% of kernel time).
- Solution: Use file descriptor passing again to avoid bouncing data through the main process.
- An argument stretched too far: “This level of detail demonstrates the power of making performance a first-class result, since existing kernel profilers would not have been able to separate the time for the read() calls by call sites.”—well, strace could separate the time!
Unnecessary sendfile blocking (§5.6): sendfile can block.
- Solution: sendfile can block because it has a limited buffer pool (discovered through DeBox’s PerSleepInfo label feature). Limited pools are often silly. Unlimit its buffer pool.
Necessary sendfile blocking (§5.6): sendfile can block.
- Solution: sendfile must block if it needs to read data from disk. Since the server still shouldn’t block, implement a non-blocking sendfile.
sendfile per-packet overhead (§5.6): sendfile is not much faster than write, even though it copies less.
- Note: It is not clear whether they used DeBox to discover this!
- Analysis: Analyzing performance by file size shows that sendfile does relatively worse for small files. The reason: sendfile simply sends more packets, since it flushes the initial HTTP headers (specified by write) before filling up packets with file data.
- Solution: Make sendfile smart enough to combine file data with header data.
- (It is worth noting that original Flash already improves the performance of combining HTTP headers with data by padding the headers to a multiple of 32 bytes, so that the copy from file cache to packet will likely be cache-aligned! It’s not clear how much this matters, but they do it!)

The following optimizations improve performance by making expensive operations cheaper:

mincore degradation (reduce cost of mincore from O(n) to O(log n))
select degradation (reduce cost of event notification from O(n) to O(1) via kevent)
fork degradation (reduce cost of fork from O(n) to O(1))
sendfile per-packet overhead (reduce per-packet cost by 1)

The following optimizations eliminate expensive operations:

Data copying (eliminate copying inherent in write)
Filename lock contention (eliminate blocking by passing fd)
File cache synchronization (eliminate blocking by passing fd)
mmap degradation (eliminate mmap by relying on memory residency heuristics)
CGI read (eliminate read by passing fd)
Unnecessary sendfile blocking (eliminate blocking by unlimiting a sendfile buffer pool)
Necessary sendfile blocking (eliminate blocking via a non-blocking API)

The following optimizations change the Flash architecture:

Data copying (introduce sendfile)
Filename lock contention, file cache synchronization, CGI read (eliminate redundancy and copying by fd passing)
mmap degradation (eliminate mmap in favor of heuristics)
Necessary sendfile blocking (eliminate heuristics in favor of a non-blocking API)

(File descriptor passing is a great technique to remember. It uses Unix-domain sockets [man unix], the sendmsg and recvmsg system calls, and the SCM_RIGHTS ancillary message type.)

The new Flash is both simpler and faster: a lovely combination. Perhaps they could have implemented “new Flash” originally, but without this specific performance information, it would be hard to justify the work.

Questions

Which parts of DeBox were used during the debugging narrative?
Which parts were not?
Which optimizations required or used first-class performance information?
…in-band performance information?
…per-call performance information?

Evaluation

The evaluation demonstrates how Flash improved as a result of the DeBox experience. The improvements are impressive, particularly for latency.

What likely explains the 1–3ms “plateau” in new Flash’s latency curve in figure 11?
- One disk seek.
Old Flash beat Apache handily on throughput (Table 8). Why did Apache beat old Flash on latency (Figure 11)?
- In Apache’s MP architecture, any single server’s blocking doesn’t affect latency for the other servers. In Flash’s AMPED architecture, main server blocking is a disaster.
The paper can replicate the SEDA Haboob server’s latency profile, but finds that Apache and old Flash have much better latency profiles than reported in the SEDA paper [3]seda. “We surmise that the minimal amount of tuning done to configurations of Apache and the original Flash yield much better results” [p11, 1]debox.
- They are referring to disabling logging. Apache in some modes logs each request to disk; this is always expensive (e.g. requiring extra system calls to get the current time), but in some modes really expensive (e.g. requiring a reverse DNS lookup). They may also be referring to increasing the max number of clients to 1024.
- If the DeBox authors are correct, and the Haboob results are measured unfairly against an untuned baseline server with different functionality, Haboob’s results are tainted.
- It is incredibly important in your own work that you understand your performance results.
“One interesting observation is that the 95% latency of the new Flash is a factor of 39 lower than the mean value.”
- Think about that for a minute. Wow.
- The Flash request latency distribution appears to be heavy-tailed. Heavy-tailed distributions are common in computer systems, and have unexpectedly weird effects. We usually think of averages (means) and medians as similar; but here, not only is the median latency much less than the mean, but the 95% latency is much less than the mean!
  - Strictly speaking no finite distribution is heavy-tailed, because the tail stops. The terminology is useful nevertheless.
What does the tail of new Flash’s latency distribution look like?
- The authors don’t give us this information in a readable way (although we might be able to extract the numerical distribution from the PostScript code for Figure 12). They tell us the 95% latency numerically, but the difference between 95% latency and mean latency indicates some real performance outliers.
- A complementary CDF or CCDF is a great way to examine heavy tails. Where a CDF plots the cumulative distribution (the fraction of samples <= X), the CCDF plots the complement (1 - CDF, or the fraction of samples >X). This number starts at 1 and goes slowly to 0, so if we plot log(CCDF), we can see the tail clearly.
- Zooming in to the paper reveals that Haboob’s latency distribution does cross new Flash’s, at around 99%. That is, the 99% latency for Haboob is equal to or less than that for new Flash. Haboob was designed to treat all connections “fairly” (by “keeping all stages within their operating regime”—that is, giving them all similar (slow!) latencies), and it actually does appear to avoid some high-latency anomalies to which new Flash is vulnerable.
- How might you use DeBox to debug the tail of the new Flash latency distribution?

Extras

“Unfortunately, we were unable to get Linux to run properly on our existing hardware, despite several attempts to resolve the issue on the Linux kernel list.” [p11, 1]debox Funny! I think only rarely now will you find hardware where FreeBSD works but Linux doesn’t.

debox
“Making the ‘Box’ Transparent: System Call Performance as a First-Class Result”, Yaoping Ruan and Vivek Pai, in Proc. USENIX ATC 2004 (Via Usenix)
flash
“Flash: An Efficient and Portable Web Server”, Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel, in Proc. USENIX ATC 1999 (Via Usenix)
seda
“SEDA: an architecture for well-conditioned, scalable internet services”, Matt Welsh, David Culler, and Eric Brewer, Proc. 18th ACM SOSP (ACM Digital Library)