Notes on Making the “Box” Transparent

The paper [1] appeared at the USENIX Annual Technical Conference, a great home for seriously practical work.

The motivating question: What bottlenecked the performance of the Flash web server [2]? But first, we need to discuss blocking vs. non-blocking system calls.

Blocking vs. Non-blocking

A system call blocks if it can cause the calling process to become unrunnable. (We then say the process has blocked.)

Some system calls inherently block: the purpose of sleep, for example, is to block the calling process. However, other system calls block by convention. Take open, for example. open cannot complete until the kernel knows whether the file exists, has the right type, and the calling process has sufficient permissions to open it. Finding these things out might require reading data from disk or elsewhere, so the open implementation blocks the calling process until the necessary data is available. But this is not the only possible design! open could return a distinguished value if it couldn’t complete yet—say, an EAGAIN error. Then the kernel could load the relevant data in the background, leaving the calling process responsible for retrying the open later. This application retry approach is called polling or non-blocking I/O.

In a pure polling system, applications never block. An application always remains schedulable and retries the system call until it succeeds, like an annoying four-year-old: “Are we there yet? Are we there yet? Are we there yet? Are we there yet? Are we there yet?” Pure polling often causes utilization problems: the CPU is kept busy answering uninteresting questions, causing lots of context switches and wasted time.

Notified polling is usually an improvement over pure polling. In notified polling, there are two classes of system calls, action system calls and event notification system calls. An action system call changes the system’s state, for example by opening a file, but might return an EAGAIN value if it can’t complete. An event notification system call blocks until some incomplete action system call from a specified set can make progress. select, poll, kevent, epoll, and aio_suspend are examples of event notification APIs in Unix.

Here’s an example of how blocking, pure-polling, and notified-polling APIs behave:

BLOCKING              PURE POLLING               NOTIFIED POLLING
P: read ...           P: read                    P: read
    |                 P: ... returns -EAGAIN     P: ... returns -EAGAIN
    | not ready       P: read                    P: select ...
    | so P blocks     P: ... returns -EAGAIN         |
    |                 P: read                        | not ready
    |                 P: ... returns -EAGAIN         | so P blocks
    |                 P: read                        |
    v                 P: ... returns -EAGAIN         v
P: ... returns 100    P: read                    P: ... returns 0
                      P: ... returns 100         P: read
                                                 P: ... returns 100

Blocking makes the smallest number of system calls, and therefore has the lowest kernel crossing overhead. Polling plus notification is a close second, but can have the highest latency, since the extra read happens after the select returns. Pure polling looks bad here. (If a server always has some work to do, pure polling can match or beat the other approaches, but this is rare in practice.)

Blocking system calls are a great solution for completely serial processes, where every operation must complete before the next operation can begin. But fewer and fewer applications actually work this way. Servers generally support multiple connections at once; if one connection blocks, the server can tend to another. Any application with a GUI involves independent event streams, one to the user and one to the system’s other hardware. (For example, users like the ability to cancel expensive operations, such as slow web page loads.) Furthermore, even for noninteractive applications, speculative performance improvements like prefetching can’t be implemented with blocking system calls in a single thread.

Multiple threads, each with blocking system calls, can implement independent event streams. Unfortunately, threads introduce overhead that can be greater than the cost of notified polling. In notified polling, the kernel can run a single server process until it has no more work to do; in multi-threaded blocking, the kernel must switch among many threads, which is generally a bit slower than returning to the same process that was previously running. The fastest servers in practice use notified polling system calls. (Note that server code need not look event driven to use notified-polling system calls. For instance, von Behren et al.’s Capriccio project compiles multithreaded servers to use notified-polling system calls underneath.)

A nice property of notified polling is that event notification system calls are hints. If a select system call returns, but the relevant file descriptor actually isn’t ready (for instance, because another process sharing the file descriptor already read all available data by the time the selecting process runs), it’s no big deal, since the application must already be prepared for EAGAIN. It is much easier to implement event notification hints than guarantees.

Sidebar: In notified polling, state changes are only performed by action system calls that succeed. An application might or might not retry an action system call that returned EAGAIN: the application can choose. But in a variant of notified polling called asynchronous I/O, action system calls execute in the background. The application can fire off many action system calls in parallel; they will complete in any order. Event notification is used to determine which system calls have completed. In Unix, POSIX Asynchronous I/O offers an asynchronous I/O interface to reading and writing files. Asynchronous I/O offers some advantages over notified polling; for example, fewer system calls are required. However, it also has disadvantages. The calling process cannot nail down an execution order for asynchronous system calls, and must prepare memory for system call return values in advance. Synchronization is just hard! Due to these factors, and some implementation-specific costs of the POSIX AIO API, notified polling is much more commonly used.

Flash

Flash is the best of the early Web server design papers [2]. It surveys several alternate server architectures, then introduces a new one, called AMPED (Asymmetric Multi-Process Event-Driven).

The key to Web server performance is minimizing overhead. Modern networks are so fast, and client loads can be so high, that server overheads, such as wasted memory, blocking, and frequent context switching, actually start to matter. Server architectures can be analyzed for overhead this way:

Note 1. The Flash paper is called “Flash: An Efficient and Portable Web Server.” Why is “portable” in the title? Well, SPED servers have what amounts to a bad-API problem: most system call APIs lack important non-blocking system calls. One could just fix the API (and I wish someone would for Linux; Windows’s API is better). But Flash’s AMPED architecture shows that the benefits of SPED can be obtained even on current APIs.

Note 2. Web servers are important, but in some ways a special case for server performance in that they rarely have much CPU work of their own to do. Services that require significant computation are driven inevitably toward multithreaded/multiprocess designs on modern multicore hardware. But to get good performance and good I/O concurrency, such services might still use notified polling system calls!

Performance metrics

There are two key server performance metrics, throughput and latency. (Latency can be measured in seconds/connection or in connections/second; the DeBox paper shows latency CDFs, an excellent choice, whereas the earlier Flash paper has graphs of connection rate vs. file size.) In very simple systems—single-threaded, blocking systems processing streams of identically-expensive requests—these metrics are related: average throughput equals one over average latency. But in more complex systems, this relationship doesn’t hold. Systems that process many connections at once can exhibit high throughput and high latency.

Both performance metrics matter, but for Web servers (and many others), good latency is harder to achieve than good throughput. The reason is kernel buffering. Many server responses are larger than one packet, so the kernel stores the extra data in a buffer until TCP decides it’s ready to send. (32 kB is a typical socket buffer size; see man setsockopt, SO_SNDBUF.) This kernel buffering can keep throughput high even when the server is otherwise blocked. But blocking always affects server latency. (For similar reasons, per-connection latency is often more variable than per-connection throughput.)

“Our experiments with the Flash web server indicate that adding a 1 ms delay to one out of every 1000 requests can degrade latency by a factor of 8 while showing little impact on throughput.” [p3, 1]

The Flash paper shows that Flash achieves good throughput, but its latency measurements aren’t very deep. The DeBox paper concentrates not on throughput, but on a “SPECweb99” score. This is the number of simultaneous connections a server can sustain while still providing a minimum quality of service, which is basically a latency measurement. Flash’s initial SPECweb99 score was disappointing. Thus, DeBox.

DeBox

DeBox provides first-class, in-band, per-call performance analysis of system calls. This means:

Performance information collected includes:

A DeBoxControl system call turns profiling on and off and controls how much information is collected, after which users simply make system calls as usual. Each system call fills out a DeBoxInfo structure in addition to returning its normal results. Overhead is relatively low, ranging from 1–3% on macrobenchmarks.

Questions

Debugging narrative (Case study)

But we don’t read the paper for DeBox itself. We read it for the detailed and fascinating case study of how DeBox was used to improve Flash performance. Over the course of the paper Flash’s SPECweb99 score goes from 200 to 820, and its median latency drops by a factor of 47 (!!). Internalize this debugging process and you’ll be well equipped to evaluate real system performance problems, whether or not you do so with first-class, in-band, per-call performance information.

Flash was already fast, so it’s not surprising that there was no one key optimization. Profiling a slow server often reveals obviously slow routines, but profiling a fast server may not reveal much. The key to improving performance of a fast server is to improve everything that can ever be slow. And, in Flash’s architecture, the key to improving latency is ensuring that the main server never blocks. DeBox helps this process by exposing system call performance and by letting the application assert-fail on blocking.

All performance improvements involve (1) making expensive operations cheaper, or (2) making expensive operations unnecessary. Flash does both. Here’s a summary.

The following optimizations improve performance by making expensive operations cheaper:

The following optimizations eliminate expensive operations:

The following optimizations change the Flash architecture:

(File descriptor passing is a great technique to remember. It uses Unix-domain sockets [man unix], the sendmsg and recvmsg system calls, and the SCM_RIGHTS ancillary message type.)

The new Flash is both simpler and faster: a lovely combination. Perhaps they could have implemented “new Flash” originally, but without this specific performance information, it would be hard to justify the work.

Questions

Evaluation

The evaluation demonstrates how Flash improved as a result of the DeBox experience. The improvements are impressive, particularly for latency.

Extras

“Unfortunately, we were unable to get Linux to run properly on our existing hardware, despite several attempts to resolve the issue on the Linux kernel list.” [p11, 1] Funny! I think only rarely now will you find hardware where FreeBSD works but Linux doesn’t.


  1. “Making the ‘Box’ Transparent: System Call Performance as a First-Class Result”, Yaoping Ruan and Vivek Pai, in Proc. USENIX ATC 2004 (Via Usenix)

  2. “Flash: An Efficient and Portable Web Server”, Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel, in Proc. USENIX ATC 1999 (Via Usenix)

  3. “SEDA: an architecture for well-conditioned, scalable internet services”, Matt Welsh, David Culler, and Eric Brewer, Proc. 18th ACM SOSP (ACM Digital Library)