The paper [1] appeared at the USENIX Annual Technical Conference, a great home for seriously practical work.
The motivating question: What bottlenecked the performance of the Flash web server [2]? But first, we need to discuss blocking vs. non-blocking system calls.
A system call blocks if it can cause the calling process to become unrunnable. (We then say the process has blocked.)
Some system calls inherently block: the purpose of sleep
, for
example, is to block the calling process. However, other system calls
block by convention. Take open
, for example. open
cannot
complete until the kernel knows whether the file exists, has the right
type, and the calling process has sufficient permissions to open
it. Finding these things out might require reading data from disk or
elsewhere, so the open
implementation blocks the calling process
until the necessary data is available. But this is not the only
possible design! open
could return a distinguished value if it
couldn’t complete yet—say, an EAGAIN
error. Then the kernel could
load the relevant data in the background, leaving the calling process
responsible for retrying the open
later. This application retry
approach is called polling or non-blocking I/O.
In a pure polling system, applications never block. An application always remains schedulable and retries the system call until it succeeds, like an annoying four-year-old: “Are we there yet? Are we there yet? Are we there yet? Are we there yet? Are we there yet?” Pure polling often causes utilization problems: the CPU is kept busy answering uninteresting questions, causing lots of context switches and wasted time.
Notified polling is usually an improvement over pure polling. In notified
polling, there are two classes of system calls, action system calls and
event notification system calls. An action system call changes the
system’s state, for example by opening a file, but might return an EAGAIN
value if it can’t complete. An event notification system call blocks until
some incomplete action system call from a specified set can make
progress. select
, poll
, kevent
, epoll
, and aio_suspend
are
examples of event notification APIs in Unix.
Here’s an example of how blocking, pure-polling, and notified-polling APIs behave:
BLOCKING PURE POLLING NOTIFIED POLLING
P: read ... P: read P: read
| P: ... returns -EAGAIN P: ... returns -EAGAIN
| not ready P: read P: select ...
| so P blocks P: ... returns -EAGAIN |
| P: read | not ready
| P: ... returns -EAGAIN | so P blocks
| P: read |
v P: ... returns -EAGAIN v
P: ... returns 100 P: read P: ... returns 0
P: ... returns 100 P: read
P: ... returns 100
Blocking makes the smallest number of system calls, and therefore has the
lowest kernel crossing overhead. Polling plus notification is a close
second, but can have the highest latency, since the extra read
happens
after the select
returns. Pure polling looks bad here. (If a server
always has some work to do, pure polling can match or beat the other
approaches, but this is rare in practice.)
Blocking system calls are a great solution for completely serial processes, where every operation must complete before the next operation can begin. But fewer and fewer applications actually work this way. Servers generally support multiple connections at once; if one connection blocks, the server can tend to another. Any application with a GUI involves independent event streams, one to the user and one to the system’s other hardware. (For example, users like the ability to cancel expensive operations, such as slow web page loads.) Furthermore, even for noninteractive applications, speculative performance improvements like prefetching can’t be implemented with blocking system calls in a single thread.
Multiple threads, each with blocking system calls, can implement independent event streams. Unfortunately, threads introduce overhead that can be greater than the cost of notified polling. In notified polling, the kernel can run a single server process until it has no more work to do; in multi-threaded blocking, the kernel must switch among many threads, which is generally a bit slower than returning to the same process that was previously running. The fastest servers in practice use notified polling system calls. (Note that server code need not look event driven to use notified-polling system calls. For instance, von Behren et al.’s Capriccio project compiles multithreaded servers to use notified-polling system calls underneath.)
A nice property of notified polling is that event notification system calls
are hints. If a select
system call returns, but the relevant file
descriptor actually isn’t ready (for instance, because another process
sharing the file descriptor already read all available data by the time the
select
ing process runs), it’s no big deal, since the application must
already be prepared for EAGAIN
. It is much easier to implement event
notification hints than guarantees.
Sidebar: In notified polling, state changes are only performed by action system calls that succeed. An application might or might not retry an action system call that returned
EAGAIN
: the application can choose. But in a variant of notified polling called asynchronous I/O, action system calls execute in the background. The application can fire off many action system calls in parallel; they will complete in any order. Event notification is used to determine which system calls have completed. In Unix, POSIX Asynchronous I/O offers an asynchronous I/O interface to reading and writing files. Asynchronous I/O offers some advantages over notified polling; for example, fewer system calls are required. However, it also has disadvantages. The calling process cannot nail down an execution order for asynchronous system calls, and must prepare memory for system call return values in advance. Synchronization is just hard! Due to these factors, and some implementation-specific costs of the POSIX AIO API, notified polling is much more commonly used.
Flash is the best of the early Web server design papers [2]. It surveys several alternate server architectures, then introduces a new one, called AMPED (Asymmetric Multi-Process Event-Driven).
The key to Web server performance is minimizing overhead. Modern networks are so fast, and client loads can be so high, that server overheads, such as wasted memory, blocking, and frequent context switching, actually start to matter. Server architectures can be analyzed for overhead this way:
Multi-process (MP): A server consists of multiple processes, each of which processes one incoming connection at a time. A main advantage is simplicity of programming, but overhead is rather high: each process has its own stack, heap, and kernel data structures; the several server processes may have independent, and possibly redundant, memory caches; and so forth.
Multi-threaded (MT): Like MP, but the multiple processes become
multiple threads sharing the same address space. One main overhead
is that of each thread’s stack. And since the kernel is allowed to
schedule threads simultaneously on independent cores, the threads
must synchronize with each other, which can be a surprisingly large
cost. (For instance, simply linking with the POSIX Threads library
(-lpthread
) can slow down your program by 20%!)
Single-process event-driven (SPED): A single, single-threaded server
process handles all connections using non-blocking system calls and
notified polling. This eliminates thread-stack and synchronization
overhead, but can complicate the programming model. Unfortunately, not
all blocking Unix system calls have good non-blocking equivalents. For
example, the non-blocking open
(available in Posix AIO) doesn’t
integrate well with the best non-blocking system calls for communicating
with network connections. As a result, SPED servers can block; and when
they do, it’s a disaster for efficiency, causing all server connections
to pause until the process unblocks.
Asymmetric multi-process event-driven (AMPED): This is the Flash contribution. A main server process handles connections SPED-style, but separate helper processes (or threads) are spun off to handle system calls that might block.
Note 1. The Flash paper is called “Flash: An Efficient and Portable Web Server.” Why is “portable” in the title? Well, SPED servers have what amounts to a bad-API problem: most system call APIs lack important non-blocking system calls. One could just fix the API (and I wish someone would for Linux; Windows’s API is better). But Flash’s AMPED architecture shows that the benefits of SPED can be obtained even on current APIs.
Note 2. Web servers are important, but in some ways a special case for server performance in that they rarely have much CPU work of their own to do. Services that require significant computation are driven inevitably toward multithreaded/multiprocess designs on modern multicore hardware. But to get good performance and good I/O concurrency, such services might still use notified polling system calls!
There are two key server performance metrics, throughput and latency. (Latency can be measured in seconds/connection or in connections/second; the DeBox paper shows latency CDFs, an excellent choice, whereas the earlier Flash paper has graphs of connection rate vs. file size.) In very simple systems—single-threaded, blocking systems processing streams of identically-expensive requests—these metrics are related: average throughput equals one over average latency. But in more complex systems, this relationship doesn’t hold. Systems that process many connections at once can exhibit high throughput and high latency.
Both performance metrics matter, but for Web servers (and many others),
good latency is harder to achieve than good throughput. The reason is
kernel buffering. Many server responses are larger than one packet, so the
kernel stores the extra data in a buffer until TCP decides it’s ready to
send. (32 kB is a typical socket buffer size; see man setsockopt
,
SO_SNDBUF
.) This kernel buffering can keep throughput high even when the
server is otherwise blocked. But blocking always affects server
latency. (For similar reasons, per-connection latency is often more
variable than per-connection throughput.)
“Our experiments with the Flash web server indicate that adding a 1 ms delay to one out of every 1000 requests can degrade latency by a factor of 8 while showing little impact on throughput.” [p3, 1]
The Flash paper shows that Flash achieves good throughput, but its latency measurements aren’t very deep. The DeBox paper concentrates not on throughput, but on a “SPECweb99” score. This is the number of simultaneous connections a server can sustain while still providing a minimum quality of service, which is basically a latency measurement. Flash’s initial SPECweb99 score was disappointing. Thus, DeBox.
DeBox provides first-class, in-band, per-call performance analysis of system calls. This means:
Performance information collected includes:
A DeBoxControl
system call turns profiling on and off and controls how
much information is collected, after which users simply make system calls
as usual. Each system call fills out a DeBoxInfo
structure in addition to
returning its normal results. Overhead is relatively low, ranging from 1–3%
on macrobenchmarks.
But we don’t read the paper for DeBox itself. We read it for the detailed and fascinating case study of how DeBox was used to improve Flash performance. Over the course of the paper Flash’s SPECweb99 score goes from 200 to 820, and its median latency drops by a factor of 47 (!!). Internalize this debugging process and you’ll be well equipped to evaluate real system performance problems, whether or not you do so with first-class, in-band, per-call performance information.
Flash was already fast, so it’s not surprising that there was no one key optimization. Profiling a slow server often reveals obviously slow routines, but profiling a fast server may not reveal much. The key to improving performance of a fast server is to improve everything that can ever be slow. And, in Flash’s architecture, the key to improving latency is ensuring that the main server never blocks. DeBox helps this process by exposing system call performance and by letting the application assert-fail on blocking.
All performance improvements involve (1) making expensive operations cheaper, or (2) making expensive operations unnecessary. Flash does both. Here’s a summary.
mincore
degradation (§5.1): Server is CPU-bound. mincore
is the slowest system call.
mincore
cheaper. Replace in-kernel
linked lists with splay trees, reducing mincore
expense from O(n) to
O(log n) (where n is the number of mapped regions).write
system calls to sendfile
, which lets
the kernel copy data directly out of the buffer cache.select
degradation (§5.1): Server is CPU-bound. Event notification API (select
) costs O(n)
for n file descriptors (see also the kqueue
paper).
kqueue
/kevent
) that usually costs
O(1) time.open
might
block—not because the necessary information had been flushed from the
cache, but because some other
open
had locked the directory (for example).open
s (possible because DeBox is
in-line) demonstrates a bug—two caches aren’t in sync. The solution:
file descriptor transfer, again.fork
degradation (§5.4): fork
is the most expensive system call, and it gets more
expensive over time.
mmap
degradation (§5.4): mmap
is an expensive system call, and it gets more expensive over
time.
mmap
is getting more expensive because the main server
process’s memory map is filling up.mmap
. Flash used mmap
just so it could use
mincore
. Forget them both; instead, infer memory residency using
the helpers (despite the small blocking risk this introduces).read
(§5.5): In Flash’s CGI interface, the main process’s read
system call is
a heavy hitter (20% of kernel time).
read()
calls by call sites.”—well, strace
could
separate the time!sendfile
blocking (§5.6): sendfile
can block.
sendfile
can block because it has a limited buffer
pool (discovered through DeBox’s PerSleepInfo label feature). Limited
pools are often silly. Unlimit its buffer pool.sendfile
blocking (§5.6): sendfile
can block.
sendfile
must block if it needs to read data from
disk. Since the server still shouldn’t block, implement a non-blocking sendfile
.sendfile
per-packet overhead (§5.6): sendfile
is not much faster than
write
, even though it copies less.
sendfile
does relatively worse for small files. The reason: sendfile
simply sends
more packets, since it flushes the initial HTTP
headers (specified by write
) before filling up packets with file
data.sendfile
smart enough to combine file data with
header data.The following optimizations improve performance by making expensive operations cheaper:
mincore
degradation (reduce cost of mincore
from O(n) to O(log n))select
degradation (reduce cost of event notification from O(n) to
O(1) via kevent
)fork
degradation (reduce cost of fork
from O(n) to O(1))sendfile
per-packet overhead (reduce per-packet cost by 1)The following optimizations eliminate expensive operations:
write
)mmap
degradation (eliminate mmap
by relying on memory residency heuristics)read
(eliminate read
by passing fd)sendfile
blocking (eliminate blocking by unlimiting a
sendfile
buffer pool)sendfile
blocking (eliminate blocking via a non-blocking API)The following optimizations change the Flash architecture:
sendfile
)read
(eliminate redundancy and copying by fd passing)mmap
degradation (eliminate mmap
in favor of heuristics)sendfile
blocking (eliminate heuristics in favor of a
non-blocking API)(File descriptor passing is a great technique to remember. It uses
Unix-domain sockets [man unix
], the sendmsg
and recvmsg
system calls,
and the SCM_RIGHTS
ancillary message type.)
The new Flash is both simpler and faster: a lovely combination. Perhaps they could have implemented “new Flash” originally, but without this specific performance information, it would be hard to justify the work.
The evaluation demonstrates how Flash improved as a result of the DeBox experience. The improvements are impressive, particularly for latency.
“Unfortunately, we were unable to get Linux to run properly on our existing hardware, despite several attempts to resolve the issue on the Linux kernel list.” [p11, 1] Funny! I think only rarely now will you find hardware where FreeBSD works but Linux doesn’t.
“Making the ‘Box’ Transparent: System Call Performance as a First-Class Result”, Yaoping Ruan and Vivek Pai, in Proc. USENIX ATC 2004 (Via Usenix)
“Flash: An Efficient and Portable Web Server”, Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel, in Proc. USENIX ATC 1999 (Via Usenix)
“SEDA: an architecture for well-conditioned, scalable internet services”, Matt Welsh, David Culler, and Eric Brewer, Proc. 18th ACM SOSP (ACM Digital Library)