Exercise 3

In this problem, your goal is to construct a robust algorithm for Web client notification in the face of failures.

Background

That’s pretty generic, so let’s make it concrete. You are building a web site that supports, say, a chat feature. In the Web architecture, it’s pretty easy to send live chat messages. When the user clicks a send button, Javascript packages up the chat message into a request and sends it to a server; the server records the message in a database. This fits naturally into the Web because the client’s in charge: the client makes an active request to the server. But receiving chat messages fits much less naturally. In the Web architecture, servers can’t contact clients to inform them of new chats! Every message exchange is initiated by a client.

Several Web technologies and techniques try to address this limitation; see push technology. We’ll focus on two techniques, polling and long polling.

In the polling approach, the client makes frequent requests to check for updates. For instance, our client might check every 5 seconds for newly received chats. Polling fits cleanly into the Web architecture and is easy to write. Unfortunately, it uses a lot of resources, and it doesn’t provide instantaneous notification—an expected 2.5-second delay per chat is a long time!

Long polling is a version of polling that addresses some of its problems. In long polling, the client again requests the server to check for updates, but the server does not respond until an update actually occurs. That means the user gets near-instantaneous notification, and the server only gets one request per notification, rather than one request every 5 seconds.

Abstractly, long polling seems pretty close to perfect for notification! The problems with the approach only show up once you dig into its implementation.

Resources. On the server, a long-polling request corresponds to an open HTTP connection, which occupies resources. First, there’s the file descriptor for the connection. Operating systems enforce multiple limits on file descriptors: there’s a global limit on the number of file descriptors (on a current EC2 Linux machine, this is large—378,358), and a local limit on the number of file descriptors per process (on a current EC2 machine, this is small—1024—though it can be raised). Thus, a server might be able to handle many thousands of short-polling requests per second (because each request’s file descriptor is quickly closed, the limits are never reached), but 1024 concurrent long-polling requests could shut it down completely. Depending on the server technology, each long-polling request might also reserve a server process and a lot of server memory.

Timeouts. Networked programs must handle runaway requests: clients can silently die, server processes can enter infinite loops and never return, the network can crash, and so forth. It’s important for robustness that all runaway requests be shut down eventually; otherwise, a single programming error in a server process could take up resources forever, and a silent failure could be mistaken for success. Timeouts are the typical solution—if a request has run too long (say, for 5 minutes), the client and/or server will treat it as failed and shut it down. There are usually many timeouts simultaneously in play: the user’s browser can time out; the user’s networking equipment can time out; the server can time out on the client (Apache, NGINX) or on the process it runs to generate a response (Apache, NGINX); and the process can time itself out (for instance, the PHP interpreter has a time limit after which it suicides). Timeouts work transparently when successful requests are instantaneous—but long polling violates this assumption. It is easy for one or more system timeouts to cancel a long-polling request mistakenly, so we should set long-polling request timeouts to large values. But large timeouts don’t catch problems very well! A failed long-polling request can be mistaken for a successful one that just hasn’t returned yet.

Because of these issues, a robust, resilient client notification system must rely on a combination of polling and long-polling techniques, and ideally use other techniques too.

The problem

You are writing a Javascript library for client notification. The server maintains a value (a string). The client has one or more tabs open that point to the server. Your library should call a function notify(value) on each tab every time the value changes.

Your library is configured with two server URLs, the polling URL and the long-polling URL. The polling server returns the current value. The long-polling server takes a value x as an argument; it returns the current value as soon as that value does not equal x. (This might be right away.)

Your goal is to design an algorithm with the following properties.

Here are the APIs you may use.

Polling requests

The API for a polling request is:

make_poll(complete, timeout);

where timeout is a timeout value in milliseconds, and complete is a callback function. Semantics:

Failure modes. Polling requests do not truly fail unless the server has crashed. Crashed servers will eventually restart (between 10 seconds and 3 minutes after they crash). If the server crashes, the polling request will fail within 250 milliseconds. Otherwise, the polling request will succeed. Typically it will succeed within 100 milliseconds, but sometimes a message gets lost and it will take 30 seconds to succeed (and the returned value might be 30 seconds old).

Long polling requests

The API for a long polling request is:

make_long_poll(complete, timeout, value, [args]);

where timeout is a timeout value in milliseconds, complete is a callback function, value is the current value, and args are extra arguments you can define (see below). Semantics:

Failure modes. Long polling requests unfortunately have several failure modes. If the server crashes, the long polling request may fail immediately. However, if the server is misconfigured or overloaded, it may simply not respond. This will cause the request to fail after 5 minutes (900000 milliseconds). Furthermore, in some cases (e.g. certain browsers that enforce tight message timeouts), a long poll request will “fail” after 2–5 minutes even though the server is fine. The long-polling server is more likely to be down than the normal server.

Additional arguments. You may change the behavior of the long-polling server if you want. Describe additional args you would provide, and how the long-polling server should treat those arguments. (You cannot eliminate failure modes, however.)

Local storage

You may also use local browser storage to improve your system’s behavior. Local storage lets the browser tabs share information in the form of a single JSON value (briefly: null, a boolean, number, or string, an array, or an object). API:

storage_get()
storage_set(x)
storage_register(f)

storage_get returns the stored value (or null if there is no stored value); storage_set sets the stored value. storage_register(f) registers an observer function f that is called when the value changes. It’s called with two arguments, the old value and the new value.

Failure modes. The local storage methods never fail.

Opening and closing tabs

The user occasionally opens & closes tabs, and your code is called on that tab when either event occurs:

open_tab()    [your function]
close_tab()   [your function]

Closing a tab automatically closes all connections opened by that tab. This happens without causing connection errors (so, for example, if you call make_long_poll(timeout, complete, value) and the tab closes, complete will not be called).

Failure modes. open_tab never fails. close_tab, however, can fail: it can happen that a tab is closed but close_tab is not called.

Additional APIs

You may find these APIs useful:

now()
x = set_timeout(f, delay)
clear_timeout(x)

now returns the current time in milliseconds. set_timeout calls f() after delay milliseconds have elapsed; it returns a unique identifier for the timeout (a “truthy” value). clear_timeout(x) resets the timeout whose UID was x; if x doesn’t correspond to an active timeout, it does nothing.

Turnin

The minimal deliverable for this assignment is a textual description of the algorithm you propose to use in your notification library. The description should be specific—for instance, pseudocode—and you should also explain why the algorithm achieves the goal.

If you choose, you may write the algorithm in Javascript code. Your Javascript code will contain, at minimum, a definition of the open_tab function; this function will likely call the other APIs, set up callbacks, etc.

You should also go above and beyond the minimal deliverable. Here are some ways to do so:

The assignment is due Monday 3/2 at midnight.

Background reading