DISTRIBUTED SYSTEMS

by Navin Jain, Cuichang Zhao, SongXin Wu, Hyeon (harold) Kim

Coupling

Distributed systems lie along a spectrum from Closely Coupled to Loosely Coupled. Failure characteristics determine where a system lies on the spectrum. In a closely coupled system, the system fails if any of its parts fail. In a loosely coupled system, the system can continue working (although maybe with less functionality) even after one or more of its parts fail.

Closely Coupled: assumes that the parts of the system do not fail independently. Examples: SMP (Symmetric Multiprocessor), ccNUMA (DISCO), and cluster machines.
Loosely Coupled: The most successful massive distributed systems are loosely coupled. For example, email and the Web.

As an example to motivate distributed systems, let's consider an imaginary Giant Climate Simulation.

Simulate world's climate for 100 years.
One chunk of data for each 1 cubic inch of atmosphere!
1KB/ cubic inch
Total guess: 100 PB of memory (10¹⁷ bytes)

No machine in the universe has 100 PB of memory, although some sites have at least 1 PB of disk storage (the Internet Archive Wayback Machine). So we know that the memory requirements are not enough so we need to use the disk to have this simulation work.

So now what are the problems with disk usage?

Thrashing
Swapping data in and out of the memory
Too many seeks, that makes the system too slow.

Swapping is initiated by page faults. Let's take a look at how page faults would be handled in the above system:

Application has Virtual address space.
Some virtual address space has no physical address backing.
When application accesses a swapped out virtual address, it will get a Page Fault. In that case OS does the following:
- Pick a physical page (using an algorithm like LRU), and write it out to disk.
- Look up disk address of the desired virtual address page.
- Send a disk request to load the data for the desired virtual page into the selected physical page.
- Install the right virtual-to-physical mapping in the page table.
- Resume the application.

How can we make this faster? Well, the slow part here is going out to the disk. What if we were able to fetch the data out of a nearby computer's memory?? While we could never get close to 100 PB, we might be able to avoid some expensive seeks by making use of other machines' caches. We can try that with...

Distributed Shared Memory

The intuition behind distributed shared memory is to replace the disk-based page fault handler, above, with a version that uses other computers' memory to store pages. If the computers are connected by a fast network, this might be faster than waiting for an expensive seek!

Again, Application has VA space, not all of which is backed by physical pages.
On a page fault:
- Pick a physical page (using an algorithm like LRU), and ~~write it out to disk~~ send it to another computer that has room for a new physical page (or write it out to disk as a last resort).
- Look up ~~disk~~ network address of the computer that is currently responsible for the desired virtual address page.
- Send a ~~disk~~ network request for the desired virtual page.
- When the other computer responds, load that data into the selected physical page.
- Install the right virtual-to-physical mapping in the page table.
- Resume the application.

= *NEED: directory mapping (Vpages -> network nodes)

-> The owner must be able to locate the page!

= *Problems: 1. Every computer needs whole directory (additional messages?)

2. Must update directory all the time! This can cause a lot of overhead.

-> CLOSELY-COUPLED (made rare in practice)

* Let us look for a better abstraction!

Remote Procedure Call (RPC)

DSM inserted distributed messaging at too low a level (memory pages), requiring too much coordination. Instead, let's insert messaging at a higher level.

= How about functions?

= *RPC (= Remote Procedure Call)

-> Function that executes on another machine.

      Computer A                 
|main() {             |                 
|    x = f();         |                 
|    x = rf();        |                 
|}                    |                 
|                     |                 
|int f() {            |                 
|    return 0;        |
|}                    |
|                     |
|int rf() { /*STUB*/  |                             Computer B
|    create message;  |                 |int server_impl() {            |
|    send message;    | ==============> |    parse message;             |
|    wait for reply;  | <==+            |    call implementation of rf; |
|    parse reply;     |    |            |    create reply message;      |
|    return result;   |    +=========== |    send reply;                |
|}                    |                 |}                              |

-> Less closely-coupled

-> How must RPC work?

Global variables (not OK) - How are they truly global?

Pointer arguments (not OK) - Do not share common memory

Integer arguments (OK)

Return values (OK)

-> ex.>

int rf (int arg) {
    char buf[1024];            /* "Marshalling": putting */
    *((int *)(buf)) = RF;      /* data into message      */
    *((int *)(buf + 4)) = arg;
    write(serverfd, buf, 8);
    // wait for reply...
    read(server, buffer, 8);
    // "Unmarshalling":
    // Parse reply,
    // find return value,
    // return;
}

Stubs, like rf, let the program call RPCs like normal functions. The stub doesn't do much work on its own; it just sends a message to the server, then waits for and parses its reply. (A good stub might also start the server connection when necessary, restart a hung connection, handle timeouts, and so forth.) A separate stub exists for each separate remote procedure. When the client invokes a remote procedure, the RPC system calls the appropriate stub, passing it the parameters provided to the remote procedure. This stub locates the port on the server and marshals the parameters.

A stub is a proxy for the remote object.

Stubs follow a conventional pattern, so people have designed Stub Generators, which take simple function prototype-level interface descriptions and generate the corresponding stubs.

However, there exists a problem in the above implementation: if the sever dose not reply, then the client waits forever, since this is closely coupled. The way to fix this problem is to make the system loosely coupled, which can add a timer to wait until timeout if there is no reply. There are still more ways to fix this problem. We have mentioned one abstraction, which is function. Right now we can add more abstractions implementations for distributed system, such as asynchronous messages. All these are less like regular functions.

Distributed File Systems

So RPC is maybe more closely coupled than we wanted. What other interfaces might we use?

How about files? Files is another good abstraction for distributed system, since in a file system, we have already expect failures. Function is more closely-coupled, when we call a function, we always expect it returns or do some useful work, but when we open a file, we already expect failures, so the file system makes it more loosely-coupled.

Distributed File System:

When we open a file, the file might be exists in another computer. In that case, we need to share files among different machines. We have learned about one way to share files which is SMB. Right now we want to introduce another way to share files: Network File System (NFS).

We want the network errors transfer to file system errors, so that when there is an error, we can exist the programs without changing the existing programs' work.

Example: Virtual File System Layer:

EDITORIAL NOTE: The figure did not come through. Refer to the text if your notes do not suffice.

App: NFS client NFS Server SHAPE \* MERGEFORMAT

The one under dark yellow color background are the RPC (remote procedure call).

Also the NFS server is not required to be in the kernel level. It can be a user level application.

How can a server read message from the client? Usually, a server needs the following information:

Credentials (user ID)

file name

offset + length (for the file)

What will happen if we try to change the file name while we are reading from the file? Example:

Process A is reading a filename "f" Process B want to change the file name from "f" to "g"

read(fd) rename("f", "g")

...reading

read(fd)

Process B is called when Process A is processing. Is this calling a problem in process A in linux system? The answer is no. Since in linux, the file system is implemented in inode. Once process A is called, it points to the same inode until it done. Even during the read, process B changes the filename; a pointer from process A to the inode is not changed. Furthermore, if in process B, instead of doing operation rename, it calls delete, the inode of the file will not be deleted until process A is done (the file stays as expected until the close of the file descriptor). But in the NFS, since we send the filename in the message, the client will look at the inode and get the filename then put it into the message. If we do a rename in process B, the filename in the sever will be changed into the new name. When the client continues to read the file (with the old file name), the process stops since the old filename doesn't exists anymore. It just acts as the file is not there even though the fd is still open. In here the client sends message and gets data to/from the sever one block at a time. This can be the same in DFS. We can also implement the same logic in NFS as in Unix. In this case, we can replace the filename by the inode number. Since the inode number will always associates with the filename. Thus, the information we need for the message is:

NFS Read Message:

file handle ~= file inode number (which is from the inode generator)

offset + length

We can lookup a file with the file handle

NFS lookup message:

File handle for directory

Name => File handle

Server Implementation:

Straw man: Server opens files, uses file descriptor numbers as the file handles.

If we use this implementation, would the system be attacked?

Yes! For example, someone can request many files at the same time or s/he doesn't even need to request the file just send something to the server act like s/he is requesting a file. In this case, the server may run out of file descriptors. Also, someone can guess the file descriptor number to read other's file.

So how should we implement the read function on the server instead?

read (filehandle, offset, len)
{
    DB: filehandle -> filename;   // use a database to map the filehandle to the filename
    filename = DB(filehandle);
    fd = open(filename, ...);         // NB: put in a critical section with DB to make sure we open the correct file
    read from fd;
    close(fd);                        // or maybe cache it for a while
}

NFS is in fact a stateless protocol. The early NFS were no open or close messages. The point of open or close is to keep the state of the file, but if the file system is stateless, why do we bother to open and close it?

Problem with no open and close:

Usually we do the permission checks at the open time, without open we need to delete the checking until we actually read the file. Also, as we learned from the SoftUpdate paper, most system flushes the disk 30 second after the file is closed. Without a close in NFS, it is hard to implement flushed disk time. For these reasons, the later NFS has open and close while it is still stateless.

Open File

OPS

Pub

Read

Write

Inode

OPS

Open

Rewrite

NFS

Server

Inode

OPS

Open File

OPS

Write

Disk