This is not the current version of the class.

Lecture 15

NCQ

Journaling

Notes

(Notes by Abby Lyons)

(these notes start 40 minutes in, sry)

Flash memory

Flash memory is weird. You can't just overwrite one block; you must overwrite 16 blocks (64 kb) at once. Blocks also degrade over time.

Modern flash drives have a remapping layer. It marks sectors as bad when they arise; so only the drive knows where blocks are located, and the OS doesn't.

Failure models

The goal is to have a journal format that tolerates disk failure due to power failure.

Disk failure model:

        DOOM

-------------------> Time

w w w w w w 

The last two writes are in-flight. what happened to them?

Journal design

Our solution to the aggressive failure model is physical redo journaling.

One common alternative to physical journaling is logical journaling, which doesn't store entire blocks; rather, it has records that track changes like "change byte 0xabc from 3 to 5". This means the journal is smaller, but can't protect against the most aggressive failures which turn entire blocks to garbage.

To use the journal, one must:

  1. write blocks to disk in a special journal location
  2. write commit record
  3. when journal writes complete, write blocks to their final locations
  4. when those writes complete, write a completion record to the journal. Journal can now be overwritten.

What about parallelism? As it turns out, steps 1 and 2 can be done in parallel. Then every part of step 3 can be done in parallel. Only after everything else is complete can step 4 take place.

But we must wait after step 2 to start step 3, and wait after step 3 to start step 4. These mandatory waits are called barriers. A barrier is a kind of synchronization primitive.

The journal in memory

Checksums help us figure out whether writes happened successfully inside the journal; if the checksum in the metablock matches the checksum of the corresponding data block, the write happened without any corruption. The metadata block stores a checksum of itself, in case it gets corrupted.

Each journal block has a sequence number, which goes up by one every time. This is useful because:

Each metablock has a commit boundary and a complete boundary. This may seem redundant, but it's used for cumulative acknowledgement, which means that "I have heard everything up to point X." We use this to figure out which transactions were actually completed.