Let’s talk about the requirements we laid out for Graph in HW1, focusing only on graphs and nodes.

A Graph is a collection of Nodes with the following functions (more or less). We include their complexity requirements.

```
template <typename V> class Graph {
/** Return number of nodes. O(1) time. */
size_type size() const;
/** Return the node with index i. O(1) time.
@pre 0 <= i < size() */
Node node(size_type i);
/** Add a node. O(1) amortized time.
@param[in] position the node's position
@param[in] value the node's value
@return result (the new node)
@post new size() == old size() + 1
@post result.index() == old size() */
Node add_node(Point position, node_value_type value);
/** Remove a node. Time polynomial in size().
Invalidates @a n, but not any other node.
Decrements the indexes of nodes above @a n. */
void remove_node(Node n);
class Node {
/** Return the node's position. O(1) time. */
Point position();
/** Return the node's value. O(1) time. */
node_value_type& value();
/** Return the node's index. O(1) time. */
size_type index();
};
};
```

How to implement this specification? A natural way is to start from the complexity requirements and use data structures with that complexity. For instance, take node(i). This returns a node in O(1) time, so seems to imply a vector. (Vectors and hash tables are the basic data structures with O(1) access time.)

```
class Graph { ...
private:
struct nodeinfo {
Point position_;
node_value_type value_;
};
std::vector<nodeinfo> nodes_; // index is node index
};
```

The Node object is a proxy for the position and value information stored in the graph under the node’s index.

```
class Graph { ...
class Node { ...
Point position() {
return graph_->nodes_[index_].position_;
}
size_type index() {
return index_;
}
private:
graph_type *graph_;
size_type index_;
Node(graph_type *graph, size_type index)
: graph_(graph), index_(index) {
}
};
Node node(size_type i) {
assert(0 <= i && i < nodes_.size());
return Node(this, i);
}
};
```

But this will cause a problem with removing nodes. Removing the node with
index *i* must shift all nodes with greater indexes, to keep the indexes
contiguous. Consider:

```
Graph<int> g;
auto n0 = g.add_node(Point(0,0,0), 0); // n0.index_ == 0
auto n1 = g.add_node(Point(1,0,0), 1); // n1.index_ == 1
auto n2 = g.add_node(Point(2,0,0), 2); // n2.index_ == 2
// g.nodes_ == [<(0,0,0),0>, <(1,0,0),1>, <(2,0,0),2>]
g.remove_node(n0); // Shifts values around in g.nodes_, but
// does not update n1.index_ and n2.index_!
// g.nodes_ == [<(1,0,0),1>, <(2,0,0),2>]
assert(n1.position() == Point(1,0,0)); // WILL FAIL!
// n1.index_ == 1, but now g.nodes_[1] points to the
// node with position (2,0,0)!
auto nx = g.node(0); // Expect the node with position (1,0,0)
assert(nx == n1); // WILL FAIL! They have different index_
```

We need to associate a more permanent identifier with each node—something that doesn’t change as nodes are removed. We called this second node index a “unique identifier” or “uid.” Here’s how we did it:

```
class Graph { ...
private:
struct nodeinfo {
Point position_;
node_value_type value_;
size_type index_;
};
std::vector<nodeinfo> nodes_; // index is uid
std::vector<node_id_type> i2u_; // index is index, value is uid
};
```

The Node object is still a proxy, but by UID, not index. The primary change is
`i2u_`

, but nodeinfo changes as well: we need an O(1) map from UID to index to
implement the Node::index() function; struct nodeinfo is a natural place to
store that map.

```
class Graph { ...
class Node { ...
Point position() {
return graph_->nodes_[uid_].position_;
}
size_type index() {
return graph_->nodes_[uid_].index_;
}
private:
graph_type *graph_;
node_id_type uid_;
Node(graph_type *graph, node_id_type uid)
: graph_(graph), uid_(uid) {
}
};
Node node(size_type i) {
assert(0 <= i && i < nodes_.size());
return Node(this, i2u_[i]);
}
};
```

With suitable changes to add_node and remove_node to keep `i2u_`

up to date,
this works great. A key change is that *remove_node does not remove old nodes
from the nodes_ array*. If it did, then the uid-to-node mapping would change,
invalidating nodes exactly as before! We spend space to get better complexity
in a classic tradeoff.

```
Graph<int> g;
auto n0 = g.add_node(Point(0,0,0), 0); // n0.uid_ == 0
auto n1 = g.add_node(Point(1,0,0), 1); // n1.uid_ == 1
auto n2 = g.add_node(Point(2,0,0), 2); // n2.uid_ == 2
// g.nodes_ == [<(0,0,0),0>, <(1,0,0),1>, <(2,0,0),2>]
// g.i2u_ == [0, 1, 2]
g.remove_node(n0); // Shifts values around in
````g.i2u_`

!
// g.nodes_ == [<UNUSED>, <(1,0,0),1>, <(2,0,0),2>]
// g.i2u_ == [1, 2]
assert(n1.position() == Point(1,0,0)); // SUCCESS!
auto nx = g.node(0); // Expect the node with position (1,0,0)
assert(nx == n1); // SUCCESS!

Of course, now the `nodes_`

array can grow without bound. This is a huge bummer,
but one we can fix. Before doing so, we’ll take a tour of specifications,
abstraction functions, and representation invariants. These properties will
help us as we analyze and improve our data structure.

The specifications at the top of the post refer over and over to a couple concepts:

```
... the node's index ...
... the node's position ...
... the node's value ...
Invalidates ...
```

These together form an *abstract* concept of a graph. The user of the `Graph`

class shouldn’t need to understand its *implementation*, but only its
*interface*; and the interface is defined in abstract terms.

We win when interfaces are specific enough that it is possible to reason about their correctness. And for that, we need a specific graph abstraction.

Here’s one:

- A graph G is a tuple ⟨N, E⟩.
- N is a sequence of
*nodes*[*n*_{0},*n*_{1}, …,*n*_{m–1}] and E is a set of*edges*. - Each node is a pair of
*position*and*value*⟨_p_, _val_⟩ where*p*is a point in 3D space and*val*is an object of value type. - Each edge represents an unordered pair of nodes: If
*e*∈ E, then*e*= {_n__{i},*n*_{j}} where*n*_{i}and*n*_{j}are elements of N.

If we wanted, we could now write out our specifications more precisely in terms of abstract objects. For example:

```
/** Add a node. O(1) time.
@param[in] position the node's position
@param[in] value the node's value
@return result (the new node)
@post new size() == old size() + 1
@post result.index() == old size()
In abstract terms, new G = <new N, new E>,
where new N = old N ++ [<@a position, @a value>]
and new E = old E. */
Node add_node(Point position, node_value_type value);
```

(Here, ++ on sequences concatenates the sequences together.) But the informal specifications are good enough in practice, as long as we can reliably extract a formal specification if and when we need one.

An *abstraction function* AF maps an internal representation of a class to the
corresponding abstract concept. Abstraction functions let us bridge between
the more abstract specifications provided by the comments and what actually
happens in the code. Abstraction functions go from representation objects to
abstract objects, because often *many* representation objects could stand for
*the same* abstract object. For one example, we don’t generally care exactly
where a Graph object is located in memory; it “means” the same thing
regardless of its address.

An object’s representation consists of its data members. For Graph, this is
the `nodes_`

and `i2u_`

arrays. The abstraction function, then, looks like this:

- AF(*this) = G = ⟨N, E⟩, where:
- N = [n0, n1, …, n(m–1)], m =
`i2u_.size()`

, and ni = ⟨`nodes_[i2u_[i]].position_`

,`nodes_[i2u_[i]].value_`

⟩ for all i in [0,m).

(We’re not considering edges, so forget about E for now.) The key thing to
note is that *the particular values of i2u_ do not occur in the abstract
concept* (the output of the abstraction function). Neither do the values of

`nodes_[x].index_`

. This is important, and common. Good data structures often
include “helper members” that don’t match directly to parts of the
corresponding abstract concept. We use those members to make the data
structure better—either faster or, as here, less likely to cause problems for
users. (It would be very difficult to use a Graph whose Node objects all got
invalidated by every remove_node operation!) Thus, many graph representations
with different node uids correspond to the same abstract graph.A *representation invariant* defines whether a class representation is valid.
We use representation invariants to help prove that data structure operations
are correct: every public data structure operation can assume that the data
structure is valid on input, and must provide a postcondition that the data
structure is valid on output. (There’s an exception for operations that
destroy data structures, whose specifications say that they invalidate their
input. Remove_node is an example.)

Representation invariants are functions that take representation objects and return Boolean values (true for valid, false for invalid).

For Graph, the representation invariant needs to check that the `nodes_`

and
`i2u_`

arrays are synchronized. RI(*this) is true if and only if:

- For every i in [0,
`i2u_.size()`

),`nodes_[i2u_[i]].index_ == i`

.

The key thing to note here is that *values not listed in the abstract concept
appear in the representation invariant*. This is again important, and common.
We add helper members to improve the data structure; but they have to be
correct to help! And here, the basic correctness requirement on nodes is that
the `index_`

member is right.

Several other useful consistency requirements are actually already expressed by this invariant:

- For each i with 0 ≤ i <
`i2u_.size()`

, 0 ≤`i2u_[i]`

<`nodes_.size()`

. (This is implied since otherwise the element access`nodes_[i2u_[i]]`

would fail.) - The uids in
`i2u_`

are disjoint: if 0 ≤ i < j <`i2u_.size()`

, then`i2u_[i] ≠ i2u_[j]`

. (This is implied since`nodes_[i].index_`

can take only one value.)

It’s usually good to express the invariant as compactly as possible, since that makes it easier to understand and prove.

Our representation invariant doesn’t mention `position_`

or `value_`

because
there are no internal consistency requirements on those fields. The
abstraction function and representation invariant serve different purposes and
can be quite independent.

Abstraction functions always work on valid representations, so if RI(x) is false it’s OK for AF(x) to break or return weird garbage.

The Node subobject has its own abstraction function and representation invariant. The abstract concept of a node is a subconcept of that of a graph.

- AF(n) = ni, where i =
`n.graph_->nodes_[n.uid_].index_`

and ni is the i’th node in AF(*n.graph_). - RI(n) is true if and only if 0 ≤
`n.uid_`

<`n.graph_->nodes_.size()`

.

Do you think this is complete, though? Think about it for a minute.

It’s not complete, because removed nodes are invalid, but their uids are still in range by design! We can improve the representation invariant to catch removed nodes this way:

- RI(n) is true if and only if
`n.graph_->nodes_[n.uid_].index_`

= i, where`n.graph_->i2u_[i] = n.uid_`

.

If `i2u_`

and `nodes_[].index_`

don’t match, the node has been deleted. Again we
can elide some implied requirements, such as that `n.uid_`

and i are in range
for their respective arrays. This is very cool: we can add an O(1)-time
valid() function to Node that verifies a node is valid, and then use that
function in assertions!

class Node { ... private:`bool valid() { return uid_ >= 0 && uid_ < graph_->nodes_.size() && graph_->nodes_[uid_].index_ < graph_->i2u_.size() && graph_->i2u_[graph_->nodes_[uid_].index_] == uid_; }`

public: Point position() {`assert(valid());`

return graph_->nodes_[uid_].position_; } ... };

Note how `valid()`

actually contains the *implied* requirements from the
representation invariant, not just the main requirement. This is important.
`Valid()`

’s purpose is to detect invalid nodes, so unlike most other
operations, it doesn’t assume its input is totally valid. The carefully
written out checks avoid crashing when a node is invalid and (say) has
`index_`

that’s out of range for `i2u_`

.

Now let’s return to our space concern: if we call “n = add_node();
remove_node(n)” repeatedly, our graph data structure will grow more and more
<UNUSED> elements. The total size of the graph is proportional to the
*total number of add_node calls*, not the graph’s size or even its maximum
size. To do better, we must reuse space from unused elements. And to do that,
we must keep track of which elements are unused. We need a *free list*.

A lot of you had good ideas on how to represent the free list. Add a stack of
free element indexes, or a vector, or even a double-ended queue (!). These
work and are even good ideas (because they are simpler code). But you can do
it by adding *four bytes* to the graph representation. How would you do this?
Think about it.

What operations must the free list support? Not very many, if we think systematically.

- remove_node() will add a node to the free list.
- add_node() should check the free list for an element that could be reused. If there is one, it should reuse that element and advance the free list to the next free element.

Sounds like push_front() and pop_front(). Several container structures support
these operations in O(1) time. We turn to *singly linked lists*. A singly
linked list uses two types of data: (1) a *head* pointer to the first list
element, and (2) per-element *next* pointers that link the list together. The
end of the list is indicated by a distinguished *sentinel* value that can
never equal a valid pointer (such as NULL).

Adding a head pointer to the first free element would take 4 extra bytes. But
where can we find space for next pointers? Simple: reuse the `nodes_[].index_`

values! List links don’t *need* to be true C pointers; integers work just as
well.

class Graph { ... private: struct nodeinfo { ... node_id_type index_;`// or next free nodeinfo`

}; std::vector<nodeinfo> nodes_; std::vector<node_id_type> i2u_;`node_id_type free_;`

// initialized to (node_id_type) -1 public: void remove_node(Node n) { ... free adjacencies, etc. ... // remove node from i2u_ i2u_.erase(i2u_.begin() + n.index());`// mark node as free nodes_[n.uid_].index_ = free_; free_ = n.uid_;`

} void add_node(Point position, node_value_type value) { node_id_type uid; if (`free_ != (node_id_type) -1`

) { // we have a free slot`uid = free_; free_ = nodes_[free_].index_;`

} else { // no free slot, add a new slot to the back uid = nodes_.size(); nodes_.push_back(nodeinfo()); } // rest is unchanged nodes_[uid].position_ = position; nodes_[uid].value_ = value; i2u_.push_back(uid); return Node(this, uid); } };

But wait a minute—the representation invariant RI puts requirements on the
`index_`

member; are we allowed to reuse it?!

Yes, and when you see why, you’ll understand a lot about abstraction functions and representations. The graph representation invariant is, again:

- For every i in [0,
`i2u_.size()`

),`nodes_[i2u_[i]].index_`

== i.

But free nodes’ uids *are not listed in i2u_*. (They aren’t valid nodes,
after all.) The representation invariant only discusses uids found in

`i2u_`

,
so it `nodes_[i].index_`

, as long as i is a free uid.It would be useful, however, to extend our representation invariant to check the free list. A correct graph will ensure that free items and used items are disjoint, and that free items and used items together cover all items.

- (Old invariant) For every i in [0,
`i2u_.size()`

),`nodes_[i2u_[i]].index_`

== i. - Let F equal the set of uids listed on the free list, starting from
`free_`

; and let U equal the set of uids in the`i2u_`

array. Then F and U are disjoint, and F ∪ U equals the range [0,`nodes_.size()`

).

Now, if we want, we can prove our code maintains this invariant for every
operation. It’s easy for most operations—Node::position() doesn’t change
`i2u_`

or `index_`

, for example, so the postcondition “RI(`*graph_`

)” follows
directly from the precondition. For others (add_node()) it’s hard, but
possible. The invariant doesn’t hold at every point during the operation, but
assuming it holds at the beginning, we can prove it holds at the end.

Unfortunately, this space-saving change changes the meaning of our
representation invariant on *nodes*.

A node becomes invalid as soon as it’s removed from the graph. This validity transition is instantaneous and doesn’t require any code—it just happens, at the semantic level. For instance:

```
auto n1 = g.add_node(...);
auto n2 = n1;
auto n3 = g.add_node(...);
n3 = n1;
g.remove_node(n3); // INSTANTLY n1, n2, and n3 become invalid
```

The previous node representation invariant allowed us to *check* node
validity. After g.remove_node(n3), *all* of n1.valid(), n2.valid(), and
n3.valid() would return false. And since node uids were never reused, the
nodes would remain checkably invalid forever.

But now we reuse node uids, which can make an old uid appear valid again!

```
Graph<...> g; // new graph
auto n1 = g.add_node(...); // n1.uid_ == 0
g.remove_node(n1); // free n1.uid_
assert(!n1.valid()); // checkably invalid
auto n2 = g.add_node(...); // reuse uid 0!
assert(n1.valid()); // n1 appears valid again!
```

Now, is this *bad*? That depends.

We program C and C++ because we are interested in performance. We give up some safety for that performance: we can turn pointers into integers, write to random memory, access memory after freeing it, all sorts of awful stuff. This makes representation invariants inherently incomplete. Every C/C++ representation invariant assumes, as a precondition, that the representation in question wasn’t destroyed by random memory writes. Given that assumption, it’s not too far fetched to expect programmers to avoid other kinds of problems, such as touching invalid nodes. Also, in some cases, preconditions and representation invariants are unacceptably expensive to check. Imagine a full precondition checker for binary search: it would have to check that the input sequence was sorted—which takes O(n) time, violating the binary search’s complexity requirement!

Nevertheless, invariant checking is often cheap. An when it is, you should definitely load your program with relevant assertions. They might catch real bugs! You can turn them off, if you must, after you prove your code correct.

Is it possible, then, to change the Graph representation so that we can detect
*all* invalid nodes, including node copies, in O(1) time? Think about it.

Yes, we can, as long as we spend some space. We need to reuse uids to save
space, but to detect reuse of invalid nodes, we we can simply *add another
identifier that is never reused*. This type of identifier is often called a
*generation number*. Add an “`unsigned gen_`

” to struct nodeinfo, and an
“`unsigned gen_`

” to Node. On every Node operation, check that the generations
match. Done!

Or almost. Next time we’ll implement the generation version more carefully and write its invariants.

Posted on February 24, 2012