Introduction to Expert Programming (in C)

Appendix M: Memory Requirements for Tables

Introduction

(Unless stated otherwise log is taken to the natural base e = 2.71828.)

How much computer memory will we need to store the employee ID numbers and birthdates for a large company?

Of course the answer depends on various factors: (a) Do we want to maximize processing speed, or should we compress the data for minimum storage?; (b) Did we assign each of 1000 employees a unique 3-digit ID, or did we use their 9-digit social security number?; (c) etc. In practice, storage details may depend on a variety of compromises, but in this appendix we are interested in the theoretical minimum of space.

We will always choose to determine storage per element. For example, if we use 9800 bytes total for our 1000-employee table, the answer would be ``9.8 bytes.'' Storage may be measured in ``bytes,'' ``bits,'' or ``nats.'' While a byte usually distinguishes 256 cases, a bit distinguishes 2 cases, and a nat distinguishes 2.71828 cases. Since the number of cases will always be an integer, the nat may seem severely impractical, but it's actually a very natural measure of information!

A table entry will consist, at least conceptually, of two parts -- key(s) and attribute(s). Employee ID is an example of key; distinct employees will always have distinct ID's. Birthdate is an attribute; two or more employees may share a birthdate. A typical query would ask two questions (a) Does any employee have ID 666-01-1357?, (b) If so, what are his/her attributes? Of course a practical query might go the other way (``Which employees have their birthday today?'') but we aren't concerned with access methods here.

Simplest Case

In the simplest case, each of K employees has a unique ID in the range of 1 to K, and the attributes are a fixed length bit string, say of length A bytes. Required storage will be A bytes per element and the storage might be declared with the simple C syntax

```
struct {
char attributes[A];
} Element[K+1];
```
(Most C programmers wouldn't waste one element as we have done here, but
• Any employee with employee ID number zero is likely to be a laughing-stock.
• Zero is often convenient as a failure code.
• Subtracting 1 from the ID whenever you access the table is likely to lead to a bug: eventually you'll forget to subtract 1 ... or subtract it twice.
• The storage loss is insignificant and, anyway, there may be a logical use for the extra element, e.g. as a default initializing element.
Having said all this, I admit I usually start normal indexes at zero for no good reason but that it seems ``purer.'' Hey! Do as I say, not as I do!)

For the remainder of this appendix we need only consider the two separate complicating factors: (a) keys (ID's) may not be assigned consecutively; (b) attributes may be of variable length.

Bit-maps and Search-tables

Sometimes a table entry will need no attributes at all: All we want to know is whether a certain ID is valid or not. That is, we want to know for each number in {0,1,2,3,...,N-1} whether it is one of the K valid ID's, or one of the (N-K) invalid ID's.

Such a memory might be called a spell-checking dictionary, or simply spell-checker. While entries in a regular dictionary will have attributes (definition, etymology, etc.), in spell-checking we just want to know whether a word exists or not.

The most common (and most obvious) approach is the bit-map. A table of N bits is used with the j'th bit set to 1 if and only if j is one of the K valid ID's. The storage requirement is 1 bit per possible ID, or, since we've agreed to measure in bits per valid element, (N / K) bits.

Although a bit-map will often be just what we need, it suffers three problems which often make it unusable:

• If attributes are stored the same way as the existence bits (indexed by 0 ... N-1) we will spend (N / K) times as much storage on attributes as we need.
• We'd like to convert the valid ID's to an index in the range (0 ... K-1) but how? Scanning the table sequentially to count 1-bits is absurd for large N, so instead of 1 bit we'd need, say 10 bits for K = 1000 to include the valid index.
• Even when no attributes are stored, bit maps can't be used for very large N. Representing employees' social security numbers this way would require 1 billion bits no matter how small our company is.

The other major approach is to just list valid ID's in a table. This requires log N storage per element, so is more efficient than a bit map whenever K < N / log2 N. (I.e., for N = 8,000,000 the bit map uses more storage than a search table when K < 345,000.)

A random access in the table will use O(K) time for sequential search, O(log K) time for binary search, and O(1) time for hash-table search. Moreover, inserting a new element into a table takes only O(1) time for a hash table, but O(log K) for a binary-search tree and perhaps even more time for a binary-search table not organized as a tree.

Hash tables are therefore often preferable, and we will speak of bit maps and hash tables as the two main choices for large tables. But be aware that you'll program with binary-search tables more often than with hash tables: they're better (simpler) when they're good enough (when K isn't huge and there are no dynamic insertions). And remember that so-called perfect hash functions are usually impractical, so O(1) performance will require wasted space in a hash table: it will use perhaps 25% more storage than a binary-search table of the same ``size,'' that is 1.25 log N nats per element instead of log N.

Incremental Trees

A spell-checking memory which is organized as a tree need not store the complete key at an entry. Instead a dictionary entry for ``helplessness'' might just describe the addition of ``ness'' to its mother node, which in turn just added ``less'' to its mother.

There are many variations on this idea, and the storage cost will be much less than the naive search tables just mentioned. But for now I offer no discussion, so this brief section just serves as a placeholder for an important topic.

Theoretical Minimum

Now, let's calculate the theoretical minimum storage requirement for a reliable spell-checker which will inform us which K of N possible elements are valid elements.

There are p = C(N, K) ways that K valid words can be chosen from N possible words. A spell-checker needs to distinguish one of these p possibilities. If we assume each way is equally likely, then the spell-checker will require log p nats of storage, or (log p) / K nats per valid element (or (log2 p) / K bits per valid element).

As an exercise, stop now and find a simple approximation to that expression valid with the assumption that N is much larger than K and that K is much larger than 1.

We first present three mathematical facts. Use these as ``hints'' and then work the exercise without reading further ahead.

```
C(N,K)      =   N! / K! (N-K)!
log (M!)   ~=   (M + 0.5) * log(M) - M + 0.919
log (N-K)  ~=   log(N) - K/N
```

The first formula is ancient and famous. The second formula is a restatement of Stirling's formula. The third formula comes from elementary calculus and is one of the reasons e is such a natural base for logarithms. All three formulae are worth memorizing, but the second is complicated and you will be forgiven if you write it down in your personal address-book/crib-sheet!

Now what is

```
Cost = (log C(N,K)) / K
```
when N is much larger than K and K is much larger than 1?

Without showing intermediate steps, the answer in nats is

```
Cost = log (N / K) + 1
```

This seems remarkably simple and elegant, as will become evident in the following discussion. I don't recall seeing it in any textbook, nor did it seem that the comp.theory Usenet group was familiar with it.

I'm sure the formula was discovered hundreds of times before I did. Still, it was only the excitement of this discovery that led me to produce this ``Appendix M'' at all.

Quotienting, 2-bit and 1-trit solutions

When I first wrote the next section ("Practical Minimum"), I'd never heard of hash table quotienting, and ended up reinventing that "wheel" in a round-about way. I don't have time to write a general article on memory-efficient hash tables just now, but will make some brief comments. Although practical systems might use a hash table, where one needs to worry about capacity slack and collision relocation, we'll continue with the "purer" problem of minimal subset storage.

Rephrasing the result of the previous section, the minimum per-element storage cost of a subset whose density is 2^-M, is given in bits by
Cost = M + log2(2.71828)

This cost consists of "quotient" (M bits) and overhead (1 nat minimum). It seems silly to try to reduce overhead to log(2.71828) (one nat), but there are straightforward ways to achieve log(4) (two bits). Cleary uses two bits of overhead (he calls them "virgin" and "change" bits) in his memory-efficient Compact Hash. Valmari discusses methods for this; and so on.

Two-bit overhead: One very simple way to achieve overhead of only two bits is to divide the universe into as many buckets as there are elements in the subset and, for each bucket, encode in unary the number of subset-selected elements in the bucket. (The unary code is 0 --> 0; 1 --> 10; 2 --> 110; 3 --> 1110, etc.)

One-trit overhead: But 2.71828 < 3 < 4, and I wondered if there's a way to reduce overhead from 2 bits to 1 trit. One day, a lightbulb flashed in my head and I realized it was easy! Instead of a complete count of the elements in the bucket, just provide three states: {0, 1, 2 or more}. When there are two or more elements, e.g. a1 < a2 < a3 < ... < ak, the count can be deduced if they are arranged in almost ascending order as
a1, a2, ..., ak-2, ak, ak-1

This is likely to be impractical, or forbidden by other considerations, but I was still pleased to discover it! (By the way, a practical way to encode trits is to pack five of them into an 8-bit byte. This uses 1.600 bits per trit compared with the theoretical minimum of log 3 / log 2 = 1.585 bits.)

Practical Minimum

The elemental cost of a hash table is 1.25 log(N) and the elemental cost of a bit map is N / K. Neither approaches the log (N / K) + 1 optimum we have just calculated.

There is a way, however, to improve on these results by combining, in a sense, the bit-map and hash-table approaches. This is discussed briefly in Chapter 3 (see 3.3 and the suggested solution to 3.5).

Write N = A * B and produce A instances of hash tables; let's call them subtables (or compartments). Each subtable, presumably, will have elemental cost of 1.25 log(B). The keys in each hash table only need log2(B) bits because the log2(N) bits of the key have been split into two fields, and the one with log2(A) bits isn't needed as a hash-key -- it's a hash-table selector.

Applying the above formula it would seem that we could make the cost as small as we want, even zero by choosing B = 1. What's wrong with this reasoning? As an exercise, figure out the fallacy and find a more realistic cost estimate before reading on.

The fallacy is that there will be some minimal overhead for every subtable; even an empty subtable will need a bit to tell us it's empty. In fact, if B = 1, we then end up with a pure bit-map with its elemental cost of N / K.

Do the math. I find cost is minimal when the average table size is equal to the subtable overhead in nats. For such small subtables we wouldn't use hashing; sequential search would be fine. The elemental cost would approach log (N / K) + O(1), in close agreement with our lower limit of log (N / K) + 1.

Professor Antti Valmari has done a detailed study of such minimum memory data structures, and has offered me corrections and comments after an earlier version of this page. The practical schemes he has designed provide fast access as well as memory efficiency; they are based on multi-compartment hash tables but with a clever system of overflow buckets. Depending on particular parameters (e.g. for space/time tradeoff) he achieves a typical elemental cost of about 1.1 log (N / K) + .02 log (K) + 4.5 nats compared with the theoretical minimum of log (N / K) + 1. (The log (K) term arises from the need to store indexes to overflow buckets, whose number will be proportional to K.)

There are various practical methods for approaching the economy of such a table, but the multi-compartment hash table is very easy to implement and can get you near the general ``ballpark'' of optimality.

Thus, although the section title ``Practical Minimum'' contrasted with the ``Theoretical Minimum'' title of the preceding section, we find that we can get close to the theoretical minimum in practice.

Before leaving this section, let's consider the case A = K; that is average subtable size is 1; and suppose for the moment that each subtable has size exactly one. A trivial data structure would then suffice:

```
struct {
Key;	/* n - k (= log2 N - log2 K) bits */
} Element[K];	/* indexed by k (-log2 K) bits of the n (=log2 N) bit key */
```

The reason this doesn't work is that we aren't guaranteed each k-bit key field occurs once: a given value might occur more than once or never. What is the theoretical storage overhead needed to deal with this problem? It is exactly one nat per element as we showed above!

Unreliable Spell-checking

The spell-checking memories we've discussed have two properties. (a) They are reliable: valid elements are never classed as non-elements nor vice versa; (b) The list of elements can be constructed from the memory. (Discuss whether the second property follows from the first.)

In chapter 3, section 3.4 we discuss an unreliable spell-checker in which valid elements are always seen as valid, but for every S non-elements, one on average will be seen erroneously as valid. The minimum storage requirement per valid element for such a Bloom filter memory is (log S) (log 2)^-2.

What is the theoretical best one can do for such an unreliable spell-checker? Does an optimized Bloom filter memory achieve the optimal performance? I don't know the answers to these questions, but it is a good math exercise to show that the optimal Bloom filter uses (log S) (log 2)^-2 storage.

First write three equations in six unknowns:

```
M    =   total size of Bloom memory
S^-1 =   rate of false positives
H    =   number of hash functions
P    =   density of 1-bits in Bloom memory
K    =   number of valid elements
C    =   storage cost per element

C    =   M / K                       (1)
P^H  =   S^-1                        (2)
1-P  =   ((M-1)/M)^KH                (3)

H    =   -C (log 1-P)                (4)
H    =   -(log S) / (log P)          (5)
```

(I said "three equations" but five are shown. (4) is just (3), simplified for large M with a calculus rule given earlier, and substituting (1). (5) is the same as (2).)

The system reduces to one equation in three unknowns:

```
(log S)   =   C (log 1-P) (log P)
```

The problem is now: ``For given S, what p minimizes C?'' Or, ``For given C, what p maximizes S?'' The answer to each question is the same, and one hardly needs calculus to find it: p = 0.5 is the optimum, as we might have known: p = 0.5 is the ``maximal entropy point'' where any bit-based memory carries maximal information.

Given the premise that p = 0.5 will be optimal, the memory requirement of the unreliable spell-checker is computed more easily. Suppose we can tolerate a false recognition rate of 2^-p. We'll then compute p hash functions for each of K words and set (p*K) bits into the table. If there are no collisions, we'll want (2*p*K) total bits in the table, but there will be collisions so (1.4427*p*K) is the correct estimate.

If we can tolerate a false recognition rate of .001 (about 2^-10), we'll need 14.427 bits per "stored" element. This figure is independent of the number or density of the stored elements. The "magic number" 1.4427 is log2(e), the number of bits in a nat.

I entered one of Al Zimmerman's Programming Contest; for this one an in-memory table of small primes was needed. (Here "small prime" is defined as a prime smaller than about 100 million, or rather whatever limit will allow the use of a smallish in-memory table. The contest required checking primality on much larger numbers as well, but when possible we'd rather use a fast lookup than a slow primality checker.)

(The Programming Contest newsletter provided a hint about a special way to code a prime number table: Use an ordinary bit-map with 1-bits set for primes and 0-bits for non-primes. You won't need even numbers to be mapped, obviously, and extending that idea you can pack 30 numbers into 8 bits -- figure that out! The more general problem remains however; hence the remaining discussion in this section.)

An unreliable table would have been fine; false recognitions (at a rate of say .001) being weaned out later when a solution looked like a promising candidate. So 80 million nats (14.4 megabytes) would have been just enough to store 8 million primes using the method of the previous section.

But a reliable memory requires only log (N/K) + 2 nats, as we saw above. The reliable memory will require less storage than the Bloom filter (with error .001) when log (N/K) < 8, or equivalently K/N > 1/2981 where K/N is the density of "valid words" (or in this case, small primes).

The small primes have a much greater density than (1/2981) so I didn't use a Bloom filter in that programming contest.

Yet intuitively, it seems that an unreliable memory should require less memory than a reliable one! Can the reader think of an alternative to the Bloom filter that would deal with the "high-density" situation?

Variable-length Data

When constructing a minimal memory representation, it is often necessary to deal with Variable-length Data. With fixed-length element entries, one simply multiplies an index or hash key by the entry size in bytes and gets the attribute's storage address, but that doesn't work for variable-length data, unless each element has storage reserved for the worst case.

A simple solution is usually best: allocate variable fields with malloc() and put a pointer to them in a fixed-length structure. But when there is a very large number of tiny variable-length elements, the pointers themselves will use too much memory. Instead we need to embed length information into the data and navigate through it.

For example, sometimes we may prefix an S-bit length code to a table entry, allowing the entry to have a suffix up to (2^S)-1 bits long. If we need to inspect the first R such length codes just to locate the (R+1)'th table entry, it would be impractical to use non-sequential access on the table, but one might compromise, by providing a direct pointer to every, say, 20th entry.

We can save 3 bits in the length code by counting bytes instead of bits. This may mean part of the last byte of an entry is unused; the unused portion would average 3.5 bits, for a net loss of just 0.5 bits. More generally, reducing the bit length code by B bits costs (2^(B-1) - B - .5) bits on average (this is a gain when B = 1 or 2.)

I've gone into these length codes in a little detail just to show the care needed to calculate and optimize storage costs. General matters of compression and compaction codes are beyond our scope here, but we should mention that data can have a variable length defined with suffixes instead of prefixes.

A simple example showing length-defining affixes would be message strings. In Pascal, strings have a length prefix. In C, strings have a (null-character) termination suffix. (The C method has the advantage that strings can exceed 255 bytes, but the disadvantage that only 255 of 256 values can occur inside a string.)

The single byte used for Pascal string lengths has the advantage that few bits are spent, but the disadvantage that arbitrarily large numbers cannot be represented. Before reading the next paragraph, design a method to solve this problem.

We will give one more example of length-definitions with prefixes or suffixes. The Elias gamma code is an important compact representation which goes by many names. It is a bit-oriented representation of positive integers in which few bits are needed for small integers, but arbitrarily large integers can be represented. Here are the first few integers in their Elias gamma code bit representation, as well as a more general form:

```
1                =   1
010              =   2
011              =   3
00100            =   4
00101            =   5
00110            =   6
00111            =   7
000001xxxxx      =   32 to 63
```

There are many variations on this Elias code. I like the one above because, except for leading zero-bits, it is simply normal binary representation. (The number of leading zeros is chosen to put the first 1-bit in the exact middle of the code word.)

With the above giving the idea and shown as (a), here are three variations of the Elias gamma code:

```
1              =   1         (a)
01x            =   2 to 3    (a)
000001xxxxx    =   32 to 63  (a)

1              =   1         (b)
0x1            =   2 to 3    (b)
0x0x0x0x0x1    =   32 to 63  (b)

1              =   1         (c)
0x0            =   2 to 3    (c)
0x1x1x1x1x0    =   32 to 63  (c)
```

All three of these codes represent a k-bit integer using (2k-1) bits; they use (k-1) binary value bits and k structural bits to define the variable length; and all are in a sense the same code. But the form (a) uses a length prefix (1-terminated string of 0's), and form (b) uses a termination suffix (1-bit in even position).

The form (c) is almost identical to form (b), with the termination condition slightly harder to state. But (c) has a very remarkable property: it can be decoded in either direction! Normally a transmission error implies loss of all data until the next synchronization point, but with a code like (c), only the mistransmitted token itself need be lost (assuming there's only one error between synchronization points).