Copyright © 2003 by James D. Allen

- Introduction
- Simplest Case
- Bit-maps and Search-tables
- Incremental Trees
- Theoretical Minimum
- Quotienting, 2-bit and 1-trit solutions
- Practical Minimum
- Unreliable Spell-checking
- What about "high-density" spell-checking?
- Variable-length Data

(Unless stated otherwise *log* is taken to the natural base
e = 2.71828.)

How much computer memory will we need to store the employee ID numbers and birthdates for a large company?

Of course the answer depends on various factors:
(a) Do we want to maximize processing speed, or should we compress
the data for minimum storage?; (b) Did we assign each of 1000
employees a unique 3-digit ID, or did we use their 9-digit
social security number?; (c) etc.
In practice, storage details may depend on a variety of compromises,
but in this appendix we are interested in the theoretical minimum
of *space*.

We will always choose to determine *storage per element*.
For example, if we use 9800 bytes total for our 1000-employee table,
the answer would be ``9.8 bytes.'' Storage may be measured in
``bytes,'' ``bits,'' or ``nats.''
While a **byte** usually distinguishes 256 cases,
a **bit** distinguishes 2 cases,
and a **nat** distinguishes 2.71828 cases.
Since the number of cases will always be an integer, the *nat*
may seem severely impractical, but it's actually a
very **nat**ural measure of information!

A table entry will consist, at least conceptually, of two parts -- key(s) and attribute(s). Employee ID is an example of key; distinct employees will always have distinct ID's. Birthdate is an attribute; two or more employees may share a birthdate. A typical query would ask two questions (a) Does any employee have ID 666-01-1357?, (b) If so, what are his/her attributes? Of course a practical query might go the other way (``Which employees have their birthday today?'') but we aren't concerned with access methods here.

In the simplest case, each of K employees has a unique ID in the range of 1 to K, and the attributes are a fixed length bit string, say of length A bytes. Required storage will be A bytes per element and the storage might be declared with the simple C syntax

(Most C programmers wouldn't waste one element as we have done here, butstruct { char attributes[A]; } Element[K+1];

- Any employee with employee ID number zero is likely to be a laughing-stock.
- Zero is often convenient as a failure code.
- Subtracting 1 from the ID whenever you access the table is likely to lead to a bug: eventually you'll forget to subtract 1 ... or subtract it twice.
- The storage loss is insignificant and, anyway, there may be a logical use for the extra element, e.g. as a default initializing element.

For the remainder of this appendix we need only consider the two separate complicating factors: (a) keys (ID's) may not be assigned consecutively; (b) attributes may be of variable length.

Sometimes a table entry will need no attributes at all: All we want to know is whether a certain ID is valid or not. That is, we want to know for each number in {0,1,2,3,...,N-1} whether it is one of the K valid ID's, or one of the (N-K) invalid ID's.

Such a memory might be called a spell-checking dictionary,
or simply *spell-checker*. While entries in a regular dictionary
will have attributes (definition, etymology, etc.), in spell-checking
we just want to know whether a word exists or not.

The most common (and most obvious) approach is the **bit-map**.
A table of **N** bits is used with
the **j**'th bit set to **1** if and only if
**j** is one of the K valid ID's.
The storage requirement is 1 bit per *possible* ID, or, since we've
agreed to measure in *bits per valid element*, **(N / K)** bits.

Although a bit-map will often be just what we need, it suffers three problems which often make it unusable:

- If attributes are stored the same way as the existence bits (indexed by 0 ... N-1) we will spend (N / K) times as much storage on attributes as we need.
- We'd like to convert the valid ID's to an index in the range (0 ... K-1) but how? Scanning the table sequentially to count 1-bits is absurd for large N, so instead of 1 bit we'd need, say 10 bits for K = 1000 to include the valid index.
- Even when no attributes are stored, bit maps can't be used for very large N. Representing employees' social security numbers this way would require 1 billion bits no matter how small our company is.

The other major approach is to just list valid ID's in a table.
This requires **log N** storage per element, so is more efficient
than a bit map whenever **K < N / log2 N**.
(I.e., for N = 8,000,000 the bit map uses more storage than
a search table when K < 345,000.)

A random access in the table will use **O(K)** time for
sequential search, **O(log K)** time for binary search,
and **O(1)** time for hash-table search.
Moreover, inserting a new element into a table takes only **O(1)**
time for a hash table, but **O(log K)** for a *binary-search tree*
and perhaps even more time for a binary-search table not organized
as a tree.

Hash tables are therefore often preferable, and we will speak of
*bit maps* and *hash tables* as the two main choices for
large tables. But be aware that you'll program with binary-search tables
more often than with hash tables: they're better (*simpler*)
when they're good enough (when K isn't huge and there are no dynamic
insertions).
And remember that so-called *perfect hash functions*
are usually impractical, so **O(1)** performance will require wasted
space in a hash table: it will use perhaps 25% more storage than
a binary-search table of the same ``size,''
that is **1.25 log N** nats per element instead of **log N**.

A spell-checking memory which is organized as a tree need not store the complete key at an entry. Instead a dictionary entry for ``helplessness'' might just describe the addition of ``ness'' to its mother node, which in turn just added ``less'' to its mother.

There are many variations on this idea, and the storage cost will be much less than the naive search tables just mentioned. But for now I offer no discussion, so this brief section just serves as a placeholder for an important topic.

Now, let's calculate the theoretical minimum storage requirement
for a *reliable spell-checker* which will
inform us which K of N possible elements are valid elements.

There are **p = C(N, K)** ways that K valid words can be chosen from
N possible words.
A spell-checker needs to distinguish one of these **p** possibilities.
If we assume each way is equally likely, then
the spell-checker will require **log p** *nats* of storage,
or **(log p) / K** *nats per valid element*
(or **(log2 p) / K** *bits* per valid element).

As an exercise, stop now and find a simple approximation to
that expression valid with
the assumption that **N**
is *much* larger than **K**
and that **K** is *much* larger than **1**.

We first present three mathematical facts. Use these as ``hints'' and then work the exercise without reading further ahead.

C(N,K) = N! / K! (N-K)! log (M!) ~= (M + 0.5) * log(M) - M + 0.919 log (N-K) ~= log(N) - K/N

The first formula is ancient and famous.
The second formula is a restatement of *Stirling's formula*.
The third formula comes from elementary calculus and is one
of the reasons *e* is such a *natural* base for
logarithms.
All three formulae are worth memorizing, but the second is
complicated and you will be forgiven if you write it down
in your personal address-book/crib-sheet!

Now what is

whenCost = (log C(N,K)) / K

Without showing intermediate steps, the answer in *nats* is

Cost = log (N / K) + 1

This seems remarkably simple and elegant, as will become evident
in the following discussion. I don't recall seeing it in any textbook,
nor did it seem that the *comp.theory* Usenet group was familiar
with it.

I'm sure the formula was discovered hundreds of times before I did. Still, it was only the excitement of this discovery that led me to produce this ``Appendix M'' at all.

When I first wrote the next section ("Practical Minimum"),
I'd never heard of hash table *quotienting*,
and ended up reinventing that "wheel" in a round-about way.
I don't have time to write a general article on memory-efficient
hash tables just now, but will make some brief comments.
Although practical systems might use a hash table, where
one needs to worry about capacity slack and collision relocation,
we'll continue with the *"purer"* problem of minimal subset
storage.

Rephrasing the result of the previous section, the
minimum per-element storage cost of a subset whose density
is **2^-M**, is given *in bits* by

**Cost = M + log _{2}(2.71828)**

This cost consists of "quotient" (M bits) and overhead (1 nat minimum).
It seems silly to try to reduce overhead to log(2.71828) (one *nat*),
but there are straightforward ways to achieve log(4) (two bits).
Cleary uses two bits of overhead (he calls them "virgin" and "change" bits)
in his memory-efficient Compact Hash. Valmari discusses methods
for this; and so on.

**Two-bit overhead**: One very simple way to achieve overhead
of only two bits is to divide the universe into as many buckets
as there are elements in the subset and, for each bucket,
encode in unary the number of subset-selected elements in the bucket.
(The unary code is 0 --> 0; 1 --> 10; 2 --> 110; 3 --> 1110, etc.)

**One-trit overhead**: But 2.71828 < 3 < 4, and I wondered
if there's a way to reduce overhead from 2 bits to 1 trit.
One day, a lightbulb flashed in my head and I realized it was easy!
Instead of a complete count of the elements in the bucket, just
provide three states: {0, 1, 2 or more}.
When there are two or more elements, e.g. a1 < a2 < a3 < ... < ak,
the count can be deduced if they are arranged in
*almost* ascending order as

a_{1}, a_{2},
..., a_{k-2}, a_{k}, a_{k-1}

This is likely to be impractical, or forbidden by other considerations, but I was still pleased to discover it! (By the way, a practical way to encode trits is to pack five of them into an 8-bit byte. This uses 1.600 bits per trit compared with the theoretical minimum of log 3 / log 2 = 1.585 bits.)

The elemental cost of a hash table is **1.25 log(N)**
and the elemental cost of a bit map is **N / K**.
Neither approaches the **log (N / K) + 1** optimum we have just
calculated.

There is a way, however, to improve on these results by combining, in a sense, the bit-map and hash-table approaches. This is discussed briefly in Chapter 3 (see 3.3 and the suggested solution to 3.5).

Write **N = A * B** and produce A instances of hash tables;
let's call them *subtables* (or *compartments*).
Each subtable, presumably, will have elemental cost of **1.25 log(B)**.
The keys in each hash table only need log2(B) bits because the
log2(N) bits of the key have been split into two fields, and the
one with log2(A) bits isn't needed as a hash-key -- it's a hash-table
selector.

Applying the above formula it would seem that we could make
the cost as small as we want, even zero by choosing B = 1.
** What's wrong with this reasoning?**
As an exercise, figure out the fallacy and find a more realistic
cost estimate before reading on.

The fallacy is that there will be some minimal overhead for
every subtable; even an empty subtable will need a bit to tell us
it's empty. In fact, if **B = 1**, we then end up with a pure bit-map
with its elemental cost of **N / K**.

Do the math. I find cost is minimal when the average table size
is equal to the subtable overhead in nats. For such small subtables
we wouldn't use hashing; sequential search would be fine.
The elemental cost would approach **log (N / K) + O(1)**,
in close agreement with our lower limit of

Professor Antti Valmari has done a detailed study of such minimum
memory data structures, and has offered me corrections and
comments after an earlier version of this page.
The practical schemes he has designed
provide fast access as well as memory efficiency;
they are based on multi-compartment
hash tables but with a clever system of overflow buckets.
Depending on particular parameters (e.g. for space/time tradeoff)
he achieves a typical elemental cost of about
**1.1 log (N / K) + .02 log (K) + 4.5** nats
compared with the theoretical minimum of
**log (N / K) + 1**.
(The log (K) term arises from the need to store indexes
to overflow buckets, whose number will be proportional to K.)

There are various practical methods for approaching the economy of such a table, but the multi-compartment hash table is very easy to implement and can get you near the general ``ballpark'' of optimality.

Thus, although the section title ``Practical Minimum'' contrasted with
the ``Theoretical Minimum'' title of the preceding section, we find that
we *can* get close to the theoretical minimum in practice.

Before leaving this section, let's consider the case **A = K**;
that is average subtable size is 1; and suppose for the moment that
each subtable has size *exactly* one.
A trivial data structure would then suffice:

struct { Key; /* n - k (= log2 N - log2 K) bits */ } Element[K]; /* indexed by k (-log2 K) bits of the n (=log2 N) bit key */

The reason this doesn't work is that we aren't guaranteed each k-bit
key field occurs once: a given value might occur more than once or never.
What is the theoretical storage overhead needed to deal with this problem?
It is exactly one *nat* per element as we showed above!

The spell-checking memories we've discussed have two properties.
(a) They are *reliable*: valid elements
are never classed as non-elements nor vice versa;
(b) The list of elements can be constructed from the memory.
(Discuss whether the second property follows from the first.)

In chapter 3, section 3.4 we discuss an unreliable spell-checker
in which valid elements are always seen as valid, but for every **S**
non-elements, one on average will be seen erroneously as valid.
The minimum storage requirement per valid element for
such a *Bloom filter memory* is **(log S) (log 2)^-2**.

What is the theoretical best one can do for such an unreliable
spell-checker? Does an optimized *Bloom filter memory* achieve the
optimal performance?
I don't know the answers to these questions, but it is a good math exercise
to show that the optimal Bloom filter uses **(log S) (log 2)^-2** storage.

First write three equations in six unknowns:

M = total size of Bloom memory S^-1 = rate of false positives H = number of hash functions P = density of 1-bits in Bloom memory K = number of valid elements C = storage cost per element C = M / K (1) P^H = S^-1 (2) 1-P = ((M-1)/M)^KH (3) H = -C (log 1-P) (4) H = -(log S) / (log P) (5)

(I said "three equations" but five are shown. (4) is just (3), simplified for large M with a calculus rule given earlier, and substituting (1). (5) is the same as (2).)

The system reduces to one equation in three unknowns:

(log S) = C (log 1-P) (log P)

The problem is now: ``For given S, what p minimizes C?''
Or, ``For given C, what p maximizes S?''
The answer to each question is the same, and one hardly needs
calculus to find it: **p = 0.5** is the optimum, as we might
have known: **p = 0.5** is the ``maximal entropy point''
where any bit-based memory carries maximal information.

Given the premise that **p = 0.5** will be optimal,
the memory requirement of the unreliable spell-checker
is computed more easily.
Suppose we can tolerate a false recognition rate of 2^-p.
We'll then compute p hash functions for each of K words and
set (p*K) bits into the table.
If there are no collisions, we'll want (2*p*K) total bits
in the table, but there will be collisions so (1.4427*p*K)
is the correct estimate.

If we can tolerate a false recognition rate of .001
(about 2^-10), we'll need 14.427 bits per "stored" element.
This figure is independent of the number or density of
the stored elements.
The "magic number" 1.4427 is log2(e), the number of bits in
a *nat*.

I entered one of Al Zimmerman's Programming Contest; for this one an in-memory table of small primes was needed. (Here "small prime" is defined as a prime smaller than about 100 million, or rather whatever limit will allow the use of a smallish in-memory table. The contest required checking primality on much larger numbers as well, but when possible we'd rather use a fast lookup than a slow primality checker.)

(The Programming Contest newsletter provided a hint about a special way to code a prime number table: Use an ordinary bit-map with 1-bits set for primes and 0-bits for non-primes. You won't need even numbers to be mapped, obviously, and extending that idea you can pack 30 numbers into 8 bits -- figure that out! The more general problem remains however; hence the remaining discussion in this section.)

An unreliable table would have been fine; false recognitions
(at a rate of say .001) being weaned out later when a solution looked
like a promising candidate. So 80 million *nats*
(14.4 megabytes) would have been
just enough to store 8 million primes using the method of the
previous section.

But a reliable memory requires only **log (N/K) + 2**
nats, as we saw above.
The *reliable* memory will require less storage than
the *Bloom filter* (with error .001) when
**log (N/K) < 8**, or equivalently **K/N > 1/2981**
where **K/N** is the *density* of "valid words"
(or in this case, small primes).

The small primes have a much greater density than (1/2981)
so I didn't use a *Bloom filter* in that programming
contest.

Yet intuitively, it seems that an *unreliable*
memory should require less memory than a reliable one!
Can the reader think of an alternative to the *Bloom filter*
that would deal with the "high-density" situation?

When constructing a minimal memory representation, it is often
necessary to deal with *Variable-length Data*.
With fixed-length element entries, one simply multiplies an index
or hash key by the entry size in bytes and gets the attribute's storage
address, but that doesn't work for variable-length data, unless
each element has storage reserved for the worst case.

A simple solution is usually best: allocate variable
fields with *malloc()* and put a pointer to them in a fixed-length
structure. But when there is a very large number of tiny
variable-length elements, the pointers themselves will use
too much memory.
Instead we need to embed length information into the data and
navigate through it.

For example, sometimes we may prefix an S-bit length code to a table entry, allowing the entry to have a suffix up to (2^S)-1 bits long. If we need to inspect the first R such length codes just to locate the (R+1)'th table entry, it would be impractical to use non-sequential access on the table, but one might compromise, by providing a direct pointer to every, say, 20th entry.

We can save 3 bits in the length code by counting bytes instead of bits.
This may mean part of the last byte of an entry is unused; the unused
portion would average 3.5 bits, for a net loss of just 0.5 bits.
More generally, reducing the bit length code by **B** bits costs
(2^(B-1) - B - .5) bits on average (this is a *gain* when B = 1 or 2.)

I've gone into these length codes in a little detail just to show
the care needed to calculate and optimize storage costs.
General matters of *compression* and *compaction codes*
are beyond our scope here, but we should mention that data can
have a variable length defined with suffixes instead of prefixes.

A simple example showing length-defining affixes would be message
strings. In Pascal, strings have a length *prefix*.
In C, strings have a (null-character) termination *suffix*.
(The C method has the advantage that strings can exceed 255 bytes,
but the disadvantage that only 255 of 256 values can occur inside
a string.)

The single byte used for Pascal string lengths has the advantage that few bits are spent, but the disadvantage that arbitrarily large numbers cannot be represented. Before reading the next paragraph, design a method to solve this problem.

We will give one more example of length-definitions with prefixes
or suffixes. The *Elias gamma code* is an important compact
representation which goes by many names. It is a bit-oriented
representation of
positive integers in which few bits are needed for small integers,
but arbitrarily large integers can be represented.
Here are the first few integers in their *Elias gamma code*
bit representation, as well as a more general form:

1 = 1 010 = 2 011 = 3 00100 = 4 00101 = 5 00110 = 6 00111 = 7 000001xxxxx = 32 to 63

There are many variations on this Elias code. I like the one above because, except for leading zero-bits, it is simply normal binary representation. (The number of leading zeros is chosen to put the first 1-bit in the exact middle of the code word.)

With the above giving the idea and shown as (a), here are three variations of the Elias gamma code:

1 = 1 (a) 01x = 2 to 3 (a) 000001xxxxx = 32 to 63 (a) 1 = 1 (b) 0x1 = 2 to 3 (b) 0x0x0x0x0x1 = 32 to 63 (b) 1 = 1 (c) 0x0 = 2 to 3 (c) 0x1x1x1x1x0 = 32 to 63 (c)

All three of these codes represent a k-bit integer using (2k-1) bits; they use (k-1) binary value bits and k structural bits to define the variable length; and all are in a sense the same code. But the form (a) uses a length prefix (1-terminated string of 0's), and form (b) uses a termination suffix (1-bit in even position).

The form (c) is almost identical to form (b), with the termination condition slightly harder to state. But (c) has a very remarkable property: it can be decoded in either direction! Normally a transmission error implies loss of all data until the next synchronization point, but with a code like (c), only the mistransmitted token itself need be lost (assuming there's only one error between synchronization points).

Copyright © 2003 by James D. Allen

Please send me some e-mail.

Please visit my family tree.