This page describes an invention in the craft of statistical bit coding. We assume fixed (static) probability estimation; and ignore any context conditioning or other data modeling. We consider only "simple" codes (in a sense we won't define rigorously) and assume no token occurs with overwhelming (greater than about 70%) probability. Also, to keep the exposition simple we describe a conceptual decoder which navigates a decoding tree step-by-step. A practical decoder might be table-driven for faster performance. For these various reasons, the page is not a comprehensive tutorial on the application of this invention to data compression.
A statistical bit-code maps decision tokens to variable-length bit strings with the intent of compression. For example
|Token||Probability (case 1)||Probability (case 2)||Compressed bit-string|
The average coding cost in bits is (- Σ p log2q) where p is a token's actual probability, and q = 2-k is the probability for which the code would be optimal. Using the examples in Table 1, the optimal and actual costs for case-1 probabilities are 1.50 and 1.50 bits per token; the optimal and actual costs for case-2 probabilities are 1.575 and 1.640 bits per token.
The depicted code has minimal cost among all simple prefix codes for either of the two probability distributions. (A simple prefix code of minimal cost is called a Huffman code.) In case 1, this Huffman code wastes zero compared with the limiting bound -- the cost of an optimal Shannon (arithmetic) code. In case 2, the Huffman code wastes 0.065 bits (1.640 - 1.575) per token compared with the Shannon limit. (That's a whopping 4% inefficiency; though that figure is misleading since practical Huffman codes usually encode many more than 3 tokens.)
A Huffman code will waste, typically, about 0.028 bits per token compared with the Shannon limit. Our case-2 example wastes much more than this because its dominant probabilities (36%) are very poorly approximated by any power of 1/2.
This paper describes a new invention ("Truffman code") which improves on Huffman coding by using a tiny amount of state information. (A key part of the extra state is a trit -- hence the name "TRuffman code.") Instead of wasting 0.028 bits per token, a crude implementation of the new invention wastes typically about 0.012 bits per token. (This is a wild guess -- I've only done a few experiments.) It might be possible to refine the invention to reduce this to 0.009 or so (??).
Before explaining the Truffman code, we will show an easy way to improve performance in our case-2 example.
|Compound Token||Probability (case 2)||Compressed bit-string|
|A A||0.13 (~.36*.36)||000
||B A||0.13 (~.36*.36)||001
||A B||0.13 (~.36*.36)||010
||B B||0.13 (~.36*.36)||011
||A C||0.10 (~.36*.28)||100
||B C||0.10 (~.36*.28)||101
By considering two adjacent tokens to be a single token, a new Huffman code is fashioned with 2.720 average bits per token compared with 2.709 for the Shannon limit; or only 0.011 (0.4%) wastage. This 0.011 wastage compares favorably with the typical Huffman code wastage of 0.0284 bits because the probabilities of the compound tokens are now all fortuitously close to powers of 1/2.
In the example, there are only three tokens, so a maximum of nine (3*3) compound tokens if at most two tokens are combined as in the hand-crafted example. In practice, a Huffman code might support 200 tokens or so, leading to as many as 40,000 compound tokens. This could be unwieldy. The new Truffman code does not increase the number of tokens; it does not use compounding. It achieves the effect of Table 2 in another way.
In a Huffman code, one bit is used to encode a decision of probability 50% and two bits for a 25% decision. We'd like to use 1 trit for a decision of probability of 33% or so. We'd also like to use the left-over space when a 25% decision does not occur to encode a 75% decision, or the left-over from a trit to encode a 67% decision. This is effectively what the Truffman code achieves.
We now describe a conceptual Truffman decoder in detail. The decoder will be in one of three states: (A) Base, (B) Preserved-trit, (C) Preserved-bit. As the names imply, a trit or bit is remembered in the Preserved states. Hence, you may prefer to think of six states: A, B0, B1, B2, C0, C1.
A conceptual Huffman decoder inputs bits (50-50 decisions). The Truffman decoder may also input skewed bits (70-30 decisions). The decoder will know from context whether it needs an ordinary bit or a skewed bit.
In the Base state, two bits are input when a skewed bit is needed. If those bits are both 1 then the 30% branch is taken. If not, the 70% branch is taken and the decoder enters the Preserved-trit state with the value of the two bits (00, 01, or 10) being the preserved trit.
In the Preserved-trit state, when a skewed bit is needed, no data is input; the preserved trit is used. If the trit is 10, the 30% branch is taken. Otherwise the 70% branch is taken and the decoder enters the Preserved-bit state. In the Preserved-bit state, the preserved trit (now restricted to 00 or 01) is used for the next needed bit, after which the decoder goes to Base state. (It is possible to design the code so that a skewed bit is never needed in the Preserved-bit state.)
Note that the optimal coding probability for the so-called "70% branch" will alternate between 75% and 67% as skewed-bit decoding alternates between the Base state and the Preserved-trit state.
I won't outline a Truffman encoder except to note that the distance between needed skewed bits can, in principle, be arbitrarily long; thus the Preserved-trit value may not be known until long after that coding point is encountered. This is no problem in my proof-of-concept implementation: I allocate enough memory to contain the entire compressed file. A simple workaround in a more practical system would be to impose a rule such as: If a Preserved-trit remains pending for more than 4 tokens, it collapses to a Preserved-bit; the efficiency loss from such a rule might be very modest.
There is an elegant algorithm due to David Huffman for constructing a minimal-cost Huffman code; there are other simple algorithms for constructing near-optimal ("good enough") Huffman-like codes. Constructing even a near-optimal Truffman code seems to be much more challenging. In this page I report only my initial effort. I'm embarrassed: it will seem quite "kludgey" even with the kludgiest details omitted.
The first step is to construct a histogram of an input training set in order to obtain the fixed probabilities. Then I triplicate that histogram, add an escape token, and apply Huffman's algorithm for the minimal Huffman code. (In the copies of the histogram I thought it useful to, for example, multiply all probabilities by 0.8 in one of the copies and by 0.65 in another.) For example, if there are 100 tokens, 301 tokens will be submitted to Huffman's algorithm. (The escape token of very low probability may be convenient, e.g. as an EOF indication.) Practical Huffman codes will often impose an upper limit on coded token lengths (or almost equivalently, a lower limit on probabilities submitted to Huffman's algorithm). In my implementation, such bounds are convenient for the post-processing kludges to function properly.
By summing the "Huffmanized" probabilities (probabilities which are powers of 1/2) of the three instances of each token, one derives a net probability estimate which is either 3, 4 or 5 times a power of 1/2. Some automated post-processing, which I won't describe in detail, reduces these three classes to two, while ensuring they comply with certain constraints: for example, odd-sized subclasses can cause trouble; the 2nd-least probable token needs to be forced to class 3. (I told you this was embarrassing.) Finally tokens and nodes are combined into higher nodes in an obvious way. One can achieve slightly better ultimate performance by considering the actual token probabilities rather than the "Huffmanized" estimates when combining nodes.
It would lead to much simpler processing, were the histogram duplicated instead of triplicated. And quadruplicating might have some advantages! I've started to code both approaches, but efforts to optimize the assigned classes may be tedious; and the existing "proof of concept" implementation may be enough for my present purpose.
Dynamic probability estimates are superior to static estimates, and various high-speed arithmetic and quasi-arithmetic codes have been developed. For these reasons, as well as complexities associated with generating a Truffman tree and using it to encode or decode, the Truffman Code may have little practical utility.
Still, it is a fun concept. Perhaps someone would be interested enough to pursue a conference presentation. Or perhaps, someone might adopt an underlying concept and apply it to something of practical value.