As computers and networks grow, any fixed length fields in protocols tend to become obsolete. Therefore, for future upwards compatibility, we want to have variable precision numbers. This class of problem is that of a universal code for integers

The particular coding I propose here is a variation on Elias encoding, though I did not realize it when I invented it.

On reflection, my proposed encoding is too clever by half, better to use Elias Omega coding, with large arbitary limits on the represented numbers, rather than clever custom codings for each field.

A fixed length field is always in danger of running out, so one needs a committee to allocate numbers. With an arbitrary length field there is always plenty of headroom, we can just let people use what numbers seem good to them, and if there is a collision, well, one or both of the colliders can move to another number.

For example, the hash of a public key structure has to contain an algorithm identifier as to the hashing algorithm, to accommodate the possibility that in future the existing algorithm becomes too weak, and we must introduce new algorithms while retaining compatibility with the old. But there could potentially be quite a lot of algorithms, though in practice initially there will only be one, and it will be a long time before there are two.

We would like a way to represent an arbitrarily large
number, a Huffman style representation of the
numbers. This is not strictly Huffman encoding,
since we want to be able to efficiently encode and decode
large numbers, without using a table, and we do not have
precise knowledge of what the probabilities of numbers are
likely to be, other than that small numbers are
substantially more probable than large numbers. In
the example above, we would like to be able to represent
numbers up to O(2^{32}), but efficiently represent
the numbers one, and two, and reasonably efficiently
represent the numbers three and four. So to be
strictly correct, “prefix free number encoding”. As we
shall see at the end, prefix free number encoding always
corresponds to Huffman encoding for some reasonable weights
- but we are not worrying too much about weights, so are
not Huffman encoding.

Assume X is a prefix free sequence of bit strings – that is to say, if we are expecting a member of this sequence, we can tell where the member ends.

Let [m…n] represent a sequence of integers m to n-1.

Then the function X→[m…n] is the function that converts a bit string of X to the corresponding integer of [m…n], and similarly for [m…n]→X.

Thus X→[m…n] and [m…n}→X provide us with a prefix free representation of numbers greater than or equal to m, and less than n.

Assume the sequence X has n elements, and we can generate and recognize each element.

Let ℓ(X,k) be a new sequence, constructed by taking the first element of
X, and appending to it the 2^{k} bit patterns of length i, the
next element of X and appending to it the 2^{k+1} bit patterns of
length k+1, and so on and so forth.

ℓ is a function that gives us this new sequence from an existing sequence and an integer.

The new sequence ℓ(X,k) will be a sequence of prefix free bit patterns
that has 2^{n+k+1} - 2^{k} elements.

We can proceed iteratively, and define a sequence ℓ(ℓ(X,j),k), which class of sequences is useful and efficient for numbers that are typically quite small, but could often be very large. We will more precisely prescribe what sequences are useful and efficient for what purposes when we relate our encoding to Huffman coding.

To generate the m+1th element of ℓ(X,k), where X is a sequence that has n elements:

Let j = m + 2^{k}

Let p = floor(log_{2}(j)) that is to say, p is the position of
the high order bit of j, zero if j is one, one if j is two
or three, two if j is four, five, six, or seven, and so on and so forth.

We encode p into its representation using the encoding [k…n+k]→X, and append to that the low order p bits of j.

To do the reverse operation, decode from the prefix free representation to
the zero based sequence position, to perform the function ℓ(X,k)→[0…2^{n+k+1}-2^{k}],
we extract p from the bit stream using the decoding of X→[j…n+j], then
extract the next p bits of the bit stream, construct k from 2^{p}-2^{j}
plus the number represented by those bits.

Now all we need is an efficient sequence X for small numbers.

Let ℒ(n) be a such a sequence with n values.

The first bit pattern of ℒ(n) is 0

The next bit pattern of ℒ(n) is 10

The next bit pattern of ℒ(n) is 110

The next bit pattern of ℒ(n) is 1110

…

The next to last bit pattern of ℒ(n) is 11…110, containing n-2 one bits
and one zero bit.

The last bit pattern of ℒ(n) breaks the sequence, for it is 11…11,
containing n-1 one bits and no zero bit.

The reason why we break the sequence, not permitting the representation of unboundedly large numbers, is that computers cannot handle unboundedly large numbers - one must always specify a bound, or else some attacker will cause our code to crash, producing results that we did not anticipate, that the attacker may well be able to make use of.

Perhaps a better solution is to waste a bit, thereby allowing future expansion. We use a representation that can represent arbitrarily large numbers, but clients and servers can put some arbitrary maximum on the size of the number. If that maximum proves too low, future clients can just expand it without breaking backward compatibility. This is similar to the fact that different file systems have different arbitrary maxima for the nesting of directories, the length of paths, and the length of directory names. Provided the maxima are generous it does not matter that they are not the same.

Thus the numbers 1 to 2 are represented by [1…3] → ℒ(2), 1 being the pattern “0”, and 2 being the pattern “1”

The numbers 0 to 5 are represented by [0…6] → ℒ(6), being the patterns

“0”, “10”, “110”, “1110”, “11110”, “11111”

Thus [0…6] → ℒ(6)(3) is a bit pattern that represents the number 3, and it is “1110”

This representation is only useful if we expect our numbers to be quite small.

[0…6] → ℓ(ℒ(2),1) is the sequence “00”, “01”,
“100”, “101”, “110”, “111” representing the
numbers zero to five, representing the numbers 0 to
less than 2^{2+1} - 2^{1}

[1…15] → ℓ(ℒ(3),1) is similarly the sequence

“00”, “01”,

“1000”, “1001”, “1010 1011”,

“11000”, “11001”, “11010”, “11011”,“11100”, “11101”, “11110”, “11111”,

representing the numbers one to fourteen, representing the
numbers 1 to less than 1 + 2^{3+1} - 2^{1}

We notice that ℓ(ℒ(n),k) has 2^{n+k} - 2^{k}
patterns, and the shortest patterns are length 1+k, and the
largest patterns of length 2n+k-2

This representation in general requires twice as many bits as to represent large numbers as the usual, non self terminating representation does (assuming k to be small)

We can iterate this process again, to get the bit string sequence:

ℓ(ℓ(ℒ(n),j),k)

which sequence has 2^(2^{n+j} - 2^{j} + k) - 2^{k}
elements.

This representation is asymptotically efficient for very large numbers, making further iterations pointless.

ℓ(ℒ(5),1) has 62 elements, starting with a two bit pattern, and ending
with a nine bit pattern. Thus ℓ(ℓ(ℒ(5),1),2) has
2^{64}-4 elements, starting with a four bit pattern, and finishing
with a 72 bit pattern.

Now let us consider a Huffman representation of the
numbers when we assign the number `n`

the
weight `1/(n*(n+1)) = 1/n - 1/(n+1)`

In this case the weight of the numbers in the range ```
n
... m
```

is `1/n - 1/(m+1)`

So our bit patterns are:

0 (representing 1)

100 101 representing 2 to 3

11000 11001 11010 11011 representing 4 to 7

1110000 1110001 1110010 1110011 1110100 1110101
1110110 1110111 representing 8 to 15

We see that the Huffman coding of the numbers that are
weighted as having probability `1/(n*(n+1))`

Is our friend [1…] → ℓ(ℒ(n),0), where n is very large.

Thus this is good in a situation where we are quite unlikely to encounter a big number. However a very common situation, perhaps the most common situation, is that we are quite likely to encounter numbers smaller than a given small amount, but also quite likely to encounter numbers larger than a given huge amount - that the probability of encountering a number in the range 0…5 is somewhat comparable to the probability of encountering a number in the range 5000…50000000.

In such case, we should we should represent such values by members of a
prefix free sequence `ℓ(ℓ(ℒ,j),k)`

These documents are licensed under the Creative Commons Attribution-Share Alike 3.0 License