5. Coding for Efficiency

 

            Suppose we have a very long message which we want to save or send to someone, and we would like to shorten it to save ourselves space or sending time. This happens in particular if our message contains some kind of audio or video information, which can go on and on.

 

            There are all sorts of procedures and protocols used to compactify information. We will study only one simple problem.

 

            Suppose we know nothing whatever about our message, and want to compress it using only information we can glean from it by making a relatively small amount of counting on its data.

 

            Suppose then that our message is represented as a long string of 0's and 1's of length N. One thing we can do is to pick a small number, k, divide the message up into consecutive blocks each consisting of k bits of information, (that is, k digits) and count how many blocks there are in our message for each possible block pattern of length k.

 

            There are, of course, 2k different possible patterns of digits 0 and 1 of length k.

If we were to do this, we would find, say, f(q) patterns of form q for each such pattern q.

 

            If k is 2, the possible patterns are 00, 01, 10 and 11, which we can call 0,1,2 and 3 which corresponds to reading them in binary notation. So we will get four frequencies, f(0), … f(3). Their sum will be N/k or rather the smallest integer at least as large as N/k, since that is how many blocks we will have.

 

            So we could find 2k  such pattern occurrence numbers, (let’s call them pattern frequencies) and we want to use this information to shorten our message.

 

            How can we do so?

 

             Well, we can assign a “code word” to each message, and make the words that correspond to our frequently occurring pattern short, and those corresponding to rare patterns long.

 

            This is what Morse code sort of does for English language text.

 

            And this brings us to two specific questions:

 

1.      How much can we shorten our message by using this technique?

 

2.      How can we find an efficient shortening algorithm for this problem?

 

5.1 Shannon’s Theorem: A lower bound to the length of our shortened message

 

            We now look at the first of these questions:

What can we say about the length of the shortest possible message obtained by trying this approach?

 

            We can employ our old friend the Pigeonhole Principle, to get a bound.

 

            To do so we address two questions:

 

                        If we shorten our message so it ends up being M bits long, how many possible messages can we distinguish?

 

                        Given what we are attempting, how many possible messages do we have to be able to handle in order to be able to decode every possible coded message successfully?

 

            The first question is straightforward: if our shortened message is M bits long, there are 2M  possible messages that can be created.

 

            So let us look at the second question: how many possible messages are there, given the information that the frequencies f(q) tell us about our message N?

 

            We know  from our frequency information how often each block occurs; but we have no information about the ordering of the blocks in our message. Thus, our message could, as far as the frequency information is concerned be any ordering of f(q) samples of block q for each q.

 

            So the question we want to address here is: How many different ways are there to order N/k blocks, given our frequency information?

 

            The answer is what is known as a “Multinomial Coefficient”.

Let us first then remind ourselves about “Binomial Coefficients”.

 

            Suppose we had chosen k to be 1, so that we had only two possible values for q.

Our current question would then be, how many different arrangements of two symbols are there given that there are f(0) occurrences of the symbol 0 and f(1) of 1?

 

            This is the number of ways of placing f(0) 0’s among f(0)+f(1) symbols all together. This is the same as the number of times you choose 1 in multiplying out (1+x)f(0)+f(1)  using the distributive law over and over again.

 

            It is called the Binomial Coefficient, (C(f(0)+f(1),f(0)) and it is defined by the “Binomial Theorem”

 

(1 + x) n  = SUM (over k of)  C(n,k).

 

 We can write a similar theorem, which we could call the multinomial theorem, as follows:

 

            (1 + x1 +x2 + … xj)n = SUM(over k1…kj) Cj(n, k1, …,kj)((x1)k1(x2)k2…(xj)kj),

 

where once again, the multinomial coefficients, the Cj(n, k1,…,kj) can be obtained by use of the distributive law.

 

(This is something of a dopey notation, since it singles out the 0th symbol, and does not list how often it is picked (it would be k0). We can of course deduce that k0  in each term on the right here is n minus the sum of all the other k’s.)

 

            In our case we can identify f(q) with kq, recognizing that every ordering of our symbols corresponds to a term obtained by applying the distributive law to the function on the left here. (the symbol appearing in position r in an arrangement is the index s corresponding to the variable xs that was chosen from the r-th factor here in the expanded term we are looking at.)

 

            So what then is a multinomial coefficient?

 

            There is a standard way of calculating these things, which goes as follows.

 

            If we specify how many copies of each symbol we are to have, which we do, all arrangements are alike except for order. We use the following fact

 

            The number of orderings is the number of arrangements multiplied by the number of orders which leave each such arrangement fixed.

           

            This is a basic fact, and I usually argue for it by saying it again louder, which unfortunately doesn’t help those who don’t see it. Let me try it here.

 

            Take any arrangement of our blocks. Now reorder the blocks according to some other order. Our initial arrangement will get mapped into some arrangement. Each arrangement A will be hit in this way the same number of times, say Q times . We can deduce then that the number of arrangements A is the number of possible orders R divided by Q.

 

            For example, if there are 24 orders and among them they map our original order into each arrangement 3 times, we can conclude that there must be 8 arrangements.

 

            This fact tells us what the binomial and multinomial coefficients are.

 

How many orders are there? With n blocks, we have n! possible orderings of them? (for us, n is N/k).

 

 How many such orders keep an arrangement of f(q) copies of q blocks for each q fixed?

 

            Any reordering  of blocks which merely permutes the blocks corresponding to the same q value does not change the arrangement at all. The totals number of orderings that merely permute blocks having the same q values is the product over all q values of f(q)!.

 

            We conclude that our multinomial coefficient is in general

 

n!/(Product(s=0 to j) of ((ks)!))

 

            In our case we find we have

 

                                    (N/k)! /(f(0)!*f(1)!*…*f(j)!)

 

possible arrangements of our blocks that make distinct messages of which our message is one.

 

            So our pigeonhole principle tells us:

 

2M    >  (N/k)! /(f(0)!*f(1)!*…*f(j)!)

taking logs we get

 

                        M >  log2 (N/k)! /(f(0)!*f(1)!*…*f(j)!).

 

            We will now massage the right hand side here to get the bound of Shannon’s Theorem, which theorem is:

 

Theorem: If we compress a message of length N by dividing it into N/k blocks of length k each, using only frequency information contained in the frequencies f(s) of the various block patterns, we will need a number of binary digits for our message that is at least

 

(N/k)H({p(0), p(1), …, p(N/k-1)})

 

and we can can find a way to compress our message in this way that uses at most

 

(N/k)(H(p(0), p(1), …, p(N/k-1))+1)

 

binary bits. (Usually we get much closer to the former number.)

 

            Here we use the notation p(s)= kf(s)/N, which is the “relative frequency” of the block pattern s: it is the number of s blocks divided by the total number of blocks; and H(…) is the “entropy of information” of the sequence of p’s.

           

            What then is this “entropy of information”, usually written as H({p(s)})?  

           

            It should come as no surprise here that it is essentially the log to the base 2 of the right hand side of our pigeonhole principle bound, which boils down to

 

H({p(s)})  = SUM (over s=0 to N/k-1 of) (-p(s) log2 p(s)).

 

            We will do the boiling down below. Here we notice that the first part of this theorem is an immediate consequence of the pigeonhole principle and this boiling.

 

            The second part of this “Shannon’s Theorem” is the statement that the bound here can be achieved, almost. We prove this soon.

 

            The remainder of this Chapter consists of 3 parts:

 

1.      A brief description of what entropy of information means and how it relates to the concept of entropy in physics.

2.      Convenient approximation for the function n! that allows us to reduce the our bound to a statement about entropy, and a reduction to same.

3.      Proof of the second part of the theorem.

 

5.2 What is Entropy? What is Entropy of Information?

 

            The physical concept of entropy was developed in the subjects of thermodynamics and statistical mechanics.

 

            If we look at a small piece of a large interacting system, we will find that it will not in general be in some fixed state. Rather, it can be found in any one of a usually large number of different states. If our small piece exchanges energy with the rest of the system, we will find it in states of differing energy at different times.

 

            Its entropy is a logarithmic measure of the diversity of the states of the system that can be observed in our small subsystem. These states generally appear with a frequency proportional to a factor of exp(-E/kT) where E represents the energy of the state, and T is the temperature of the system.  In physics the entropy is proportional to the logarithm of the sum over all states of the exponential factor just mentioned.

 

            At high temperatures the exponential factor is close to 1 for many states and the entropy is high. At low temperature, only states with close to the lowest possible energy occur much at all, and the entropy is usually very low.

 

            There are two laws of thermodynamics that involve entropy. The second law states that entropy of a system left on its own never decreases. That reflects the physical fact that states do not coalesce spontaneously, and a diverse system will retain its diversity.

 

            There is also a third law of thermodynamics which states that the entropy of a physical system approaches 0 near 0 temperature. This is of course only true of systems with very few low energy states. At low temperature only these states predominate in our subsystem and there is little diversity among the states.There is an exception to this law, and it is the substance known as water, but that is beyond our present horizon.

 

             Shannon invented the concept of entropy of information, as a logarithmic measure of the diversity among potential messages that is wiped out upon receipt of a specific message. It thus provides a measure of how much information, how many bits you actually learn when you get a message.

 

            In the case of information, if we know the frequency of our blocks, the diversity of potential messages comes entirely from the different orders in which these blocks can occur.

 

            In our case, our frequency information tells us that the number of potential messages is a certain multinomial coefficient. The log to the base 2 of this coefficient is number of bits you must have in order to exceed the same number of potential messages, and is the number of bits you need supply in order to reduce the original diversity down to one particular message.

 

 

5.3 Stirling’s Formula and Approximations for Multinomial Coefficients

 

            We have written a formula for a multinomial coefficient in terms of  factorials, so we can write it and its logarithm conveniently if we can find a convenient approximation to n!.

 

            There is such a formula and it can be written as follows:

 

n! =  (n/e)n W,

 

where W  is a factor on the order of n1/2.

           

In fact we can write W= (2 pi n + pi/3)1/2  (1 + O(1/n)).

 

When we take logarithms of n! the factor W  recedes into insignificance, and contributes nothing to the leading terms here. It can therefore be ignored in our present discussion.

 

(It is curious that the approximate formula implied above for n! is quite accurate even for n=0, (if you interpret (0)0 as 1) and gets better as n increases in that the ratio the formula to n! approaches 1 as n increases..)

 

An argument for this form of approximation comes from the power series expression for the function exp(n).

 

This series is:

 

Exp(n) = 1 + n + … nn-1/(n-1)! + nn/n! + nn+1/(n+1)! + ….

 

 

You can verify that the terms increase up to nn-1/(n-1)!, then the next term is the same and then they start downward, at ever increasing speeds since the ratio of the kth term to the (k-1)st is n/k.

 

We can thereforestimate represent the left-hand-side, the exponential function,  by the product of the largest term on the right and a measure of how many terms on the right make major contributions, in short by nn/n! *W where W is easily seen to be less than n.

 

And we can rearrange this relation to read n! = W*nn/exp(n) as claimed, with 1< W < n.

 

The notation O(f(n)) means that as n goes to infinity is quantity behaves like f(n), in that its ratio to f(n) approaches a constant,  while o(f(n)) means its ratio to f(n) goes to 0 as n increases.

 

What then does this tell us about the logarithm of a multinomial coefficient?

 

Consider first a binomial coefficient, C(n,np(1)).

 

This can be written as n!/(np(0))!(np(1))!

 

Applying Stirling’s formula this becomes

 

(n/e)n((np(0))/e)-n(1-p(1))(np(1)/e)-np(1) Wn /Wp(1)n/Wn(1-p(1)

 

 And the n and e factors cancel and we get the following expression for the binomial coefficient: ((p(1))-p1(1-p(1)))-p1)n Wn /Wp(1)n/Wn(1-p(1).

 

IIf we take logs of both sides, we find that the W factors are insignificant for large n and we get

 

H(p1) = - p1log2 p1 –(1-p1) log2(1-p1).

            = -p1 log2 p1 – p0log2 p0.

 

And the log of the binomial coeffient is close to nH(p1).

 

This as a function of p1 is symmetric about p1=1/2, is 0 at 0 and 1, and is 1 at ½.

 

And what about the multinomial coefficient? When one takes the logarithm of it we get a term of –ps log2 ps  f for each s, and again the W terms can be ignored, and we get.

 

H({ps}) = -p0 log p0 – p1 log p1 - … - pj log pj.

 

The log of the multinomial coefficient is roughly nH({ps})

 

5.4 How can we get a code that come close to the bound here?

 

            We can do this by defining a code, which assigns a binary sequence to each block type. The length l(q)  of the sequence assigned to will be used a total of f(q) times in the message, and the length of the message will be

 

            SUM (over all q from 0 to j of) l(q)f(q)

 

 

What code should we use?

 

            We first notice that there is a simple code that can be used to meet the lower bound just described exactly, when each p(s) is the reciprocal of a power of 2.

 

             We assume now that the p(s) add up to 1; this tells us that the rarest two messages must have the same frequency, when they are all reciprocals of powers of 2.

 

            We can then take the rarest two messages, assign their last bits to be 0 and 1, and from now on treat them as one message.

 

            It is clear that their frequency doubles when their frequencies were the same and we combine them as one.

 

            We can immediately deduce, by induction on the number of blocks that by doing this we can assign each block q a code word whose length, l(q) is -log2p(q).

 

            For, by the induction hypothesis (using induction on the number of block patterns) we can assume this to be true after combining the two rarest frequencies.

 

That combination was achieved at a cost of adding one extra bit to each of their code words, but  we also subtracted 1 from each of their –log2p(q) values, which came from doubling their frequency. (doubling p(q) lowers – log2p(q) by 1).

           

            The induction hypothesis tells us we can assign each of these blocks a common prefix which is one bit shorter than their log2p(q) values, plus one extra bit to distinguish them, which gives exactly log2p(q) bits in all, as desired for them.

 

            If we have a code like this, then the total message length is exactly the lower bound we have found from the first part of Shannon’s Theorem.

 

            We can also observe that given any values  for p(q), we can lower them to make them reciprocals of powers of 2 at the cost at worst of halving them. (In reality if we lower some p(q) we would raise others to keep the sum at 1, so we would do lots better than halving each p(q).  But halving each p(q)  adds only 1 to each negative log, and the that would affect the total message length at worst as indicated in the statement of the theorem.

 

            So we can actually find a reasonably good code, by rounding our relative frequencies down to inverse powers of 2 and assigning code words to have length given by their -logf(q) values

 

            But this code is not optimal, and we can do better by a still easier procedure which we will describe in the next section.

 

Exercises: 1.  Use a spreadsheet or other means to plot the function H(p1) in the range 0 to 1 for p1.

2.      On a spreadsheet or otherwise compute n!/(n/e)n /n1/2 for n = 2,4,8,16,32,64,128

Use these results and extrapolation to estimate the limit of this ratio.

(to extrapolate look at s(k) = 2r(2k)-r(k)

                                    then t(k) = 4s(2k)/3 –s(k)/3

                                    then u(k) = 8t(2k)/7-t(k)/7

                                    then v(k) = 16u(2k)/15 – u(k)/15

            Compare your answer to (2 pi)1/2.

            3. Given the following frequencies f(q) for q=0 to 15, compute the entropy of this list (compute their total, n, the p(q) and then the entropy function of these)

                        1, 22,11,15,2,1,2,3,45,95,2,2,1,2,25,1