8. Coding for Error Correction: the Shannon Bound

 

 

1. Introduction and Statement of the Theorem

 

            We now consider the problem of transmitting or storing information accurately, when we have noise in our transmission device or dead spots in our storage medium, which destroy some of the information.

 

            Suppose we have a message which requires M bits to send, We mean by this that the actual message will be one among 2M possible messages that we might send.

 

            We suppose further, that each bit will be in error a certain proportion p of the time. You might imagine p to be 10-4  or something like that.

 

            We assume that errors appear independently in our bits with probability p. We have no information, however, as to which bits will be in error when we send our message.

 

            In practice, errors often are not independent, but rather have a tendency to occur in bursts. This is useful information for us, because we can structure our procedures for handling errors so that they are less vulnerable to several errors in a small segment of the message than they are in general. For the moment we will ignore this possibility and assume independence of our errors, in the sense of probability.

 

            The assumptions we have made so far allow us to use both the (weak) Law of Large.Numbers derived in the last chapter, but the binomial distribution of errors also discussed.

 

            In this context our plan is to make our message longer than M bits, so that we have some redundancy in it. We want to use that redundancy to recognize the errors in our message, after it is sent and received, or after it is retrieved from storage,  and recover our original message.

 

            We will address two questions:

 

            How much redundancy need we add here so that we can recover the original message? In other words, how long must our message actually be?

 

            How can we actually construct useful codes that are reasonably efficient in error correction?

 

            The answer to the first question is given by Shannon's Second Theorem, which is as follows.

 

Theorem 8.1: To transmit M message bits given a probability p of each bit being in error, for large M requires a proportion H(p) (which is –plog2p – (1-p)log2(1-p)) of additional redundant bits, (called check bits). Thus, if we let the total number of bits needed be N, we have

N = M + NH(p).

 

Not only must the code be at least this long, but most (almost all) codes with bit length cM longer than this for any c>0 allow us to recover the message under the given assumptions, a proportion approaching 1 of the time.

 

            The remainder of this chapter will consist of proof of this theorem. We will then turn to the second question.

 

            The rest of this section consists of philosophical remarks.

 

            We will find that we can construct interesting codes that are quite efficient, nobody has been able to discover classes of useful codes for large M and large p ranges, which come close to this bound.

 

            The theorem implies that a code in which the code words were chosen at random for each potential message, from among code words of the length noted above, would very probably allow us to recover our message most of the time.

 

            However, we do not know how to find such a code in general.

 

            You might ask, why not generate a code word for each potential message randomly and use that code when you need one?

 

            There are three objections to such a procedure:

 

            First, if M is large, the number of potential messages, 2M, is far too vast to make an independent choice of code word for each potential message at all possible. OK, we can break our message up into blocks of length k and define a code word for each potential block. This is becomes hopeless when the block size gets near 50.

 

            Second, encoding and decoding in such a code requires table look up, which requires keeping the code words for every potential message available.

 

            Finally, decoding when there are errors requires additional work, because the received message with its errors will not be in the table of code words. By the law of large numbers, is will differ from the sent code word by roughly pN bits. There are techniques for locating the message, (these are extensively used in biology to find genes among sequences of DNA) but they are cumbersome.

           

            Thus, we will later seek codes which have lots of structure and that make decoding easy even in the presence of errors. For large k and given p it becomes difficult to find codes that make this possible that meet the Shannon bound..

 

            Finding efficient error correcting codes is a problem that has, then, the remarkable property that human beings have difficulty constructing codes of a given length that are error correcting with proportion of check bits H(p) +c for small c, while almost all randomly chosen codes of this length have the error correcting property.

 

            In other words, for this problem (and quite a few others in computer science) random constructions are better than anything we know how to devise ourselves.

 

            This tells us a lesson that perhaps explains a bit of history. Throughout the Nineteenth and much of Twentieth Century it was an article of faith among many intellectuals that a planned economy could and should be far more efficient than the random and chaotic economy that is built up from the bottom by the independent action of many individuals.

 

            However, when put into practice in the Twentieth Century, planned economies turned out to be not only quite irksome and occasionally deadly to the intended beneficiaries of the planning, but economic disasters as well.

 

            Perhaps then the construction of an economy is a problem like those of computer science such as finding error correcting codes, where random construction is almost always better than anything systematic we can come up with.

 

8.2 Proof of the Shannon Bound

 

            We want to prove that we need a proportion of redundant bits beyond our original M bits of at least something like H(p) in order to be able to correctly decode our message.

 

            When we receive a message we will have no idea which bits have been garbled. We will assume, however, that no bits are dropped or added; the only possible mistakes are  to switch 0 and 1 here and there throughout the message.

 

            We seek a unique decoding; under these circumstances the best candidate to be the true sent message, is the message whose code word differs from the received word in the fewest places,

 

            The number of places in which two words of the same length differ is called the Hamming distance between them.

 

 

            The Hamming distance between the true code word and the received word is the number of errors our message was afflicted with. Now we know from the Law of Large Numbers of Chapter  7 that this will be on the order of Np.

 

            If our message is long enough, this number will be quite large, Suppose then that Np + q errors are made, for |q| much smaller than Np.

 

            We now ask, how many different ways are there to make Np+q errors, in a message of length N?

 

            The answer is the binomial coefficient, C(N, Np+q). We have seen in the last chapter that we can write this binomial coefficient according to:

 

C(N,Np+q) = 2 ((N*H(p+(q/N))) (1+o(1)))

            Each error pattern for this number of errors is as probable asany other. We also have no advanced knowledge of which message was sent, and there were 2M  possible of those.

 

            Thus there are at least 2M+ ((N*H(p+(q/N))) (1+o(1))) possible received words.all of which have the same probability of being received.

 

            By our pigeonhole principle, we need N to be on the order of log2 of this number of bits in order to distinguish among these. And notice that if we fail to distinguish among s of these possible received messages at most one of these can come from any code word, and we will not be able to distinguish among s different sent messages.

 

            For us to have a reasonable chance of decoding then, most of the time  s should be quite small (preferably 1) , and that means that N must be at least on the order of M+N*H(p).

 

            Which is what we set out to prove.

 

            To summarize the argument, the errors that we know will happen will spread our 2M messages each to on the order of  2NH(p) possible error filled messages, and this implies our claim.

 

            8.3 Proof that Random Codes come close to the Shannon Bound.

 

            To prove this, we look at the sample space consisting of all possible codes, (assignments of code words to potential messages) in which all code words have length N with each code word for each message having the same probability of being a code word, and code words for different messages chosen independently and all possible distributions of errors with probability p for each error place independently.

 

            Suppose we receive a word R. We know from our weak law of large numbers, that the true decoding of this , the message Z, has a code word C(Z) which has Hamming distance on the order of  Np from R.

 

            Our plan will be to decode R to the nearest code word D to it that we find.

 

            If that word is C(Z) we will be right, and otherwise we will be wrong.

 

            Then when will we be wrong?

 

            We will be wrong if there is any other code word D that is closer to R than D is. So we will be right if no other code word has Hamming distance on the order of N(p+e) or less from  R.

 

            Let the number of words of length N having Hamming distance at most N(p+e) from R be V(N(p+e))

 

            Since each code word for other code words is chosen at random among the 2N possible code words, it has probability of  2-N  of being any single word, and probability V(N(p+e))/2N of being within Hamming distance at most N(p+e) from R.

 

            Thus the probability of that code word being too far from R to be confused with the actual sent code word is at least

 

(1 - V(N(p+e))/2N).

 

            The probability that this happens independently for each of the 2M -1 other code  words is the product of the probabilities of each of these events happening, or

 

 

 (1 - V(N(p+e))/2N) raised to the power 2M-1.

 

            And if this happens, we will decode successfully.

 

            Now V(N(p+e))/2N is quite small. One minus it is therefore very close to e raised to the power of -V(N(p+e))/2N.

           

            We deduce then, that we will decode successfully with probability at least

exp(-V(N(p+e))/2N) raised to the power 2M-1 which is

 

            exp(-V(N(p+e))/2N-M+1).

 

Exercise: Prove log2(V(N(p+e))) is of the order of NH(p+e).

 

            By the result you have just proven, your chance of success in decoding is at least 

 

Exp(-2-(N-M-NH(p+e))).

 

and when  N>>M+NH(p+e) holds, which means that the number of redundant bits is more than NH(p+e) for any positive e, the exponent here is close to 0 and the probability of successful decoding is close to 1.

 

            This is what we set out to prove.

 

 

            If no other code word is in the Hamming sphere of radius (p+e)N of R, we will decode our message correctly.