8. Coding for Error
Correction: the Shannon Bound
We now consider the problem of transmitting or storing
information accurately, when we have noise in our transmission device or dead
spots in our storage medium, which destroy some of the information.
Suppose we want to send a message
which requires M bits. We mean by this that the actual message will be
one among 2M possible messages that we might send.
We suppose further, that each bit will be in error a certain
proportion p of the time. You might imagine p to be 10−4 or
something like that.
We assume that errors appear independently in our bits with
probability p. We have no information, however, as to which bits will be in
error when we send our message.
In practice, errors often are not independent, but rather have a
tendency to occur in bursts. This is useful information for us, because we can
structure our procedures for handling errors so that they are less vulnerable
to several errors in a small segment of the message than they are in general.
For the moment we will ignore this possibility and assume independence of our
errors, in the sense of probability.
The assumptions we have made so far allow us to use both the
(weak) Law of Large Numbers derived in the last
chapter, and also the binomial distribution of errors that was also discussed.
In this context our plan
is to make our message longer than M bits, so that we have some redundancy in
it. We want to use that redundancy to recognize the errors in our message,
after it is sent and received, or after it is
retrieved from storage, and recover our original message.
We will address two questions:
How much redundancy need we add here so that we can recover
the original message? In other words, how long must our message actually
be?
How can we actually construct useful codes for error correction
where the encoding and deocding operations are reasonably efficient?
The answer to the first question is given by Shannon's Second
Theorem, which is as follows.
N = M + NH(p).
The rest of this section consists
of philosophical remarks.
We will find that though Shannon proved his bound in 1948 with a relatively
easy argument, for fifty years after this, nobody was been able to
discover classes of useful codes for large M and large p ranges, which
came come close to this bound. The theorem said, though, that most definable
codes with the right length for code words actually work.
The theorem implies that a code in which the code words were
chosen at random for each potential message, from among code words of the
length noted above, would very probably allow us to recover our message most of
the time.
However, finding such a code that could be used in practice was much
harder than proving that they existed. It was only in the last decade
that codes coming fairly close (turbo codes and low density
parity check codes) were constructed. Even now, nobody can
prove that these codes can be made to reach Shannon's bound in the
asymptotic limit, although it has been proved that if you tweak the
parameters in low density parity check codes in the right way, you can
come very close. And for these for two classes of codes, randomization
is still used in their constructions, although unlike completely random
codes they have efficient algorithms for encoding and decoding them.
Nobody knows how to construct codes coming close to Shannon's bound
without using randomness.
You might ask, why not generate a code word for each potential
message randomly and use that code when you need one?
There
are three objections to such a procedure:
First, if M is large, the
number of potential messages, 2M, is far too vast to make an
independent choice of code word for each potential message at all possible. OK,
we can break our message up into blocks of length k and define a code word for
each potential block. This becomes hopeless when the block size gets near
50.
Second, encoding and
decoding in such a code requires table look up, which requires keeping the code
words for every potential message available. This would require immense storage
space.
Finally, decoding when there are errors requires additional work,
because the received message with its errors will not be in the table of code
words. By the law of large numbers, is will differ from the sent code word by
roughly pN bits. There are techniques for locating
the message, (these are extensively used (for example in biology to find genes
among sequences of DNA) but they are cumbersome.
Thus, we will actually seek codes which have lots of structure
and that make decoding easy even in the presence of errors. For large k and
given p it becomes difficult to find codes that both make this possible and
meet the Shannon bound.
Finding efficient error correcting codes is a problem that has,
then, the remarkable property that human beings have difficulty constructing
codes of a given length that are error correcting with proportion of check bits
H(p) +c for small c, while almost all randomly chosen codes of this length have
the error correcting property.
In other words, for this problem (and quite a few others in
computer science) random constructions are better than anything we know how to
devise ourselves.
This tells us a lesson
that perhaps explains a bit of history. Throughout the Nineteenth and much of
Twentieth Century it was an article of faith among many intellectuals that a
planned economy could and should be far more efficient than the random and
chaotic economy that is built up from the bottom by the independent action of
many individuals.
However, when put into practice in the Twentieth Century, planned
economies turned out to be not only quite irksome and occasionally deadly to
the intended beneficiaries of the planning, but economic disasters as well.
Perhaps then the
construction of an economy is a problem like those of computer science such as
finding error correcting codes, where random construction developed locally by
a host of individual decisions is almost always better than anything systematic
we can come up with imposed from above..
We want to prove that we need a proportion of redundant bits
beyond our original M bits of at least something like H(p)
in order to be able to correctly decode our message.
When we receive a message we will have no idea which bits have
been garbled. We will assume, however, that no bits are dropped or added; the
only possible mistakes are to switch 0 and 1 here and there throughout the
message.
We seek the optimal decoding; under these circumstances the best
candidate to be the true sent message is the message whose code word differs
from the received word in the fewest places.
The number of places in which two
words of the same length differ is called the Hamming distance between
them.
The Hamming distance between the
true code word and the received word is the number of errors our message was
afflicted with. Now we know from the Law of Large Numbers of Chapter 7 that
this will be on the order of Np.
If our message is long enough, this number will be quite large, Suppose then that Np + q errors
are made, for |q| much smaller than Np.
We now ask, how many different ways are there to make Np+q errors, in a message of length N?
The answer is the binomial coefficient, C(N,
Np+q). We have seen in the last chapter that we can
write this binomial coefficient as:
We have 2M possible sent messages, each of which can accumulate errors in 2((N*H(p+(q/N))) (1+o(1))) different ways, all of which are equally likely. Suppose we had fewer than 2M + ((N*H(p+(q/N))) (1+o(1))) possible received words. Then the pigeonhole principle tells us that some received words must arise from more than one message. And if there are substantially fewer than this number of received words, then our discussion of probability tells us that on average, each received word will arise from several messages.
By our pigeonhole principle, we need N to be on the order of log2
of this number of possible received words in order to distinguish among these.
For us to have a reasonable chance of decoding then, N must be at
least on the order of M+N*H(p).
Which is what we set out to prove.
To summarize the argument, the errors that we know will happen
will spread each of our 2M messages to on the order of 2NH(p) possible
erroneous messages, and this implies our claim.
To prove this, we look at the sample space consisting of all
possible codes, (assignments of code words to potential messages) in which all
code words have length N, with all possible words having the same probability
of being a code word, and code words for different messages chosen
independently.
We assume further that each bit of the received word acquires an error with probability $p$, and that errors on different bits are independent.
Suppose we receive a word
R. We know from our weak law of large numbers, that the true decoding of this, the message Z, has a code word C(Z) which has Hamming
distance on the order of Np from R.
Our plan will be to decode R to the nearest code word D to it
that we find.
If that word is C(Z) we will be right,
and otherwise we will be wrong.
Then when will we be wrong?
We will be wrong if there is any other code word D that is
closer to R than D is. So we will be right if no other code word has Hamming
distance on the order of N(p+
ε) or less from R.
Let the number of words of
length N having Hamming distance of at most N(p+
ε) from R be V(N(p+ε))
Since each code word for message words other than R is chosen at random
among the 2N possible code words, it has probability
2−N of being any single word, and
probability V(N(p+ε))/2N of
being within Hamming distance at most N(p+ε)
from R.
Thus the probability of that code word being too far
from R to be confused with the actual sent code word is at least
The probability that this happens independently for each of
the 2M −1 other code words is the product of the probabilities of
each of these events happening, or
And if this happens, we will decode
successfully.
Now V(N(p+ε))/2N
is quite small. One minus it is therefore very close to e raised to the power
of -V(N(p+ε))/2N.
We deduce then, that we will decode successfully with probability
at least
By the result you will have proven by
performing this exercise, your chance of success in decoding is at least
When N>>M+NH(p+
ε) holds, which means that the number of
redundant bits is more than NH(p+ε), the exponent here is
close to 0 and the probability of successful decoding is close to 1.
This is what we set out to prove.
If no other code word is in the Hamming sphere
of radius (p+ε)N
of R, we will decode our message correctly. That is the long and the short of
the argument, which is fleshed out by using probability theory to show that the
probability of this approaches one if we have enough check bits.