3. Finding the Median (or the key at any
other rank)
We now address the following question.
We have
N keys, that are completely unsorted, and wish to find the key at rank m; that
is so that there are m-1 keys bigger than it, and N-m smaller than it.
When m is N/2 this
key is called the median key, and that is the hardest one to find.
Obviously we can sort the whole mess, and find the median
key and those of every rank with Nlog2N comparisons, as already
noted.
So the question is: can we do better? Can we find the
median in a linear number of comparisons, that is, in cN comparisons, with c a
constant, independent of N?
If we want to do this for a small or limited sized list of
keys, the answer is of course yes, because log2N will be limited by
a constant. Thus if we had a thousand keys, the log is 10 and we can certainly
find the median with under 10N comparisons. And if we went up to a billion
keys, then 30N comparisons would tell us which key is at each rank. (actually
we could get away with about half of this but who cares)
If we want to do this in a practical way, we can do so with
not much more than 3N/2 comparisons, if we only want to have a very good chance
of succeeding.
For, we can pick a random sample of something like N1/2
keys, find the median (or whatever proportional rank we seek) of the sample,
which will require much less than N comparisons, compare that key with all the
keys not in the sample, and with any luck it will not be very far from the
desired rank. We can then find the exact
key we want with on the order of N/2 comparisons.
How can we do this?
Well, if we the sample median turns out to have rank N/2 +
d with d small compared to N, we can immediately eliminate all keys smaller
than the sample median, and can use tournament sort to find the exact median in
(N/2+d) + dlog2(N/2+d) comparisons or so.
The big terms here will be N from finding the rank of our
special key and N/2 from running the tournament to find the winner, for a total
of 3N/2.
Now we come to a harder question: suppose we want to find
the median ranked key (say) "in
worst case". This means we assume that we get no breaks at all: if we
choose a key at random it will turn out to have rank 1 or N and me almost
useless to us; any sample we choose will be equally atypical. How many
comparisons will we have to make in such a case?
(Notice that in such a case,
quicksort is quadratic, since if we pick our keys to insert at random, they
will always turn out to place only themselves after comparing them to all other keys. And quicksort is a
very commonly used sorting algorithm so this is not really a practical question.)
In particular we ask: can we find the median in worst case
with a number of comparisons that is linear in N?
The answer is yes, and in fact this can be done with at
most cN comparisons with c well under 10.
We will find a crude way to do this which can be improved
by various clever tricks.
How can we do this?
First notice that if we want to find the median or any
other rank in at most 10N steps we can do that easily if N is less than a half
a million, by having a tournament, and then, using the fact that the second
biggest loses only to the biggest, find the second biggest in log2N
more comparisons. Repeating that second biggest step N/2-1 times will give us
the median. If we wanted a rank lower than the median we could do the same
thing reversing winning and losing.
So suppose N is much greater than half a million.
Our plan is this: find a good candidate for the median, and
compare that candidate with every other key. That will establish the rank of
the candidate; if that is too high (higher than the rank we seek), we can
eliminate every key higher than the candidate, and if the candidate is too low
we can eliminate all keys of even lower rank than it. Thus we can reduce the
finding the median on N keys problem to solving a smaller, finding-an-off-median-key
problem.
Well then, how can we find a good candidate for the median?
We will first consider a
simple plan, and then a more complicated but more efficient way to do this
Here is first our plan:
First we pick a nice
odd number above 3. Say 5.
Having picked 5, we arbitrarily split the keys into sets of cardinality 5, and sort each set of 5 keys completely. There will be N/5 sets;
(assuming N is divisible by 5; we ignore small differences otherwise.)
Then we take each of the keys of rank 3 (that is, median rank)
in its set (There will be N/5 of these) and find their median, by whatever
means is appropriate for N/5 keys.
That median of
medians is our good candidate.
Notice, that at this point we know that this median median
key ranks higher than half the other medians, N/10 of them, and therefore ranks
above the 2 other members of their sets that each of these medians ranks above.
Thus we know that this median median ranks higher than
3N/10 keys, and ranks lower than another set of 3N/10 keys, by the same
reasoning.
This means we can find its exact rank, by comparing it with
the 2N/5 other keys not in either of these two sets.
When we have found that exact rank (or even before) we can
find out whether this median median key is above or below the true median.
If our median median is above the true median than all keys
above it are also above the true median, cannot be the true median and are
eliminated as candidates for it.
Thus, once we know the relation between the true median and
our median median, we can eliminate either those keys we know to rank higher
than the median median, or those we know to rank lower than it, and we have
reduced our problem to a smaller one.
In particular, here we will eliminate at least 3N/10 of the
keys, so our problem will be reduced to one with at most 7N/10 keys.
Let us ask now, what can we deduce from just this information?
Let f(N) be the number of comparisons we have to make in
order to find the median given N unsorted keys. We seek an upper bound on f(N)
of the form cN.
So let us assume, by induction, that for all numbers of
keys, M, that are smaller than N the median can be found in with at most cM
comparisons.
The first step of our procedure involved sorting each of
N/5 sets of 5 keys each, which takes 8N/5 comparisons. Let us define the number
of comparisons that have to be done after this step to be g(N).
We have f(N) = g(N) + 8N/5.
We have seen that g(N) is at most f(N/5), (the number of
comparisons needed to find the median median) plus the number necessary to find
out if the median median is too big or too small by comparing it with the 2N/5
keys left after the first step, plus the number necessary to put the 7N/10 keys
left into blocks of 5 plus g(7N/10).
In the worst case, when all
off the 2N/5 keys lie on the same side of the median median, we can actually
detect which side the median median lies in N/10 comparisons (in general, if x
keys lie on the other side in N/10+4x comparisons, but there will then be
(7N/10 – x) keys to handle after this step).
Once we know where the median median goes, we will have
N/10 sets of sorted 5's and N/10 sets of sorted 2's. These are the remnants of
the sorted 5's after we have eliminated 3 of their members.
We can put them into sorted blocks of 5 using the two already
known comparisons within each set of 5 (and N/25 sets of 5 to be sorted) with 6
more comparisons per set or 6N/25 more all together.
This tells us that g(N) is at
most f(N/5) + N/10+ 6N/25 + g(7N/10).
Here we have assumed that x=0
above is the worst case, and when we are looking for a number off from the
median the problem makes things easier and not harder. These are true
statements but we will not verify them here. We could replace the N/10 by 2N/5
and not have these problems. We will do the calculation both ways, but realize
that the one using N/10 comparisons to decide about the median median is right
but the 2N/5 is all we are proving here.
Putting f(N/5) <
cN/5, above, and g(7N/10) <7cN/10 –
56N/50, we find that
g(N) is at most 9cN/10 –
39N/50, which tells us
f(N) < 9cN/10 + 41N/50.
This tells us that if c>41/5, we can deduce f(N) < 9cN/10 +
NC/10 = cN.
So that our algorithm allows
us to prove our bound for c at least 8.1.
The bound proven here, in which we compare our median median with all
2N/5 others is 9.6N.
Exercises:
1. Exhibit how to sort 5 keys using 8 comparisons, and
how to start from two sorted 2's and one other and completely sort them using 6
comparisons
2. give a rigorous derivation of the 11.2N bound here.
3. Show how N/10 comparisons can be enough to locate
the median median with respect to the true median in the worst case (to within
1 or 2)
4. Derive a similar bound dividing into sets of 7 keys
rather than the 5 used here. What bounds do you get?