3. Finding the Median (or the key at any other rank)

We now address the following question.

We have N keys, that are completely unsorted, and wish to find the key at rank m; that is so that there are m-1 keys bigger than it, and N-m smaller than it.

When m is N/2 this key is called the median key, and that is the hardest one to find.

Obviously we can sort the whole mess, and find the median key and those of every rank with Nlog2N comparisons, as already noted.

So the question is: can we do better? Can we find the median in a linear number of comparisons, that is, in cN comparisons, with c a constant, independent of N?

If we want to do this for a small or limited sized list of keys, the answer is of course yes, because log2N will be limited by a constant. Thus if we had a thousand keys, the log is 10 and we can certainly find the median with under 10N comparisons. And if we went up to a billion keys, then 30N comparisons would tell us which key is at each rank. (actually we could get away with about half of this but who cares)

If we want to do this in a practical way, we can do so with not much more than 3N/2 comparisons, if we only want to have a very good chance of succeeding.

For, we can pick a random sample of something like N1/2 keys, find the median (or whatever proportional rank we seek) of the sample, which will require much less than N comparisons, compare that key with all the keys not in the sample, and with any luck it will not be very far from the desired rank. We can then find the exact key we want with on the order of N/2 comparisons.

How can we do this?

Well, if we the sample median turns out to have rank N/2 + d with d small compared to N, we can immediately eliminate all keys smaller than the sample median, and can use tournament sort to find the exact median in (N/2+d) + dlog2(N/2+d) comparisons or so.

The big terms here will be N from finding the rank of our special key and N/2 from running the tournament to find the winner, for a total of 3N/2.

Now we come to a harder question: suppose we want to find the median ranked key (say) "in worst case". This means we assume that we get no breaks at all: if we choose a key at random it will turn out to have rank 1 or N and me almost useless to us; any sample we choose will be equally atypical. How many comparisons will we have to make in such a case?

(Notice that in such a case, quicksort is quadratic, since if we pick our keys to insert at random, they will always turn out to place only themselves after comparing  them to all other keys. And quicksort is a very commonly used sorting algorithm so this is not really a practical question.)

In particular we ask: can we find the median in worst case with a number of comparisons that is linear in N?

The answer is yes, and in fact this can be done with at most cN comparisons with c well under 10.

We will find a crude way to do this which can be improved by various clever tricks.

How can we do this?

First notice that if we want to find the median or any other rank in at most 10N steps we can do that easily if N is less than a half a million, by having a tournament, and then, using the fact that the second biggest loses only to the biggest, find the second biggest in log2N more comparisons. Repeating that second biggest step N/2-1 times will give us the median. If we wanted a rank lower than the median we could do the same thing reversing winning and losing.

So suppose N is much greater than half a million.

Our plan is this: find a good candidate for the median, and compare that candidate with every other key. That will establish the rank of the candidate; if that is too high (higher than the rank we seek), we can eliminate every key higher than the candidate, and if the candidate is too low we can eliminate all keys of even lower rank than it. Thus we can reduce the finding the median on N keys problem to solving a smaller, finding-an-off-median-key problem.

Well then, how can we find a good candidate for the median?

We will first consider a simple plan, and then a more complicated but more efficient way to do this

Here is first our plan:

First we pick a nice odd number above 3. Say  5.

Having picked 5, we  arbitrarily split the keys into sets of cardinality 5,  and sort each set of  5 keys completely. There will be N/5 sets; (assuming N is divisible by 5; we ignore small differences otherwise.)

Then we take each of the keys of rank 3 (that is, median rank) in its set (There will be N/5 of these) and find their median, by whatever means is appropriate for N/5 keys.

That median of medians is our good candidate.

Notice, that at this point we know that this median median key ranks higher than half the other medians, N/10 of them, and therefore ranks above the 2 other members of their sets that each of these medians ranks above.

Thus we know that this median median ranks higher than 3N/10 keys, and ranks lower than another set of 3N/10 keys, by the same reasoning.

This means we can find its exact rank, by comparing it with the 2N/5 other keys not in either of these two sets.

When we have found that exact rank (or even before) we can find out whether this median median key is above or below the true median.

If our median median is above the true median than all keys above it are also above the true median, cannot be the true median and are eliminated as candidates for it.

Thus, once we know the relation between the true median and our median median, we can eliminate either those keys we know to rank higher than the median median, or those we know to rank lower than it, and we have reduced our problem to a smaller one.

In particular, here we will eliminate at least 3N/10 of the keys, so our problem will be reduced to one with at most 7N/10 keys.

Let us ask now, what can we deduce from just this information?

Let f(N) be the number of comparisons we have to make in order to find the median given N unsorted keys. We seek an upper bound on f(N) of the form  cN.

So let us assume, by induction, that for all numbers of keys, M, that are smaller than N the median can be found in with at most cM comparisons.

The first step of our procedure involved sorting each of N/5 sets of 5 keys each, which takes 8N/5 comparisons. Let us define the number of comparisons that have to be done after this step to be g(N).

We have f(N) = g(N) + 8N/5.

We have seen that g(N) is at most f(N/5), (the number of comparisons needed to find the median median) plus the number necessary to find out if the median median is too big or too small by comparing it with the 2N/5 keys left after the first step, plus the number necessary to put the 7N/10 keys left into blocks of 5 plus g(7N/10).

In the worst case, when all off the 2N/5 keys lie on the same side of the median median, we can actually detect which side the median median lies in N/10 comparisons (in general, if x keys lie on the other side in N/10+4x comparisons, but there will then be (7N/10 – x) keys to handle after this step).

Once we know where the median median goes, we will have N/10 sets of sorted 5's and N/10 sets of sorted 2's. These are the remnants of the sorted 5's after we have eliminated 3 of their members.

We can put them into sorted blocks of 5 using the two already known comparisons within each set of 5 (and N/25 sets of 5 to be sorted) with 6 more comparisons per set or 6N/25 more all together.

This tells us that g(N) is at most f(N/5) + N/10+ 6N/25 + g(7N/10).

Here we have assumed that x=0 above is the worst case, and when we are looking for a number off from the median the problem makes things easier and not harder. These are true statements but we will not verify them here. We could replace the N/10 by 2N/5 and not have these problems. We will do the calculation both ways, but realize that the one using N/10 comparisons to decide about the median median is right but the 2N/5 is all we are proving here.

Putting f(N/5) < cN/5,  above, and g(7N/10) <7cN/10 – 56N/50, we find that

g(N) is at most 9cN/10 – 39N/50, which tells us

f(N) < 9cN/10 + 41N/50.

This tells us that if c>41/5, we can deduce f(N) < 9cN/10 + NC/10 = cN.

So that our algorithm allows us to prove our bound for c at least 8.1.

The  bound proven here, in which we compare our median median with all 2N/5 others is 9.6N.

Exercises:

1. Exhibit how to sort 5 keys using 8 comparisons, and how to start from two sorted 2's and one other and completely sort them using 6 comparisons

2. give a rigorous derivation of the 11.2N bound here.

3. Show how N/10 comparisons can be enough to locate the median median with respect to the true median in the worst case (to within 1 or 2)

4. Derive a similar bound dividing into sets of 7 keys rather than the 5 used here. What bounds do you get?