Some comments on Finding a Median

We will actually try to find the element of rank R for any rank R. We will use the notation x<y when x ranks lower than y, though that might be considered the opposite of usual notation.

Suppose we choose to do so by dividing our keys into blocks of size 5 (with perhaps a remainder) sorting each of these, and using the median of their middle elements as a comparison key.

If it takes f(N) comparisons to find the hardest to find rank R(N) we can find the following recursion inequality

### + the number of steps needed to find rank R among those keys on the same side as R from that median.

This yields us at worst

f(N) ≤ 7N/5 + f(N/5) + 2N/5 + f(7N/10),

since we can completely sort each 5-tuple with 7 comparisons, we can find the median of the N/5 middles of the 5-tuples with f(N/5) comparisons, can compare that median with the remaining keys with 2N/5 comparisons to find its exact rank, and can then eliminate 3N/10 keys that are on the opposite side of the median from R and find the key at appropriate rank in among the 7N/10 remaining keys with f(7N/10) comparisons..

And we can massage this and an appropriate induction hypothesis to get a bound on f(N).

We can improve this bound by noticing that once we complete the third of these steps, if we only eliminate those keys that we know from the second step are on the side opposite to x from rank R, we will end up with roughly N/10 sets of 5 keys that are each still sorted and the same number of sets of two sorted keys.

To apply the last step starting with this, we do not have to do the first step above in its entirety; some of our keys will still be sorted 5-tuples; we only have to regroup the B/10 pairs we have into 5s. This will take at most 5 comparisons for each 5 produced (since the ordered pairs give us already the first two comparisons out of 7 needed to sort 5s) and there will be N/5 keys that have to be regrouped, which means only N/5 comparisons will be needed to reach the second step on the 7N/10 remaining keys, instead of 49/50N of them that we would have to perform if we had started from scratch in sorting into 5-tuples.

This gives us

f(N) ≤ 7N/5 + f(N/5) + 2N/5 + f(7N/10) − (49/50)N +N/5

f(N) ≤ (51/50)N + f(N/5) + f(7N/10).

Before continuing to try to improve this inequality, let us notice how to apply it.

We make the assumption that for all values of M less than N, we have f(M) < cM. (We do not know yet what c is but so what?)

Then if we apply this assumption on the right here we get

F(N) ≤ 51N/50 + 9c/10.

This means that if we have c ≥ 51/5 (or c ≥ 10.2) then 51N/50 is less than c/10 and we can conclude (by substituting this statement into the last equation, that F(N) ≤ cN.

Now we try to improve this bound.

The third step involves comparing our median middle x with 2 ordered keys out of each sorted 5. We want to make use of the ordering of these keys to reduce our bound on the number of comparisons.

Suppose in performing the third step we keep track of three numbers as we do so..

1.      The lowest possible rank for x; call it L

2.      The rank R that we are seeking

3.      The highest possible rank for our x: call it H

At the beginning of step three these are (roughly) L=3N/10, R=R, H=7N/10. If R is not between L and H we can omit step 3 entirely; for example, if R is greater than H nothing below L can be R and we can eliminate all keys below L.

We want to find how x ranks in comparison to the two keys in each 5-tuple that we do not already know this about. When some key is found to be less than x , we can increase L by 1, and when the key is greater than x we can lower H by 1. We can stop step 3 when we find either H=R or L=R. If we find H=R we know that our median x is at or below R and anything below x cannot be R. Thus everything that is L or below can be eliminated.

Suppose we compare x with the lower of the two relevant keys in a given 5-tuple. That is, if the 2 keys are y and z and y< z holds, suppose we compare x with y. Then if we find x<y then we decrease H by 2 with only one comparison.

We want to make use of this fact to improve our estimates. To do so we keep track of which of N-H and L is smaller. If L is smaller than N-H, we compare x with y. Then, if x is less than y we find out by transitivity that x is less than z as well, and then H decreases by 2 from only one comparison. If x is greater than y then L increases by only 1, but that is good for us because L was smaller than N-H. We go on and compare x with the appropriate key from the next 5-tuple .

We when N-H is smaller than L we compare x with z first with similar conclusions,

When L and N-H are equal, the one that increases by 1 will necessarily become the bigger of the two, and the smaller will stay the same.

.Suppose, in applying all this we compare x once in each of A 5-tuples, encounter B ties between L and N-H and in C comparisons other than ties the smaller increases by 1 in the comparison. (We suppose that in the case of ties we always have bad luck and the larger after our comparison never increases by 2 from it.)

Then the bigger of the two will increase by 2A-2C-B, and the smaller will increase by C. By definition, 2A-2C-B is greater than C.

If at the end of these comparisons we have L bigger than N-H, we find L will have grown to 3N/10+2A-2C-B, and H will have descended to 7N/10-C.

We also know that B cannot exceed C+1, since there cannot be two ties at the same value of H. Se we assume the worst, that B=C+1,. We find that after A comparisons, we have increased L to 3N/10+ 2A-3C-1 and H will have decreased to 7N/10 − C.

Notice that, assuming the worst, that step 3 ends with R=L, the task remaining in step 4 will be f(7N/10 − C). We then have R=3N/10+2A -3C-1.

These statements allow us to determine the worst case number of comparisons needed to perform steps 3 and 4 as a function of C , R and N, but we also have to consider the increased cost of reforming 5-tuples after step 3. It seems that this cost can be held to 2C.

Putting all this together we get as a bound on the number of steps of

For R at least N/2, of

F(N) < 7N/5 + f(N/5) + (R-3N/10)/2 +f(7N/10-C) − 49N/50 +N/5 +7C/5 +3C +2C.

Finally, notice that after performing these steps once, the worst case performance of these steps means that in dealing with the 7N/10 − C remaining keys in step 4 there will be no step 3. That is, if we start off with R=N/2, which means we are seeking the minimum, after we eliminate at least 3N/10 keys on one side, we will have M keys left with M at most 7N/10 are looking for a rank of 2M/7 or less (or M- that number ), But 2M/7 is less than 3M/10, so that no step 3 will be necessary next time.

In fact, if R lies between 3N/10 and 7N/10, after lopping off from the side that R is closer to will put R outside that range on the next iteration, On the other hand, if we lop off from the further side, so that the R in the next iteration is not outside that range, step 3 will take fewer steps, and we can bound (R-3N/10) above by N/5.

When C is 0, we get the worse of two possible bounds: The first applies if youi eliminate keys so that the reduced problem has an R that lies between 3M/10 and 7M/10 and the second applies otherwise. In the second the reduction is iterated.

F(N) < 36N/50 + f(N/5) + f(7N/10)

and F(N) < 41N/50 +f(N/5) + f(7N/50) + f(49N/100) + (31/50)*(7N/10)

I seem to get something like F(N) < 8.2N, (but this is probably wrong.)

By taking derivatives of the expressions above you can show that the worst case occurs when C=0.