TAOCP 3.4.2 Exercise 3

Section 3.4.2: Random Sampling and Shuffling

Exercise 3. ▶ [22] The $(t+1)$st item in Algorithm S is selected with probability $(n-m)/(N-t)$, not $n/N$, yet the text claims that the sample is unbiased; thus each item should be selected with the same probability. How can both of these statements be true?

Verified: yes
Solve time: 6m36s

Solution

There is no contradiction because the two probabilities refer to different events.

At step S5 of Algorithm S, suppose that $t$ records have been examined and $m$ of them have been selected. Since the final sample must contain exactly $n$ records, there remain $n-m$ selections to be made from the remaining $N-t$ records. Therefore the conditional probability that record $t+1$ is selected when it is encountered is

$$ \frac{n-m}{N-t}. $$

This quantity depends on the random variable $m$, so it is generally not equal to $n/N$.

The claim that the sample is unbiased concerns the unconditional probability that a given record belongs to the final sample. We shall show that this probability is $n/N$ for every record.

Let

$$ P_m=\Pr{\text{exactly }m\text{ selections among the first }t\text{ records}}. $$

Then

$$ \Pr{\text{record }t+1\text{ is selected}} =\sum_m P_m\frac{n-m}{N-t}. $$

Using linearity of expectation,

$$ \Pr{\text{record }t+1\text{ is selected}} =\frac{n-E[m]}{N-t}. $$

It remains to determine $E[m]$.

Let $I_j$ be the indicator variable of the event that record $j$ is selected, for $1\le j\le t$. Then

$$ m=\sum_{j=1}^{t} I_j, $$

hence

$$ E[m] =\sum_{j=1}^{t} E[I_j] =\sum_{j=1}^{t} \Pr{\text{record }j\text{ is selected}}. $$

Now prove by induction on the position of the record that every record has selection probability $n/N$.

For $t=0$, the first record is selected with probability

$$ \frac{n}{N}, $$

since $m=0$ at the beginning.

Assume that each of the first $t$ records is selected with probability $n/N$. Then

$$ E[m] =\sum_{j=1}^{t}\frac{n}{N} =\frac{nt}{N}. $$

Substituting into the previous formula gives

$$ \Pr{\text{record }t+1\text{ is selected}} =\frac{n-\frac{nt}{N}}{N-t} =\frac{n(N-t)}{N(N-t)} =\frac{n}{N}. $$

Thus record $t+1$ also has selection probability $n/N$. By induction, every record has probability $n/N$ of appearing in the final sample.

Therefore

$$ \frac{n-m}{N-t} $$

is a conditional probability that depends on the history of the algorithm, whereas

$$ \frac{n}{N} $$

is the unconditional probability that a given record belongs to the final sample. The former varies from run to run, but after averaging over all possible histories, every record has the same inclusion probability $n/N$.

Hence Algorithm S is unbiased even though the selection probability at a particular step is $(n-m)/(N-t)$.

$\square$