TAOCP 3.4.2 Exercise 1
Equation (1) in Section 3.
Section 3.4.2: Random Sampling and Shuffling
Exercise 1. [M12] Explain Eq. (1).
Verified: yes
Solve time: 1m46s
Solution
Equation (1) in Section 3.4.2 gives the probability with which the $(t+1)$st record of a file should be selected in Algorithm S, in order to maintain a uniform sample of size $n$ from the first $t+1$ records. Specifically, the probability that the new record is chosen to replace an element in the current sample of size $n$ is
$$ \frac{n}{t+1}. $$
To justify this, consider that after $t$ records have been processed, each record in the current sample occurs with probability exactly $n/t$, by the inductive assumption. When the $(t+1)$st record arrives, we want the sample to remain unbiased over the $t+1$ records, so that each has probability $n/(t+1)$ of being included in the new sample.
Suppose we select the new record with probability $p$, and if selected, we replace a randomly chosen element of the existing sample. Let $R_i$ denote the event that the $i$th record remains in the sample after processing the $(t+1)$st record. Then for $1 \le i \le t$, we have
$$ \Pr(R_i) = \Pr(\text{record } i \text{ survives}) = \Pr(\text{new record not selected}) + \Pr(\text{new record selected}) \cdot \Pr(\text{record } i \text{ not replaced}) . $$
Since the new record is replaced uniformly at random among $n$ elements, we have $\Pr(\text{record } i \text{ not replaced}) = 1 - \frac{1}{n} = \frac{n-1}{n}$. Thus
$$ \Pr(R_i) = (1 - p) + p \cdot \frac{n-1}{n} = 1 - \frac{p}{n}. $$
We require that after processing the $(t+1)$st record, each of the first $t$ records has probability $n/(t+1)$ of remaining in the sample. By the inductive hypothesis, $\Pr(\text{record } i \text{ is in sample before new record}) = n/t$. Therefore, the probability that record $i$ is in the sample after processing the new record is
$$ \frac{n}{t} \cdot \left( 1 - \frac{p}{n} \right). $$
Setting this equal to the desired probability $n/(t+1)$ gives
$$ \frac{n}{t} \left( 1 - \frac{p}{n} \right) = \frac{n}{t+1} \implies 1 - \frac{p}{n} = \frac{t}{t+1} \implies p = \frac{n}{t+1}. $$
Hence the $(t+1)$st record must be selected with probability $n/(t+1)$, as stated in equation (1). This ensures that after each step of the algorithm, every record seen so far has an equal probability $n/t$ of being in the sample, maintaining uniformity.
This completes the explanation of equation (1). ∎