TAOCP 3.4.2 Exercise 2

Let $N$ be the total number of records in the input file, and let $n$ be the number of records to be selected by Algorithm S.

Section 3.4.2: Random Sampling and Shuffling

Exercise 2. [20] Prove that Algorithm S never tries to read more than $N$ records of its input file.

Verified: yes
Solve time: 1m09s


Solution

Let $N$ be the total number of records in the input file, and let $n$ be the number of records to be selected by Algorithm S. We will prove that the algorithm never reads more than $N$ records.

Algorithm S proceeds as follows. For each record in the input file, indexed by $t$ from $1$ to $N$, the algorithm generates a random number $U$ uniformly distributed in $(0,1)$. It then selects the current record with probability

$$ p_t = \frac{n - s}{N - t + 1}, $$

where $s$ is the number of records selected so far. If the record is selected, it is included in the sample, and $s$ is incremented by one. Otherwise, the algorithm continues to the next record. The process stops when $s = n$.

We claim that the total number of records considered is at most $N$. To see this, note that at each step $t$, the algorithm either increments $s$ by 1 (selects the record) or leaves $s$ unchanged (skips the record), and then advances to the next record $t+1$. Once $t = N$, the algorithm has reached the end of the file, and no further records exist to read. Since the algorithm cannot increment $t$ beyond $N$, it is impossible for Algorithm S to attempt to read more than $N$ records.

Formally, let $t_{\max}$ denote the largest value of $t$ reached by the algorithm. Since $t$ is incremented exactly once for each record read, we have

$$ t_{\max} \le N. $$

Consequently, the algorithm never tries to read beyond the $N$th record. This bound is tight because if the algorithm selects $n$ records only after considering all $N$ records, it will indeed read exactly $N$ records.

This completes the proof.