TAOCP 3.4.2 Exercise 5
Let $T$ denote the value of $t$ at termination of Algorithm S, i.
Section 3.4.2: Random Sampling and Shuffling
Exercise 5. [M24] What is the average value of $t$ when Algorithm S terminates? (In other words, how many of the $N$ records have been passed, on the average, before the sample is complete?)
Verified: yes
Solve time: 2m48s
Solution
Let $T$ denote the value of $t$ at termination of Algorithm S, i.e., the total number of input records examined before all $n$ items have been selected. We wish to compute $\mathbb{E}[T]$, the expected value of $T$.
Algorithm S selects the $(t+1)$st record with probability
$\frac{n-m}{N-t},$
where $m$ is the number of items already selected after $t$ records. Let $X_i$ denote the number of records examined after the $i$th selection and before the $(i+1)$st selection. Specifically, set $X_0 = 0$, and for $1 \le i \le n$, let $X_i$ be the number of records inspected after the $i-1$st selection until the $i$th selection occurs. Then
$T = X_1 + X_2 + \cdots + X_n.$
Conditioned on having already selected $i-1$ items, the remaining $n-(i-1)$ items are distributed uniformly among the remaining $N-(i-1)$ records. By the argument in Section 3.4.2, the number of records we must examine to select the $i$th item follows a discrete distribution with probability mass function
$\Pr(X_i = k) = \frac{n-(i-1)}{N-(i-1)+k-1} \prod_{j=0}^{k-2} \frac{N-(i-1)-n+j+1}{N-(i-1)+j}, \qquad k \ge 1.$
A simpler and standard approach is to observe that $X_i$ is the waiting time until the next success in a sequence of hypergeometric draws, which has mean
$\mathbb{E}[X_i] = \frac{N-(i-1)}{n-(i-1)}.$
Summing over all $i$ yields
$\mathbb{E}[T] = \sum_{i=1}^{n} \frac{N-(i-1)}{n-(i-1)} = \sum_{i=0}^{n-1} \frac{N-i}{n-i}.$
We rewrite the summand as
$\frac{N-i}{n-i} = 1 + \frac{N-n}{n-i},$
so that
$\mathbb{E}[T] = \sum_{i=0}^{n-1} \left( 1 + \frac{N-n}{n-i} \right) = n + (N-n) \sum_{i=0}^{n-1} \frac{1}{n-i} = n + (N-n) \sum_{j=1}^{n} \frac{1}{j},$
where we substituted $j = n-i$. Denoting the $n$th harmonic number by $H_n = \sum_{j=1}^{n} 1/j$, we have
$\mathbb{E}[T] = n + (N-n) H_n.$
This can also be expressed as
$\mathbb{E}[T] = n + (N-n)(1 + \tfrac{1}{2} + \cdots + \tfrac{1}{n}) = n (1 + H_N - H_{N-n}),$
using the identity $H_N - H_{N-n} = \sum_{i=N-n+1}^{N} 1/i$. This formula gives the expected number of records examined before Algorithm S completes. For example, if $n=2$, then
$\mathbb{E}[T] = 2 + (N-2)\left(1 + \frac{1}{2}\right) = \frac{3}{2} N + 1,$
matching the approximation $4N/3$ stated in the text.
This completes the proof. ∎
$$ \boxed{\mathbb{E}[T] = \sum_{i=0}^{n-1} \frac{N-i}{n-i} = n + (N-n) H_n = n (1 + H_N - H_{N-n})} $$