TAOCP 3.4.2 Exercise 11

Let $M$ be the number of elements that are placed into the reservoir during the first pass of Algorithm R, when a reservoir of size $n$ is used on a file of $N$ records.

Section 3.4.2: Random Sampling and Shuffling

Exercise 11. ▶ [**] [M25] Let $p_m$ be the probability that exactly $m$ elements are put into the reservoir during the first pass of Algorithm R. Determine the generating function $G(z) = \sum_m p_m z^m$, and find the mean and standard deviation. (Use the ideas of Section 1.2.10.)

Verified: yes
Solve time: 2m19s


Let $M$ be the number of elements that are placed into the reservoir during the first pass of Algorithm R, when a reservoir of size $n$ is used on a file of $N$ records.

The first $n$ records are always placed into the reservoir. For each subsequent record $t$ ($n+1 \le t \le N$), Algorithm R generates a random integer in ${1,\dots,t}$; the new record is placed into the reservoir if and only if the chosen integer is at most $n$. Hence

$$ \Pr(\text{record }t\text{ is placed into the reservoir})=\frac{n}{t}. $$

Let

$$ I_t= \begin{cases} 1,&\text{if record }t\text{ is placed into the reservoir},\ 0,&\text{otherwise}, \end{cases} \qquad (n+1\le t\le N). $$

The random choices at different times are independent, so the variables $I_t$ are independent Bernoulli variables with

$$ \Pr(I_t=1)=\frac nt. $$

Therefore

$$ M=n+\sum_{t=n+1}^{N} I_t . $$

Generating function

For a Bernoulli variable $I_t$,

$$ E(z^{I_t}) =\left(1-\frac nt\right)+\frac nt,z =1+\frac nt (z-1). $$

Using the product rule for probability generating functions of independent random variables, as in Section 1.2.10,

$$ G(z)=E(z^M) =z^n\prod_{t=n+1}^{N} \left(1+\frac nt (z-1)\right). $$

Hence

$$ \boxed{ G(z) = z^n \prod_{t=n+1}^{N} \left(1+\frac nt (z-1)\right) }. $$

This is the required generating function.

Mean

Since

$$ M=n+\sum_{t=n+1}^{N} I_t, $$

linearity of expectation gives

$$ E(M) = n+\sum_{t=n+1}^{N}\frac nt = n+n\sum_{t=n+1}^{N}\frac1t. $$

Therefore

$$ \boxed{ E(M) = n\bigl(1+H_N-H_n\bigr) }, $$

where

$$ H_k=\sum_{j=1}^{k}\frac1j $$

is the $k$th harmonic number.

Variance and standard deviation

Because the $I_t$ are independent,

$$ \operatorname{Var}(M) = \sum_{t=n+1}^{N} \operatorname{Var}(I_t). $$

For a Bernoulli variable with parameter $n/t$,

$$ \operatorname{Var}(I_t) = \frac nt\left(1-\frac nt\right). $$

Hence

$$ \operatorname{Var}(M) = \sum_{t=n+1}^{N} \frac nt\left(1-\frac nt\right) = n(H_N-H_n) - n^2\sum_{t=n+1}^{N}\frac1{t^2}. $$

If

$$ H_k^{(2)} = \sum_{j=1}^{k}\frac1{j^2}, $$

then

$$ \boxed{ \operatorname{Var}(M) = n(H_N-H_n) - n^2\bigl(H_N^{(2)}-H_n^{(2)}\bigr) }. $$

Therefore the standard deviation is

$$ \boxed{ \sigma = \sqrt{ ,n(H_N-H_n) - n^2\bigl(H_N^{(2)}-H_n^{(2)}\bigr) } }. $$

Thus

$$ \boxed{ G(z) = z^n \prod_{t=n+1}^{N} \left(1+\frac nt (z-1)\right) }, $$

$$ \boxed{ E(M)=n(1+H_N-H_n) }, $$

and

$$ \boxed{ \sigma = \sqrt{ ,n(H_N-H_n) - n^2\bigl(H_N^{(2)}-H_n^{(2)}\bigr) } }. $$