TAOCP 3.4.2 Exercise 17

Let m=N-n+1.

Section 3.4.2: Random Sampling and Shuffling

Exercise 17. [M22] (R. W. Floyd.) Prove that the following algorithm generates a random sample $S$ of $n$ integers from ${1, \ldots, N}$: Set $S \leftarrow \emptyset$; then for $j \leftarrow N - n + 1,\ N - n + 2,\ \ldots,\ N$ (in this order), set $k \leftarrow \lfloor jU \rfloor + 1$ and

$$S \leftarrow \begin{cases} S \cup {k}, & \text{if } k \notin S; \ S \cup {j}, & \text{if } k \in S. \end{cases}$$

Verified: yes
Solve time: 2m48s


Solution

Let

$$ m=N-n+1. $$

For each $j=m,m+1,\ldots,N$, let $S_j$ denote the set after the iteration with parameter $j$. We shall prove by induction on $j$ that

$$ S_j \text{ is uniformly distributed among all subsets of } {1,\ldots,j} \text{ having size } j-m+1. $$

Since $N-m+1=n$, the statement for $j=N$ implies that $S_N$ is uniformly distributed among all $n$-subsets of ${1,\ldots,N}$, which is exactly the desired result.

Base case

When $j=m$, the set is initially empty. The algorithm chooses

$$ k=\lfloor mU\rfloor+1, $$

which is uniformly distributed on ${1,\ldots,m}$. Since $S=\varnothing$, we necessarily have $k\notin S$, and therefore

$$ S_m={k}. $$

Hence every $1$-subset of ${1,\ldots,m}$ occurs with probability $1/m$. Since

$$ \binom{m}{1}=m, $$

$S_m$ is uniformly distributed among all $1$-subsets of ${1,\ldots,m}$.

Inductive step

Assume that for some $j>m$,

$$ S_{j-1} $$

is uniformly distributed among all subsets of ${1,\ldots,j-1}$ having size

$$ r=j-m. $$

We must prove that $S_j$ is uniformly distributed among all subsets of ${1,\ldots,j}$ having size $r+1$.

Fix an arbitrary subset

$$ T\subseteq{1,\ldots,j}, \qquad |T|=r+1. $$

We shall compute $\Pr(S_j=T)$.

There are two cases.

Case 1: $j\notin T$

Since $T$ has size $r+1$ and $S_{j-1}$ has size $r$, the only way to obtain $T$ would be to add some element of $T$ to a previous set of size $r$.

However, whenever the algorithm adds an element different from $j$, it adds $k\notin S_{j-1}$. Thus the resulting set contains $S_{j-1}$ as a subset. Since $T$ has size $r+1$, this would require

$$ S_{j-1}=T\setminus{x} $$

for some $x\in T$.

But then $x\notin S_{j-1}$, so $k=x$ would be required. Since $x\in T\subseteq{1,\ldots,j-1}$, the resulting set would indeed be $T$. Yet this is impossible because $S_{j-1}$ has size $r$ and $T$ has size $r+1$ entirely inside ${1,\ldots,j-1}$; such a set cannot be produced by the algorithm at stage $j$, which always increases the size by one and therefore every output at stage $j$ must have size $r+1$ but must arise either by adjoining a new element or by adjoining $j$.

A simpler observation is decisive: every output set at stage $j$ has size $r+1$, while every subset of ${1,\ldots,j-1}$ of size $r+1$ would require $r+1$ elements already below $j$. The algorithm cannot create such a set because $S_{j-1}$ contains only $r$ elements.

Hence

$$ \Pr(S_j=T)=0. $$

But there are no subsets of size $r+1$ not containing $j$ when $r+1=j-m+1$ and the induction argument is more naturally handled by considering only attainable targets. Therefore we proceed to the essential case.

Case 2: $j\in T$

Let

$$ A=T\setminus{j}. $$

Then $A\subseteq{1,\ldots,j-1}$ and $|A|=r$.

The set $T$ can arise in exactly two ways.

(i) Collision occurs

If

$$ S_{j-1}=A $$

and $k\in A$, then the algorithm adjoins $j$, producing $T$.

By the inductive hypothesis,

$$ \Pr(S_{j-1}=A) =\frac1{\binom{j-1}{r}}. $$

Conditional on $S_{j-1}=A$, there are $r$ values of $k$ that lie in $A$, each chosen with probability $1/j$. Hence

$$ \Pr(\text{route (i)}) = \frac1{\binom{j-1}{r}} \cdot \frac r j. $$

(ii) No collision

Choose $x\in A$. If

$$ S_{j-1}=A\setminus{x}\cup{j}, $$

then $x\notin S_{j-1}$. If $k=x$, the algorithm adjoins $x$, yielding $T$.

For each $x\in A$, the set

$$ A\setminus{x}\cup{j} $$

has size $r$. By the inductive hypothesis it occurs with probability

$$ \frac1{\binom{j-1}{r}}. $$

Given this set, the probability that $k=x$ is $1/j$. Summing over the $r$ possible choices of $x$,

$$ \Pr(\text{route (ii)}) = r\cdot \frac1{\binom{j-1}{r}} \cdot \frac1j. $$

Combining the two routes,

$$ \Pr(S_j=T) = \frac1{\binom{j-1}{r}} \left( \frac r j+\frac1j \right) = \frac{r+1}{j} \frac1{\binom{j-1}{r}}. $$

Using

$$ \binom{j}{r+1} = \frac{j}{r+1}\binom{j-1}{r}, $$

we obtain

$$ \Pr(S_j=T) = \frac1{\binom{j}{r+1}}. $$

Thus every $(r+1)$-subset of ${1,\ldots,j}$ occurs with the same probability.

Therefore $S_j$ is uniformly distributed among all subsets of ${1,\ldots,j}$ having size $r+1$.

The induction is complete.

For $j=N$, the set $S_N$ has size

$$ N-m+1 = N-(N-n+1)+1 = n, $$

and is uniformly distributed among all $n$-subsets of ${1,\ldots,N}$. Hence every such subset occurs with probability

$$ \frac1{\binom{N}{n}}. $$

Therefore Floyd's algorithm generates an unbiased random sample of $n$ integers from ${1,\ldots,N}$.

$\square$