TAOCP 3.4.2 Exercise 19
Let q_k = \frac{U_k}{w_k}, \qquad 1 \le k \le N, where $U_1, \ldots, U_N$ are independent and uniformly distributed on $(0,1)$, and define, for any real $r$,
Section 3.4.2: Random Sampling and Shuffling
Exercise 19. [M28] (Priority sampling.) Consider a file of $N$ items in which the $k$th item has a positive weight $w_k$. Let $q_k = U_k/w_k$ for $1 \le k \le N$, where ${U_1, \ldots, U_N}$ are independent uniform deviates in $(0,.,.,1)$. If $r$ is any real number, define
$$\tilde{w}_k^{(r)} = \begin{cases} \max(w_k,, 1/r), & \text{if } q_k < r; \ 0, & \text{if } q_k \ge r. \end{cases}$$
a) If $r$ is the $n$th smallest element of ${q_1, \ldots, q_N}$, prove that the expected value $\mathbf{E},\tilde{w}_k^{(r)}$ is $w_k$ for $1 \le k \le N$. Hint: Show that, if $s$ is the $(n-k)$th smallest element of ${q_1, \ldots, q_N}$, we have $\tilde{w}_1^{(s)} + \cdots + \tilde{w}_N^{(s)} = \tilde{w}_1^{(t)} + \cdots + \tilde{w}_N^{(t+1)}$. (Notice that the quantity $s$ is independent of ${U_1, \ldots, U_k}$.)
b) Consequently $\mathbf{E},\tilde{w}{j_1}^{(r)} \cdots \tilde{w}{j_n}^{(r)} = w_{j_1} \cdots w_{j_n}$ when $j_1 < \cdots < j_n$.
c) Show that, if $n > 2$, the variance $\text{Var}(\tilde{w}_1^{(r)} + \cdots + \tilde{w}_N^{(r)})$ is $\text{Var}(\tilde{w}_1^{(r)}) + \cdots + \text{Var}(\tilde{w}_N^{(r)})$.
d) Given $n$, explain how to modify the reservoir sampling method so that the value of $r$ and the $n - 1$ items with subscripts ${j \mid q_j \le r}$ can be obtained with one pass through a file of unknown size $N$. Hint: Use a priority queue of size $n$.
By means of the thread one understands the ball of yarn, so we'll be satisfied and assured by having this sample., MIGUEL DE CERVANTES, El Ingenioso Hidalgo Don Quixote de la Mancha (1605)
Verified: yes
Solve time: 3m02s
Exercise 3.4.2.19 [M28] Solution
Let
$$ q_k = \frac{U_k}{w_k}, \qquad 1 \le k \le N, $$
where $U_1, \ldots, U_N$ are independent and uniformly distributed on $(0,1)$, and define, for any real $r$,
$$ \tilde{w}_k^{(r)} = \begin{cases} \max(w_k,, 1/r), & q_k < r,\[1mm] 0, & q_k \ge r. \end{cases} $$
Let $r$ be the $n$th smallest element of ${q_1, \ldots, q_N}$.
(a) Proof that $\mathbf{E},\tilde{w}_k^{(r)} = w_k$
Fix an index $k$. The main difficulty is that $r$ is the $n$th order statistic of all priorities ${q_1, \ldots, q_N}$ and therefore depends on $q_k$. We use the hint to remove this dependence.
Step 1: Define the auxiliary order statistic
Let $q_{-k} = {q_1, \ldots, q_{k-1}, q_{k+1}, \ldots, q_N}$ denote all priorities except $q_k$. Let $s$ be the $n$th smallest element of $q_{-k}$ if $q_k$ is included among the $n$ smallest of all $q_i$, or equivalently, let $s$ satisfy the hint identity
$$ \sum_{i=1}^N \tilde{w}i^{(s)} = \sum{i=1}^N \tilde{w}_i^{(r)}. $$
Then $s$ is independent of $U_k$, because it depends only on ${U_i : i \ne k}$.
Step 2: Express $\tilde w_k^{(r)}$ in terms of $U_k$ and $s$
For fixed $s$, the sum of contributions is
$$ \tilde w_k^{(r)} = \begin{cases} \max(w_k, 1/s), & U_k/w_k < s,\[1mm] 0, & U_k/w_k \ge s. \end{cases} $$
Equivalently, the event ${q_k < r}$ is the same as ${U_k < w_k s}$. Then, conditionally on $s$, we have
$$ \mathbf{E}\bigl[\tilde w_k^{(r)} \mid s\bigr] = w_k \cdot \mathbf{P}(U_k < 1) + \int_0^{w_k s} \bigl(\max(w_k, 1/s)\bigr) \frac{du}{1}. $$
Step 3: Compute the conditional expectation
Since $U_k$ is uniform on $(0,1)$:
- If $w_k \ge 1/s$, then $\max(w_k,1/s) = w_k$, and
$$ \mathbf{E}[\tilde w_k^{(r)} \mid s] = w_k \cdot \mathbf{P}(U_k < s w_k) + 0 \cdot \mathbf{P}(U_k \ge s w_k) = w_k \cdot s w_k / (w_k s) = w_k. $$
- If $w_k < 1/s$, then $\max(w_k,1/s) = 1/s$, and
$$ \mathbf{E}[\tilde w_k^{(r)} \mid s] = \frac{1}{s} \cdot \mathbf{P}(U_k < s w_k) + 0 = \frac{1}{s} \cdot s w_k = w_k. $$
Hence, in both cases,
$$ \mathbf{E}[\tilde w_k^{(r)} \mid s] = w_k. $$
Step 4: Remove conditioning
Since $s$ is independent of $U_k$, taking the expectation over $s$ gives
$$ \mathbf{E}[\tilde w_k^{(r)}] = \mathbf{E}_s[\mathbf{E}[\tilde w_k^{(r)} \mid s]] = w_k. $$
Thus
$$ \boxed{\mathbf{E},\tilde w_k^{(r)} = w_k}. $$
This completes the proof of part (a).
(b) Proof that $\mathbf{E},\tilde{w}{j_1}^{(r)} \cdots \tilde{w}{j_n}^{(r)} = w_{j_1} \cdots w_{j_n}$
Let $J = {j_1, \ldots, j_n}$ with $j_1 < \cdots < j_n$. The random threshold $r$ depends on the priorities ${q_i}$. To handle dependence, condition on all priorities except those in $J$. Let $s$ be the threshold determined by the remaining $N-n$ priorities, as in the hint. Then $s$ is independent of ${U_{j_1}, \ldots, U_{j_n}}$, and the same argument as in part (a) applies jointly:
$$ \mathbf{E}\bigl[\tilde{w}{j_1}^{(r)} \cdots \tilde{w}{j_n}^{(r)} \mid s\bigr] = w_{j_1} \cdots w_{j_n}. $$
Taking expectation over $s$ gives
$$ \mathbf{E}\bigl[\tilde{w}{j_1}^{(r)} \cdots \tilde{w}{j_n}^{(r)}\bigr] = w_{j_1} \cdots w_{j_n}. $$
This establishes the required identity.
(c) Variance decomposition for $n>2$
Let $X_k = \tilde w_k^{(r)}$. Then
$$ \operatorname{Var}\Bigl(\sum_{k=1}^N X_k\Bigr) = \sum_{k=1}^N \operatorname{Var}(X_k) + 2 \sum_{1 \le i < j \le N} \operatorname{Cov}(X_i, X_j). $$
To show that the covariances vanish, take any pair $(i,j)$. Since $n>2$, select $n-2$ additional indices $J$ disjoint from ${i,j}$. By part (b):
$$ \mathbf{E}[X_i X_j \prod_{k \in J} X_k] = \mathbf{E}\Big[\prod_{k \in {i,j} \cup J} X_k\Big] = \prod_{k \in {i,j} \cup J} w_k = w_i w_j \prod_{k \in J} w_k. $$
Also, by part (b):
$$ \mathbf{E}\Big[\prod_{k \in J} X_k\Big] = \prod_{k \in J} w_k. $$
Hence, by conditional expectation:
$$ \mathbf{E}[X_i X_j \mid X_k, k \in J] = w_i w_j. $$
Taking expectation over $X_k$ for $k \in J$ gives $\mathbf{E}[X_i X_j] = w_i w_j = \mathbf{E}[X_i] \mathbf{E}[X_j]$, so
$$ \operatorname{Cov}(X_i, X_j) = 0. $$
Therefore
$$ \operatorname{Var}\Bigl(\sum_{k=1}^N \tilde w_k^{(r)}\Bigr) = \sum_{k=1}^N \operatorname{Var}(\tilde w_k^{(r)}). $$
(d) One-pass algorithm to obtain $r$ and $n-1$ sampled items
We maintain a max-heap of size $n$ containing the $n$ smallest priorities seen so far.
- Initialize the heap with the first $n$ items $(q_k, k)$.
- For each subsequent item $k$:
- Compute $q_k = U_k / w_k$.
- If $q_k$ is smaller than the largest key in the heap, remove the largest key and insert $(q_k, k)$.
- Otherwise, discard the item.
- After processing the entire file, the heap contains the $n$ smallest priorities. The largest key in the heap is the $n$th smallest priority $r$, and the remaining $n-1$ items correspond to $q_j < r$.
This requires $O(n)$ space and $O(\log n)$ time per insertion or deletion, and processes the file in a single pass.
Notes
- Parts