Reservoir Sampling: Uniform Sampling of Streaming Data

Consider a stream of data that we receive, call them $x_1, x_2, \ldots, x_n$ where $x_i$ is the $i^\text{th}$ element in the stream. Note that we receive every $x_i$ at the $i^\text{th}$ time step and that $x_i$ is then no more in our access once we move on to the next time step. Furthermore, we don’t even know the value of $n$ . How can we possibly uniformly sample an element from this stream? When I first thought about this, I was completely stuck; I couldn’t fathom any such procedure; to me, it just felt wrong that we needed to gain a uniform sample over $n$ samples without even knowing $n$ !

However, as the title implies, one method to accomplish this is called Reservoir Sampling. I’ll first introduce the mechanics directly, and don’t be discouraged if it doesn’t make sense since it can be quite non-intuitive at first. Before we delve into the mathematics, I recommend you take a look at this post, especially the basics of probability theory.

Development of the Theory

First, we let our stream be $x_1, \ldots, x_n$ and we start by seeing $x_1$ and continue all the way to $x_n$ . Now consider the random variables $X_1, \ldots, X_n$ defined by

$\begin{align*} X_i = \begin{cases} x_i & \text{with probability $\frac{1}{i}$} \\ X_{i - 1} & \text{otherwise} \end{cases} \end{align*}$

In other words, we keep a reservoir sample and switch it to the current element we are reading with probability inverse to the current timestep, and the variables $X_1, \ldots, X_n$ simply are the state of the reservoir at different timesteps. We claim that

$\begin{align*} \Pr[X_n = x_i] = \frac{1}{n}, \end{align*}$

for all $i$ , i.e. $X_n$ is a uniformly distributed random variable from the sample space of all the stream elements. And in fact, we define $X_n$ as our sample from the stream.

Now what confused me the most was that the probability that $X_j$ (fixing $j$ ) is $x_j$ is $\frac{1}{j}$ , but then somehow we concluded the probability that $X_n$ remains as $x_j$ is $\frac{1}{n}$ . Seemingly confusing, right? We may select $x_j$ at the $j^\text{th}$ timestep; however, we forget that we still need to not select the rest of the elements in order for $X_n$ to remain $x_j$ . That part about not selecting other elements weighs into proving that $X_n$ is uniformly distributed, and now we will see this in motion formally. I will provide a rigorous proof first, and then a more intuitive argument afterward.

Claim: Defining $x_1, \ldots, x_n$ and $X_1, \ldots, X_n$ as we have, we claim

$\begin{align*} \Pr[X_n = x_i] = \frac{1}{n}. \end{align*}$

Proof: We prove the claim by induction on $n$ . First consider the case of $n = 1$ where then we choose the first and only element of the stream with full probability, so obviously our selection is uniform.

Now consider some arbitrary $n \ge 1$ , and assume the claim for the stream $x_1, \ldots, x_n$ . Now add an element $x_{n + 1}$ to the stream, so our stream is $x_1, \ldots, x_n, x_{n + 1}$ now. Now by definition $X_{n + 1} = X_n$ with probability $1 - \frac{1}{n + 1}$ , so for all $j = 1, \ldots, n$ :

$\begin{align*} \Pr[X_{n + 1} = x_j] &= \Pr[(X_{n + 1} = X_n) \cap (X_n = x_j)] \\ &= \Pr[X_{n + 1} = X_n] \Pr[X_n = x_j \mid X_{n + 1} = X_n] \\ &= \Pr[X_{n + 1} = X_n] \Pr[X_n = x_j] \\ &= \lp1 - \frac{1}{n + 1}\rp \lp \frac{1}{n} \rp \\ &=\frac{n}{n + 1} \cdot \frac{1}{n} \\ &= \frac{1}{n + 1} \end{align*}$

as required. The second line is by Conditional Equivalence (details can be found here), the third by the independence of the reservoir’s state at a previous timestep, and the fourth derived from the inductive hypothesis. Furthermore, by definition we know $\Pr[X_{n + 1} = x_{n + 1}] = \frac{1}{n + 1}$ . Thus we have shown for all $i = 1, \ldots, n + 1$ :

$\begin{align*} \Pr[X_{n + 1} = x_i] = \frac{1}{n + 1}, \end{align*}$

i.e. $X_{n + 1}$ is an uniform sample from the stream $x_1, \ldots, x_n, x_{n + 1}$ as claimed; thus proving the correctness of Reservoir Sampling. $\quad \blacksquare$

A more intuitive approach, although less rigorous, is by writing

$\begin{align*} \Pr[X_n = x_j] = \Pr[X_j = x_j] \cdot \prod_{i = j + 1}^n \Pr[X_i = X_{i - 1}] \end{align*}$

which is essentially the probability we select $x_j$ and then never select any other element of the stream. We can simplify:

$\begin{align*} \Pr[X_n = x_j] &= \frac{1}{j} \cdot \prod_{i = j + 1}^n \lp 1 - \frac{1}{i} \rp \\ &= \frac{1}{j} \cdot \prod_{i = j + 1}^n \lp \frac{i - 1}{i} \rp \\ &= \lp \frac{1}{j} \rp \lp \frac{j}{j + 1} \rp \cdots \lp \frac{n - 1}{n} \rp \\ &= \frac{1}{n} \end{align*}$

as required, so we have shown that $X_n$ is uniform. This isn’t rigorous since we haven’t had a chance to discuss the relations between the random variables in order to get that initial product, and we never need to delve into that with our proof by induction.

Verifying Reservoir Sampling

Now that we have understood the theory of Reservoir Sampling, let’s take a look at it in full motion. The following is the Python code for the Reservoir Sampling Algorithm.

def get_reservoir_sample(stream):
  n = len(stream)
  X = 0 # our sample
  for i in range(n):
    if rand.random() < 1/(i + 1):
      X = stream[i]
  return X

Now we generate a stream of 1,000 elements whose distribution is depicted in the following histogram.

Now after sampling 10,000 elements by the Reservoir Sampling, we arrive at the following distribution of samples.

Distribution of Samples by Reservoir Sampling of Streaming Data

Notice that the distributions are almost identical which verifies the uniformity of the Reservoir Sampling in practice.

I encourage you to try this for yourself and witness the power of Reservoir Sampling.

Practical Use of Reservoir Sampling

Common in most Streaming Algorithms, we tend to assume our stream to be extremely large, so storing the contents of the stream in memory is simply not viable. For example, consider a stream of posts from all CS blogs; obviously, storing each one in memory is a waste of space, but with Reservoir Sampling we can sample one post uniformly with $\Theta(1)$ space compared to $\Theta(n)$ if we decided to keep all the posts.

Furthermore, if we need $k$ independent samples from a stream, we can Reservoir Sample simultaneously and independently, so we can get our $k$ samples with only one pass through the stream. In practice, this method tends to be the preferred method of sampling as often we need more than just one sample.

Also, there are more applications of this principle of keeping a reservoir (the $k$ samples) and replacing it with new stream elements as time passes, so look out for posts on this topic of streaming algorithms.

Recent Posts

Recent Comments

Subscribe

Categories

Archives

Reservoir Sampling: Uniform Sampling of Streaming Data

Development of the Theory

Verifying Reservoir Sampling

Practical Use of Reservoir Sampling

About the Author

Balaram Behera

Leave a Reply Cancel reply

You may also like these

Analysis of Randomized QuickSort

Key Probabilistic Concentration Bounds

Randomized QuickSort

Measure Theoretic Probability Theory