Reservoir Sampling: Uniform Sampling of Streaming Data

Consider a stream of data that we receive, call them x_1, x_2, \ldots, x_n where x_i is the i^\text{th} element in the stream. Note that we receive every x_i at the i^\text{th} time step and that x_i is then no more in our access once we move on to the next time step. Furthermore, we don’t even know the value of n. How can we possibly uniformly sample an element from this stream? When I first thought about this, I was completely stuck; I couldn’t fathom any such procedure; to me, it just felt wrong that we needed to gain a uniform sample over n samples without even knowing n!

However, as the title implies, one method to accomplish this is called Reservoir Sampling. I’ll first introduce the mechanics directly, and don’t be discouraged if it doesn’t make sense since it can be quite non-intuitive at first. Before we delve into the mathematics, I recommend you take a look at this post, especially the basics of probability theory.

Development of the Theory

First, we let our stream be x_1, \ldots, x_n and we start by seeing x_1 and continue all the way to x_n. Now consider the random variables X_1, \ldots, X_n defined by

    \begin{align*} X_i = \begin{cases} x_i & \text{with probability $\frac{1}{i}$} \\ X_{i - 1} & \text{otherwise} \end{cases} \end{align*}

In other words, we keep a reservoir sample and switch it to the current element we are reading with probability inverse to the current timestep, and the variables X_1, \ldots, X_n simply are the state of the reservoir at different timesteps. We claim that

    \begin{align*} \Pr[X_n = x_i] = \frac{1}{n}, \end{align*}

for all i, i.e. X_n is a uniformly distributed random variable from the sample space of all the stream elements. And in fact, we define X_n as our sample from the stream.

Now what confused me the most was that the probability that X_j (fixing j) is x_j is \frac{1}{j}, but then somehow we concluded the probability that X_n remains as x_j is \frac{1}{n}. Seemingly confusing, right? We may select x_j at the j^\text{th} timestep; however, we forget that we still need to not select the rest of the elements in order for X_n to remain x_j. That part about not selecting other elements weighs into proving that X_n is uniformly distributed, and now we will see this in motion formally. I will provide a rigorous proof first, and then a more intuitive argument afterward.

Claim: Defining x_1, \ldots, x_n and X_1, \ldots, X_n as we have, we claim

    \begin{align*} \Pr[X_n = x_i] = \frac{1}{n}. \end{align*}

Proof: We prove the claim by induction on n. First consider the case of n = 1 where then we choose the first and only element of the stream with full probability, so obviously our selection is uniform.

Now consider some arbitrary n \ge 1, and assume the claim for the stream x_1, \ldots, x_n. Now add an element x_{n + 1} to the stream, so our stream is x_1, \ldots, x_n, x_{n + 1} now. Now by definition X_{n + 1} = X_n with probability 1 - \frac{1}{n + 1}, so for all j = 1, \ldots, n:

    \begin{align*} \Pr[X_{n + 1} = x_j] &= \Pr[(X_{n + 1} = X_n) \cap (X_n = x_j)] \\ &= \Pr[X_{n + 1} = X_n] \Pr[X_n = x_j \mid X_{n + 1} = X_n] \\ &= \Pr[X_{n + 1} = X_n] \Pr[X_n = x_j] \\ &= \lp1 - \frac{1}{n + 1}\rp \lp \frac{1}{n} \rp \\ &=\frac{n}{n + 1} \cdot \frac{1}{n} \\ &= \frac{1}{n + 1}  \end{align*}

as required. The second line is by Conditional Equivalence (details can be found here), the third by the independence of the reservoir’s state at a previous timestep, and the fourth derived from the inductive hypothesis. Furthermore, by definition we know \Pr[X_{n + 1} = x_{n + 1}] = \frac{1}{n + 1}. Thus we have shown for all i = 1, \ldots, n + 1:

    \begin{align*} \Pr[X_{n + 1} = x_i] = \frac{1}{n + 1}, \end{align*}

i.e. X_{n + 1} is an uniform sample from the stream x_1, \ldots, x_n, x_{n + 1} as claimed; thus proving the correctness of Reservoir Sampling. \quad \blacksquare

A more intuitive approach, although less rigorous, is by writing

    \begin{align*} \Pr[X_n = x_j] = \Pr[X_j = x_j] \cdot \prod_{i = j + 1}^n \Pr[X_i = X_{i - 1}] \end{align*}

which is essentially the probability we select x_j and then never select any other element of the stream. We can simplify:

    \begin{align*} \Pr[X_n = x_j] &= \frac{1}{j} \cdot \prod_{i = j + 1}^n \lp 1 - \frac{1}{i} \rp \\ &= \frac{1}{j} \cdot \prod_{i = j + 1}^n \lp \frac{i - 1}{i} \rp \\ &= \lp \frac{1}{j} \rp \lp \frac{j}{j + 1} \rp \cdots \lp \frac{n - 1}{n} \rp \\ &= \frac{1}{n} \end{align*}

as required, so we have shown that X_n is uniform. This isn’t rigorous since we haven’t had a chance to discuss the relations between the random variables in order to get that initial product, and we never need to delve into that with our proof by induction.

Verifying Reservoir Sampling

Now that we have understood the theory of Reservoir Sampling, let’s take a look at it in full motion. The following is the Python code for the Reservoir Sampling Algorithm.

def get_reservoir_sample(stream):
  n = len(stream)
  X = 0 # our sample
  for i in range(n):
    if rand.random() < 1/(i + 1):
      X = stream[i]
  return X

Now we generate a stream of 1,000 elements whose distribution is depicted in the following histogram.

Distribution of Streaming Data

Now after sampling 10,000 elements by the Reservoir Sampling, we arrive at the following distribution of samples.

Distribution of Samples by Reservoir Sampling of Streaming Data

Notice that the distributions are almost identical which verifies the uniformity of the Reservoir Sampling in practice.

I encourage you to try this for yourself and witness the power of Reservoir Sampling.

Practical Use of Reservoir Sampling

Common in most Streaming Algorithms, we tend to assume our stream to be extremely large, so storing the contents of the stream in memory is simply not viable. For example, consider a stream of posts from all CS blogs; obviously, storing each one in memory is a waste of space, but with Reservoir Sampling we can sample one post uniformly with \Theta(1) space compared to \Theta(n) if we decided to keep all the posts.

Furthermore, if we need k independent samples from a stream, we can Reservoir Sample simultaneously and independently, so we can get our k samples with only one pass through the stream. In practice, this method tends to be the preferred method of sampling as often we need more than just one sample.

Also, there are more applications of this principle of keeping a reservoir (the k samples) and replacing it with new stream elements as time passes, so look out for posts on this topic of streaming algorithms.

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts