Reservoir Sampling

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Reservoir Sampling相关的知识,希望对你有一定的参考价值。

Reservoir sampling is proposed to solve such set of problems: Randomly choose 技术分享 items from a stream of 技术分享 elements where 技术分享 could be very large or unknown in advance, i.e., all elements in the stream are equally likely to be selected with probability 技术分享

The algorithm works as follows.

Let’s first take a look at a simple example with 技术分享. When a new item 技术分享 comes, we either keep 技术分享 with probability 技术分享 or keep the old selected item with probability 技术分享. We repeat this process till the end of the stream, i.e., all elements in 技术分享 have been visited. The probability that 技术分享 is chosen in the end is 技术分享

Thus we prove the algorithm guarantees equal probability for all elements to be chosen. A Java implementation of this algorithm should look like this:

int random(int n) {
    Random rnd = new Random();
    int ret = 0;
    for (int i = 1; i <= n; i++)
        if (rnd.nextInt(i) == 0)
            ret = i;
    return ret;
}

技术分享 is a little tricky. One straightforward way is to simply run the previous algorithm 技术分享 times. However, this does require multiple passes against the stream. Here we discuss another approach to get 技术分享 element randomly.

For item 技术分享, there are two cases to handle:

  1. When 技术分享, we just blindly keep 技术分享
  2. When 技术分享, we keep 技术分享 with probability 技术分享

A simple implementation requires the memory space to store the 技术分享 selected elements, say 技术分享. For every 技术分享 we first get a random number 技术分享 and keep 技术分享 when 技术分享, i.e., 技术分享. Otherwise 技术分享 is discarded. This guarantees the 技术分享 probability in the second scenario.

The proof is as previous. The probability of 技术分享 to be chosen is 技术分享

技术分享 is the probability that 技术分享 is replace by 技术分享 ad 技术分享.

Below is a sample implementation in Java:

int[] random(int[] a, int k) {
    int[] s = new int[k];
    Random rnd = new Random();
    for (int i = 0; i < k; i++)
        s[i] = a[i];
    for (int i = k + 1; i <= a.length; i++) {
        int j = rnd.nextInt(i);
        if (j < k) s[j] = a[i];
    }
    return s;
}

以上是关于Reservoir Sampling的主要内容,如果未能解决你的问题,请参考以下文章

Reservoir Sampling

Reservoir Sampling-382. Linked List Random Node

leetcode_398 Random Pick Index(Reservoir Sampling)

水塘抽样(Reservoir Sampling)问题

Reservoir Sampling - 蓄水池抽样算法

[leetcode]Reservoir Sampling-382. Linked List Random Node