## A.2 Reservoir Sampling

To perform the sampling operation,
both `SampleDiscrete()` and alias tables require the number
of outcomes being sampled from as well as all their probabilities to
be stored in memory.
Often this is not a problem, but for cases where we would like
to draw a sample from a large number of events, or cases where each
event requires a large amount of memory, it is useful to be able to
generate samples without storing all of them at once.

A family of algorithms based on a technique called *reservoir
sampling* makes this possible, by taking a stream of candidate samples one at a time
and randomly keeping just one of them in a way that ensures that the sample
that is kept is from the distribution defined by the samples that have been
seen so far. Reservoir sampling algorithms date to the early days of computer
tape drives, where data could only be accessed sequentially and there was
often more of it than could be stored in main memory. Reservoir sampling
made it possible to draw random samples from data stored on tape while only
reading the tape once.

The basic reservoir sampling algorithm is easily expressed. Each candidate sample is stored in the reservoir with probability equal to one over the number of candidates that have been considered:

The correctness of this algorithm can be shown using induction. For the base case, it is clear that if there is a single sample, it will be stored in the reservoir, and the reservoir has successfully drawn a sample with the appropriate probability from the sample distribution.

Now consider the case where samples have been considered and assume that the sample stored in the reservoir has been kept with probability . When a new sample is considered, it will be kept with probability , which is clearly the correct probability for it. The existing sample is kept with probability ; the product of the probability of keeping the existing sample and its probability of being stored in the reservoir gives the correct probability, , as well.

*Weighted reservoir sampling* algorithms generalize the basic
algorithm by making it possible to associate a
nonnegative weight with each sample. Samples are then kept with
probability given by the ratio of their weight to the sum of weights of all
of the candidate samples that have been seen so far. The
`WeightedReservoirSampler` class implements this algorithm. It is
parameterized by the type of object being sampled `T`.

`sample`to reservoir>>

`WeightedReservoirSampler` stores an `RNG` object that provides
the random numbers that are used in deciding whether to add each
sample to the reservoir. The constructor correspondingly takes a seed
value that is passed on to the `RNG`.

If an array of `WeightedReservoirSampler`s is allocated, then the
default constructor runs instead. In that case, the `RNG`s in
individual samplers can be seeded via the `Seed()` method.

The `Add()` method takes a single sample and a nonnegative weight
value and updates the reservoir so that the stored sample is from
the expected distribution.

`sample`to reservoir>>

The probability `p` for storing the sample candidate in the reservoir
is easily found given `weightSum`.

`sample`to reservoir>>=

The weight of the sample stored in the reservoir is stored in
`reservoirWeight`; it is needed to compute the value of the probability
mass function (PMF) for the sample that
is kept.

A second `Add()` method takes a callback function that returns a
sample. This function is only called when the sample is to be
stored in the reservoir. This variant is useful in cases where the
sample’s weight can be computed independently of its value and where its
value is relatively expensive to compute. The fragment that contains its
implementation, <<Process weighted reservoir sample via callback>>,
otherwise follows the same structure as the first `Add()` method, so it
is not included here.

A number of methods provide access to the sample and the probability that it was stored in the reservoir.

It is sometimes useful to reset a `WeightedReservoirSampler` and
restart from scratch with a new stream of samples; the `Reset()`
method handles this task.

Remarkably, it is possible to merge two reservoirs into one in such a way
that the stored sample is kept with the same probability as if a single
reservoir had considered all of the samples seen by the two.
Merging two reservoirs is a matter of randomly taking the sample
stored by the second reservoir with probability defined by its sum of
sample weights divided by the sum of both reservoirs’ sums of sample
weights, which in turn is exactly what the `Add()` method does.