Christian Gunning and Jonathan Olmsted — written Apr 12, 2013 — updated Dec 12, 2022 — source

All of R’s (r*, p*, q*, d*) distribution functions are available in C++ via
the R API. R is written in C, and the R API has no concept of a vector (at
least not in the STL sense). Consequently, R’s `sample()`

function can’t just be
exported via the R API, despite its importance and usefulness. The purpose of
`RcppArmadilloExtensions/sample.h`

(written by Christian Gunning) is to provide
this functionaility within C++ programs.

Given sampling’s central role in statistical programming, it is surprising that no standard library implementation for C or C++ is commonly available. There have been repeated questions about this on the Rcpp mailing list. StackExchange contains an extensive discussion of this question, but there is no “canonical” implementation.

In general, it’s best to use R’s tried-and-true RNG-related functions leaving
the praise (and blame) for their performance to others. R’s C routines for
sampling can be found in `src/main/random.c`

, with a discussion of the
associated algorithms in Ripley 87.

The goal is to exactly reproduce the behavior of R’s sample() function in a
templated C++/Rcpp function. So far, we’ve reproduced everything but R’s
implementation of the Walker Alias method (used only when sampling with
replacement using >200 non-zero weights). It uses convenience functions from
Armadillo, and thus is added to RcppArmadillo as an extension. (The hope is
that future extensions will follow!) All you need is this simple call, which
should work for any Rcpp Vector: ```
RcppArmadillo::sample(x, size, replace,
prob)
```

.

Make sure you have a recent version of `RcppArmadillo`

. The earliest adequate
release is 3.800.1 or preferably release 0.3.810.0. The usual
`install.packages("RcppArmadillo")`

command will help if
you need to update.
You are ready to go from there:

Loading required package: RcppArmadillo

This post was originally written in 2013. Years later, with R release 3.6.0,
a change was made to how R generates random integers. This affects the
`sample()`

function discussed here. To restore the behaviour present when
this post was written (and to which the C++ implementation is calibrated) use
`RNGkind(sample.kind = “Rounding”).

Here’s a quick test to make sure it works.

Some C++ code that can be hooked into with `sourceCpp()`

:

Notice that we only need `#include <RcppArmadilloExtensions/sample.h>`

because
`sample.h`

then `#include`

-s RcppArmadillo.

We invoke the (automatically defined) `csample_char()`

R function:

Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler used

[1] TRUE

Of course, R’s sample() function is “internally” vectorized and
already fast. This functionality was *not* added to speed up
`sample()`

! Instead, this lets you stay in C++ when you need to sample
from an Rcpp Vector, be it Numeric, Character, or whatever else.

That said, performance is still a concern. A quick test shows a dead-heat for sampling with replacement when compared to vanilla R:

Consider the following timing where we compare vanilla R’s `sample()`

to
`RcppArmadillo::sample()`

. See the results for sampling with replacement with
and without probability weights.

Loading required package: rbenchmark

test replications relative elapsed 2 cpp 1000 1.000 0.084 1 r 1000 1.512 0.127

test replications relative elapsed 2 cpp.prob 1000 1.000 0.318 1 r.prob 1000 1.006 0.320

The two perform equally well.

Next we look at the performance of sampling without replacement. The number of draws can be no larger than the number of elements. Thus we’re sampling fewer elements. Otherwise, the code is identical.

test replications relative elapsed 2 cpp 1000 1.00 0.004 1 r 1000 4.25 0.017

test replications relative elapsed 2 cpp.prob 1000 1.000 0.009 1 r.prob 1000 1.444 0.013

Finally, what we haven’t done. For sampling with replacement and more than 200 non-zero weights, R uses Walker’s Alias method. This method can be substantially faster than the vanilla sampling method (with replacement, less than 200 non-zero weights). Rather than risk leading users astray with inefficient and inappropriate methods, we throw an error.

**tags:**
rng

- Extending R with C++ and Fortran — Dirk Eddelbuettel and JBrandon Duck-Mayr
- Benchmarking Rcpp code with RcppClock — Zach DeBruine
- Simulation Smoother using RcppArmadillo — Tomasz Woźniak