Christian Gunning and Jonathan Olmsted — written Apr 12, 2013 — source
All of R’s (r*, p*, q*, d*) distribution functions are available in C++ via
the R API. R is written in C, and the R API has no concept of a vector (at
least not in the STL sense). Consequently, R’s
sample() function can’t just be
exported via the R API, despite its importance and usefulness. The purpose of
RcppArmadilloExtensions/sample.h (written by Christian Gunning) is to provide
this functionaility within C++ programs.
Given sampling’s central role in statistical programming, it is surprising that no standard library implementation for C or C++ is commonly available. There have been repeated questions about this on the Rcpp mailing list. StackExchange contains an extensive discussion of this question, but there is no “canonical” implementation.
In general, it’s best to use R’s tried-and-true RNG-related functions leaving
the praise (and blame) for their performance to others. R’s C routines for
sampling can be found in
src/main/random.c, with a discussion of the
associated algorithms in Ripley 87.
The goal is to exactly reproduce the behavior of R’s sample() function in a
templated C++/Rcpp function. So far, we’ve reproduced everything but R’s
implementation of the Walker Alias method (used only when sampling with
replacement using >200 non-zero weights). It uses convenience functions from
Armadillo, and thus is added to RcppArmadillo as an extension. (The hope is
that future extensions will follow!) All you need is this simple call, which
should work for any Rcpp Vector:
RcppArmadillo::sample(x, size, replace,
Make sure you have a recent version of
RcppArmadillo. The earliest adequate
release is 3.800.1 or preferably release 0.3.810.0. The usual
install.packages("RcppArmadillo") command will help if
you need to update.
You are ready to go from there:
Loading required package: RcppArmadillo
Loading required package: Rcpp
Here’s a quick test to make sure it works.
Some C++ code that can be hooked into with
Notice that we only need
#include <RcppArmadilloExtensions/sample.h> because
We invoke the (automatically defined)
csample_char() R function:
Of course, R’s sample() function is “internally” vectorized and
already fast. This functionality was not added to speed up
sample()! Instead, this lets you stay in C++ when you need to sample
from an Rcpp Vector, be it Numeric, Character, or whatever else.
That said, performance is still a concern. A quick test shows a dead-heat for sampling with replacement when compared to vanilla R:
Consider the following timing where we compare vanilla R’s
RcppArmadillo::sample(). See the results for sampling with replacement with
and without probability weights.
Loading required package: rbenchmark
test replications relative elapsed 2 cpp 1000 1.000 0.186 1 r 1000 1.188 0.221
test replications relative elapsed 1 r.prob 1000 1.000 0.759 2 cpp.prob 1000 1.009 0.766
The two perform equally well.
Next we look at the performance of sampling without replacement. The number of draws can be no larger than the number of elements. Thus we’re sampling fewer elements. Otherwise, the code is identical.
test replications relative elapsed 2 cpp 1000 1.000 0.011 1 r 1000 1.364 0.015
test replications relative elapsed 1 r.prob 1000 1 0.029 2 cpp.prob 1000 1 0.029
Finally, what we haven’t done. For sampling with replacement and more than 200 non-zero weights, R uses Walker’s Alias method. This method can be substantially faster than the vanilla sampling method (with replacement, less than 200 non-zero weights). Rather than risk leading users astray with inefficient and inappropriate methods, we throw an error.
Warning: Walker's alias method used: results are different from R < 2.2.0
Error: Walker Alias method not implemented. R-core sample() is likely faster for this problem.