Nathan Russell — written Jul 26, 2017 — source
- C++ templates and function overloading are incompatible with R’s C API, so polymorphism must be achieved via run-time dispatch, handled explicitly by the programmer.
- The traditional technique for operating on
SEXPobjects in a generic manner entails a great deal of boilerplate code, which can be unsightly, unmaintainable, and error-prone.- The desire to provide polymorphic functions which operate on vectors and matrices is common enough that Rcpp provides the utility macros
RCPP_RETURN_VECTORandRCPP_RETURN_MATRIXto simplify the process.- Subsequently, these macros were extended to handle an (essentially) arbitrary number of arguments, provided that a C++11 compiler is used.
To motivate a discussion of polymorphic functions, imagine that we desire a
function (ends) which, given an input vector x and an integer n, returns
a vector containing the first and last n elements of x concatenated.
Furthermore, we require ends to be a single interface which is capable of
handling multiple types of input vectors (integers, floating point values,
strings, etc.), rather than having a separate function for each case. How can
this be achieved?
A naïve implementation in R might look something like this:
ends <- function(x, n = 6L)
{
n <- min(n, length(x) %/% 2)
c(head(x, n), tail(x, n))
}
ends(1:9)[1] 1 2 3 4 6 7 8 9
ends(letters, 3)[1] "a" "b" "c" "x" "y" "z"
ends(rnorm(20), 2)[1] -0.560476 -0.230177 0.701356 -0.472791
The simple function above demonstates a key feature of many dynamically-typed
programming languages, one which has undoubtably been a significant factor in their
rise to popularity: the ability to write generic code with little-to-no
additional effort on the part of the developer. Without getting into a discussion
of the pros and cons of static vs. dynamic typing, it is evident that being able
to dispatch a single function generically on multiple object types, as opposed to,
e.g. having to manage separate impementations of ends for each vector type,
helps us to write more concise, expressive code. Being an article about Rcpp,
however, the story does not end here, and we consider how this problem might
be approached in C++, which has a much more strict type system than R.
For simplicity, we begin by considering solutions in the context of a “pure”
(re: not called from R) C++ program. Eschewing more complicated tactics
involving run-time dispatch (virtual functions, etc.), the C++ language
provides us with two straightforward methods of achieving this at compile time:
The first case can be demonstrated as follows:
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
typedef std::vector<int> ivec;
ivec ends(const ivec& x, std::size_t n = 6)
{
n = std::min(n, x.size() / 2);
ivec res(2 * n);
std::copy(x.begin(), x.begin() + n, res.begin());
std::copy(x.end() - n, x.end(), res.begin() + n);
return res;
}
typedef std::vector<double> dvec;
dvec ends(const dvec& x, std::size_t n = 6)
{
n = std::min(n, x.size() / 2);
dvec res(2 * n);
std::copy(x.begin(), x.begin() + n, res.begin());
std::copy(x.end() - n, x.end(), res.begin() + n);
return res;
}
typedef std::vector<std::string> svec;
// and so on...
int main()
{
ivec x, xres;
dvec y, yres;
for (int i = 0; i < 20; i++) {
x.push_back(i);
y.push_back(i + 0.5);
}
xres = ends(x, 4);
yres = ends(y);
for (std::size_t i = 0; i < xres.size(); i++) {
std::cout << xres[i] << "\n";
}
for (std::size_t i = 0; i < yres.size(); i++) {
std::cout << yres[i] << "\n";
}
}Although the above program meets our criteria, the code duplication is profound. Being seasoned C++ programmers, we recognize this as a textbook use case for templates and refactor accordingly:
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
template <typename T>
T ends(const T& x, std::size_t n = 6)
{
n = std::min(n, x.size() / 2);
T res(2 * n);
std::copy(x.begin(), x.begin() + n, res.begin());
std::copy(x.end() - n, x.end(), res.begin() + n);
return res;
}
typedef std::vector<int> ivec;
typedef std::vector<double> dvec;
typedef std::vector<std::string> svec;
// and so on...
int main()
{
// as before
}This approach is much more maintainable as we have a single implementation
of ends rather than one implementation per typedef. With this in hand, we
now look to make our C++ version of ends callable from R via Rcpp.
Many people, myself included, have attempted some variation of the following at one point or another:
#include <Rcpp.h>
// [[Rcpp::export]]
template <typename T>
T ends(const T& x, std::size_t n = 6)
{
n = std::min(n, x.size() / 2);
T res(2 * n);
std::copy(x.begin(), x.begin() + n, res.begin());
std::copy(x.end() - n, x.end(), res.begin() + n);
return res;
}Sadly this does not work: magical as Rcpp attributes may be, there are limits to what they can do, and at least for the time being, translating C++ template functions into something compatible with R’s C API is out of the question. Similarly, the first C++ approach from earlier is also not viable, as the C programming language does not support function overloading. In fact, C does not support any flavor of type-safe static polymorphism, meaning that our generic function must be implemented through run-time polymorphism, as touched on in Kevin Ushey’s Gallery article Dynamic Wrapping and Recursion with Rcpp.
Armed with the almighty TYPEOF macro and a SEXPTYPE cheatsheat, we
modify the template code like so:
#include <Rcpp.h>
using namespace Rcpp;
namespace impl {
template <int RTYPE>
Vector<RTYPE> ends(const Vector<RTYPE>& x, int n)
{
n = std::min((R_xlen_t)n, x.size() / 2);
Vector<RTYPE> res(2 * n);
std::copy(x.begin(), x.begin() + n, res.begin());
std::copy(x.end() - n, x.end(), res.begin() + n);
return res;
}
} // impl
// [[Rcpp::export]]
SEXP ends(SEXP x, int n = 6) {
switch (TYPEOF(x)) {
case INTSXP: {
return impl::ends(as<IntegerVector>(x), n);
}
case REALSXP: {
return impl::ends(as<NumericVector>(x), n);
}
case STRSXP: {
return impl::ends(as<CharacterVector>(x), n);
}
case LGLSXP: {
return impl::ends(as<LogicalVector>(x), n);
}
case CPLXSXP: {
return impl::ends(as<ComplexVector>(x), n);
}
default: {
warning(
"Invalid SEXPTYPE %d (%s).\n",
TYPEOF(x), type2name(x)
);
return R_NilValue;
}
}
}ends(1:9)[1] 1 2 3 4 6 7 8 9
ends(letters, 3)[1] "a" "b" "c" "x" "y" "z"
ends(rnorm(20), 2)[1] -1.067824 -0.217975 -0.305963 -0.380471
ends(list())Warning in ends(list()): Invalid SEXPTYPE 19 (VECSXP).
NULL
Some key remarks:
ends template to use
an integer parameter instead of a type parameter. This is a crucial point, and
later on, we will exploit it to our benefit.SEXP for our input / output vector since we need a
single input / output type. In this particular situation, replacing SEXP with
the Rcpp type RObject would also be suitable as it is a generic class capable
of representing any SEXP type.Rcpp::Vector type accordingly within each case label. (For
further reference, the list of vector aliases can be found here). Finally, we could dress each return value in Rcpp::wrap to convert
the Rcpp::Vector to a SEXP, but it isn’t necessary because Rcpp attributes
will do this automatically (if possible).At this point we have a polymorphic function, written in C++, and callable from
R. But that switch statement sure is an eyesore, and it will need to be
implemented every time we wish to export a generic function to R. Aesthetics
aside, a more pressing concern is that boilerplate such as this increases the
likelihood of introducing bugs into our codebase – and since we are leveraging
run-time dispatch, these bugs will not be caught by the compiler. For example,
there is nothing to prevent this from compiling:
// ...
case INTSXP: {
return impl::ends(as<CharacterVector>(x), n);
}
// ...
In our particular case, such mistakes likely would not be too disastrous, but it should not be difficult to see how situations like this can put you (or a user of your library!) on the fast track to segfault.
The C preprocessor is undeniably one of the more controversial aspects of the C++ programming language, as its utility as a metaprogramming tool is rivaled only by its potential for abuse. A proper discussion of the various pitfalls associated with C-style macros is well beyond the scope of this article, so the reader is encouraged explore this topic on their own. On the bright side, the particular macros that we will be discussing are sufficiently complex and limited in scope that misuse is much more likely to result in a compiler error than a silent bug, so practically speaking, one can expect a fair bit of return for relatively little risk.
At a high level, we summarize the RCPP_RETURN macros as follows:
RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX, respectively.SEXPTYPEs:
INTSXP (integers)REALSXP (numerics)RAWSXP (raw bits)LGLSXP (logicals)CPLXSXP (complex numbers)STRSXP (characters / strings)VECSXP (lists)EXPRSXP (expressions)SEXP objectFinally, the template function must meet the following criteria:
SEXP (or something convertible to
SEXP) argument.Examining our templated impl::ends function from the previous section, we see
that it meets the first requirement, but fails the second, due to its second
parameter n. Before exploring how ends might be adapted to meet the (C++98)
template requirements, it will be helpful demonstrate correct usage with a few
simple examples.
We consider two situations where our input type is generic, but our output type is fixed:
int is always returned.IntegerVector is always returned.First, our len function:
#include <Rcpp.h>
using namespace Rcpp;
namespace impl {
template <int RTYPE>
int len(const Vector<RTYPE>& x)
{
return static_cast<int>(x.size());
}
} // impl
// [[Rcpp::export]]
int len(RObject x)
{
RCPP_RETURN_VECTOR(impl::len, x);
}(Note that we omit the return keyword, as it is part of the macro definition.)
Testing this out on the various supported vector types:
classes <- c(
"integer", "numeric", "raw", "logical",
"complex", "character", "list", "expression"
)
sapply(seq_along(classes), function(i) {
x <- vector(mode = classes[i], length = i)
all.equal(len(x), length(x))
})[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Similarly, creating a generic function that determines the dimensions of an input matrix is trivial:
#include <Rcpp.h>
using namespace Rcpp;
namespace impl {
template <int RTYPE>
Vector<INTSXP> dims(const Matrix<RTYPE>& x)
{
return Vector<INTSXP>::create(x.nrow(), x.ncol());
}
} // impl
// [[Rcpp::export]]
IntegerVector dims(RObject x)
{
RCPP_RETURN_MATRIX(impl::dims, x);
}And checking this against base::dim,
classes <- c(
"integer", "numeric", "raw", "logical",
"complex", "character", "list", "expression"
)
sapply(seq_along(classes), function(i) {
x <- matrix(
vector(mode = classes[i], length = i ^ 2),
nrow = i
)
all.equal(dims(x), dim(x))
})[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
everything seems to be in order.
It’s worth pointing out that, for various reasons, it is possible to pass a
matrix object to an Rcpp function which calls RCPP_RETURN_VECTOR:
len(1:9)[1] 9
len(matrix(1:9, 3))[1] 9
Although this is sensible in the case of len – and even saves us from
implementing a matrix-specific version – there may be situations where
this behavior is undesirable. To distinguish between the two object types we
can rely on the API function Rf_isMatrix:
#include <Rcpp.h>
using namespace Rcpp;
namespace impl {
template <int RTYPE>
int len(const Vector<RTYPE>& x)
{
return static_cast<int>(x.size());
}
} // impl
// [[Rcpp::export]]
int len2(RObject x)
{
if (Rf_isMatrix(x)) {
stop("matrix objects not supported.");
}
RCPP_RETURN_VECTOR(impl::len, x);
}len2(1:9)[1] 9
tryCatch(
len2(matrix(1:9, 3)),
error = function(e) print(e)
)<Rcpp::exception in len2(matrix(1:9, 3)): matrix objects not supported.>
We don’t have to worry about the opposite scenario, as this is already handled within Rcpp library code:
tryCatch(
dims(1:5),
error = function(e) print(e)
)<Rcpp::not_a_matrix in dims(1:5): Not a matrix.>
In many cases our return type will correspond to our input type. For example,
exposing the Rcpp sugar function rev is trivial:
#include <Rcpp.h>
using namespace Rcpp;
template <int RTYPE>
Vector<RTYPE> Rev(const Vector<RTYPE>& x)
{
return rev(x);
}
// [[Rcpp::export]]
RObject rev2(RObject x)
{
RCPP_RETURN_VECTOR(Rev, x);
}rev2(1:5)[1] 5 4 3 2 1
rev2(as.list(1:5 + 2i))[[1]] [1] 5+2i [[2]] [1] 4+2i [[3]] [1] 3+2i [[4]] [1] 2+2i [[5]] [1] 1+2i
rawToChar(rev2(charToRaw("abcde")))[1] "edcba"
As a slightly more complex example, suppose we would like to write a function
to sort matrices which preserves the dimensions of the input, since
base::sort falls short of the latter stipulation:
sort(matrix(c(1, 3, 5, 7, 9, 2, 4, 6, 8), 3))[1] 1 2 3 4 5 6 7 8 9
There are two obstacles we need to overcome:
Matrix class does not implement its own sort method. However,
since Matrix inherits from Vector,
we can sort the matrix as a Vector and construct the result from this
sorted data with the appropriate dimensions.RCPP_RETURN macros will generate code to handle
exactly 8 SEXPTYPEs; no less, no more. Some functions, like Vector::sort,
are not implemented for all eight of these types, so in order to avoid a
compilation error, we need to add template specializations.With this in mind, we have the following implementation of msort:
#include <Rcpp.h>
using namespace Rcpp;
// primary template
template <int RTYPE>
Matrix<RTYPE> Msort(const Matrix<RTYPE>& x)
{
return Matrix<RTYPE>(
x.nrow(),
x.ncol(),
clone(x).sort().begin()
);
}
// template specializations for raw vectors,
// lists, and expression vectors
//
// we can just throw an exception, as base::sort
// does the same
template <>
Matrix<RAWSXP> Msort(const Matrix<RAWSXP>& x)
{ stop("sort not allowed for raw vectors."); }
template <>
Matrix<VECSXP> Msort(const Matrix<VECSXP>& x)
{ stop("sort not allowed for lists."); }
template <>
Matrix<EXPRSXP> Msort(const Matrix<EXPRSXP>& x)
{ stop("sort not allowed for expression vectors."); }
// [[Rcpp::export]]
RObject msort(RObject x)
{
RCPP_RETURN_MATRIX(Msort, x);
}Note that elements will be sorted in column-major order since we filled our
result using this constructor. We can verify that msort works as intended by checking a few test cases:
(x <- matrix(c(1, 3, 5, 7, 9, 2, 4, 6, 8), 3))
[,1] [,2] [,3]
[1,] 1 7 4
[2,] 3 9 6
[3,] 5 2 8
msort(x)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
sort(x)[1] 1 2 3 4 5 6 7 8 9
(x <- matrix(c("a", "c", "z", "y", "b", "x"), 3))
[,1] [,2]
[1,] "a" "y"
[2,] "c" "b"
[3,] "z" "x"
msort(x)
[,1] [,2]
[1,] "a" "x"
[2,] "b" "y"
[3,] "c" "z"
sort(x)[1] "a" "b" "c" "x" "y" "z"
x <- matrix(as.list(1:9), 3); str(x)List of 9 $ : int 1 $ : int 2 $ : int 3 $ : int 4 $ : int 5 $ : int 6 $ : int 7 $ : int 8 $ : int 9 - attr(*, "dim")= int [1:2] 3 3
tryCatch(
msort(x),
error = function(e) print(e)
)<Rcpp::exception in msort(x): sort not allowed for lists.>
tryCatch(
sort(x),
error = function(e) print(e)
)<simpleError in sort.int(x, na.last = na.last, decreasing = decreasing, ...): 'x' must be atomic>
Having familiarized ourselves with basic usage of the RCPP_RETURN macros, we
can return to the problem of implementing our ends function with
RCPP_RETURN_VECTOR. Just to recap the situation, the template function
passed to the macro must meet the following two criteria in C++98 mode:
Vector type).SEXP (or convertible to SEXP) argument.Currently ends has the signature
template <int RTYPE>
Vector<RTYPE> ends(const Vector<RTYPE>&, int);meaning that the first criterion is met, but the second is not. In order
preserve the functionality provided by the int parameter, we effectively
need to generate a new template function which has access to the user-provided
value at run-time, but without passing it as a function parameter.
The technique we are looking for is called partial function application, and it can be implemented
using one of my favorite C++ tools: the functor. Contrary to typical functor
usage, however, our implementation features a slight twist: rather than
using a template class with a non-template function call operator, as is the
case with std::greater, etc., we are
going to make operator() a template itself:
#include <Rcpp.h>
using namespace Rcpp;
class Ends {
private:
int n;
public:
Ends(int n)
: n(n)
{}
template <int RTYPE>
Vector<RTYPE> operator()(const Vector<RTYPE>& x)
{
n = std::min((R_xlen_t)n, x.size() / 2);
Vector<RTYPE> res(2 * n);
std::copy(x.begin(), x.begin() + n, res.begin());
std::copy(x.end() - n, x.end(), res.begin() + n);
return res;
}
};
// [[Rcpp::export]]
RObject ends(RObject x, int n = 6)
{
RCPP_RETURN_VECTOR(Ends(n), x);
}Not bad, right? All in all, the changes are fairly minor:
Ends::operator() is identical to that of
impl::ends.n is now a private data member rather than a function parameter, which
gets initialized in the constructor.RCPP_RETURN_VECTOR,
we pass the expression Ends(n), where n is supplied at run-time from the
R session. In turn, the macro will invoke Ends::operator() on the SEXP
(RObject, in our case), using the specified n value.We can demonstrate this on various test cases:
ends(1:9)[1] 1 2 3 4 6 7 8 9
ends(letters, 3)[1] "a" "b" "c" "x" "y" "z"
ends(rnorm(20), 2)[1] -0.694707 -0.207917 0.123854 0.215942
As alluded to earlier, a more modern compiler (supporting C++11 or later)
will free us from the “single SEXP argument” restriction, which means
that we no longer have to move additional parameters into a function
object. Here is ends re-implemented using the C++11 version of
RCPP_RETURN_VECTOR (note the // [[Rcpp::plugins(cpp11)]]
attribute declaration):
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
using namespace Rcpp;
namespace impl {
template <int RTYPE>
Vector<RTYPE> ends(const Vector<RTYPE>& x, int n)
{
n = std::min((R_xlen_t)n, x.size() / 2);
Vector<RTYPE> res(2 * n);
std::copy(x.begin(), x.begin() + n, res.begin());
std::copy(x.end() - n, x.end(), res.begin() + n);
return res;
}
} // impl
// [[Rcpp::export]]
RObject ends(RObject x, int n = 6)
{
RCPP_RETURN_VECTOR(impl::ends, x, n);
}ends(1:9)[1] 1 2 3 4 6 7 8 9
ends(letters, 3)[1] "a" "b" "c" "x" "y" "z"
ends(rnorm(20), 2)[1] 0.379639 -0.502323 0.181303 -0.138891
The current definition of RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX allows for up
to 24 arguments to be passed; although in principal, the true upper bound
depends on your compiler’s implementation of the __VA_ARGS__ macro, which
is likely greater than 24. Having said this, if you find yourself trying
to pass around more than 3 or 4 parameters at once, it’s probably time
to do some refactoring.