Nathan Russell — written Jul 26, 2017 — source
- C++ templates and function overloading are incompatible with R’s C API, so polymorphism must be achieved via run-time dispatch, handled explicitly by the programmer.
- The traditional technique for operating on
SEXP
objects in a generic manner entails a great deal of boilerplate code, which can be unsightly, unmaintainable, and error-prone.- The desire to provide polymorphic functions which operate on vectors and matrices is common enough that Rcpp provides the utility macros
RCPP_RETURN_VECTOR
andRCPP_RETURN_MATRIX
to simplify the process.- Subsequently, these macros were extended to handle an (essentially) arbitrary number of arguments, provided that a C++11 compiler is used.
To motivate a discussion of polymorphic functions, imagine that we desire a
function (ends
) which, given an input vector x
and an integer n
, returns
a vector containing the first and last n
elements of x
concatenated.
Furthermore, we require ends
to be a single interface which is capable of
handling multiple types of input vectors (integers, floating point values,
strings, etc.), rather than having a separate function for each case. How can
this be achieved?
A naïve implementation in R might look something like this:
[1] 1 2 3 4 6 7 8 9
[1] "a" "b" "c" "x" "y" "z"
[1] -0.560476 -0.230177 0.701356 -0.472791
The simple function above demonstates a key feature of many dynamically-typed
programming languages, one which has undoubtably been a significant factor in their
rise to popularity: the ability to write generic code with little-to-no
additional effort on the part of the developer. Without getting into a discussion
of the pros and cons of static vs. dynamic typing, it is evident that being able
to dispatch a single function generically on multiple object types, as opposed to,
e.g. having to manage separate impementations of ends
for each vector type,
helps us to write more concise, expressive code. Being an article about Rcpp,
however, the story does not end here, and we consider how this problem might
be approached in C++, which has a much more strict type system than R.
For simplicity, we begin by considering solutions in the context of a “pure”
(re: not called from R) C++ program. Eschewing more complicated tactics
involving run-time dispatch (virtual
functions, etc.), the C++ language
provides us with two straightforward methods of achieving this at compile time:
The first case can be demonstrated as follows:
Although the above program meets our criteria, the code duplication is profound. Being seasoned C++ programmers, we recognize this as a textbook use case for templates and refactor accordingly:
This approach is much more maintainable as we have a single implementation
of ends
rather than one implementation per typedef
. With this in hand, we
now look to make our C++ version of ends
callable from R via Rcpp.
Many people, myself included, have attempted some variation of the following at one point or another:
Sadly this does not work: magical as Rcpp attributes may be, there are limits to what they can do, and at least for the time being, translating C++ template functions into something compatible with R’s C API is out of the question. Similarly, the first C++ approach from earlier is also not viable, as the C programming language does not support function overloading. In fact, C does not support any flavor of type-safe static polymorphism, meaning that our generic function must be implemented through run-time polymorphism, as touched on in Kevin Ushey’s Gallery article Dynamic Wrapping and Recursion with Rcpp.
Armed with the almighty TYPEOF
macro and a SEXPTYPE cheatsheat, we
modify the template code like so:
[1] 1 2 3 4 6 7 8 9
[1] "a" "b" "c" "x" "y" "z"
[1] -1.067824 -0.217975 -0.305963 -0.380471
Warning in ends(list()): Invalid SEXPTYPE 19 (VECSXP).
NULL
Some key remarks:
ends
template to use
an integer parameter instead of a type parameter. This is a crucial point, and
later on, we will exploit it to our benefit.SEXP
for our input / output vector since we need a
single input / output type. In this particular situation, replacing SEXP
with
the Rcpp type RObject
would also be suitable as it is a generic class capable
of representing any SEXP
type.Rcpp::Vector
type accordingly within each case label. (For
further reference, the list of vector aliases can be found here). Finally, we could dress each return value in Rcpp::wrap
to convert
the Rcpp::Vector
to a SEXP
, but it isn’t necessary because Rcpp attributes
will do this automatically (if possible).At this point we have a polymorphic function, written in C++, and callable from
R. But that switch
statement sure is an eyesore, and it will need to be
implemented every time we wish to export a generic function to R. Aesthetics
aside, a more pressing concern is that boilerplate such as this increases the
likelihood of introducing bugs into our codebase – and since we are leveraging
run-time dispatch, these bugs will not be caught by the compiler. For example,
there is nothing to prevent this from compiling:
// ...
case INTSXP: {
return impl::ends(as<CharacterVector>(x), n);
}
// ...
In our particular case, such mistakes likely would not be too disastrous, but it should not be difficult to see how situations like this can put you (or a user of your library!) on the fast track to segfault.
The C preprocessor is undeniably one of the more controversial aspects of the C++ programming language, as its utility as a metaprogramming tool is rivaled only by its potential for abuse. A proper discussion of the various pitfalls associated with C-style macros is well beyond the scope of this article, so the reader is encouraged explore this topic on their own. On the bright side, the particular macros that we will be discussing are sufficiently complex and limited in scope that misuse is much more likely to result in a compiler error than a silent bug, so practically speaking, one can expect a fair bit of return for relatively little risk.
At a high level, we summarize the RCPP_RETURN
macros as follows:
RCPP_RETURN_VECTOR
and RCPP_RETURN_MATRIX
, respectively.SEXPTYPE
s:
INTSXP
(integers)REALSXP
(numerics)RAWSXP
(raw bits)LGLSXP
(logicals)CPLXSXP
(complex numbers)STRSXP
(characters / strings)VECSXP
(lists)EXPRSXP
(expressions)SEXP
objectFinally, the template function must meet the following criteria:
SEXP
(or something convertible to
SEXP
) argument.Examining our templated impl::ends
function from the previous section, we see
that it meets the first requirement, but fails the second, due to its second
parameter n
. Before exploring how ends
might be adapted to meet the (C++98)
template requirements, it will be helpful demonstrate correct usage with a few
simple examples.
We consider two situations where our input type is generic, but our output type is fixed:
int
is always returned.IntegerVector
is always returned.First, our len
function:
(Note that we omit the return
keyword, as it is part of the macro definition.)
Testing this out on the various supported vector types:
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Similarly, creating a generic function that determines the dimensions of an input matrix is trivial:
And checking this against base::dim
,
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
everything seems to be in order.
It’s worth pointing out that, for various reasons, it is possible to pass a
matrix object to an Rcpp function which calls RCPP_RETURN_VECTOR
:
[1] 9
[1] 9
Although this is sensible in the case of len
– and even saves us from
implementing a matrix-specific version – there may be situations where
this behavior is undesirable. To distinguish between the two object types we
can rely on the API function Rf_isMatrix
:
[1] 9
<Rcpp::exception in len2(matrix(1:9, 3)): matrix objects not supported.>
We don’t have to worry about the opposite scenario, as this is already handled within Rcpp library code:
<Rcpp::not_a_matrix in dims(1:5): Not a matrix.>
In many cases our return type will correspond to our input type. For example,
exposing the Rcpp sugar function rev
is trivial:
[1] 5 4 3 2 1
[[1]] [1] 5+2i [[2]] [1] 4+2i [[3]] [1] 3+2i [[4]] [1] 2+2i [[5]] [1] 1+2i
[1] "edcba"
As a slightly more complex example, suppose we would like to write a function
to sort matrices which preserves the dimensions of the input, since
base::sort
falls short of the latter stipulation:
[1] 1 2 3 4 5 6 7 8 9
There are two obstacles we need to overcome:
Matrix
class does not implement its own sort
method. However,
since Matrix
inherits from Vector
,
we can sort the matrix as a Vector
and construct the result from this
sorted data with the appropriate dimensions.RCPP_RETURN
macros will generate code to handle
exactly 8 SEXPTYPE
s; no less, no more. Some functions, like Vector::sort
,
are not implemented for all eight of these types, so in order to avoid a
compilation error, we need to add template specializations.With this in mind, we have the following implementation of msort
:
Note that elements will be sorted in column-major order since we filled our
result using this constructor. We can verify that msort
works as intended by checking a few test cases:
[,1] [,2] [,3] [1,] 1 7 4 [2,] 3 9 6 [3,] 5 2 8
[,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9
[1] 1 2 3 4 5 6 7 8 9
[,1] [,2] [1,] "a" "y" [2,] "c" "b" [3,] "z" "x"
[,1] [,2] [1,] "a" "x" [2,] "b" "y" [3,] "c" "z"
[1] "a" "b" "c" "x" "y" "z"
List of 9 $ : int 1 $ : int 2 $ : int 3 $ : int 4 $ : int 5 $ : int 6 $ : int 7 $ : int 8 $ : int 9 - attr(*, "dim")= int [1:2] 3 3
<Rcpp::exception in msort(x): sort not allowed for lists.>
<simpleError in sort.int(x, na.last = na.last, decreasing = decreasing, ...): 'x' must be atomic>
Having familiarized ourselves with basic usage of the RCPP_RETURN
macros, we
can return to the problem of implementing our ends
function with
RCPP_RETURN_VECTOR
. Just to recap the situation, the template function
passed to the macro must meet the following two criteria in C++98 mode:
Vector
type).SEXP
(or convertible to SEXP
) argument.Currently ends
has the signature
meaning that the first criterion is met, but the second is not. In order
preserve the functionality provided by the int
parameter, we effectively
need to generate a new template function which has access to the user-provided
value at run-time, but without passing it as a function parameter.
The technique we are looking for is called partial function application, and it can be implemented
using one of my favorite C++ tools: the functor. Contrary to typical functor
usage, however, our implementation features a slight twist: rather than
using a template class with a non-template function call operator, as is the
case with std::greater
, etc., we are
going to make operator()
a template itself:
Not bad, right? All in all, the changes are fairly minor:
Ends::operator()
is identical to that of
impl::ends
.n
is now a private data member rather than a function parameter, which
gets initialized in the constructor.RCPP_RETURN_VECTOR
,
we pass the expression Ends(n)
, where n
is supplied at run-time from the
R session. In turn, the macro will invoke Ends::operator()
on the SEXP
(RObject
, in our case), using the specified n
value.We can demonstrate this on various test cases:
[1] 1 2 3 4 6 7 8 9
[1] "a" "b" "c" "x" "y" "z"
[1] -0.694707 -0.207917 0.123854 0.215942
As alluded to earlier, a more modern compiler (supporting C++11 or later)
will free us from the “single SEXP
argument” restriction, which means
that we no longer have to move additional parameters into a function
object. Here is ends
re-implemented using the C++11 version of
RCPP_RETURN_VECTOR
(note the // [[Rcpp::plugins(cpp11)]]
attribute declaration):
[1] 1 2 3 4 6 7 8 9
[1] "a" "b" "c" "x" "y" "z"
[1] 0.379639 -0.502323 0.181303 -0.138891
The current definition of RCPP_RETURN_VECTOR
and RCPP_RETURN_MATRIX
allows for up
to 24 arguments to be passed; although in principal, the true upper bound
depends on your compiler’s implementation of the __VA_ARGS__
macro, which
is likely greater than 24. Having said this, if you find yourself trying
to pass around more than 3 or 4 parameters at once, it’s probably time
to do some refactoring.