Hadley Wickham and Dirk Eddelbuettel — written Dec 2, 2013 — source
The plyr package uses a couple of
small C functions to optimise a number of particularly bad bottlenecks.
Recently, two functions were converted to C++. This was mostly stimulated by a
segmentation fault caused by some large inputs to the
function: rather than figuring out exactly what was going wrong with the
complicated C code, it was easier to rewrite with simple, correct C++ code.
The job of
split_indices() is simple: given a vector of integers,
it returns a list where the
i-th element of the list is an integer vector
containing the positions of
x equal to
i. This is a useful building block
for many of the functions in plyr.
It is fairly easy to see what is going on the in the C++ code:
We create a
std::vector (of type
ids. It will grow efficiently as we add new values, and Rcpp will
automatically convert to a list of integer vectors when returned to R.
The loop iterates through each element of
x, adding its index to the end
ids. It also makes sure that
ids is long enough. (The plus and minus
ones are needed because C++ uses 0 based indices and R uses 1 based
The code is simple, easy to understand (if one is a little familiar with the STL), and performant. The most awkward aspect of the code is switching between R’s 1-based indexing and C++ 0-based indexing (and indeed this was the source of a bug in a previous version of this code).
Compare it to the original C code:
This function is almost three times as long, and has a bug in it. It is substantially more complicated because it:
has to take care of memory management with
Rcpp takes care of this for us
needs an additional loop through the data to determine how long each
vector should be; the
std::vector grows efficiently and eliminates this
Conversion to C++ can make code shorter and easier to understand and maintain, while remaining just as performant.Tweet