Kevin Ushey — written Feb 27, 2013 — source
Recall that factors are really just integer vectors with ‘levels’, i.e., character labels that get mapped to each integer in the vector. How can we take an arbitrary character, integer, numeric, or logical vector and coerce it to a factor with Rcpp? It’s actually quite easy with Rcpp sugar:
Note a few things:
We template over the RTYPE
; i.e., the internal type that R assigns to its
objects. For this example, we just need to know that the R types (as exposed
in an R session) map to internal C types as
integer -> INTSXP
, numeric -> REALSXP
, and character -> STRSXP
.
We return an IntegerVector. Remember that factors are just
integer vectors with a levels
attribute and class factor
.
To generate our factor, we simply need to calculate the sorted unique values (the levels), and then match our vector back to those levels.
Next, we can just set the attributes on the object so that R will interpret it as a factor, rather than a plain old integer vector, when it’s returned.
And a quick test:
Unit: milliseconds expr min lq median uq max neval factor(lets) 5.065 5.788 5.976 6.375 36.57 100 fast_factor(lets) 1.367 1.421 1.453 1.520 2.83 100
(However, note that this doesn’t handle NA
s – fixing that is left as an
exercise. Similarily for logical vectors – it’s not quite as simple as just
adding a call to a LGLSXP
templated call, but it’s still not tough – use
INTSXP
and set set the levels to FALSE and TRUE.)
We can demonstrate a simple example of where this might be useful with
tapply. tapply(x, group, FUN)
is really just a wrapper to lapply( split(x, group), FUN )
,
and split
relies on coercing ‘group’ to a factor. Otherwise, split
calls
.Internal( split(x, group) )
, and trying to do better than an internal C
function is typically a bit futile. So, now that we’ve written this,
we can test a couple ways of performing a tapply
-like function:
test elapsed relative 2 unlist(lapply(split(x, fast_factor(gp)), mean)) 0.292 1.000 3 unlist(lapply(split(x, gp), mean)) 1.042 3.568 1 tapply(x, gp, mean) 2.043 6.997
To be fair, tapply actually returns a 1-dimensional array rather than a vector, and also can operate on more general arrays. However, we still do see a modest speedup both for using lapply, and for taking advantage of our fast factor generator.
Tweet