Michael Kane and Scott Ritchie — written Mar 14, 2013 — source
The bigmemory package allows users to create
matrices that can be shared across R sessions. These can either be stored in
RAM, or stored on disk, allowing for the matrices to be much larger than the
system RAM. These objects are defined in the big.matrix
class, and are
designed to have similar behaviour to native R matrices. However, they are
implemented in C++ and can be easily accessed and manipulated from Rcpp
.
This example demonstrates how to write functions in Rcpp
that operate on
big.matrix
objects. Here we implement the equivalent of colSums
:
When accessing a big.matrix
object in Rcpp, there are two objects you are
interested in creating. First, the External Pointer for the big.matrix
;
XPtr<BigMatrix>
, which also stores all of the attributes of the big.matrix
,
including nrow()
, ncol()
, and matrix_type()
. The second is the
MatrixAccessor
which allows you to access elements within the big.matrix
.
When creating the MatrixAccessor
you must declare the type
of the
BigMatrix
, resulting in the design pattern above to correctly handle all
cases.
A BigMatrix
object stores elements in a column major format, meaning that
values are accessed and filled in by column, rather than by row. The
MatrixAccessor
implements the bracket operator, returning a pointer to the
first element of a column. As a result, for a MatrixAccessor mat
, the i-th
row and j-th column is accessed with ma[j][i]
rather than m[i, j]
, which
R users are more familiar with.
The code above defines a function BigColSums
. This is broken into two
components: a dispatch function, and function which implements the logic. The
dispatch function takes as an argument a generic SEXP
object, a container
object for all data objects in R. First, we tell Rcpp that the SEXP object is
an external pointer (XPtr
) associated with a BigMatrix
object. It then
creates the appropriate MatrixAccessor
depending on the type of the data
stored in the BigMatrix
, as detected at runtime, and dispatches both the
External Pointer and Matrix Accessor objects for the BigMatrix
to the
implementation of the logic for BigColSums
.
Because the logic for BigColSums
remains the same regardless of the underlying
representation of the data, we have simply implemented a generic template that
works the same for all four types (char, short, int, double). In the
implementation of BigColSums
’s logic, first we create a NumericVector
to
hold the results of the column sums, using the number of columns in the
BigMatrix
to define the length of the results vector. Next, the function loops
through the columns. For each iteration of the loop the values in a single
column are accumulated and stored at the appropriate location in the colSum
vector. Finally, the columns’ sums are returned to R via the dispatch function.
The code below shows how to use the new Rcpp function. A big.matrix
object
is created, named bigmat, with 10000 rows and 3 columns. Matrix elements are
stored on disk in a “backingfile” named bigmat.bk. After creating the
big.matrix object, the column values are filled in and then the
big.matrix
’s external pointer, which is references with the address
slot,
is passed to the Rcpp BigColSums
function. The corresponding R function is
shown below so that you can verify that our new function returns the correct
value.
[1] -23.72 -182.13 -212.98
[1] TRUETweet