Using bigmemory with Rcpp

Michael Kane and Scott Ritchie — written Mar 14, 2013 — source

The bigmemory package allows users to create matrices that can be shared across R sessions. These can either be stored in RAM, or stored on disk, allowing for the matrices to be much larger than the system RAM. These objects are defined in the big.matrix class, and are designed to have similar behaviour to native R matrices. However, they are implemented in C++ and can be easily accessed and manipulated from Rcpp.

This example demonstrates how to write functions in Rcpp that operate on big.matrix objects. Here we implement the equivalent of colSums:

#include <Rcpp.h>

// The next line is all it takes to find the bigmemory
// headers -- thanks to the magic of Rcpp attributes, 
// and as bigmemory now accesses Boost headers from the BH package,
// we need to make sure we do so as well in this Rcpp::depends comment.
//
// Boost headers can generate some warning with the default compilation
// options for R.  To suppress these, we can enable C++11 mode which gets
// us 'long long' types.
//
// If your compiler is to old, just disable / remove the following line
// [[Rcpp::plugins(cpp11)]]

// Ensure R uses the headers for Boost (a dependency) and bigmemory 
//
// [[Rcpp::depends(BH, bigmemory)]]
#include <bigmemory/MatrixAccessor.hpp>

#include <numeric>

using namespace Rcpp;

// Logic for BigColSums.
template <typename T>
NumericVector BigColSums(XPtr<BigMatrix> pMat, MatrixAccessor<T> mat) {

    // Create the vector we'll store the column sums in.
    NumericVector colSums(pMat->ncol());
    for (size_t i=0; i < pMat->ncol(); ++i)
        colSums[i] = std::accumulate(mat[i], mat[i]+pMat->nrow(), 0.0);
    return colSums;
}

// Dispatch function for BigColSums
//
// [[Rcpp::export]]
NumericVector BigColSums(SEXP pBigMat) {
    // First we have to tell Rcpp what class to use for big.matrix objects.
    // This object stores the attributes of the big.matrix object passed to it
    // by R.
    XPtr<BigMatrix> xpMat(pBigMat);

    // To access values in the big.matrix, we need to create a MatrixAccessor
    // object of the appropriate type. Note that in every case we are still
    // returning a NumericVector: this is because big.matrix objects only store
    // numeric values in R, even if their type is set to 'char'. The types
    // simply correspond to the number of bytes used for each element.
    switch(xpMat->matrix_type()) {
      case 1:
        return BigColSums(xpMat, MatrixAccessor<char>(*xpMat));
      case 2:
        return BigColSums(xpMat, MatrixAccessor<short>(*xpMat));
      case 4:
        return BigColSums(xpMat, MatrixAccessor<int>(*xpMat));
      case 8:
        return BigColSums(xpMat, MatrixAccessor<double>(*xpMat));
      default:
        // This case should never be encountered unless the implementation of
        // big.matrix changes, but is necessary to implement shut up compiler
        // warnings.
        throw Rcpp::exception("unknown type detected for big.matrix object!");
    }
}

When accessing a big.matrix object in Rcpp, there are two objects you are interested in creating. First, the External Pointer for the big.matrix; XPtr<BigMatrix>, which also stores all of the attributes of the big.matrix, including nrow(), ncol(), and matrix_type(). The second is the MatrixAccessor which allows you to access elements within the big.matrix. When creating the MatrixAccessor you must declare the type of the BigMatrix, resulting in the design pattern above to correctly handle all cases.

A BigMatrix object stores elements in a column major format, meaning that values are accessed and filled in by column, rather than by row. The MatrixAccessor implements the bracket operator, returning a pointer to the first element of a column. As a result, for a MatrixAccessor mat, the i-th row and j-th column is accessed with ma[j][i] rather than m[i, j], which R users are more familiar with.

The code above defines a function BigColSums. This is broken into two components: a dispatch function, and function which implements the logic. The dispatch function takes as an argument a generic SEXP object, a container object for all data objects in R. First, we tell Rcpp that the SEXP object is an external pointer (XPtr) associated with a BigMatrix object. It then creates the appropriate MatrixAccessor depending on the type of the data stored in the BigMatrix, as detected at runtime, and dispatches both the External Pointer and Matrix Accessor objects for the BigMatrix to the implementation of the logic for BigColSums.

Because the logic for BigColSums remains the same regardless of the underlying representation of the data, we have simply implemented a generic template that works the same for all four types (char, short, int, double). In the implementation of BigColSums’s logic, first we create a NumericVector to hold the results of the column sums, using the number of columns in the BigMatrix to define the length of the results vector. Next, the function loops through the columns. For each iteration of the loop the values in a single column are accumulated and stored at the appropriate location in the colSum vector. Finally, the columns’ sums are returned to R via the dispatch function.

The code below shows how to use the new Rcpp function. A big.matrix object is created, named bigmat, with 10000 rows and 3 columns. Matrix elements are stored on disk in a “backingfile” named bigmat.bk. After creating the big.matrix object, the column values are filled in and then the big.matrix’s external pointer, which is references with the address slot, is passed to the Rcpp BigColSums function. The corresponding R function is shown below so that you can verify that our new function returns the correct value.

suppressMessages(require(bigmemory))

# set up big.matrix
nrows <- 10000
setwd("/tmp")
bkFile <- "bigmat.bk"
descFile <- "bigmatk.desc"
bigmat <- filebacked.big.matrix(nrow=nrows, ncol=3, type="double",
                                backingfile=bkFile, backingpath=".",
				descriptorfile=descFile,
				dimnames=c(NULL,NULL))
  
# Each column value with be the column number multiplied by 
# samples from a standard normal distribution.
set.seed(123)
for (i in 1:3) bigmat[,i] = rnorm(nrows)*i

# Call the Rcpp function.
res <- BigColSums(bigmat@address) 
print(res)
[1]  -23.72 -182.13 -212.98
# Verify that it is that same as running colSums on a matrix with equal values.
print(all.equal(res, colSums(bigmat[,])))  
[1] TRUE

tags: basics  boost  bigmemory 

Related Articles