Zach DeBruine and Dirk Eddelbuettel — written Apr 16, 2021 — source
It is no secret that sparse matrix operations are faster in C++ than in R.
RcppArmadillo and RcppEigen do a great job copying sparse matrices from R to C++ and back again.
But note the word “copy”.
Every time RcppArmadillo converts an R sparse matrix to an
arma::SpMat<T> object, it has to creates a deep copy due to the difference in representation between dense matrices (usually occupying
contiguous chunks of memory) and sparse matrices (which do not precisely because of sparsity).
Similarly, every time RcppEigen converts an R sparse matrix to an
Eigen::SparseMatrix<T> object, it also creates a deep copy.
The price of instantiating sparse matrix object is measurable in both time and memory.
But one of the advantages of types such as
Rcpp::NumericVector is that they simply re-use the underlying memory of the R object by wrapping around the underlying
SEXP representation and its
contiguous dense memory layout so no copy is created!
Can we create a sparse matrix class using
Rcpp::IntegerVector that uses them similarly as references rather than actual deep-copy of each element?
We will focus on the
dgCMatrix type, the most common form of compressed-sparse-column (CSC) matrix in the Matrix package.
dgCMatrix class is an S4 object with several slots:
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ..@ i : int [1:12] 2 4 0 3 4 0 3 4 2 3 0 2 ..@ p : int [1:6] 0 2 5 8 10 12 ..@ Dim : int [1:2] 5 5 ..@ Dimnames:List of 2 .. ..$ : NULL .. ..$ : NULL ..@ x : num [1:12] 0.5 -1 0.83 -0.056 2.5 0.24 1.7 1.3 0.55 -1.7 -0.78 1.3 ..@ factors : list()
Here we have:
iis an integer vector giving the row indices of the non-zero values of the matrix
pis an integer vector giving the index of the first non-zero value of each column in
xgives the non-zero elements of the matrix corresponding to rows in
iand columns delineated by
p are integer vectors, and
x is a numeric vector (stored as a
double), corresponding to
That means that we can construct a sparse matrix class in C++ using Rcpp vectors!
dgCMatrix is simply an S4 object containing integer and double vectors, and Rcpp already has implemented exporters and wrappers for S4 objects, integer vectors, and numeric vectors.
That makes class construction easy:
In the above code, first we create a C++ class for a
dgCMatrix in the Rcpp namespace with public members corresponding to the slots in an R
Second, we add a constructor for the class that receives an S4 R object (which should be a valid
Third, we add a simple forward-only sparse column iterator with read/write access for convenient traversal of non-zero elements in the matrix.
Finally, we use
Rcpp::wrap for seamless conversion between R and C++ and back again.
We can now simply pull a
dgCMatrix into any Rcpp function thanks to our handy class methods and the magic of
And call the following from R:
Rcpp structures such as
NumericVector are wrappers around the existing R objects.
This means that if we modify our sparse matrix in C++, it will modify the original R object.
sum before squaring: 801.059
sum of original object after squaring: 1007.97
sum of assigned object after squaring: 1007.97
Notice how the original object AND the returned object are identical, yet they don’t point to the same memory address (because of copy on write):
You might further inspect memory addresses within the structure using
.Internal(inspect()) and indeed, you will see the memory addresses are not the same.
What this all means is that we can simply call the function in R and modify the object implicitly without an assignment operator.
sum_before = 793.861
sum_after = 970.174
This principle of course applies to other Rcpp classes such as
NumericVector as well.
But especially when working with very large sparse matrices, it is useful to avoid deep copies and pass by reference only.
If you do need to operate on a copy of your matrix in C++, there is no reason to be using an Rcpp container when you can be using RcppArmadillo or RcppEigen.
These libraries are extremely capable and well-documented—and generally give you access to specific sparse Matrix algorithms.
One of the best ways to read and/or write non-zero values in a sparse matrix is with an iterator. A basic column forward iterator with read/write access was presented in the declaration of our sparse matrix class. We can use this iterator in a very Armadillo-esque manner to compute things like column sums:
On the R end:
Great—but is it faster???
Unit: microseconds expr min lq mean median uq max neval cld Rcpp_colSums(spmat) 2.510 2.7885 3.32225 3.1355 3.2685 21.842 100 a Matrix::colSums(spmat) 11.068 11.5380 13.26691 11.7775 12.0525 79.324 100 b
There are many ways we can extend
For example, we can use
row_iterator and iterators that traverse all values in addition to col_iterator.
We can also add support for subview copies, dense copies of columns and rows, basic element-wise operations, cross-product, etc.
We have implemented and documented these methods, and hopefully there will be a Rcpp-extending CRAN package in the not-too-distant future that allows you to seamlessly interface with the dgCMatrix class. For now, see the Github repo for RcppSparse for the in-progress package.Tweet