Zach DeBruine and Dirk Eddelbuettel — written Apr 16, 2021 — source

It is no secret that sparse matrix operations are faster in C++ than in R.
RcppArmadillo and RcppEigen do a great job copying sparse matrices from R to C++ and back again.
But note the word “copy”.
Every time RcppArmadillo converts an R sparse matrix to an `arma::SpMat<T>`

object, it has to creates a *deep copy* due to the difference in representation between dense matrices (usually occupying
contiguous chunks of memory) and sparse matrices (which do not precisely because of *sparsity*).
Similarly, every time RcppEigen converts an R sparse matrix to an `Eigen::SparseMatrix<T>`

object, it also creates a deep copy.

The price of instantiating sparse matrix object is measurable in both time and memory.
But one of the advantages of types such as `Rcpp::NumericVector`

is that they simply re-use the underlying memory of the R object by wrapping around the underlying `SEXP`

representation and its
contiguous dense memory layout so no copy is created!
Can we create a sparse matrix class using `Rcpp::NumericVector`

and `Rcpp::IntegerVector`

that uses them similarly as references rather than actual deep-copy of each element?

`dgCMatrix`

We will focus on the `dgCMatrix`

type, the most common form of compressed-sparse-column (CSC) matrix in the Matrix package.
The `dgCMatrix`

class is an S4 object with several slots:

Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ..@ i : int [1:12] 2 4 0 3 4 0 3 4 2 3 0 2 ..@ p : int [1:6] 0 2 5 8 10 12 ..@ Dim : int [1:2] 5 5 ..@ Dimnames:List of 2 .. ..$ : NULL .. ..$ : NULL ..@ x : num [1:12] 0.5 -1 0.83 -0.056 2.5 0.24 1.7 1.3 0.55 -1.7 -0.78 1.3 ..@ factors : list()

Here we have:

- Slot
`i`

is an integer vector giving the row indices of the non-zero values of the matrix - Slot
`p`

is an integer vector giving the index of the first non-zero value of each column in`i`

- Slot
`x`

gives the non-zero elements of the matrix corresponding to rows in`i`

and columns delineated by`p`

Realize that `i`

and `p`

are integer vectors, and `x`

is a numeric vector (stored as a `double`

), corresponding to `Rcpp::IntegerVector`

and `Rcpp::NumericVector`

.
That means that we can construct a sparse matrix class in C++ using Rcpp vectors!

A `dgCMatrix`

is simply an S4 object containing integer and double vectors, and Rcpp already has implemented exporters and wrappers for S4 objects, integer vectors, and numeric vectors.
That makes class construction easy:

In the above code, first we create a C++ class for a `dgCMatrix`

in the Rcpp namespace with public members corresponding to the slots in an R `Matrix::dgCMatrix`

.
Second, we add a constructor for the class that receives an S4 R object (which should be a valid `dgCMatrix`

object).
Third, we add a simple forward-only sparse column iterator with read/write access for convenient traversal of non-zero elements in the matrix.
Finally, we use `Rcpp::as`

and `Rcpp::wrap`

for seamless conversion between R and C++ and back again.

We can now simply pull a `dgCMatrix`

into any Rcpp function thanks to our handy class methods and the magic of `Rcpp::as`

.

And call the following from R:

[1] TRUE

Awesome!

Rcpp structures such as `NumericVector`

are wrappers around the existing R objects.
This means that if we modify our sparse matrix in C++, it will modify the original R object.
For instance:

used in

sum before squaring: 801.059

sum of original object after squaring: 1007.97

sum of assigned object after squaring: 1007.97

Notice how the original object AND the returned object are identical, yet they don’t point to the same memory address (because of *copy on write*):

[1] "<0x55f9609a3ae0>"

[1] "<0x55f960f449f0>"

You might further inspect memory addresses within the structure using `.Internal(inspect())`

and indeed, you will see the memory addresses are not the same.
What this all means is that we can simply call the function in R and modify the object implicitly without an assignment operator.

sum_before = 793.861

sum_after = 970.174

This principle of course applies to other Rcpp classes such as `NumericVector`

as well.
But especially when working with very large sparse matrices, it is useful to avoid deep copies and pass by reference only.
If you do need to operate on a copy of your matrix in C++, there is no reason to be using an Rcpp container when you can be using RcppArmadillo or RcppEigen.
These libraries are extremely capable and well-documented—and generally give you access to specific sparse Matrix algorithms.

One of the best ways to read and/or write non-zero values in a sparse matrix is with an iterator. A basic column forward iterator with read/write access was presented in the declaration of our sparse matrix class. We can use this iterator in a very Armadillo-esque manner to compute things like column sums:

On the R end:

[1] TRUE

Great—but is it faster???

Unit: microseconds expr min lq mean median uq max neval cld Rcpp_colSums(spmat) 2.510 2.7885 3.32225 3.1355 3.2685 21.842 100 a Matrix::colSums(spmat) 11.068 11.5380 13.26691 11.7775 12.0525 79.324 100 b

There are many ways we can extend `Rcpp::dgCMatrix`

.
For example, we can use `const_iterator`

, `row_iterator`

and iterators that traverse all values in addition to col_iterator.
We can also add support for subview copies, dense copies of columns and rows, basic element-wise operations, cross-product, etc.

We have implemented and documented these methods, and hopefully there will be a Rcpp-extending CRAN package in the not-too-distant future that allows you to seamlessly interface with the dgCMatrix class. For now, see the Github repo for RcppSparse for the in-progress package.

Tweet- Extending R with C++ and Fortran — Dirk Eddelbuettel and JBrandon Duck-Mayr
- Benchmarking Rcpp code with RcppClock — Zach DeBruine
- Simulation Smoother using RcppArmadillo — Tomasz Woźniak