Parallel Distance Matrix Calculation with RcppParallel
JJ Allaire and Jim Bullard —
written Jul 15, 2014 —
source
The RcppParallel package includes
high level functions for doing parallel programming with Rcpp. For example,
the parallelFor function can be used to convert the work of a standard
serial “for” loop into a parallel one.
This article describes using RcppParallel to compute pairwise distances for
each row in an input data matrix and return an n x n lower-triangular
distance matrix which can be used with clustering tools from within R, e.g.,
hclust.
Jensen-Shannon Distance
In this example, we compute the Jensen-Shannon
distance
(JSD); a metric not a part of base R. Calculating distance matrices is a
common practice in clustering applications (unsupervised learning). Certain
clustering methods, such as partitioning around medoids (PAM) and
hierarchical clustering, operate directly on this matrix.
A distance matrix stores the n*(n-1)/2 pairwise distances/similarities
between observations in an n x p matrix where n correspond to the independent
observational units and p represent the covariates measured on each
individual. As a result we are typically limited by the size of n as the
algorithm scales quadratically in both time and space in n.
Implementation in R
As a baseline we’ll start with the implementation of Jenson-Shannon distance
in plain R:
Implementation using Rcpp
Here is a re-implementation of js_distance using Rcpp. Note that this
doesn’t yet take advantage of parallel processing, but still yields an
approximately 50x speedup over the original R version on a 2.6GHz Haswell
MacBook Pro.
Abstractly, a Distance function will take two vectors in RJ and
return a value in R+. In this implementation, we don’t support
arbitrary distance metrics, i.e., the JSD code computes the values from
within the parallel kernel.
Our distance function kl_divergence is defined below and takes three
parameters: iterators to the beginning and end of vector 1 and an iterator to
the beginning of vector 2 (the end position of vector2 is implied by the end
position of vector1).
With the kl_distance function defined we can now iteratively apply it
to the rows of the input matrix to generate the distance matrix:
Parallel Version using RcppParallel
Adapting the serial version to run in parallel is straightforward. A few
notes about the implementation:
To implement a parallel version we need to create a function
object that can process
discrete chunks of work (i.e. ranges of input).
Since the parallel version will be called from background threads, we can’t
use R and Rcpp APIs directly. Rather, we use the threadsafe RMatrix
accessor class provided by RcppParallel to read and write to directly the
underlying matrix memory.
Other than organzing the code as a function object and using RMatrix, the
parallel code is almost identical to the serial code. The main difference is
that the outer loop starts with the begin index passed to the worker
function rather than 0.
Parallelizing in this case has a big payoff: we observe performance of about
5.5x the serial version on a 2.6GHz Haswell MacBook Pro with 4 cores (8 with
hyperthreading). Here is the definition of the JsDistance function object:
Now that we have the JsDistance function object we can pass it to
parallelFor, specifying an iteration range based on the number of rows in
the input matrix:
Benchmarks
We now compare the performance of the three different implementations: pure
R, serial Rcpp, and parallel Rcpp:
The serial Rcpp version yields a more than 50x speedup over straight R code.
The parallel Rcpp version provides another 5.5x speedup, amounting to a total
gain of over 300x compared to the original R version.