## Parallel Distance Matrix Calculation with RcppParallel

JJ Allaire and Jim Bullard — written Jul 15, 2014 — source

The RcppParallel package includes high level functions for doing parallel programming with Rcpp. For example, the `parallelFor` function can be used to convert the work of a standard serial “for” loop into a parallel one.

This article describes using RcppParallel to compute pairwise distances for each row in an input data matrix and return an n x n lower-triangular distance matrix which can be used with clustering tools from within R, e.g., hclust.

### Jensen-Shannon Distance

In this example, we compute the Jensen-Shannon distance (JSD); a metric not a part of base R. Calculating distance matrices is a common practice in clustering applications (unsupervised learning). Certain clustering methods, such as partitioning around medoids (PAM) and hierarchical clustering, operate directly on this matrix.

A distance matrix stores the n*(n-1)/2 pairwise distances/similarities between observations in an n x p matrix where n correspond to the independent observational units and p represent the covariates measured on each individual. As a result we are typically limited by the size of n as the algorithm scales quadratically in both time and space in n.

### Implementation in R

As a baseline we’ll start with the implementation of Jenson-Shannon distance in plain R:

### Implementation using Rcpp

Here is a re-implementation of `js_distance` using Rcpp. Note that this doesn’t yet take advantage of parallel processing, but still yields an approximately 50x speedup over the original R version on a 2.6GHz Haswell MacBook Pro.

Abstractly, a Distance function will take two vectors in RJ and return a value in R+. In this implementation, we don’t support arbitrary distance metrics, i.e., the JSD code computes the values from within the parallel kernel.

Our distance function `kl_divergence` is defined below and takes three parameters: iterators to the beginning and end of vector 1 and an iterator to the beginning of vector 2 (the end position of vector2 is implied by the end position of vector1).

With the `kl_distance` function defined we can now iteratively apply it to the rows of the input matrix to generate the distance matrix:

### Parallel Version using RcppParallel

Adapting the serial version to run in parallel is straightforward. A few notes about the implementation:

• To implement a parallel version we need to create a function object that can process discrete chunks of work (i.e. ranges of input).

• Since the parallel version will be called from background threads, we can’t use R and Rcpp APIs directly. Rather, we use the threadsafe `RMatrix` accessor class provided by RcppParallel to read and write to directly the underlying matrix memory.

• Other than organzing the code as a function object and using `RMatrix`, the parallel code is almost identical to the serial code. The main difference is that the outer loop starts with the `begin` index passed to the worker function rather than 0.

Parallelizing in this case has a big payoff: we observe performance of about 5.5x the serial version on a 2.6GHz Haswell MacBook Pro with 4 cores (8 with hyperthreading). Here is the definition of the `JsDistance` function object:

Now that we have the `JsDistance` function object we can pass it to `parallelFor`, specifying an iteration range based on the number of rows in the input matrix:

### Benchmarks

We now compare the performance of the three different implementations: pure R, serial Rcpp, and parallel Rcpp:

```                          test replications elapsed relative
3 rcpp_parallel_js_distance(m)            3   0.110    1.000
2          rcpp_js_distance(m)            3   0.618    5.618
1               js_distance(m)            3  35.560  323.273
```

The serial Rcpp version yields a more than 50x speedup over straight R code. The parallel Rcpp version provides another 5.5x speedup, amounting to a total gain of over 300x compared to the original R version.