Nino Hardt, Dicko Ahmadou, Benjamin Christoffersen — written Jul 13, 2013 — updated Feb 2, 2020 — source

The Multivariate Normal density function is used frequently for a number of problems. Especially for MCMC problems, fast evaluation is important. Multivariate Normal Likelihoods, Priors and mixtures of Multivariate Normals require numerous evaluations, thus speed of computation is vital. We show dramatic increases in speed by using efficient algorithms, RcppArmadillo, and some extra gain by using OpenMP. The code is based on the latest version of RcppArmadillo (0.9.800.1.0).

While the `dmvnorm()`

function from the `mvtnorm`

package is quite popular,
and in an earlier version of this article we demonstrated that an
Rcpp implementation would lead to faster computation.

Peter Rossi, author of `bayesm`

, called our attention to the `bayesm`

pure R
implementation which is much faster than `dmvnorm()`

.
The function `dMvn()`

is used internally by the mixture of normals model in
`bayesm`

. It is the matrix-equivalent version of `lndMvn`

:

Translating the vectorized approach into RcppArmadillo,
we precompute the inverse of the Cholesky decomposition of the covariance
matrix ahead of the main loop over the rows of `X`

.
The loop can easily be parallelized, and the code is easy to read and
manipulate. For instance, the inverse Cholesky decomposition can be put
inside the main loop, if varying covariance matrices are necessary.
It is worth remarking that multiplying with a precomputed inverse of the
Cholesky decomposition of
the covariance matrix is faster but less numerically stable compared to
a backsolve as in the `Mahalanobis`

function we will define later.

The use of `trimatu`

allows to exploit that of the Cholesky
decomposition the covariance matrix is an upper triangular matrix in the
inversion. The `dmvnrm_arma_old`

is an older version of the function used in
a previous version of this article. The new version differs mainly by

- using
`const &`

for the input parameters. - declaring
`z`

outside the loop. - using
`arma::dot`

instead of`arma::sum`

. - other minor things.

This turns out to be quite important for the computation times. The
`dmvnrm_arma_fast`

makes an inplace vector matrix product and exploits
that the matrix is an upper triangular matrix. One can use the
dtrmv
BLAS function instead. It is not available through the Armadillo library
though.

Additionally, we can make use of the OpenMP library to use multiple cores. For the OpenMP implementation, we need to enable OpenMP support. One way of doing so is by adding the required compiler and linker flags as follows:

Rcpp version 0.10.5 and later will also provide a plugin to set these variables for us:

When used in a package, the standard `src/Makevars`

supplied by the package
ensures the variable `$(SHLIB_OPENMP_CXXFLAGS)`

is used. It relies on the R
configuration to automatically add OpenMP support where available.

We also need to set the number of cores to be used before running the
compiled functions. One way is to use `detectCores()`

from the `parallel`

package.

Only two additional lines are needed to enable multicore processing. In this example, a dynamic schedule is used for OpenMP. A static schedule might be faster in some cases. However,this is left to further experimentation.

Likewise, it is easy to translate `dmvnorm`

from the `mvtnorm`

package into Rcpp:

We use `each_col`

, an appropriate overload of `arma::solve`

, and `for_each`

to do the computations without performing additional allocations after the
copy at `x_cen = x.t()`

.

Now we simulate some data for benchmarking:

And run the benchmark:

[1] "Using 12 cores for _mc versions"

Loading required package: rbenchmark

test replications elapsed relative 4 dmvnrm_arma_mc 100 0.875 1.000 6 dmvnrm_arma_fast 100 3.622 4.139 3 dmvnrm_arma 100 5.274 6.027 5 dmvnrm_arma_old 100 7.433 8.495 2 dmvnorm_arma 100 8.649 9.885 7 dMvn 100 12.563 14.358 1 dmvnorm 100 14.637 16.728

Lastly, we show that the functions yield the same results:

[1] TRUETweet

- Benchmarking Rcpp code with RcppClock — Zach DeBruine
- Simulation Smoother using RcppArmadillo — Tomasz Woźniak
- Constructing a Sparse Matrix Class in Rcpp — Zach DeBruine and Dirk Eddelbuettel