## Using RcppNT2 to Compute the Sum

Kevin Ushey — written Feb 1, 2016 — source

## Introduction

The Numerical Template Toolbox (NT2) collection of header-only C++ libraries that make it possible to explicitly request the use of SIMD instructions when possible, while falling back to regular scalar operations when not. NT2 itself is powered by Boost, alongside two proposed Boost libraries – `Boost.Dispatch`, which provides a mechanism for efficient tag-based dispatch for functions, and `Boost.SIMD`, which provides a framework for the implementation of algorithms that take advantage of SIMD instructions. `RcppNT2` wraps and exposes these libraries for use with `R`.

If you haven’t already, read the RcppNT2 introduction article to get acquainted with the RcppNT2 package.

## Computing the Sum

First, let’s review how we might use `std::accumulate()` to sum a vector of numbers. We explicitly pass in the `std::plus<double>()` functor, just to make it clear that the `std::accumulate()` algorithm expects a binary functor when accumulating values.

Now, let’s rewrite this to take advantage of RcppNT2. There are two main steps required to take advantage of RcppNT2 at a high level:

1. Write a functor, with a templated call operator, with the implementation written in a ‘`Boost.SIMD`-aware’ way;

2. Provide the functor as an argument to the appropriate SIMD algorithm.

Let’s follow these steps in implementing our SIMD sum.

As you can see, it’s quite simple to take advantage of `Boost.SIMD`. For very simple operations such as this, RcppNT2 provides a number of pre-defined functors, which can be accessed in the `RcppParallel::functor` namespace. The following is an equivalent way of defining the above function:

Behind the scenes of `simdReduce()`, `Boost.SIMD` will apply your templated functor to ‘packs’ of values when appropriate, and scalar values when not. In other words, there are effectively two kinds of template specializations being generated behind the scenes: one with `T = double`, and one with ```T = boost::simd::pack<double>```. The use of the packed representation is what allows `Boost.SIMD` to ensure vectorized instructions are used and generated. `Boost.SIMD` provides a host of functions and operator overloads that ensure that optimized instructions are used when possible over a packed object, while falling back to ‘default’ operations for scalar values when not.

Now, let’s compare the performance of these two implementations.

```             expr     min       lq     mean   median       uq      max
vectorSum(v) 870.894 887.1535 978.7471 987.1810 989.1060 1792.215
vectorSumSimd(v) 270.985 283.2315 297.8062 287.6565 298.7115  608.373
```

Perhaps surprisingly, the RcppNT2 solution is much faster – the gains are similar to what we might have seen when computing the sum in parallel. However, we’re still just using a single core; we’re just taking advantage of vectorized instructions provided by the CPU. In this particular case, on Intel CPUs, `Boost.SIMD` will ensure that we are using the `addpd` instruction, which is documented in the Intel Software Developer’s Manual [PDF].

Note that, for the naive serial sum, the compiler would likely generate similarly efficient code when the `-ffast-math` optimization flag is set. By default, the compiler is somewhat ‘pessimistic’ about the set of optimizations it can perform around floating point arithmetic. This is because it must respect the IEEE floating point standard, and this means respecting the fact that, for example, floating point operations are not assocative:

``` 1.110223e-16
```

Surprisingly, the above computation does not evaluate to zero!

In practice, you’re likely safe to take advantage of the `-ffast-math` optimizations, or `Boost.SIMD`, in your own work. However, be sure to test and verify!

This article provides just a taste of how RcppNT2 can be used. If you’re interested in learning more, please check out the RcppNT2 website.

tags: simd  parallel