## SIMD Map-Reduction with RcppNT2

Kevin Ushey — written Feb 2, 2016 — source

## Introduction

The Numerical Template Toolbox (NT2) collection of header-only C++ libraries that make it possible to explicitly request the use of SIMD instructions when possible, while falling back to regular scalar operations when not. NT2 itself is powered by Boost, alongside two proposed Boost libraries – `Boost.Dispatch`, which provides a mechanism for efficient tag-based dispatch for functions, and `Boost.SIMD`, which provides a framework for the implementation of algorithms that take advantage of SIMD instructions. `RcppNT2` wraps and exposes these libraries for use with `R`.

If you haven’t already, read the RcppNT2 introduction article to get acquainted with the RcppNT2 package.

## Map Reduce

MapReduce is the (infamous) buzzword that describes a class of problems that can be solved by splitting an an algorithm into a map (transform) step, and a reduction step. Although this scheme is typically adopted to help solve problems ‘at scale’ (e.g., with a large number of communicating machines), it is also a useful abstraction for many problems in the SIMD universe.

Take, for example, the dot product. This can be expressed in R code simply as:

``````sum(lhs * rhs)
``````

Here, we ‘map’ our vectors by multiplying them together element-wise, and ‘reduce’ the result through summation. Of course, behind the scenes, R is doing a bit more than it has to – it’s computing a new vector, `lhs * rhs`, which is of the same length as `lhs`, and then collapsing (reducing) that vector by adding each element up. It would be great if we could skip that large temporary vector allocation. RcppNT2 provides a function, `simdMapReduce()`, that makes expressing these kinds of problems very easy.

To make use of `simdMapReduce()`, you need to write a class that provides a number of templated methods:

• `U init()` — returns the initial (scalar) data state.

• `T map(const T&... ts)` — transforms the values

• `T combine(const T& lhs, const T& rhs)` — describes how results should be combined

• `U reduce(const T& t)` — describes how a SIMD pack should be reduced

You’ll notice that we play a little fast-and-loose with the terms, but it should still be relatively clear what each method accomplishes. With this infrastructure, the dot product could be implemented like so:

If you can ignore the C++ templates, it should hopefully be fairly clear what’s going on here. We transform elements by multiplying them together, and we combine + reduce by adding them up. (Unfortunately, although the `combine()` and `reduce()` functions are effectively doing the same thing, they need to be expressed separately, as the `reduce()` function is effectively our bridge from SIMD land to scalar land).

Now, let’s show how our map-reducer can be called.

Let’s also export a version that accepts IntegerVector, just to show that our class is generic enough to accept other integral types as well.

And let’s execute it from R, just to convince ourselves that it works.

Great! Of course, a large number of problems can be expressed with a ‘plus’, or ‘sum’ reduction, so RcppNT2 also provides a helper for that, so that you only need to implement the ‘map’ step. We can do this by writing a class that inherits from the `PlusReducer` class:

And, let’s convince ourselves it works:

And let’s use a quick microbenchmark to see if we’ve truly gained anything here:

```               test replications elapsed relative
2         n1 %*% n2          100   0.412    5.282
3   simdDot(n1, n2)          100   0.085    1.090
4 simdDotV2(n1, n2)          100   0.078    1.000
1      sum(n1 * n2)          100   0.307    3.936
```
```                test replications elapsed relative
2          i1 %*% i2          100   1.303   42.032
3 simdDotInt(i1, i2)          100   0.031    1.000
1       sum(i1 * i2)          100   0.571   18.419
```

You might be surprised how profound the speed improvements accrued by using SIMD instructions are. How does this happen?

Behind the scenes, `simdMapReduce()` is handling a number of things for us:

1. Iteration over the sequences used SIMD packs when possible, and scalars when not,

2. Optimized SIMD instructions are used to transform and combine packs of values,

3. Intermediate results are held in a SIMD register, rather than materializing a whole vector,

4. The SIMD register and scalar buffer are not combined until the very final step.

In the ‘double’ case, we can pack 2 values into a SIMD pack; in the ‘int’ case, we can pack 4 values (assuming 32bit ‘int’ and 128bit SSE registers, which is the common case on Intel processors at the time of this post). Assuming that it takes the number of clock cycles to execute a SIMD instruction as it does for the scalar equivalent, this should translate into ~2x and ~4x speedups – and that’s not even accounting for gains in efficient register use, cache efficiency, and the ability to avoid the large temporary allocation! That said, we are playing it a little fast and loose in the ‘int’ case: with larger numbers, we could easily overflow; depending on the type of data expected it may be more appropriate to accumulate values into a different data type.

In short – if you’re implementing an algorithm, or part of an algorithm, that can be expressed as:

• `sum(<transformation of variables>)`

then `simdMapReduce()` is worth looking at.

This article provides just a taste of how RcppNT2 can be used. If you’re interested in learning more, please check out the RcppNT2 website.

tags: simd  parallel