Kevin Ushey — written Feb 1, 2016 — source
The Numerical Template Toolbox (NT2)
collection of header-only C++ libraries that make it
possible to explicitly request the use of SIMD instructions
when possible, while falling back to regular scalar
operations when not. NT2 itself is powered
by Boost, alongside two proposed
Boost libraries –
Boost.Dispatch, which provides a
mechanism for efficient tag-based dispatch for functions,
Boost.SIMD, which provides a framework for the
implementation of algorithms that take advantage of SIMD
and exposes these libraries for use with
If you haven’t already, read the RcppNT2 introduction article to get acquainted with the RcppNT2 package.
First, let’s review how we might use
to sum a vector of numbers. We explicitly pass in the
std::plus<double>() functor, just to make it clear that
std::accumulate() algorithm expects a binary
functor when accumulating values.
Now, let’s rewrite this to take advantage of RcppNT2. There are two main steps required to take advantage of RcppNT2 at a high level:
Write a functor, with a templated call operator,
with the implementation written in a
Provide the functor as an argument to the appropriate SIMD algorithm.
Let’s follow these steps in implementing our SIMD sum.
As you can see, it’s quite simple to take advantage of
Boost.SIMD. For very simple operations such as this,
RcppNT2 provides a number of pre-defined functors,
which can be accessed in the
namespace. The following is an equivalent way of defining
the above function:
Behind the scenes of
apply your templated functor to ‘packs’ of values when
appropriate, and scalar values when not. In other words,
there are effectively two kinds of template
specializations being generated behind the scenes: one
T = double, and one with
boost::simd::pack<double>. The use of the packed
representation is what allows
Boost.SIMD to ensure
vectorized instructions are used and generated.
Boost.SIMD provides a host of functions and operator
overloads that ensure that optimized instructions are
used when possible over a packed object, while falling
back to ‘default’ operations for scalar values when not.
Now, let’s compare the performance of these two implementations.
expr min lq mean median uq max vectorSum(v) 870.894 887.1535 978.7471 987.1810 989.1060 1792.215 vectorSumSimd(v) 270.985 283.2315 297.8062 287.6565 298.7115 608.373
Perhaps surprisingly, the RcppNT2 solution is much
faster – the gains are similar to what we might have
seen when computing the sum in parallel. However, we’re
still just using a single core; we’re just taking
advantage of vectorized instructions provided by the CPU.
In this particular case, on Intel CPUs,
ensure that we are using the
addpd instruction, which
is documented in the Intel Software Developer’s Manual
Note that, for the naive serial sum, the compiler would
likely generate similarly efficient code when the
-ffast-math optimization flag is set. By default, the
compiler is somewhat ‘pessimistic’ about the set of
optimizations it can perform around floating point
arithmetic. This is because it must respect the
IEEE floating point standard,
and this means respecting the fact that, for example,
floating point operations are not assocative:
Surprisingly, the above computation does not evaluate to zero!
In practice, you’re likely safe to take advantage of the
-ffast-math optimizations, or
Boost.SIMD, in your own
work. However, be sure to test and verify!
This article provides just a taste of how RcppNT2 can be used. If you’re interested in learning more, please check out the RcppNT2 website.Tweet