James Joseph Balamuta — written Jun 8, 2016 — source

As is the case with the majority of posts normally born into existence, there was an interesting problem that arose recently on StackOverflow. Steffen, a scientist at an unnamed weather service, faced an issue with the amount of computational time required by a loop intensive operation in R. Specifically, Steffen needed to be able to sum over different continuous regions of elements within a 3D array structure with dimensions `2500 x 2500 x 50`

to acquire the amount of neighbors. The issue is quite common within geography information sciences and/or spatial statistics. What follows next is a modified version of the response that I gave providing additional insight into various solutions.

Consider an array of matrices that contain only 1 and 0 entries that are spread out over time with dimensions . Within each matrix, there are specific regions on the plane of a fixed time, , that must be summed over to obtain their neighboring elements. Each region is constrained to a four by four tile given by points inside , where , , , and .

Without a loss of generality, I’ve opted to downscale the problem to avoid a considerable amount of output while displaying a sample result. Thus, the sample result presented next is under the dimensions: . Therefore, the results of the neighboring clusters are:

**Note:** Both time points, `t <- 1`

and `t <- 2`

, are the same!!! More on this interesting pattern later…

There are many ways to approach a problem like this. The first is to try to optimize the initial code base within R. Secondly, and one of the primary reasons for this article is, one can port over the computational portion of the code and perhaps add some parallelization component with the hope that such a setup would decrease the amount of time required to perform said computations. Thirdly, one can try to understand the underlying structure of the arrays by searching for patterns that can be exploited to create a cache of values which would be able to be reused instead of individually computed for each time point.

Thus, there are really three different components within this post that will be addressed:

- Optimizing R code directly in hopes of obtaining a speedup in brute forcing the problem;
- Porting over the R code into C++ and parallelizing the computation using more brute force;
- Trying to determine different patterns in the data and exploit their weakness

The initial problem statement has something that all users of `R`

dread: a loop. However, it isn’t just one loop, it’s 3! As is known, one of the key downsides to `R`

is looping. This problem has a lot of coverage — my favorite being the straight, curly, or compiled as it sheds light on R’s parser — and is primarily one of the key reasons why `Rcpp`

is favored in loop heavy problems.

There are a few ways we can aim to optimize just the R code:

- Cache subsets within the loop.
- Parallelize the time computation stage.

To show differences between functions, I’ll opt to declare the base computational loop given above as:

The first order of business is to implement a cached value schema so that we avoid subsetting the different elements considerably.

Next up, let’s implement a way to parallelize the computation over time using R’s built in `parallel`

package.

After modifying a function’s initial behavior, it is a good idea to check whether or not the new function obtains the same values. If not, chances are there was a bug that crept into the function after the change.

[1] TRUE

[1] TRUE

Thankfully, all the modifications yielded no numerical change from the original.

As this **is** `Rcpp`

, benchmarks are king. Here I’ve opted to create a benchmark of the different functions after trying to optimize them within R.

**Output:**

Unit: milliseconds expr min lq mean median uq max neval r_orig 3.810462 4.616145 5.275087 4.948391 5.746963 9.232729 100 r_cached 3.104683 3.529720 3.965664 3.753478 4.053211 5.793070 100 r_cores1 146.083671 151.194330 155.888165 153.902970 157.469347 193.474503 100 r_cores2 283.613516 294.775359 302.326447 299.960622 306.691685 367.890747 100 r_cores3 429.393160 441.279938 449.346205 445.347866 452.774463 540.862980 100 r_cores4 575.001192 588.513541 602.381051 596.365266 608.556754 713.669543 100

Not surprisingly, the cached subset function performed better than the original R function. On the other hand, what was particularly surpising was that the parallelization option within R took a considerably longer amount of time as additional cores were added. This was partly due to the fact that the `a`

array had to be replicated out across the different `ncores`

number of `R`

processes spawned. In addition, there was also a time lag between spawning the processes and winding them down. This may prove to be a fatal flaw of the construction of `cube_r_parallel`

as a normal user may wish to make repeated calls within the same parallelized session. However, I digress as I feel like I’ve spent too much time describing ways to optimize the `R`

code.

To keep in the spirit of optimizing a generalized solution, we can quickly port over the R cached version of the computational loop to `Armadillo`

, C++ Linear Algebra Library, and use `OpenMP`

with `RcppArmadillo`

.

Why not use just `Rcpp`

though? The rationale for using `RcppArmadillo`

over `Rcpp`

is principally because of the native support for multidimensional arrays via arma::cube. In addition, it’s a bit painful in the current verison of `Rcpp`

to work with an array. Hence, I would rather face the cost of copying an object from `SEXP`

to `arma`

and back again than mess around with this.

Before we get started, it is very important to provide protection for systems that still lack `OpenMP`

by default in R (**cough** OS X **cough**). Though, by changing how R’s `~/.R/Makevars`

compiler flag is set, `OpenMP`

can be used within `R`

. Discussion of this approach is left to http://thecoatlessprofessor.com/programming/openmp-in-r-on-os-x/. Now, there still does exist the real need to protect users of the default R `~/.R/Makevars`

configuration by guarding the header inclusion of `OpenMP`

in the `cpp`

code using preprocessor directives:

Given the above code, we have effectively provided protection from the compiler throwing an error due to `OpenMP`

not being available on the system. **Note:** In the event that the system does not have `OpenMP`

, the process will be executed serially just like always.

With this being said, let’s look at the C++ port of the R function:

A few things to note here:

`a`

,`res`

,`xdim`

,`ydim`

and`ncores`

are**shared**across processes.`t`

is unique to each process.- We protect users that cannot use
`OpenMP`

!

To verify that this is an equivalent function, we opt to check the object equality:

[1] TRUE

Just as before, let’s check the benchmarks to see how well we did:

**Output:**

Unit: microseconds expr min lq mean median uq max neval r_orig 3641.936 4000.8825 4704.5218 4198.0920 4813.7600 40890.469 100 r_cached 2835.476 3056.9185 3300.9981 3195.7130 3328.6795 4466.824 100 cpp_core1 177.770 182.8920 202.1724 188.5620 199.7085 1076.910 100 cpp_core2 114.871 117.5255 136.1581 120.6395 132.3940 704.835 100 cpp_core3 83.156 87.1025 100.3414 93.1880 106.2620 207.876 100 cpp_core4 82.987 87.2925 106.9642 97.5210 121.6275 183.434 100

Wow! The C++ version of the parallelization really did wonders when compared to the previous R implementation of parallelization. The speed ups when compared to the R implementations are about 44x vs. original and 34x vs. the optimized R loop (using 3 cores). Note, with the addition of the 4th core, the parallelization option performs poorly as the system running the benchmarks only has four cores. Thus, one of the cores is trying to keep up with the parallelization while also having to work on operating system tasks and so on.

Now, this isn’t to say that there is no cap to parallelization benefits given infinite amounts of processing power. In fact, there are two laws that govern general speedups from parallelization: Amdahl’s Law (Fixed Problem Size) and Gustafson’s Law (Scaled Speedup). The details are better left for another time on this matter.

Previously, we made the assumption that the structure of the data within computation had no pattern. Thus, we opted to create a generalized algorithm to effectively compute each value. Within this section, we remove the assumption about no pattern being present. In this case, we opt to create a personalized solution to the problem at hand.

As a first step, notice how the array is constructed in this case with: `array(0:1, dims)`

. There seems to be some sort of pattern depending on the `xdim`

, `ydim`

, and `tdim`

of the distribution of 1s and 0s. If we can recognize the pattern in advance, the amount of computation required decreases. **However, this may also impact the ability to generalize the algorithm to other cases outside the problem statement.** Thus, the reason for this part of the post being at the terminal part of this article.

After some trial and error using different dimensions, the different patterns become recognizable and reproducible. Most notably, we have three different cases:

- Case 1: If
`xdim`

is even, then only the rows of a matrix alternate with rows containing**all**1s or 0s. - Case 2: If
`xdim`

is odd and`ydim`

is even, then only the rows alternate of a matrix alternate with rows containing a**combination of both**1 or 0. - Case 3: If
`xdim`

is odd and`ydim`

is odd, then rows alternate as well as the matrices alternate with a**combination of both**1 or 0.

Let’s see the cases described previously in action to observe the patterns. Please note that the dimensions of the example case arrays are small and would likely yield issues within the `for`

loop given above due to the indices being negative or zero triggering an out-of-bounds error.

**Case 1:**

**Output**:

**Case 2:**

**Output:**

, , 1 [,1] [,2] [,3] [,4] [1,] 0 1 0 1 [2,] 1 0 1 0 [3,] 0 1 0 1 , , 2 [,1] [,2] [,3] [,4] [1,] 0 1 0 1 [2,] 1 0 1 0 [3,] 0 1 0 1

**Case 3:**

**Output:**

, , 1 [,1] [,2] [,3] [,4] [1,] 0 1 0 1 [2,] 1 0 1 0 [3,] 0 1 0 1 , , 2 [,1] [,2] [,3] [,4] [1,] 0 1 0 1 [2,] 1 0 1 0 [3,] 0 1 0 1

Based on the above discussion, we opt to make a bit of code that exploits this unique pattern. The language that we can write this code in is either R or C++. Though, due to the nature of this website, we opt to proceed in the later. Nevertheless, as an exercise, feel free to backport this code into R.

Having acknowledged that, we opt to start off trying to create code that can fill a matrix with either an `even`

or an `odd`

column vector. e.g.

The *odd* vector is defined as having the initial value being given by `1`

instead of `0`

. This is used in heavily in both **Case 2** and **Case 3**.

In comparison to the *odd* vector, the *even* vector starts at 0. This is used principally in **Case 1** and then in **Case 2** and **Case 3** to alternate columns.

To obtain such an alternating vector that switches between two values, we opt to create a vector using the modulus operator while iterating through element positions in an length vector. Specifically, we opt to use the fact that when `i`

is an even number `i % 2`

must be `0`

and when `i`

is odd `i % 2`

gives `1`

. Thefore, we are able to create an alternating vector that switches between two different values with the following code:

With our ability to now generate **odd** and **even** vectors by column, we now need to figure out how to create the matrices described in each case. As mentioned above, there are three cases of matrix given as:

- The even,
- The bad odd,
- And the ugly odd.

Using a similar technique to obtain the alternating vector, we obtain the three cases of matrices:

Next, we need to create a computational loop to subset the appropriate continuous areas of the matrix to figure out the amount of neighbors. In comparison to the problem statement, note that this loop is without the `t`

as we no longer need to repeat calculations within this approach. Instead, we only need to compute the values once and then cache the result before duplicating it across the 3D array (e.g. `arma::cube`

).

*Whew*, that was a lot of work. But, by approaching the problem this way, we have:

- Created reusable code snippets.
- Decreased the size of the main call function.
- Improved clarity of each operation.

Now, the we are ready to write the glue that combines all the different components. As a result, we will obtain the desired neighbor information.

When put together, the code forms:

To verify, let’s quickly create similar cases and test them against the original R function:

[1] TRUE

[1] TRUE

[1] TRUE

With all of these different methods now thoroughly described, let’s do one last benchmark to figure out the best of the best.

**Output:**

Unit: microseconds expr min lq mean median uq max neval r_orig 3692.596 3977.0135 4350.60190 4106.0515 4857.9320 7182.160 100 r_cached 2851.516 3100.4145 3498.17482 3234.1710 3514.4660 11165.995 100 cpp_core1 175.402 179.6125 192.61909 188.2585 199.0685 273.855 100 cpp_core2 112.878 115.8420 126.66236 119.7010 130.9980 302.339 100 cpp_core3 81.540 84.8295 101.10615 89.4330 103.0835 400.662 100 cpp_cpat 40.582 43.1645 51.00269 50.7720 55.7760 96.967 100

As can be seen, the customized approach based on the data’s pattern provided the fastest speed up. Though, by customizing to the pattern, we lost the ability to generalize the solution without being exposed to new cases. Meanwhile, the code port into C++ yielded much better results than the optimized R version. Both of these pieces of code were highly general to summing over specific portions of an array. Lastly, the parallelization within R was simply too time consuming for a one-off computation.

**tags:**
armadillo
cube
openmp
parallel

- Performance considerations with sparse matrices in Armadillo — Hong Ooi
- Suppressing Call Stack Info in Rcpp-Generated Errors and Warnings — Michael Weylandt
- Introducing RcppArrayFire — Ralf Stubner