## Optimizing Code vs Recognizing Patterns with 3D Arrays

James Joseph Balamuta — written Jun 8, 2016 — source

## Intro

As is the case with the majority of posts normally born into existence, there was an interesting problem that arose recently on StackOverflow. Steffen, a scientist at an unnamed weather service, faced an issue with the amount of computational time required by a loop intensive operation in R. Specifically, Steffen needed to be able to sum over different continuous regions of elements within a 3D array structure with dimensions 2500 x 2500 x 50 to acquire the amount of neighbors. The issue is quite common within geography information sciences and/or spatial statistics. What follows next is a modified version of the response that I gave providing additional insight into various solutions.

## Problem Statement

Consider an array of matrices that contain only 1 and 0 entries that are spread out over time with dimensions $X \times Y \times T$. Within each matrix, there are specific regions on the $X-Y$ plane of a fixed time, $t$, that must be summed over to obtain their neighboring elements. Each region is constrained to a four by four tile given by points inside $(x-2, y-2), (x-2, y+2), (x+2, y+2), (x+2, y-2)$, where $x \in [3, X - 2]$, $y \in [3, Y - 2]$, $X \ge 5$, and $Y \ge 5$.

### Sample Result:

Without a loss of generality, I’ve opted to downscale the problem to avoid a considerable amount of output while displaying a sample result. Thus, the sample result presented next is under the dimensions: $6 \times 6 \times 2$. Therefore, the results of the neighboring clusters are:

Note: Both time points, t <- 1 and t <- 2, are the same!!! More on this interesting pattern later…

## Possible Solutions

There are many ways to approach a problem like this. The first is to try to optimize the initial code base within R. Secondly, and one of the primary reasons for this article is, one can port over the computational portion of the code and perhaps add some parallelization component with the hope that such a setup would decrease the amount of time required to perform said computations. Thirdly, one can try to understand the underlying structure of the arrays by searching for patterns that can be exploited to create a cache of values which would be able to be reused instead of individually computed for each time point.

Thus, there are really three different components within this post that will be addressed:

1. Optimizing R code directly in hopes of obtaining a speedup in brute forcing the problem;
2. Porting over the R code into C++ and parallelizing the computation using more brute force;
3. Trying to determine different patterns in the data and exploit their weakness

## Optimizing within R

The initial problem statement has something that all users of R dread: a loop. However, it isn’t just one loop, it’s 3! As is known, one of the key downsides to R is looping. This problem has a lot of coverage — my favorite being the straight, curly, or compiled as it sheds light on R’s parser — and is primarily one of the key reasons why Rcpp is favored in loop heavy problems.

There are a few ways we can aim to optimize just the R code:

1. Cache subsets within the loop.
2. Parallelize the time computation stage.

### Original

To show differences between functions, I’ll opt to declare the base computational loop given above as:

### Cached R

The first order of business is to implement a cached value schema so that we avoid subsetting the different elements considerably.

### Parallelized R

Next up, let’s implement a way to parallelize the computation over time using R’s built in parallel package.

### Validating Function Output

After modifying a function’s initial behavior, it is a good idea to check whether or not the new function obtains the same values. If not, chances are there was a bug that crept into the function after the change.

 TRUE

 TRUE


Thankfully, all the modifications yielded no numerical change from the original.

### Benchmarking

As this is Rcpp, benchmarks are king. Here I’ve opted to create a benchmark of the different functions after trying to optimize them within R.

Output:

Unit: milliseconds
expr        min         lq       mean     median         uq        max neval
r_orig   3.810462   4.616145   5.275087   4.948391   5.746963   9.232729   100
r_cached   3.104683   3.529720   3.965664   3.753478   4.053211   5.793070   100
r_cores1 146.083671 151.194330 155.888165 153.902970 157.469347 193.474503   100
r_cores2 283.613516 294.775359 302.326447 299.960622 306.691685 367.890747   100
r_cores3 429.393160 441.279938 449.346205 445.347866 452.774463 540.862980   100
r_cores4 575.001192 588.513541 602.381051 596.365266 608.556754 713.669543   100


Not surprisingly, the cached subset function performed better than the original R function. On the other hand, what was particularly surpising was that the parallelization option within R took a considerably longer amount of time as additional cores were added. This was partly due to the fact that the a array had to be replicated out across the different ncores number of R processes spawned. In addition, there was also a time lag between spawning the processes and winding them down. This may prove to be a fatal flaw of the construction of cube_r_parallel as a normal user may wish to make repeated calls within the same parallelized session. However, I digress as I feel like I’ve spent too much time describing ways to optimize the R code.

## Porting into C++

To keep in the spirit of optimizing a generalized solution, we can quickly port over the R cached version of the computational loop to Armadillo, C++ Linear Algebra Library, and use OpenMP with RcppArmadillo.

Why not use just Rcpp though? The rationale for using RcppArmadillo over Rcpp is principally because of the native support for multidimensional arrays via arma::cube. In addition, it’s a bit painful in the current verison of Rcpp to work with an array. Hence, I would rather face the cost of copying an object from SEXP to arma and back again than mess around with this.

Before we get started, it is very important to provide protection for systems that still lack OpenMP by default in R (**cough** OS X **cough**). Though, by changing how R’s ~/.R/Makevars compiler flag is set, OpenMP can be used within R. Discussion of this approach is left to http://thecoatlessprofessor.com/programming/openmp-in-r-on-os-x/. Now, there still does exist the real need to protect users of the default R ~/.R/Makevars configuration by guarding the header inclusion of OpenMP in the cpp code using preprocessor directives:

Given the above code, we have effectively provided protection from the compiler throwing an error due to OpenMP not being available on the system. Note: In the event that the system does not have OpenMP, the process will be executed serially just like always.

With this being said, let’s look at the C++ port of the R function:

A few things to note here:

1. a, res, xdim,ydim and ncores are shared across processes.
2. t is unique to each process.
3. We protect users that cannot use OpenMP!

To verify that this is an equivalent function, we opt to check the object equality:

 TRUE


### Timings

Just as before, let’s check the benchmarks to see how well we did:

Output:

Unit: microseconds
expr      min        lq      mean    median        uq       max neval
r_orig 3641.936 4000.8825 4704.5218 4198.0920 4813.7600 40890.469   100
r_cached 2835.476 3056.9185 3300.9981 3195.7130 3328.6795  4466.824   100
cpp_core1  177.770  182.8920  202.1724  188.5620  199.7085  1076.910   100
cpp_core2  114.871  117.5255  136.1581  120.6395  132.3940   704.835   100
cpp_core3   83.156   87.1025  100.3414   93.1880  106.2620   207.876   100
cpp_core4   82.987   87.2925  106.9642   97.5210  121.6275   183.434   100


Wow! The C++ version of the parallelization really did wonders when compared to the previous R implementation of parallelization. The speed ups when compared to the R implementations are about 44x vs. original and 34x vs. the optimized R loop (using 3 cores). Note, with the addition of the 4th core, the parallelization option performs poorly as the system running the benchmarks only has four cores. Thus, one of the cores is trying to keep up with the parallelization while also having to work on operating system tasks and so on.

Now, this isn’t to say that there is no cap to parallelization benefits given infinite amounts of processing power. In fact, there are two laws that govern general speedups from parallelization: Amdahl’s Law (Fixed Problem Size) and Gustafson’s Law (Scaled Speedup). The details are better left for another time on this matter.

## Detecting Patterns

Previously, we made the assumption that the structure of the data within computation had no pattern. Thus, we opted to create a generalized algorithm to effectively compute each value. Within this section, we remove the assumption about no pattern being present. In this case, we opt to create a personalized solution to the problem at hand.

As a first step, notice how the array is constructed in this case with: array(0:1, dims). There seems to be some sort of pattern depending on the xdim, ydim, and tdim of the distribution of 1s and 0s. If we can recognize the pattern in advance, the amount of computation required decreases. However, this may also impact the ability to generalize the algorithm to other cases outside the problem statement. Thus, the reason for this part of the post being at the terminal part of this article.

After some trial and error using different dimensions, the different patterns become recognizable and reproducible. Most notably, we have three different cases:

• Case 1: If xdim is even, then only the rows of a matrix alternate with rows containing all 1s or 0s.
• Case 2: If xdim is odd and ydim is even, then only the rows alternate of a matrix alternate with rows containing a combination of both 1 or 0.
• Case 3: If xdim is odd and ydim is odd, then rows alternate as well as the matrices alternate with a combination of both 1 or 0.

## Examples of Pattern Cases

Let’s see the cases described previously in action to observe the patterns. Please note that the dimensions of the example case arrays are small and would likely yield issues within the for loop given above due to the indices being negative or zero triggering an out-of-bounds error.

Case 1:

Output:

Case 2:

Output:

, , 1

[,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1

, , 2

[,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1


Case 3:

Output:

, , 1

[,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1

, , 2

[,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1


### Pattern Hacking

Based on the above discussion, we opt to make a bit of code that exploits this unique pattern. The language that we can write this code in is either R or C++. Though, due to the nature of this website, we opt to proceed in the later. Nevertheless, as an exercise, feel free to backport this code into R.

Having acknowledged that, we opt to start off trying to create code that can fill a matrix with either an even or an odd column vector. e.g.

#### Odd Vector

The odd vector is defined as having the initial value being given by 1 instead of 0. This is used in heavily in both Case 2 and Case 3.

#### Even

In comparison to the odd vector, the even vector starts at 0. This is used principally in Case 1 and then in Case 2 and Case 3 to alternate columns.

#### Creating Alternating Vectors

To obtain such an alternating vector that switches between two values, we opt to create a vector using the modulus operator while iterating through element positions in an $N$ length vector. Specifically, we opt to use the fact that when i is an even number i % 2 must be 0 and when i is odd i % 2 gives 1. Thefore, we are able to create an alternating vector that switches between two different values with the following code:

#### Creating the three cases of matrix

With our ability to now generate odd and even vectors by column, we now need to figure out how to create the matrices described in each case. As mentioned above, there are three cases of matrix given as:

1. The even,
3. And the ugly odd.

Using a similar technique to obtain the alternating vector, we obtain the three cases of matrices:

#### Calculation Engine

Next, we need to create a computational loop to subset the appropriate continuous areas of the matrix to figure out the amount of neighbors. In comparison to the problem statement, note that this loop is without the t as we no longer need to repeat calculations within this approach. Instead, we only need to compute the values once and then cache the result before duplicating it across the 3D array (e.g. arma::cube).

#### Call Main Function

Whew, that was a lot of work. But, by approaching the problem this way, we have:

1. Created reusable code snippets.
2. Decreased the size of the main call function.
3. Improved clarity of each operation.

Now, the we are ready to write the glue that combines all the different components. As a result, we will obtain the desired neighbor information.

#### Pieced together

When put together, the code forms:

#### Verification of Results

To verify, let’s quickly create similar cases and test them against the original R function:

##### Case 1:
 TRUE

##### Case 2:
 TRUE

##### Case 3:
 TRUE


## Closing Time - You don’t have to go home but you can’t stay here.

With all of these different methods now thoroughly described, let’s do one last benchmark to figure out the best of the best.

Output:

Unit: microseconds
expr      min        lq       mean    median        uq       max neval
r_orig 3692.596 3977.0135 4350.60190 4106.0515 4857.9320  7182.160   100
r_cached 2851.516 3100.4145 3498.17482 3234.1710 3514.4660 11165.995   100
cpp_core1  175.402  179.6125  192.61909  188.2585  199.0685   273.855   100
cpp_core2  112.878  115.8420  126.66236  119.7010  130.9980   302.339   100
cpp_core3   81.540   84.8295  101.10615   89.4330  103.0835   400.662   100
cpp_cpat   40.582   43.1645   51.00269   50.7720   55.7760    96.967   100


As can be seen, the customized approach based on the data’s pattern provided the fastest speed up. Though, by customizing to the pattern, we lost the ability to generalize the solution without being exposed to new cases. Meanwhile, the code port into C++ yielded much better results than the optimized R version. Both of these pieces of code were highly general to summing over specific portions of an array. Lastly, the parallelization within R was simply too time consuming for a one-off computation.