## Detecting a Time Series Change Point

Joshua French — written Jan 4, 2014 — source

In this example we will detect the change point in a time series of counts using Bayesian methodology. A natural solution to this problem utilizes a Gibbs sampler. We’ll first implement the sampler in R naively, then create a vectorized R implementation, and lastly create an implementation of the sampler using Rcpp and RcppArmadillo. We will compare these implementations using a famous coal mining data set. We will see that the speed of the sampler greatly improves with each implementation.

We will assume that we have a time series of `n` responses coming from a Poisson distribution. The first `k` observations are i.i.d. and come from a Poisson distribution with mean `lambda`, while the remaining observations are i.i.d. and come from a Poisson distribution with mean `phi`. We will assume the prior distribution for `lambda` is Gamma(`a`, `b`), the prior distribution for `phi` is Gamma(`c`, `d`), and the prior distribution for `k` is discrete uniform over the values 1,2,…,`n`. From this information, one can arrive at closed form expressions for the full conditional distribution of `lambda`, `phi`, and `k` (see here).

We can sample values from the full conditional distributions of `lambda` and `phi` using the `rgamma` function built into R. We can sample from the discrete set of values 1, 2, …, `n` using the `sample` function with the appropriate sampling probability for each value. Since the probability mass function (pmf) for the full conditional distribution of `k` is a bit complicated, we’ll code it up as its own function taking the observed data `y` and the parameters `lambda` and `phi`. The functions for the pmf will be named `kprob*`. Note that in the pmf we will need to calculate the natural logarithm of a sum of exponential values; this can sometimes create problems with numerical underflow or overflow. Consequently, we will also create a function `logsumexp` to do this efficiently. The Gibbs samplers for the three implementations will be named `gibbsloop`, `gibbsvec`, and `gibbscpp`.

The initial naive R implementation of the sampler is given below.

Looking through the first implemention, we realize that we can speed up our code by vectorizing when possible and not performing computations more often than we have to. Specifically, we are calculating `sum(y[1:i])` and `sum(y[min((k+1),n):n])` repeatedly in the loops. We can dramatically improve the speed of our implementation by calculating the sum of `y` and the cumulative sum of `y` and then using the stored results appropriately. We now reimplement the pmf and Gibbs sampler functions with this in mind.

We should be able to improve the speed of our sampler by implementing it in C++. We can reimplement the vectorized version of the sampler fairly easily using Rcpp and RcppArmadillo. Note that the parameterization of `rgamma` used by Rcpp is slightly different from base R.

We will now compare and apply the implementations to a coal mining data set.

Coal mining is a notoriously dangerous occupation. Consider the tally of coal mine disasters over a 112-year period (1851 to 1962) in the United Kingdom. The data have relatively high disaster counts in the early era, and relatively low counts in the later era. When did technology improvements and safety practices have an actual effect on the rate of serious coal mining disasters in the United Kingdom?

Let us first read the data into R:

For simplicity, we’ll refer to 1851 as year 1, 1852 as year 2, …, 1962 as year 112. We will set the initial values for `phi` and `k` to be 1 and 40, respectively. We will set the hyperparameters `a`, `b`, `c`, and `d` to be 4, 1, 1, and 2, respectively.

We will run the samplers for 10,000 cycles each. We set the random number seed for each sampler to be 1 for reproducible results. We then check that the results for the three samplers are identical.

``` TRUE
```

We now compare compare the relative speed of the three implementations using the `benchmark` function.

```                                                                                            test
3  gibbscpp(nsim = nsim, y = counts, a = 4, b = 1, c = 1, d = 2, kposs = 1:112, phi = 1, k = 40)
2  gibbsvec(nsim = nsim, y = counts, a = 4, b = 1, c = 1, d = 2, kposs = 1:112, phi = 1, k = 40)
1 gibbsloop(nsim = nsim, y = counts, a = 4, b = 1, c = 1, d = 2, kposs = 1:112, phi = 1, k = 40)
replications elapsed relative
3           10   2.207    1.000
2           10   5.611    2.542
1           10  68.118   30.865
```

The C++ version of the Gibbs sampler is a vast improvement over the looped R version and also quite a bit faster than the vectorized R version.

To complete this application, we analyze the samples from the posterior distribution of `k`. The median value of the posterior sample is 40, meaning the estimated change point year is 1890. To visualize the distribution of the data before and after estimated change point, we can create a lineplot of the time series with years 1851-1890 colored orange and the remaining years colored blue.

``` 40
```
``` 1890
```