## Getting indices of top elements from a vector using a priority queue

Romain Francois and Martin Morgan — written Sep 12, 2013 — source

This post is based on a question on Stack Overflow and more precisely on Martin Morgan’s answer.

The problem is to find the indices of top `n` elements from a vector.

An inefficient way of doing this is to run `order` on the vector and then only keep the last `n` values:

This is inefficient because it requires sorting the entire vector, which is expensive.

Instead, Martin ‘s idea is to use a priority queue, this is a container adapter from the Standard Template Libary which is designed so that its top element is always the greatest.

So Martin used a priority queue of `std::pair<double,int>`, using the default implementation of the “greater” operator for pairs which just does lexicographic ordering. The idea is then to only keep `n` such pairs in the queue and feed it by scanning the vector.

Here is Martin’s full code:

The version we present here uses the `Compare` template argument of the `priority_queue` to control the comparison. This way, instead of storing pairs of (value, index) we will only store the indices and implement the comparison between two indices by going back to the data.

For that purpose, we need to define the `IndexCompare` class, which we make a template to handle different input types (integer vector, numeric vector, character vector).

Then we abstract some of the concept out of main function by providing our own queue template class : `IndexQueue` which is templated by the type of data that is in the vector.

With these types, we can now implement the `top_index` template function, which once again is templated on the type of vector we deal with. Finally we have the attributes compatible `top_index` function that dispatches to the appropriate version using a simple `switch`

We will use the template implementation above for integer and numeric vectors. The `IndexCompare` keeps a reference to the internal data of the vector, as a raw pointer for better performance and defines the parenthesis operator using this information.

For character vectors, data is internally stored into an array of `SEXP` of type `CHARSXP`, which we can compare thanks to the implementation of the greater operator in the `String` class. We however need a specific implementation because we need a cast to `String` to use the operator.

The implementation of the operator takes into account ties. When there is a tie, we want to retain the value that has the smallest index. This is the reason for this part of the operator : `|| (data[i] == data[j] && j > i )`

The `IndexQueue` template embeds a `priority_queue` with the right parameters, and implements:

• `input` : this first compares the top of the queue and replaces it if necessary
• `pop` : delegates to `priority_queue`
• `push` : idem
• conversion to an `IntegerVector` for when we want to get the results.

With this, the implementation of `top_index` is straightforward. We handle the case where there are less than `n` data points which is not interesting, then we `push` the first `n` points into the queue, and finally we `input` the rest of the data.

Let’s check that we get what we want:

``` TRUE
```
```  17 18 19 20 21 22 23 24 25 26
```

And then let us benchmark:

```Loading required package: microbenchmark
```
```Unit: microseconds
expr       min         lq       mean   median         uq       max
R_order 28264.304 28461.3320 29190.3675 28602.28 29412.4410 42631.973
cpp1   745.554   755.0810   774.4438   762.33   777.0160  1030.393
cpp2   447.278   455.8515   478.6965   464.09   471.9845  1472.173
neval cld
100   b
100  a
100  a
```

tags: stl