Romain Francois and Martin Morgan — written Sep 12, 2013 — source
This post is based on a question on Stack Overflow and more precisely on Martin Morgan’s answer.
The problem is to find the indices of top n
elements from a vector.
An inefficient way of doing this is to run order
on the vector and then
only keep the last n
values:
This is inefficient because it requires sorting the entire vector, which is expensive.
Instead, Martin ‘s idea is to use a priority queue, this is a container adapter from the Standard Template Libary which is designed so that its top element is always the greatest.
So Martin used a priority queue of std::pair<double,int>
, using the
default implementation of the “greater” operator for pairs which just
does lexicographic ordering. The idea is then to only keep n
such pairs
in the queue and feed it by scanning the vector.
Here is Martin’s full code:
The version we present here uses the Compare
template argument of the
priority_queue
to control the comparison. This way, instead of storing
pairs of (value, index) we will only store the indices and implement
the comparison between two indices by going back to the data.
For that purpose, we need to define the IndexCompare
class, which we make a
template to handle different input types (integer vector, numeric vector, character
vector).
Then we abstract some of the concept out of main function by providing
our own queue template class : IndexQueue
which is templated by the
type of data that is in the vector.
With these types, we can now implement the top_index
template function, which
once again is templated on the type of vector we deal with. Finally we have
the attributes compatible top_index
function that dispatches to the
appropriate version using a simple switch
We will use the template implementation above for integer and numeric vectors. The
IndexCompare
keeps a reference to the internal data of the vector, as a
raw pointer for better performance and defines the parenthesis operator
using this information.
For character vectors, data is internally stored into an array of SEXP
of
type CHARSXP
, which we can compare thanks to the implementation of the
greater operator in the String
class. We however need a specific implementation
because we need a cast to String
to use the operator.
The implementation of the operator takes into account ties. When there is a tie, we
want to retain the value that has the smallest index. This is the reason for this part
of the operator : || (data[i] == data[j] && j > i )
The IndexQueue
template embeds a priority_queue
with the right parameters,
and implements:
input
: this first compares the top of the queue and replaces it if necessarypop
: delegates to priority_queue
push
: idemIntegerVector
for when we want to get the results.With this, the implementation of top_index
is straightforward. We handle the
case where there are less than n
data points which is not interesting, then
we push
the first n
points into the queue, and finally we input
the rest of the
data.
Let’s check that we get what we want:
[1] TRUE
[1] 17 18 19 20 21 22 23 24 25 26
And then let us benchmark:
Loading required package: microbenchmark
Unit: microseconds expr min lq mean median uq max R_order 28264.304 28461.3320 29190.3675 28602.28 29412.4410 42631.973 cpp1 745.554 755.0810 774.4438 762.33 777.0160 1030.393 cpp2 447.278 455.8515 478.6965 464.09 471.9845 1472.173 neval cld 100 b 100 a 100 a
tags: stl
Tweet