Using Rcpp with Boost.Regex for regular expression

Dirk Eddelbuettel — written Mar 1, 2013 — updated Mar 4, 2018 — source

Gabor asked about Rcpp use with regular expression libraries. This post shows a very simple example, based on one of the Boost.Regex examples.

There is one big difference between this example, and other Boost examples, possibly using the BH package. Here, we need to set linker options as Boost regex requires its library. Similar restrictions apply for Boost System Library, Boost Filesystem and a few other Boost libraries.

Now, if you computer has them (as would be common under Linux or on macOS), then this can be as simple as

Sys.setenv("PKG_LIBS"="-lboost_regex")

provided the corresponding library libboost_regex is indeed in one of the system library directories.

If so, the following example can be built:

// cf www.boost.org/doc/libs/1_53_0/libs/regex/example/snippets/credit_card_example.cpp

#include <Rcpp.h>

#include <string>
#include <boost/regex.hpp>

bool validate_card_format(const std::string& s) {
   static const boost::regex e("(\\d{4}[- ]){3}\\d{4}");
   return boost::regex_match(s, e);
}

const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
const std::string machine_format("\\1\\2\\3\\4");
const std::string human_format("\\1-\\2-\\3-\\4");

std::string machine_readable_card_number(const std::string& s) {
   return boost::regex_replace(s, e, machine_format, boost::match_default | boost::format_sed);
}

std::string human_readable_card_number(const std::string& s) {
   return boost::regex_replace(s, e, human_format, boost::match_default | boost::format_sed);
}

// [[Rcpp::export]]
Rcpp::DataFrame regexDemo(std::vector<std::string> s) {
    int n = s.size();
    
    std::vector<bool> valid(n);
    std::vector<std::string> machine(n);
    std::vector<std::string> human(n);
    
    for (int i=0; i<n; i++) {
        valid[i]  = validate_card_format(s[i]);
        machine[i] = machine_readable_card_number(s[i]);
        human[i] = human_readable_card_number(s[i]);
    }
    return Rcpp::DataFrame::create(Rcpp::Named("input") = s,
                                   Rcpp::Named("valid") = valid,
                                   Rcpp::Named("machine") = machine,
                                   Rcpp::Named("human") = human);
}

We can test the function using the same input as the Boost example:

s <- c("0000111122223333", "0000 1111 2222 3333", "0000-1111-2222-3333", "000-1111-2222-3333")
regexDemo(s)
                input valid          machine               human
1    0000111122223333 FALSE 0000111122223333 0000-1111-2222-3333
2 0000 1111 2222 3333  TRUE 0000111122223333 0000-1111-2222-3333
3 0000-1111-2222-3333  TRUE 0000111122223333 0000-1111-2222-3333
4  000-1111-2222-3333 FALSE  000111122223333  000-1111-2222-3333

tags: boost  basics 

Related Articles