Using iconv via the R exported header

Dirk Eddelbuettel — written Mar 6, 2021 — source

Introduction

Character encodings can be tricky and frustrating to deal with. Several newer languages such as Go or Julia default to native UTF-8 on all platforms, greatly facilitating and easing use of characters with languages other than English on all common platforms. With R we are not quite as lucky. UTF-8 is mostly working as desired on “operating systems with an x” but sadly, there are still a lot of Windows users out there for whom native UTF-8 is not quite in reach. A very detailed discussion of the issues involved was provided last summer on the R Developers Blog in this post.

More recently, another blog post covered character conversion. The useful idea presented in the post is to rely on the (public) header R_ext/Riconv.h which then transparently passes on to the iconv library R itself uses. (Strictly speaking this is an optional feature, see capabilities("iconv") to check your build of R.)

In order to test this, we wrapped up a little (GitHub-only) package RcppIconvExample so that we could toss the file at the accessiable Windows builders for tests (given that we do not have a physical Windows machine around). Together with an input file encoded in windows-1252 (taken from the uchardet CRAN package wrapping Mozilla’s uchardet library) we can then read and convert text in these ‘foreign’ encoding:

win1252file <- system.file("rawdata", "windows-1252.txt", package="RcppIconvExample")
win1252txt <- RcppIconvExample::read_file(win1252file, "windows-1252")
cat(win1252txt)
L’œuf de volaille est un produit agricole servant d'ingrédient entrant dans la
composition de nombreux plats, dans de nombreuses cultures gastronomiques du
monde.

Our implementation of read_file() follows. It refactors the two functions in the blog post into a single function with an optional encoding argument:

// cf https://fishandwhistle.net/post/2021/using-rs-cross-platform-iconv-wrapper-from-cpp11
std::string read_file(std::string filename, std::string encoding = "") {
    const int len = 2048;
    char buffer[len/2];

    std::ifstream file;
    file.open(filename, std::ifstream::in | std::ifstream::binary);

    file.read(buffer, len/2);
    size_t n_read = file.gcount();
    file.close();

    if (encoding == "") {       // no encoding given so return 'as is'
        return std::string(buffer, n_read);
    }

    std::string str_source(buffer, n_read);

    void* iconv_handle = Riconv_open("UTF-8", encoding.c_str());
    if (iconv_handle == ((void*) -1)) {
        Rcpp::stop("Can't convert from '%s' to 'UTF-8'", encoding.c_str());
    }

    const char* in_buffer = str_source.c_str();
    char utf8_buffer[len];
    char* utf8_buffer_mut = utf8_buffer;
    size_t in_bytes_left = n_read;
    size_t out_bytes_left = len;

    size_t result = Riconv(iconv_handle, &in_buffer, &in_bytes_left, &utf8_buffer_mut, &out_bytes_left);
    Riconv_close(iconv_handle);

    if (result == ((size_t) -1) || (in_bytes_left != 0)) {
        Rcpp::stop("Failed to convert file contents to UTF-8");
    }

    return std::string(utf8_buffer, len - out_bytes_left);
}

The entire function body is plain C++ code in a basic C++1998 standard, calls the C API of R to access iconv if a conversion is selected, and relies on Rcpp for the convenience of automating the interface and translating strings to SEXP objects and back.

The function has also been implemented in a bareboned C++ version on top of the C API of R (using the helper definitions in the tidyCpp package) in package dang as function readAndConvert().

Postscriptum: An astute reader pointed out that while iconv appears to be platform-independent, it is not. Results on Windows can differ from results on Linux. The only consistent cross-platform encoding conversions are offered by library stringi (as used by package stringr) but that one is also a known pain to install (as it is not tiny).

tags: basics 

Related Articles