Converting Ruby String to wchar_t* in C Extension

I’m struggling with how to convert a Ruby string to wchar_t* in a C Extension.

I need to handle UTF-8 file paths within a C Extension. I get the path from Ruby as a UTF-8 string but then it needs to be converted to wchar_t* in order to pass as an argument to a third party C API.

Does anyone have any experience with this? I’ve been researching for a while now and am still coming up empty. I need a solution that works for both Windows and OSX.

So far, I have been using the function StringValueCStr(VALUE) to convert a Ruby object to a char* string but is there something similar I can use to convert the VALUE to a wchar_t*?

Thanks!

I believe that wchar_t is equivalent to UTF-16 … perhaps the Ruby Encoding::Converter class would help.

Or you could manually expand the UTF-8 (such as [J][I][M]) into UTF-16 ([J]0[I]0[M]0) by adding extra zeros (nulls) after each UTF-8 character.

[EDITED: My answer is wildly inaccurate … for an explanation from someone who knows what they’re talking about, see @tt_su’s replies below]

You need to take the char* and convert to wchar_t* yourself. In doing this I assume you also try to convert to UTF-16 (or something compatible with the Win32 API?) Because simply remapping char* to wchar* isn’t going to automatically change the encoding. But I’ve never heard of anyone storing UTF-8 in wchars so I assume you’re also looking at changing encoding.

Here is what I use to convert from UTF-8 to UTF-16LE for my own extensions:

Utfstring.h:

#pragma once

#include <string>


namespace EvilSoftwareEmpire {


// UTF-8
typedef std::string StringUtf8;

// UTF-16LE
//typedef std::u16string StringUtf16;
typedef std::wstring StringUtf16;


StringUtf16 ConvertUtf8ToUtf16(const StringUtf8& string);
StringUtf8 ConvertUtf16ToUtf8(const StringUtf16& string);


} // namespace EvilSoftwareEmpire

Utfstring.cpp:

#include "UtfString.h"

#include <codecvt>
#include <locale>


namespace EvilSoftwareEmpire {


StringUtf16 ConvertUtf8ToUtf16(const StringUtf8& string)
{
  // http://codesnipers.com/?q=node/80
  // http://stackoverflow.com/a/4614838/486990
  // http://stackoverflow.com/a/14809553/486990
  //
  // http://stackoverflow.com/a/7235204/486990
  // The docs seem to indicate the default endian is BE, while Windows uses
  // LE - but the strings from this function seem to work with the Win32 API.
  std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
  return convert.from_bytes(string);
}


StringUtf8 ConvertUtf16ToUtf8(const StringUtf16& string)
{
  // http://stackoverflow.com/a/19009269/486990
  std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
  return convert.to_bytes(string);
}


} // namespace EvilSoftwareEmpire

I use std::string and std::wstring - but as you know, that’s just the stdlib wrapper on top of char* and wchar*.

2 Likes

I didn’t notice this comment previously. I want to follow up on this:

char and wchar does not carry any meaning of text encoding. char is a 8bit type and wchar is a 16bit type. Either can represent any number of encodings.

However, if you have US ASCII you don’t use wchar because US ASCII is 7bit. Using a 16bit type is wasting bits.

UTF-8 and UTF-16 are both encoding types where each characters is represented of a varying length.

UTF-8 can be from 1 to 4 bytes per character. Because of that you normally use char* so you can represent the lower ranges without excess bits. (The 1-byte code UTF-8 characters matches the US ASCII characters.)

UTF-16 can be from 2 to 4 bytes - so you typically use wchar* for this. But note that one wchar does not represent a character (or code point). So to get the string size you cannot count bytes.

UTF-32 is 4 bits - so this one is of fixed length. But I don’t know what might use this encoding. For latin based languages this wastes a lot of bytes. For other languages the difference might be less. But still I think UTF-32 is rather rare.

That being said, Windows was one of the first to implement unicode, and originally it was in the form of UCS-2. I think this was a fixed 16bit per character (So wchar* fit well for that.) These days the Win32 API uses UTF-16 which can be seen as a superset of UCS-2.

1 Like

Thanks so much for the insight, guys!
I will be digging into this more later this week and will update you on my results.

I have used this to interact with the Win32 file system. Let me know if you run into issues.

Thanks Thom!

Converting the string to utf-16 was what I needed.