I’m struggling with how to convert a Ruby string to wchar_t* in a C Extension.
I need to handle UTF-8 file paths within a C Extension. I get the path from Ruby as a UTF-8 string but then it needs to be converted to wchar_t* in order to pass as an argument to a third party C API.
Does anyone have any experience with this? I’ve been researching for a while now and am still coming up empty. I need a solution that works for both Windows and OSX.
So far, I have been using the function StringValueCStr(VALUE) to convert a Ruby object to a char* string but is there something similar I can use to convert the VALUE to a wchar_t*?
You need to take the char* and convert to wchar_t* yourself. In doing this I assume you also try to convert to UTF-16 (or something compatible with the Win32 API?) Because simply remapping char* to wchar* isn’t going to automatically change the encoding. But I’ve never heard of anyone storing UTF-8 in wchars so I assume you’re also looking at changing encoding.
#include "UtfString.h"
#include <codecvt>
#include <locale>
namespace EvilSoftwareEmpire {
StringUtf16 ConvertUtf8ToUtf16(const StringUtf8& string)
{
// http://codesnipers.com/?q=node/80
// http://stackoverflow.com/a/4614838/486990
// http://stackoverflow.com/a/14809553/486990
//
// http://stackoverflow.com/a/7235204/486990
// The docs seem to indicate the default endian is BE, while Windows uses
// LE - but the strings from this function seem to work with the Win32 API.
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
return convert.from_bytes(string);
}
StringUtf8 ConvertUtf16ToUtf8(const StringUtf16& string)
{
// http://stackoverflow.com/a/19009269/486990
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
return convert.to_bytes(string);
}
} // namespace EvilSoftwareEmpire
I use std::string and std::wstring - but as you know, that’s just the stdlib wrapper on top of char* and wchar*.
I didn’t notice this comment previously. I want to follow up on this:
char and wchar does not carry any meaning of text encoding. char is a 8bit type and wchar is a 16bit type. Either can represent any number of encodings.
However, if you have US ASCII you don’t use wchar because US ASCII is 7bit. Using a 16bit type is wasting bits.
UTF-8 and UTF-16 are both encoding types where each characters is represented of a varying length.
UTF-8 can be from 1 to 4 bytes per character. Because of that you normally use char* so you can represent the lower ranges without excess bits. (The 1-byte code UTF-8 characters matches the US ASCII characters.)
UTF-16 can be from 2 to 4 bytes - so you typically use wchar* for this. But note that one wchar does not represent a character (or code point). So to get the string size you cannot count bytes.
UTF-32 is 4 bits - so this one is of fixed length. But I don’t know what might use this encoding. For latin based languages this wastes a lot of bytes. For other languages the difference might be less. But still I think UTF-32 is rather rare.
That being said, Windows was one of the first to implement unicode, and originally it was in the form of UCS-2. I think this was a fixed 16bit per character (So wchar* fit well for that.) These days the Win32 API uses UTF-16 which can be seen as a superset of UCS-2.