C++ - Unicode conversions
How to convert strings in C++ between Unicode encodings (UTF-8, UTF-16 and UTF-32)? Here are some code snippets showing easiest ways I found out yet.
C++11 is required. You have to include:
#include <codecvt>
#include <locale>
#include <string>
It correctly converts special cases such as:
- non BMP characters (2 code units in UTF-16)
- large code point values
Firstly, a little helper function
bool isLittleEndianSystem() {
char16_t test = 0x0102;
return (reinterpret_cast<char *>(&test))[0] == 0x02;
}
Between UTF-8 and UTF-16
std::u16string utf8_to_utf16(const std::string &s) {
static bool littleEndian = isLittleEndianSystem();
if (littleEndian) {
std::wstring_convert<std::codecvt_utf8_utf16<char16_t, 0x10ffffU, std::codecvt_mode::little_endian>, char16_t> convert_le;
return convert_le.from_bytes(s);
} else {
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert_be;
return convert_be.from_bytes(s);
}
}
std::string utf16_to_utf8(const std::u16string &s) {
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert;
return convert.to_bytes(s);
}
Between UTF-8 and UTF-32
std::u32string utf8_to_utf32(const std::string &s) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
return convert.from_bytes(s);
}
std::string utf32_to_utf8(const std::u32string &s) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
return convert.to_bytes(s);
}
Between UTF-16 and UTF-32
There probably are better ways of doing this type of conversion than this (e.g. using UTF-8 as an intermediate encoding and using 2 of the previous code snippets).
std::u32string utf16_to_utf32(const std::u16string &s) {
std::string bytes;
bytes.reserve(s.size() * 2);
for (const char16_t c : s) {
bytes.push_back(static_cast<char>(c / 256));
bytes.push_back(static_cast<char>(c % 256));
}
std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> convert;
return convert.from_bytes(bytes);
}
std::u16string utf32_to_utf16(const std::u32string &s) {
std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> convert;
std::string bytes = convert.to_bytes(s);
std::u16string result;
result.reserve(bytes.size() / 2);
for (size_t i = 0; i < bytes.size(); i += 2) {
result.push_back(static_cast<char16_t>(static_cast<unsigned char>(bytes[i]) * 256 + static_cast<unsigned char>(bytes[i + 1])));
}
return result;
}
Little and big endian
According to the C++ standard, for codecvt_utf8
and codecvt_utf8_utf16
, the endianess specification should not influence the behavior.
However, in utf8_to_utf16
, when compiling using -stdlib=libstdc++
(libstdc++, version 5.2.0), I had to specify how the bytes will be stored in the underlying memory of char16_t
type:
std::codecvt_mode::little_endian
- little endian (there will be e.g. 0x00 0x61 in memory for value 97)- default (the third template argument of
codecvt_x
not specified) - big endian (there will be e.g. 0x61 0x00 in memory for value 97)
So, for little endian systems, we have to specify std::codecvt_mode::little_endian
for the value of char16_t
to be really 97 in that case.
When using -stdlib=libc++
(libc++ version 3.6.2), the endianess specification was ignored and not influencing the output in any way (according to the standard).
So, the isLittleEndianSystem()
distinction and the using of std::codecvt_mode::little_endian
is here maybe temporarily for current -stdlib=libstdc++
.
Note that in utf16_to_utf32
and utf32_to_utf16
I was setting and getting the bytes explicitly (as big endian, which is the default), so not little/big endian distinction according to the system had to be made.
Why not wchar_t
and std::wstring
?
The types char
(element type in std::string
), char16_t
(element type in std::u16string
) and char32_t
(element type in std::u32string
) have fixed sizes representing code unit size of the encoding we want to work with.
The wchar_t
(element type in std::wstring
) type is platform specific, e.g.:
- 16 bit on Windows - used to encode UTF-16 (or, alternatively, UCS-2)
- 32 bit on Linux - used to encode UTF-32 (the same as UCS-4)
So, wchar_t
does not have much value when we are discussing specific Unicode encodings. If you want to use it, see e.g. codecvt_utf8 (the table at the bottom - rows UTF-16, UCS2/UCS4).
What about UCS-2?
Note that UCS-2 is not able to represent all Unicode code points. But it has an advantage of being a fixed length encoding (as opposite to UTF-16).
If you want to use it despite its disadvantage, here are the conversion functions:
std::u16string utf8_to_ucs2(const std::string &s) {
static bool littleEndian = isLittleEndianSystem();
if (littleEndian) {
std::wstring_convert<std::codecvt_utf8<char16_t, 0x10ffff, std::little_endian>, char16_t> convert_le;
return convert_le.from_bytes(s);
} else {
std::wstring_convert<std::codecvt_utf8<char16_t>, char16_t> convert_be;
return convert_be.from_bytes(s);
}
}
std::string ucs2_to_utf8(const std::u16string &s) {
std::wstring_convert<std::codecvt_utf8<char16_t>, char16_t> convert;
return convert.to_bytes(s);
}
std::u16string utf16_to_ucs2(const std::u16string &s) {
std::string bytes;
bytes.reserve(s.size() * 2);
for (const char16_t c : s) {
bytes.push_back(static_cast<char>(c / 256));
bytes.push_back(static_cast<char>(c % 256));
}
std::wstring_convert<std::codecvt_utf16<char16_t>, char16_t> convert;
return convert.from_bytes(bytes);
}
std::u16string ucs2_to_utf16(const std::u16string &s) {
std::wstring_convert<std::codecvt_utf16<char16_t>, char16_t> convert;
std::string bytes = convert.to_bytes(s);
std::u16string result;
result.reserve(bytes.size() / 2);
for (size_t i = 0; i < bytes.size(); i += 2) {
result.push_back(static_cast<char16_t>(static_cast<unsigned char>(bytes[i]) * 256 + static_cast<unsigned char>(bytes[i + 1])));
}
return result;
}
Be aware that if a code point being converted cannot be represented in UCS-2, the conversion can:
- produce invalid data (currently happened with
-stdlib=libstdc++
) - throw
std::range_error
(currently happened with-stdlib=libc++
)
Testing
Tested on Linux:
- g++ (little endian system)
- clang++ -stdlib=libstdc++ (little endian system)
- clang++ -stdlib=libc++ (little endian and big endian system)
This code was used to test the conversion functions:
struct TestString {
TestString(std::string utf8, std::u16string utf16, std::u32string utf32)
: utf8(std::move(utf8))
, utf16(std::move(utf16))
, utf32(std::move(utf32))
{
}
std::string utf8;
std::u16string utf16;
std::u32string utf32;
};
void testOneString(const TestString &s) {
if (utf8_to_utf16(s.utf8) != s.utf16)
std::cout << "utf8_to_utf16" << std::endl;
if (utf16_to_utf8(s.utf16) != s.utf8)
std::cout << "utf16_to_utf8" << std::endl;
if (utf8_to_utf32(s.utf8) != s.utf32)
std::cout << "utf8_to_utf32" << std::endl;
if (utf32_to_utf8(s.utf32) != s.utf8)
std::cout << "utf32_to_utf8" << std::endl;
if (utf16_to_utf32(s.utf16) != s.utf32)
std::cout << "utf16_to_utf32" << std::endl;
if (utf32_to_utf16(s.utf32) != s.utf16)
std::cout << "utf32_to_utf16" << std::endl;
try {
if (utf8_to_ucs2(s.utf8) != s.utf16)
std::cout << "utf8_to_ucs2" << std::endl;
} catch (const std::range_error &) {
std::cout << "utf8_to_ucs2 - range error" << std::endl;
}
try {
if (ucs2_to_utf8(s.utf16) != s.utf8)
std::cout << "ucs2_to_utf8" << std::endl;
} catch (const std::range_error &) {
std::cout << "ucs2_to_utf8 - range error" << std::endl;
}
try {
if (utf16_to_ucs2(s.utf16) != s.utf16)
std::cout << "utf16_to_ucs2" << std::endl;
} catch (const std::range_error &) {
std::cout << "utf16_to_ucs2 - range error" << std::endl;
}
try {
if (ucs2_to_utf16(s.utf16) != s.utf16)
std::cout << "ucs2_to_utf16" << std::endl;
} catch (const std::range_error &) {
std::cout << "ucs2_to_utf16 - range error" << std::endl;
}
}
std::vector<TestString> strings
{
TestString("\x61", u"\x0061", U"\x00000061"),
TestString("\xEF\xBD\x81", u"\xFF41", U"\x0000FF41"),
TestString("\xC4\x8D", u"\x010D", U"\x010D"),
TestString("\x63\xCC\x8C", u"\x0063\x030C", U"\x00000063\x0000030C"),
TestString("\xC4\xB3", u"\x0133", U"\x00000133"),
TestString("\x69\x6A", u"\x0069\x006A", U"\x00000069\x0000006A"),
TestString("\xCE\xA9", u"\x03A9", U"\x000003A9"),
TestString("\xE2\x84\xA6", u"\x2126", U"\x00002126"),
TestString("\xF0\x9D\x93\x83", u"\xD835\xDCC3", U"\x0001D4C3")
};
for (const TestString string : strings) {
std::cout << string.utf8 << std::endl;
testOneString(string);
}
There are some special instances tested, all described in this previous post.