How to convert strings in C++ between Unicode encodings (UTF-8, UTF-16 and UTF-32)? Here are some code snippets showing easiest ways I found out yet.
C++11 is required. You have to include:
It correctly converts special cases such as:
non BMP characters (2 code units in UTF-16)
large code point values
Firstly, a little helper function
Between UTF-8 and UTF-16
Between UTF-8 and UTF-32
Between UTF-16 and UTF-32
There probably are better ways of doing this type of conversion than this (e.g. using UTF-8 as an intermediate encoding and using 2 of the previous code snippets).
Little and big endian
According to the C++ standard, for codecvt_utf8 and codecvt_utf8_utf16, the endianess specification should not influence the behavior.
However, in utf8_to_utf16, when compiling using -stdlib=libstdc++ (libstdc++, version 5.2.0), I had to specify how the bytes will be stored in the underlying memory of char16_t type:
std::codecvt_mode::little_endian - little endian (there will be e.g. 0x00 0x61 in memory for value 97)
default (the third template argument of codecvt_x not specified) - big endian (there will be e.g. 0x61 0x00 in memory for value 97)
So, for little endian systems, we have to specify std::codecvt_mode::little_endian for the value of char16_t to be really 97 in that case.
When using -stdlib=libc++ (libc++ version 3.6.2), the endianess specification was ignored and not influencing the output in any way (according to the standard).
So, the isLittleEndianSystem() distinction and the using of std::codecvt_mode::little_endian is here maybe temporarily for current -stdlib=libstdc++.
Note that in utf16_to_utf32 and utf32_to_utf16 I was setting and getting the bytes explicitly (as big endian, which is the default), so not little/big endian distinction according to the system had to be made.
Why not wchar_t and std::wstring?
The types char (element type in std::string), char16_t (element type in std::u16string) and char32_t (element type in std::u32string) have fixed sizes representing code unit size of the encoding we want to work with.
The wchar_t (element type in std::wstring) type is platform specific, e.g.:
16 bit on Windows - used to encode UTF-16 (or, alternatively, UCS-2)
32 bit on Linux - used to encode UTF-32 (the same as UCS-4)
So, wchar_t does not have much value when we are discussing specific Unicode encodings. If you want to use it, see e.g. codecvt_utf8 (the table at the bottom - rows UTF-16, UCS2/UCS4).
What about UCS-2?
Note that UCS-2 is not able to represent all Unicode code points. But it has an advantage of being a fixed length encoding (as opposite to UTF-16).
If you want to use it despite its disadvantage, here are the conversion functions:
Be aware that if a code point being converted cannot be represented in UCS-2, the conversion can:
produce invalid data (currently happened with -stdlib=libstdc++)
throw std::range_error (currently happened with -stdlib=libc++)
Testing
Tested on Linux:
g++ (little endian system)
clang++ -stdlib=libstdc++ (little endian system)
clang++ -stdlib=libc++ (little endian and big endian system)
This code was used to test the conversion functions:
There are some special instances tested, all described in this previous post.