This post is mostly inspired by facts summarised at utf8everywhere.org.
Unicode is somehow complicated. There is a common misunderstanding about these things:
- there is not a simple direct mapping between graphemes (the user-perceived characters) and code points (the “Unicode”characters)
- UTF-16 is not a fixed width encoding - some code points are encoded into 2 and some into 4 bytes
There are some examples of this:
|Graphemes||Code points||UTF-8 bytes||UTF-16 bytes||UTF-32 bytes|
|1. Simple||a||61||61||00 61||00 00 00 61|
|2. Fullwidth||ａ||FF41||EF BD 81||FF 41||00 00 FF 41|
|3. Diacritic||č||10D||C4 8D||01 0D||00 00 01 0D|
|4. Diacritic - separate||č||63 30C||63 CC 8C||00 63 03 0C||00 00 00 63 00 00 03 0C|
|5. Ligature||ĳ||133||C4 B3||01 33||00 00 01 33|
|6. Separate||ij||69 6A||69 6A||00 69 00 6A||00 00 00 69 00 00 00 6A|
|7. Same grapheme||Ω||3A9||CE A9||03 A9||00 00 03 A9|
|8. Same grapheme||Ω||2126||E2 84 A6||21 26||00 00 21 26|
|9. Non-BMP||𝓃||1D4C3||F0 9D 93 83||D8 35 DC C3||00 01 D4 C3|
I looked up the information about the concrete characters using unicode-table.com. The code points in the table link to the information about the character on that site.
Each code unit is underlined separately. Each code point is encoded as:
- UTF-8 - 1-byte code units, from 1 to 6
- UTF-16 - 2-byte code units, 1 or 2
- UTF-32 - 4-byte code unit
UTF-16 and UTF-32 can be encoded as big or little endian. The table shows only the big endian variant (the most significant byte first).
All numbers are shown in hexadecimal form (even the code point numbers, where it maybe does not make any sense).
One grapheme as one code point encoded as 2 UTF-16 bytes. As we would expect.
The same grapheme, but looks slightly different (is wider). This is used with some East Asian characters to occupy the same width in fixed-width fonts.
The same, but for a less common and more complicated grapheme (with caron). We would expect this, too.
4. Diacritic - separate
Or we can use 2 separate code points to encode characters with diacritic:
- one for the base character (c) - blue color
- one for the diacritical mark (the caron) - red color
Note that this is the same grapheme (as users perceive it), but encoded in completely different ways. This can cause problems when comparing texts - the user would perceive them as equal, but the comparison of code points would tell that the texts are different. A process called normalization exists to convert the code points to the same form, so they can be correctly compared.
These are 2 graphemes, which correspond to only one code point.
Of course, these graphemes (i and j) can be encoded separately, too - as two code points (blue and red). The separate characters look almost the same as the ligature.
7./8. Same grapheme
These look exactly the same, but are different code points, with different meanings:
- Greek letter Omega
- Ohm sign
This is the case where one code point is encoded as 4 UTF-16 bytes (instead of 2).
All characters which do not belong to the Basic Multilingual Plane (BMP) are encoded like this.
You can find other commonly used non-BMP characters in this post at Stack Overflow - see the “trans-BMP code points” part.