Unicode examples

This post is mostly inspired by facts summarised at utf8everywhere.org.

Unicode is somehow complicated. There is a common misunderstanding about these things:

there is not a simple direct mapping between graphemes (the user-perceived characters) and code points (the “Unicode”characters)
UTF-16 is not a fixed width encoding - some code points are encoded into 2 and some into 4 bytes

There are some examples of this:

	Graphemes	Code points	UTF-8 bytes	UTF-16 bytes	UTF-32 bytes
1. Simple	a	61	61	00 61	00 00 00 61
2. Fullwidth	ａ	FF41	EF BD 81	FF 41	00 00 FF 41
3. Diacritic	č	10D	C4 8D	01 0D	00 00 01 0D
4. Diacritic - separate	č	63 30C	63 CC 8C	00 63 03 0C	00 00 00 63 00 00 03 0C
5. Ligature	ĳ	133	C4 B3	01 33	00 00 01 33
6. Separate	ij	69 6A	69 6A	00 69 00 6A	00 00 00 69 00 00 00 6A
7. Same grapheme	Ω	3A9	CE A9	03 A9	00 00 03 A9
8. Same grapheme	Ω	2126	E2 84 A6	21 26	00 00 21 26
9. Non-BMP	𝓃	1D4C3	F0 9D 93 83	D8 35 DC C3	00 01 D4 C3

I looked up the information about the concrete characters using unicode-table.com. The code points in the table link to the information about the character on that site.

Each code unit is underlined separately. Each code point is encoded as:

UTF-8 - 1-byte code units, from 1 to 6
UTF-16 - 2-byte code units, 1 or 2
UTF-32 - 4-byte code unit

UTF-16 and UTF-32 can be encoded as big or little endian. The table shows only the big endian variant (the most significant byte first).

All numbers are shown in hexadecimal form (even the code point numbers, where it maybe does not make any sense).

1. Simple

One grapheme as one code point encoded as 2 UTF-16 bytes. As we would expect.

2. Fullwidth

The same grapheme, but looks slightly different (is wider). This is used with some East Asian characters to occupy the same width in fixed-width fonts.

3. Diacritic

The same, but for a less common and more complicated grapheme (with caron). We would expect this, too.

4. Diacritic - separate

Or we can use 2 separate code points to encode characters with diacritic:

one for the base character (c) - blue color
one for the diacritical mark (the caron) - red color

Note that this is the same grapheme (as users perceive it), but encoded in completely different ways. This can cause problems when comparing texts - the user would perceive them as equal, but the comparison of code points would tell that the texts are different. A process called normalization exists to convert the code points to the same form, so they can be correctly compared.

5. Ligature

These are 2 graphemes, which correspond to only one code point.

6. Separate

Of course, these graphemes (i and j) can be encoded separately, too - as two code points (blue and red). The separate characters look almost the same as the ligature.

7./8. Same grapheme

These look exactly the same, but are different code points, with different meanings:

Greek letter Omega
Ohm sign

9. Non-BMP

This is the case where one code point is encoded as 4 UTF-16 bytes (instead of 2).

All characters which do not belong to the Basic Multilingual Plane (BMP) are encoded like this.

You can find other commonly used non-BMP characters in this post at Stack Overflow - see the “trans-BMP code points” part.

Written on July 21, 2015