Unicode and UTF-8 Tutorial

September 2021

Understanding everything about encoding isn’t an easy job. Most tutorials start with explaining the history of encoding, but you really don’t have to know about all that stuff to get a basic understanding of unicode and UTF-8. So here we go, I promise to make it as short and simple as humanly possible.

Unicode

You can think of unicode as a map of every possible character to a number. A character in the broader sense, e.g. an emoji is also a character.

Examples :

A = 65
µ = 181
ż = 380
╬ = 9820
丠 = 20000
☠ = 9760
😍 = 128525

or if you prefer in hex, which is also very common:

A = U+41
µ = U+B5
ż = U+380
╬ = U+17C
丠 = U+4E20
☠ = U+2620
😍 = U+1f60d

Unicode characters in hex are also often prefixed with U+. You might also see a prefix of 0x (general machine readable hex prefix). We’ll continue in this article by using the decimal representation though.

Note: Unicode represents an idea of a character, not what you see on the screen directly. For example A or 😍 is displayed on your screen in a certain font, the pixels are arranged in a certain way. But what unicode represents is not the arrangement of pixels, but the idea of the “A” or the “Love eyes emoji”. So depending on your font or your system, the pixel arrangement might be different.

Why is unicode great?

UTF-8 (Unicode Transformation Format, 8 bit)

So the number has to be stored somehow on your computer. Your computer only stores ones and zeroes in the end, and UTF-8 specifies exactly that: how to transform a sequence of unicode characters into binary and back.

Now you think, wait, I know how to transform a number into binary:

A = 65 = 1100101

The problem becomes evident if you try to put together two characters:

A😍 = 65128525 = 110010111111011000001101
        ↑               ↑
       here the new character starts,
       but it could also be read as one big number
       instead of two numbers.
       So we need a way to separate two numbers.

Ok, so we need a way to represent two characters. We could for example just say each character gets 32 bits:

A😍 = 0000000000000000000000000110010100000000000000011111011000001101
                                      ↑
                    here starts the second character

As you can see we’re wasting quite a lot of disk space or memory here with so many unused zeroes, especially for the “A”. If on the other hand we just say we’re only using 16 bits, then there isn’t enough space for the beloved emojis and possible extensions to unicode.

Enter UTF-8, a “smarter” encoding

I think it is easiest to just explain it with an example. Let’s encode A☠😍 in UTF-8!

01000001 11100010 10011000 10100000 11110000 10011111 10011000 10001101
|______| |________________________| |_________________________________|
   |                  |                              |
   A                  ☠                              😍

So we right away notice one thing: The characters with a smaller number in unicode (remember A=65, ☠=9760, 😍=128525) take up less space.

But then, how does the computer know where one character stops and the next starts?

The trick is: Not all bits are used to encode the actual characters, some bits are used to encode how many bytes belong to a character instead!

Those “header bits” are all those up to the first zero in each byte and they are to be read like this:

Note: Even though this idea could be continued, the maximum bytes per character is four, so the list above is actually complete.

Now, there’s one with a special meaning:

The actual code numbers are then retrieved by dropping the header bits and putting together what belongs together.

Let’s try and visualize that, H means it’s a header bit:

H        HHHH     HH       HH       HHHHH    HH       HH       HH
01000001 11100010 10011000 10100000 11110000 10011111 10011000 10001101
|______| |________________________| |_________________________________|
   |                  |                              |
   A                  ☠                              😍

Notice how for each byte the last header bit is the first zero of the byte!

So it can be read as follows:

We’ve successfully read a binary sequence encoded in UTF-8! 🥳 = U+1F973 = 11110000 10011111 10100101 10110011

Now the historical reason why UTF-8 is exactly this way has some more interesting details to it, of which I’d just like to mention two:

Conclusion

Understanding unicode is quite easy, UTF-8 is a little trickier. Unicode assigns a number to each character in the universe and is the de facto standard to do so. UTF-8 is considered with storing the numbers on your machine in a way that they don’t take up too much space, can still cover the huge amount of unicode characters and even provide some backwards compatibility with ASCII. To do so it uses some header bits on each byte, that tell you how many bytes belong to a character.

Dear Devs: You can help Ukraine🇺🇦. I opted for (a) this message and (b) a geo targeted message to Russians coming to this page. If you have a blog, you could do something similar, or you can link to a donations page. If you don't have one, you could think about launching a page with uncensored news and spread it on Russian forums or even Google Review. Or hack some russian servers. Get creative. #StandWithUkraine 🇺🇦
Dear russians🇷🇺. I am a peace loving person from Switzerland🇨🇭. It is without a doubt in my mind, that your president, Vladimir Putin, has started a war that causes death and suffering for Ukrainians🇺🇦 and Russians🇷🇺. Your media is full of lies. Lies about the casualties, about the intentions, about the "Nazi Regime" in Ukraine. Please help to mobilize your people against your leader. I know it's dangerous for you, but it must be done!