Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views4 pages

Handout - Utf 8 Encoding Explained (Step by Step For U+1f60a)

This handout explains how a Unicode code point, such as U+1F60A (😊), is converted into a UTF-8 byte sequence (F0 9F 98 8A). It outlines the differences between Unicode and UTF-8, the templates used for encoding, and provides a step-by-step algorithm for encoding a code point into UTF-8. Additionally, it includes examples, common pitfalls, and a quick reference for encoding various code points.

Uploaded by

senbeth11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views4 pages

Handout - Utf 8 Encoding Explained (Step by Step For U+1f60a)

This handout explains how a Unicode code point, such as U+1F60A (😊), is converted into a UTF-8 byte sequence (F0 9F 98 8A). It outlines the differences between Unicode and UTF-8, the templates used for encoding, and provides a step-by-step algorithm for encoding a code point into UTF-8. Additionally, it includes examples, common pitfalls, and a quick reference for encoding various code points.

Uploaded by

senbeth11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

UTF‑8 Encoding — Step‑by‑Step Handout (example:

U+1F60A — 😊)
Goal: Explain clearly how a Unicode code point (U+1F60A) becomes the UTF‑8 byte sequence
F0 9F 98 8A . This is a self‑contained handout you can print or give to students.

1) Quick reminder: Unicode vs encodings


• Unicode assigns every character a code point (written U+xxxxxx , in hex). Example: U+1F60A is
the smiling‑face emoji (😊).
• UTF‑8 is one way to convert a Unicode code point into a sequence of bytes so computers can store/
send it.

2) UTF‑8 byte formats (templates)


UTF‑8 uses different templates depending on the code point value.

Bytes Template (bits) Code point range

1 0xxxxxxx U+0000 .. U+007F (7 bits)

2 110xxxxx 10xxxxxx U+0080 .. U+07FF (5 + 6 = 11 bits)

3 1110xxxx 10xxxxxx 10xxxxxx U+0800 .. U+FFFF (4 + 6 + 6 = 16 bits)

4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U+10000 .. U+10FFFF (3 + 6 + 6 + 6 = 21 bits)

Important rules: - Continuation bytes always start with 10 (those are the 10xxxxxx bytes). - The
number of leading 1 bits in the first byte (followed by a 0 ) indicates how many bytes the character has
(e.g. 11110 indicates 4 bytes total).

3) General UTF‑8 encoding algorithm (practical steps)


1. Get the Unicode code point (hex), e.g. U+1F60A .
2. Decide how many UTF‑8 bytes are needed using the ranges above.
3. Convert the hex code point to binary.
4. Pad the binary with leading zeros so the total number of bits equals the sum available in the chosen
template (7, 11, 16 or 21 bits).
5. Split the padded binary left to right into groups that match the x slot sizes in the template. For
example, for 4 bytes the groups are 3 | 6 | 6 | 6 bits.

1
6. Put each group into its place in the template. Add the fixed prefix bits ( 0 , 110 , 1110 , or 11110
on the first byte and 10 for continuation bytes).
7. Convert each resulting 8‑bit byte to hex — that gives the UTF‑8 byte sequence.

4) Worked example: encode U+1F60A (😊)


Step A — Start value - Unicode: U+1F60A (hex) - Decimal: 128522

Step B — Convert to binary - Hex 1F60A → digitwise: 1 F 6 0 A → binary nibbles: 0001 1111
0110 0000 1010 . - That is a 20‑bit group. UTF‑8 4‑byte template needs 21 bits, so pad on the left with
one 0 to make 21 bits.

21‑bit padded binary (grouped for clarity):

000 011111 011000 001010


^^^ ^^^^^^ ^^^^^^ ^^^^^^
3b 6b 6b 6b

(we grouped into 3 | 6 | 6 | 6 because the 4‑byte template has xxx + three xxxxxx groups)

Step C — Place groups into the UTF‑8 4‑byte template Template bits:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Fill in the groups (left→right): - xxx ← 000 - 1st xxxxxx ← 011111 - 2nd xxxxxx ← 011000 - 3rd
xxxxxx ← 001010

So the bytes become (binary):

11110000 10011111 10011000 10001010

Breakdown: - 11110000 → 0xF0 - 10011111 → 0x9F - 10011000 → 0x98 - 10001010 → 0x8A

Result (UTF‑8 byte sequence): F0 9F 98 8A — that’s exactly the bytes sent/stored for 😊 in UTF‑8.

5) Visual grid (how the bits are packed)

Code point (hex): 1 F 6 0 A


Nibbles (binary): 0001 1111 0110 0000 1010 (20 bits)

2
Pad to 21 bits: 0 0001 1111 0110 0000 1010 -> 000011111011000001010
Split into 3|6|6|6: 000 | 011111 | 011000 | 001010
UTF-8 template: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Fill in groups: 11110000 10011111 10011000 10001010
↓ ↓ ↓
Hex bytes: F0 9F 98 8A

6) Extra notes & common pitfalls


• Padding is crucial. Always pad the code point on the left with zeros so the bit length matches the
template capacity (7, 11, 16, 21). Forgetting padding will misalign groups.
• Group left → right. Always take the most significant bits first and move to the least significant.
• Continuation bytes always start with 10 . If you see a byte that starts 10 , it’s a continuation (not
a start) of a multi‑byte character.
• Endianness confusion: UTF‑8 is a byte sequence and has no byte‑order mark (BOM) issues like
UTF‑16; the order of bytes is fixed.

7) Quick reference: how many bits per template


• 1 byte: 0xxxxxxx → 7 bits available (for U+0000..U+007F)
• 2 bytes: 110xxxxx 10xxxxxx → 5 + 6 = 11 bits (for U+0080..U+07FF)
• 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx → 4 + 6 + 6 = 16 bits (for U+0800..U+FFFF)
• 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx → 3 + 6 + 6 + 6 = 21 bits (for
U+10000..U+10FFFF)

8) Small exercises (try to encode, answers below)


1. Encode U+00A9 (©) → expected UTF‑8 bytes?
2. Encode U+20AC (€) → expected UTF‑8 bytes?
3. Encode U+1F44D (👍 thumbs up) → expected UTF‑8 bytes?

Answers: 1. U+00A9 → hex C2 A9


2. U+20AC → hex E2 82 AC
3. U+1F44D → hex F0 9F 91 8D

9) Short summary for students


• Unicode gives each character a code point (U+...).
• UTF‑8 packs those code points into 1–4 bytes using fixed templates.

3
• To encode: convert, pad, split into groups that match template x slots, insert into the template,
then convert each 8‑bit byte to hex.

If you want, I can also provide a printable 1‑page handout (PDF) or make the same content as a slide. Tell
me which format you prefer.

You might also like