Unicode & Character Encodings In Python: A Painless Guide

Table of Contents

  • What’s a Character Encoding?
    • The string Module
    • A Bit of a Refresher
    • We Need More Bits!
  • Covering All the Bases: Other Number Systems
  • Enter Unicode
    • Unicode vs UTF-8
    • Encoding and Decoding in Python 3
    • Python 3: All-In on Unicode
    • One Byte, Two Bytes, Three Bytes, Four
    • What About UTF-16 and UTF-32?
  • Python’s Built-In Functions
  • Python String Literals: Ways to Skin a Cat
  • Other Encodings Available in Python
  • You Know What They Say About Assumptions…
  • Odds and Ends: unicodedata
  • Wrapping Up
  • Resources
Remove ads

Recommended Course

Unicode in Python: Working With Character Encodings (51m)

Handling character encodings in Python or any other language can at times seem painful. Places such as Stack Overflow have thousands of questions stemming from confusion over exceptions like UnicodeDecodeError and UnicodeEncodeError. This tutorial is designed to clear the Exception fog and illustrate that working with text and binary data in Python 3 can be a smooth experience. Python’s Unicode support is strong and robust, but it takes some time to master.

This tutorial is different because it’s not language-agnostic but instead deliberately Python-centric. You’ll still get a language-agnostic primer, but you’ll then dive into illustrations in Python, with text-heavy paragraphs kept to a minimum. You’ll see how to use concepts of character encodings in live Python code.

By the end of this tutorial, you’ll:

  • Get conceptual overviews on character encodings and numbering systems
  • Understand how encoding comes into play with Python’s str and bytes
  • Know about support in Python for numbering systems through its various forms of int literals
  • Be familiar with Python’s built-in functions related to character encodings and numbering systems

Character encoding and numbering systems are so closely connected that they need to be covered in the same tutorial or else the treatment of either would be totally inadequate.

Note: This article is Python 3-centric. Specifically, all code examples in this tutorial were generated from a CPython 3.7.2 shell, although all minor versions of Python 3 should behave (mostly) the same in their treatment of text.

If you’re still using Python 2 and are intimidated by the differences in how Python 2 and Python 3 treat text and binary data, then hopefully this tutorial will help you make the switch.

Free Download: Get a sample chapter from Python Tricks: The Book that shows you Python’s best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

What’s a Character Encoding?

There are tens if not hundreds of character encodings. The best way to start understanding what they are is to cover one of the simplest character encodings, ASCII.

Whether you’re self-taught or have a formal computer science background, chances are you’ve seen an ASCII table once or twice. ASCII is a good place to start learning about character encoding because it is a small and contained encoding. (Too small, as it turns out.)

It encompasses the following:

  • Lowercase English letters: a through z
  • Uppercase English letters: A through Z
  • Some punctuation and symbols: "$" and "!", to name a couple
  • Whitespace characters: an actual space (" "), as well as a newline, carriage return, horizontal tab, vertical tab, and a few others
  • Some non-printable characters: characters such as backspace, "\b", that can’t be printed literally in the way that the letter A can

So what is a more formal definition of a character encoding?

At a very high level, it’s a way of translating characters (such as letters, punctuation, symbols, whitespace, and control characters) to integers and ultimately to bits. Each character can be encoded to a unique sequence of bits. Don’t worry if you’re shaky on the concept of bits, because we’ll get to them shortly.

The various categories outlined represent groups of characters. Each single character has a corresponding code point, which you can think of as just an integer. Characters are segmented into different ranges within the ASCII table:

Code Point Range Class
0 through 31 Control/non-printable characters
32 through 64 Punctuation, symbols, numbers, and space
65 through 90 Uppercase English alphabet letters
91 through 96 Additional graphemes, such as [ and \
97 through 122 Lowercase English alphabet letters
123 through 126 Additional graphemes, such as { and |
127 Control/non-printable character (DEL)

The entire ASCII table contains 128 characters. This table captures the complete character set that ASCII permits. If you don’t see a character here, then you simply can’t express it as printed text under the ASCII encoding scheme.

ASCII TableShow/Hide

Code Point Character (Name) Code Point Character (Name)
0 NUL (Null) 64 @
1 SOH (Start of Heading) 65 A
2 STX (Start of Text) 66 B
3 ETX (End of Text) 67 C
4 EOT (End of Transmission) 68 D
5 ENQ (Enquiry) 69 E
6 ACK (Acknowledgment) 70 F
7 BEL (Bell) 71 G
8 BS (Backspace) 72 H
9 HT (Horizontal Tab) 73 I
10 LF (Line Feed) 74 J
11 VT (Vertical Tab) 75 K
12 FF (Form Feed) 76 L
13 CR (Carriage Return) 77 M
14 SO (Shift Out) 78 N
15 SI (Shift In) 79 O
16 DLE (Data Link Escape) 80 P
17 DC1 (Device Control 1) 81 Q
18 DC2 (Device Control 2) 82 R
19 DC3 (Device Control 3) 83 S
20 DC4 (Device Control 4) 84 T
21 NAK (Negative Acknowledgment) 85 U
22 SYN (Synchronous Idle) 86 V
23 ETB (End of Transmission Block) 87 W
24 CAN (Cancel) 88 X
25 EM (End of Medium) 89 Y
26 SUB (Substitute) 90 Z
27 ESC (Escape) 91 [
28 FS (File Separator) 92 \
29 GS (Group Separator) 93 ]
30 RS (Record Separator) 94 ^
31 US (Unit Separator) 95 _
32 SP (Space) 96 `
33 ! 97 a
34 " 98 b
35 # 99 c
36 $ 100 d
37 % 101 e
38 & 102 f
39 ' 103 g
40 ( 104 h
41 ) 105 i
42 * 106 j
43 + 107 k
44 , 108 l
45 - 109 m
46 . 110 n
47 / 111 o
48 0 112 p
49 1 113 q
50 2 114 r
51 3 115 s
52 4 116 t
53 5 117 u
54 6 118 v
55 7 119 w
56 8 120 x
57 9 121 y
58 : 122 z
59 ; 123 {
60 < 124 |
61 = 125 }
62 > 126 ~
63 ? 127 DEL (delete)
Remove ads

The string Module

Python’s string module is a convenient one-stop-shop for string constants that fall in ASCII’s character set.

Here’s the core of the module in all its glory:

Python # From lib/python3.7/string.py whitespace = ' \t\n\r\v\f' ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz' ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' ascii_letters = ascii_lowercase + ascii_uppercase digits = '0123456789' hexdigits = digits + 'abcdef' + 'ABCDEF' octdigits = '01234567' punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""" printable = digits + ascii_letters + punctuation + whitespace

Tag » Coding Utf-8 Python Header