Unicode & Character Encodings In Python: A Painless Guide

Maybe your like

Table of Contents

What’s a Character Encoding?
- The string Module
- A Bit of a Refresher
- We Need More Bits!
Covering All the Bases: Other Number Systems
Enter Unicode
- Unicode vs UTF-8
- Encoding and Decoding in Python 3
- Python 3: All-In on Unicode
- One Byte, Two Bytes, Three Bytes, Four
- What About UTF-16 and UTF-32?
Python’s Built-In Functions
Python String Literals: Ways to Skin a Cat
Other Encodings Available in Python
You Know What They Say About Assumptions…
Odds and Ends: unicodedata
Wrapping Up
Resources

Remove ads

Recommended Course

Unicode in Python: Working With Character Encodings (51m)

Handling character encodings in Python or any other language can at times seem painful. Places such as Stack Overflow have thousands of questions stemming from confusion over exceptions like UnicodeDecodeError and UnicodeEncodeError. This tutorial is designed to clear the Exception fog and illustrate that working with text and binary data in Python 3 can be a smooth experience. Python’s Unicode support is strong and robust, but it takes some time to master.

This tutorial is different because it’s not language-agnostic but instead deliberately Python-centric. You’ll still get a language-agnostic primer, but you’ll then dive into illustrations in Python, with text-heavy paragraphs kept to a minimum. You’ll see how to use concepts of character encodings in live Python code.

By the end of this tutorial, you’ll:

Get conceptual overviews on character encodings and numbering systems
Understand how encoding comes into play with Python’s str and bytes
Know about support in Python for numbering systems through its various forms of int literals
Be familiar with Python’s built-in functions related to character encodings and numbering systems

Character encoding and numbering systems are so closely connected that they need to be covered in the same tutorial or else the treatment of either would be totally inadequate.

Note: This article is Python 3-centric. Specifically, all code examples in this tutorial were generated from a CPython 3.7.2 shell, although all minor versions of Python 3 should behave (mostly) the same in their treatment of text.

If you’re still using Python 2 and are intimidated by the differences in how Python 2 and Python 3 treat text and binary data, then hopefully this tutorial will help you make the switch.

Free Download: Get a sample chapter from Python Tricks: The Book that shows you Python’s best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

What’s a Character Encoding?

There are tens if not hundreds of character encodings. The best way to start understanding what they are is to cover one of the simplest character encodings, ASCII.

Whether you’re self-taught or have a formal computer science background, chances are you’ve seen an ASCII table once or twice. ASCII is a good place to start learning about character encoding because it is a small and contained encoding. (Too small, as it turns out.)

It encompasses the following:

Lowercase English letters: a through z
Uppercase English letters: A through Z
Some punctuation and symbols: "$" and "!", to name a couple
Whitespace characters: an actual space (" "), as well as a newline, carriage return, horizontal tab, vertical tab, and a few others
Some non-printable characters: characters such as backspace, "\b", that can’t be printed literally in the way that the letter A can

So what is a more formal definition of a character encoding?

At a very high level, it’s a way of translating characters (such as letters, punctuation, symbols, whitespace, and control characters) to integers and ultimately to bits. Each character can be encoded to a unique sequence of bits. Don’t worry if you’re shaky on the concept of bits, because we’ll get to them shortly.

The various categories outlined represent groups of characters. Each single character has a corresponding code point, which you can think of as just an integer. Characters are segmented into different ranges within the ASCII table:

Code Point Range	Class
0 through 31	Control/non-printable characters
32 through 64	Punctuation, symbols, numbers, and space
65 through 90	Uppercase English alphabet letters
91 through 96	Additional graphemes, such as [ and \
97 through 122	Lowercase English alphabet letters
123 through 126	Additional graphemes, such as { and \|
127	Control/non-printable character (DEL)

The entire ASCII table contains 128 characters. This table captures the complete character set that ASCII permits. If you don’t see a character here, then you simply can’t express it as printed text under the ASCII encoding scheme.

ASCII TableShow/Hide

Code Point	Character (Name)	Code Point	Character (Name)
0	NUL (Null)	64	@
1	SOH (Start of Heading)	65	A
2	STX (Start of Text)	66	B
3	ETX (End of Text)	67	C
4	EOT (End of Transmission)	68	D
5	ENQ (Enquiry)	69	E
6	ACK (Acknowledgment)	70	F
7	BEL (Bell)	71	G
8	BS (Backspace)	72	H
9	HT (Horizontal Tab)	73	I
10	LF (Line Feed)	74	J
11	VT (Vertical Tab)	75	K
12	FF (Form Feed)	76	L
13	CR (Carriage Return)	77	M
14	SO (Shift Out)	78	N
15	SI (Shift In)	79	O
16	DLE (Data Link Escape)	80	P
17	DC1 (Device Control 1)	81	Q
18	DC2 (Device Control 2)	82	R
19	DC3 (Device Control 3)	83	S
20	DC4 (Device Control 4)	84	T
21	NAK (Negative Acknowledgment)	85	U
22	SYN (Synchronous Idle)	86	V
23	ETB (End of Transmission Block)	87	W
24	CAN (Cancel)	88	X
25	EM (End of Medium)	89	Y
26	SUB (Substitute)	90	Z
27	ESC (Escape)	91	[
28	FS (File Separator)	92	\
29	GS (Group Separator)	93	]
30	RS (Record Separator)	94	^
31	US (Unit Separator)	95	_
32	SP (Space)	96	`
33	!	97	a
34	"	98	b
35	#	99	c
36	$	100	d
37	%	101	e
38	&	102	f
39	'	103	g
40	(	104	h
41	)	105	i
42	*	106	j
43	+	107	k
44	,	108	l
45	-	109	m
46	.	110	n
47	/	111	o
48	0	112	p
49	1	113	q
50	2	114	r
51	3	115	s
52	4	116	t
53	5	117	u
54	6	118	v
55	7	119	w
56	8	120	x
57	9	121	y
58	:	122	z
59	;	123	{
60	<	124	\|
61	=	125	}
62	>	126	~
63	?	127	DEL (delete)

Remove ads

The string Module

Python’s string module is a convenient one-stop-shop for string constants that fall in ASCII’s character set.

Here’s the core of the module in all its glory:

Python # From lib/python3.7/string.py whitespace = ' \t\n\r\v\f' ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz' ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' ascii_letters = ascii_lowercase + ascii_uppercase digits = '0123456789' hexdigits = digits + 'abcdef' + 'ABCDEF' octdigits = '01234567' punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""" printable = digits + ascii_letters + punctuation + whitespace

Tag » Coding Utf-8 Python Header

Unicode & Character Encodings In Python: A Painless Guide

What’s a Character Encoding?

The string Module

Python - When To Use Utf8 As A Header In Py Files - Stack Overflow

Standard Python Header - Gists · GitHub

Working With UTF-8 Encoding In Python Source - Intellipaat Community

PEP 263 – Defining Python Source Code Encodings

How To Enable UTF-8 In Python ? - Gankrin

Python Coding Utf-8 Code Example - Grepper

Header Request Encoding Utf 8 Python Code Example

Working With UTF-8 Encoding In Python Source [duplicate]

Encode A String To UTF-8 In Java - Baeldung

Python Coding Utf-8

Detailed Python Coding Problems - Actorsfit

Encodage Python

What Is The Difference Between Python #coding:utf-8 And ... - Katastros

How To Add Header # _coding:utf-8 _ To Pycharm? - Birost

Contact