5.11. Character Encoding

Có thể bạn quan tâm

5.11.2. Introduction to UTF-8

Most software is not designed to handle 16 bit or 32 bit characters, yet to create a universal character set more than 8 bits was required. Therefore, a special format called UTF-8 was developed to encode these potentially international characters in a format more easily handled by existing programs and libraries. UTF-8 is defined, among other places, in IETF RFC 3629 (updating RFC 2279), so it’s a well-defined standard that can be freely read and used. UTF-8 is a variable-width encoding; characters numbered 0 to 0x7f (127) encode to themselves as a single byte, while characters with larger values are encoded into 2 to 4 (originally 6) bytes of information (depending on their value). The encoding has been specially designed to have the following nice properties (this information is from the RFC and Linux utf-8 man page):

The classical US ASCII characters (0 to 0x7f) encode as themselves, so files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. This is fabulous for backward compatibility with the many existing U.S. programs and data files.
All UCS characters beyond 0x7f are encoded as a multibyte sequence consisting only of bytes in the range 0x80 to 0xfd. This means that no ASCII byte can appear as part of another character. Many other encodings permit characters such as an embedded NIL, causing programs to fail.
It’s easy to convert between UTF-8 and a 2-byte or 4-byte fixed-width representations of characters (these are called UCS-2 and UCS-4 respectively).
The lexicographic sorting order of UCS-4 strings is preserved, and the Boyer-Moore fast search algorithm can be used directly with UTF-8 data.
All possible 2^31 UCS codes can be encoded using UTF-8.
The first byte of a multibyte sequence which represents a single non-ASCII UCS character is always in the range 0xc0 to 0xfd and indicates how long this multibyte sequence is. All further bytes in a multibyte sequence are in the range 0x80 to 0xbf. This allows easy resynchronization; if a byte is missing, it’s easy to skip forward to the “next” character, and it’s always easy to skip forward and back to the “next” or “preceding” character.

In short, the UTF-8 transformation format is becoming a dominant method for exchanging international text information because it can support all of the world’s languages, yet it is backward compatible with U.S. ASCII files as well as having other nice properties. For many purposes I recommend its use, particularly when storing data in a “text” file.

Từ khóa » C0 80

5.11. Character Encoding

5.11.2. Introduction to UTF-8

Unicode/UTF-8-character Table - Starting From Code Position C080

C080 - Final Articles Revision Convention, 1946 (No. 80) - ILO

U+C080 - FileFormat.Info

C080 DCM | Access Control | Galaxy Accessories - Honeywell Security

Component: C080 CHANGEOVER 1-0-2 BASE - Giovenzana

Component: C080 ON-OFF 0-1 BASE | Automation - Giovenzana

UN/EDIFACT Composite Element: C080 Release: D.00A - Transport

Honeywell Security C080 Two Door Controller For Galaxy Dimension

Percentage Of Dye Removal Versus Adsorbent Dosage (C0 = 80 Ppm; V

SMC CQ-C080 Rear Clevis Pivot Bracket, CQ2-Z COMPACT CYLINDER

TC115-M5-C0-WY80FC - Taps With Reinforced Shank - ToolsUnited

TC115-M3-C0-WY80AA - Taps With Reinforced Shank - ToolsUnited

DRAWER UNITS WITH TRANSPARENT PULL-OUT DRAWERS

Liên Hệ