5.11. Character Encoding
5.11.2. Introduction to UTF-8
Most software is not designed to handle 16 bit or 32 bit characters, yet to create a universal character set more than 8 bits was required. Therefore, a special format called UTF-8 was developed to encode these potentially international characters in a format more easily handled by existing programs and libraries. UTF-8 is defined, among other places, in IETF RFC 3629 (updating RFC 2279), so it’s a well-defined standard that can be freely read and used. UTF-8 is a variable-width encoding; characters numbered 0 to 0x7f (127) encode to themselves as a single byte, while characters with larger values are encoded into 2 to 4 (originally 6) bytes of information (depending on their value). The encoding has been specially designed to have the following nice properties (this information is from the RFC and Linux utf-8 man page):
The classical US ASCII characters (0 to 0x7f) encode as themselves, so files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. This is fabulous for backward compatibility with the many existing U.S. programs and data files.
All UCS characters beyond 0x7f are encoded as a multibyte sequence consisting only of bytes in the range 0x80 to 0xfd. This means that no ASCII byte can appear as part of another character. Many other encodings permit characters such as an embedded NIL, causing programs to fail.
It’s easy to convert between UTF-8 and a 2-byte or 4-byte fixed-width representations of characters (these are called UCS-2 and UCS-4 respectively).
The lexicographic sorting order of UCS-4 strings is preserved, and the Boyer-Moore fast search algorithm can be used directly with UTF-8 data.
All possible 2^31 UCS codes can be encoded using UTF-8.
The first byte of a multibyte sequence which represents a single non-ASCII UCS character is always in the range 0xc0 to 0xfd and indicates how long this multibyte sequence is. All further bytes in a multibyte sequence are in the range 0x80 to 0xbf. This allows easy resynchronization; if a byte is missing, it’s easy to skip forward to the “next” character, and it’s always easy to skip forward and back to the “next” or “preceding” character.
In short, the UTF-8 transformation format is becoming a dominant method for exchanging international text information because it can support all of the world’s languages, yet it is backward compatible with U.S. ASCII files as well as having other nice properties. For many purposes I recommend its use, particularly when storing data in a “text” file.
Từ khóa » C0 80
-
Unicode/UTF-8-character Table - Starting From Code Position C080
-
C080 - Final Articles Revision Convention, 1946 (No. 80) - ILO
-
U+C080 - FileFormat.Info
-
C080 DCM | Access Control | Galaxy Accessories - Honeywell Security
-
Component: C080 CHANGEOVER 1-0-2 BASE - Giovenzana
-
Component: C080 ON-OFF 0-1 BASE | Automation - Giovenzana
-
UN/EDIFACT Composite Element: C080 Release: D.00A - Transport
-
Honeywell Security C080 Two Door Controller For Galaxy Dimension
-
Percentage Of Dye Removal Versus Adsorbent Dosage (C0 = 80 Ppm; V
-
SMC CQ-C080 Rear Clevis Pivot Bracket, CQ2-Z COMPACT CYLINDER
-
TC115-M5-C0-WY80FC - Taps With Reinforced Shank - ToolsUnited
-
TC115-M3-C0-WY80AA - Taps With Reinforced Shank - ToolsUnited
-
DRAWER UNITS WITH TRANSPARENT PULL-OUT DRAWERS