Character Sets - SUNY Cortland

The starting point for the character sets we find on most computers was ASCII (American Standard Code for Information Interchange). ASCII is a 7-bit code - one bit (binary digit) is a single switch that can be on or off, zero or one. Character sets used today in the US are generally 8-bit sets with 256 different characters, effectively doubling the ASCII set.

One bit can have 2 possible states. 21=2. 0 or 1. Two bits can have 4 possible states. 22=4. 00,01,10,11. (i.e. 0-3) Four bits can have 16 possible states. 24=16. 0000,0001,0010,0011, etc. (i.e. 0-15) Seven bits can have 128 possible states. 27=128. 0000000,0000001,0000010, etc. (i.e. 0-127). Eight bits can have 256 possible states. 28=256. 00000000,00000001,00000010, etc. (i.e. 0-255).

Eight bits are called a byte. One byte character sets can contain 256 characters. The current standard, though, is Unicode which uses two bytes to represent all characters in all writing systems in the world in a single set.

The original ASCII was a 7 bit character set (128 possible characters) with no accented letters. This was used in teletype machines. (The eighth bit was originally used to check parity - a way to look for errors.) IBM and Mac both created extended character sets for their personal computers, using the eighth bit to double the number of characters. As competitors, they didn't use the same characters in the same positions in their sets. Thus 8-bit character sets and incompatibility were born. For example, old Microsoft DOS/Windows used character 130 for the é, but the old Macs used character 142. Character 130 on the Mac was Ç. Today's standards have reduced such problems.

On the Internet, many cables had wires that were designed to transmit 7-bit codes. In order to send more complex data, encoding schemes were developed to translate the more complex data (e.g. 8-bit, binary [graphics]) into something that could fit through a 7-bit pipeline. One such encoding scheme is MIME (actually many different schemes are part of MIME - Multipurpose Internet Mail Extensions). In order for MIME to work, two elements need to be defined, a content format or character set (what characters or other content are to be represented) and an encoding scheme (what codes will be used to represent these characters) for the content.

A common code used for accented characters is Quoted-Printable. Any extended characters (higher than 127) are encoded using a string of three symbols. For example, é is =E9. 8BIT (essentially uncompressed character data) is also a valid MIME code and is the most common way to send characters with accents today.

To allow a code to function on two different kinds of machines with different operating systems and different built-in character sets, we all have to agree on standard character sets into which we will translate. The International Organization for Standardization (ISO) has established such standards. For instance, the standard character set for Western European languages is ISO-LATIN-I (or ISO-8859-1). But as long as a computer knows which character set is being used, it can be programmed to translate and display those characters, no matter what the computer's native character set may be. The é is character 130 in ISO-LATIN-I.

A MIME compliant email program will use the email headers to keep track of which character set and which encoding scheme are applied to each email message. A web browser will do the same. This allows the program to convert and know how to display the characters on any given machine so the entire encoding system is transparent (the user doesn't notice) to the user. For MIME Quoted-Printable in western European languages, these headers might look like:

    X-Mailer: QUALCOMM Windows Eudora Version 5.1 Mime-Version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-transfer-encoding: quoted-printable

Tag » How Many Bytes Is A Char