Between 1950 and 2000, each manufacturer and each operating system created itsown 8 bits encoding. The problem was that 8 bits (256 code points) are notenough to store any character, and so the encoding tries to fit the user’slanguage. Most 8 bits encodings are able to encode multiple languages, usuallygeographically close (e.g. ISO-8859-1 is intented for Western Europe).
It was difficult to exchange documents with different languages, because using aninvalid encoding while loading the document leads to mojibake.
6.1. ASCII¶
ASCII encoding is supported by all applications. A document encoded in ASCIIcan be read decoded by any other encoding. This is explained by the fact thatall 7 and 8 bits encodings are superset of ASCII, to be compatible with ASCII.Except JIS X 0201 encoding: 0x5C
is decoded to the yen sign(U+00A5, ¥) instead of a backslash (U+005C, \).
ASCII is the smallest encoding, it only contains 128 codes including 95printable characters (letters, digits, punctuation signs and some other variouscharacters) and 33 control codes. Control codes are used to control theterminal. For example, the “line feed” (code point 10, usually written"\n"
) marks the end of a line. There are some special control code. Forexample, the “bell” (code point 7, written "\b"
) sent to ring a bell.
-0 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 | -9 | -a | -b | -c | -d | -e | -f | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0- | NUL | � | � | � | � | � | � | BEL | � | TAB | LF | � | � | CR | � | � |
1- | � | � | � | � | � | � | � | � | � | � | � | ESC | � | � | � | � |
2- | ! | “ | # | $ | % | & | ‘ | ( | ) | * | + | , | - | . | / | |
3- | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? | |
4- | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
5- | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
6- | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
7- | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | DEL |
0x00—0x1F and 0x7F are control codes:
NUL (0x00): nul character (U+0000,
"\0"
)BEL (0x07): sent to ring a bell (U+0007,
"\b"
)TAB (0x09): horizontal tabulation (U+0009,
"\t"
)LF (0x0A): line feed (U+000A,
"\n"
)CR (0x0D): carriage return (U+000D,
"\r"
)ESC (0x1B): escape (U+001B)
DEL (0x7F): delete (U+007F)
other control codes are displayed as � in this table
0x20 is a space.
Note
The first 128 code points of the Unicode charset (U+0000—U+007F) are theASCII charset: Unicode is a superset of ASCII.
6.2. ISO 8859 family¶
Year | Norm | Description | Variant |
1987 | ISO 8859-1 | Western European: German, French, Italian, … | cp1252 |
1987 | ISO 8859-2 | Central European: Croatian, Polish, Czech, … | cp1250 |
1988 | ISO 8859-3 | South European: Turkish and Esperanto | |
1988 | ISO 8859-4 | North European - | |
1988 | ISO 8859-5 | Latin/Cyrillic: Macedonian, Russian, … | KOI family |
1987 | ISO 8859-6 | Latin/Arabic: Arabic language characters | cp1256 |
1987 | ISO 8859-7 | Latin/Greek: modern Greek language | cp1253 |
1988 | ISO 8859-8 | Latin/Hebrew: modern Hebrew alphabet | cp1255 |
1989 | ISO 8859-9 | Turkish: Largely the same as ISO 8859-1 | cp1254 |
1992 | ISO 8859-10 | Nordic: a rearrangement of Latin-4 | |
2001 | ISO 8859-11 | Latin/Thai: Thai language | TIS 620, cp874 |
1998 | ISO 8859-13 | Baltic Rim: Baltic languages | cp1257 |
1998 | ISO 8859-14 | Celtic: Gaelic, Breton | |
1999 | ISO 8859-15 | Revision of 8859-1: euro sign | cp1252 |
2001 | ISO 8859-16 | South-Eastern European |
Note
ISO 8859-12 doesn’t exist.
6.2.1. ISO 8859-1¶
ISO/CEI 8859-1, also known as “Latin-1” or “ISO-8859-1”, is a superset ofASCII: it adds 128 code points, mostly latin letters with diacritics and32 control codes. It is used in the USA and in Western Europe.
-0 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 | -9 | -a | -b | -c | -d | -e | -f | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0- | NUL | � | � | � | � | � | � | BEL | � | TAB | LF | � | � | CR | � | � |
1- | � | � | � | � | � | � | � | � | � | � | � | ESC | � | � | � | � |
2- | ! | “ | # | $ | % | & | ‘ | ( | ) | * | + | , | - | . | / | |
3- | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? | |
4- | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
5- | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
6- | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
7- | p | q | r | s | t | u | v | w | x | y | z | { | } | ~ | DEL | |
8- | � | � | � | � | � | � | � | � | � | � | � | � | � | � | � | � |
9- | � | � | � | � | � | � | � | � | � | � | � | � | � | � | � | � |
a- | NBSP | ¡ | ¢ | £ | ¤ | ¥ | ¦ | § | ¨ | © | ª | « | ¬ | SHY | ® | ¯ |
b- | ° | ± | ² | ³ | ´ | µ | ¶ | · | ¸ | ¹ | º | » | ¼ | ½ | ¾ | ¿ |
c- | À | Á | Â | Ã | Ä | Å | Æ | Ç | È | É | Ê | Ë | Ì | Í | Î | Ï |
d- | Ð | Ñ | Ò | Ó | Ô | Õ | Ö | × | Ø | Ù | Ú | Û | Ü | Ý | Þ | ß |
e- | à | á | â | ã | ä | å | æ | ç | è | é | ê | ë | ì | í | î | ï |
f- | ð | ñ | ò | ó | ô | õ | ö | ÷ | ø | ù | ú | û | ü | ý | þ | ÿ |
U+0000—U+001F, U+007F and U+0080—U+009F are control codes (displayed as � inthis table). See the ASCII table for U+0000—U+001F and U+007F control codes.
“NBSP” (U+00A0) is a non breaking space and “SHY” (U+00AD) is a soft hyphen.
Note
The 256 first code points of the Unicode charset (U+0000—U+00FF) are theISO-8859-1 charset: Unicode is a superset of ISO-8859-1.
6.2.2. cp1252¶
Windows code page 1252, best known as cp1252, is a variantof ISO 8859-1. It is the default encoding of all English and WesternEurope Windows setups. It is used as a fallback by web browsers if the webpagedoesn’t provide any encoding information (not in HTML, nor in HTTP).
cp1252 shares 224 code points with ISO-8859-1, the range 0x80—0x9F (32characters, including 5 not assigned codes) are different. In ISO-8859-1, thisrange are 32 control codes (not printable).
Code point | ISO-8859-1 | cp1252 | Code point | ISO-8859-1 | cp1252 |
---|---|---|---|---|---|
| U+0080 | € (U+20AC) |
| U+0090 | not assigned |
| U+0081 | not assigned |
| U+0091 | ‘ (U+2018) |
| U+0082 | ‚ (U+201A) |
| U+0092 | ’ (U+2019) |
| U+0083 | ƒ (U+0192) |
| U+0093 | “ (U+201C) |
| U+0084 | „ (U+201E) |
| U+0094 | ” (U+201D) |
| U+0085 | … (U+2026) |
| U+0095 | • (U+2022) |
| U+0086 | † (U+2020) |
| U+0096 | – (U+2013) |
| U+0087 | ‡ (U+2021) |
| U+0097 | — (U+2014) |
| U+0088 | ˆ (U+02C6) |
| U+0098 | ˜ (U+02DC) |
| U+0089 | ‰ (U+2030) |
| U+0099 | ™ (U+2122) |
| U+008A | Š (U+0160) |
| U+009A | š (U+0161) |
| U+008B | ‹ (U+2039) |
| U+009B | › (U+203A) |
| U+008C | Œ (U+0152) |
| U+009C | œ (U+0153) |
| U+008D | not assigned |
| U+009D | not assigned |
| U+008E | Ž (U+017D) |
| U+009E | ž (U+017U) |
| U+008F | not assigned |
| U+009F | Ÿ (U+0178) |
6.2.3. ISO 8859-15¶
ISO/CEI 8859-15, also known as Latin-9 or ISO-8859-15, is a variant ofISO 8859-1. 248 code points are identicals, 8 are different:
Code point | ISO-8859-1 | ISO-8859-15 | Code point | ISO-8859-1 | ISO-8859-15 |
---|---|---|---|---|---|
| ¤ (U+00A4) | € (U+20AC) |
| ¸ (U+00B8) | ž (U+017E) |
| ¦ (U+00A6) | Š (U+0160) |
| ¼ (U+00BC) | Œ (U+0152) |
| ¨ (U+00A8) | š (U+0161) |
| ½ (U+00BD) | œ (U+0152) |
| ´ (U+00B4) | Ž (U+017D) |
| ¾ (U+00BE) | Ÿ (U+0178) |
6.3. CJK: asian encodings¶
6.3.1. Chinese encodings¶
GBK is a family of Chinese charsets using multibyte encodings:
GB 2312 (1980): includes 6,763 Chinese characters
GBK (1993) (code page 936)
GB 18030 (2005, last revision in 2006)
HZ (1989) (HG-GZ-2312)
Other encodings: Big5 (大五碼, Big Five Encoding, 1984), cp950.
6.3.2. Japanese encodings¶
JIS is a family of Japanese encodings:
JIS X 0201 (1969): all code points are encoded to 1 byte
16 bits:
JIS X 0208 (first version in 1978: “JIS C 6226”, last revision in 1997):code points are encoded to 1 or 2 bytes
JIS X 0212 (1990), extends JIS X 0208 charset: it is only a charset. UseEUC-JP or ISO 2022 to encode it.
JIS X 0213 (first version in 2000, last revision in 2004: EUC JIS X 2004),EUC JIS X 0213: it is only a charset, use EUC-JP, ISO 2022 or ShiftJIS 2004to encode it.
JIS X 0211 (1994), based on ISO/IEC 6429
Microsoft encodings:
Shift JIS
Windows code page 932 (cp932): extension of Shift JIS
In strict mode (flags=MB_ERR_INVALID_CHARS), cp932 cannot decode bytes in0x81
—0xA0
and 0xE0
—0xFF
ranges. By default (flags=0),0x81
—0x9F
and 0xE0
—0xFC
are decoded as U+30FB (Katakanamiddle dot), 0xA0
as U+F8F0, 0xFD
as U+F8F1, 0xFE
as U+F8F2 and0xFF
as U+F8F3 (U+E000—U+F8FF is for private usage).
The JIS family causes mojibake on MS-DOS and MicrosoftWindows because the yen sign (U+00A5, ¥) is encoded to 0x5C
which is abackslash (U+005C, \) in ASCII. For example, “C:\Windows\win.ini” isdisplayed “C:¥Windows¥win.ini”. The backslash is encoded to 0x81 0x5F
.
To encode Japanese, there is also the ISO/IEC 2022 encoding family.
6.3.3. ISO 2022¶
ISO/IEC 2022 is an encoding family:
ISO-2022-JP: JIS X 0201-1976, JIS X 0208-1978, JIS X 0208-1983
ISO-2022-JP-1: JIS X 0212-1990
ISO-2022-JP-2: GB 2312-1980, KS X 1001-1992, ISO/IEC 8859-1, ISO/IEC 8859-7
ISO-2022-JP-3: JIS X 0201-1976, JIS X 0213-2000, JIS X 0213-2000
ISO-2022-JP-2004: JIS X 0213-2004
ISO-2022-KR: KS X 1001-1992
ISO-2022-CN: GB 2312-1980, CNS 11643-1992 (planes 1 and 2)
ISO-2022-CN-EXT: ISO-IR-165, CNS 11643-1992 (planes 3 though 7)
6.3.4. Extended Unix Code (EUC)¶
EUC-CN: GB2312
EUC-JP: JIS X 0208, JIS X 0212, JIS X 0201
EUC-KR: KS X 1001, KS X 1003
EUC-TW: CNS 11643 (16 planes)
6.4. Cyrillic¶
KOI family, “Код Обмена Информацией”:
KOI-7: oldest KOI encoding (ASCII + some characters)
KOI8-R: Russian
KOI8-U: Ukrainian
Variants: ECMA-Cyrillic, KOI8-Unified, cp1251, MacUkrainian, Bulgarian MIK, …