I've got this set of text files that have a lot of C2 bytes in them, usually prefixing characters that I think of as 'extended ASCII', like 97 for an em dash. Suspecting this might be something to do with this new-fangled Unicode stuff, I had Firefox display one: it correctly rendered the 97s and hid the preceding C2s, and reported the encoding as UTF-8.
So I happily went off to look at a
UTF-8 table, only to find that C297 is a control character, and not an em dash.
But
this page has the correct translation.
So thinking about it really hard - till it hurts:
- In the old extended ASCII codepage 1252 ISO Latin-1 character set, 97 hex is an em dash
- UTF-8 is one of the modern systems for encoding modern Unicode characters, which can have up to 4 bytes (00 00 00 00 to 10 FF FF FF). There are several such systems:
- UTF-32 which has a nice sensible invariant 4-byte-wide character
- UTF-16 and UTF-8 which use variable-length characters, each aiming to give a certain specific subset of characters the shortest code
- UTF-8 favours original ASCII characters 00-7F - they all get 1-byte characters, but higher codes take up to 6 bytes to express
- But we digress.
- When 97 encoded in UTF-8, 97 translates to C2 97
Why does it translate thusly?
I've adapted this from http://sydney.edu.au/engineering/it/~graphapp/package/src/utility/utf8.c:
Max Sig Bits Pattern
------------ -------
7 0xxxxxxx
11 110xxxxx 10xxxxxx
16 1110xxxx 10xxxxxx 10xxxxxx
21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
32 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
So for example, 00-7F have 7 significant bits - xxxxxxx, or to put it more helpfully, abcdefg - and these are rendered 0abcdefg. So 00-7F translate to single bytes 00 to 7F.
But our character, 97 (1001 1111) has 8 significant bits, which means it falls in the 11 bit category, and is rendered 110xxxxx 10xxxxxx, or again to put it more helpfully, 110abcde 10fghijk. Thusly:
9 7
1001 1111
000 1001 1111
abc defg hijk
abcde fghijk
00010 011111
110abcde 10fghijk
11000010 10011111
11000010 10011111
1100 0010 1001 1111
C 2 9 7
Ta da!
The only remaining mystery is why the first chart I consulted listed C2 97 as the translation for a control character. I wonder if there is some confusion with U+0097 'END OF GUARDED AREA'.
Perhaps it is because UTF-8 is a just a coding scheme - a dumb algorithm for converting bit patterns to byte groups - rather than part of Unicode itself. UTF-8 doesn't know or care what 97 means, it just knows to translate it to C2 97.
In fact yes, that's it.
- When the text files in question were originally created, an em dash was turned into a codepage 1252 extended ASCII character 97.
- The files were UTF-8 encoded and this turned 97 into C2 97
- My text editor rendered that as two characters - C2 (Â) and 97 (—) - because it only understands 8-bit extended ASCII
- But Firefox took C2 97 as a 2 byte encoding of 97, and duly rendered a single character 97 (—)