Zoegond's notes: October 2014

Friday, October 31, 2014

Unicode thinking aloud

I've got this set of text files that have a lot of C2 bytes in them, usually prefixing characters that I think of as 'extended ASCII', like 97 for an em dash. Suspecting this might be something to do with this new-fangled Unicode stuff, I had Firefox display one: it correctly rendered the 97s and hid the preceding C2s, and reported the encoding as UTF-8.

So I happily went off to look at a UTF-8 table, only to find that C297 is a control character, and not an em dash.

But this page has the correct translation.

So thinking about it really hard - till it hurts:

In the old extended ASCII codepage 1252 ISO Latin-1 character set, 97 hex is an em dash
UTF-8 is one of the modern systems for encoding modern Unicode characters, which can have up to 4 bytes (00 00 00 00 to 10 FF FF FF). There are several such systems:
- UTF-32 which has a nice sensible invariant 4-byte-wide character
- UTF-16 and UTF-8 which use variable-length characters, each aiming to give a certain specific subset of characters the shortest code
- UTF-8 favours original ASCII characters 00-7F - they all get 1-byte characters, but higher codes take up to 6 bytes to express
- But we digress.
When 97 encoded in UTF-8, 97 translates to C2 97

Why does it translate thusly?

I've adapted this from http://sydney.edu.au/engineering/it/~graphapp/package/src/utility/utf8.c:

Max Sig Bits  Pattern
------------  -------
           7  0xxxxxxx
          11  110xxxxx 10xxxxxx
          16  1110xxxx 10xxxxxx 10xxxxxx
          21  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
          26  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
          32  111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

So for example, 00-7F have 7 significant bits - xxxxxxx, or to put it more helpfully, abcdefg - and these are rendered 0abcdefg. So 00-7F translate to single bytes 00 to 7F.

But our character, 97 (1001 1111) has 8 significant bits, which means it falls in the 11 bit category, and is rendered 110xxxxx 10xxxxxx, or again to put it more helpfully, 110abcde 10fghijk. Thusly:

             9    7
          1001 1111          
      000 1001 1111
      abc defg hijk
      abcde   fghijk
      00010   011111
   110abcde 10fghijk
   11000010 10011111
  11000010  10011111
 1100 0010 1001 1111
    C    2    9    7

Ta da!

The only remaining mystery is why the first chart I consulted listed C2 97 as the translation for a control character. I wonder if there is some confusion with U+0097 'END OF GUARDED AREA'.

Perhaps it is because UTF-8 is a just a coding scheme - a dumb algorithm for converting bit patterns to byte groups - rather than part of Unicode itself. UTF-8 doesn't know or care what 97 means, it just knows to translate it to C2 97.

In fact yes, that's it.

When the text files in question were originally created, an em dash was turned into a codepage 1252 extended ASCII character 97.
The files were UTF-8 encoded and this turned 97 into C2 97
My text editor rendered that as two characters - C2 (Â) and 97 (—) - because it only understands 8-bit extended ASCII
But Firefox took C2 97 as a 2 byte encoding of 97, and duly rendered a single character 97 (—)

Zoegond's notes

Quick

Friday, October 31, 2014

Unicode thinking aloud

Followers

Blog Archive