Anti-nuisance lawsuit warning: The purpose of these notes is to remind me, Zoegond, of stuff or to help me work stuff out. They may contain mistakes.

Quick

  • ($a, $b....) = unpack("A2A7...", $packed)
  • push( array, list )

Friday, October 31, 2014

Unicode thinking aloud

I've got this set of text files that have a lot of C2 bytes in them, usually prefixing characters that I think of as 'extended ASCII', like 97 for an em dash. Suspecting this might be something to do with this new-fangled Unicode stuff, I had Firefox display one: it correctly rendered the 97s and hid the preceding C2s, and reported the encoding as UTF-8.

So I happily went off to look at a UTF-8 table, only to find that C297 is a control character, and not an em dash.

But this page has the correct translation.

So thinking about it really hard - till it hurts:

  • In the old extended ASCII codepage 1252 ISO Latin-1 character set, 97 hex is an em dash
  • UTF-8 is one of the modern systems for encoding modern Unicode characters, which can have up to 4 bytes (00 00 00 00 to 10 FF FF FF). There are several such systems:
    • UTF-32 which has a nice sensible invariant 4-byte-wide character
    • UTF-16 and UTF-8 which use variable-length characters, each aiming to give a certain specific subset of characters the shortest code
    • UTF-8 favours original ASCII characters 00-7F - they all get 1-byte characters, but higher codes take up to 6 bytes to express
    • But we digress.
  • When 97 encoded in UTF-8, 97 translates to C2 97
Why does it translate thusly?

I've adapted this from http://sydney.edu.au/engineering/it/~graphapp/package/src/utility/utf8.c:

Max Sig Bits  Pattern
------------  -------
           7  0xxxxxxx
          11  110xxxxx 10xxxxxx
          16  1110xxxx 10xxxxxx 10xxxxxx
          21  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
          26  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
          32  111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
So for example, 00-7F have 7 significant bits - xxxxxxx, or to put it more helpfully, abcdefg - and these are rendered 0abcdefg. So 00-7F translate to single bytes 00 to 7F.

But our character, 97 (1001 1111) has 8 significant bits, which means it falls in the 11 bit category, and is rendered 110xxxxx 10xxxxxx, or again to put it more helpfully, 110abcde 10fghijk. Thusly:

             9    7
          1001 1111          
      000 1001 1111
      abc defg hijk
      abcde   fghijk
      00010   011111
   110abcde 10fghijk
   11000010 10011111
  11000010  10011111
 1100 0010 1001 1111
    C    2    9    7
Ta da!

The only remaining mystery is why the first chart I consulted listed C2 97 as the translation for a control character. I wonder if there is some confusion with U+0097 'END OF GUARDED AREA'.

Perhaps it is because UTF-8 is a just a coding scheme - a dumb algorithm for converting bit patterns to byte groups - rather than part of Unicode itself. UTF-8 doesn't know or care what 97 means, it just knows to translate it to C2 97.

In fact yes, that's it.

  • When the text files in question were originally created, an em dash was turned into a codepage 1252 extended ASCII character 97.
  • The files were UTF-8 encoded and this turned 97 into C2 97
  • My text editor rendered that as two characters - C2 (Â) and 97 (—) - because it only understands 8-bit extended ASCII
  • But Firefox took C2 97 as a 2 byte encoding of 97, and duly rendered a single character 97 (—)

Followers

Blog Archive