Wednesday 4 September 2019

Unicode and UTF-8: A survivor's guide

UTF-8 is the most popular way of representing the comprehensive Unicode character set in those old-fashioned 8-bit bytes. But from the web-developer's point of view it can be a way of introducing maddening inconsistency in the handling of strings and text documents. This article gives the facts and suggests how to avoid the pitfalls.

Unicdoe

There is only one Unicode. It's a set of characters each with a unique 'code point', overseen by a consortium of industry leaders intended to cover all languages and scripts in current use. Despite appearances there are no 'code pages' or anything like that. There are several ways to represent Unicode characters in a document however, and many different collations to confuse the picture.

UTF-8

The standard for encoding the Unicode character set in 8-bit bytes (from one to four of them) is called 'UTF-8'. There's also 'UTF-16' and 'UTF-32' for two byte and four byte encodings respectively, but we won't be looking at those. Note that the standard is to write the name in caps with a hyphen: 'UTF-8'. Given the industry's respect for standards you may encounter it written otherwise; phpMyAdmin uses 'utf8' for example. In these cases it's safest just to go along with it.

In web pages

Normally the server should specify UTF-8 in the response headers, but to make a document portable include this line in the HTML5 header:

<meta charset="UTF-8">

These are only declarations however; they do no actual character set conversion, and don't guarantee that the characters are actual encoded that way in the first place. If you see odd characters on a web page look for your browser's function for viewing in a different encoding. My version of Opera has Page->Encoding on the main menu for instance.

In programming languages

Each programming language has its own functions for converting between encodings, and they'll have to be studied carefully. This is not the place for such study, but I'll mention that PHP has utf8_encode to help read from web pages with the popular ISO-8859-1 set. But for more general tasks mb_convert_encoding will be needed.

In authoring software and databases

When setting up a database you need to check that it is dealing in UTF-8. In phpMyAdmin go to the server main menu and choose the Variables option (it won't be there if you don't have the right privileges). Consult the Server System Variables documentation (click the ? icon) to find all mentions of 'utf8' and make changes accordingly. Check your code thoroughly with a variety of accented texts before committing.

Different collations

Collations are a big source of confusion; it looks like there are several different versions of the character set, a bit like the old 'code pages'. But a collation only refers to how strings are sorted rather than stored, so if you've picked the wrong one it doesn't matter very much. Just change it. The correct choice is this one:

utf8mb4_unicode_ci

This states that it uses encodings of up to four bytes ('mb4'), uses universal rather than language-specific sorting ('unicode'), and sorting is case insensitive ('ci'). There are also 'general_ci' variants, but don't uses them: they are a fraction more time efficient but at the expense of accuracy.

How a character is encoded

UTF-8 uses a sequence of bytes which is backwards compatible with 7-bit ASCII. All characters needing 8 bits or more (decimal 128+) now have their binary values spread over 2, 3 or 4 formatted bytes, zero-padded at the head where necessary, always with high order bits first. The first byte (the prefix) specifies the length and includes a few bits of the character, the rest are continuation bytes including 6 bits of the character encoding. The prefix is coded so that the total bytes used is the number of leading '1' bits followed by a '0'. So:

Binary representations of UTF-8 prefix bytes:

10xx xxxx  (1 byte long, but invalid; it's used as a 'continuation' byte)
110x xxxx  2 bytes long, and 5 bits of encoding, 11 bits total
1110 xxxx  3 bytes long, and 4 bits of encoding, 16 bits total
1111 0xxx  4 bytes long, and 3 bits of encoding, 21 bits total

Note the first example beginning binary '10'. This could encode one of the first 64 ASCII characters in two bytes instead of one but overlong encodings like these are never allowed. Instead it's used to store the rest of the character code in groups of 6 bits. Sometimes Octal is used to write UTF-8 characters because they split neatly into groups of 3 bits.

Advantages of UTF-8

  • It is backwards compatible with ASCII.
  • Old-style 8-bit character subroutines (even string comparisons) work without alteration as long as they are given valid UTF-8 input strings.
  • Robustness. If a string is split in the middle of a multi-byte sequence it will be apparent, whether it is cut short at the end or beginning. On top of this, if input is given erroneously in non-UTF-8 it should become apparent (if the text is long enough) by the number of invalid bytes (typically continuation bytes without prefixes).

Disadvantages of UTF-8

Besides its relative inefficiency at encoding non-Western alphabets, the only real disadvantage of UTF-8 is that it works so well even with 'broken' input that it is tempting for a developer to overlook the occasional errors with accented characters. Don't put off fixing these errors: get to the root of the problem right away no matter how frustrating the process can be.