Wednesday 4 September 2019

Unicode and UTF-8: A survivor's guide

UTF-8 is the most popular way of representing the comprehensive Unicode character set in those old-fashioned 8-bit bytes. But from the web-developer's point of view it can be a way of introducing maddening inconsistency in the handling of strings and text documents. This article gives the facts and suggests how to avoid the pitfalls.

Unicdoe

There is only one Unicode. It's a set of characters each with a unique 'code point', overseen by a consortium of industry leaders intended to cover all languages and scripts in current use. Despite appearances there are no 'code pages' or anything like that. There are several ways to represent Unicode characters in a document however, and many different collations to confuse the picture.

UTF-8

The standard for encoding the Unicode character set in 8-bit bytes (from one to four of them) is called 'UTF-8'. There's also 'UTF-16' and 'UTF-32' for two byte and four byte encodings respectively, but we won't be looking at those. Note that the standard is to write the name in caps with a hyphen: 'UTF-8'. Given the industry's respect for standards you may encounter it written otherwise; phpMyAdmin uses 'utf8' for example. In these cases it's safest just to go along with it.

In web pages

Normally the server should specify UTF-8 in the response headers, but to make a document portable include this line in the HTML5 header:

<meta charset="UTF-8">

These are only declarations however; they do no actual character set conversion, and don't guarantee that the characters are actual encoded that way in the first place. If you see odd characters on a web page look for your browser's function for viewing in a different encoding. My version of Opera has Page->Encoding on the main menu for instance.

In programming languages

Each programming language has its own functions for converting between encodings, and they'll have to be studied carefully. This is not the place for such study, but I'll mention that PHP has utf8_encode to help read from web pages with the popular ISO-8859-1 set. But for more general tasks mb_convert_encoding will be needed.

In authoring software and databases

When setting up a database you need to check that it is dealing in UTF-8. In phpMyAdmin go to the server main menu and choose the Variables option (it won't be there if you don't have the right privileges). Consult the Server System Variables documentation (click the ? icon) to find all mentions of 'utf8' and make changes accordingly. Check your code thoroughly with a variety of accented texts before committing.

Different collations

Collations are a big source of confusion; it looks like there are several different versions of the character set, a bit like the old 'code pages'. But a collation only refers to how strings are sorted rather than stored, so if you've picked the wrong one it doesn't matter very much. Just change it. The correct choice is this one:

utf8mb4_unicode_ci

This states that it uses encodings of up to four bytes ('mb4'), uses universal rather than language-specific sorting ('unicode'), and sorting is case insensitive ('ci'). There are also 'general_ci' variants, but don't uses them: they are a fraction more time efficient but at the expense of accuracy.

How a character is encoded

UTF-8 uses a sequence of bytes which is backwards compatible with 7-bit ASCII. All characters needing 8 bits or more (decimal 128+) now have their binary values spread over 2, 3 or 4 formatted bytes, zero-padded at the head where necessary, always with high order bits first. The first byte (the prefix) specifies the length and includes a few bits of the character, the rest are continuation bytes including 6 bits of the character encoding. The prefix is coded so that the total bytes used is the number of leading '1' bits followed by a '0'. So:

Binary representations of UTF-8 prefix bytes:

10xx xxxx  (1 byte long, but invalid; it's used as a 'continuation' byte)
110x xxxx  2 bytes long, and 5 bits of encoding, 11 bits total
1110 xxxx  3 bytes long, and 4 bits of encoding, 16 bits total
1111 0xxx  4 bytes long, and 3 bits of encoding, 21 bits total

Note the first example beginning binary '10'. This could encode one of the first 64 ASCII characters in two bytes instead of one but overlong encodings like these are never allowed. Instead it's used to store the rest of the character code in groups of 6 bits. Sometimes Octal is used to write UTF-8 characters because they split neatly into groups of 3 bits.

Advantages of UTF-8

  • It is backwards compatible with ASCII.
  • Old-style 8-bit character subroutines (even string comparisons) work without alteration as long as they are given valid UTF-8 input strings.
  • Robustness. If a string is split in the middle of a multi-byte sequence it will be apparent, whether it is cut short at the end or beginning. On top of this, if input is given erroneously in non-UTF-8 it should become apparent (if the text is long enough) by the number of invalid bytes (typically continuation bytes without prefixes).

Disadvantages of UTF-8

Besides its relative inefficiency at encoding non-Western alphabets, the only real disadvantage of UTF-8 is that it works so well even with 'broken' input that it is tempting for a developer to overlook the occasional errors with accented characters. Don't put off fixing these errors: get to the root of the problem right away no matter how frustrating the process can be.

Wednesday 28 August 2019

Closures in Javascript: a simple how and why

Closures are a simple concept, but often poorly understood. This may be because explanations are often mixed up with other concepts such as immediate functions, and then wrapped up in hard-to-unpick syntax. A good understanding of closures is needed when defining flexible event handlers for dynamic web pages. We create a new closure when we need to pass an argument to a function (such as an event handler) that we are defining now but will be called by the system later.

Take this simple case of a dynamic web page. We call a function setalerts to put up a sequence of alerts, each delayed by one second longer than the last, and each with a message stating how long was the delay:

  var message='Delay in seconds: ';

  function setalerts (maxcount) {
    var count;

    for (count=1; count<=maxcount; count++) {
      setTimeout (alert (message+count),count*1000);
    }
  }

  setalerts (3);

This works just fine, setting up as many alert boxes as we passed in the argument maxcount. All the work of creating the message string is done when we define the argument for each function to be called, so the same value of count is used for the message as for the delay.

We're going to sabotage this. Notice that message is a global variable, a simple representation of the changing real-world environment. Put this line after the call to setalerts:

  setTimeout (function () {message='Decay in sieverts:';},1500);

The function gives the same results as before, not taking account of the change to message that occurred half a second after the first call. So we try again, a bit wiser this time. We pass an anonymous function to setTimeout. It looks the same as before, but will take the values of message and count current at run time:

  function setalerts (maxcount) {
    var count;

    for (count=1; count<=maxcount; count++) {
      setTimeout (function () {alert (message+count);},count*1000);
    }
  }
  setalerts (3);

  setTimeout (function () {message='Decay in sieverts:';},1500);

This works to reflect the change in the message alright, but gets the value for count wrong. It's always one greater than we passed as maxcount. A little reflection explains why. By the time the anonymous function is called by the setTimeout system the setalerts function has long since finished running and the value of count has passed its maximum. While we're trying to understand how to fix things, a little more reflection makes us wonder how we can access the value of count at all! Surely setalerts has finished running and all its local variables been destroyed? Trying to access count should only give a memory protection fault.

That's how things are with the most common, statically oriented languages. With Javascript things are different. Functions behave just like instances of other types of object. As long as a reference exists to it, it stays in memory. So, put simply, what makes up an instance of a function?

  • A reference to the text of the function, ie. its statements etc.
  • Arguments and local variables. These are allocated afresh each time the function is called.
  • A reference back to the variables of its enclosing function. In turn the enclosing function may have a reference back to its own enclosing function, but in most cases there will simply be the one layer, the global variables. Note that this is only a reference, not copies of the variables.

Together these elements are called the closure. We can fix the problem with our function calls not accessing the expected values of variables simply by creating a new closure:

  function setalerts (maxcount) {
    var count;

    function nestedalert (nestedcount) {
      setTimeout (function () {alert (message+nestedcount);},count*1000);
    }

    for (count=1; count<=maxcount; count++) {
      nestedalert (count);
    }
  }
  setalerts (3);

  setTimeout (function () {message='Decay in sieverts:';},1500);

What we have done here is simply nested a new function inside our existing one, and used its argument nestedcount instead of count. At first this seems absurd as we only seem to be putting off the problem. But nevertheless the code works. This is because our anonymous function now has a new closure in the function nestedalert. We don't create new local variables when we declare the function and pass it to setTimeout, but we do create a new reference to its closure. This is how we pass parameters to a function now when it will only be run later.

The function nestedalert is called once every time the for loop executes, each time creating a new instance of the function, complete with a different value for its argument nestedcount. So each declaration of the anonymous function points to its own instance of nestedalert, with successive values for nestedcount of 1, 2, 3 etc., which will remain in memory as long as it is referenced.

There's a more elegant way of coding the same idea, but its syntax can cause confusion. It's an extension of the idea of the anonymous function where we're saved the bother of thinking up new names (nestedalert etc.) for functions we'll only be defining calls to the once. It's called an immediate function, a function declaration and call wrapped up in one, and is defined as simply as this:

  Instead of:

    <identifier> (<argumentlist>);

  We have:

    ( <function declaration> (<argumentlist>) );

When seen in the real word it can be difficult to disentangle the mixture of brackets at first:

  function setalerts (maxcount) {
    var count;

    for (count=1; count<=maxcount; count++) {
      (
        function (nestedcount) {
          setTimeout (function () {alert (message+nestedcount);},count*1000);
        }
        (count)
      );
    }
  }

  setalerts (3);

  setTimeout (function () {message='Decay in sieverts:';},1500);

Here I've tried to make the syntax clearer at the cost of using more space. Notice how we've used exactly the same function nestedalert as before, only without including its name. After the definition is the argument list (count here but it may be empty), just the same as in a standard function call. This is all wrapped up in a pair of brackets. So where we had the word nestedlalert in the for loop we now have the function declaration itself. That's basically all there is to an immediate function (plus the enclosing brackets of course).

To sum up the hows and whys of closures, use them when you want to pass an argument (such as an element ID which you have already calculated) to a function that will be called later on by the system. This function (set up an event handler to paste into an element with a given ID) works as expected:

  function addThisListener (thisID) {
    document.addEventListener("paste", function (e) { pasteIntoElement (thisID); }, false);
  }

But this function (set up an event handler to paste into ten elements via ten different IDs) does not work as expected because each instance will have the identical closure and thus reference the same (invalid, with i=11) element ID:

  function addThisListener (thisIDStem) {
    for (var i=1; i<=10; i++) {
      document.addEventListener("paste", function (e) { pasteIntoElement (thisIDStem+i); }, false);
    }
  }

Done correctly, via a new closure and immediate function, it becomes:

  function addThisListener (thisIDStem) {
    for (var i=1; i<=10; i++) {
      (
        function (compoundID) {
          document.addEventListener("paste", function (e) { pasteIntoElement (compoundID); }, false);
        }
          (thisIDStem+i)
      );
    }
  }

I hope this explanation of closures has been a help. For occasional Javascript programmers finding clear explanations of new concepts online can be a trial. I wrote this to get things clear for my own benefit, in frustration after having waded through a long long piece of simple-minded rubbish on Daily JS by someone calling himself 'Software engineering manager NY Times'. Totally useless, what an absolute plonker.