Perl, character encoding and the WWW

By | January 9, 2015

For most applications, UTF-8 is a good choice, since you can code arbitrary Unicode codepoints with it. On the other hand English text (and of most other European languages) is encoded very efficiently.

HTTP offers the Accept-Charset-Header in which the client can tell the server which character encodings it can handle. But if you stick to the common encodings like UTF-8 or Latin-1, next to all user agents will understand it, so it isn’t really necessary to check that header.

HTTP headers themselves are strictly ASCII only, so all information that is sent in the HTTP header (including cookies and URLs) need to be encoded to ASCII if non-ASCII characters are used.

For HTML files the header typically looks like this: Content-Type: text/html; charset=UTF-8. If you send such a header, you only have to escape those characters that have a special meaninig in HTML: <, >, & and, in attributes, ".

Special care must be taken when reading POST or GET parameters with the function param in the module CGI. Older versions (prior to 3.29) always returned byte strings, newer version return text strings if charset("UTF-8") has been called before, and byte strings otherwise.

CGI.pm also doesn’t support character encodings other than UTF-8. Therefore you should not to use the charset routine and explicitly decode the parameter strings yourself.

To ensure that form contents in the browser are sent with a known charset, you can add the accept-charset attribute to the<form> tag.

<form method="post" accept-charset="utf-8" action="/script.pl">

If you use a template system, you should take care to choose one that knows how to handle character encodings. Good examples areTemplate::Alloy, HTML::Template::Compiled (since version 0.90 with the open_mode option), or Template Toolkit (with theENCODING option in the constructor and an IO layer in the process method).

Modules

There are a plethora of Perl modules out there that handle text, so here are only a few notable ones, and what you have to do to make them Unicode-aware:

LWP::UserAgent and WWW::Mechanize

Use the $response->decode_content instead of just $response->content. That way the character encoding information sent in the HTTP response header is used to decode the body of the response.

DBI

DBI leaves handling of character encodings to the DBD:: (driver) modules, so what you have to do depends on which database backend you are using. What most of them have in common is that UTF-8 is better supported than other encodings.

For Mysql and DBD::mysql pass the mysql_enable_utf8 => 1 option to the DBI->connect call.

For Postgresql and DBD::Pg, set the pg_enable_utf8 attribute to 1

For SQLite and DBD::SQLite, set the sqlite_unicode attribute to 1

Advanced Topics

With the basic charset and Perl knowledge you can get quite far. For example, you can make a web application “Unicode safe”, i.e. you can take care that all possible user inputs are displayed correctly, in any script the user happens to use.

But that’s not all there is to know on the topic. For example, the Unicode standard allows different ways to compose some characters, so you need to “normalize” them before you can compare two strings. You can read more about that in the Unicode normalization FAQ.

To implement country specific behaviour in programs, you should take a look at the locales system. For example in Turkey lc 'I', the lower case of the capital letter I is ı, U+0131 LATIN SMALL LETTER DOTLESS I, while the upper case of i is İ, U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE.

A good place to start reading about locales is perldoc perllocale.

Philosophy

Many programmers who are confronted with encoding issues first react with “But shouldn’t it just work?”. Yes, it should just work. But too many systems are broken by design regarding character sets and encodings.

Broken by Design

“Broken by Design” most of the time means that a document format, and API or a protocol allows multiple encodings, without a normative way on how that encoding information is transported and stored out of band.

A classical example is the Internet Relay Chat (IRC), which specifies that a character is one Byte, but not which character encoding is used. This worked well in the Latin-1 days, but was bound to fail as soon as people from different continents started to use it.

Currently, many IRC clients try to autodetect character encodings, and recode it to what the user configured. This works quite well in some cases, but produces really ugly results where it doesn’t work.

Another Example: XML

The Extensible Markup Language, commonly known by its abbreviation XML, lets you specific the character encoding inside the file:

<?xml version="1.0" encoding="UTF-8" ?>

There are two reasons why this is insufficient:

  1. The encoding information is optional. The specification clearly states that the encoding must be UTF-8 if the encoding information is absent, but sadly many tool authors don’t seem to know that, end emit Latin-1. (This is of course only partly the fault of the specification).
  2. Any XML parser first has to autodetect the encoding to be able to parse the encoding information

The second point is really important. You’d guess “Ah, that’s no problem, the preamble is just ASCII” – but many encodings are ASCII-incompatible in the first 127 bytes (for example UTF-7, UCS-2 and UTF-16).

So although the encoding information is available, the parser first has to guess nearly correctly to extract it.

The appendix to the XML specification contains a detection algorithm than can handle all common cases, but for example lacks UTF-7 support.

Source: perlgeek.de