This page discusses how Whitebeam from version 1.3.30 onwards supports UTF-8 character sets transparently in Apache VirtualHosts. This applies to Whitebeam running under Apache 2.x onwards.
Traditionally browsers have used a variety of character sets to represent content. In the early days of Whitebeam Latin-1, an 8 but character set, was common and worked well with most
latin based languages. 8 bit character sets are too restricted to support non-latin languages, each set being limited to 256 unique codes.
UCS (Universal Character Set) defines a 32 bit character address space supporting over 4 billion unique symbols. Rather than simply extend every character to 4 bytes
and creating huge documents a set of simple transformations allow the 32 bit characters to be represented as a sequence of either 16 bit (UTF-16) or 8 bit (UTF-8)
'words'. The most commonly adopted transformation is UTF-8 which provides a good compromise between document size and internationalisation.
Whitebeam has generally been fairly agnostic about the characters it receives and simply treated a sequence of characters as a sequence of characters. SpiderMonkey,
16 bits was to simply add a zero byte padding byte to each character. This can lead to some very interesting results!
With version 1.3.30 of Whitebeam deals with characters is much more flexible.
By default, for backwards compatibility, the behaviour is as in previous releases.
By adding the 'RButf8 true' to a VirtualHost definition in your Apache configuration you enable the new UTF-8 support for that virtual host. If you're a new installation
simply add this directive at the top level to switch all virtual hosts to UTF-8 by default.
UTF-8 Mode in Whitebeam
UTF-8 modification to Whitebeam occur in two main locations: via Apache connections and via the Postgres database.
In UTF-8 mode Whitebeam will send only UTF-8 characters to Postgres and will expect the characters retrieved from the database to be encoded as a UTF-8 sequence (this is where you have to be careful when upgrading an existing site to UTF-8 with legacy data stored!
Date received via Apache from a remote connection is intelligently decoded depending on the encoding headers sent with the HTTP request. If the incoming stream is signalled as UTF-8 then correct