This page discusses how Whitebeam from version 1.3.30 onwards supports UTF-8 character sets transparently in Apache VirtualHosts. This applies to Whitebeam running under Apache 2.x onwards.
OverviewTraditionally browsers have used a variety of character sets to represent content. In the early days of Whitebeam Latin-1, an 8 but character set, was common and worked well with most
latin based languages. 8 bit character sets are too restricted to support non-latin languages, each set being limited to 256 unique codes. UCS (Universal Character Set) defines a 32 bit character address space supporting over 4 billion unique symbols. Rather than simply extend every character to 4 bytes
and creating huge documents a set of simple transformations allow the 32 bit characters to be represented as a sequence of either 16 bit (UTF-16) or 8 bit (UTF-8)
'words'. The most commonly adopted transformation is UTF-8 which provides a good compromise between document size and internationalisation. Whitebeam has generally been fairly agnostic about the characters it receives and simply treated a sequence of characters as a sequence of characters. SpiderMonkey,
the JavaScript engine used at the heart of Whitebeam, natively uses 16 bit UTF-16 characters. The way in which the 8-bit stream from a browser was extended to
16 bits was to simply add a zero byte padding byte to each character. This can lead to some very interesting results! With version 1.3.30 of Whitebeam deals with characters is much more flexible. By default, for backwards compatibility, the behaviour is as in previous releases. By adding the 'RButf8 true' to a VirtualHost definition in your Apache configuration you enable the new UTF-8 support for that virtual host. If you're a new installation
simply add this directive at the top level to switch all virtual hosts to UTF-8 by default. UTF-8 Mode in WhitebeamUTF-8 modification to Whitebeam occur in two main locations: via Apache connections and via the Postgres database. In UTF-8 mode Whitebeam will send only UTF-8 characters to Postgres and will expect the characters retrieved from the database to be encoded as a UTF-8 sequence (this is where you have to be careful when upgrading an existing site to UTF-8 with legacy data stored! Date received via Apache from a remote connection is intelligently decoded depending on the encoding headers sent with the HTTP request. If the incoming stream is signalled as UTF-8 then correct
UTF-8 encoding rules are used to expand the stream to UTF-16 in JavaScript. The outbound stream to the browser is then always encoded as UTF-8. |