Supporting UTF-8 Virtual Hosts

Site Map
 
Home
 
Application Guide
Reference
Community
Contact Whitebeam
To-Do
Download
Credits
Licence
Whitebeam Users
 
 
 

Supporting UTF-8 Virtual Hosts

This page discusses how Whitebeam from version 1.3.30 onwards supports UTF-8 character sets transparently in Apache VirtualHosts. This applies to Whitebeam running under Apache 2.x onwards.


Overview

Traditionally browsers have used a variety of character sets to represent content. In the early days of Whitebeam Latin-1, an 8 but character set, was common and worked well with most latin based languages. 8 bit character sets are too restricted to support non-latin languages, each set being limited to 256 unique codes.

UCS (Universal Character Set) defines a 32 bit character address space supporting over 4 billion unique symbols. Rather than simply extend every character to 4 bytes and creating huge documents a set of simple transformations allow the 32 bit characters to be represented as a sequence of either 16 bit (UTF-16) or 8 bit (UTF-8) 'words'. The most commonly adopted transformation is UTF-8 which provides a good compromise between document size and internationalisation.

Whitebeam has generally been fairly agnostic about the characters it receives and simply treated a sequence of characters as a sequence of characters. SpiderMonkey, the JavaScript engine used at the heart of Whitebeam, natively uses 16 bit UTF-16 characters. The way in which the 8-bit stream from a browser was extended to 16 bits was to simply add a zero byte padding byte to each character. This can lead to some very interesting results!

With version 1.3.30 of Whitebeam deals with characters is much more flexible.

By default, for backwards compatibility, the behaviour is as in previous releases.

By adding the 'RButf8 true' to a VirtualHost definition in your Apache configuration you enable the new UTF-8 support for that virtual host. If you're a new installation simply add this directive at the top level to switch all virtual hosts to UTF-8 by default.

UTF-8 Mode in Whitebeam

UTF-8 modification to Whitebeam occur in two main locations: via Apache connections and via the Postgres database.

In UTF-8 mode Whitebeam will send only UTF-8 characters to Postgres and will expect the characters retrieved from the database to be encoded as a UTF-8 sequence (this is where you have to be careful when upgrading an existing site to UTF-8 with legacy data stored!

Date received via Apache from a remote connection is intelligently decoded depending on the encoding headers sent with the HTTP request. If the incoming stream is signalled as UTF-8 then correct UTF-8 encoding rules are used to expand the stream to UTF-16 in JavaScript. The outbound stream to the browser is then always encoded as UTF-8.

Whitebeam release 1.3.36
View XML source of this page
(loadtime : 8ms)