contact us at info@solidstategroup.com
or on +44 (0) 845 838 2163

Home >  Company > Tech Blog > All talking the same language

All talking the same language

Posted by Ben Rometsch on 29 May 2007

Last week we completed one of our most complex projects to date, putting the Amnesty International Report for 2007 online. Hundreds of pages of content in 5 different languages needed to be imported from a variety of sources and normalised to look and work perfectly on the web.

Java is great at working with UTF-8 character sets. What we we've discovered is that, when it comes to characters in languages like Russian or Arabic this is not the end of the matter, but merely the beginning! When working with UTF-8, you need to have UTF-8 capable "stuff" everywhere. Your O/S must be UTF-8. Your web server must be UTF-8. Your database, your database connection, your database GUI tool, your application server, your email client (for those last minute fixes from the client), your web-based project management software...After 6 weeks I was beginning to worry that the bike I cycle to the office on wasn't UTF-8 aware and it was that that was causing all the problems.

You get the idea.

Little did we realise how deep this would go. Not for the first time were the words "it really shouldn't be this hard" uttered by the development team as yet another trap was sprung. Are you sure you are running a UTF-8 database connection? Is that block of text your DB GUI tool displaying actually UTF-8 or has it decoded some escaped-HTML?

The biggest lesson we learnt is that, from the outset, you need to make sure that absolutely everything you touch in terms of software should be happy working in UTF-8. If you have a single weak link in the chain, you will start seeing those dreaded question marks where you were expecting a non-ASCII character. If you are starting a green-field development, make sure from the very start that you are UTF-8 compatible, and test everything to make sure that you are.