What's That Noise?! [Ian Kallen's Weblog]

Sunday December 12, 2004

l10n development practices I've used Java's handy-dandy ResourceBundles to do some proof-of-concept localizations. But my past proofs were limited; they only used other European languages that utilized ISO-8859-1 characters. I don't recall having to do anything special with the property files in those cases.

Working on a recent Japanese localization project was an eye opening experience. It turns out the java.util.Properties expects ISO-8859-1 characters. I guess that's the downside of having a super-simple file format. I got the localized display boostrapped by using native2ascii to get the UTF-8 localization text rendered as escaped unicode. On a one-off basis, that's easy enough. But collaborative development always begs the tools question, how do folks typically manage this?

Give the localization team instructions on how to use native2ascii on a command line basis?
Put it in the build system with the native2ascii ant task?
Use a light weight GUI tool like resourcebundleeditor (or a heavy weight GUI tool ...)?

What about input encoding? If there's an HTML form on a page and the input has multibyte characters in the query string (or POST data), are characters escaped to ISO-8859-1? My recollection was that HTTP headers must be ISO-8859-1.... but looking at the docs for PHP's mbstring and the encoding_translation parameter, it looks like server-side handling of the request needs to account for other character set encodings. Do browsers honor charset specification as a form attribute, like

<form action=... method=... accept-charset="UTF-8">

(looks like Struts supports this) or is it presumed that the browser always escapes unicode? Or perhaps they simply URL encode the characters so it's a non-issue? On the server side the must the request handling do this

request.setCharacterEncoding("UTF-8");
String raw = request.getParameter("foo");
String clean = new String(raw.getBytes("ISO-8859-1"), "UTF-8");

or is it all supposed to transparently just work (obviating String cleansing) if request.setCharacterEncoding("UTF-8") is used? ...for all of the hand-waving in the docs for ResourceBundle, etc establishing a clear practice for input String handling in a webapp remains murky.

As far as sending responses, is it safe to always just send UTF-8 and include "charset=UTF-8" in the Content-type header? Is it standard practice to presume that the client will send a request header Accept-Charset (which indicates what an acceptable response is)? If they send it and UTF-8 isn't on the list, must the server go through a big String re-writing exercise to encode response to the browser's preference or is UTF-8 presumed to be implicitly acceptable at all times?

So many questions... I'm still digging for anwers.

( Dec 12 2004, 11:51:01 PM PST ) Permalink

Sun	Mon	Tue	Wed	Thu	Fri	Sat
« December 2004 »
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Today

What's That Noise?! [Ian Kallen's Weblog]

Sunday December 12, 2004

Technorati

Who?

Twitter

links

calendar

search