I was recently stymied by an encoding error (the exception thrown was kicked off by UnicodeError) on a web page that was detected as utf-8, the W3 Validator said it was utf-8 but in all my efforts to get a parsing classes derived from python's SGMLParser, it consistently bombed out. I tried chardet:
>>> import chardet >>> import urllib >>> urlread = lambda url: urllib.urlopen(url).read() >>> chardet.detect(urlread(theurl)) {'confidence': 0.98999999999999999, 'encoding': 'utf-8'}...and yet the parser insisted that it had hit the "'ascii' codec can't decode byte XXXX in position YYYY: ordinal not in range(128)" error. WTF?!
On a hunch, I decided to try forcing it to be treated as utf-16 and then coercing it back to utf-8, like this
parser.feed(pagedata.encode("utf-16", "replace").encode("utf-8"))That worked!
I hate it when I follow an intuited hunch, it pans out and but I don't have any explanation as to why. I just don't know the details of python's character encoding behaviors to debug this further, most of my work is in those Curly Bracket languages :)
If any python experts are having any "OMG don't do that, here's why..." reactions, please let me know!
python utf8 character sets character encoding chardet sgmlparser
( Apr 16 2007, 11:28:31 AM PDT ) Permalink