ASPN : Python Cookbook : Visualize unicode strings

accesine 2005-12-01

展開全文

text=u"""Europython 2005
G\u00f6teborg, Sweden
\u8463\u5049\u696d
Hotel rates 100\N{euro sign}
"""

import codecs 

def printu(ustr):
    print ustr.encode(‘raw_unicode_escape‘)
    
def saveu(ustr, filename=‘output.txt‘):
    file(filename,‘wb‘).write(codecs.BOM_UTF8 + ustr.encode(‘utf8‘))

Discussion:

Someday all software, including the console and text editors, would fully support unicode and display any languages effortlessly. Until then we will have to settle with console that works with 8 bit characters only. Here I will show a few tricks to help displaying unicode in Python.

First of all I have defined a variable ‘text‘ above as a sample text. It is an unicode string contains characters in several languages. In Python the ‘u‘ or ‘U‘ prefix denote an Unicode string. Unicode characters outside of ASCII can be entered using the ‘\uXXXX‘ escape sequence or the ‘\N{name}‘ notation by the unicode character name.

If we just try ‘print text‘, it will run into the dreaded UnicodeEncodeError. Since the console in general support only ASCII characters, Python automatically transform unicode strings into ASCII before printing. Any character that falls outside of the ASCII range, like the \u8463, would cause an exception.

One simple way to see at least some result is to use the the ‘replace‘ as the error handling method as oppose to the default ‘strict‘ in encoding. For example,

>>> print text.encode(‘a(chǎn)scii‘,‘replace‘)
Europython 2005
G?teborg, Sweden

Hotel rates 100?

The characters that cannot be represented in ASCII are turned into ‘?‘. The result is a corrupted string. But I still preferred this to not showing anything at all. Just replacing non-ASCII characters into ‘?‘ is a quick and dirty trick. But sometimes you really need to know what the characters are. The printu() method uses a little known internal encoding scheme ‘raw_unicode_encoding‘ to render the string:

>>> printu(text)
Europython 2005
G鰐eborg, Sweden
\u8463\u5049\u696d
Hotel rates 100\u20ac

Characters that cannot be displayed in the console are show as \u escaped sequence. So you can verify the euro sign U+20AC is correctly represented. Also the text can be easily cut and paste to form a string literal to reconstruct the string.

To actually see the sample rendered we need to find some software that support displaying unicode. The good old vi will not do. I highly recommend a Windows shareware EmEditor (http://www./). It is by far the best in handling various character encodings and fonts. Otherwise web browsers are also very good in rendering unicode text. First use saveu() to dump the string into a file:

>>> saveu(text)

Next open the file ‘output.txt‘ with you browser. The characters should show there. If you do not have time to execute the examples, I have posted a copy of the output at http:///2005/sample_utf8.txt. saveu() output the file using a common utf-8 encoding. The codecs.BOM_UTF8 inserted is a 3 byte magic number that denote the file as a unicode text file encoded using utf-8. The BOM is optional but in this case it helps the browser to detect the encoding correctly.