= How to use Unicode in Python = ''Last updated 2006-03-04.'' Here are Joel's notes about how to use Unicode in Python. I make no guarantees that the information below is correct. It is partly based on guesswork. If you think that anything is wrong, don't hesitate to contact me at joel@rosdahl.net or change the wiki page directly. Python 2.4.1 (and also 2.3.5 in some cases) was used when testing. Platforms tested: Debian GNU/Linux testing with en_GB.UTF-8 and en_GB.ISO-8859-1 locales) and Windows 2000 (in the command prompt and in IDLE). It would be nice to know about more platforms (like MacOS etc.), but I haven't got access to any other systems right now. == Nomenclature == I use these definitions in this document: * '''Unicode string''': Python's {{{unicode}}} type. * '''Byte string''': Python's {{{str}}} type. * '''Windows''': Windows-like operating system; at least Windows 2000 and Windows XP. * '''Unix''': Linux/GNU/Posix-like operating systems. == General == Always keep text in Unicode strings internally in the application and encode/decode properly to the outside world (I/O and UI) and libraries that are not Unicode aware or Unicode agnostic. == String interpolation == Although {{{ "%sbar" % u"foo" }}} and {{{ u"%sbar" % u"foo" }}} yield the same result ({{{u"foobar"}}}), use the latter for clarity and performance reasons. Python 2.3 has this bug: {{{ class C: def __str__(self): return "str" def __unicode__(self): return "unicode" print u"%s" % C() # Prints "str", not "unicode". }}} The bug has been fixed in Python 2.4 ({{{unicode}}} is printed). Surprising results may occur when byte format strings are used. For example, the expressions {{{ "%s %s" % (str_value, x) "%s %s" % (unicode_value, x) }}} do not print {{{x}}} the same way, if {{{x}}} happens to have a {{{__unicode__}}} method. {{{__str__}}} is called (or {{{__repr__}}} if {{{__str__}}} is missing) in the first case. {{{__unicode__}}} is called in the second. In fact, within {{{"%s %s %s %s %s..."}}}, the first Unicode value from left to right switches the result to Unicode mode for the remaining values. If you have mixed byte strings and Unicode strings computed in variables, say, it makes the formatting a bit unpredictable. The easy solution is to always use Unicode format strings. == File content input/output == Normally, we want to read/write file contents according to the user's default locale: {{{ encoding = locale.getpreferredencoding() f = codecs.open("filename", "w", encoding) }}} Note that {{{codecs.open}}} also takes an {{{errors}}} parameter that determines what to do with code points that can't be expressed in the chosen encoding. == Standard input/output/error == {{{sys.stdin}}} and {{{sys.stdout}}} have an {{{encoding}}} attribute if they refer to a terminal and Python has been able to figure out the encoding for the terminal. So, if the attribute exists, it should be used for encoding to standard input/output. If it doesn't exist, standard input/output is probably redirected from/to a file or pipe whose encoding is unknown and we should use the same logic as for normal file content input/output as described above. Here is a way to make suitable wrappers: {{{ def get_file_encoding(f): if hasattr(f, "encoding") and f.encoding: return f.encoding else: return locale.getpreferredencoding() sys.stdin = codecs.getreader(get_file_encoding(sys.stdin))(sys.stdin) sys.stdout = codecs.getwriter(get_file_encoding(sys.stdout))(sys.stdout) sys.stderr = codecs.getwriter(get_file_encoding(sys.stderr))(sys.stderr) }}} XXX: Possible Python bug: {{{sys.stderr}}} seems to often lack an encoding attribute. This means that a program that has standard output directed to a file or pipe but standard error still refers to a terminal can't find out the proper encoding of {{{sys.stderr}}}. (The only place where I have found {{{sys.stdout.encoding}}} and {{{locale.getpreferredencoding()}}} to differ is in the Windows command shell.) == File system == Since Python 2.3, file system operations (with some exceptions) do the right thing (using {{{sys.getfilesystemencoding()}}}) when given Unicode string arguments: {{{ os.unlink(u"path") # Et cetera. }}} On Unix, the file system encoding is taken from the locale settings. On Windows, the encoding is always {{{mbcs}}}, which indicates that the "wide" versions of API calls should be used. Note: File system operations raise a {{{UnicodeEncodeError}}} if given a path that can't be encoded in the encoding returned by {{{sys.getfilesystemencoding()}}}. == os.listdir == {{{os.listdir(u"path")}}} returns Unicode strings for names that can be decoded with {{{sys.getfilesystemencoding()}}} but silently returns byte strings for names that can't be decoded. That is, the return value of {{{os.listdir(u"path")}}} is potentially a mixed list of Unicode and byte strings. == os.readlink == {{{os.readlink}}} chokes on Unicode strings that aren't coercable to the default encoding. The argument must therefore be a byte string. (Not applicable to Windows.) == glob == {{{glob.glob(u"pattern")}}} does not return Unicode strings. XXX: Is this a bug? == shutil == {{{shutil.copytree}}} in Python 2.3 fails with {{{UnicodeEncodeError}}} when given a Unicode string as the first argument and the specified directory tree has some filename that isn't decodable in {{{sys.getfilesystemencoding()}}}. Works OK in Python 2.4. == os.path == On Unix, {{{os.path.abspath}}} throws {{{UnicodeDecodeError}}} when given a Unicode string with a relative path and {{{os.getcwd()}}} returns a non-ASCII binary string (or rather: a non-{{{sys.getdefaultencoding()}}}-encoded binary string). Therefore, the argument must be a byte string. On Windows, however, the argument must be a Unicode string so that the "wide" API calls are used. This leads to the nasty situation that a work-around is needed, but only on Unix: {{{ if os.name == "posix": fs_enc = sys.getfilesystemencoding() def abspath(path): return os.path.abspath(path.encode(fs_enc)).decode(fs_enc) else: abspath = os.path.abspath x = abspath(u"path") }}} {{{os.path.realpath}}} behaves the same way as {{{os.path.abspath}}}. XXX: Bugs, I presume? == Program arguments == The program arguments in {{{sys.argv}}} come as byte strings. I think that {{{sys.getfilesystemencoding()}}} should be used for the decoding: {{{ encoding = sys.getfilesystemencoding() sys.argv = [x.decode(encoding) for x in sys.argv] }}} == Calling external programs == {{{os.system}}} and {{{os.popen}}} coerce Unicode strings to byte strings using Python's default encoding (returned by {{{sys.getdefaultencoding()}}} and typically {{{ascii}}}), which is not good. The argument to those functions must therefore be encoded explicitly. On Unix, the {{{os.exec*}}} functions convert all parameters (path to executable and arguments) from Unicode strings to properly encoded byte strings. I don't know if {{{locale.getpreferredencoding()}}} or {{{sys.getfilesystemencoding()}}} is used, though. My guess is {{{sys.getfilesystemencoding()}}}. On Windows 2000, it also seems that the {{{os.exec*}}} functions take Unicode strings properly. On Unix, the {{{subprocess}}} module uses {{{os.execvp}}}, so the discussion above applies here. On Windows 2000, the {{{subprocess}}} module does not handle Unicode strings, so properly encoded byte strings must be created for all arguments. XXX: Bug? == os.path.expanduser == {{{os.path.expanduser}}} (on both UNIX and Windows) doesn't handle Unicode when {{{~}}} expands to a non-ASCII path. Therefore, a byte string must be passed in and the result decoded: {{{ home = os.path.expanduser("~").decode(sys.getfilesystemencoding()) }}} XXX: Bug? == os.environ == Environment variables in the {{{os.environ}}} dictionary are byte strings (both names and values). == inspect == inspect.getsource() returns byte strings, possibly with non-ASCII characters in them, so mixing the result with Unicode may be problematic. == Unicode arguments to exceptions == The Python interpreter may have problems printing exceptions if Unicode strings with non-ASCII characters are passed as exception arguments. For example, consider this program: {{{ x = u":-)" raise Exception(x) }}} When run, the interpreter prints this backtrace: {{{ Traceback (most recent call last): File "foo.py", line 2, in ? raise Exception(x) Exception: :-) }}} Now, suppose there are non-ASCII characters in the exception argument: {{{ x = u"\u263a" raise Exception(x) }}} The interpreter will print the following (at least if {{{sys.getdefaultencoding()}}} is {{{ascii}}}): {{{ Traceback (most recent call last): File "foo.py", line 2, in ? raise Exception(x) }}} That is, the last row is missing, so the value of {{{x}}} will not be displayed. == PyGTK == PyGTK works with UTF-8-encoded byte strings. Setter methods seem to convert Unicode strings to UTF-8 byte strings, though. So: {{{ widget.set_thing(unicode_string) unicode_string = widget.get_thing().decode("utf-8") }}} == References == * http://www.python.org/doc/current/lib/module-locale.html * http://www.python.org/doc/current/lib/module-codecs.html * http://www.python.org/doc/current/lib/module-sys.html#l2h-348 * http://www.amk.ca/python/howto/unicode