How to use Unicode in Python

Last updated 2006-03-04.

Here are Joel's notes about how to use Unicode in Python.

I make no guarantees that the information below is correct. It is partly based on guesswork. If you think that anything is wrong, don't hesitate to contact me at joel@… or change the wiki page directly.

Python 2.4.1 (and also 2.3.5 in some cases) was used when testing. Platforms tested: Debian GNU/Linux testing with en_GB.UTF-8 and en_GB.ISO-8859-1 locales) and Windows 2000 (in the command prompt and in IDLE). It would be nice to know about more platforms (like MacOS etc.), but I haven't got access to any other systems right now.

Nomenclature

I use these definitions in this document:

  • Unicode string: Python's unicode type.
  • Byte string: Python's str type.
  • Windows: Windows-like operating system; at least Windows 2000 and Windows XP.
  • Unix: Linux/GNU/Posix-like operating systems.

General

Always keep text in Unicode strings internally in the application and encode/decode properly to the outside world (I/O and UI) and libraries that are not Unicode aware or Unicode agnostic.

String interpolation

Although

"%sbar" % u"foo"

and

u"%sbar" % u"foo"

yield the same result (u"foobar"), use the latter for clarity and performance reasons.

Python 2.3 has this bug:

class C:
    def __str__(self):
        return "str"
    def __unicode__(self):
        return "unicode"

print u"%s" % C() # Prints "str", not "unicode".

The bug has been fixed in Python 2.4 (unicode is printed).

Surprising results may occur when byte format strings are used. For example, the expressions

"%s %s" % (str_value, x)
"%s %s" % (unicode_value, x)

do not print x the same way, if x happens to have a __unicode__ method. __str__ is called (or __repr__ if __str__ is missing) in the first case. __unicode__ is called in the second. In fact, within "%s %s %s %s %s...", the first Unicode value from left to right switches the result to Unicode mode for the remaining values. If you have mixed byte strings and Unicode strings computed in variables, say, it makes the formatting a bit unpredictable. The easy solution is to always use Unicode format strings.

File content input/output

Normally, we want to read/write file contents according to the user's default locale:

encoding = locale.getpreferredencoding()
f = codecs.open("filename", "w", encoding)

Note that codecs.open also takes an errors parameter that determines what to do with code points that can't be expressed in the chosen encoding.

Standard input/output/error

sys.stdin and sys.stdout have an encoding attribute if they refer to a terminal and Python has been able to figure out the encoding for the terminal. So, if the attribute exists, it should be used for encoding to standard input/output. If it doesn't exist, standard input/output is probably redirected from/to a file or pipe whose encoding is unknown and we should use the same logic as for normal file content input/output as described above.

Here is a way to make suitable wrappers:

def get_file_encoding(f):
    if hasattr(f, "encoding") and f.encoding:
        return f.encoding
    else:
        return locale.getpreferredencoding()
sys.stdin = codecs.getreader(get_file_encoding(sys.stdin))(sys.stdin)
sys.stdout = codecs.getwriter(get_file_encoding(sys.stdout))(sys.stdout)
sys.stderr = codecs.getwriter(get_file_encoding(sys.stderr))(sys.stderr)

XXX: Possible Python bug: sys.stderr seems to often lack an encoding attribute. This means that a program that has standard output directed to a file or pipe but standard error still refers to a terminal can't find out the proper encoding of sys.stderr. (The only place where I have found sys.stdout.encoding and locale.getpreferredencoding() to differ is in the Windows command shell.)

File system

Since Python 2.3, file system operations (with some exceptions) do the right thing (using sys.getfilesystemencoding()) when given Unicode string arguments:

os.unlink(u"path")
# Et cetera.

On Unix, the file system encoding is taken from the locale settings. On Windows, the encoding is always mbcs, which indicates that the "wide" versions of API calls should be used.

Note: File system operations raise a UnicodeEncodeError if given a path that can't be encoded in the encoding returned by sys.getfilesystemencoding().

os.listdir

os.listdir(u"path") returns Unicode strings for names that can be decoded with sys.getfilesystemencoding() but silently returns byte strings for names that can't be decoded. That is, the return value of os.listdir(u"path") is potentially a mixed list of Unicode and byte strings.

os.readlink chokes on Unicode strings that aren't coercable to the default encoding. The argument must therefore be a byte string. (Not applicable to Windows.)

glob

glob.glob(u"pattern") does not return Unicode strings. XXX: Is this a bug?

shutil

shutil.copytree in Python 2.3 fails with UnicodeEncodeError when given a Unicode string as the first argument and the specified directory tree has some filename that isn't decodable in sys.getfilesystemencoding(). Works OK in Python 2.4.

os.path

On Unix, os.path.abspath throws UnicodeDecodeError when given a Unicode string with a relative path and os.getcwd() returns a non-ASCII binary string (or rather: a non-sys.getdefaultencoding()-encoded binary string). Therefore, the argument must be a byte string. On Windows, however, the argument must be a Unicode string so that the "wide" API calls are used. This leads to the nasty situation that a work-around is needed, but only on Unix:

if os.name == "posix":
    fs_enc = sys.getfilesystemencoding()
    def abspath(path):
        return os.path.abspath(path.encode(fs_enc)).decode(fs_enc)
else:
    abspath = os.path.abspath
x = abspath(u"path")

os.path.realpath behaves the same way as os.path.abspath.

XXX: Bugs, I presume?

Program arguments

The program arguments in sys.argv come as byte strings. I think that sys.getfilesystemencoding() should be used for the decoding:

encoding = sys.getfilesystemencoding()
sys.argv = [x.decode(encoding) for x in sys.argv]

Calling external programs

os.system and os.popen coerce Unicode strings to byte strings using Python's default encoding (returned by sys.getdefaultencoding() and typically ascii), which is not good. The argument to those functions must therefore be encoded explicitly.

On Unix, the os.exec* functions convert all parameters (path to executable and arguments) from Unicode strings to properly encoded byte strings. I don't know if locale.getpreferredencoding() or sys.getfilesystemencoding() is used, though. My guess is sys.getfilesystemencoding().

On Windows 2000, it also seems that the os.exec* functions take Unicode strings properly.

On Unix, the subprocess module uses os.execvp, so the discussion above applies here.

On Windows 2000, the subprocess module does not handle Unicode strings, so properly encoded byte strings must be created for all arguments. XXX: Bug?

os.path.expanduser

os.path.expanduser (on both UNIX and Windows) doesn't handle Unicode when ~ expands to a non-ASCII path. Therefore, a byte string must be passed in and the result decoded:

home = os.path.expanduser("~").decode(sys.getfilesystemencoding())

XXX: Bug?

os.environ

Environment variables in the os.environ dictionary are byte strings (both names and values).

inspect

inspect.getsource() returns byte strings, possibly with non-ASCII characters in them, so mixing the result with Unicode may be problematic.

Unicode arguments to exceptions

The Python interpreter may have problems printing exceptions if Unicode strings with non-ASCII characters are passed as exception arguments. For example, consider this program:

x = u":-)"
raise Exception(x)

When run, the interpreter prints this backtrace:

Traceback (most recent call last):
  File "foo.py", line 2, in ?
    raise Exception(x)
Exception: :-)

Now, suppose there are non-ASCII characters in the exception argument:

x = u"\u263a"
raise Exception(x)

The interpreter will print the following (at least if sys.getdefaultencoding() is ascii):

Traceback (most recent call last):
  File "foo.py", line 2, in ?
    raise Exception(x)

That is, the last row is missing, so the value of x will not be displayed.

PyGTK

PyGTK works with UTF-8-encoded byte strings. Setter methods seem to convert Unicode strings to UTF-8 byte strings, though.

So:

widget.set_thing(unicode_string)
unicode_string = widget.get_thing().decode("utf-8")

References

Attachments