How to use Unicode in Python
Last updated 2006-03-04.
Here are Joel's notes about how to use Unicode in Python.
I make no guarantees that the information below is correct. It is partly based on guesswork. If you think that anything is wrong, don't hesitate to contact me at joel@rosdahl.net or change the wiki page directly.
Python 2.4.1 (and also 2.3.5 in some cases) was used when testing. Platforms tested: Debian GNU/Linux testing with en_GB.UTF-8 and en_GB.ISO-8859-1 locales) and Windows 2000 (in the command prompt and in IDLE). It would be nice to know about more platforms (like MacOS etc.), but I haven't got access to any other systems right now.
Nomenclature
I use these definitions in this document:
- Unicode string: Python's unicode type.
- Byte string: Python's str type.
- Windows: Windows-like operating system; at least Windows 2000 and Windows XP.
- Unix: Linux/GNU/Posix-like operating systems.
General
Always keep text in Unicode strings internally in the application and encode/decode properly to the outside world (I/O and UI) and libraries that are not Unicode aware or Unicode agnostic.
String interpolation
Although
"%sbar" % u"foo"
and
u"%sbar" % u"foo"
yield the same result (u"foobar"), use the latter for clarity and performance reasons.
Python 2.3 has this bug:
class C:
def __str__(self):
return "str"
def __unicode__(self):
return "unicode"
print u"%s" % C() # Prints "str", not "unicode".
The bug has been fixed in Python 2.4 (unicode is printed).
Surprising results may occur when byte format strings are used. For example, the expressions
"%s %s" % (str_value, x) "%s %s" % (unicode_value, x)
do not print x the same way, if x happens to have a __unicode__ method. __str__ is called (or __repr__ if __str__ is missing) in the first case. __unicode__ is called in the second. In fact, within "%s %s %s %s %s...", the first Unicode value from left to right switches the result to Unicode mode for the remaining values. If you have mixed byte strings and Unicode strings computed in variables, say, it makes the formatting a bit unpredictable. The easy solution is to always use Unicode format strings.
File content input/output
Normally, we want to read/write file contents according to the user's default locale:
encoding = locale.getpreferredencoding()
f = codecs.open("filename", "w", encoding)
Note that codecs.open also takes an errors parameter that determines what to do with code points that can't be expressed in the chosen encoding.
Standard input/output/error
sys.stdin and sys.stdout have an encoding attribute if they refer to a terminal and Python has been able to figure out the encoding for the terminal. So, if the attribute exists, it should be used for encoding to standard input/output. If it doesn't exist, standard input/output is probably redirected from/to a file or pipe whose encoding is unknown and we should use the same logic as for normal file content input/output as described above.
Here is a way to make suitable wrappers:
def get_file_encoding(f):
if hasattr(f, "encoding") and f.encoding:
return f.encoding
else:
return locale.getpreferredencoding()
sys.stdin = codecs.getreader(get_file_encoding(sys.stdin))(sys.stdin)
sys.stdout = codecs.getwriter(get_file_encoding(sys.stdout))(sys.stdout)
sys.stderr = codecs.getwriter(get_file_encoding(sys.stderr))(sys.stderr)
XXX: Possible Python bug: sys.stderr seems to often lack an encoding attribute. This means that a program that has standard output directed to a file or pipe but standard error still refers to a terminal can't find out the proper encoding of sys.stderr. (The only place where I have found sys.stdout.encoding and locale.getpreferredencoding() to differ is in the Windows command shell.)
File system
Since Python 2.3, file system operations (with some exceptions) do the right thing (using sys.getfilesystemencoding()) when given Unicode string arguments:
os.unlink(u"path") # Et cetera.
On Unix, the file system encoding is taken from the locale settings. On Windows, the encoding is always mbcs, which indicates that the "wide" versions of API calls should be used.
Note: File system operations raise a UnicodeEncodeError if given a path that can't be encoded in the encoding returned by sys.getfilesystemencoding().
os.listdir
os.listdir(u"path") returns Unicode strings for names that can be decoded with sys.getfilesystemencoding() but silently returns byte strings for names that can't be decoded. That is, the return value of os.listdir(u"path") is potentially a mixed list of Unicode and byte strings.
os.readlink
os.readlink chokes on Unicode strings that aren't coercable to the default encoding. The argument must therefore be a byte string. (Not applicable to Windows.)
glob
glob.glob(u"pattern") does not return Unicode strings. XXX: Is this a bug?
shutil
shutil.copytree in Python 2.3 fails with UnicodeEncodeError when given a Unicode string as the first argument and the specified directory tree has some filename that isn't decodable in sys.getfilesystemencoding(). Works OK in Python 2.4.
os.path
On Unix, os.path.abspath throws UnicodeDecodeError when given a Unicode string with a relative path and os.getcwd() returns a non-ASCII binary string (or rather: a non-sys.getdefaultencoding()-encoded binary string). Therefore, the argument must be a byte string. On Windows, however, the argument must be a Unicode string so that the "wide" API calls are used. This leads to the nasty situation that a work-around is needed, but only on Unix:
if os.name == "posix":
fs_enc = sys.getfilesystemencoding()
def abspath(path):
return os.path.abspath(path.encode(fs_enc)).decode(fs_enc)
else:
abspath = os.path.abspath
x = abspath(u"path")
os.path.realpath behaves the same way as os.path.abspath.
XXX: Bugs, I presume?
Program arguments
The program arguments in sys.argv come as byte strings. I think that sys.getfilesystemencoding() should be used for the decoding:
encoding = sys.getfilesystemencoding() sys.argv = [x.decode(encoding) for x in sys.argv]
Calling external programs
os.system and os.popen coerce Unicode strings to byte strings using Python's default encoding (returned by sys.getdefaultencoding() and typically ascii), which is not good. The argument to those functions must therefore be encoded explicitly.
On Unix, the os.exec* functions convert all parameters (path to executable and arguments) from Unicode strings to properly encoded byte strings. I don't know if locale.getpreferredencoding() or sys.getfilesystemencoding() is used, though. My guess is sys.getfilesystemencoding().
On Windows 2000, it also seems that the os.exec* functions take Unicode strings properly.
On Unix, the subprocess module uses os.execvp, so the discussion above applies here.
On Windows 2000, the subprocess module does not handle Unicode strings, so properly encoded byte strings must be created for all arguments. XXX: Bug?
os.path.expanduser
os.path.expanduser (on both UNIX and Windows) doesn't handle Unicode when ~ expands to a non-ASCII path. Therefore, a byte string must be passed in and the result decoded:
home = os.path.expanduser("~").decode(sys.getfilesystemencoding())
XXX: Bug?
os.environ
Environment variables in the os.environ dictionary are byte strings (both names and values).
inspect
inspect.getsource() returns byte strings, possibly with non-ASCII characters in them, so mixing the result with Unicode may be problematic.
Unicode arguments to exceptions
The Python interpreter may have problems printing exceptions if Unicode strings with non-ASCII characters are passed as exception arguments. For example, consider this program:
x = u":-)" raise Exception(x)
When run, the interpreter prints this backtrace:
Traceback (most recent call last):
File "foo.py", line 2, in ?
raise Exception(x)
Exception: :-)
Now, suppose there are non-ASCII characters in the exception argument:
x = u"\u263a" raise Exception(x)
The interpreter will print the following (at least if sys.getdefaultencoding() is ascii):
Traceback (most recent call last):
File "foo.py", line 2, in ?
raise Exception(x)
That is, the last row is missing, so the value of x will not be displayed.
PyGTK
PyGTK works with UTF-8-encoded byte strings. Setter methods seem to convert Unicode strings to UTF-8 byte strings, though.
So:
widget.set_thing(unicode_string)
unicode_string = widget.get_thing().decode("utf-8")
References
Attachments
-
unicodetest.py
(2.0 kB) - added by joel
4 years ago.
A program that prints some encodings determined by the environment
