sys.setdefaultencoding is evil
by Tarek Ziadé
I have recently found some UnicodeDecodeError bugs on some products, that some people couldn’t reproduced. The bug was due to a call to a CMF API that was doing a str() over the object, right before using it.
This is perfectly fine in that case, because the object is supposed to be a ZODB id, so it has to be full ASCII.
So the bug looks like this :
>>> id = u'éou' >>> str(id) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
The people that couldn’t reproduced it because they use that ugly hack which consists of setting Python’s default encoding to utf8:
>>> import sys >>> sys.setdefaultencoding('utf8') >>> id = u'éou' >>> str(id) '\xc3\xa9ou'
This will be applied to the whole process, and Python itself dynamically removes the method from the module at it first use. From the official doc:
Set the current default string encoding used by the Unicode implementation. If name does not match any available encoding, LookupError is raised. This function is only intended to be used by the site module implementation and, where needed, by sitecustomize. Once used by the site module, it is removed from the sys module's namespace. New in version 2.0.
I can’t find the link back, but I have read once that this built-in was to be removed because it should not be used outside site.py
The problem is that people tend to add a sitecustomize.py in their environment, then work with str() and unicode() calls and forget about doing it right. The result is a major
misused of strings and unicodes and the code created will be buggy on other computers.
So never ever use this in your code. If you have a UnicodeDecodeError it probably means the function is waiting for a string. If you have a UnicodeEncodeError, it should be unicode. In the same way, do not guess the encoding in your code. You should work with one type (str or unicode) and know exactly what is its encoding.
I think this misued is partly due to a lack of warning here: http://www.diveintopython.org/xml_processing/unicode.html
Because that’s one of the first page a developer finds when he tries to understand why
he has such bugs.
See a similar entry on the topic 2 years ago here: http://faassen.n–tree.net/blog/view/weblog/2005/08/02/0
so what is the solution for all the applications? 🙂
The best way to deal with unicode is to make sure that everything that enters your application (from the filesystem, from the web, or a database) is decoded into unicode, and everything that leaves your application is encoded (preferably to UTF-8).
Which is fine if you know what encoding everything is in, but of course at the moment it isn’t always obvious and sometimes the heuristics break down.
It’s still good advice to decode inputs and encode outputs, though. It will be even more important in 3.0, where Unicode strings will be all we have. There’s already been discussion about how to achieve the required decodings.
I wish there was an opposite of sys.setdefaultencoding. I want to make EVERY use of implicit encoding and decoding through unicode an exception.
(Of course, it wouldn’t make sense to put this in a sitecustomize. I want to be able to turn it on for a specific application.)
Steve, if you have to deal with encoded text while you don’t know the encoding, try to find it! Perhaps using some heuristics, once, when writing the application. Don’t use heuristics directly in your application to find encodings in your application unless you have a very very good reason for it (I have a hard time thinking of these reasons). Preferable would be to add an option for your users to specify the encoding themselves.
Cory, there’s a way to use setdefaultencoding to make it complain all the time; I forget the exact details but perhaps setting it to None will do the trick. Unfortunately this breaks a lot of code that works perfectly well in Python, and that code is almost certainly going to be in libraries you use. You’d be surprised how more code depends on automatic ascii to unicode translation than you’d think initially (dictionary accesses, for instance). So I don’t consider this to be a viable option.
Python 3.x will have the behavior you ask for. While I think this is a good language design decision, I also expect a lot of people will go through a lot of pain when they try to convert their working code to Python 3.x. It won’t be easy.
@Martinj, I have never had a case where the input encoding couldn’t be known. It can always be set, either by a user configuration like you said, or either with the environement.
If a text doesn’t have the right encoding or an explicit one, most application I know just fails: the input was just not the right one. Trying to guess the encoding is just too dangerous imho
Unfortunately it is very common to not know the encoding when dealing with input from various sources. Examples are people uploading text files to a CMS like Plone which are supposed to become content in there or variants thereof.
To give another example: I wrote a product (CMFBibliography) that supports the import of bibliographic data into Plone by uploading BibTeX or Medline or Endnote, RIS etc files. For all practical purposes it is close to impossible to know the encoding in advance. You have to guess it based in some heuristics if you don’t want to fail and simply say “sorry”. Asking the users to provide the encoding in addition might seem like an option but often times it isn’t really as they simply don’t know it. And it also doesn’t help if the file is imported via FTP or WebDAV (yes, Plone/CMFBib support this as well).
Other situations where I have seen this being an issue is when dealing with larger applications or frameworks or entire stacks of frameworks where you simply don’t know what a specific API call will return (this may even vary depending on version/concrete install/input parameters/some configuration/etc.) and is often times not even documented. If this is from somewhere not under your control you can of course submit bug reports and wait but typically you just want to move on and so you try to hack around it in some way 😦
i want to ask how can i use these applications ?
One example that comes to my mind where in practice you have no idea what encoding the input is in, would be the User-Agent header in http headers.
Plus sys.setdefaultencoding is not THAT bad, if you happen to know the encoding you need to use.
But I concur that this is something the application and not the site-customize.py should set (why? Because you cannot rely on site-customize to set it.)
So a typical app usage would be:
import sys, locale
[…] Python 2 has a thing called “default encoding” to automagically encode Unicode strings when they are presented as byte strings. This is evil and has been discussed various times before. […]
> This is evil and has been discussed various times before
tell it to idiots who don’t support encoding attribute in function implementations but still use unicode() within them.
sys.setdefaultencoding might be evil, but it is sometimes the lesser of two evils. For instance, if you use a library that prints something to the terminal that you need to grab, you’re basically screwed. As long as the terminal has defined an encoding and it’s printing to screen, it works fine – but then pipe it to a file, and BOMB, out of luck, UnicodeEncodeErrors all over the place. And you can’t fix them because they’re in perfectly legitimate library code. Lame.
> This is evil and has been discussed various times before
It is more evil to avoid setdefaultencoding to cause all unicode libs having to take care of each line of codes with ascii unicode conversion, which will prone to give more bugs..
If The id has to be ascii, I think you should not set a non-ascii string to it at the first place. Instead you need to perform some checking and throw sth like IllegalIdException.
I don’t understand why the default encoding should be ascii.
Even if the default encoding is utf-8, pure ascii strings still remains ascii. That does not make any difference because utf-8 includes ascii in this case.
For people who handle non-ascii strings mostly, it’s definitely natural if the default encoding is utf-8. Otherwise I will have to make everything unicode, and encode them manually. It’s nonsense.
In my case, I know the string might be utf-8. If it is not utf-8, there has to be some reason that I’ve already known so I would handle them seperately.