10 PEP 293: Codec Error Handling Callbacks

When encoding a Unicode string into a byte string, unencodable characters may be encountered. So far, Python has allowed specifying the error processing as either ``strict'' (raising UnicodeError), ``ignore'' (skipping the character), or ``replace'' (using a question mark in the output string), with ``strict'' being the default behavior. It may be desirable to specify alternative processing of such errors, such as inserting an XML character reference or HTML entity reference into the converted string.

Python now has a flexible framework to add different processing strategies. New error handlers can be added with codecs.register_error, and codecs then can access the error handler with codecs.lookup_error. An equivalent C API has been added for codecs written in C. The error handler gets the necessary state information such as the string being converted, the position in the string where the error was detected, and the target encoding. The handler can then either raise an exception or return a replacement string.

Two additional error handlers have been implemented using this framework: ``backslashreplace'' uses Python backslash quoting to represent unencodable characters and ``xmlcharrefreplace'' emits XML character references.