NLTK comes with a Python 2.x/3.x compatibility layer, nltk.compat (which is loosely based on six):
>>> from nltk import compat >>> compat.PY3 False >>> compat.integer_types (<type 'int'>, <type 'long'>) >>> compat.string_types (<type 'basestring'>,) >>> # and so on
Under Python 2.x __str__ and __repr__ methods must return bytestrings.
@python_2_unicode_compatible decorator allows writing these methods in a way compatible with Python 3.x:
and they would be fixed under Python 2.x to return byte strings:
>>> from nltk.compat import python_2_unicode_compatible >>> @python_2_unicode_compatible ... class Foo(object): ... def __str__(self): ... return u'__str__ is called' ... def __repr__(self): ... return u'__repr__ is called' >>> foo = Foo() >>> foo.__str__().__class__ <type 'str'> >>> foo.__repr__().__class__ <type 'str'> >>> print(foo) __str__ is called >>> foo __repr__ is called
Original versions of __str__ and __repr__ are available as __unicode__ and unicode_repr:
>>> foo.__unicode__().__class__ <type 'unicode'> >>> foo.unicode_repr().__class__ <type 'unicode'> >>> unicode(foo) u'__str__ is called' >>> foo.unicode_repr() u'__repr__ is called'
There is no need to wrap a subclass with @python_2_unicode_compatible if it doesn't override __str__ and __repr__:
>>> class Bar(Foo): ... pass >>> bar = Bar() >>> bar.__str__().__class__ <type 'str'>
However, if a subclass overrides __str__ or __repr__, wrap it again:
>>> class BadBaz(Foo): ... def __str__(self): ... return u'Baz.__str__' >>> baz = BadBaz() >>> baz.__str__().__class__ # this is incorrect! <type 'unicode'> >>> @python_2_unicode_compatible ... class GoodBaz(Foo): ... def __str__(self): ... return u'Baz.__str__' >>> baz = GoodBaz() >>> baz.__str__().__class__ <type 'str'> >>> baz.__unicode__().__class__ <type 'unicode'>
Applying @python_2_unicode_compatible to a subclass shouldn't break methods that was not overridden:
>>> baz.__repr__().__class__ <type 'str'> >>> baz.unicode_repr().__class__ <type 'unicode'>
Under Python 3.x repr(unicode_string) doesn't have a leading "u" letter.
nltk.compat.unicode_repr function may be used instead of repr and "%r" % obj to make the output more consistent under Python 2.x and 3.x:
>>> from nltk.compat import unicode_repr >>> print(repr(u"test")) u'test' >>> print(unicode_repr(u"test")) 'test'
It may be also used to get an original unescaped repr (as unicode) of objects which class was fixed by @python_2_unicode_compatible decorator:
>>> @python_2_unicode_compatible ... class Foo(object): ... def __repr__(self): ... return u'<Foo: foo>' >>> foo = Foo() >>> repr(foo) '<Foo: foo>' >>> unicode_repr(foo) u'<Foo: foo>'
For other objects it returns the same value as repr:
>>> unicode_repr(5) '5'
It may be a good idea to use unicode_repr instead of %r string formatting specifier inside __repr__ or __str__ methods of classes fixed by @python_2_unicode_compatible to make the output consistent between Python 2.x and 3.x.