the snowman encoding

UTF-☃

PEP 263

  • "Defining Python Source Code Encodings"
  • Search for "encoding cookie" in first two lines
  • Decode the file contents using that encoding

what was happening

>>> from tokenize import cookie_re
>>> print(cookie_re.pattern)
^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)
>>> cookie_re.search('# Ennnncoding: utf-8').group(1)
'utf-8'
>>> cookie_re.search('# Ennnncoding: UTF^-_-^-_-^8').group(1)
'UTF'

ok the actual wat

all examples use python 3.x

for python 2 add a u prefix to the string

.encode() working

converts the string to bytes

>>> '☃'.encode('UTF-8')
b'\xe2\x98\x83'

.encode() failing

when an invalid codec is passed, a LookupError is produced

>>> '☃'.encode('UTF-garbage')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: UTF-garbage

... usually

🤔🤨🤔🤨🤔

>>> '☃'.encode('UTF-^-_-^-_-^8')
b'\xe2\x98\x83'

>>> '☃'.encode('UTF-☃')
b'\xe2\x98\x83'

explanation

>>> encodings.normalize_encoding('UTF-^-_-^-_-^8'.lower())
'utf_8'

>>> encodings.normalize_encoding('UTF-☃'.lower())
'utf'
>>> encodings.aliases.aliases['utf']
'utf_8'