If you enjoyed this site, please consider donating $3. Any amount is appreciated. Thanks!

Unicode in Perl and Python

Xah Lee, 2005-01

Python

Source Code Interpretation

Python supports unicode in source file by putting a file encoding declaration as the first line.

#-*- coding: utf-8 -*-
print "look chinese chars: 请你不要哭"

What this means is that the Python interpreter will interpreter your source code using utf-8 encoding. It does not, however, mean that all the strings in your code are interpreted as unicode.

The “#-*- coding: utf-8 -*-” declaration in the first line is a convention adopted from the text editor Emacs. It tells any program reading the file that the file is encoded using a particular character set. For example, it serves a purpose similar to HTML's “<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=utf-8">”.

Text Processing With Unicode Strings

If you are going to do any processing with unicode string, such as substring extracting or string pattern matching, then you need to put “u” in front of the string. For example,

#-*- coding: utf-8 -*-
$str = u"look Chinese chars: 请你不要哭"

Note, however, identifiers cannot use unicode chars. For example, variable names cannot contain unicode chars.

Sometimes when you print unicode strings, you may get a error like this: # UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).

The solution is to encode or decode your line into a particular encoding. Because, when reading a file as lines, to Python a line is just a sequence of bytes. For example:

    myString=myString.decode("utf-8") or
    myString=myString.encode("utf-8")
#-*- coding: utf-8 -*-
# python

alpha=u'α'

# Bad
print u'Unicode alpha: ', alpha

# Good
print u'Unicode alpha: ', (alpha).encode('utf-8')

Python Doc

Python Doc

Perl

use bytes; # Larry can take Unicode and shove it up his ass sideways. 
            # Perl 5.8.0 causes us to start getting incomprehensible 
            # errors about UTF-8 all over the place without this.

               —from the source code of WebCollage (1998),
                by Jamie W Zawinski (~b1971) 

In Perl, dealing with unicode is quite different from Python. Perl's Unicode support starts to be somewhat usable with Perl 5.8. Perl provides the “-C” option in the command line, which changes input and output behaviors of Perl to work UTF-8. It is uncessarily complex, because it is hacked up thru the years since Perl 5.6.

Perl 5.8 (2002-07) can have unicode chars used as variable's name or function name. You need to say “use utf8;” in your code. Example:

# perl

use utf8; # necessary if you want to use unicode in function or var names

# processing unicode string
$s = 'I ★ you'; $s =~ s///;
print $s;

# variable with unicode char
$愛=4;  print $愛;

# function with unicode char
sub f愛 { return 2;}  print f愛();

Because you are outputing utf-8 unicode string in the above code, you need to run it with the -C option, example: “perl -C7 myCode.pl”.

perldoc perluniintro

perldoc perlunicode


Related essays:

2005-01
© 2005 by Xah Lee.