Python: Converting a File's Encoding

Advertise Here

, 2005-03-07

Python

Starting with Python version 2.4, you can use it to change a file's character encoding. Here's a example:

# python
path='infile.html'
path2='outfile.html'

f= open(path, 'rb') 
content= unicode(f.read(), 'gb18030') 
f.close() 
f= open(path2, 'wb') 
f.write(content.encode('utf-16')) 
f.close()

To test this script, you can find a GB18030 encoded file in just about any Chinese website (for example cn.yahoo.com ) and save the source code and use it as the input file. Note that many Chinese site's html declare GB2312 as its character set/encoding. However, GB2312 is a old standard and is a subset of GB18030.

See also: Convert File Encoding with Python for All Files in a Dir.

(thanks to Andrew Clover who provided help in this article.)

Perl

For converting charset encodings in Perl, you need the Encoding module. It is bundled with Perl v5.8.6 or earlier. In general, for Perl with unicode support, see: Unicode in Perl & Python.

Other Tools

Note: Also, the GNU command line tool “iconv” does character encoding conversion.

For a Java solution, see convert charset encoding with Java.

If you use emacs, you can open the file, then call “set-buffer-file-coding-system” with a value such as “utf-8” or “utf-16”. See: Emacs and Unicode Tips. This can also be done programmatically for hundreds of files using Emacs Lisp.

blog comments powered by Disqus