Converting a File's Encoding with Python

Xah Lee, 2005-03-07.

Python

I have a bunch of files encoded in gb18030 and i need to convert it to utf-16. Can i do that with Pyton?

Yes. Such facilities is built-in in Python version 2.4, released in 2004 Nov. Here's a sample code.

# python
path='infile.html'
path2='outfile.html'

f= open(path, 'rb') 
content= unicode(f.read(), 'gb18030') 
f.close() 
f= open(path2, 'wb') 
f.write(content.encode('utf-16')) 
f.close()

To test this script, you can find a gb18030 encoded file in just about any Chinese website (for example http://image.yahoo.com.cn/ ) and save the source code and use it as the input file. Note that many Chinese site's html declare gb2312 as its character set/encoding: “<meta http-equiv="content-type" content="text/html; charset=gb2312">”. However, GB2312 is a old standard and is wholly consumed by gb18030. (i.e. being a compatible subset)

See also: Convert File Encoding For All Files In A Directory.

For info about unicode and gb18030, see Unicode↗, GB18030↗

(thanks to Andrew Clover who provided help in this article.)


Perl

For converting charset encodings in Perl, you need the Encoding module. It is bundled with Perl v5.8.6 or earlier. In general, for Perl with unicode support, see: Unicode in Perl and Python.


Note: Also, the GNU command line tool “iconv” does character encoding conversion.

For a Java solution, see convert charset encoding with Java


See also:


Page created: 2005-03.
© 2005 by Xah Lee.
Xah Signet