Xah Lee, 2005-04-15
Often you need to split a line by a textual pattern. This page shows you how.
I have a file that is a translation of Chinese lyrics. It is formatted like this:
你是我最苦澀的等待 | you are my hardest wait 讓我歡喜又害怕未來 | giving me joy and also fear the future
The left side is Chinese, the right side is English. (See the file here: 哭沙 (Weeping Sand)) I want to write a program to split the line, so that i get the whole Chinese part or the whole English part.
Here's the code:
# -*- coding: utf-8 -*- # Python import re myText = ur'''你是我最苦澀的等待 | you are my hardest wait 讓我歡喜又害怕未來 | giving me joy and also fear the future''' lines=myText.splitlines() # or lines=re.split(r'\n',myText) for ln in lines: fracture=re.split(r'\s*\|\s*',ln,re.U) print fracture[0].encode('utf-8') # prints just the Chinese column
Unicode chars can be included in regex patterns directly. Just make sure your string starts with ur, and the third argument to re.split is re.U to tell re.split to work in a unicode mode. For example: re.search(ur'苦',mystring,re.U).
Unicode can also be represented by \u followed by its hexadecimal code. For example, to match the unicode alpha “α”
which has hexadecimal “x3b1”, do re.search(ur'\ux3b1',mystring,re.U). (unicode char can also be included literally as well. Be sure that your string start with “u” like this u"string with unicode char here" and your source file should start with the line
# -*- coding: utf-8 -*- if it is utf-8 encoded.)
See also: String Pattern Matching (regex) Documentation.
To split a line into a list using a text pattern as the seperator, use the function “split”. Here's a basic example:
# perl $myText = '你是我最苦澀的等待 | you are my hardest wait 讓我歡喜又害怕未來 | giving me joy and also fear the future'; @lines= split (/\n/,$myText); # use Data::Dumper; # print @lines; for $ln (@lines) { @fracture = split(/\s*\|\s*/, $ln); print "$fracture[0]\n"; # prints just the Chinese column }blog comments powered by Disqus