Xah Lee, 2007-10-29
This page gives a outline of how to use emacs lisp to do text processing, using a specific real-world problem as example. If you don't know elisp, first take a look at Emacs Lisp Basics.
I want to write a elisp program, that process a list of given files. Each file is a HTML file. For each file, i want to remove the link to itself, in its page navigation bar. More specifically, each file has a page navigation bar in this format:
<div class="pages">Goto Page: <a href="1.html">1</a>, <a href="2.html">2</a>, <a href="3.html">3</a>, ...</div>.
where the file names and link texts are arbitrary. (not necessarily as 1, 2, 3 shown here.) The link to itself needs to be removed.
My website has over 3 thousand HTML files; many of the pages are parts of a series. For example, i have a article on Algorithmic Mathematical Art, which is broken into 3 HTML pages. So, at the bottom of each page, i have a page navigation bar with code like this:
<div class="pages">Goto Page: <a href="20040113_cmaci_larcu.html">1</a>, <a href="cmaci_larcu2.html">2</a>, <a href="cmaci_larcu3.html">3</a></div>
In a browser, it would look like this:
Note that the link to the page itself really shouldn't be a link.
There are a total of 134 pages scattered about in various directories that have this page navigation bar. I need some easy way to process these files and remove the self-link.
I've been programing in perl professionally from 1998 to 2002 full time. Typically, for this task in perl (or Python), i'd open each file, read in the file, then use regex to do the replacement, then write out the file. For replacement that span over several lines, the regex needs to act on the whole file (as opposed to one line at a time). The regex can become quite complex or reaching its limit. For a more robust solution, a XML/HTML parser package can be used to read in the file into a structured representation, then process that. Using a HTML parser is a bit involved. For example, using a HTML parsor will likely reformat my HTML code, which is something i do not want. Then, as usual, one may need to create backups of the original files, and also deal with maintaining the file's meta info such as keeping the same permission bits. In summary, if the required text manipulation is not simple, then the coding gets fairly complex quickly, even if the job is trivial in principle.
With emacs lisp, the task is vastly simplified, because emacs reads in a file into its buffer representation. With buffers, one can move a pointer back and forth, search and delete or insert text arbitrarily, with the entire emacs lisp's suite of functions designed for processing text (literally over 13 thousands of them as of Emacs 22), as well the entire emacs environment that automatically deals with handling file. (symbolic links, hard links, auto-backup system, file meta-info maintaince, file locking, remote files, file copying, handling various file encoding, Unix/DOS/Mac line endings, ... etc. all handled AUTOMATICALLY).
We proceed to write a elisp code to solve this problem.
Here are the steps we need to do for each file:
We begin by writing a test code to process a single file.
(defun xx () "temp. experimental code" (interactive) (let (fpath fname mybuffer) (setq fpath "/Users/xah/test1.html") (setq fname (file-name-nondirectory fpath)) (setq mybuffer (find-file fpath)) (search-forward "<div class=\"pages\">Goto Page:") (search-forward fname) (sgml-delete-tag 1) (save-buffer) (kill-buffer mybuffer)))
First of all, create files test1.html, test2.html, test3.html in a temp directory for testing this code. Each file will contain this page navigation line:
<div class="pages">Goto Page: <a href="test1.html">XYZ Overview</a>, <a href="test2.html">Second Page</a>, <a href="test3.html">Summary Z</a></div>
Note that in actual files, the page-nav string may not be in a single line.
The elisp code above is fairly simple and self-explanatory. The file opening function find-file is found from elisp doc section “Files”. The cursor moving function search-forward is in “Searching and Matching”, the save or close buffer fuctions are in section “Buffers”.
Reference: Elisp Manual: Files.
Reference: Elisp Manual: Buffers.
Reference: Elisp Manual: Searching-and-Matching.
The interesting part is calling the function sgml-delete-tag. It is a function loaded by html-mode (which is automatically loaded when a html file is opened). What sgml-delete-tag does is to delete the tag that encloses the cursor (both the opening and closing tags will de deleted). The cursor can be anywhere inside the beginning tag or ending tag. This sgml-delete-tag function helps us tremendously.
Now, with the above code, our job is essentially done. All we need to do now is to feed it a bunch of file paths. First we clean up the code by making it to take a path as argument.
(defun my-modfile-page-tag (fpath) "Modify the HTML file at fpath." (let (fname mybuffer) (setq fname (file-name-nondirectory fpath)) (setq mybuffer (find-file fpath)) (search-forward "<div class=\"pages\">Goto Page:") (search-forward fname) (sgml-delete-tag 1) (save-buffer) (kill-buffer mybuffer)))
Then, we test this modified code by evaluating the following code:
(my-modfile-page-tag "/Users/xah/test1.html")
To complete our task, all we have to do now is get the list of files that contains the page-nav tag and feed them to my-modfile-page-tag.
To get the list of files that contains the page-nav tag, we can simply use unix's “find” and “grep”, like this:
find . -name "*\.html" -exec grep -l '<div class="pages">' {} \;
For each line in the output, we just wrap a double quote around it to make it a lisp string. Possibly also insert the full path by using string-rectangle, to construct the following code:
(mapcar 'my-modfile-page-tag
(list
"/Users/xah/web/3d/viz.html"
"/Users/xah/web/3d/viz2.html"
"/Users/xah/web/dinju/Khajuraho.html"
"/Users/xah/web/dinju/Khajuraho2.html"
"/Users/xah/web/dinju/Khajuraho3.html"
;... 100+ lines
)
)
The mapcar is a lisp idiom of looping thru a list. The first argument is a function. The function will be applied to every element in the list. The single quote in front of the function is necessary. It prevents the function from being evaluated.
Emacs is beautiful!
Related essays:
Page created: 2007-10. © 2007 by Xah Lee.