Xah Lee, 2008-12-02
This page shows a real world example of using emacs's regex to update HTML image tags on all files in a dir. You should be familiar with Elisp Language Basics.
I need to add proper “alt="image description"” to all image tags for a bunch of HTML files in a dir. The alt's value should be based on the image's file name.
Technically, this page shows you how to use emacs's regex and a elisp function for the replacement string, to do find/replace on all files in a dir.
I have many HTML files in a dir. Many have a image tag like this:
<img src="paraboloid.png" alt="math surface" width="832" height="513">
Note that their “alt” value is all just “math surface”. I want the alt value to be more descriptive, based on the file name. So, in this example, it should be just “alt="paraboloid"”.
All these files are inside a dir, most of these are inside various subdirs. There are about 100 files. About maybe 50 of them has “alt="math surface"” that needs to be fixed.
The simplest solution is to use regex with a custome replacement function. (The method described here can be used if your image tags don't have “alt=” and you need to add it.).
The solution is quite simple. To do regexp replace on a bunch of files, one can use the builtin command dired-do-query-replace-regexp. So, all we have to do is to go to dired of the dir, call that command, give the find string and replacement string, and we are done.
Since the files are in different subdirs, so i use find-dired first, which gets me all files i wanted in one dired listing. So, i type “Alt+x find-dired”, then give the dir name, then give “-name "*html"”. The result is all html files in that dir and subdir.
Then, i mark all files i want, by typing “% m”, which invokes dired-mark-files-regexp. Then i give the pattern “\.html”, which would mark all html files.
The next job is to give regex search pattern. This is simple:
<img src="\([^"]+\)" alt="math surface" width="\([0-9]+\)" height="\([0-9]+\)">
The heart of this task is to write the elisp function that gives us the replacement string, where the alt part is the trasformed version of the file name. This is suprisingly simple too. Here's the lisp expression we need:
(concat "<img src=\"" (match-string 1) "\" alt=\"" (replace-regexp-in-string ".png" "" (replace-regexp-in-string "_" " " (match-string 1))) "\" width=\"" (match-string 2) "\" height=\"" (match-string 3) "\">" )
The “match-string” simply give us the matched values. The interesting part is the replace-regexp-in-string we used to generate the value for alt. First, we replace “_” to space, then we delete the “.png”. That's all there is to it.
Finally, we invoke dired-do-query-replace-regexp in the dired buffer (keyboard shortcut is Q). Then emacs will ask for the search string. We give the search string as above, and then emacs will prompt for the replacement string. We type “\,‹lisp expression›”, where the lisp expression is the above code (copy and paste it works). Then, emacs will start to do the search/replace and stop whenever it finds a match. To replace it and continue, type “y”, and “n” to skip, to do all replacement on the current file, type “!”. To continue for the whole, just hold on “!”. You can also type “,” to do the replace without searching again, so that you can visually check whether it is right. Once done, type “Alt+x ibuffer”, then type “*u” to list all unsaved files, type “S” to save them all, then type “D” to close them all.
Without emacs, the above operation might take a hour or two and is tedious and error prone. With expertise in perl or python scripting, the problem is lack of interactive see-and-do. With emacs, the whole operation is less than 5 minutes.
The project that required this task is this: Gallery of Famous Surfaces.
Suppose you are given a task where hundreds of valid HTML files in a dir needs to be converted to valid XHTML. Note that XHTML has a slightly different syntax. For example, all tags such as <p> and <li> now needs to be closed. Tags like <img>, <hr>, <br> etc need to be like <img ... />, <hr/>, <br/>. Also, tags are now case sensitive, so you need to lower case them. Also, image tags now must be wrapped inside a container tag, such as “<div>”. The DTD also needs to be changed, and there are many style oriented tags that needs to be transformed. (see: XHTML#Valid XHTML documents )
This task seems daunting. You could try a perl script in one shot, but it would probably take you a whole day or days to develope, and if your script has a parsing or regex error, it'll delete parts of your files without you knowing it. You could do a trial and error approach by regex replacement experimentally one at a time. Still, your script goes batch. If you make a mistake, you'll have to revert all your files. With mastery of emacs, you can do the above transform using regex find/replace one by one, interactively and safely, saving your time some 10 fold.
Related essays: