This page shows a simple practical elisp script for HTML tag transformation.
I want batch transform the tag
<span class="w">xyz</span> to
<b>xyz</b>, for over a hundred files, and print a report of the changes so that i can scan to make sure there's no errors. (for example, in the case that the HTML file has a mismatched span tag.)
In my English vocabulary and literature study projects, many interesting words are marked up by this tag:
<span class="w">xyz</span>.
With CSS, it is rendered in bold.
I think that markup is too elaborate, and i want to replace it simply with
<b>xyz</b>, for over a few hundred files.
The following is a little side note on why i had “span.w” in the first place. (you can skip this section.)
I have the following “span” markups: { “span.w”, “span.b”, “span.r” }. The “span.w” means interesting word that's new, rendered as bold. They are typically difficult words new to me.
Sometimes many college-level words are still interesting, and i want to highlight them too, for highschool or ESL students and myself. Sometimes these are familiar words but used in a sense that's not common (e.g. “seedy” hotel). For these words, i markup with “span.b”. They are rendered in blue.
The “span.r” is for highlighting interesting {word, phrase, sentence} of the work, not necessarily for vocabulary study purposes. e.g. a interesting thought, quotable passage, interesting writing style. They are rendered in red.
As a example of how i use these markups, here's a excerpt from Gulliver's Travels. PART I — A VOYAGE TO LILLIPUT. Quote:
The declivity was so small, that I walked near a mile before I got to the shore, which I conjectured was about eight o'clock in the evening. I then advanced forward near half a mile, but could not discover any sign of houses or inhabitants; at least I was in so weak a condition, that I did not observe them. I was extremely tired, and with that, and the heat of the weather, and about half a pint of brandy that I drank as I left the ship, I found myself much inclined to sleep. I lay down on the grass, which was very short and soft, where I slept sounder than ever I remembered to have done in my life, and, as I reckoned, about nine hours; for when I awaked, it was just day-light. I attempted to rise, but was not able to stir: for, as I happened to lie on my back, I found my arms and legs were strongly fastened on each side to the ground; and my hair, which was long and thick, tied down in the same manner. I likewise felt several slender ligatures across my body, from my arm-pits to my thighs. I could only look upwards; the sun began to grow hot, and the light offended my eyes.
Here's some annotated works you might find interesting:
Note that i'm not replacing all the “span.w” in the above projects. I AM doing it for my vocabulary collection project. Sample page: Writer's Words ₄. In this project, it's like a dictionary entry. Each page has many entries, each entry is marked by <div class="ent">…</div>, and all “span.w” happens inside those. There are no “span.b” or “span.r” here. It is here, i thought “b” would be better than “span.w”.
Here's outline of steps.
Here's the code:
;; -*- coding: utf-8 -*- ;; 2011-07-18 ;; replace <span class="w">…</span> to <b>…</b> ;; ;; do this for all files in a dir. (setq inputDir "~/web/xahlee_org/PageTwo_dir/Vocabulary_dir/" ) ; dir should end with a slash (setq changedItems '()) (defun my-process-file (fPath) "Process the file at FPATH …" (let (myBuff myWord) (setq myBuff (find-file fPath)) (widen) (goto-char 1) ;; in case buffer already open (while (search-forward-regexp "<span class=\"w\">\\([^<]+?\\)</span>" nil t) (setq myWord (match-string 1)) (when (< (length myWord) 15) ; a little double check in case of possible mismatched tag (replace-match (concat "<b>" myWord "</b>" ) t) (setq changedItems (cons (substring-no-properties myWord) changedItems ) ) ) ) ;; close buffer if there's no change. Else leave it open. (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) ) ) ) (require 'find-lisp) (setq make-backup-files t) (setq case-fold-search nil) (setq case-replace nil) (let (outputBuffer) (setq outputBuffer "*xah span.w to b replace output*" ) (with-output-to-temp-buffer outputBuffer (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$")) (print changedItems) (princ "Done deal!") ) )
The above is fairly easy to understand. You might refresh elisp basics at: Text Processing with Emacs Lisp Batch Style and Emacs Lisp Idioms (for writing interactive commands).
Here's the output: elisp_batch_html_tag_transform_bold_output.txt.
There are over 1k changes. The output is extremely useful because i can just take a few seconds to glance at the output to know there are no errors. Errors are possible because whenever using regex to parse HTML, a missing tag in HTML or even a unexpected nested tag, can mean disaster.