Emacs Lisp: Processing HTML: Transform Tags from <span> to <b>

Advertise Here

,

This page shows a simple practical elisp script for HTML tag transformation.

Problem Description

Summary

I want batch transform the tag <span class="w">xyz</span> to <b>xyz</b>, for over a hundred files, and print a report of the changes so that i can scan to make sure there's no errors. (for example, in the case that the HTML file has a mismatched span tag.)

Detail

In my English vocabulary and literature study projects, many interesting words are marked up by this tag: <span class="w">xyz</span>. With CSS, it is rendered in bold. I think that markup is too elaborate, and i want to replace it simply with <b>xyz</b>, for over a few hundred files.

Side note

The following is a little side note on why i had “span.w” in the first place. (you can skip this section.)

I have the following “span” markups: { “span.w”, “span.b”, “span.r” }. The “span.w” means interesting word that's new, rendered as bold. They are typically difficult words new to me.

Sometimes many college-level words are still interesting, and i want to highlight them too, for highschool or ESL students and myself. Sometimes these are familiar words but used in a sense that's not common (e.g. “seedy” hotel). For these words, i markup with “span.b”. They are rendered in blue.

The “span.r” is for highlighting interesting {word, phrase, sentence} of the work, not necessarily for vocabulary study purposes. e.g. a interesting thought, quotable passage, interesting writing style. They are rendered in red.

As a example of how i use these markups, here's a excerpt from Gulliver's Travels. PART I — A VOYAGE TO LILLIPUT. Quote:

The declivity was so small, that I walked near a mile before I got to the shore, which I conjectured was about eight o'clock in the evening. I then advanced forward near half a mile, but could not discover any sign of houses or inhabitants; at least I was in so weak a condition, that I did not observe them. I was extremely tired, and with that, and the heat of the weather, and about half a pint of brandy that I drank as I left the ship, I found myself much inclined to sleep. I lay down on the grass, which was very short and soft, where I slept sounder than ever I remembered to have done in my life, and, as I reckoned, about nine hours; for when I awaked, it was just day-light. I attempted to rise, but was not able to stir: for, as I happened to lie on my back, I found my arms and legs were strongly fastened on each side to the ground; and my hair, which was long and thick, tied down in the same manner. I likewise felt several slender ligatures across my body, from my arm-pits to my thighs. I could only look upwards; the sun began to grow hot, and the light offended my eyes.

Here's some annotated works you might find interesting:

Note that i'm not replacing all the “span.w” in the above projects. I AM doing it for my vocabulary collection project. Sample page: Writer's Words ₄. In this project, it's like a dictionary entry. Each page has many entries, each entry is marked by <div class="ent">…</div>, and all “span.w” happens inside those. There are no “span.b” or “span.r” here. It is here, i thought “b” would be better than “span.w”.

Solution

Here's outline of steps.

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-07-18
;; replace <span class="w">…</span> to <b>…</b>
;;
;; do this for all files in a dir.

(setq inputDir "~/web/xahlee_org/PageTwo_dir/Vocabulary_dir/" ) ; dir should end with a slash

(setq changedItems '())

(defun my-process-file (fPath)
  "Process the file at FPATH …"
  (let (myBuff myWord)
    (setq myBuff (find-file fPath))

    (widen) (goto-char 1) ;; in case buffer already open

    (while (search-forward-regexp "<span class=\"w\">\\([^<]+?\\)</span>" nil t)
      (setq myWord (match-string 1))
      (when (< (length myWord) 15) ; a little double check in case of possible mismatched tag
        (replace-match (concat "<b>" myWord "</b>" )  t) 
        (setq changedItems (cons (substring-no-properties myWord) changedItems ) )
        ) )

    ;; close buffer if there's no change. Else leave it open.
    (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) )
    ) )

(require 'find-lisp)

(setq make-backup-files t)
(setq case-fold-search nil)
(setq case-replace nil)

(let (outputBuffer)
  (setq outputBuffer "*xah span.w to b replace output*" )
  (with-output-to-temp-buffer outputBuffer
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (print changedItems)
    (princ "Done deal!")
    )
  )

The above is fairly easy to understand. You might refresh elisp basics at: Text Processing with Emacs Lisp Batch Style and Emacs Lisp Idioms (for writing interactive commands).

Here's the output: elisp_batch_html_tag_transform_bold_output.txt.

There are over 1k changes. The output is extremely useful because i can just take a few seconds to glance at the output to know there are no errors. Errors are possible because whenever using regex to parse HTML, a missing tag in HTML or even a unexpected nested tag, can mean disaster.

blog comments powered by Disqus