Emacs Lisp: Command to Replace HTML Entities with Unicode Characters

Advertise Here

, ,

This page shows you how to write a elisp command to replace HTML entities such as é by its Unicode character é.

Problem Description

I have many HTML files from existing sources that contain many HTML Entities. I want to have a command that automatically change them to Unicode characters. Example:

(For more about HTML entities, see: Character Sets and Encoding in HTMLHTML/XML Entities List.)

The command should work on the current paragraph, or text selection.

Solution

This is easy to write. One of the basic elisp idiom is find & replace on a region, like this:

(defun replace-html-chars-region (start end)
  "Replace some HTML entities in region …."
  (interactive "r")
  (save-restriction 
    (narrow-to-region start end)

    (goto-char (point-min))
    (while (search-forward "‘" nil t) (replace-match "‘" nil t))

    (goto-char (point-min))
    (while (search-forward "’" nil t) (replace-match "’" nil t))

    (goto-char (point-min))
    (while (search-forward "“" nil t) (replace-match "“" nil t))

    (goto-char (point-min))
    (while (search-forward "”" nil t) (replace-match "”" nil t))

    (goto-char (point-min))
    (while (search-forward "é" nil t) (replace-match "é" nil t))
    ;; more here
    )
  )

The (interactive "r") tells emacs that this is a command that can be called by execute-extended-command 【M-x】 and the "r" means emacs will feed the beginning and ending text selection positions to your function's parameters.

There are several problems with the above simple code.

① The code requires you to make a text selection first. It'd be better if it automatically work on text selection if there's one, else works on current paragraph.

For solution on this, see: Emacs Lisp Idioms (for writing interactive commands).

② The elisp code above is too verbose. It'd be much better if we can write it like this:

(defun replace-html-named-entities ()
 …
  (replace-pairs-in-string inputStr
    [
     ["‘" "‘"]
     ["’" "’"]
     ["“" "“"]
     ["”" "”"]
     ["é" "é"]
     ]
  ))

For solution on this, see: Emacs Lisp: Multi-Pair String Replacement Function.

③ Replacing multiple pairs of strings one by one may produce incorrect behavior.

Tricky Issue with Sequential Replacement of Multi-Pairs

Suppose you are working on a HTML tutorial that discusses HTML entities. Suppose the file contains this string:

use “©” for ©

The intended display is use “©” for ©.

However, if you are sequentially replacing each entities, the & part will become &, then © becomes just ©, so you got use “©” for © WRONG!

When you have many pairs of replacement, then doing them one by one, each time starting from the top of the document, may introduce unexpected changes. A solution is to replace them to a set of unique intermediate values, then replace these to the final values.

For the final code of replace-html-named-entities that fixes these problems, get it at xah_elisp_util.el.

You'll need to install 2 elisp libraries:

blog comments powered by Disqus