Xah Lee, ,
This page shows you how to write a elisp command to replace HTML entities such as é by its Unicode character é.
I have many HTML files from existing sources that contain many HTML Entities. I want to have a command that automatically change them to Unicode characters. Example:
‘ ⇒ ‘’ ⇒ ’“ ⇒ “” ⇒ ”é ⇒ é(For more about HTML entities, see: Character Sets and Encoding in HTML ◇ HTML/XML Entities List.)
The command should work on the current paragraph, or text selection.
This is easy to write. One of the basic elisp idiom is find & replace on a region, like this:
(defun replace-html-chars-region (start end) "Replace some HTML entities in region …." (interactive "r") (save-restriction (narrow-to-region start end) (goto-char (point-min)) (while (search-forward "‘" nil t) (replace-match "‘" nil t)) (goto-char (point-min)) (while (search-forward "’" nil t) (replace-match "’" nil t)) (goto-char (point-min)) (while (search-forward "“" nil t) (replace-match "“" nil t)) (goto-char (point-min)) (while (search-forward "”" nil t) (replace-match "”" nil t)) (goto-char (point-min)) (while (search-forward "é" nil t) (replace-match "é" nil t)) ;; more here ) )
The (interactive "r") tells emacs that this is a command that can be called by execute-extended-command 【M-x】 and the "r" means emacs will feed the beginning and ending text selection positions to your function's parameters.
There are several problems with the above simple code.
① The code requires you to make a text selection first. It'd be better if it automatically work on text selection if there's one, else works on current paragraph.
For solution on this, see: Emacs Lisp Idioms (for writing interactive commands).
② The elisp code above is too verbose. It'd be much better if we can write it like this:
(defun replace-html-named-entities () … (replace-pairs-in-string inputStr [ ["‘" "‘"] ["’" "’"] ["“" "“"] ["”" "”"] ["é" "é"] ] ))
For solution on this, see: Emacs Lisp: Multi-Pair String Replacement Function.
③ Replacing multiple pairs of strings one by one may produce incorrect behavior.
Suppose you are working on a HTML tutorial that discusses HTML entities. Suppose the file contains this string:
use “©” for ©
The intended display is use “©” for ©.
However, if you are sequentially replacing each entities, the & part will become &, then © becomes just ©, so you got use “©” for © WRONG!
When you have many pairs of replacement, then doing them one by one, each time starting from the top of the document, may introduce unexpected changes. A solution is to replace them to a set of unique intermediate values, then replace these to the final values.
For the final code of replace-html-named-entities that fixes these problems, get it at xah_elisp_util.el.
You'll need to install 2 elisp libraries: