Elisp Tutorial: Writing htmlize-block Function for Syntax Coloring Code In HTML

Xah Lee, 2007-10

This page shows a example of writing a emacs lisp function that process a block of text to syntax color it by HTML tags. If you don't know elisp, first take a look at Emacs Lisp Basics.

The Problem

Summary

I want to write a elisp function, such that when invoked, the block of text the cursor is on, will have various HTML's “<span class="xyz">” tags wrapped around them. This is for the purpose of publishing programing language code in HTML on the web.

Detail

I write a lot computer programing tutorials for several computer languages. For example: Perl and Python tutorial, Java tutorial, Emacs Lisp tutorial, Javascript tutorial. In these tutorials, often there are code snippets. These code need to be syntax colored in HTML.

For example, here's a elisp code snippet:

(if (< 3 2)  (message "yes") )

Here's what i actually want as raw HTML:

(<span class="keyword">if</span> (&lt; 3 2)  (message <span class="string">"yes"</span>) )

Which should look like this in a web browser:

(if (< 3 2)  (message "yes") )

There is a emacs package that turns a syntax-colored text in emacs to HTML form. This is extremely nice. The package is called htmlize.el and is written (1997,...,2006) by Hrvoje Niksic, available at http://fly.srk.fer.hr/~hniksic/emacs/htmlize.el.

This program provides you with a few new emacs commands. Primarily, it has htmlize-region, htmlize-buffer, htmlize-file. The region and buffer commands will output HTML code in a new buffer, and the htmlize-file version will take a input file name and output into a file.

When i need to include a code snippet in my tutorial, typically, i write the code in a separate file (e.g. “temp.java”, “temp.py”), run it to make sure the code is correct (compile, if necessary), then, copy the file into the HTML tutorial page, inside a “pre” block. In this scheme, the best way for me to utilize htmlize.el program is to use the “html-buffer” command on my temp.java, then copy the htmlized output and paste that into my HTML tutorial file inside a “pre” block. Since many of my tutorials are written haphazardly over the years before seeing the need for syntax coloration, most exist inside “pre” tags already without a temp code file. So, in most cases, what i do is to select the text inside the “pre” tag, paste into a temp buffer and invoke the right mode for the language (so the text will be fontified correctly), then do htmlize-buffer, then copy the html output, then paste back to replace the selected text.

This process is tedious. A page may have several code snippets. For each, i will need to select text, create a buffer, switch mode, do htmlize, select again, switch buffer, then paste. Each text-selection step involves multiple keystrokes with deliberate eye-balling or precision mousing. I have a few hundred pages for potential colorization.

It would be wonderful, if i can place the cursor on a code block, then press a button, and have emacs magically replace the code block with htmlized version colorized for that language. We proceed to write this function.

Solution

For those elisp experts who have worked with emacs fontification, the solution would be to write a function that maps the string's fontification info into html tags. This is what htmlize.el does exactly. Since it is already written, a elisp expert might find the essential code in htmlize.el. (the code is licensed under GPL)

Unfortunately, my lisp experience isn't so great. I spent maybe 30 minutes tried to look in htmlize.el in hope to find a function something like htmlize-str that is the essence, but wasn't successful. I figured, it is actually faster if i took the dumb and inefficient approach, by writing a elisp code that extracts the output from the htmlize-buffer command. Here's the outline of the plan of my function:

To achieve the above, i decided on 2 steps. A: Write a function “htmlize-string” that takes a string and mode name, and returns the htmlized string. B: Write a function “htmlize-block” that does the steps of grabbing text and pasting, and calls “htmlize-string” for the actual htmlization.

Here's the code of my htmlize-string function:

(defun htmlize-string (ccode mn)
"Take string ccode and return htmlized code, using mode mn.\n
This function requries the htmlize-mode.el by Hrvoje Niksic, 2006"
(let (cur-buf temp-buf temp-buf2 x1 x2 resultS)
    (setq cur-buf (buffer-name))
    (setq temp-buf "xout-weewee")
    (setq temp-buf2 "*html*") ;; the buffer that htmlize-buffer creates

    ; put the code in a new buffer, set the mode
    (switch-to-buffer temp-buf)
    (insert ccode)
    (funcall (intern mn))
    (font-lock-fontify-buffer)

    (htmlize-buffer temp-buf)
    (kill-buffer temp-buf)
    (switch-to-buffer temp-buf2)

    ; extract the core code
    (setq x1 (re-search-forward "<pre>"))
    (setq x1 (+ x1 1))
    (re-search-forward "</pre>")
    (setq x2 (re-search-backward "</pre>"))
    (setq resultS (buffer-substring-no-properties x1 x2))
    (kill-buffer temp-buf2)

    (switch-to-buffer cur-buf)
    resultS
)
)

The major part in this code is knowing how to create, switch, kill buffers. Then, how to set a mode. Lastly, how to grab text in a buffer.

Current buffer is given by “buffer-name”. To create or switch buffer is done by “switch-to-buffer”. Kill buffer is “kill-buffer”. To activate a mode, the code is “(funcall (intern my-mode-name))”. The “funcall” calls a function and passing it arguments. The first argument must be a lisp symbol. Our variable “my-mode-name” evaluates to a string, then the “intern” function takes a string and returns a symbol that corresponds to that string.

The grabbing text is done by locating the desired beginning and ending locations using re-search functions, and buffer-substring-no-properties for actually extracting the string.

Here, note the “no-properties” in “buffer-substring-no-properties”. Emacs's string can contain information called properties, which is essentially the fontification information.

Reference: Elisp Manual: Buffers.

Reference: Elisp Manual: Text-Properties.

Here's the code of my htmlize-block function:

(defun htmlize-block ()
  "Replace the region enclosed by <pre> tag to htmlized code.
For example, if the cursor somewhere inside the tag:

<pre class=\"elisp\">
codeXYZ...
</pre>

after calling, the “codeXYZ...” block of text will be htmlized.
That is, wrapped with many <span> tags.

The opening tag must be of the form <pre class=\"lang\">.
The “lang” determines what emacs mode is used to colorize
the code.
This function requires htmlize.el by Hrvoje Niksic."

(interactive)
(let (mycode tag-begin styclass code-begin code-end tag-end mymode)
  (progn
    (setq tag-begin (re-search-backward "<pre class=\"\\([A-z-]+\\)\""))
    (setq styclass (match-string 1))
    (setq code-begin (re-search-forward ">"))
    (re-search-forward "</pre>")
    (setq code-end (re-search-backward "<"))
    (setq tag-end (re-search-forward "</pre>"))
    (setq mycode (buffer-substring-no-properties code-begin code-end))
    )
  (cond
   ((equal styclass "elisp") (setq mymode "emacs-lisp-mode"))
   ((equal styclass "perl") (setq mymode "cperl-mode"))
   ((equal styclass "python") (setq mymode "python-mode"))
   ((equal styclass "java") (setq mymode "java-mode"))
   ((equal styclass "html") (setq mymode "html-mode"))
    ((equal styclass "haskell") (setq mymode "haskell-mode"))
   )
  (save-excursion
    (delete-region code-begin code-end)
    (goto-char code-begin)
    (insert (htmlize-string mycode mymode))
    )
  )
)

The outline of this function is to grab the text inside the “pre” block, call htmlize-string, then insert the result replacing text.

Originally, i plan to determine the extent of the code block by matching for “<pre>...</pre>” tags, then use some heuristics on the text to determine what language it is (by a simple regex match for certain strings particular to the lang), then call htmlize-string with the mode-name passed to it. However, since my html pages already have the language information as the pre tag's attribute: “<pre class="perl">” (for CSS reasons), so, now i search text by that form, and use the “class”'s value to determine a mode.

Emacs is beautiful.

Note: quote from htmlize.el: «htmlize supports three types of HTML output, selected by setting “htmlize-output-type”: “css”, “inline-css”, and “font”. ... “css” mode is the default.». My htmlize-block and htmlize-string assumes the css mode too. This means, you'll have to do a one-time manual process of grabbing the CSS from the htmlized output and place in your own CSS page.

If your html is in unicode, you might add set these variables for htmlize:

(setq htmlize-convert-nonascii-to-entities nil)
(setq htmlize-html-charset "utf-8")

Piecemeal Process

Postscript:

The story given above is slightly simplified. For example, when i began my language notes and commentaries, they were not planned to be some systematic or sizable tutorial. As the pages grew, more quality are added in editorial process. So, a plain un-colored code inside “pre” started to have “language comment” strings colorized (e.g. “<span class="cmt">#...</span>), by using a simple elisp code that wraps a tag on them, and this function is mapped to shortcut key for easy execution. As pages and languages grew, i find colorizing comment isn't enough, then i started to look for a syntax-coloring html solution. There are solutions in Perl, Python, PHP, but I find emacs solution best suites my needs, in particular because it is integrated with emacs's interactive nature, and my writing work is done in a accumulative, piecemeal, editorial process.

Once i found and decided to use htmlize.el, i use commands htmlize-region and htmlize-buffer when i write new tutorial pages. Note that this is still a laborious process involving multiple deliberate copy-paste operations. Gradually i need to colorized my existing tutorial pages. The problem is that many already contain my own «span class="cmt"» tags, and strings common in computer languages such as “<=” have already been transformed into required html encoding “&lt;=”. So, the elisp code will first need to “un-htmlize” these. So, initially in my htmlize-block code contain lines to un-htmlized those specific tags and entities strings. After many months, when all my existing code have been so newly colorized, the part of code to transform strings for un-htmlize is no longer necessary, so they are taken out in htmlize-block and resumes a cleaner state. Also, htmlize-block went thru many revisions over the year. Sometimes in recent past, i had one code wrapper for each language. For example, i had htmlize-me-perl, htmlize-me-python, htmlize-me-java, etc. The thought for unification into a single coherent wrapper code didn't materialize. In general, it is my experience, in particular in writing elisp customization for emacs, that tweaking code periodically thru the year is practical, because it adapts to the constant changes of requirements, environment, work process. For example, eventually i might write my own htmlize.el, if i happen to need more flexibility, or if my elisp experience sufficiently makes the job relatively easy.

Also note: a whole-sale solution is to write a program, in say, Python, that process html files and replace proper sections by htmlized string. This is perhaps more efficient if all the existing html files are in some uniform format by a spec (even though all my HTML pages are already HTML 4 validated by w3c), and if this needs to be done only once. However, i need to work on my tutorials on a case-by-case basis. In part, because, some pages contain multiple languages or contain pseudo-code that i do not wish colorized. (For example, some pages contain codes of the Mathematica↗ language. Mathematica code is normally done in Mathematica's mathematical typesetting capable “front-end” IDE called “Notebook” and is not “syntax-colored” as such.)


Related essays:


Page created: 2007-10.
© 2007 by Xah Lee.
Xah Signet