Generate a Web Links Report with Emacs Lisp

Advertise Here

, 2008-01

This page shows a real-world example using emacs lisp to process HTML files in a directory, searching for a text pattern, and generate a report. If you don't know elisp, first take a look at Emacs Lisp Basics.

Problem Description

Summary

I want to write a elisp program, that visit all HTML files in a directory (and all sub dirs), and extract all links to a particular website, then print a nice report of links and the files it is linked from. The expected number of files is about 3500.

In this lesson, you'll learn how to use elisp to traverse into a directory and all sub-directories, and how to build a hash-table in elisp, and a elisp idiom for programmatically processing a large number of files.

Detail

As of today, i have over 3500 links to Wikipedia on my site xahlee.org, scattered over in about 2500 files. I want to know what Wikipedia articles i've linked to, and from which files. Once i've written a program to extract this info, i can create a nice report in HTML for easy reading and clickable links. In the HTML report, i can use a HTML table, where each row has 2 columns, the first column is a link to the Wikipedia article, the second column is a link to the file containing the Wikipedia link.

Normally, one can just do a grep like this: grep -r 'wikipedia\.org/' dir. This can give me a nice overview, however, the result of grep also contains adjacent texts. If i want to know which link are duplicated, or if i want to organize the output by file names the link originated from, a simple grep cannot solve the problem. Typically, one can write a Perl program, but as previously indicated (See: Text Processing with Emacs Lisp: Transforming Page Tag.), doing this in elisp is easier and more flexible.

Solution

Here are the functional units we will have to do:

Once we have the data in a hash-table, it is very flexible. We can then generate a report in plain text or HTML, or do other processing.

Hash Table

First, we'll need to know how to use hash table in emacs. Here's a simple example:

(let (myhash val)

  ;; create a hash table
  (setq myhash (make-hash-table :test 'equal))

  ;; add entries
  (puthash "mary" "19" myhash)
  (puthash "jane" "20" myhash)
  (puthash "carrie" "17" myhash)
  (puthash "liz" "21" myhash)

  ;; modify a entry's value
  (puthash "jane" "16" myhash)

  ;; remove a entry
  (remhash "liz" myhash)

  ;; get a entry's value
  (setq val (gethash "jane" myhash))

  (message val) ; print it
)

For a detailed lesson on hash table, see: Emacs Lisp Tutorial: Hash Table.

Process a Single File

In our situation, we want the hash table's keys to be the full URL string, and the value should be a list. Each element in the list is the full path of the file that contains the link.

Here is the code, that processes a single file. It opens the file, search for URL, if found, check if it exist in hash, if not, add it, else append to the existing entry.

(defun ff ()
  "test code to process a single file"
  (interactive)
  (let (myhash myBuff myfile url)
    (setq myfile "~/web/p/time_machine/tm.html") ; the file to process 

    (setq myhash (make-hash-table :test 'equal)) ; create hash table

    (setq myBuff (find-file myfile)) ; open file

    ;; repeat search for URL till not found
    (while
        (re-search-forward
         "href=\"\\(http://..\\.wikipedia\\.org/[^\"]+\\)\">\\([^<]+\\)</a>"
         nil t)

      (when (match-string 0) ; if URL found
        (setq url (match-string 1)) ; set the url to matched string

        ;; if exist in hash, prepend to existing entry, else just add
        (if (gethash url myhash)
            (puthash url (cons myfile (gethash url myhash)) myhash)
          (puthash url (list myfile) myhash))))

    (kill-buffer myBuff) ; close file

    ; print number of elements to see if this worked
    (message (number-to-string (hash-table-count myhash)))
    ))

The sample file processed in the above code is this: Time Machine.

The above code prints 2, since there are 2 Wikipedia links in that file. This is all good. This means the code is working.

Note that in the above, we opened a file in a buffer, did some processing, then closed it. The form is this:

(setq myBuff (find-file myfile)) ; open file
…
(kill-buffer myBuff) ; close file

When emacs opens a file, it does several tasks transparent to the user. For example, it will load the proper mode, and colorize the text, and also automatically use the proper decoding to display the text. This is all good, but if you are going to programmatically process thousands of files, you usually don't need emacs to load the language mode or do syntax-coloring or record undo info. So, what's the proper idiom to open a file for real-only processing and without loading the mode or syntax coloring? Here's the answer:

; a elisp idiom for read-only processing a file without user interaction
(save-current-buffer
  (set-buffer (get-buffer-create " myTemp"))
  (insert-file-contents filePath nil nil nil t)
  ; process it …
  (kill-buffer " myTemp")
)

The key, is to open a temp buffer, then insert the file content into that buffer, then work with that buffer.

In the above code, get-buffer-create creates the temp buffer. It takes a string as new buffer name. If the buffer name starts with a empty space, then no undo info will be kept. (good for us) The set-buffer makes that buffer current without switching it to the front. (so that all text manipulation commands will target this temp buffer) The insert-file-contents takes several optional parameters. The argument we passed to it, means that it'll replace existing content in the buffer. (see its inline doc for detail) The save-current-buffer means, make the original current buffer current when the body of code finished execution. (If you don't do this, your current buffer will still be the invisible temp buffer, and all your typing will be invisible.)

Traverse A Dir

Now, we find out how to traverse a dir recursively in elisp. As it turns out, elisp has the function directory-files that returns the directory's content, but elisp does not have a build-in function to traverse a dir recursively. But, emacs 22 is bundled with a package “find-lisp” that contains a function “find-lisp-find-files” that does it. Here's a usage example:

(require 'find-lisp)
(find-lisp-find-files "~/web/emacs/" "\\.html$")
;; returns a list of all HTML files in dir (including files in any nested subdirs)

So, we just want to apply our function to the resulting list.

Pretty Print Helpers

Here's are some function to help us do the pretty-print for our report in HTML format.

(defun wikipedia-url-to-link (url)
  "Return the URL as HTML link string.\n
Example:
http://en.wikipedia.org/wiki/Emacs
becomes
<a href=\"http://en.wikipedia.org/wiki/Emacs\">Emacs</a>."
  (require 'gnus-util)
  (let ((linktext url))
    (setq linktext (gnus-url-unhex-string linktext nil))
    (setq linktext (concat (car (last (split-string linktext "/"))) ) )
    (setq linktext (replace-string-pairs linktext '( ("_" " ") )))
    (concat "<a href=\"" url "\">" linktext "</a>" ) ))
(defun get-html-file-title (fName)
"Return HTML fName's <title> tag's text."
 (let (x1 x2 linkText)
   (save-current-buffer
     (set-buffer (get-buffer-create " tmp8293"))
     (goto-char (point-min))
     (insert-file-contents fName nil nil nil t)

     (setq x1 (search-forward "<title>"))
     (search-forward "</title>")
     (setq x2 (search-backward "<"))
     (buffer-substring-no-properties x1 x2))))

Putting It All Together

We now put all of the above together. Here's the final source code: elisp_link_report.el.gz, and here's the final generated report: Links to Wikipedia from XahLee.org.

Emacs ♥

blog comments powered by Disqus