Text Processing: Emacs Lisp vs Perl

(How Perl's Lies Damaged the Computing Community)

[this article is originally posted to comp.lang.perl.misc, comp.lang.python, comp.lang.lisp, comp.emacs]

Xah Lee, 2007-10-29

PS I'm cross-posting this post Text Processing with Emacs Lisp to perl and python groups because i find that it being a little know fact that emacs lisp's power in the area of text processing, are far beyond Perl (or Python).

... i worked as a professional perl programer since 1998. I started to study elisp as a hobby since 2005. (i started to use emacs daily since 1998) It is only today, while i was studying elisp's file and buffer related functions, that i realized how elisp can be used as a general text processing language, and in fact is a dedicated language for this task, with powers quite beyond Perl (or Python).

This realization surprised me, because it is well-known that Perl is the de facto language for text processing, and emacs lisp for this is almost unknown (outside of elisp developers). The surprise was exasperated by the fact that Emacs Lisp existed before perl by almost a decade. (Albeit Emacs Lisp is not suitable for writing general applications.)

My study about lisp as a text processing tool today, remind me of a article i read in 2000: “Ilya Regularly Expresses”, of a interview with Dr Ilya Zakharevich (author of cperl-mode.el and a major contributor to the Perl language). In the article, he mentioned something about Perl's lack of text processing primitives that are in emacs, which i did not fully understand at the time. (i don't know elisp at the time)

The article is at: http://www.perl.com/lpt/a/2000/09/ilya.html, last accessed 2007-10-30.

Here's the relevant excerpt:

Let me also mention that classifying the text handling facilities of Perl as “extremely agile” gives me the willies. Perl's regular expressions are indeed more convenient than in other languages. However, the lack of a lot of key text-processing ingredients makes Perl solutions for many averagely complicated tasks either extremely slow, or not easier to maintain than solutions in other languages (and in some cases both).

I wrote a (heuristic-driven) Perlish syntax parser and transformer in Emacs Lisp, and though Perl as a language is incomparably friendlier than Lisps, I would not be even able of thinking about rewriting this tool in Perl: there are just not enough text-handling primitives hardwired into Perl. I will need to code all these primitives first. And having these primitives coded in Perl, the solution would turn out to be (possibly) hundreds times slower than the built-in Emacs operations.

My current conjecture on why people classify Perl as an agile text- handler (in addition to obvious traits of false advertisements) is that most of the problems to handle are more or less trivial (“system maintenance”-type problems). For such problems Perl indeed shines. But between having simple solutions for simple problems and having it possible to solve complicated problems, there is a principle of having moderately complicated solutions for moderately complicated problems. There is no reason for Perl to be not capable of satisfying this requirement, but currently Perl needs improvement in this regard.


2008-01

[the following is a excerpt from a post on comp.lang.lisp]

I'm a perl expert, programing it in industry daily from 1998 to 2002. I started to learn elisp in 2005 casually. (am a daily emacs user since 1997) To my surprise, for tasks of text processing (and sys admin), elisp's power, ease, and convenience is almost a order above Perl. This is particular painful to realize for me because even being a Perl hater from the very beginning, fully aware of its rampant lies, i was nevertheless strongly deceived by the wide-spread understanding that Perl IS the most suitable language for text processing tasks. (and conversely, due to emacs lisper's lack of emacs lisp advocacy, partly due to the suppression by Scheme lisp and Common Lisp factions, emacs's power as a text-processing computer language system is basically unknown to the programers at large (often, emacs lisp is understood to programers who actually heard of lisp, as “just a extension niche language” usually with demeaning connotations)) (See also: Modernization of Emacs Lisp)

Besides industrial programing, my personal use of perl is primarily text processing. So, i was quite pissed by the extent of damage perl's lies caused to me on this regard. I'm currently, casually doing all my text processing needs in emacs instead of Perl or Python. (e.g. typically a function that process tens to tens of thousand files.)


2008-01-29

Tim X wrote:

Personally, I'd use something like Perl or one of the many other scripting languages that are ideal for (and largely designed for) this sort of problem.

A interesting thing about wanting to use elisp to open large file, for me, is this:

Recently i discovered that emacs lisp is probably the most powerful lang for processing text, far more so than Perl. Because, in emacs, there's the “buffers” infra-structure, which allows one to navigate a point back and forth, delete, insert, regex search, etc, with literally few thousands text-processing functions build-in to help this task.

While in perl or python, typically one either reads the file one line at a time and process it one line at a time, or read the whole file one shot but basically still process it one line at a time. The gist is that, any function you might want to apply to the text is only applied to a line at a time, and it can't see what's before or after the line. (one could write it so that it “buffers” the neighboring lines, but that's rather unusual and involves more code. Alternatively, one could read in one char at a time, and as well move the index back and forth, but then that loses all the regex power, and dealing with files as raw bytes and file pointers is extremely painful)

The problem with processing one-line at a time is that, for many data the file is a tree structure (such as HTML/XML, Mathematica source code). To process a tree structure such as XML, where there is a root tag at the beginning of the file and closes at the end, and most tree branches span multiple lines. Processing it line by line is almost useless. So, in perl, the typical solution is to read in the whole file, and apply regex to the whole content. This really put stress on the regex and basically the regex won't work unless the processing needed is really simple.

A alternative solution to process tree-structured file such as XML, is to use a proper parser. (e.g. javascript/DOM, or using a library/module) However, when using a parser, the nature of programing ceases to be text-processing but more as structural manipulation. In general, the program becomes more complex and difficult. Also, if one uses a XML parser and DOM, the formatting of the file will also be lost. (i.e. all your original line endings and indents will be gone)

This is a major reason why, i think emacs lisp's is far more versatile because it can read in the XML into emacs's buffer infra-structure, then the programer can move back and forth a point, freely using regex to search or replace text back and forth. For complex XML processing such as tree transformation (e.g. XSLT etc), a XML/DOM parser/model is still more suitable, but for most simple manipulation (such as processing HTML files), using elisp's buffer and treating it as text is far easier and flexible. Also, if one so wishes, she can use a XML/DOM parser/model written in elisp, just as in other lang.

So, last year i switched all new text processing tasks from Perl to elisp.

But now i have a problem, which i “discovered” this week. What to do when the file is huge? Normally, one can still just do huge files since these days memories comes in few gigs. But in my particular case, my file happens to be 0.5 gig, that i couldn't even open it in emacs (presumably because i need a 64 bit OS/hardware. Thanks). So, given the situation, i'm thinking, perhaps there is a way, to use emacs lisp to read the file line by line just as perl or python. (The file is just a apache log file and can be process line by line, can be split, can be fed to sed/awk/grep with pipes. The reason i want to open it in emacs and process it using elisp is more just a exploration, not really a practical need)


Related essays:

Page created: 2007-10.
© 2007 by Xah Lee.
Xah Signet