Problem of Calling Unix grep in Emacs

Advertise Here

, ,

This page describes problems of calling unix grep in emacs.

Unix grep util is quite useful. In emacs, it's even better. Because, emacs has a command named grep and others that act as wrapper to unix grep, with the advantage that the found text in output is colored, and file names are linked. (➲ Emacs: Searching for Text in Files (grep, find))

However, unix grep has many problems. Calling it inside emacs makes it worse, either directly by a shell-command command, or indirectly by {grep, rgrep, lgrep, …}.

On Windows, External Program Problem

First problem is that it's a external util. On Windows, this is a major pain. User has to install either Cygwin or others, then, emacs goes thru several layers, thru cygwin, Windows. Unicode, or text that contain quote chars or escapes, almost always gets screwed along the way.

Unix Shell Quote Escape Pain

Today, i want to grep with this regex height="[0-9]+" /> for html image tags.

When calling emacs's grep command, i tried:

grep -ie -nH 'height\=\"[0\-9]\+\" \/\>' *html

and many variations with the backslash in different places, double backslash, single/double quotes. No go. Sometimes the error is about cygwin detecting DOS style slash. Sometimes it silently creates a file of 0 length named ' in your directory, due to your screwed escapes. (i wasn't aware till days or months later.)

Unix Syntax Problem

Unix utils's syntax is incomprehensible to none-unix users, but emacs depends on them for basic features such as searching text in a dir. Users not familiar with the cryptic shell syntax won't be able to use it.

For example, emacs grep command prompts this: grep -nH -e ▮. For those who want to search string but doesn't have unix experience, how are they supposed to understand what it means? They will need to read man pages, which is again external command. The man pages assume you are a unix admin, not a piece of documentation independent of unix.

Besides grep command for searching text. Also, to list files in emacs dired, the only command for that is find-dired. These DEPENDS on unix find/xargs commands with cryptic syntax. (➲ Inconsistency of Emacs Text-Searching Features)

Grep Not Robust for Unicode String

Unix grep is not very robust for processing Unicode texts. On Windows with Cygwin, the char encoding in the stream gets messed up thru the various layers.

My search string usually has Unicode chars. (e.g. Sample Unicode Characters.) For example, grep fails when searching for (U+2502). This is calling Cygwin grep from emacs on Windows. It's too complex to figure out exactly why it fails.

With Unicode, you have to deal with environment variable “locale”. On Windows, there's complex interplay of environment variables among {emacs, emacs's inferior shells, Cygwin, Windows} and or translation of the meaning fo locale between unix and Windows. (➲ Emacs and Microsoft Windows Tips).

Problem with Long Search String

Often, my search string is long, containing 300 hundred chars or more. (e.g. a snippet of HTML that contains JavaScript and span multi-lines.) You could put your search string in a file with grep, but it is not convenient. Here's a example of a string i need to search today:

<div class="chtk"><script>ch_client="polyglut";ch_width=550;ch_height=90;ch_type="mpu";ch_sid="Chitika Default";ch_backfill=1;ch_color_site_link="#00C";ch_color_title="#00C";ch_color_border="#FFF";ch_color_text="#000";ch_color_bg="#FFF";</script><script src="http://scripts.chitika.net/eminimalls/amm.js"></script></div>

(btw, yes, i need this whole string. Because sometimes i've made changes and might forgot to make the change across all pages on my site. If some of them got a extra space, extra newline, or some script tag has extra attributes such as type="text/javascript", then my various other scripts that do various other things will fail. (and, tech geekers, please don't start to tell me i need some content manager system or other fancy tech. My site works fine and am more efficient, more powerful, more flexible, in managing the 5k pages with emacs, unix grep problem or not, than any fancy content management system.))

Grep Not Flexible for Specifying Files in Directories

grep is not very flexible for working with all files in a directory. There's -r option, but then you can't specify file pattern (e.g. *html) (It is possible, with shell file globs or “find … -exec, xargs”, but i find it quite frustrating to trial error man page loop with unix tools.)

Sometimes you need to work on a list of files, sometimes by a pattern, sometimes you want to exclude some files by list or by pattern, sometimes only the first 2 levels, or a combination of the above in a specific order. Some unix tools provide these options, sometimes by combination of tools (e.g. find/xargs), but their order and syntax is complex and tool specific. With a script in Perl, Python, elisp, it's much easier to control.

Too Many Incompatible Versions of Grep

There are too many versions and varieties of grep. The primary 2 are BSD vs GNU. Mac OS X comes with bsd versions, but some utils are GNU versions. This makes it very painful. Linuxes typically come with GNU versions. The different versions accept different options. Also, GNU grep for example, support a varieties of regex (“--basic-regexp”, “--extended-regexp”, “--perl-regexp”.) It's too painful to figure them out and remember their details.

Unix grep and associated tool (sort, wc, uniq, pipe, sed, awk, …) is not flexible. When your need is slightly more complex, unix shell tools can't handle it. For example, suppose i need to find all occurrences of html “img” tag that are not wrapped by a <div> tag. This is impossible with unix tools. (extending the limit of unix tools is how Perl was born in 1987.)

Advantage of Proper Script

When writing a script in Perl or Python, you can always write it so the script works as a command line script that takes options like unix command line tools. Or, you can leave the script without a command line interface. When you need to run the script, you open it with a editor, modify the parameters, save, then run it. (See: Python: Find & ReplacePerl: Find & Replace.)

Ι always prefer the latter. Because, that way i can edit the options much more comfortably, in a editor with full view instead of the command line. I can also view whatever doc the script has in the header, instead of doing some confusing “-help” or “-h”, “--help” or “man …” in the command line. And with emacs, i can run the script by a single key-press, and many other conveniences. Basically, a command line is nice if you are using other's code because it's a blackbox with a (somewhat) standardize command line interface. But when i write my own custom text processing scripts, i prefer not to add command line interface. Just use it inside emacs.

So, with my own script for grep (may it be elisp or Perl or Python), i can make the script do exactly what i need and works everywhere with emacs.


2 month ago (2011 Feb) i wrote How to Write grep in Emacs Lisp. In it, i documented a few problems of calling unix grep within emacs. Today i run into a problem again. Here's a more concrete example.

In my vocabulary page Wordy English — the Making of Belles-Lettres, i use the Unicode BOX DRAWINGS LIGHT VERTICAL “│” as a temp marker for processing the word list. Today i need to grep pages containing that character.

Calling 【Meta+x grep】 in emacs with grep -inH -e "│" *html returns a error:

-*- mode: grep; default-directory: "c:/Users/xah/web/xahlee_org/emacs/" -*-
Grep started at Tue Apr 05 15:37:47

grep "│" *html
warning: extra args ignored after 'grep "│\'

Grep finished with no matches found at Tue Apr 05 15:37:47

Starting shell in emacs (which runs Microsoft cmd.exe in Windows Vista) doesn't work neither. (it works fine when grepping ASCII string) Here's a session log:

Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

c:\Users\xah\web\xahlee_org\emacs>grep "│" *html
grep "â\224? *html

It stuck there. 【Ctrl+c Ctrl+c】 doesn't get out. I had to kill the buffer.

Calling msys-shell works. (msys-shell is bundled with ErgoEmacs. It calls bash in MinGW, which is a subset of Cygwin port.) Here's a log:

sh-3.2$ grep "│" *html
antonymous_synonyms.html:<li> cry, decry │ you can cry, as in crying out loud, but you can also decry, by crying out loud </li>
antonymous_synonyms.html:<li> linear, rectilinear │ linear algebra, rectilinear motion. Rectilinear is the linearness of motion.</li>
…

Calling it in Cygwin Bash running inside Windows Console also works.

So, this means, the problem isn't grep not understanding Unicode. It must be something that got screwed up when emacs talks to Cygwin. Though, what exactly is the problem? Well, i'm not about to spend few hours to find out.

in PowerShell, it also works. e.g. with this command select-string -path *.html -pattern "│". However, calling PowerShell thru emacs does not work. (➲ Emacs PowerShell Modes)

Here's my system setup:

O, The Complexity & Tedium of Software Engineering.

Addendum: Adding the option -P also worked. e.g. call emacs “grep” command, then give grep -inH -e -P "│" *html. Thanks to “blandest” (gnu.emacs.help).

blog comments powered by Disqus