Xah Lee, 2007-08, …, 2011-12-06
This page is a tutorial on emacs regex.
Emacs's regex is not based on Perl or Python's, but is very similar. In emacs regex, the parenthesis characters () are
literal. If you want to capture a pattern, you need to escape the
paren like this: \(myPattern\).
Here are some common patterns:
| Pattern | Matches |
|---|---|
| . | Any single character |
| \. | One period |
| [0-9]+ | Sequence of digits |
| [A-Za-z]+ | Sequence of letters |
| [-A-Za-z0-9]+ | Sequence of letter, digit, hyphen |
| [_A-Za-z0-9]+ | Sequence of letter, digit, underscore |
| [-_A-Za-z0-9]+ | Sequence of letter, digit, hyphen, underscore |
| [[:ascii:]]+ | Sequence of ASCII chars. |
| [[:nonascii:]]+ | Sequence of none ASCII chars (e.g. Unicode chars) |
| [\t\n ]+ | Sequence of {tab, newline, space} |
| Pattern | Matches |
|---|---|
| "\([^"]+?\)" | capture text between double quotes (non-greedy) |
| “\([^”]+?\)” | capture text between curly double quotes (non-greedy; Unicode char) |
| (\([^)]+?\)) | capture text between parenthesis (non-greedy) |
| Pattern | Matches |
|---|---|
| + | means match previous pattern 1 or more times |
| * | means match previous pattern 0 or more times |
| ? | means match previous pattern 0 or 1 time |
| +? | means match previous pattern 1 or more times, but with minimal match (aka non-greedy) |
| Pattern | Matches |
|---|---|
| ^… | Beginning of {line, string, buffer} |
| …$ | End of {line, string, buffer} |
| \`… | Beginning of {string, buffer} |
| …\' | End of {string, buffer} |
| \b | word boundary marker |
If you are familiar with Perl's regex, here are some practical major differences.
\(pattern\) to capture instead.[A-z], as its meaning is currently ambiguous. Use [A-Za-z]./d, /w, /s} do not work. Use [[:digit:]], [[:word:]], [[:space:]] instead. For example, Perl's /d+ for one or more digits is emacs's [[:digit:]]+. Also, the meaning of a char class may be dependent on the current major mode's syntax table. For example, what chars are considered whitespace in [[:space:]] depends on how its defined in syntax table of current major mode. Syntax table is hard to work with. Best is just to put the chars you want explicitly in your regex.\t, \n. To enter a literal Tab, press 【Ctrl+q Tab】. To match a new line, press 【Ctrl+q Ctrl+j】. (For explanation, see: Emacs's Key Notations Explained (/r, ^M, C-m, RET, <return>, M-, meta)) In elisp string, you can use {\t, \n}, and no need to double the backslash.\n. You do not need to worry whether the file has unix style line ending or Windows or Mac. Also, if you want to change line ending convention of a file, you should call set-buffer-file-coding-system. Do NOT manually do find & replace on newline chars. (See: Emacs: Newline Representations ^M ^J ^L.)Emacs has a interactive regex mode. It show matches as you type. To go into the mode, call regexp-builder.
Alternatively, you can call query-replace-regexp to test your pattern. Ι prefer this.
Regex is used in elisp code too, just like Perl as a language.
To test regex in your elisp code, you can open a empty file and place the regex function at top and the text you want to match below it, like this:
(search-forward-regexp "yourRegex")
whatever text here
Then, put your cursor to the right of the closing parenthesis, then call eval-last-sexp 【Ctrl+x Ctrl+e】. If your regex matches, it'll move cursor to the last char of the matched text. If you get a lisp error saying search failed, then your regex didn't match. If you get a lisp syntax error, then you probably screwed up on the backslashs.
In a lisp regex function that takes a regex string (e.g. search-forward-regexp), you will need to use double backslash. This is because, in elisp string, a backslash needs to be prefixed with a backslash, then, this interpreted string is passed to emacs's regex engine.
For example, suppose you have this text:
Sin[x] + Sin[y]
and you need to capture the x or y. If you are calling regex command such as query-replace-regexp, you can input in the prompt:
\(\[[a-z]\]\)
But in lisp code, you'll need to double the backslashes, like this:
(search-forward-regexp "\\(\\[[a-z]\\]\\)")
The regex engine really just got:
\(\[[a-z]\]\)
C language style escape for newline (linne feed) \n and tab \t must not have double backslash in elisp string, regex or not.