A Regex pattern is a string, used as a pattern to match other strings. For examples, 'a+' matches repeated occurences of letter "a", such as "aaahh!". For another example, r' \w+@\[A-Za-z]\.com' is a pattern that matches a common email form xyz@somewhere.com, useful for example when you want to replace all email addresses in a file by a dummy string.
In the example 'a+', the plus char “+” means “one or more” of the previous char. In regex, many chars have special meanings. Other chars that have special meaning include: [ ] ( ) \ . ^ $ * ? and more.
This page documents the meaning and special construct of all special characters.
\d\D\wWith LOCALE flag set, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale.
NOTE TO DOC WRITERS: need a explicit example here about locale, illustrating exactly where or how to set locale and how it effects the code. Also, possibly include a link to the doc about locale.
If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
NOTE TO DOC WRITERS: need a explicit example here.
\W\s\S
[][] can also be used to represent a set of characters not listed inside [], by placing a caret '^' in the beginning. For example, '[^aeiou]' will match any character that is not a vow.
A range of characters can be specified by a hypen. Typically, '[0-9]' matches any digit, and '[a-z]' matches any lower case letters, and '[A-Z]' matches any capital letters. This syntax can be combined, for example [A-Za-z] to mean any capitical or lowercase letters.
Regex of character class such as '\w' or '\s' can also be used inside square brakets. For example, '[\w,. ]' will match any alphanumeric char or underscore, or one of comma, period, space.
Characters that have special meanings in regex do not have special meanings when used inside []. For example, '[b+]' does not mean one or more b; It just matches 'b' or '+'.
To include characters such as bracket ']' or dash '-' or backslash '\' in a set, one can add a backslash before the char. For example, r'[\\a\-]' will match '\', 'a', or '-'. For historical reasons, if one of []-\ appears as the first char in the braket, then they are treated literally. For example, '[]b]' is legal syntax. It will match ']' or 'b'.
Note that a pattern group can be used in front of * or any repetition qualifiers such as '+' or '?'. For example: 'a(xy)*b' will match 'ab', 'axyb', and also 'axyxyb' or 'axyxyxyb'.
*?, +?, ??For example
if the regex '<.*>' is matched against
'<H1>title</H1>'
, it will match the entire string, and not just
'<H1>'.
This would not be useful if you want to capture the title by the pattern '<.*>(.*)<.*>'.
One can specify a non-greedy, minimal match behavior, by
adding "?" after the qualifier. For example,
'<.*?>'
will now match only '<H1>' in '<H1>title</H1>'.
To capture the title in '<H1>title</H1>', one can either use
'<.+?>(.+?)<.+?>' or
'<[^>]+>([^<]+)<'
.
{m}{m,n}aaaab or a thousand
"a" characters followed by a b, but not aaab.
{m,n}?'aaaaaa', 'a{3,5}' will match 5
"a" characters, while 'a{3,5}?' will only match 3
characters.
In regex, many chars has special meaning. For example: “()[]{}*+?.\” and more. Sometimes you want to search for these chars exactly. This can be done by adding a backslash in front of the char that has special meaning. For example, to match a string containing the question mark, use the regex r'\?'.
If a char does not have special meaning, adding a backslash in front may or may not represent the character literally. For example, many of the “character class” wildcards start with a backslash. e.g. \w \W \d \D. (see Wildcards section above.) Also, a backslash followed by a number represents the captured pattern. (see Captures section below.)
NOTE TO DOC WRITERS: the following are not clear in meaning.
Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:
\a \b \f \n \r \t \v \x \\
Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.
re.search(r'^aha', 'why not?\naha, i see.', re.MULTILINE)
it will return a MatchObject because 'aha' appeared in the beginning of second line. If the re.MULTILINE flag is not given, then None is returned since no 'aha' appears at the beginning of the string.
If re.MULTILINE flag is given, then it also matches before any newline. For example, re.compile('gocha$',re.MULTILINE) will now also match 'gocha\nthis time'.
Regex such as ^ and & are called archors, because they force a regex pattern to match at the start or end of a string or lines of strings.
\A\Z\bUNICODE and LOCALE flags. For example, if unicode flag is set and locale is set to Chinese, then Chinese chars are considered alphanumeric.
Inside a character set [], \b means backspace character.
\B\b, so is also subject to the settings of LOCALE and UNICODE.
Alternatives can be used inside capture groups as well (see Captures below).
To match the vertical bar | exactly, use \|.
()in the following example, the quote is captured and referenced as \1, and the source after the double dash is captured and referenced as \2.
newstr=re.sub(r'([^-]+)--(.+)$', r'\1--Me, not \2','"what do you mean?" --A Sage') # returns: "what do you mean?" --Me, not A Sage
To match parenthisis literaly, use \( and \).
\number'the the' or '55 55', but not
'the end'.
Note: For historical reasons, the n in \n can be a number from 1 to 99 only. That is to say, there can be no more than 99 captures.
NOTE TO DOC WRITER: what happens with '\562' for example? And what happens with '\09' for example? or '\23456'? Needs a clear explanation on this here.
(?P<name>...)Naming captured groups has advandages. In particular, when a complex regex is edited with captures added or deleted, references using the named form will remain stable. To refer a named captured group in a replacement string, use the form r'\g<name>'. To refer a named captured group in a regex, use the form '(?P=name)'.
In the following example, a file name and link string are extracted from a HTML document's link archor:
patternObj=re.compile(r'([^<]+)<a href="(?P<fileName>[^"]+)">(?P<linkStr>[^<]+)</a>')
matchObj=patternObj.search('look: <a href="some.jpg">my cat</a>.')
print matchObj.expand(r'file name is: \g<fileName>, link string is: \g<linkStr>')
# prints: file name is: some.jpg, link string is: my cat
(?P=name)(?...)(?iLmsux)Note that the (?x) flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.
(?:...)(?#...)(?=...)'Isaac ' only if it's
followed by 'Asimov'.
(?!...)'Isaac ' only if it's not
followed by 'Asimov'.
(?<=...)
>>> import re
>>> m = re.search('(?<=abc)def', 'abcdef')
>>> m.group(0)
'def'
This example looks for a word following a hyphen:
>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'
(?<!...)(?(id/name)yes-pattern|no-pattern)'<user@host.com>' as well as
'user@host.com', but not with '<user@host.com'.
Page created: 2005-04, by Xah Lee. For copyright and terms, see terms.html
