Regular Expressions/Regular Expression Syntaxes

From Wikibooks, the open-content textbooks collection

Jump to: navigation, search

There are several variants of regular expressions, differing not only in their concrete syntax but also in their capabilities. These include traditional Unix regular expressions, POSIX regular expressions, and Perl regular expressions. Individual tools supporting regular expresions have their own peculiarities: that is also the case with the text editor Emacs.

Contents

[edit] Traditional Unix regular expressions

The "basic" Unix regular expression syntax is now defined as obsolete by POSIX, but is still widely used for the purposes of backwards compatibility. Most regular-expression-aware Unix utilities, such as grep and sed, use it by default while providing support for extended regular expressions with command line arguments (see below).

In the basic syntax, most characters are treated as literals — they match only themselves (i.e. "a" matches "a", "(bc" matches "(bc", etc). The exceptions are called metacharacters:

Operators
Operator Effect
. Matches any single character. Into [ ] this character has its habitual meaning. For example, "a.cd" matches "abcd", "a..d" matches "abcd".
[ ] Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] matches any lowercase letter. These can be mixed: [abcq-z] matches a, b, c, q, r, s, t, u, v, w, x, y, z, and so does [a-cq-z].

The '-' character should be literal only if it is the last or the first character within the brackets: [abc-] or [-abc]. To match an '[' or ']' character, the easiest way is to make sure the closing bracket is first in the enclosing square brackets: [][ab] matches ']', '[', 'a' or 'b'.

[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter. As above, these can be mixed.

To avoid matching the ']' character, place it immediately after the '^' character: "[^]]" matches any single character that is not ']'.

^ Matches the start of the line (or any line, when applied in multiline mode)
$ Matches the end of the line (or any line, when applied in multiline mode)
( ) Defines a "marked subexpression". What the enclosed expression matched can be recalled later. See the next entry, \n. A "marked subexpression" is also a "block". This feature is not found in some instances of regex. In most Unix utilities (such as sed and vi) a backslash must precede the open and close parentheses.
\n Where n is a digit from 1 to 9; matches what the nth marked subexpression matched. This construct is theoretically irregular and has not been adopted in the extended regular expression syntax.
* A single character expression followed by "*" matches zero or more copies of the expression. For example, "ab*c" matches "ac", "abc", "abbbc" etc. "[xyz]*" matches "", "x", "y", "zx", "zyx", and so on.
  • \n*, where n is a digit from 1 to 9, matches zero or more iterations of what the nth marked subexpression matched. For example, "(a.)c\1*" matches "abcab" and "abcabab" but not "abcac".
  • An expression enclosed in "\(" and "\)" followed by "*" is deemed to be invalid. In some cases (e.g. /usr/bin/xpg4/grep of SunOS 5.8), it matches zero or more iterations of the string that the enclosed expression matches. In other cases (e.g. /usr/bin/grep of SunOS 5.8), it matches what the enclosed expression matches, followed by a literal "*".
{x,y} Match the last "block" at least x and not more than y times. For example, "a\{3,5}" matches "aaa", "aaaa" or "aaaaa". Note that this is not found in some instances of regex.

Note that particular implementations of regular expressions interpret backslash differently in front of some of the metacharacters. For example, egrep and Perl interpret unbackslashed parentheses and vertical bars as metacharacters, reserving the backslashed versions to mean the literal characters themselves. Old versions of grep did not support the alternation operator "|".

Examples:
Example Match
".at" any three-character string like hat, cat or bat
"[hc]at" hat and cat
"[^b]at" all the matched strings from the regex ".at" except bat
"^[hc]at" hat and cat but only at the beginning of a line
"[hc]at$" hat and cat but only at the end of a line

Since many ranges of characters depends on the chosen locale setting (e.g., in some settings letters are organized as abc..yzABC..YZ while in some others as aAbBcC..yYzZ), the POSIX standard defines some classes or categories of characters as shown in the following table:

POSIX class similar to meaning
[:upper:] [A-Z] uppercase letters
[:lower:] [a-z] lowercase letters
[:alpha:] [A-Za-z] upper- and lowercase letters
[:alnum:] [A-Za-z0-9] digits, upper- and lowercase letters
[:digit:] [0-9] digits
[:xdigit:] [0-9A-Fa-f] hexadecimal digits
[:punct:] [.,!?:...] punctuation
[:blank:] [ \t] space and TAB characters only
[:space:] [ \t\n\r\f\v] blank (whitespace) characters
[:cntrl:] control characters
[:graph:] [^ \t\n\r\f\v] printed characters
[:print:] [^\t\n\r\f\v] printed characters and space

Example: [[:upper:]ab] should only match the uppercase letters and lowercase 'a' and 'b'.

It is generally agreed that [:print:] consists of [:graph:] plus the space character. However, in Perl regular expressions [:print:] matches [:graph:] union [:space:].

An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers. The editor vim further distinguishes word and word-head classes (using the notation \w and \h) since in many programming languages the characters that can begin an identifier are not the same as those that can occur in other positions.

(For an ASCII chart color-coded to show the POSIX classes, see ASCII.)

[edit] Greedy expressions

Quantifiers in regular expressions match as much as they can; they are greedy (meaning they try to match the maximum available). This can be a significant problem. For example, someone wishing to find the first instance of an item in double-brackets in the text

Another whale explosion occurred on [[January 26]], [[2004]], in [[Tainan City]], [[Taiwan]].

would most likely use the pattern (\[\[.*\]\]), which seems correct (note that the square bracket is preceded by a back slash as it is to be interpreted as a literal character). However, this pattern will actually return [[January 26]], [[2004]], in [[Tainan City]], [[Taiwan]] instead of the expected [[January 26]]. This is because it will return everything between the first 2 left brackets from [[January 26]] and the last 2 right brackets from [[Taiwan]].

There are two ways to avoid this common problem; firstly, rather than specifying what is to be matched, specify what is not to be matched, in this case the ] is not to be matched, so the pattern would be (\[\[[^\]]*\]\]). However, this would fail to match at all on this string:

A B C [[D E] F G]]

Secondly, modern regular expression tools allow a quantifier to be specified as non-greedy, by putting a question mark after the quantifier: (\[\[.*?\]\]).

In PHP, you can allow a quantifier to be specified as non-greedy, by adding a 'U' at the end of the regex (just after the finishing slash). For example, /\[\[.*\]\]/U

[edit] POSIX modern (extended) regular expressions

The more modern "extended" regular expressions can often be used with modern Unix utilities by including the command line flag "-E".

POSIX extended regular expressions are similar in syntax to the traditional Unix regular expressions, with some exceptions. The following metacharacters are added:

  • + — Match the last "block" one or more times - "ba+" matches "ba", "baa", "baaa" and so on
  • ? — Match the last "block" zero or one times - "ba?" matches "b" or "ba"
  • | — The choice (or set union) operator: match either the expression before or the expression after the operator - "abc|def" matches "abc" or "def".

Also, backslashes are removed: \{...\} becomes {...} and \(...\) becomes (...). Examples:

  • "[hc]+at" matches with "hat", "cat", "hhat", "chat", "hcat", "ccchat" etc.
  • "[hc]?at" matches "hat", "cat" and "at"
  • "([cC]at)|([dD]og)" matches "cat", "Cat", "dog" and "Dog"

Since the characters '(', ')', '[', ']', '.', '*', '?', '+', '^' and '$' are used as special symbols they have to be escaped if they are meant literally. This is done by preceding them with '\' which therefore also has to be escaped this way if meant literally. Examples:

"a\.(\(|\))" matches with the string "a.)" or "a.("

[edit] Perl-compatible regular expressions (PCRE)

TODO

TODO
Extend from Perl Programming/Regular Expressions Reference.

Perl has a richer and more predictable syntax than even the extended POSIX regexp. An example of its predictability is that \ always quotes a non-alphanumeric character. An example of something that is possible to specify with Perl but not POSIX is whether part of the match wanted to be greedy or not. For instance in the pattern /a.*b/, the .* will match as much as it can, while in the pattern /a.*?b/, .*? will match as little. So given the string "a bad dab", the first pattern will match the whole string, and the second will only match "a b".

For these reasons, many other utilities and applications have adopted syntaxes that look a lot like Perl's — for example, Java, Ruby, Python, PHP, exim, BBEdit, and even Microsoft's .NET Framework all use regular expression syntax similar to Perl's. Not all "Perl-compatible" regular expression implementations are identical, and many implement only a subset of Perl's features.

[edit] Emacs regular expressions

A stub

  • "\s" does not mean whitespace, unlike in JavaScript, .NET and Perl. Instead, "\s-" matches whitespace.
  • It doesn't have "\d" like in PCRE. Use [0-9] or [[:digit:]]
  • The following metacharacters must be escaped using backslashes (unlike in PCRE): ( ) { } |
  • No lookahead and no lookbehind like in PCRE
  • Emacs regexp can match characters by syntax using mode-specific syntax tables ("\sc", "\s-", "\s ") or by catetories ("\cc", "\cg").

Introduction · Implementation

Introduction · Regular Expressions · Implementation