Regular Expressions/Print version
From Wikibooks, the open-content textbooks collection
Contents |
Introduction
A Regular Expression (regex) is a string of text written in a very concise language. Various software applications use regular expressions to match patterns in text. Using a regular expression and a tool which understands it, a user can simplify the search for a particular textual pattern. For example, you can use a regular expression to tell an editor to find the next occurrence of the word "Chapter" followed by one or more spaces followed by one or more digits. Or you can use a regular expression to tell the UNIX grep command to show only those lines of a file that contain the word "Wiki" followed by either the word "Books" or the word-fragment "pedia". We'll discuss the exact syntax of such regular expressions in the next chapter.
Regular expressions are a powerful text examination and manipulation tool. In a sense, a regular expression is a little computer program that finds or isolates a subset of a larger set of text. In the same way that an ordinary computer program needs a computer to execute it, a regular expression needs a software application to interpret it — to give it meaning. There are a variety of software applications that implement regular expressions. Let's look at the notable ones.
Supporting Software
Regular expressions are supported by various software tools, including command line tools, plain text editors and programming languages. Most of these tools are available for various computing platforms, including Linux, Windows and Mac OS X.
The tools:
- Command line tools
- grep
- egrep
- sed
- awk
- Plain text editors
- vi
- Emacs
- Programming languages
- Java
- JavaScript
- .NET
- Perl
- Ruby
- Tcl
The tools use slightly different styles syntax. Perl uses a form called Perl Compatible Regular Expressions (PCRE). TextPad uses POSIX style regexes. We will cover PCRE as an introduction and later discuss the difference between PCRE and POSIX styles.
Regular Expression Syntaxes
There are several variants of regular expressions, differing not only in their concrete syntax but also in their capabilities. These include traditional Unix regular expressions, POSIX regular expressions, and Perl regular expressions. Individual tools supporting regular expresions have their own peculiarities: that is also the case with the text editor Emacs.
Traditional Unix regular expressions
The "basic" Unix regular expression syntax is now defined as obsolete by POSIX, but is still widely used for the purposes of backwards compatibility. Most regular-expression-aware Unix utilities, such as grep and sed, use it by default while providing support for extended regular expressions with command line arguments (see below).
In the basic syntax, most characters are treated as literals — they match only themselves (i.e. "a" matches "a", "(bc" matches "(bc", etc). The exceptions are called metacharacters:
| Operator | Effect |
|---|---|
| . | Matches any single character. Into [ ] this character has its habitual meaning. For example, "a.cd" matches "abcd", "a..d" matches "abcd". |
| [ ] | Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] matches any lowercase letter. These can be mixed: [abcq-z] matches a, b, c, q, r, s, t, u, v, w, x, y, z, and so does [a-cq-z].
The '-' character should be literal only if it is the last or the first character within the brackets: [abc-] or [-abc]. To match an '[' or ']' character, the easiest way is to make sure the closing bracket is first in the enclosing square brackets: [][ab] matches ']', '[', 'a' or 'b'. |
| [^ ] | Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter. As above, these can be mixed.
To avoid matching the ']' character, place it immediately after the '^' character: "[^]]" matches any single character that is not ']'. |
| ^ | Matches the start of the line (or any line, when applied in multiline mode) |
| $ | Matches the end of the line (or any line, when applied in multiline mode) |
| ( ) | Defines a "marked subexpression". What the enclosed expression matched can be recalled later. See the next entry, \n. A "marked subexpression" is also a "block". This feature is not found in some instances of regex. In most Unix utilities (such as sed and vi) a backslash must precede the open and close parentheses. |
| \n | Where n is a digit from 1 to 9; matches what the nth marked subexpression matched. This construct is theoretically irregular and has not been adopted in the extended regular expression syntax. |
| * | A single character expression followed by "*" matches zero or more copies of the expression. For example, "ab*c" matches "ac", "abc", "abbbc" etc. "[xyz]*" matches "", "x", "y", "zx", "zyx", and so on.
|
| {x,y} | Match the last "block" at least x and not more than y times. For example, "a\{3,5}" matches "aaa", "aaaa" or "aaaaa". Note that this is not found in some instances of regex. |
Note that particular implementations of regular expressions interpret backslash differently in front of some of the metacharacters. For example, egrep and Perl interpret unbackslashed parentheses and vertical bars as metacharacters, reserving the backslashed versions to mean the literal characters themselves. Old versions of grep did not support the alternation operator "|".
| Example | Match |
|---|---|
| ".at" | any three-character string like hat, cat or bat |
| "[hc]at" | hat and cat |
| "[^b]at" | all the matched strings from the regex ".at" except bat |
| "^[hc]at" | hat and cat but only at the beginning of a line |
| "[hc]at$" | hat and cat but only at the end of a line |
Since many ranges of characters depends on the chosen locale setting (e.g., in some settings letters are organized as abc..yzABC..YZ while in some others as aAbBcC..yYzZ), the POSIX standard defines some classes or categories of characters as shown in the following table:
| POSIX class | similar to | meaning |
|---|---|---|
| [:upper:] | [A-Z] | uppercase letters |
| [:lower:] | [a-z] | lowercase letters |
| [:alpha:] | [A-Za-z] | upper- and lowercase letters |
| [:alnum:] | [A-Za-z0-9] | digits, upper- and lowercase letters |
| [:digit:] | [0-9] | digits |
| [:xdigit:] | [0-9A-Fa-f] | hexadecimal digits |
| [:punct:] | [.,!?:...] | punctuation |
| [:blank:] | [ \t] | space and TAB characters only |
| [:space:] | [ \t\n\r\f\v] | blank (whitespace) characters |
| [:cntrl:] | control characters | |
| [:graph:] | [^ \t\n\r\f\v] | printed characters |
| [:print:] | [^\t\n\r\f\v] | printed characters and space |
Example: [[:upper:]ab] should only match the uppercase letters and lowercase 'a' and 'b'.
It is generally agreed that [:print:] consists of [:graph:] plus the space character. However, in Perl regular expressions [:print:] matches [:graph:] union [:space:].
An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers. The editor vim further distinguishes word and word-head classes (using the notation \w and \h) since in many programming languages the characters that can begin an identifier are not the same as those that can occur in other positions.
(For an ASCII chart color-coded to show the POSIX classes, see ASCII.)
Greedy expressions
Quantifiers in regular expressions match as much as they can; they are greedy (meaning they try to match the maximum available). This can be a significant problem. For example, someone wishing to find the first instance of an item in double-brackets in the text
- Another whale explosion occurred on [[January 26]], [[2004]], in [[Tainan City]], [[Taiwan]].
would most likely use the pattern (\[\[.*\]\]), which seems correct (note that the square bracket is preceded by a back slash as it is to be interpreted as a literal character). However, this pattern will actually return [[January 26]], [[2004]], in [[Tainan City]], [[Taiwan]] instead of the expected [[January 26]]. This is because it will return everything between the first 2 left brackets from [[January 26]] and the last 2 right brackets from [[Taiwan]].
There are two ways to avoid this common problem; firstly, rather than specifying what is to be matched, specify what is not to be matched, in this case the ] is not to be matched, so the pattern would be (\[\[[^\]]*\]\]). However, this would fail to match at all on this string:
- A B C [[D E] F G]]
Secondly, modern regular expression tools allow a quantifier to be specified as non-greedy, by putting a question mark after the quantifier: (\[\[.*?\]\]).
In PHP, you can allow a quantifier to be specified as non-greedy, by adding a 'U' at the end of the regex (just after the finishing slash). For example, /\[\[.*\]\]/U
POSIX modern (extended) regular expressions
The more modern "extended" regular expressions can often be used with modern Unix utilities by including the command line flag "-E".
POSIX extended regular expressions are similar in syntax to the traditional Unix regular expressions, with some exceptions. The following metacharacters are added:
- + — Match the last "block" one or more times - "ba+" matches "ba", "baa", "baaa" and so on
- ? — Match the last "block" zero or one times - "ba?" matches "b" or "ba"
- | — The choice (or set union) operator: match either the expression before or the expression after the operator - "abc|def" matches "abc" or "def".
Also, backslashes are removed: \{...\} becomes {...} and \(...\) becomes (...). Examples:
- "[hc]+at" matches with "hat", "cat", "hhat", "chat", "hcat", "ccchat" etc.
- "[hc]?at" matches "hat", "cat" and "at"
- "([cC]at)|([dD]og)" matches "cat", "Cat", "dog" and "Dog"
Since the characters '(', ')', '[', ']', '.', '*', '?', '+', '^' and '$' are used as special symbols they have to be escaped if they are meant literally. This is done by preceding them with '\' which therefore also has to be escaped this way if meant literally. Examples:
- "a\.(\(|\))" matches with the string "a.)" or "a.("
Perl-compatible regular expressions (PCRE)
Perl has a richer and more predictable syntax than even the extended POSIX regexp. An example of its predictability is that \ always quotes a non-alphanumeric character. An example of something that is possible to specify with Perl but not POSIX is whether part of the match wanted to be greedy or not. For instance in the pattern /a.*b/, the .* will match as much as it can, while in the pattern /a.*?b/, .*? will match as little. So given the string "a bad dab", the first pattern will match the whole string, and the second will only match "a b".
For these reasons, many other utilities and applications have adopted syntaxes that look a lot like Perl's — for example, Java, Ruby, Python, PHP, exim, BBEdit, and even Microsoft's .NET Framework all use regular expression syntax similar to Perl's. Not all "Perl-compatible" regular expression implementations are identical, and many implement only a subset of Perl's features.
Emacs regular expressions
A stub
- "\s" does not mean whitespace, unlike in JavaScript, .NET and Perl. Instead, "\s-" matches whitespace.
- It doesn't have "\d" like in PCRE. Use [0-9] or [[:digit:]]
- The following metacharacters must be escaped using backslashes (unlike in PCRE): ( ) { } |
- No lookahead and no lookbehind like in PCRE
- Emacs regexp can match characters by syntax using mode-specific syntax tables ("\sc", "\s-", "\s ") or by catetories ("\cc", "\cg").
Implementation
Implementations and running times
There are at least two different algorithms that decide if (and how) a given string matches a regular expression.
The oldest and fastest relies on a result in formal language theory that allows every nondeterministic Finite State Machine (NFA) to be transformed into a deterministic finite state machine (DFA). The algorithm performs or simulates this transformation and then runs the resulting DFA on the input string, one symbol at a time. The latter process takes time linear to the length of the input string. More precisely, an input string of size n can be tested against a regular expression of size m in time O(n+2m) or O(nm), depending on the details of the implementation. This algorithm is often referred to as DFA. It is fast, but can be used only for matching and not for recalling grouped subexpressions. There is a variant that can recall grouped subexpressions, but its running time slows down to O(n2m) [citation needed].
The other algorithm is to match the pattern against the input string by backtracking. (This algorithm is sometimes called NFA, but this terminology is highly confusing.) Its running time can be exponential, which simple implementations exhibit when matching against expressions like "(a|aa)*b" that contain both alternation and unbounded quantification and force the algorithm to consider an exponential number of subcases. More complex implementations identify and speed up various common cases where they would otherwise run slowly.
Even though backtracking implementations only give an exponential guarantee in the worst case, they allow much greater flexibility and provide more expressive power. For instance any implementation that allows the use of backreferences, or implements the various improvements that Perl introduced, must use a backtracking implementation.
Some implementations try to provide the best of both algorithms by first running a fast DFA match to see if the string matches the regular expression at all, and only in that case perform a potentially slower backtracking match.
Examples
A regular expression ( also "RegEx" or "regex" ) is a string that is used to describe or match a set of strings according to certain syntax rules. The specific syntax rules vary depending on the specific implementation, programming language, or library in use. Additionally, the functionality of regex implementations can vary between versions.
Despite this variability, and because regular expressions can be difficult to both explain and understand without examples, this article provides a basic description of some of the properties of regular expressions by way of illustration.
Conventions
The following conventions are used in the examples.[1]
metacharacter(s) ;; the metacharacters column specifies the regex syntax being demonstrated =~ m// ;; indicates a regex match operation in perl =~ s/// ;; indicates a regex substitution operation in perl
Also worth noting is that these regular expressions are all Perl-like syntax. Standard POSIX regular expressions are different.
Examples
Unless otherwise indicated, the following examples conform to the Perl programming language, release 5.8.8, January 31, 2006. The syntax and conventions used in these examples coincide with that of other programming environments as well (e.g., see Java in a Nutshell - Page 213, Python Scripting for Computational Science - Page 320, Programming PHP - Page 106 ).
| Metacharacter(s) | Description | Example Note that all the if statements return a TRUE value |
|---|---|---|
| . | Normally matches any character except a newline. Within square brackets the dot is literal. |
$string1 = "Hello World\n";
if ($string1 =~ m/...../) {
print "$string1 has length >= 5\n";
}
|
| ( ) | Groups a series of pattern elements to a single element. When you match a pattern within parentheses, you can use any of $1, $2, ... later to refer to the previously matched pattern. |
$string1 = "Hello World\n";
if ($string1 =~ m/(H..).(o..)/) {
print "We matched '$1' and '$2'\n";
}
Output:
|
| + | Matches the preceding pattern element one or more times. |
$string1 = "Hello World\n";
if ($string1 =~ m/l+/) {
print "There are one or more consecutive letter \"l\"'s in $string1\n";
}
Output:
|
| ? | Matches the preceding pattern element zero or one times. |
$string1 = "Hello World\n";
if ($string1 =~ m/H.?e/) {
print "There is an 'H' and a 'e' separated by ";
print "0-1 characters (Ex: He Hoe)\n";
}
|
| ? | Modifies the *, +, or {M,N}'d regexp that comes before to match as few times as possible. |
$string1 = "Hello World\n";
if ($string1 =~ m/(l.+?o)/) {
print "The non-greedy match with 'l' followed by one or ";
print "more characters is 'llo' rather than 'llo wo'.\n";
}
|
| * | Matches the preceding pattern element zero or more times. |
$string1 = "Hello World\n";
if ($string1 =~ m/el*o/) {
print "There is an 'e' followed by zero to many ";
print "'l' followed by 'o' (eo, elo, ello, elllo)\n";
}
|
| {M,N} | Denotes the minimum M and the maximum N match count. |
$string1 = "Hello World\n";
if ($string1 =~ m/l{1,2}/) {
print "There exists a substring with at least 1 ";
print "and at most 2 l's in $string1\n";
}
|
| [...] | Denotes a set of possible character matches. |
$string1 = "Hello World\n";
if ($string1 =~ m/[aeiou]+/) {
print "$string1 contains one or more vowels.\n";
}
|
| | | Separates alternate possibilities. |
$string1 = "Hello World\n";
if ($string1 =~ m/(Hello|Hi|Pogo)/) {
print "At least one of Hello, Hi, or Pogo is ";
print "contained in $string1.\n";
}
|
| \b | Matches a word boundary. |
$string1 = "Hello World\n";
if ($string1 =~ m/llo\b/) {
print "There is a word that ends with 'llo'\n";
} else {
print "There are no words that end with 'llo'\n";
}
|
| \w | Matches an alphanumeric character, including "_". |
$string1 = "Hello World\n";
if ($string1 =~ m/\w/) {
print "There is at least one alphanumeric ";
print "character in $string1 (A-Z, a-z, 0-9, _)\n";
}
|
| \W | Matches a non-alphanumeric character, excluding "_". |
$string1 = "Hello World\n";
if ($string1 =~ m/\W/) {
print "The space between Hello and ";
print "World is not alphanumeric\n";
}
|
| \s | Matches a whitespace character (space, tab, newline, form feed) |
$string1 = "Hello World\n";
if ($string1 =~ m/\s.*\s/) {
print "There are TWO whitespace characters, which may";
print " be separated by other characters, in $string1";
}
|
| \S | Matches anything BUT a whitespace. |
$string1 = "Hello World\n";
if ($string1 =~ m/\S.*\S/) {
print "There are TWO non-whitespace characters, which";
print " may be separated by other characters, in $string1";
}
|
| \d | Matches a digit, same as [0-9]. |
$string1 = "99 bottles of beer on the wall.";
if ($string1 =~ m/(\d+)/) {
print "$1 is the first number in '$string1'\n";
}
Output:
|
| \D | Matches a non-digit. |
$string1 = "Hello World\n";
if ($string1 =~ m/\D/) {
print "There is at least one character in $string1";
print " that is not a digit.\n";
}
|
| ^ | Matches the beginning of a line or string. |
$string1 = "Hello World\n";
if ($string1 =~ m/^He/) {
print "$string1 starts with the characters 'He'\n";
}
|
| $ | Matches the end of a line or string. |
$string1 = "Hello World\n";
if ($string1 =~ m/rld$/) {
print "$string1 is a line or string ";
print "that ends with 'rld'\n";
}
|
| \A | Matches the beginning of a string (but not an internal line). |
$string1 = "Hello\nWorld\n";
if ($string1 =~ m/\AH/) {
print "$string1 is a string ";
print "that starts with 'H'\n";
}
|
| \Z | Matches the end of a string (but not an internal line). |
$string1 = "Hello\nWorld\n";
if ($string1 =~ m/d\n\Z/) {
print "$string1 is a string ";
print "that ends with 'd\\n'\n";
}
|
| [^...] | Matches every character except the ones inside brackets. |
$string1 = "Hello World\n";
if ($string1 =~ m/[^abc]/) {
print "$string1 contains a character other than ";
print "a, b, and c\n";
}
|
Notes
- ↑ The character 'm' is not always required to specify a perl match operation. For example, m/[^abc]/ could also be rendered as /[^abc]/. The 'm' is only necessary if the user wishes to specify a match operation without using a forward-slash as the regex delimiter. Sometimes it is useful to specify an alternate regex delimiter in order to avoid "delimiter collision". See 'perldoc perlre' for more details.
Links
External Links
- Regex Tutorial Site
- Text and Data Manipulation with Regular Expressions in .NET Development — tutorial, reference and Ready to use Regular Expression patterns for VB.NET, C#.NET and ASP.NET development
- Easy Regex Tutorial
- Emacs regular expressions