Regular Expressions/Print version

From Wikibooks, open books for an open world
< Regular Expressions
Jump to: navigation, search

Contents

[edit] Introduction

[edit] What is a regular expression?

A regular expression is a method of representing a string matching pattern. Regular expressions enable strings that match a particular pattern within textual data records to be located and modified and they are often used within utility programs and programming languages that manipulate textual data. Regular expressions are extremely powerful.

[edit] Example applications

Various software applications use regular expressions to locate, select or modify particular sections text. For example, a regular expression could be used to:

  • replace the word "snake" with the word "serpent" throughout an entire piece of text
  • locate pieces of text containing the words "fox" and "sheep" on the same line

[edit] Regular expression components

Regular expressions are made up of three types of components:

  • anchors used to specify the position of the pattern in relation to a line of text.
  • character sets used to match one or more characters in a single position.
  • modifiers used to specify how many times a character set is repeated.

[edit] Syntax varies across application programs

The syntax of regular expressions varies across application programs. For example the shell uses a limited form of regular expression called shell regular expressions for filename substitution, whereas awk uses a superset of extended regular expression syntax.

[edit] Supporting Software

Regular expressions are supported by various software tools, including command line tools, plain text editors and programming languages. Most of these tools are available for various computing platforms, including Linux, Windows and Mac OS X. The tools use slightly different syntax styles. Let's look at some notable ones.

The tools:

  • Command line tools
    • grep
    • egrep
    • sed
  • Plain text editors
    • ed
    • vi
    • Emacs
  • Programming languages
    • Awk
    • Java
    • JavaScript
    • .NET
    • Perl
    • PHP
    • Ruby
    • Tcl
    • python

A regular expression can be considered to be a little computer program that finds or isolates a subset of a larger set of text. In the same way that an ordinary computer program needs a computer to execute it, a regular expression needs a software application to interpret it — to give it meaning.

For example, a regular expression can be used to tell an editor to find the next occurrence of the word "Chapter" followed by several spaces and digits. Or you can use a regular expression to tell the UNIX grep command to show only those lines of a file that contain the word "Wiki" followed by either the word "Books" or the word-fragment "pedia". We will discuss the exact syntax of such regular expressions in the next chapter.

[edit] syntax

There are several variants of regular expressions. These variants differ not only in their concrete syntax but also in their capabilities. Individual tools that support regular expressions also have their own peculiarities.

Since many ranges of characters depends on the chosen locale setting (e.g., in some settings letters are organized as abc..yzABC..YZ while in some others as aAbBcC..yYzZ).

[edit] Greedy expressions

Quantifiers in regular expressions match as much as they can; they are greedy (meaning they try to match the maximum available). This can be a significant problem. For example, someone wishing to find the first instance of an item in double-brackets in the text

Another whale explosion occurred on [[January 26]], [[2004]], in [[Tainan City]], [[Taiwan]].

would most likely use the pattern (\[\[.*\]\]), which seems correct (note that the square bracket is preceded by a back slash as it is to be interpreted as a literal character). However, this pattern will actually return [[January 26]], [[2004]], in [[Tainan City]], [[Taiwan]] instead of the expected [[January 26]]. This is because it will return everything between the first 2 left brackets from [[January 26]] and the last 2 right brackets from [[Taiwan]].

There are two ways to avoid this common problem; firstly, rather than specifying what is to be matched, specify what is not to be matched, in this case the ] is not to be matched, so the pattern would be (\[\[[^\]]*\]\]). However, this would fail to match at all on this string:

A B C [[D E] F G]]

In PHP, you can allow a quantifier to be specified as non-greedy, by adding a 'U' at the end of the regex (just after the finishing slash). For example, /\[\[.*\]\]/U

[edit] Implementation

[edit] Implementations and running times

There are at least two different algorithms that decide if (and how) a given string matches a regular expression.

The oldest and fastest relies on a result in formal language theory that allows every nondeterministic Finite State Machine (NFA) to be transformed into a deterministic finite state machine (DFA). The algorithm performs or simulates this transformation and then runs the resulting DFA on the input string, one symbol at a time. The latter process takes time linear to the length of the input string. More precisely, an input string of size n can be tested against a regular expression of size m in time O(n+2m) or O(nm), depending on the details of the implementation. This algorithm is often referred to as DFA. It is fast, but can be used only for matching and not for recalling grouped subexpressions. There is a variant that can recall grouped subexpressions, but its running time slows down to O(n2m)[citation needed].

The other algorithm is to match the pattern against the input string by backtracking. (This algorithm is sometimes called NFA, but this terminology is highly confusing.) Its running time can be exponential, which simple implementations exhibit when matching against expressions like "(a|aa)*b" that contain both alternation and unbounded quantification and force the algorithm to consider an exponential number of subcases. More complex implementations identify and speed up various common cases where they would otherwise run slowly.

Even though backtracking implementations only give an exponential guarantee in the worst case, they allow much greater flexibility and provide more expressive power. For instance any implementation that allows the use of backreferences, or implements the various improvements that Perl introduced, must use a backtracking implementation.

Some implementations try to provide the best of both algorithms by first running a fast DFA match to see if the string matches the regular expression at all, and only in that case perform a potentially slower backtracking match.

[edit] Examples

A regular expression ( also "RegEx" or "regex" ) is a string that is used to describe or match a set of strings according to certain syntax rules. The specific syntax rules vary depending on the specific implementation, programming language, or library in use. Additionally, the functionality of regex implementations can vary between versions.

Despite this variability, and because regular expressions can be difficult to both explain and understand without examples, this article provides a basic description of some of the properties of regular expressions by way of illustration.

[edit] Conventions

The following conventions are used in the examples.[1]

   metacharacter(s) ;; the metacharacters column specifies the regex syntax being demonstrated
   =~ m//           ;; indicates a regex match operation in perl    
   =~ s///          ;; indicates a regex substitution operation in perl

The regular expressions in these examples are all /perl compatible syntax. POSIX basic regular expressions use a different syntax.

[edit] Examples

Unless otherwise indicated, the following examples conform to the Perl programming language, release 5.8.8, January 31, 2006. The syntax and conventions used in these examples coincide with that of other programming environments as well (e.g., see Java in a Nutshell - Page 213, Python Scripting for Computational Science - Page 320, Programming PHP - Page 106 ).

Metacharacter(s) Description Example
Note that all the if statements return a TRUE value
. Normally matches any character except a newline. Within square brackets the dot is literal.
$string1 = "Hello World\n";
if ($string1 =~ m/...../) {
  print "$string1 has length >= 5\n";
}
( ) Groups a series of pattern elements to a single element. When you match a pattern within parentheses, you can use any of $1, $2, ... later to refer to the previously matched pattern.
$string1 = "Hello World\n";
if ($string1 =~ m/(H..).(o..)/) {
  print "We matched '$1' and '$2'\n";
}

Output:

We matched 'Hel' and 'o W';
+ Matches the preceding pattern element one or more times.
$string1 = "Hello World\n";
if ($string1 =~ m/l+/) {
  print "There are one or more consecutive letter \"l\"'s in $string1\n";
}
Output:

There are one or more consecutive letter "l"'s in Hello World
? Matches the preceding pattern element zero or one times.
$string1 = "Hello World\n";
if ($string1 =~ m/H.?e/) {
  print "There is an 'H' and a 'e' separated by ";
  print "0-1 characters (Ex: He Hoe)\n";
}
? Modifies the *, +, or {M,N}'d regexp that comes before to match as few times as possible.
$string1 = "Hello World\n";
if ($string1 =~ m/(l.+?o)/) {
  print "The non-greedy match with 'l' followed by one or ";
  print "more characters is 'llo' rather than 'llo wo'.\n";
}
* Matches the preceding pattern element zero or more times.
$string1 = "Hello World\n";
if ($string1 =~ m/el*o/) {
  print "There is an 'e' followed by zero to many ";
  print "'l' followed by 'o' (eo, elo, ello, elllo)\n";
}
{M,N} Denotes the minimum M and the maximum N match count.
$string1 = "Hello World\n";
if ($string1 =~ m/l{1,2}/) {
 print "There exists a substring with at least 1 ";
 print "and at most 2 l's in $string1\n";
}
[...] Denotes a set of possible character matches.
$string1 = "Hello World\n";
if ($string1 =~ m/[aeiou]+/) {
  print "$string1 contains one or more vowels.\n";
}
| Separates alternate possibilities.
$string1 = "Hello World\n";
if ($string1 =~ m/(Hello|Hi|Pogo)/) {
  print "At least one of Hello, Hi, or Pogo is ";
  print "contained in $string1.\n";
}
\b Matches a word boundary.
$string1 = "Hello World\n";
if ($string1 =~ m/llo\b/) {
  print "There is a word that ends with 'llo'\n";
} else {
  print "There are no words that end with 'llo'\n";
}
\w Matches an alphanumeric character, including "_".
$string1 = "Hello World\n";
if ($string1 =~ m/\w/) {
  print "There is at least one alphanumeric ";
  print "character in $string1 (A-Z, a-z, 0-9, _)\n";
}
\W Matches a non-alphanumeric character, excluding "_".
$string1 = "Hello World\n";
if ($string1 =~ m/\W/) {
  print "The space between Hello and ";
  print "World is not alphanumeric\n";
}
\s Matches a whitespace character (space, tab, newline, form feed)
$string1 = "Hello World\n";
if ($string1 =~ m/\s.*\s/) {
  print "There are TWO whitespace characters, which may";
  print " be separated by other characters, in $string1";
}
\S Matches anything BUT a whitespace.
$string1 = "Hello World\n";
if ($string1 =~ m/\S.*\S/) {
  print "There are TWO non-whitespace characters, which";
  print " may be separated by other characters, in $string1";
}
\d Matches a digit, same as [0-9].
$string1 = "99 bottles of beer on the wall.";
if ($string1 =~ m/(\d+)/) {
  print "$1 is the first number in '$string1'\n";
}

Output:

99 is the first number in '99 bottles of beer on the wall.'
\D Matches a non-digit.
$string1 = "Hello World\n";
if ($string1 =~ m/\D/) {
  print "There is at least one character in $string1";
  print " that is not a digit.\n";
}
^ Matches the beginning of a line or string.
$string1 = "Hello World\n";
if ($string1 =~ m/^He/) {
  print "$string1 starts with the characters 'He'\n";
}
$ Matches the end of a line or string.
$string1 = "Hello World\n";
if ($string1 =~ m/rld$/) {
  print "$string1 is a line or string ";
  print "that ends with 'rld'\n";
}
\A Matches the beginning of a string (but not an internal line).
$string1 = "Hello\nWorld\n";
if ($string1 =~ m/\AH/) {
  print "$string1 is a string ";
  print "that starts with 'H'\n";
}
\Z Matches the end of a string (but not an internal line).
$string1 = "Hello\nWorld\n";
if ($string1 =~ m/d\n\Z/) {
  print "$string1 is a string ";
  print "that ends with 'd\\n'\n";
}
[^...] Matches every character except the ones inside brackets.
$string1 = "Hello World\n";
if ($string1 =~ m/[^abc]/) {
  print "$string1 contains a character other than ";
  print "a, b, and c\n";
}

[edit] Notes

  1. The character 'm' is not always required to specify a perl match operation. For example, m/[^abc]/ could also be rendered as /[^abc]/. The 'm' is only necessary if the user wishes to specify a match operation without using a forward-slash as the regex delimiter. Sometimes it is useful to specify an alternate regex delimiter in order to avoid "delimiter collision". See 'perldoc perlre' for more details.

[edit] Links

[edit] External Links

Applications

Personal tools
Namespaces
Variants
Actions
Navigation
Community
Toolbox
In other languages
Sister projects
Print/export