Regular Expressions/syntax

From Wikibooks, open books for an open world
Jump to: navigation, search

There are several variants of regular expressions. These variants differ not only in their concrete syntax but also in their capabilities. Individual tools that support regular expressions also have their own peculiarities.

Since many ranges of characters depends on the chosen locale setting (e.g., in some settings letters are organized as abc..yzABC..YZ while in some others as aAbBcC..yYzZ).

Greedy expressions [edit]

Quantifiers in regular expressions match as much as they can; they are greedy (meaning they try to match the maximum available). This can be a significant problem. For example, someone wishing to find the first instance of an item in double-brackets in the text

Another whale explosion occurred on [[January 26]], [[2004]], in [[Tainan City]], [[Taiwan]].

would most likely use the pattern (\[\[.*\]\]), which seems correct (note that the square bracket is preceded by a back slash as it is to be interpreted as a literal character). However, this pattern will actually return [[January 26]], [[2004]], in [[Tainan City]], [[Taiwan]] instead of the expected [[January 26]]. This is because it will return everything between the first 2 left brackets from [[January 26]] and the last 2 right brackets from [[Taiwan]].

There are two ways to avoid this common problem; firstly, rather than specifying what is to be matched, specify what is not to be matched, in this case the ] is not to be matched, so the pattern would be (\[\[[^\]]*\]\]). However, this would fail to match at all on this string:

A B C [[D E] F G]]

In PHP, you can allow a quantifier to be specified as non-greedy, by adding a 'U' at the end of the regex (just after the finishing slash). For example, /\[\[.*\]\]/U

Introduction · Implementation