Regular Expressions/syntax
There are several variants of regular expressions. These variants differ not only in their concrete syntax but also in their capabilities. Individual tools that support regular expressions also have their own peculiarities.
- shell regular expressions - a limited form of regular expression used for pattern matching and filename substitution
- simple regular expressions - widely used for backwards compatibility, but deprecated on POSIX compliant systems.
- basic regular expressions - used by some Unix shell tools
- emacs regular expressions - used by the emacs editor
- non posix basic regular expressions - provides additional character classes not supported by posix
- perl compatible regular expressions - used by perl and some application programs (esp. those written in perl)
- posix basic regular expressions - provides extensions for consistency between utility programs. These extensions are not supported by some traditional implementations of Unix tools.
- posix extended regular expressions - may be supported by some Unix utiities via the -E command line switch
Since many ranges of characters depends on the chosen locale setting (e.g., in some settings letters are organized as abc..yzABC..YZ while in some others as aAbBcC..yYzZ).
Greedy expressions [edit]
Quantifiers in regular expressions match as much as they can; they are greedy (meaning they try to match the maximum available). This can be a significant problem. For example, someone wishing to find the first instance of an item in double-brackets in the text
- Another whale explosion occurred on [[January 26]], [[2004]], in [[Tainan City]], [[Taiwan]].
would most likely use the pattern (\[\[.*\]\]), which seems correct (note that the square bracket is preceded by a back slash as it is to be interpreted as a literal character). However, this pattern will actually return [[January 26]], [[2004]], in [[Tainan City]], [[Taiwan]] instead of the expected [[January 26]]. This is because it will return everything between the first 2 left brackets from [[January 26]] and the last 2 right brackets from [[Taiwan]].
There are two ways to avoid this common problem; firstly, rather than specifying what is to be matched, specify what is not to be matched, in this case the ] is not to be matched, so the pattern would be (\[\[[^\]]*\]\]). However, this would fail to match at all on this string:
- A B C [[D E] F G]]
In PHP, you can allow a quantifier to be specified as non-greedy, by adding a 'U' at the end of the regex (just after the finishing slash). For example, /\[\[.*\]\]/U