Visual Basic/Regular Expressions

From Wikibooks, open books for an open world
< Visual Basic
Jump to: navigation, search

Sometimes, the built in string functions are not the most convenient or elegant solution to the problem at hand. If the task involves manipulating complicated patterns of characters, regular expressions can be a more effective tool than sequences of simple string functions.

Visual Basic has no built-in support for regular expressions. It can use regular expressions via VBScript Regular Expression Library, though. If you have Internet Explorer installed, you almost certainly have the library. To use it, you must add a reference to the project; on the Project menu choose References and scroll down to Microsoft VBScript Regular Expressions. There might be more than one version; if so, choose the one with the highest version number, unless you have some particular reason to choose an old version, such as compatibility with that version on another machine.

Class outline[edit]

Class outline of VBScript.RegExp class:

  • Attributes
    • RegExp.Pattern
    • RegExp.Global
    • RegExp.IgnoreCase
    • RegExp.MultiLine
  • Methods
    • RegExp.Test
    • RegExp.Replace
    • RegExp.Execute

Constructing a regexp[edit]

A method of constructing a regular expression object:

   Set Regexp = CreateObject("VBScript.RegExp")
   Regexp.Pattern = "[0-9][0-9]*"

A method of constructing a regular expression object that requires that, in Excel, you set a reference to Microsoft VBScript Regular Expressions:

   Set Regexp = new RegExp
   Regexp.Pattern = "[0-9][0-9]*"

Testing for match[edit]

An example of testing for match of a regular expression

   Set RegExp = CreateObject("VBScript.RegExp")
   RegExp.Pattern = "[0-9][0-9]*"
   If RegExp.Test("354647") Then
     MsgBox "Test 1 passed."
   End If
   If RegExp.Test("a354647") Then
     MsgBox "Test 2 passed." 'This one passes, as the matching is not a whole-string one
   End If
   If RegExp.Test("abc") Then
     MsgBox "Test 3 passed." 'This one does not pass
   End If

An example of testing for match in which the whole string has to match:

   Set RegExp = CreateObject("VBScript.RegExp")
   RegExp.Pattern = "^[0-9][0-9]*$"
   If RegExp.Test("354647") Then
     MsgBox "Test 1 passed."
   End If
   If RegExp.Test("a354647") Then
     MsgBox "Test 2 passed." 'This one does not pass
   End If

Finding matches[edit]

An example of iterating through the collection of all the matches of a regular expression in a string:

   Set Regexp = CreateObject("VBScript.RegExp")
   Regexp.Pattern = "a.*?z"
   Regexp.Global = True 'Without global, only the first match is found
   Set Matches = Regex.Execute("aaz abz acz ad1z")
   For Each Match In Matches
     MsgBox "A match: " & Match
   Next

Finding groups[edit]

An example of accessing matched groups:

   Set Regexp = CreateObject("VBScript.RegExp")
   Regexp.Pattern = "(a*) *(b*)"
   Regexp.Global = True
   Set Matches = Regexp.Execute("aaa bbb")
   For Each Match In Matches
     FirstGroup = Match.SubMatches(0) '=aaa
     SecondGroup = Match.SubMatches(1) '=bbb
   Next

Replacing[edit]

An example of replacing all sequences of dashes with a single dash:

   Set Regexp = CreateObject("VBScript.RegExp")
   Regexp.Pattern = "--*"
   Regexp.Global = True
   Result = Regexp.Replace("A-B--C----D", "-") '="A-B-C-D"

An example of replacing doubled strings with their single version with the use of two sorts of backreference:

   Set Regexp = CreateObject("VBScript.RegExp")
   Regexp.Pattern = "(.*)\1"
   Regexp.Global = True
   Result = Regexp.Replace("hellohello", "$1") '="hello"

Splitting[edit]

There is no direct support for splitting by a regular expression, but there is a workaround. If you can assume that the split string does not contain Chr(1), you can first replace the separator regular expression with Chr(1), and then use the non-regexp split function on Chr(1).

An example of splitting by a non-zero number of spaces:

  SplitString = "a b  c   d"
  Set Regexp = CreateObject("VBScript.RegExp")
  Regexp.Pattern = "  *"
  Regexp.Global = True
  Result = Regexp.Replace(SplitString , Chr(1))
  SplitArray = Split(Result, Chr(1))
  For Each Element In SplitArray
    MsgBox Element 
  Next

Example application[edit]

For many beginning programmers, the ideas behind regular expressions are so foreign that it might be worth presenting a simple example before discussing the theory. The example given is in fact the beginning of an application for scraping web pages to retrieve source code so it is relevant too.

Imagine that you need to parse a web page to pick up the major headings and the content to which the headings refer. Such a web page might look like this:

  <html>
    <head>
      <title>RegEx Example</title>
    </head>
    <body>
      <h1>RegEx Example</h1>
        aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
        <h2>Level Two in RegEx Example</h2>
          bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
      <h1>Level One</h1>
        cccccccccccccccccccccccccccccccccccccc
        <h2>Level Two in Level One</h2>
          dddddddddddddddddddddddddddddddddddd
    </body>
  </html>

What we want to do is extract the text in the two h1 elements and all the text between the first h1 and the second h1 as well as all the text between the second h1 element and the end of body tag.

We could store the results in an array that looks like this:

"RegEx Example" " aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n<h2>Level Two in RegEx Example</h2>\nbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
"Level One" " cccccccccccccccccccccccccccccccccccccc\n<h2>Level Two in Level One</h2>\n dddddddddddddddddddddddddddddddddddd"

The \n character sequences represent end of line marks. These could be any of carriage return, line feed or carriage return followed by line feed.

A regular expression specifies patterns of characters to be matched and the result of the matching process is a list of sub-strings that match either the whole expression or some parts of the expression. An expression that does what we want might look like this:

 
  "<h1>\s*([\s\S]*?)\s*</h1>"
  

Actually it doesn't quite do it but it is close. The result is a collection of matches in an object of type MatchCollection:

 
  Item 0
    .FirstIndex:89
    .Length:24
    .Value:"<h1>RegEx Example</h1>"
    .SubMatches:
      .Count:1
      Item 0
        "RegEx Example"
  Item 1
    .FirstIndex:265
    .Length:20
    .Value:"<h1>Level One</h1>"
    .SubMatches:
      .Count:1
      Item 0
        "Level One"
  

The name of the item is in the SubMatches of each item but where is the text? To get that we can simply use Mid$ together with the FirstIndex and Length properties of each match to find the start and finish of the text between the end of one h1 and the start of the next. However, as usual there is a problem. The last match is not terminated by another h1 element but by the end of body tag. So our last match will include that tag and all the stuff that can follow the body. The solution is to use another expression to get just the body first:

 "<body>([\s\S]*)</body>"

This returns just one match with on sub-match and the sub match is everything between the body and end body tags. Now we can use our original expression on this new string and it should work.

Now that you have seen an example here is a detailed description of the expressions used and the property settings of the Regular Expression object used.

A regular expression is simply a string of characters but some characters have special meanings. In this expression:

 "<body>([\s\S]*)</body>"

there are three principal parts:

 "<body>"
 "([\s\S]*)"
 "</body>"

Each of these parts is also a regular expression. The first and last are simple strings with no meaning beyond the identity of the characters, they will match any string that includes them as a substring.

The middle expression is rather more obscure. It matches absolutely any sequence of characters and also captures what it matches. Capturing is indicated by surrounding the expression with round brackets. The text that is captured is returned as one of the SubMatches of a match.

<body> matches just <body>
( begins a capture expression
[ begins a character class
\s specifies the character class that includes all white space characters
\S specifies the character class that includes all non-white space characters
] ends the character class
* means that as many instances of the preceding expression as possible are to be matched
) ends the capture expression
</body> matches </body>

In the case studies section of this book there is a simple application that you can use to test regular expressions: Regular Expression Tester.

External links[edit]

Previous: Built In String Functions Contents Next: Arrays