75% developed

Applied Programming/RegEx

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Overview[edit | edit source]

Regular Expression[edit | edit source]

In programming, regular expressions refers to a series of characters that can be used to specify a search function. They allow for robust and powerful ways of searching through a plethora of data without creating redundant or repetitive lines of code. Regular expressions, or regex(es) for short, behave like a ctrl+f for coding, allowing us to easily retrieve data patterns such as email addresses, domain names, phone numbers; virtually any data that we can attribute a pattern to.

Regex characters consist of a combination of special metacharacters, and certain standard characters used to specify what elements of the string to match.

Metacharacter[edit | edit source]

A metacharacter is a character that has a special meaning to a computer program, such as a shell interpreter or a regular expression (regex) engine.

Characters[1]

  • []

A set of characters

 import re
 
 txt = "The rain in Spain"
 
 #Find all lower case characters alphabetically between "a" and "m":
 
 x = re.findall("[a-m]", txt)
 print(x)

Output:

['h', 'e', 'a', 'i', 'i', 'a', 'i']
  • \

Signals a special sequence (can also be used to escape special characters)

 import re
 
 txt = "That will be 59 dollars"
 
 #Find all digit characters:
 
 x = re.findall("\d", txt)
 print(x)

Output:

['5', '9']
  • .

Any character (except newline character)

 import re
 
 txt = "hello world"
 
 #Search for a sequence that starts with "he", followed by two (any) characters, and an "o":
 
 x = re.findall("he..o", txt)
 print(x)

Output:

['hello']
  • ^

Starts with

 import re
 
 txt = "hello world"
 
 #Check if the string starts with 'hello':
 
 x = re.findall("^hello", txt)
 if x:
   print("Yes, the string starts with 'hello'")
 else:
   print("No match")

Output:

Yes, the string starts with 'hello'
  • $

Ends with

 import re
 
 txt = "hello world"
 
 #Check if the string ends with 'world':
 
 x = re.findall("world$", txt)
 if x:
   print("Yes, the string ends with 'world'")
 else:
   print("No match")

Output:

Yes, the string ends with 'world'
  • *

Zero or more occurrences

 import re
 
 txt = "The rain in Spain falls mainly in the plain!"
 
 #Check if the string contains "ai" followed by 0 or more "x" characters:
 
 x = re.findall("aix*", txt)
 
 print(x)
 
 if x:
   print("Yes, there is at least one match!")
 else:
   print("No match")

Output:

['ai', 'ai', 'ai', 'ai']
Yes, there is at least one match!
  • +

One or more occurrences

 import re
 
 txt = "The rain in Spain falls mainly in the plain!"
 
 #Check if the string contains "ai" followed by 1 or more "x" characters:
 
 x = re.findall("aix+", txt)
 
 print(x)
 
 if x:
   print("Yes, there is at least one match!")
 else:
   print("No match")

Output:

[]
No match
  • {}

Exactly the specified number of occurrences

 import re
 
 txt = "The rain in Spain falls mainly in the plain!"
 
 #Check if the string contains "a" followed by exactly two "l" characters:
 
 x = re.findall("al{2}", txt)
 
 print(x)
 
 if x:
   print("Yes, there is at least one match!")
 else:
   print("No match")

Output:

['all']
Yes, there is at least one match!
  • |

Either or

 import re
 
 txt = "The rain in Spain falls mainly in the plain!"
 
 #Check if the string contains either "falls" or "stays":
 
 x = re.findall("falls|stays", txt)
 
 print(x)
 
 if x:
   print("Yes, there is at least one match!")
 else:
   print("No match")

Output:

['falls']
Yes, there is at least one match!
Other Examples[2]

Some other characters may have special meaning in some environments.

  • In some Unix shells the semicolon (";") is a statement separator.
  • In XML and HTML, the ampersand ("&") introduces an HTML entity. It also has special meaning in MS-DOS/Windows Command Prompt.
  • In some Unix shells and MS-DOS/Windows Command Prompt, the less-than sign and greater-than sign ("<" and ">") are used for redirection and the grave accent/backquote ("`") is used for command substitution.
  • In many programming languages, strings are delimited using quotes (" or '). In some cases, escape characters (and other methods) are used to avoid delimiter collision, e.g. "He said, \"Hello\"".
  • In printf format strings, the percent sign ("%") is used to introduce format specifiers and must be escaped as "%%" to be interpreted literally. In SQL, the percent is used as a wildcard character.
  • In SQL, the underscore ("_") is used to match any single character.

Kleene Star[edit | edit source]

Greedy Algorithm[edit | edit source]

Greedy is an algorithmic paradigm that builds up a solution piece by piece, always choosing the next piece that offers the most obvious and immediate benefit. So the problems where choosing locally optimal also leads to global solution are best fit for Greedy.[3]


Here are the reasons for using the greedy approach:

  • The greedy approach has a few tradeoffs, which may make it suitable for optimization.
  • One prominent reason is to achieve the most feasible solution immediately. In the activity selection problem (Explained below), if more activities can be done before finishing the current activity, these activities can be performed within the same time.
  • Another reason is to divide a problem recursively based on a condition, with no need to combine all the solutions.
  • In the activity selection problem, the "recursive division" step is achieved by scanning a list of items only once and considering certain activities.[4]


Greedy algorithms have some advantages and disadvantages:

  1. It is quite easy to come up with a greedy algorithm (or even multiple greedy algorithms) for a problem.
  2. Analyzing the run time for greedy algorithms will generally be much easier than for other techniques (like Divide and conquer). For the Divide and conquer technique, it is not clear whether the technique is fast or slow. This is because at each level of recursion the size of gets smaller and the number of sub-problems increases.
  3. The difficult part is that for greedy algorithms you have to work much harder to understand correctness issues. Even with the correct algorithm, it is hard to prove why it is correct. Proving that a greedy algorithm is correct is more of an art than a science. It involves a lot of creativity.[5]

Activities[edit | edit source]

1) Write a program that uses Regex to parse through a text file containing individual e-mail addresses and sort them into a dictionary where the key is the e-mail address' handle and the value is the domain.

Example: A file containing 'johndoe@aol.com' and 'janedoh@yahoo.com' would result in ["johndoe": "aol", "janedoe": "yahoo"].

2) Write a program that uses Regex to find and replace all occurrences of vowels in a string and replaces them with a period. Only replace 'y' if it is the only vowel in the word.

Example: "The quick brown fox jumped over the lazy dog" would result in "Th. q..ck br.wn f.x j.mp.d .v.r th. l.zy d.g"

3) Write a program that uses Regex to convert a date in dd-mm-yyyy format to yy-mm-dd format.

Example: 06-04-1992 would result in 92-06-04.

4) Write a program that uses Regex to break down a URL and sort the information into a dictionary. The key should be the primary domain while the value should be a list of subpages.

Example: "https://www.wikipedia.org/wiki/Computer_programming" would result in ["wikipedia": ['wiki', 'Computer_programming']]

Key Terms[edit | edit source]

Approximate string-match algorithm - (also known as Fuzzy String Searching) searches for substrings of the input string. [6]

Escape character - A character used to bypass a symbol that may be a metacharacter and express the symbol literally.

Exact string-match algorithm - To find one, several, or all occurrences of a defined string (pattern) in a large string (text or sequences) such that each matching is perfect.[6]

Greedy algorithm - An algorithmic paradigm that builds up a solution piece by piece, always choosing the next piece that offers the most obvious and immediate benefit.[3]

Kleene star - A programming resource that offers outcomes related to the concatenation of a string set. Using the Kleene star, developers and others assess how to filter given results based on input.[7]

Metacharacter - A character that has a special meaning to a computer program, such as a shell interpreter or a regular expression (regex) engine.[8]

Monoid - A set equipped with an associative binary operation and an identity element.[9]

Pattern - A regular expression against which text is either matched or captured.[10]

Quantifier - A modifier that follows another regex token, enumerating the number of times you expect it to appear.[10]

RegEx processor - A processor that translates a regular expression in the above syntax into an internal representation which can be executed and matched against a string representing the text being searched.[10]

Regular expression - A regular expression (regex or regexp for short) is a special text string for describing a search pattern.[11]

Wildcard - A generic character that stands in for anything.[10]

References[edit | edit source]