REBOL Programming/Language Features/Parse

From Wikibooks, the open-content textbooks collection

Jump to: navigation, search

Contents

[edit] About Parse

The PARSE function is one of the most powerful features in REBOL. It has many capabilities from string splitting to advanced parsing of blocks containing values of all datatypes, and forms the basis of pattern matching that is implemented as a regular expression in other languages.

If you are asking yourself why REBOL has no regular expression implementation, PARSE is the answer.

PARSE enables REBOL to create powerful dialects that are used throughout the language. An example of this is VID or the Visual Interface Dialect, used for constructing graphical user interfaces in REBOL.

What PARSE does is traversing a series (a block, a string or a binary) once and while it does that, you can perform functions or collect information from the series to be used elsewhere or perform these functions on the series itself.

Parsing a series is really like traversing a series using the FOREACH function and then act (or not act!) on the current element in the series, only you can describe what you want to do in a briefer syntax.

A key element to the way it works is that you provide a set of rules and if the string or block does not adhere to the rules, the parsing stops. This is not as devastating as it may sound, as we'll see in the next chapter.

There are roughly two ways PARSE is used, either on strings similar to the use of regex, and block parsing where the parsing is done at the level of REBOL values. Block parsing is used to create dialects, and this is one of the main characteristics of the language. However, it is also a major area of confusion for newcomers who confuse REBOL dialects with the REBOL language.

A skeleton of a parsing operation:

parse <DATA> <RULES>

[edit] Simple String Parsing

Simple parsing involves string splitting:

>> parse "this is a string" none
== ["this" "is" "a" "string"]

By not providing a set of rules, the default rule 'none set is used. The default rule will break a string into a block of string(s) based on common delimiters; whitespace [space tab newline] and quote, comma or semicolon.

[edit] Return Values from PARSE

You'll also see here that PARSE returns a block of values.

PARSE only returns anything other than FALSE, if it has reached the end of the string/block successfully, which only happens if the string/block adheres to the rules you provide. This is an important observation in rule making.

Because the 'none rule is no more strict than this, it will always reach the end of a string.

Examples:

nothing -- that is no default delimiters present in the source string

string: "redbluegreen"
parse string none
== ["redbluegreen"]

space

string: "red blue green"
parse string none
== ["red" "blue" "green"]

comma

string: "red,blue,green"
parse string none
== ["red" "blue" "green"]

tab (how to display tab in wiki? please fix)

string: "red → blue → green"
parse string none
== ["red" "blue" "green"]

semicolon

string: "red;blue;green"
parse string none
== ["red" "blue" "green"]

CSV

string: {"red","blue","green"}
parse string none
== ["red" "blue" "green"]

newline

string: {
red
blue  
green
}
parse string none
== ["red" "blue" "green"]

The default rule breaks down when:

  • You have different delimiter(s)
  • You want to keep a default delimeter in your field
  • You need something other than simple splitting.

If you have different delimiter(s) you can supply a string rule to PARSE containing your delimiters.

Examples:

delimiter: "#"
string: "red#blue#green"
parse string delimiter
== ["red" "blue" "green"]
delimiter: "#*"
string: "red#blue*green"
parse string delimiter
== ["red" "blue" "green"]

Note that the sequence of the characters in the delimiter string is not important.

[edit] Character Parsing

Sometimes you want to parse a string to see if it fits a specific format. This can be used for simple things like determining and validating the format of a phone number or an email address.

It can also be used for complex string parsing to determine a binary file format.

You can define a character range as a rule.

[edit] Full Series Parsing

(Is this a good title?)

Full series parsing is the parsing of a block of various values and then act on the current value of the series. What you basically can do, is parse a block of elements of all kinds of datatypes or any other kind of series. The sequence in which these values occur combined with the value can be used to trigger functions for the series.

We'll start with some very simple examples:

[edit] One Value in a Block

The rule is now a datatype, instead of a string character:

>> parse [5] [integer!]
== true

This returns TRUE because:

  • The datatype matched the element in the block.
  • We've reached the end of the block with one element.

Note that reaching the end of the series means you could use TAIL? on the series and it would return TRUE.

>> parse [25-Dec-2005] [date!]
== true
>> parse ["Hello"] [string!]
== true

Same thing, only with a date and a string.

Now we change the rule to a specific value. The block will only adhere to the rule if two conditions are met:

  • The datatype matches the element
  • The value matches the element
>> parse [Hi] ['Hi]
== true

Changing the value of the word will make the parsing fail, even if the datatype is the same:

>> parse [Hi] ['Bye]
== false

Note that when parsing for integer values, we have to specify a range since integers are used in parse rules to specify a minimum and maximum number of characters.

>> parse [-1] [1 1 -1]
== true
>> parse [-1] [-1]
== false

[edit] Multiple Values in a Block

Now we can see how PARSE traverses a block:

>> parse [Hi Bye] [word!]
== false

What goes wrong here? What happens is that not only are we checking the block element datatype, but we also need to check the count. In other words, PARSE fails, because the rule only looks for the first word in the block. As soon as that word has been found, there are no more rules to parse with and it can't finish the block.

Therefore the block hasn't been parsed to the end, which means the parse fails.

If we want it to go to the end, we need to make the rule check for word! datatypes a specific number of times. There are specific keywords to use for this:

  • ANY - Zero or more times
  • SOME - One or more times
  • OPT - Zero or one time
  • ONE - Exactly one time
  • an integer - Use an integer to determine a specific number of times
  • two integers - Any number between the two integers to determine a number of times

If we add such a keyword:

>> parse [Hi Bye] [any word!]
== true

It returns TRUE, because:

  • The rule checks for word! datatypes for as long as it can by using the ANY keyword.
  • The block adheres to the parse rule.

By using an integer:

>> parse [Hi Bye] [2 word!]
== true

The rule checks for exactly two words, no more, no less.

>> parse [Hi Bye] [1 2 word!]
== true

The rule checks for not lower than 1 and not more than 2 words.

>> parse [Hi how are you? Bye] [0 5 word!]
== true

This block will be valid for between 0 and 5 words.

Using integers as a counter makes using an integer as a rule a bit special:

>> parse [36] [1 1 36]
== true

We need to give the minimum and maximum number of 36'es that should be detected.

If we add a different datatype to the block:

>> parse [Hi 36 Bye] [any word!]
== false

The parse mismatches as soon as it reaches 36 and stops. We need more rules!

[edit] Multiple Rules

To use multiple rules, simply append them to your rule block, but note that the sequence is important.

>> parse [Hi 36 Bye] [word! integer! word!]
== true

Now we can see how multiple rules will allow the parse to succeed, because:

  • The first value in the block is a word! datatype
  • The second value in the block is an integer! datatype
  • The third value in the block is a word! datatype

In other words, the block and the rule block are being traversed concurrently.

Let's add a little flexibility:

>> parse [Hi how are you? 36 Bye] [any word! integer! word]
== true

The first rule will be run until it no longer matches. It reads 'Hi, 'how, 'are and 'you? as words. When it reaches the integer 36, it no longer matches and proceeds to the next rule and so forth.

There is an important difference between a mismatch and failure here:

  • Mismatch only happens if the rule has matched at least once.
  • Failure only happens if the rule has never matched the block element when a match was attempted.

An example of failure:

>> parse [36 Bye] [some word! integer! word!]
== false

This fails because the first rule never was matched at least once. To make it work, change SOME to ANY:

>> parse [36 Bye] [any word! integer! word!]
== true

With ANY, words can occur zero or more times, thus the rule has been matched.

You can also see here that there are more rules than block elements. This allows further flexibility, and the parse still returns true, even if the remaining rules are never used.

[edit] Conditional Rules

Beyond the counting keywords, you can specify branch points if one rule can't be used to allow matching the current element in the block. This is done by all matching rules in a block and separating them by the pipe sign:

[RULE-A | RULE-B | RULE-C]

Let's say you want to check both for integer! and decimal!. Example:

>> parse [36 Bye] [any word! [integer! | decimal!] word!]
== true
>> parse [37.2 Bye] [any word! [integer! | decimal!] word!]
== true

Each block behaves like one rule, so you can use the aforementioned counting keywords to specify counts for the rule block.

[edit] Skipping data in the Parse Block

It's possible to use certain skip rules, which progresses the position in the block to first position that matches the next rule. We simply allow PARSE to ignore block elements until it finds the right one.

This is nice if you want to parse large amounts of data and don't care about the type or value of certain contents in the block.

Let's say we don't care about anything until we reach a word. That can be done with TO:

>> parse [37.2 38 Bye] [to word!]
== false

This makes PARSE return FALSE because we have reached the given word (you can't tell from this example), but we haven't finished the parse. In order to finish the series, we have to proceed past the word using THRU instead of TO.

>> parse [37.2 38 Bye] [thru word!]
== true
  • TO will ignore series elements up until the one element which fulfills the rule.
  • THRU will ignore series elements just past the one element which fulfills the rule.

[edit] Triggers

Each value in the parse block can trigger some REBOL code to be run. This is the power that allows you to use PARSE to build dialects.

To use in a rule, you specify it after the rule in question:

>> parse [Hi 36 Bye] [word! integer! (print "Thirtysix") word!]
Thirtysix
== true

Now, every time the rule has matched, the code is run. Let's see how this works for multiple matches of one rule:

>> parse [Hi 36 37 38 Bye] [word! any [integer! (print "Number Found")] word!]
Number Found
Number Found
Number Found
== true

You need to enclose the rule and the code as a block, otherwise ANY, won't see the code block.

For conditional rules:

parse [Hi 36 37.2 38 Bye] [
  word!
  any [integer! (print "Integer Found") | decimal! ("Decimal Found")]
  word!
]
Integer Found
Decimal Found
Integer Found

== true

[edit] Using Data from the Parse Series

You can also use data from the series to be used in your trigger code. This goes for values and the index of where we are in the block. PARSE constantly keeps an eye on where it is in the block it's parsing.

This is simply done by assigning a variable to the parse rule, because the parse rule returns parse block at its current position, during parsing.

parse [Hi 36 37.2 38 Bye] [
  word!
  any [int: integer! (print ["Integer" first int "Found"]) | decimal! (print "Decimal Found")]
  word!
]
Integer 36 Found
Decimal Found
Integer 38 Found
== true

We can apply normal function of Series to extracting data from the series we are parsing.

parse [Hi 36 37.2 38 Bye] [
  any [
    int: integer! (print ["Integer" first int "Found"]) |
    dec: decimal! (print ["Decimal Found at position" index? dec]) |
    wrd: thru word! (print ["Word" first wrd "is near tail:" tail? wrd])
  ]
]
Word Hi is near tail: false
Integer 36 Found
Decimal Found at Position 3
Integer 38 Found
Word Bye is near tail: true
== true

[edit] Modifying a Parse Series

Since normal Series functions such as CHANGE, INSERT or REMOVE can be used on a parse series during a parse operation, it's also possible to manipulate the series during a parse. This can be useful for highly advanced search/replace functions.

Write some more about that.

[edit] Complex Rules

It's possible to build very large rule sets, and when you do that, it can be nice to split it in smaller blocks and give them meaningful names.

[edit] Troubleshooting

PARSE is a very powerful function, but it can also be troublesome, if you don't keep a good eye on what you are doing. If you are unlucky, PARSE can get stuck in an infinite loop, requiring you to restart REBOL.

But when exactly does it happen?

PARSE normally traverses through the block, but it's really the rules that make PARSE progress. If you specify a rule that won't make it progress, it will be stuck at the same point forever. Such a rule can be an empty block, an empty rule in a block with alternate rules or a NONE rule.

Examples:

>> parse "abc" [any []]
*** HANGS REBOL ***
>> parse "abc" [some ["a" | ]]
*** HANGS REBOL ***
>> parse "abc" [some [none]]
*** HANGS REBOL ***
>> parse "abc" [any char!]
*** HANGS REBOL ***

Note: to be able escape from infinite loops use () somewhere in the parse rules as for example:

 >> parse "abc" [any [()]]
 *** YOU CAN PRESS [ESC] NOW TO STOP THE LOOP ***

[edit] Debugging

[edit] Block Parsing Examples

Personal tools