A Parse Tutorial Sort of (Rebol 2).

Introduction

If you want to extract data from strings (like HTML, TXT, CSV, etc.) consider Parse. If you want to just check some user data against a specific format consider using Parse. If you want to validate some message written in your new dialect use Parse. Parse is useful. Parse is quick. This document is a very rough show by example description of Parse with a few warnings thrown in. I recommend you read it in conjunction with the section (section 14) on Parse of REBOL Core User Guide.

Comments to brett at codeconscious.com please.

The two modes of parse

Parse actually has probably about five modes of operations but two general categories stand out:

Parsing character string input in a string breakapart mode.
Parsing character strings or blocks using rules (parse dialect).

Breaking strings apart is a useful function of parse, it is a handy utility.

Parsing input according to rules is a more sophisticated use of Parse. In this use of Parse you are more likely to be interpreting the input in some way, overlaying it with new meaning. That is you have a string or block and you are perhaps identifying fields of records, or tokens of a language, or even identifying sections of a message protocol.

Parsing character string input (or binary data)

First up you have to decide whether you want Parse to handle whitespace characters for you or whether you will handle them yourself. Parse will handle the whitespace for you by default. If you specify the /all refinement REBOL's whitespace handling will be turned off. You must use /all to get correct results if your character data is actually binary type data.

What characters constitute whitespace? Here's my list. I generated it by using a function that plugged each into parse and checked for an effect. Could be wrong but should be pretty close:

{^A^B^C^D^E^F^G^H^-^/^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^!^_ ^~ }

Most of these are control characters, however you can also see the usual suspects of tab #"^-", newline #"^/" and space #" ".

Operations on strings

Parse can operate on your data in one of three ways. Here are the modes using my own terminology to describe them:

String Breakapart mode - default delimiters
String Breakapart mode - specified delimiters
Parse Instruction mode

String Breakapart mode - default delimiters

With this use of parse, all you do really is supply a string and parse will break it up according to predefined delimiters. These delimiters (I believe) are:

delimiting-characters: {",;}

In this mode double quoted text is recognised. So that if parse encounters a double quote it will delimit the text at the next double quote instead of say breaking at a comma.

If you omit the /all refinement then whitespace will also be considered as a delimiter. Doing this can be useful for working with some types of delimited files such as CSV (though to deal properly with CSV files as exported by MS Excel requires more work).

Examples:

>> parse {123 "456 789" 012} none
== ["123" "456 789" "012"]

>> parse {123,456,"789,012" 345} none
== ["123" "456" "789,012" "345"]

Using the /all refinement means that whitespace handling is turned off.

>> parse/all {123 "456 789" 012} none
== [{123 "456 789" 012}]

>> parse/all {123,456,"789,012" 345} none
== ["123" "456" "789,012" " 345"]

String Breakapart mode - specified delimiters

In this mode you specify your own delimiters. These will replace the predefined delimiters and you lose the double quote functionality.

Examples:

>> parse {123,456*789;012} "*"
== ["123,456" "789;012"]

>> parse/all {123,456*789;012} "*"
== ["123,456" "789;012"]

>> parse {123,"456 *789";012} "*"
== [{123,"456} "" {789";012}]

>> parse/all {123,"456 *789";012} "*"
== [{123,"456 } {789";012}]

>> parse {the quick brown fox} {aeiou}
== ["th" "q" "" "ck" "br" "wn" "f" "x"]

>> parse/all {the quick brown fox} {aeiou}
== ["th" " q" "" "ck br" "wn f" "x"]

Parse Dialect

In this mode you give parse a rule block containing instructions to follow. These instructions allow you to utilise parse to interpret custom external formats or protocols. These instructions can be as simple or as complex as you need. A simple example would be to check that some input against postal code format. A sophisticated example is REBOL's XML parser. It uses this mode of parse to load in simple XML documents. I've used this mode of parse to interpret MIME format email messages.

The instructions are written according to the parse dialect. The instructions tell parse how to read through your input. In actual fact, the instructions describe the patterns that the input should take. Parse attempts to match the input against your patterns. Parse will return a TRUE result if your instructions accurately describe the input. If your instructions fail to describe the input (or looking at it the other way, the input fails to follow your rules) parse will return FALSE. You also have the ability to carry out normal REBOL operations as parse traverses the input and your rules.

It is very important to realise that the words of the Parse dialect are interpreted by Parse in a specific way and should be considered as being different in meaning to REBOL words when used at the console.

Through this description/tutorial thingy I'll use examples assuming a string input.

Let's start at the end

>> input-string: {}
>> parse input-string [end]
== true

Ah success! Here I am parsing an empty string. My rule says to parse "check that we are at the end". The result is of course TRUE because the string was empty to begin with.

This is similar in normal REBOL script to:

>> tail? input-string
== true

Baby steps

Next up let's test that a string matches our expectations:

>> input-string: "fox"
== "fox"
>> parse input-string ["fox" end]
== true

We successfully tested that the input started with "fox" and then finished. Ok no big deal. But reflect a moment. This is a sequence - first "fox" then END. As parse traverses the input and your rule block, it keeps track of a current position for both. So at the start, the current position in the input is at the head of the string. After the rule "fox" was matched the current position in the input string will be directly after the "x" of "fox". In this example, this happens to be the tail of the string, so the very next match rule END will succeed.

We do not always have to supply an END in the rule block. You can omit it in the last example because Parse effectively slaps one on at the end anyway.

>> parse input-string ["fox"]
== true

While you can do this for simple examples, remember you'll likely need to add it in explicitly for more complex rules.

Ok back to the example again. In an ordinary REBOL session the above example is similar to the following:

>> input-string: find/match input-string "fox"
== ""
>> tail? input-string
== true

Note that the ordinary REBOL code examples through this article are provided to help learn PARSE. There are enough important differences between the Parse examples and the ordinary code examples that you cannot alway treat them as exactly equivalent.

Failures / challenges

For contrast let's look at an unsuccessful match:

>> input-string: "dog"
== "dog"
>> parse input-string ["fox"]
== false

The meaning of this is pretty obvious. Hang on though, what actually happens when Parse encounters a failure with one of the rules? Well it backtracks the input to the point it was at when the rule started. So in REBOL code what happens is actually more like this:

input-string: "dog"
If position: Find/match input-string "fox" [ input-string: position ]
Tail? input-string

Keep this little idea in the back of your mind, it becomes more meaningful with more complex rules.

Optional matching and Compound Rules

What if we want to check for a number of common pet alternatives? Let's accept a "dog" or a "cat" or indeed a bird:

>> input-string: "dog"
== "dog"
>> parse input-string ["dog" | "cat" | "bird"]
== true

In ordinary REBOL this is like coding:

input-string: any [
    find/match input-string "dog"
    find/match input-string "cat"
    find/match input-string "bird"
]
tail? input-string

Now, REBOL can be pretty concise and the ANY function definitely helps in writing concise code, but you can see already that the parse dialect is looking to be better suited to matching than ordinary scripting.

Reflecting on this a bit. We have here a more interesting rule. In fact we have a compound rule. Our compound rule is composed of three sub rules. Each of the three sub rules here are very basic but they are allowed to be compound rules themselves. The basic rules perform the lowest level matching of the input, the compound rules check the overall pattern (structure/grammar) of your data.

Back to options. What about something that may or may not exist at all? Using OPT we can indicate that the dog could be black or just leave it out:

>>  input-string: "black dog"
== "black dog"
>>  parse input-string [opt "black" "dog"]
== true

>> input-string: "dog"
== "dog"
>> parse input-string [opt "black" "dog"]
== true

Repetition - known range of occurrences

Time for some more compound rules.

Here's how to check for exactly two dogs.

>> parse "dog dog" [2 "dog"]
== true

Pretty cool eh? You can check for exactly 30 dogs in the same way. Hang on, you may object, there's a space in between the two dogs! True, but whitespace handling is in effect. If you use the /all refinement whitespace handling is not used and the space becomes a valid character to check for:

>> parse/all "dog dog" [2 "dog"]
== false

But now we don't just have two dogs, we have a dog a space and a dog:

>> parse/all "dog dog" ["dog" #" " "dog"]
== true

For the rest of these introductory examples I'll leave whitespace handling on.

I can specify between 1 and 3 dogs (inclusive) too:

>>  parse "dog" [1 3 "dog"]
== true

>> parse "dog dog dog" [1 3 "dog"]
== true

Repetition again :) - unknown number of occurrences

What if we grab a net and go prawning? We may not know how many prawns are caught by the net when we catch them:

>> input-string: {} insert/dup input-string {prawn } random 100
== ""
>> parse input-string [some "prawn"]
== true

Excellent, we have some prawns but we don't know how many. The SOME keyword means "match one or more of the following". Again it is a compound rule because I could have as easily done this if it was "raining cats and dogs":

>>  input-string: "dog dog cat dog cat cat cat dog cat dog"
== "dog dog cat dog cat cat cat dog cat dog"
>>  parse input-string [  some [ "dog" | "cat" ]  ]
== true

If it fines up:

>> input-string: {}
== ""
>> parse input-string [  some [ "dog" | "cat" ]  ]
== false

It returns false because SOME requires at least one instance to be matched. If however, we don't actually care whether we get some or not use ANY:

>> input-string: {}
== ""
>> parse input-string [  any [ "dog" | "cat" ]  ]
== true

>> input-string: {dog cat}
== "dog cat"
>> parse input-string [  any [ "dog" | "cat" ]  ]
== true

Here then also is an example of one of those REBOL words with a new meaning in the context of Parse. In ordinary REBOL ANY is a function that return the first non-false or non-none value in the block it is given. In Parse, by contrast, ANY is a keyword that introduces a compound rule that means, "match zero, one or many of the following".

Repeated Repetition

Now that I've introduced repetition and compound rules, what happens if I create a compound rule made up of nested repetition rules? Hmm, tricky.

This next example put Parse into a spin - an infinite loop. The escape key will not work - only try it if you know how to kill a process using your operating system (e.g in NT4 use task manager). A version you can quit with the escape key will be given later:

input-string: {}
parse input-string [ any [ any "dog" ] ]

To understand this infinite loop happens you need to know when the ANY rule returns success and when it completes.

Here's the major answer: ANY ALWAYS returns success. ANY will keep calling its subrule while that subrule returns success. ANY gives up on receipt of bad news (failure) but it itself always returns success. Now if ANY always receives a success because it's subrule in fact is another ANY... Well I think that explains it.

Remember OPT. It always returns success just like ANY. So putting an OPT inside an ANY is bound to lead to trouble as well.

The point then is that your repetition compound rules must be carefully written because of the possibility of creating these infinite loops. It is not a bug in REBOL, it is consequence of having a flexible parse dialect.

Sometimes these infinite loops start only after traversing lots of other complex rules and therefore can become hard to catch. I create these loops less often now since I started considering how I want Parse's "point of attention" to move. When writing your rules consider how the input is consumed by the rules.

That's part of the reason why I've been demonstrating the REBOL code similar to the various Parse examples.

Not all combinations of repetition rules create infinite loops:

>>  input-string: {}
== ""
>>  parse input-string [ any [ some "dog" ] ]
== true

This last example is ok because the SOME does not always return success. If SOME does not have at least one success it returns a failure result. So you can see that at some point, given that we can assume that the input is finite, the overall rule must terminate.

Quoting Ladislav, "The dangerous rules are rules, that don't consume any input, yet they return success."

REBOL version based on Core 2.5.3 and later have another way to handle this infinite loop scenario - the BREAK keyword. BREAK terminates the rule when it is encountered. See the REBOL change documentation for examples.

Nothing here much

Check this:

>>  parse {} [none]
== true

The NONE keyword does nothing but is always successful. Other than that, you may as well forget it until you really need it. Oh and wrap a NONE within an ANY or a SOME and you get....lots and lots of wasted CPU cycles.

All these characters

Charset. Stands for character set. It is a bitset which I believe makes it fast for pattern matching operations.

Let's say you only want to check that your input contains the digits 0 to 9.

digit: charset [#"0" - #"9"]

Now parse can use this directly as a pattern matching instruction. It will match one character only of those in the set 0 - 9.

>> parse {1} [digit]
== true

Naturally enough you can use these in compound rules too:

An Australian postcode consists of 4 numeric digits so:

parse {2069} [4 digit]

Maybe you want everything but digits:

non-digit: complement digit

>> parse {1} [non-digit]
== false

Charsets (bitsets) are sets and you can apply the set operations union, intersection, exclude, etc on them:

letter: charset [#"a" - #"z" #"A" - #"Z"]
digit: charset [#"0" - #"9"]
letter-or-digit: union letter digit
valid-name: [letter any letter-or-digit]

>>  parse {1abc} valid-name
== false
>>  parse {rebol} valid-name
== true
>>  parse {xyz1234} valid-name
== true

Moving right along...

Sometimes we really couldn't care less what lies between things of interest.

This example does not "skip c" it reads "match a, skip a character, match c, tail?".

parse {abc} ["a" skip "c" end]

You want to skip 5 characters? Use repetition:

parse {1234567890} ["123" 5 skip "90" end]

Sometimes we don't know how much is in between but we do know what is the next interesting bit:

>> input-string: {1234 fox}
== "1234 fox"
>> parse input-string [thru "fox" end]
== true

This is like the REBOL code of:

>> input-string: find/tail input-string "fox"
== ""
>> tail? input-string
== true

We can stop where fox starts:

>>  input-string: "1234 fox"
== "1234 fox"
>>  parse input-string [to "fox" "fox" end]
== true

And the REBOL code that performs similarly:

input-string: {1234 fox}
input-string: find input-string "fox"
input-string: find/match input-string "fox"
tail? input-string

We can skip to the end as well:

>>  parse {123456} ["123" to end end]
== true

This says "match 123, move to the tail, test tail". Pretty obvious we would get a true result if you think of it in these terms.

While we're here how about another repetition warning. The rule [to end] moves to the tail and reports success every time. Put an ANY or SOME around it and you can guess what will happen (hint read repeated repetition section repeatedly).

But I want some information from it!

Up to this point I've concentrated on the various matching functionality of Parse. Of course though you want to extract information from your data. The keyword of note for this purpose is COPY. Also of use is the ability to execute REBOL code within the parse rules and thereby set and maintain REBOL variables (eg. Counters) using that code.

Ok COPY.

Copy is really really simple really. It is a compound rule that takes two arguments a variable and a subrule. Whatever input the subrule matches gets copied into the variable. If the subrule doesn't match anything (fails) COPY returns the failure but leaves the variable unchanged.

Here the subrule is to match an "A" which obviously fails.

>> parse {123} [copy some-text "A"]
== false
>> some-text
** Script Error: some-text has no value
** Where: halt-view
** Near: some-text

Here the subrule is a simple skip:

>> parse {123} [copy some-text skip end]
== false
>> some-text
== "1"

And here the subrule is to match nothing NONE which is always successful so copy copies that which was matched... Well perhaps it should have been an empty string, but this is what happens (at least in REBOL/View 1.2.1):

>> parse "123" [copy some-text none]
== false
>> some-text
== none

Bring on the code

Ordinary REBOL code can be used inside the parse dialect via the use of "(" and ")" i.e. a Paren! series:

>> parse {} [(print "some code just executed") end]
some code just executed
== true

Obviously this is very handy. Nicer is that it runs according to its placement in the rule. Though note that even if the rule ultimately fails your code may have already run:

>>  parse {123} [
[      "1" (print "Found 1!")
[      "2" (print "found 2!")
[      "A" (print "found an A!")
[      end
[     ]
Found 1!
found 2!
== false

So the upshot is you can maintain counters and take actions based on your parse rules.

Another interesting use for the Paren! is to enable the Escape key to work in the infinite loop situation described earlier by adding within the looping part.

Taking the earlier example and adding a Paren! to it gives:

input-string: {}
parse input-string [ any [ () any "dog" ] ]

This will loop spin around until you hit the Escape key (Esc).

So during development it might be useful to put print statements in these allowing you to see what is happening and use the Esc key if necessary. Note though it is possible this behaviour could change in later version of REBOL.

The current index and manipulating it

Parse maintains a reference to the input. The reference is a series and so has a current index.

Some special parse dialect syntax allows you to get and set this reference. In fact you use a set-word and get-word syntax respectively.

In this example I set the word "mark" to the input series at the current index that parse has, don't worry about the false - it is just saying we didn't get all the way through the input:

>>  parse {123456} ["123" mark:]
== false
>> mark
== "456"

I can manipulate the current index that parse uses too:

>>  parse {1234567} ["123" mark: (mark: next next mark) :mark "67"]
== true

To explain. First "123" is matched, then the word mark is set to the reference. Then the REBOL code between the parentheses is evaluated. This code manipulates the reference we hold by two characters. I return this modified reference to parse using the get-word syntax. Parse seeing the get-word syntax knows that it must update it's reference to that given. Finally I match the "67".

Whitespace handling again

Whitespace handling is a bit like changing:

[ your-rule-here ]

to

[ [ any whitespace your-rule-here] ]

so that:

input-string: "fox dog"
parse input-string ["fox" "dog"]

is similar to:

ws:  charset {^A^B^C^D^E^F^G^H^-^/^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^!^_ ^~ }
input-string: "fox dog"
parse/all input-string [ [any ws "fox"] [any ws "dog"] ]

Parsing loaded values

This mode is used when the value to be parsed is actually a block not a string. You use this mode when you have already loaded data into REBOL values. You write parse instructions in a rule block using the parse dialect in a similar way to that described for parsing strings except for parsing blocks the semantics are different and you have a couple more keywords to use.

This is the mode of parse that deserves the attention of anyone using REBOL. The reason is that you are free to store your data in a form understandable by yourself and others and yet is still computer readable.

An example that shows what can be achieved is Carl Sassenrath's stock transaction example which you can see below. Now what if "sell 300 shares at $89.08" came in via email?

If you study this example you will see that Carl, in a very small space, has created a small interpreter that parses, validates and performs computations. This is very powerful technology that is easily underestimated because it is so small and simple.

rule: [
    set action ['buy | 'sell]
    set number integer!
    'shares 'at
    set price money!
    (either action = 'sell [
            print ["income" price * number]
            total: total + (price * number)
        ] [
            print ["cost" price * number]
            total: total - (price * number)
        ]
    )
]

total: 0
parse [sell 100 shares at $123.45] rule
print ["total:" total]

total: 0
parse [
    sell 300 shares at $89.08
    buy 100 shares at $120.45
    sell 400 shares at $270.89
] [some rule]
print ["total:" total]

Another powerful example of this is the VID dialect of REBOL/View. VID describes in a effective but simple way what should appear on screen. VID is actually a block using normal REBOL values such as words and strings. The LAYOUT function of REBOL/View takes a VID block as an argument to construct the visual objects. Layout uses parse to process the VID specification.

Special situations

When you do NOT want to match a pattern

One situation where you might do this is when you have a sub rule that might "consume" something needed by an enclosing rule.

I have come across this sort of problem a few times and I thank Ladislav for showing me a solution.

For my example, I'll parse a block rather than text but the concept still applies. I want to parse the following block, and print out every word, but if I encounter a "|" I'll print out the text "**********":

my-block: [ the quick brown fox | jumped | over the lazy]

This next bit of code will not work. If you try it you will see that there are no "*"s printed, instead you will see the "|":

single-word: [set item word! (print mold item)]
phrase: [some single-word]
parse my-block [ phrase some ['| (print "**********") phrase] ]

The thing to note is that "|" is a word too. Therefore the "|" is "consumed" by the rule called SINGLE-WORD. So one way to solve this is to give SINGLE-WORD some indigestion (make it fail) when it encounters a "|". To do this I will use a dynamic rule - a rule that is modified as parse is executing.

To force a rule to fail, make sure it cannot match anything any more. A way to ensure this is to try a skip after the end of the input. This can never work, if we are not at the end it will fail, if we are at the end then the skip will fail. So this rule is guaranteed to fail every time:

always-fails: [end skip]

Using this I now wrap SINGLE-WORD with a rule I call WORD-EXCEPT-BAR. The purpose of this new rule is to fail if it finds the "|" word otherwise it goes ahead and runs SINGLE-WORD. I also need to modify PHRASE to call WORD-EXCEPT-BAR: The dynamic rule I mentioned earlier is called WEB. Here are rules with the complex one split over multiple lines to improve readability:

phrase: [some word-except-bar]
word-except-bar: [
    [
        '| (web: :always-fails)   |    (web: :single-word)
    ]
    web
]

Another way to describe the PHRASE rule, as it is now, is "a rule that matches a series of words which does not contain the word |."

To finish off I'll create a function to call parse with the correct rule and wrap the whole lot in an object just to be tidy:

word-parsing-object: context [
    always-fails: [end skip]
    single-word: [set item word! (print mold item)]
    word-except-bar: [
        ['| (web: :always-fails) | (web: :single-word)]
        web
    ]
    phrase: [some word-except-bar]
    set 'parse-words func[ a-block [block!] ] [
        parse a-block [ phrase some ['| (print "**********") phrase] ]
    ]
]

Here is a test run:

>> parse-words [the quick brown fox | jumped | over the lazy]
the
quick
brown
fox
**********
jumped
**********
over
the
lazy
== true

In summary in this section I have demonstrated how one can match a specific pattern even when a more general pattern (that includes the specific pattern) gets to see the input first.

Why didn't you just write...

parse-words: func [a-block [block!]] [
    parse a-block [
        some [
            '| (print "**********") |
            set item word! (print mold item)
        ]
    ]
]

The point was to demonstrate [end skip] and dynamic rules because there are situations when you need them.

The BREAK keyword

From RT's changes document:

When the BREAK word is encountered within a rule block, the block is
immediately terminated regardless of the current input pointer.
Expressions that follow the BREAK within the same rule block will not
be evaluated.

In this example the SOME rule is exited early:

>> parse "X" [some [ (print "*Break*") break] "X"]
*Break*
== true

Here again the SOME rule is exited early just like the previous example. In this case the rule that SOME is processing is referred to by a word:

>> rule-to-break: [(print "*Break*") break]
== [(print "*Break*") break]
>> parse "X" [some rule-to-break "X"]
*Break*
== true

This case produces an infinite loop. Because the BREAK is within a sub-rule of the rule that SOME is processing. The BREAK does not affect success/failure status or the input pointer - it just exits a rule early:

>> rule-to-break: [(print "*Break*") break]
== [(print "*Break*") break]
>> parse "X" [some [rule-to-break] "X"]
*Break*
*Break*
...
*Break*
*Break*(escape)

Related toolset

I have written "Parse Analysis Toolset" to help learn and analyse the way Parse works. The Explain-parse function of the toolset should help with learning Parse. The script has related documentation. You can find the script and a linkg to the documentation at:

parse-analysis.r (at REBOL.org Script Library)

Building on this toolset I've got a visual token highlighter. The script and the documentation for it is at:

parse-analysis-view.r (at REBOL.org Script Library)

Screenshot of Token Stepper highlighting an ABNF rule using abnf-parser.r:

Screenshot of Token Stepper

One more program I've made can return a parse tree of your input:

load-parse-tree.r (at REBOL.org Script Library)

Comments

Parse is a key component REBOL. REBOL is promoted as a messaging language. Messages can come in many formats (syntaxes). Parse allows you to define the syntax of a message so that you can interpret the message and transform it to something else or act on it directly. That may sound complex, but it isn't really.

What are messages? Lots of things can be considered as messages. Basically if you can put it into a file and the format of the file has some rule to it, then I think you have a message. You don't have to put it in a file though to use parse. REBOL's networking functions use parse to interpret many of the internet protocols that REBOL provides access to.

With REBOL you can define a mini-language (a language designed for a particular purpose). Parse helps you to validate and process such mini-languages. You might want to design a mini-language for creating web pages on your internet site. Or perhaps for controlling a special device you have attached to your computer.

Even if you don't go this far, parse's delimit mode will be useful for you just as a string-breakapart utility.