Regular Expressions
Pikt regular expressions follow the usual regular expression rules with any necessary clarifications/amplifications to follow.
Here are the regular expression operators:
OPERATOR MEANING
a =~ b string b matches at least one
substring within a
a =~~ b like the above, but without case sensitivity
a !~ b string b matches no substring within a
a !~~ b like the above, but without case sensitivity
For example, all of the following are true:
"this is a test" =~ "is" "this is a test" =~~ "IS" "this is a test" !~ "THIS" "this is a test" !~~ "that" "this is a test" =~ "" "" !~ "this is a test"These characters have special meaning within Pikt regular expressions:
CHARACTER(S) MEANING
. matches any single character
* matches zero or more instances of the preceding
character/pattern
? matches zero or one instance(s) of the preceding
character/pattern
+ matches one or more instances of the preceding
character/pattern
{m,n} matches as few as m, or as many as n, instances
of the preceding character/pattern
( ) enclose a subexpression, or set of subexpressions
separated by |
| separates subexpressions (think of "or")
[ ] enclose a set of characters/character ranges
^ as the first character in a [ ] subexpression,
indicates set negation; as the first character
in a regular expression, anchors to the
beginning of the string expression on the
left-hand side of the regexp operator
$ anchors to the end of the string expression
on the left-hand side of the regexp operator
In addition to user-specified character classes, Pikt supports these built-in predefined character classes:
[[:alnum:]] the set of alphanumeric characters [[:alpha:]] the set of letters [[:blank:]] tab and space [[:cntrl:]] the control characters [[:digit:]] the decimal digits [[:graph:]] the printable characters except space [[:lower:]] the lower-case letters [[:print:]] the printable characters [[:punct:]] the punctuation characters [[:space:]] whitespace characters [[:upper:]] the upper-case lettersBackslash escapes suppress a character's specialness. So, "\\*" is a literal asterisk, and the following are all true:
"fo*bar" !~ "fo*bar" // left side literal string,
// right side regexp
"fo*bar" !~ "fo\*bar"
"fo*bar" =~ "fo\\*bar"
"fo*bar" =~ "\\*"
"*" =~ "\\*"
In any of the above left-hand expressions, you could substitute "fo\*bar", and the statements would all still be true.
Usually, just a single backslash is required for this purpose. In Pikt, however, backslashes are a general escape character. If, for example, you want to output the literal text string "$x" without the $x being interpreted as a variable (which Pikt would attempt to resolve to a value), you would use "\$x". So, if you require a backslash in the final product, you must supply double backslashes going in. Again, see the sample config files for examples of double-backslash usage.
Note that every time a regular expression containing matching parentheses is invoked, for example in any of the following situations
dat "([^:]*):([^:]*)" if $line =~ "^([^:]*):([^:]*)" do #split($rdline, "([^:]*):([^:]*)")you can reference the first parentheses-enclosed matched subexpression with $1, the second with $2, and so on. $0 references the entire matched subexpression.
Note well: The $0, $1, and so on only persist until the next regexp pattern match. The next time you use =~ (or any of the other regexp operators), or the next time you invoke the #split() function (in any of its forms), any previous $0, $1, ... values get supplanted by the values in the latest regexp. You will encounter many strange bugs unless you keep this in mind!
Alternate forms for referencing regexp matches are: $[0], $[1], $[2], and so on. These make it possible to reference the matched expressions within for loops:
set #n = #split($rdlin)
for #i=1 #i<=#n #i+=1
output $[#i]
endfor
Here is a technique for saving $0, $1, ... before a subsequent regexp action:
set #n = #split($rdlin)
for #i=1 #i<=#n #i+=1
set $f[#i] = $[#i]
endfor
...
if $f[3] =~ "cantata|sonata|toccata" // wipes out
// $3 & $[3] value
output $f[3]
fi
Better still is to use the #split() function (with all three arguments required) this way:
do #split($f, $rdlin, " ")
...
if $f[3] =~ "cantata|sonata|toccata" // wipes out
// $3 & $[3] value
output $f[3]
fi
If you failed to save the previous regexp values in the $f[] array and simply referenced $3 or $[3], that value would be undefined, since in the =~ test you didn't put ( )'s around any third subexpression, but even if you did (around "toccata") you have lost your previous $3 value.
For further coverage of regular expressions, see the GNU RX info pages.
| | 1st page | next page |