StreamWeaver Modules
The Match module does regular-expression pattern matching on incoming
text, in a somewhat similar fashion to the command-line tool 'grep', but with some significant differences. [If you have no idea what a "regular expression" is, this is not the place for an in-depth review, but you may
get the idea from what follows.]
The matching process here is 'string-oriented', rather than line-oriented
as grep is. When grep finds a match, it returns the whole line containing
the match, and it can't match across lines. Match, on the other hand,
returns the exact string that was matched, and if you wish this can span
multiple lines. Or, if you prefer, you can slice out sections of the match
and output them in a different format.
You can choose whether to find the "longest" or "shortest"
matching string, and there are several other options (described under
'Settings'). Unlike any other common string-matching application, Match
is UTF8-aware: UTF8 multibyte characters can occur anywhere in the
regular expression or the text being matched.
The module is multiconnector, with one input (the incoming text), and
two outputs -- the text pass-through and the match results. The module's
panel has two text-entry fields, one for the regular expression to be
matched, the other to (optionally) determine the output format, and a 'Settings' menu. These are described in respective sections below.
Regular expressions in Match follow the same form as grep and other
applications such as awk or perl but, as each of these has its own quirks,
so does Match. It uses the set of special characters that has become
standard, but doesn't have extensions like 'interval expressions' ( "{m,n}")
or predefined range specifiers ("[:ALNUM:]"). A newline may be explicitly
or implicitly included in an expression, allowing it to match across lines.
Successive matches will never overlap -- the scan resumes at the
character following a match.
The special characters are . * + ? ^ | $ ( ) [ \
The dash - and right-bracket ] are special after
a left-bracket, but elsewhere they are simply themselves. All other
characters, including all extended UTF8 characters, are literal, representing
themselves.
Two particular 'literal' characters need special represntation, to be able
to be entered into the TextControl. One is of course 'newline' itself which
can be represented with the usual \n pair. The other is 'tab',
represented by \t . If you need any special character to
appear as a literal in the expression, precede it with a backslash.
Each Special Character has meaning as follows:
- Period '.'
- The period represents any single
character, except -- in the default case -- a newline. A menu
setting (see later) can affect this behaviour, so that it will match a
newline as well, enabling matches across multiple lines.
- Asterisk '*'
- The asterisk represents zero or more (i.e.
any number of) occurrences of the previous 'item'. An item is either
a single literal character or a sub-expression enclosed in parentheses.
- Plus '+'
- Represents one or more
occurrences of the preceding item.
- Question-mark '?'
- Represents exactly zero or
one occurrences of the preceding item.
- Vertical Bar '|'
- The Bar denotes 'alternatives'.
Either the expression preceding the bar or the one after (or
both of course) must match. Note that the alternatives are expressions,
not items: if you want just parts of the whole expression to participate
in the alternation, enclose the whole alternation section in parentheses.
- Caret '^'
- The caret should only be used as the
first character of an entire expression (or inside a set, where it has a
different meaning -- see below). It forces the match to be only at the
beginning of a line (the very beginning of the text will also match).
Note that it does not match an actual newline
at that position. It anchors the match immediately after it.
If you do want to include a newline character in the expression,
use '\n' at the appropriate point.
- Dollar '$'
- In contrast to the caret, the dollar
should appear only at the end of the expression. It anchors
the match to the end of a line (again not matching the newline itself).
You can choose (via the Settings menu) whether the very end of the
text (when end-of-data arrives) is considered end-of-line even if there
is no newline there.
- Parentheses '( )'
- Parentheses group segments
of an expression, either to form an 'item' for a repetition operator, to
enclose an alternation, or simply to mark a subsegment of the entire
match that can be extracted for output by the Template (below).
- Brackets '[ ]'
- Square brackets enclose a set of
characters, one of which must match the current scanned text character
for the match to succeed.. A range of characters may be specified
(ascii/unicode ordering) by giving the characters at each end of the
range separated with a dash (the bounding characters may be multibyte
UTF8). If the first character after the
left-bracket is a caret ( ^ ) the bracketed expression matches
any character not in the set; a caret anywhere else within the
brackets just represents itself. If you want a dash (or a right-bracket)
in the set, precede it with a backslash
- Backslash '\'
- A backslash can precede any
special character to change it into a literal. To get a literal backslash,
use a pair of them. As mentioned, \n and \t have
special meanings -- 'newline' and 'tab' respectively.
- hello
- ... matches the exact string "hello" wherever
it appears in the input stream.
- ^(hello|hi).*$
- ... matches -- and returns -- an entire
line (anchored to both beginning and end by ^ and $)
that begins with either "hello" or "hi" followed by any number of other
characters.
- [0-9.]+
- ...matches any string containing only digits
and periods (without regard to numeric rationality!) Be aware that matches
such as this will beave differerently if "Shortest" match is selected in
the Settings menu.
- Merry Christmas (Ho! )*
- ... finds that greeting,
ending with any number of "Ho! Ho! Ho!"s...
See 'Example Configurations' below for other possibilities.
By default, when a match to the regular expression is found in the
incoming text stream, it is simply sent verbatim out the 'Matches' output.
Other information about each match is available, however, such as its
byte position in the stream and the segments of the entire match that
correspond to groups in the expression. You may want to output a
formatted string containing some of this data, rather than just the match
itself. The Template field of the panel lets you do this. If you enter a
string here it will be used to format the output; deleting all text from the Template field returns to default unformatted match output, .
The template string can have both plain text and 'data selector' elements.
Each of the latter is a backslash followed by a single selection character.
When a match is found, each data selector is equated to its
appropriate value for that match, and the template is sent in sequence
to the output, with plain text going out as is, and each selector
replaced by its value.
You can use a particular selector more than once in the same template
if you need to. (There is an overall limit of 20 selectors, but that should
suffice...)
The possible data selectors are:
- \0
- (that's "zero", not "oh") represents the entire matched string.
- \1 ... \9
- These selectors evaluate to the strings
that match successive parenthesized subexpressions in the overall
regular expression. Order is determined by the order of the left
parentheses. Thus (with not-very-useful fixed expression elements
for demonstration), if the expression ((abc(de)) finds a match,
- \1 will contain abcde
- \2 will contain abc
- \3 will contain de
- \4 and beyond will be empty.
- \b
- This is set to the beginning of the match --
i.e the (integer) byte position in the text stream of its first character.
- \e
- This is set to the end of the match --
the byte position immediately after the last matched character.
- \n and \t
- let you output newline and tab
respectively. Be aware that the 'Newline' option in the Settings menu
is ignored if a template is in effect.
You can modify matching behaviour in a number of ways through
the 'Settings' menu, which contains a number of toggleable items
described below. All are initially set "off".
- Shortest
-
By default, each match found is the longest possible one at that point.
In many cases this isn't what you want, especially when matching
across multiple lines. For example: ^.*$, which you
might use intending it to match each full line, will instead find
all the text as the longest match if 'Newline is any' (below)
is set! To get the minimum length match, set 'Shortest'.
- Case Insensitive
-
By default, all literal characters must match exactly. Set this option
to ignore the case of characters.
- Period matches Newline
-
BY default, a period ('.') in a regular expression matches any character
except a newline. Setting this option makes it match newline
as well. If you set this, you will probably want to use 'Shortest' also.
- $ matches End-of-Text
-
If you want to ensure that an end-of-line is always seen when end-of-data
is received, even if no newline precedes it, set this option.
- Add Newline
-
By default, matches are output exactly as found, so there will be no
newline at the end unless it was matched. To force a newline after
each match, set this option.
When you change either the current Regular Expression or Template,
the module has to terminate any scan in progress to recompile things.
If it has already processed input in the current scan, it will send an
End-of-Data; otherwise it keeps quiet. You can change the Settings
menu at any time, though. Remember too that a changed entry in
either text field only takes effect when you either hit <Enter> or
click on another field.
Because Match is string- rather than line-oriented, it does not keep
track of input lines. Therefore, though byte position is available,
line position is not.
While it is determining a match, the module has to store all the pending
text. In the worst case this might mean holding on to the entire incoming
stream, but more usually it can be discarded once it has been output or
determined not to match.
Although Match is as immediate as possible in its output, the nature of
the algorithm means that it doesn't know it has a full match until the first
non-matching character arrives. This is not usually serious, but may
affect you if you are trying to catch specific characters on the fly.
The 'Shortest' option in the Settings menu determines whether the
overall matched string will be the longest or shortest possible, but
the spans of parenthesized subexpressions within the string are
always the longest possible (given the match to that point).
Thus (.*)X(.*)$
will match the line "abcXdefXpqr" with \1 as
"abcXdef" and \2 as "pqr", whether
or not 'Shortest' is set. (Unlike perl, there is no way of selecting "lazy"
or "greedy" matching withing an expression.)
Match is no speed demon compared to grep, though it is fast enough
for many purposes. If you have a lot of text to scan, and don't need
Match's special features, you may be better off using grep in a PipeStream
element. On the other hand, it doesn't pretend to supplant full-fledged
languages such as awk or perl, though you may find it hard to get
immediate output from those. Choose your weapon wisely.
Here's a very basic configuration intended to let you play with your
own Match expressions and templates. In the centre is a Match
element, fed from a StreamView where you can enter test text, and
a ReadFile to supply entire text files for scanning. Output just goes to
another StreamView for display. (Another StreamView is included
but with a closed window. This will show the original text being scanned
if you want to open it.) You can elaborate with additional elements as
you like.
As supplied, it has the very
minimal expression ".*" and no Template.
Settings options are 'Case Insensitive' and '$ Matches End-of-Text'.
The resulting action is that any line of text will be matched in its entirety
except for the terminating newline ("longest string of any character
except newline") and output as found.
To get each match on its own line, select the 'Add Newline' Settings option.
You can alternatively change the expression to ".*\n" so that the
newline gets matched as well, but you will then notice some of the vagaries
of the matching algorithm... it gets way behind in deciding that it actually
has a match! You can improve things by setting 'Shortest' as well, but you
will still see output one character behind. 'Shortest' is of course a futile
setting with the original string -- I hope you realize why.
This is just the previous configuration with a perhaps more useful Match
expression. It extracts comments from C or C++ source code.
I'll leave you to deconstruct the expression yourself -- it is just one
subexpression for each of the two comment types, joined as alternatives
by the vertical bar. The asterisks found in a C-style comment have to be
escaped, of course.
Notice the additional Settings, though. 'Shortest' and 'Period matches
Newline'
are now set, so that the scan will find the shortest string that satisfies the
expression, even if it is a C-style comment that extends over several lines.
'Add Newline' is also set because the match will not include any newlines
outside the comment itself.
Again the same configuration, but this time with both a Match expression
and a Template to format the results. This one finds all link tags in an
HTML document -- i.e. tags that begin "<A HREF=...>" and end with
"</A>".
These Settings options are selected this time:
'Case Insensitive', so that the tags can
be upper or lower case, and 'Shortest' and 'Period matches Newline',
as we want to span multiline tags. 'Add Newline and '$ matches End-of-Text'
are irrelevant with this match string and Template.
In addition to simply finding the links -- even if they extend over several
lines -- the example extracts their relevant segments and reorders these
for output. There are two segments of interest here: the string immediately
following "HREF=" which at least includes the URL being linked to (other
attributes may be present -- they get grabbed too), and the Link label
itself between the opening and closing tags. The Match expression is
set to extract these as sub-matches, and the Template outputs these
in reverse order: the label first, a separating "===>", and then the URL.
Here is the expression, in all its ugliness:
<a +href=([^>]*)>(.*)</\n?a\n?>
The first few characters are a straightforward match of the beginning
of the link tag, with possibly several spaces between the "a" and the "href".
Within the first set of parentheses is the first sub-match expression: this
is a 'negated set' -- any character except ">" -- repeated any number
of times. Beyond this comes the ">" tag closure that must be matched.
The second sub-match is all the characters up to but not including the
closing tag. The part of the expression representing what ought to be
simply "</A>" is more complicated than you would expect because
I believe I've seen HTML documents in which lines got wrapped here;
in any case allowing optional newlines ("\n?") here doesn't hurt,
and exercises another special character! There probably should be similar
protection on the opening tag, but that's an exercise for the reader.
[As a possible matter of iterest, I used a similar diagram to quickly
extract all the headers from this document and turn them into links and
anchors. A small amount of subsequent editing and some cutting and pasting
resulted in 'anchored' headers and the contents list at the top of the page.]
This last example adds some post-processing to the output of the Match
element. The Match expression and Template between them (deconstruction
left to the reader) attempt to separate the incoming text into "words",
and output each on its own line. This is fed to a 'sort' command (the '-u'
option discards duplicates) so at the end you have a sorted list of all
the words in the document. The example's idea of a "word" is not always
what yours would be, but it does its best. Try variations on the supplied
expression to see if you can improve things.
Another thing this example demonstrates is unfortunately that Match is
not lightning fast! Don't start off with a long document as input...
(Of course you won't see the sorted output until the very end, but you can
watch the words scrolling by in the other window.)
The Match module and this documentation are Copyright 1999 by
Peter Goodeve, All Rights Reserved.
The regular expression matching code is derived from original code
written by Rob Pike for the 'sam' editor. This has the following Copyright
notice:
/*
* The authors of this software are Rob Pike and Howard Trickey.
* Copyright (c) 1998 by Lucent Technologies.
* Permission to use, copy, modify, and distribute this software for any
* purpose without fee is hereby granted, provided that this entire notice
* is included in all copies of any software which is or includes a copy
* or modification of this software and in all copies of the supporting
* documentation for such software.
* THIS SOFTWARE IS BEING PROVIDED "AS IS", WITHOUT ANY EXPRESS OR IMPLIED
* WARRANTY. IN PARTICULAR, NEITHER THE AUTHORS NOR LUCENT TECHNOLOGIES MAKE ANY
* REPRESENTATION OR WARRANTY OF ANY KIND CONCERNING THE MERCHANTABILITY
* OF THIS SOFTWARE OR ITS FITNESS FOR ANY PARTICULAR PURPOSE.
*/
========================