StreamWeaver Modules

Match

Introduction

The Match module does regular-expression pattern matching on incoming text, in a somewhat similar fashion to the command-line tool 'grep', but with some significant differences. [If you have no idea what a "regular expression" is, this is not the place for an in-depth review, but you may get the idea from what follows.]

The matching process here is 'string-oriented', rather than line-oriented as grep is. When grep finds a match, it returns the whole line containing the match, and it can't match across lines. Match, on the other hand, returns the exact string that was matched, and if you wish this can span multiple lines. Or, if you prefer, you can slice out sections of the match and output them in a different format.

You can choose whether to find the "longest" or "shortest" matching string, and there are several other options (described under 'Settings'). Unlike any other common string-matching application, Match is UTF8-aware: UTF8 multibyte characters can occur anywhere in the regular expression or the text being matched.

The module is multiconnector, with one input (the incoming text), and two outputs -- the text pass-through and the match results. The module's panel has two text-entry fields, one for the regular expression to be matched, the other to (optionally) determine the output format, and a 'Settings' menu. These are described in respective sections below.

Regular Expressions

Regular expressions in Match follow the same form as grep and other applications such as awk or perl but, as each of these has its own quirks, so does Match. It uses the set of special characters that has become standard, but doesn't have extensions like 'interval expressions' ( "{m,n}") or predefined range specifiers ("[:ALNUM:]"). A newline may be explicitly or implicitly included in an expression, allowing it to match across lines. Successive matches will never overlap -- the scan resumes at the character following a match.

The special characters are . * + ? ^ | $ ( ) [ \
The dash - and right-bracket ] are special after a left-bracket, but elsewhere they are simply themselves. All other characters, including all extended UTF8 characters, are literal, representing themselves.

Two particular 'literal' characters need special represntation, to be able to be entered into the TextControl. One is of course 'newline' itself which can be represented with the usual \n pair. The other is 'tab', represented by \t . If you need any special character to appear as a literal in the expression, precede it with a backslash.

Each Special Character has meaning as follows:

Examples:
See 'Example Configurations' below for other possibilities.


Templates

By default, when a match to the regular expression is found in the incoming text stream, it is simply sent verbatim out the 'Matches' output. Other information about each match is available, however, such as its byte position in the stream and the segments of the entire match that correspond to groups in the expression. You may want to output a formatted string containing some of this data, rather than just the match itself. The Template field of the panel lets you do this. If you enter a string here it will be used to format the output; deleting all text from the Template field returns to default unformatted match output, .

The template string can have both plain text and 'data selector' elements. Each of the latter is a backslash followed by a single selection character. When a match is found, each data selector is equated to its appropriate value for that match, and the template is sent in sequence to the output, with plain text going out as is, and each selector replaced by its value. You can use a particular selector more than once in the same template if you need to. (There is an overall limit of 20 selectors, but that should suffice...)

The possible data selectors are:


Settings

You can modify matching behaviour in a number of ways through the 'Settings' menu, which contains a number of toggleable items described below. All are initially set "off".

Shortest
By default, each match found is the longest possible one at that point. In many cases this isn't what you want, especially when matching across multiple lines. For example: ^.*$, which you might use intending it to match each full line, will instead find all the text as the longest match if 'Newline is any' (below) is set! To get the minimum length match, set 'Shortest'.
Case Insensitive
By default, all literal characters must match exactly. Set this option to ignore the case of characters.
Period matches Newline
BY default, a period ('.') in a regular expression matches any character except a newline. Setting this option makes it match newline as well. If you set this, you will probably want to use 'Shortest' also.
$ matches End-of-Text
If you want to ensure that an end-of-line is always seen when end-of-data is received, even if no newline precedes it, set this option.
Add Newline
By default, matches are output exactly as found, so there will be no newline at the end unless it was matched. To force a newline after each match, set this option.

Subtleties and Cautions

When you change either the current Regular Expression or Template, the module has to terminate any scan in progress to recompile things. If it has already processed input in the current scan, it will send an End-of-Data; otherwise it keeps quiet. You can change the Settings menu at any time, though. Remember too that a changed entry in either text field only takes effect when you either hit <Enter> or click on another field.

Because Match is string- rather than line-oriented, it does not keep track of input lines. Therefore, though byte position is available, line position is not.

While it is determining a match, the module has to store all the pending text. In the worst case this might mean holding on to the entire incoming stream, but more usually it can be discarded once it has been output or determined not to match.

Although Match is as immediate as possible in its output, the nature of the algorithm means that it doesn't know it has a full match until the first non-matching character arrives. This is not usually serious, but may affect you if you are trying to catch specific characters on the fly.

The 'Shortest' option in the Settings menu determines whether the overall matched string will be the longest or shortest possible, but the spans of parenthesized subexpressions within the string are always the longest possible (given the match to that point). Thus (.*)X(.*)$ will match the line "abcXdefXpqr" with \1 as "abcXdef" and \2 as "pqr", whether or not 'Shortest' is set. (Unlike perl, there is no way of selecting "lazy" or "greedy" matching withing an expression.)

Match is no speed demon compared to grep, though it is fast enough for many purposes. If you have a lot of text to scan, and don't need Match's special features, you may be better off using grep in a PipeStream element. On the other hand, it doesn't pretend to supplant full-fledged languages such as awk or perl, though you may find it hard to get immediate output from those. Choose your weapon wisely.


Example Configurations

Basic Match
Here's a very basic configuration intended to let you play with your own Match expressions and templates. In the centre is a Match element, fed from a StreamView where you can enter test text, and a ReadFile to supply entire text files for scanning. Output just goes to another StreamView for display. (Another StreamView is included but with a closed window. This will show the original text being scanned if you want to open it.) You can elaborate with additional elements as you like.

As supplied, it has the very minimal expression ".*" and no Template. Settings options are 'Case Insensitive' and '$ Matches End-of-Text'. The resulting action is that any line of text will be matched in its entirety except for the terminating newline ("longest string of any character except newline") and output as found.

To get each match on its own line, select the 'Add Newline' Settings option. You can alternatively change the expression to ".*\n" so that the newline gets matched as well, but you will then notice some of the vagaries of the matching algorithm... it gets way behind in deciding that it actually has a match! You can improve things by setting 'Shortest' as well, but you will still see output one character behind. 'Shortest' is of course a futile setting with the original string -- I hope you realize why.

Comments
This is just the previous configuration with a perhaps more useful Match expression. It extracts comments from C or C++ source code.

I'll leave you to deconstruct the expression yourself -- it is just one subexpression for each of the two comment types, joined as alternatives by the vertical bar. The asterisks found in a C-style comment have to be escaped, of course.

Notice the additional Settings, though. 'Shortest' and 'Period matches Newline' are now set, so that the scan will find the shortest string that satisfies the expression, even if it is a C-style comment that extends over several lines. 'Add Newline' is also set because the match will not include any newlines outside the comment itself.

HTML Links
Again the same configuration, but this time with both a Match expression and a Template to format the results. This one finds all link tags in an HTML document -- i.e. tags that begin "<A HREF=...>" and end with "</A>".

These Settings options are selected this time: 'Case Insensitive', so that the tags can be upper or lower case, and 'Shortest' and 'Period matches Newline', as we want to span multiline tags. 'Add Newline and '$ matches End-of-Text' are irrelevant with this match string and Template.

In addition to simply finding the links -- even if they extend over several lines -- the example extracts their relevant segments and reorders these for output. There are two segments of interest here: the string immediately following "HREF=" which at least includes the URL being linked to (other attributes may be present -- they get grabbed too), and the Link label itself between the opening and closing tags. The Match expression is set to extract these as sub-matches, and the Template outputs these in reverse order: the label first, a separating "===>", and then the URL.

Here is the expression, in all its ugliness:

<a +href=([^>]*)>(.*)</\n?a\n?>

The first few characters are a straightforward match of the beginning of the link tag, with possibly several spaces between the "a" and the "href". Within the first set of parentheses is the first sub-match expression: this is a 'negated set' -- any character except ">" -- repeated any number of times. Beyond this comes the ">" tag closure that must be matched.

The second sub-match is all the characters up to but not including the closing tag. The part of the expression representing what ought to be simply "</A>" is more complicated than you would expect because I believe I've seen HTML documents in which lines got wrapped here; in any case allowing optional newlines ("\n?") here doesn't hurt, and exercises another special character! There probably should be similar protection on the opening tag, but that's an exercise for the reader.

[As a possible matter of iterest, I used a similar diagram to quickly extract all the headers from this document and turn them into links and anchors. A small amount of subsequent editing and some cutting and pasting resulted in 'anchored' headers and the contents list at the top of the page.]

Sort Words
This last example adds some post-processing to the output of the Match element. The Match expression and Template between them (deconstruction left to the reader) attempt to separate the incoming text into "words", and output each on its own line. This is fed to a 'sort' command (the '-u' option discards duplicates) so at the end you have a sorted list of all the words in the document. The example's idea of a "word" is not always what yours would be, but it does its best. Try variations on the supplied expression to see if you can improve things.

Another thing this example demonstrates is unfortunately that Match is not lightning fast! Don't start off with a long document as input... (Of course you won't see the sorted output until the very end, but you can watch the words scrolling by in the other window.)


Copyrights:

The Match module and this documentation are Copyright 1999 by Peter Goodeve, All Rights Reserved.

The regular expression matching code is derived from original code written by Rob Pike for the 'sam' editor. This has the following Copyright notice:

/*
 * The authors of this software are Rob Pike and Howard Trickey.
 *		Copyright (c) 1998 by Lucent Technologies.
 * Permission to use, copy, modify, and distribute this software for any
 * purpose without fee is hereby granted, provided that this entire notice
 * is included in all copies of any software which is or includes a copy
 * or modification of this software and in all copies of the supporting
 * documentation for such software.
 * THIS SOFTWARE IS BEING PROVIDED "AS IS", WITHOUT ANY EXPRESS OR IMPLIED
 * WARRANTY.  IN PARTICULAR, NEITHER THE AUTHORS NOR LUCENT TECHNOLOGIES MAKE ANY
 * REPRESENTATION OR WARRANTY OF ANY KIND CONCERNING THE MERCHANTABILITY
 * OF THIS SOFTWARE OR ITS FITNESS FOR ANY PARTICULAR PURPOSE.
 */
========================