Chief among the
the unsung heroes of Linux and Unix is m4. Unsung? Well, for
instance, although m4 has been a standard part of Unix since version 7,
no mention is made of it in that great O'Reilly & Associates book, Unix
Power Tools. What is it
about m4 that makes it so useful, and yet so overlooked? m4 -- a macro
processor -- unfortunately has a dry name that disguises a great
utility. A
macro processor is basically a program that scans text and looks for
defined symbols, which it replaces with other text or other symbols.
Thus, m4 is a powerful general-purpose utility that can be used to
automate many tasks people often end up doing in sed
,
awk
,
perl
,
and
even their favorite text editor. Even so, it still doesn't seem like a
macro
processor is that big of a deal.
Unix developers already have a built-in macro processor, in the form
of the C preprocessor, in their compiler. Perhaps this is
what accounts for m4's relative neglect. Whatever the case may be, this
article
will show Linux users the power and usefulness of this software tool.
What is m4?
What is macro
processing, and what is it good for? In their seminal work,
Software Tools, Kernighan
and Plauger have a succinct
definition:
"Macros are
used to extend some underlying language -- to perform a
translation from one language to another."
Thus, symbolic
constants may be defined so that subsequent occurrences
of the name can be replaced by the defining string of characters,
regardless of the contents of the definition or its context. Such a
definition is called a macro, the replacement process is called
macro expansion, and the program for the process is called a macro
processor. The task performed by any macro processor is
the replacement of text by other text. A macro is defined either by
the m4 program (a built-in) or by the user. In addition to doing macro
expansion, m4 -- with functions that include other files, perform
integer
arithmetic, manipulate text, and so forth -- is a perfect example of
the
power
of the Unix filter concept.
The contemporary
implementation of m4 on a Linux system is GNU
m4, which follows System V Release 3 m4, with extensions. I am
aware of no other version of m4 that has been ported to Linux.
m4 implementations on BSD may differ slightly. However, m4 is m4, and
this
article should be useful for other Unix users, too. The latest version
is
1.4, which was released in October 1994.
The
scanning process
As m4 reads its
input, it separates it into tokens. A token is
either a previously defined name, a string, or any single character
that is not a part of either a name or string. The input is then
scanned for recognized macros. This scanning process is recursive,
which
means
that scanning continues until no more macros are recognized. The
transformed
input is written to the output. Macros can be built in or user-defined.
A
list of built-in macros follows later in the article.
Defining
macros
The most important
of the built-in macros is define()
,
which
allows users to define their own macros. For example, define(author,
Paul Dunne)
defines a macro "author" -- any occurrence of
which will
expand to the string "Paul Dunne". m4 expands macro names into
their defining text as soon as it possibly can.
Quoting
The m4 quote
characters are `
and '
. For example,
`this is
quoted'.
It is often best to quote both macro name and
substitution text in a definition. This avoids any unwanted side
effects,
such as an early expansion of another macro name. m4 uses commas as
argument
separators; therefore, any definition that includes commas must be
quoted.
Arguments
As mentioned
earlier, arguments to macros are delimited by commas. They are
also, as we've seen, enclosed in parentheses. A macro can also be
called with no arguments. This is common when we simply wish to
replace one string with another, as in the "author" example above.
Built-in
functions
m4 provides a
small set of useful built-in functions. We may group
them under the following headings:
Flow
control functions
m4 provides the
classic "if-then" programming construct, in two
related forms.
ifdef(a,b)
defines b if a is
defined, and
ifelse(a,b,c,d)
compares the
strings a and b. If they match, string c is returned
as the function value; if not, string d. Actually, ifelse
is
not
limited to four arguments; it can take any greater number, and it thus
provides a limited multiway decision-making capability. For example,
ifelse(a,b,c,d,e,f,g)
means that if a
matches b, then c; else if d matches e, then f; else g.
Arithmetic
functions
There are three
arithmetic built-ins.
incr
, which increments
its numeric argument by one.
decr
, which decrements
its numeric argument by one.
eval
, which performs
arbitrary integer arithmetic.
Its operators are:
unary + and - |
|
|
** or ^ |
|
exponentiation |
+ - |
|
|
== != < <= > >= |
|
equal, not equal, less
than, less than or equal to, greater than, greater than or equal to |
! |
|
not |
& or && |
|
logical and |
| or || |
|
logical or |
String
functions
There are two
functions for simple operations on strings of characters.
len(a)
returns the length
of the string "a".
substr(s, m, n)
returns a
substring from the string "s", starting at position m,
and continuing for n characters.
As a more
complicated example than those we've had so far, consider
this combination of ifelse
, eval
, and
substr
.
define(len,`ifelse($1,,0,`eval(1+len(substr($1,2)))')')
Well now, what
does this do? It is an implementation of the m4
built-in len
in terms of
other m4 built-ins! Note the two
layers
of quotes. The outer layer prevents all initial evaluation. We want
len
defined as exactly what's in the second argument. The inner
layer protects the eval
built-in from being evaluated while the
arguments for the ifelse
are
collected.
translit(s, f, t)
returns the string
"s" with all occurrences of the characters
listed in "f" replaced by those listed in "t". It functions
as a simpler version of the Linux command `tr'
.
For example,
translit(s,abcdefghijklmnopqrstuvwxyz,
nopqrstuvwxyzabcdefghijklm)
is the well-known rot-13, or
Caesar.
File
functions
File functions, as
the name suggests, are used for working with files.
include(filename)
includes the
contents of "filename" at the point in the input
stream at which it occurs. This is useful if we have a central
collection
of standard m4 macros, which we can then use in another file by simply
creating an appropriate include macro.
divert(n)
This is used to
divert text from the input stream to an internal
file number. File number -1 is equivalent to discarding the text,
file number 0 is the normal output stream, and files are usually
used for temporary storage. For example,
divert(-1)
is most commonly
used to get rid of the extraneous
white space that is often generated by m4. For example,
divert(-1)
...
definitions
...
divert
ensures that no
output is performed while the various definitions
between the ellipses are performed (the ellipses are not part of
m4 syntax). Otherwise, we would end up with a pack of newlines in our
output.
dnl
It's hard to
categorize this one, so I've put it here. dnl is "delete
to newline." It was used as a comment character in the original m4. As
the
name suggests, all characters up to the next newline are deleted
from the output stream. GNU m4 also allows use of #
as a
comment
character, with the difference that such comments are passed
to the
output stream. Any macro calls or definitions after the #
are
ignored however. The input is passed to the output exactly as is.
System
functions
There is one
system function -- that is, one that communicates with the
underlying operating system.
esyscmd
passes a command
to the system interpreter, usually the unix shell,
for execution. For example, esyscmd(date)
returns today's date.
There are also
some miscellaneous functions that have been added to
the original m4 function set:
changecom
changes the m4
comment character (normally #).
traceon/off
turns tracing on
and off. This is useful for debugging.
Usage
m4 is invoked the
normal way, by simply typing m4
.
It works as
a classic Unix filter, reading from standard input if no filename
is given on the command line and writing to standard output.
Both input and output may be redirected in the shell or by commands
in the input file.
A full summary of
m4 usage, available by typing m4 --help
,
provides:
Usage: m4 [OPTION]... [FILE]...
Mandatory or optional arguments attached to long options are mandatory
and
optional for short options, too.
Operation modes:
--help display this help and exit
--version output version
information and exit
-e, --interactive unbuffer output, ignore interrupts
-E, --fatal-warnings stop execution after first warning
-Q, --quiet, --silent suppress some warnings for built-ins
-P, --prefix-built-ins force a `m4_' prefix to all built-ins
Preprocessor features:
-I, --include=DIRECTORY search this directory second for
includes
-D, --define=NAME[=VALUE] enter NAME as having VALUE, or
empty
-U, --undefine=NAME delete built-in NAME
-s, --synclines generate `#line NO "FILE"' lines
Limits control:
-G, --traditional suppress all GNU extensions
-H, --hashsize=PRIME set symbol lookup hash table size
-L, --nesting-limit=NUMBER change artificial nesting limit
Frozen state files:
-F, --freeze-state=FILE produce a frozen state on FILE at
end
-R, --reload-state=FILE reload a frozen state from FILE at
start
Debugging:
-d, --debug=[FLAGS] set debug level (no FLAGS implies `aeq')
-t, --trace=NAME trace NAME when it will be defined
-l, --arglength=NUM restrict macro tracing size
-o, --error-output=FILE redirect debug and trace output
FLAGS is any of:
t trace for all macro calls, not only traceon'ed
a show actual arguments
e show expansion
q quote values as necessary, with a or e flag
c show before collect, after collect, and after call
x add a unique macro call ID, useful with c flag
f say current input file name
l say current input line number
p show results of path searches
i show changes in input files
V shorthand for all of the above flags
If no FILE or if FILE is `-', standard input is read.
This is a
formidable list of options. But we need use only a few.
In fact, most
often m4 is run as just m4
,
with perhaps the -P
flag
to specify that built-ins are preceded by m4_
,
e.g., m4_include
rather than include
. Below is
an example of a line I use in a makefile to
generate my html pages:
cat $*.m4 | htmlize | m4 -P > $*.html
m4 at work
So, we've had an
overview of m4. Now, lets take a look at how it can be used
to do useful work.
Example:
Generating HTML
I use m4, among
other Linux software tools, to maintain my Web pages.
Rather than marking each page up in HTML -- a tiresome chore -- I have
written a set of definitions that translates m4 macros into HTML.
As well as being easier on the eye and simpler to write than HTML,
this has other advantages. For example, an often seen feature on
Websites is the navigational button bar, which has links to the
main parts of a site. Obviously, it is nicer not to have a link
from the button bar to our Linux page if that's where we already are,
for example. This can be automated using m4, so that the correct HTML
code is generated. The definition I use is as follows:
m4_define(
`_button_bar',
`<HR>
<P ALIGN="center">
m4_ifdef(`_index',[Home],_link(index.html, [Home]))
m4_ifdef(`_linux',[Linux],_link(linux.html, [Linux]))
m4_ifdef(`_writing',[Writing],_link(writing.html, [Writing]))
m4_ifdef(`_bookshop',[Bookshop],_link(bookshop/index.html, [Bookstore]))
</P>
<HR>'
)
Then, in the file linux.html
, the macro _link
is defined, and so
when _button_bar
is
referenced later in that file, the button bar
code generated has no link to the Linux page as the Linux link is
grayed out.
Again, we can
define your email address in the master file. Then,
if you should change your email address there is no need to do a global
search-and-replace through all the files that constitute the site. A
simple make
updates everything -- but that's the subject of another article.
Example: A
Linux key-map
Maintenance of
Linux keymap files is an interesting and imaginative use of
m4.
I don't do this myself, since hacking an existing file is simplest
for me. We don't have the space to examine the file in any depth here.
If you
take a look at /usr/lib/kbd/keymaps/i386/qwerty/hypermap.m4
on
your Linux system, you will see how using m4 makes defining a
complicated
keymap quite a bit simpler and makes it easier to maintain.
Example:
Sendmail config
m4's most well
known application helps to demystify sendmail configuration
files. The sendmail source distribution comes with m4 macros that are
sufficient to generate a sendmail.cf for most any site. At most, a
little tweaking of the resulting sendmail.cf file (whose syntax has
been memorably and justly compared to line noise) may be required.
For anyone who has tried to write a sendmail config file from scratch
--
in the days before the m4 macros -- this is a godsend.
Differences
between m4 versions
Inevitably, there
are different versions of m4. This is not an
issue for the Linux user, as you will invariably be using GNU m4.
The main
difference is that System V m4 supports multiple arguments for
`defn'
.
Since the usefulness of this is unclear to GNU m4's
maintainer (and indeed to me), this feature is not in GNU m4.
There are several
other incompatibilities (which shouldn't surprise
anyone who's tried to use GNU make and then BSD's pmake, or vice
versa). None are too important, but those interested can read the
relevant
info page (alas, no man page has been provided). As this article is
about
m4 rather than GNU m4, I won't mention the various extensions
implemented in the GNU version -- those curious can see the list in the
info
page.
Things to
watch
Quoting can be
cantankerous on occasion. Quoting problems can
usually be solved by changequote. For example, to
include one of the quote characters in a macro definition, using
changequote([[,]])
and then
define([[`a quoted macro']])
will keep the
quote characters in the macro definition. Note that
`
and '
can't be escaped, so we
have to do it this
way.
Another thing to
watch out for is that if you have the name of an m4 built in
your text, m4 will interpret such names as calls to that function,
which is
presumably not what you want.
This can be
avoided by quoting, but that is inconvenient. GNU m4
offers us a better way. The -P
command-line switch allows us to
preface all built-ins with the string m4_
rather than use the #
character as
the C preprocessor does.
Limitations
m4 is a useful
tool, but it can be overstrained. Although it can be
made to do most things with ingenuity, m4 is at its best when used
for straightforward text substitution, as with our HTML example.
In Software
Tools, Kernighan and Plauger sum it up nicely:
"The main
thing is to ensure that any operation -- macro call,
definition, other built-in -- can occur in the middle of any other one.
If this is possible, then in principle the macro processor is capable
of doing any computation, although it may well be hard to express....
In principle, macro [i.e. m4] is capable of performing any computing
task, but it is all too easy to write incomprehensible macros."
This article has
been an introduction to an often overlooked Linux
program. Hopefully, you'll now be able to go off and do some m4ing
yourselves.
Resources