Using the m4 Macro Processor

LinuxWorld.com 4/25/00

Paul Dunne, LinuxWorld.com

Chief among the the unsung heroes of Linux and Unix is m4. Unsung? Well, for instance, although m4 has been a standard part of Unix since version 7, no mention is made of it in that great O'Reilly & Associates book, Unix Power Tools. What is it about m4 that makes it so useful, and yet so overlooked? m4 -- a macro processor -- unfortunately has a dry name that disguises a great utility. A macro processor is basically a program that scans text and looks for defined symbols, which it replaces with other text or other symbols. Thus, m4 is a powerful general-purpose utility that can be used to automate many tasks people often end up doing in sed, awk, perl, and even their favorite text editor. Even so, it still doesn't seem like a macro processor is that big of a deal. Unix developers already have a built-in macro processor, in the form of the C preprocessor, in their compiler. Perhaps this is what accounts for m4's relative neglect. Whatever the case may be, this article will show Linux users the power and usefulness of this software tool.

What is m4?

What is macro processing, and what is it good for? In their seminal work, Software Tools, Kernighan and Plauger have a succinct definition:

"Macros are used to extend some underlying language -- to perform a translation from one language to another."

Thus, symbolic constants may be defined so that subsequent occurrences of the name can be replaced by the defining string of characters, regardless of the contents of the definition or its context. Such a definition is called a macro, the replacement process is called macro expansion, and the program for the process is called a macro processor. The task performed by any macro processor is the replacement of text by other text. A macro is defined either by the m4 program (a built-in) or by the user. In addition to doing macro expansion, m4 -- with functions that include other files, perform integer arithmetic, manipulate text, and so forth -- is a perfect example of the power of the Unix filter concept.

The contemporary implementation of m4 on a Linux system is GNU m4, which follows System V Release 3 m4, with extensions. I am aware of no other version of m4 that has been ported to Linux. m4 implementations on BSD may differ slightly. However, m4 is m4, and this article should be useful for other Unix users, too. The latest version is 1.4, which was released in October 1994.

The scanning process

As m4 reads its input, it separates it into tokens. A token is either a previously defined name, a string, or any single character that is not a part of either a name or string. The input is then scanned for recognized macros. This scanning process is recursive, which means that scanning continues until no more macros are recognized. The transformed input is written to the output. Macros can be built in or user-defined. A list of built-in macros follows later in the article.

Defining macros

The most important of the built-in macros is define(), which allows users to define their own macros. For example, define(author, Paul Dunne) defines a macro "author" -- any occurrence of which will expand to the string "Paul Dunne". m4 expands macro names into their defining text as soon as it possibly can.

Quoting

The m4 quote characters are ` and '. For example, `this is quoted'. It is often best to quote both macro name and substitution text in a definition. This avoids any unwanted side effects, such as an early expansion of another macro name. m4 uses commas as argument separators; therefore, any definition that includes commas must be quoted.

Arguments

As mentioned earlier, arguments to macros are delimited by commas. They are also, as we've seen, enclosed in parentheses. A macro can also be called with no arguments. This is common when we simply wish to replace one string with another, as in the "author" example above.

Built-in functions

m4 provides a small set of useful built-in functions. We may group them under the following headings:

Flow control functions

m4 provides the classic "if-then" programming construct, in two related forms.

ifdef(a,b)

defines b if a is defined, and

ifelse(a,b,c,d)

compares the strings a and b. If they match, string c is returned as the function value; if not, string d. Actually, ifelse is not limited to four arguments; it can take any greater number, and it thus provides a limited multiway decision-making capability. For example,

ifelse(a,b,c,d,e,f,g)

means that if a matches b, then c; else if d matches e, then f; else g.

Arithmetic functions

There are three arithmetic built-ins.

incr, which increments its numeric argument by one.

decr, which decrements its numeric argument by one.

eval, which performs arbitrary integer arithmetic.

Its operators are:

unary + and -        
** or ^       exponentiation
+ -        
== != < <= > >=       equal, not equal, less than, less than or equal to, greater than, greater than or equal to
!       not
& or &&       logical and
| or ||       logical or

String functions

There are two functions for simple operations on strings of characters.

len(a)

returns the length of the string "a".

substr(s, m, n)

returns a substring from the string "s", starting at position m, and continuing for n characters.

As a more complicated example than those we've had so far, consider this combination of ifelse, eval, and substr.

define(len,`ifelse($1,,0,`eval(1+len(substr($1,2)))')')

Well now, what does this do? It is an implementation of the m4 built-in len in terms of other m4 built-ins! Note the two layers of quotes. The outer layer prevents all initial evaluation. We want len defined as exactly what's in the second argument. The inner layer protects the eval built-in from being evaluated while the arguments for the ifelse are collected.

translit(s, f, t)

returns the string "s" with all occurrences of the characters listed in "f" replaced by those listed in "t". It functions as a simpler version of the Linux command `tr'. For example, translit(s,abcdefghijklmnopqrstuvwxyz, nopqrstuvwxyzabcdefghijklm) is the well-known rot-13, or Caesar.

File functions

File functions, as the name suggests, are used for working with files.

include(filename)

includes the contents of "filename" at the point in the input stream at which it occurs. This is useful if we have a central collection of standard m4 macros, which we can then use in another file by simply creating an appropriate include macro.

divert(n)

This is used to divert text from the input stream to an internal file number. File number -1 is equivalent to discarding the text, file number 0 is the normal output stream, and files are usually used for temporary storage. For example,

divert(-1) is most commonly used to get rid of the extraneous white space that is often generated by m4. For example,

divert(-1)
...
definitions
...
divert

ensures that no output is performed while the various definitions between the ellipses are performed (the ellipses are not part of m4 syntax). Otherwise, we would end up with a pack of newlines in our output.

dnl

It's hard to categorize this one, so I've put it here. dnl is "delete to newline." It was used as a comment character in the original m4. As the name suggests, all characters up to the next newline are deleted from the output stream. GNU m4 also allows use of # as a comment character, with the difference that such comments are passed to the output stream. Any macro calls or definitions after the # are ignored however. The input is passed to the output exactly as is.

System functions

There is one system function -- that is, one that communicates with the underlying operating system.

esyscmd

passes a command to the system interpreter, usually the unix shell, for execution. For example, esyscmd(date) returns today's date.

There are also some miscellaneous functions that have been added to the original m4 function set:

changecom

changes the m4 comment character (normally #).

traceon/off

turns tracing on and off. This is useful for debugging.

Usage

m4 is invoked the normal way, by simply typing m4. It works as a classic Unix filter, reading from standard input if no filename is given on the command line and writing to standard output. Both input and output may be redirected in the shell or by commands in the input file.

A full summary of m4 usage, available by typing m4 --help, provides:

Usage: m4 [OPTION]... [FILE]...
Mandatory or optional arguments attached to long options are mandatory and
optional for short options, too.

Operation modes:
      --help display this help and exit
      --version output version information and exit
  -e, --interactive unbuffer output, ignore interrupts
  -E, --fatal-warnings stop execution after first warning
  -Q, --quiet, --silent suppress some warnings for built-ins
  -P, --prefix-built-ins force a `m4_' prefix to all built-ins

Preprocessor features:
  -I, --include=DIRECTORY search this directory second for includes
  -D, --define=NAME[=VALUE] enter NAME as having VALUE, or empty
  -U, --undefine=NAME delete built-in NAME
  -s, --synclines generate `#line NO "FILE"' lines

Limits control:
  -G, --traditional suppress all GNU extensions
  -H, --hashsize=PRIME set symbol lookup hash table size
  -L, --nesting-limit=NUMBER change artificial nesting limit

Frozen state files:
  -F, --freeze-state=FILE produce a frozen state on FILE at end
  -R, --reload-state=FILE reload a frozen state from FILE at start

Debugging:
  -d, --debug=[FLAGS] set debug level (no FLAGS implies `aeq')
  -t, --trace=NAME trace NAME when it will be defined
  -l, --arglength=NUM restrict macro tracing size
  -o, --error-output=FILE redirect debug and trace output

FLAGS is any of:
  t trace for all macro calls, not only traceon'ed
  a show actual arguments
  e show expansion
  q quote values as necessary, with a or e flag
  c show before collect, after collect, and after call
  x add a unique macro call ID, useful with c flag
  f say current input file name
  l say current input line number
  p show results of path searches
  i show changes in input files
  V shorthand for all of the above flags

If no FILE or if FILE is `-', standard input is read.

This is a formidable list of options. But we need use only a few.

In fact, most often m4 is run as just m4, with perhaps the -P flag to specify that built-ins are preceded by m4_, e.g., m4_include rather than include. Below is an example of a line I use in a makefile to generate my html pages:


cat $*.m4 | htmlize | m4 -P > $*.html

m4 at work

So, we've had an overview of m4. Now, lets take a look at how it can be used to do useful work.

Example: Generating HTML

I use m4, among other Linux software tools, to maintain my Web pages. Rather than marking each page up in HTML -- a tiresome chore -- I have written a set of definitions that translates m4 macros into HTML. As well as being easier on the eye and simpler to write than HTML, this has other advantages. For example, an often seen feature on Websites is the navigational button bar, which has links to the main parts of a site. Obviously, it is nicer not to have a link from the button bar to our Linux page if that's where we already are, for example. This can be automated using m4, so that the correct HTML code is generated. The definition I use is as follows:

m4_define(
`_button_bar',
`<HR>
<P ALIGN="center">
m4_ifdef(`_index',[Home],_link(index.html, [Home]))
m4_ifdef(`_linux',[Linux],_link(linux.html, [Linux]))
m4_ifdef(`_writing',[Writing],_link(writing.html, [Writing]))
m4_ifdef(`_bookshop',[Bookshop],_link(bookshop/index.html, [Bookstore]))
</P>
<HR>'
)

Then, in the file linux.html, the macro _link is defined, and so when _button_bar is referenced later in that file, the button bar code generated has no link to the Linux page as the Linux link is grayed out.

Again, we can define your email address in the master file. Then, if you should change your email address there is no need to do a global search-and-replace through all the files that constitute the site. A simple make updates everything -- but that's the subject of another article.

Example: A Linux key-map

Maintenance of Linux keymap files is an interesting and imaginative use of m4. I don't do this myself, since hacking an existing file is simplest for me. We don't have the space to examine the file in any depth here. If you take a look at /usr/lib/kbd/keymaps/i386/qwerty/hypermap.m4 on your Linux system, you will see how using m4 makes defining a complicated keymap quite a bit simpler and makes it easier to maintain.

Example: Sendmail config

m4's most well known application helps to demystify sendmail configuration files. The sendmail source distribution comes with m4 macros that are sufficient to generate a sendmail.cf for most any site. At most, a little tweaking of the resulting sendmail.cf file (whose syntax has been memorably and justly compared to line noise) may be required. For anyone who has tried to write a sendmail config file from scratch -- in the days before the m4 macros -- this is a godsend.

Differences between m4 versions

Inevitably, there are different versions of m4. This is not an issue for the Linux user, as you will invariably be using GNU m4.

The main difference is that System V m4 supports multiple arguments for `defn'. Since the usefulness of this is unclear to GNU m4's maintainer (and indeed to me), this feature is not in GNU m4.

There are several other incompatibilities (which shouldn't surprise anyone who's tried to use GNU make and then BSD's pmake, or vice versa). None are too important, but those interested can read the relevant info page (alas, no man page has been provided). As this article is about m4 rather than GNU m4, I won't mention the various extensions implemented in the GNU version -- those curious can see the list in the info page.

Things to watch

Quoting can be cantankerous on occasion. Quoting problems can usually be solved by changequote. For example, to include one of the quote characters in a macro definition, using

changequote([[,]])

and then

define([[`a quoted macro']])

will keep the quote characters in the macro definition. Note that ` and ' can't be escaped, so we have to do it this way.

Another thing to watch out for is that if you have the name of an m4 built in your text, m4 will interpret such names as calls to that function, which is presumably not what you want.

This can be avoided by quoting, but that is inconvenient. GNU m4 offers us a better way. The -P command-line switch allows us to preface all built-ins with the string m4_ rather than use the # character as the C preprocessor does.

Limitations

m4 is a useful tool, but it can be overstrained. Although it can be made to do most things with ingenuity, m4 is at its best when used for straightforward text substitution, as with our HTML example.

In Software Tools, Kernighan and Plauger sum it up nicely:

"The main thing is to ensure that any operation -- macro call, definition, other built-in -- can occur in the middle of any other one. If this is possible, then in principle the macro processor is capable of doing any computation, although it may well be hard to express.... In principle, macro [i.e. m4] is capable of performing any computing task, but it is all too easy to write incomprehensible macros."

This article has been an introduction to an often overlooked Linux program. Hopefully, you'll now be able to go off and do some m4ing yourselves.

Resources