The LAP library: LexerText

Derived from: Lexer

Declared in: LexerText.h


Overview

The LexerText object is meant to perform the lexical analysis on text files. Contrary to the binary lexer LexerBin, there is no notion of byte reversing or whatever. However, the line counter is supported by the class.

A NextToken() method is defined as virtual. This is the method to call to get the next lexeme/token. Strictly speaking, it should have been a pure virtual method, winding up with an abstract LexerText class. This is because lexical rules are so different from a language or file to another.
Note that this method is not declared for binary files. Anyway, this makes sense, doesn't it?

Lexical analysis in text files

The fact is that a text lexer is pretty tedious to implement, even though the Lexer::Charset.
As explained, the lexical analyzer can be represented by a DFA, each state being implemented by a function or procedure which uses the methods Fwd() and Bwd().
Everything depends on the lexical items you have to recognize. For instance, in C++, you have the lexemes "-", "->", "--" and "-=" which start with a minus character. The implementation in the lexer can be:
main routine:
  if (*LexmEnd == '-')
    return StateMinus();
  else if (Charset[*LexmEnd] == LX_CHAR_DIGIT)
    return StateIntegerOrFloat();
  else ...

StateMinus:
  FWD;    // Skip the minus sign
  if (*LexmEnd == '>')
  {
    FWD;    // Lexeme '->'
    return LX_TOKEN_ARROW;
  }
  else if (*LexmEnd == '-')
  {
    FWD;    // Lexeme '--'
    return LX_TOKEN_DEC;
  }
  else if (*LexmEnd == '=')
  {
    FWD;    // Lexeme '-='
    return LX_TOKEN_SUB;
  }
  return LX_TOKEN_MINUS;

StateIntegerOrFloat:
  ...
This is not a tough work, but it can be pretty long for complex languages/grammars.
For some lexemes, all the functions are provided. It's the case for integers, floats, words, ... These functions will be described shortly.

Keywords

For some other lexemes, the work is so much easier: the keywords. Keywords simply consist in a set of alphanumerical characters. When a string lexeme is recognized, the lexer has to check whether it is a keyword or a something else (a variable for instance).
"Rejouissez-vous!", keywords are supported by the class LexerText. After the construction of the object, you can add your lexemes to a table called -guess what- Lexemes. A few functions are provided and are: You also have to declare tokens for each of the keywords you need. Each token must be higher than 0xff, that is 255. For instance, in your MyLexerText.h file, you can write:
  enum
  {
    MY_TOKEN_PUBLIC = 0x100,
    MY_TOKEN_VOLATILE,
    MY_TOKEN_WHILE,
    ...
  };
Then, in your constructor, you can write:
  MyLexerText::MyLexerText()
  {
    AddLexeme(new Lexeme("public", MY_TOKEN_PUBLIC));
    AddLexeme(new Lexeme("volatile", MY_TOKEN_VOLATILE));
    AddLexeme(new Lexeme("while", MY_TOKEN_WHILE));
    ...
    SortLexemes();
  }
Have a look at Lexeme for Lexeme specific information.
All the keywords are declared in the constructor. Note the call to SortLexemes(), which -guess what- sorts the lexemes to speed up look-up operations. This has to be done after all the lexemes are declared and -obviously- before the first call to NextToken().

Symbols

Don't mingle keywords and symbols. Symbols have to be stored in the symbols table, and keywords in the lexemes table. For instance, in C++, symbols are: unions, classes, typedefs, methods, ... The corresponding lexemes are "union", "typedef", "class", , ...

This is much compiler oriented stuff and there's no need to get any deeper, unless you really want to explore these abysses (You'd rather buy a good book). If you're not discouraged, proceed to the next paragraph. Skip it otherwise (It won't be that long, anyway).

Other symbols include

What you would have to do is declare all your lexemes in MyLexerText::MyLexerText and build your symbol table as you encounter definitions of any type. Some of these definitions are built-in (int, float, read, write, ... & Co).
  // Built-in const
  AddSymbol(new Symbol("true", LX_KIND_BUILTIN_CONST, LX_TYPE_INTEGER, 1));
  AddSymbol(new Symbol("false", LX_KIND_BUILTIN_CONST, LX_TYPE_INTEGER, 0));
  ...
  // Built-in types
  AddSymbol(new Symbol("int", LX_KIND_BUILTIN_TYPE, LX_TYPE_NONE, 0));
  AddSymbol(new Symbol("float", LX_KIND_BUILTIN_TYPE, LX_TYPE_NONE, 0));
  ...
  // Built-in functions
  AddSymbol(new Symbol("read", LX_KIND_BUILTIN_PROC, LX_TYPE_NONE, 0));
  AddSymbol(new Symbol("strcat", LX_KIND_BUILTIN_PROC, LX_TYPE_NONE, 0));
  ...
Whenever you encounter a type or class definition (or whatever), you add a new symbol in the symbol table so that information is available when it is further referred to.
Well, I suppose you understand (or maybe you already knew, I guess it's better)


Constructor and D estructor


LexerText()

 
      LexerText(int32 BufferLength, int32 LexemeLength) 

Creates a LexerText object. Buffers are allocated through the call of Lexer::Lexer. As the allocations may have failed, the user should always check the status of the object after creation (see GetStatus()).


~LexerText()

 
      ~LexerText(void) 

The list of lexemes is flushed.


Member Functions


SearchLexeme()

 
      Lexeme* SearchLexeme(char* Name)

Search the lexeme whose string is given in input. The output is NULL if it was not found, true otherwise.


AddLexeme()

 
      bool AddLexeme(Lexeme* Lexm)

Attempts to add the lexeme given in parameter to the list of lexemes. This method fails (returns false) if a lexeme with the same name is already in the list. Otherwise, true is returned.


SortLexemes()

 
      void SortLexemes()

Sorts the lexemes stored in the lexemes table according to their token value. This is meant to speed up look-up operations. This method has to be invoked once all the lexemes are entered in the table, and before the first call to NextToken().

As you probably guess: THE LEXER WILL NOT WORK PROPERLY IF YOU DON'T DO THIS!!!


PrintLexeme()

 
      bool PrintLexeme(Lexeme* Lexem)

Prints out the lexeme given in parameter along with its token.
For debugging purposes, a priori.


InitNextToken()

 
      void InitNextToken()

Delete the allocation for the previous lexeme if necessary. This is used in NextToken().
For debugging purposes, a priori.


PrintLexemes()

 
      void PrintLexemes()

Prints out all the lexemes stored in the lexeme table, along with their token
For debugging purposes, a priori.


GetTokenLexeme()

 
      const char* GetTokenLexeme(uint32 Token)

Returns the lexeme corresponding to the Token given in parameter. If no such token is found, NULL is returned.

DON'T DELETE THE STRING WHICH IS OUTPUT, IT IS THE PROPERTY OF THE CORRESPONDING LEXEME!!!


GetTokenValueInt(), GetTokenValueFloat(), , GetTokenValueStr(), GetIllegalCharacter()

 
      int32   GetTokenValueInt()
      float   GetTokenValueFloat()
      char*   GetTokenValueStr()
      char    GetIllegalCharacter()

Retrieve the value of the last lexeme.


GetLineNumber()

 
      int32 GetLineNumber()

Hemmmmmm... let me check... That's it, returns the current line number in the file.

The int32 value may have to be changed to off_t, huh?


SetInput()

 
      virtual void SetInput(entry_ref& Token)

Sets the input file, reset the line number to 1, sets up various internal variables...


SkipSpaces()

 
      bool SkipSpaces()

Getting hot! This methods is meant to skip spaces, or at least, characters which are referred to as spaces in LexerText::Charset.
The line number is updated if necessary.
Never fails.

This function is meant to be called from NextToken().


SkipLine()

 
      bool SkipLine()

Well, skips a line, or reaches the end of the file if no '\n' character is encountered.
Never fails.


LexWord()

 
      bool LexWord()

Parse the current word, a word being a set of contiguous characters which is delimited by at least a space or an EOF character. Hence, the word can contain any set of characters within, but no spaces.
Never fails.

This function is meant to be called from NextToken().


LexNumber()

 
      bool LexNumber()

Parse the current number (either integer or float).

Never fails.

This function is meant to be called from NextToken().


LexString()

 
      bool LexString()

Parse the current string, that is a set of contigous characters referred to as alphabetical or numerical characters (LX_CHAR_ALPHA and LX_CHAR_DIGIT).

The parsed string is either a "simple" string (LX_TYPE_STRING), or a lexeme (LX_TYPE_STRING), or a symbol (LX_TYPE_SYMBOL)

Never fails.

This function is meant to be called from NextToken().


NextToken()

 
      virtual bool NextToken(uint32& OutputToken)

Returns the next token found in the input stream. This method can be seen as the start-state of the DFA which is evoqued in the documentation. From this "state", several branches (or transitions) are to be taken according to the current input character.

This function fails if it is called although the end of file has already been reached.

Basically, this method has to

As an example:
  bool
  MyLexerText::NextToken(uint32& Out)
  {
    InitNextToken(); // Preprocess: frees previous allocation

    // This loop is exited when a lexeme is found
    for (;;)
    {
      // Check for end of file
      if (EndOfFile)
      {
        Out = Token = LX_CHAR_EOF;
        return true;
      }

      // Prepare for a new lexeme
      LexmStart = LexmEnd;
      uint8  CharType = Charset[uint8(*LexmEnd)];

      // Parse an integer
      if (CharType == LX_CHAR_DIGIT)
      {
        if (!LexNumber())
          return false;
        Out = Token;
        return true;
      }

      // Parse a string
      else if (CharType == LX_CHAR_ALPHA)
      {
        if (!LexString())
          return false;
        Out = Token;
        return true;
      }

      // Skip spaces
      else if (CharType == LX_CHAR_SPACE)
      {
        if (!SkipSpaces())
          return false;
      }

      // Illegal characters generate an error
      else if (CharType == LX_CHAR_ILLEGAL)
      {
        Out = Token = LX_CHAR_ILLEGAL;
        IllegalCharacter = *LexmEnd;
        return false;
      }

      // it was a lexeme
      else
      {
        Out = Token = CharType;
        return true;
      }
    }
  }
Note the call to InitNextToken() which is mandatory at the very beginning of any of the NextToken methods (orelse you may notice memory leakages).
You can probably figure out that each of the cases checked in the main loop correspond to a particular "state" of an imaginary DFA.


The LAP library, V0.7, November 3rd 1997, First public release