Derived from: Lexer
Declared in: LexerText.h
The LexerText object is meant to perform the lexical analysis on text files. Contrary to the binary lexer LexerBin, there is no notion of byte reversing or whatever. However, the line counter is supported by the class.
A NextToken() method is defined as virtual. This is the method to call to
get the next lexeme/token. Strictly speaking, it should have been a pure virtual method, winding up with
an abstract LexerText class. This is because lexical rules are so different from a language or file
to another.
Note that this method is not declared for binary files. Anyway, this makes sense, doesn't it?
main routine: if (*LexmEnd == '-') return StateMinus(); else if (Charset[*LexmEnd] == LX_CHAR_DIGIT) return StateIntegerOrFloat(); else ... StateMinus: FWD; // Skip the minus sign if (*LexmEnd == '>') { FWD; // Lexeme '->' return LX_TOKEN_ARROW; } else if (*LexmEnd == '-') { FWD; // Lexeme '--' return LX_TOKEN_DEC; } else if (*LexmEnd == '=') { FWD; // Lexeme '-=' return LX_TOKEN_SUB; } return LX_TOKEN_MINUS; StateIntegerOrFloat: ...This is not a tough work, but it can be pretty long for complex languages/grammars.
enum { MY_TOKEN_PUBLIC = 0x100, MY_TOKEN_VOLATILE, MY_TOKEN_WHILE, ... };Then, in your constructor, you can write:
MyLexerText::MyLexerText() { AddLexeme(new Lexeme("public", MY_TOKEN_PUBLIC)); AddLexeme(new Lexeme("volatile", MY_TOKEN_VOLATILE)); AddLexeme(new Lexeme("while", MY_TOKEN_WHILE)); ... SortLexemes(); }Have a look at Lexeme for Lexeme specific information.
This is much compiler oriented stuff and there's no need to get any deeper, unless you really want to explore these abysses (You'd rather buy a good book). If you're not discouraged, proceed to the next paragraph. Skip it otherwise (It won't be that long, anyway).
Other symbols include
// Built-in const AddSymbol(new Symbol("true", LX_KIND_BUILTIN_CONST, LX_TYPE_INTEGER, 1)); AddSymbol(new Symbol("false", LX_KIND_BUILTIN_CONST, LX_TYPE_INTEGER, 0)); ... // Built-in types AddSymbol(new Symbol("int", LX_KIND_BUILTIN_TYPE, LX_TYPE_NONE, 0)); AddSymbol(new Symbol("float", LX_KIND_BUILTIN_TYPE, LX_TYPE_NONE, 0)); ... // Built-in functions AddSymbol(new Symbol("read", LX_KIND_BUILTIN_PROC, LX_TYPE_NONE, 0)); AddSymbol(new Symbol("strcat", LX_KIND_BUILTIN_PROC, LX_TYPE_NONE, 0)); ...Whenever you encounter a type or class definition (or whatever), you add a new symbol in the symbol table so that information is available when it is further referred to.
LexerText(int32 BufferLength, int32 LexemeLength)
Creates a LexerText object. Buffers are allocated through the call of Lexer::Lexer. As the allocations may have failed, the user should always check the status of the object after creation (see GetStatus()).
~LexerText(void)
The list of lexemes is flushed.
Lexeme* SearchLexeme(char* Name)
Search the lexeme whose string is given in input. The output is NULL if it was not found, true otherwise.
bool AddLexeme(Lexeme* Lexm)
Attempts to add the lexeme given in parameter to the list of lexemes. This method fails (returns false) if a lexeme with the same name is already in the list. Otherwise, true is returned.
void SortLexemes()
Sorts the lexemes stored in the lexemes table according to their token value. This is meant to speed up look-up operations. This method has to be invoked once all the lexemes are entered in the table, and before the first call to NextToken().
As you probably guess: THE LEXER WILL NOT WORK PROPERLY IF YOU DON'T DO THIS!!!
bool PrintLexeme(Lexeme* Lexem)
Prints out the lexeme given in parameter along with its token.
For debugging purposes, a priori.
void InitNextToken()
Delete the allocation for the previous lexeme if necessary. This is used in NextToken().
For debugging purposes, a priori.
void PrintLexemes()
Prints out all the lexemes stored in the lexeme table, along with their token
For debugging purposes, a priori.
const char* GetTokenLexeme(uint32 Token)
Returns the lexeme corresponding to the Token given in parameter. If no such token is found, NULL is returned.
DON'T DELETE THE STRING WHICH IS OUTPUT, IT IS THE PROPERTY OF THE CORRESPONDING LEXEME!!!
int32 GetTokenValueInt() float GetTokenValueFloat() char* GetTokenValueStr() char GetIllegalCharacter()
Retrieve the value of the last lexeme.
int32 GetLineNumber()
Hemmmmmm... let me check... That's it, returns the current line number in the file.
The int32 value may have to be changed to off_t, huh?
virtual void SetInput(entry_ref& Token)
Sets the input file, reset the line number to 1, sets up various internal variables...
bool SkipSpaces()
Getting hot! This methods is meant to skip spaces, or at least, characters which are
referred to as spaces in LexerText::Charset.
The line number is updated if necessary.
Never fails.
This function is meant to be called from NextToken().
bool SkipLine()
Well, skips a line, or reaches the end of the file if no '\n' character is
encountered.
Never fails.
bool LexWord()
Parse the current word, a word being a set of contiguous characters which is
delimited by at least a space or an EOF character. Hence, the word can contain
any set of characters within, but no spaces.
Never fails.
This function is meant to be called from NextToken().
bool LexNumber()
Parse the current number (either integer or float).
Never fails.
This function is meant to be called from NextToken().
bool LexString()
Parse the current string, that is a set of contigous characters referred to as alphabetical or numerical characters (LX_CHAR_ALPHA and LX_CHAR_DIGIT).
The parsed string is either a "simple" string (LX_TYPE_STRING), or a lexeme (LX_TYPE_STRING), or a symbol (LX_TYPE_SYMBOL)
Never fails.
This function is meant to be called from NextToken().
virtual bool NextToken(uint32& OutputToken)
Returns the next token found in the input stream. This method can be seen as the start-state of the DFA which is evoqued in the documentation. From this "state", several branches (or transitions) are to be taken according to the current input character.
This function fails if it is called although the end of file has already been reached.
Basically, this method has to
bool MyLexerText::NextToken(uint32& Out) { InitNextToken(); // Preprocess: frees previous allocation // This loop is exited when a lexeme is found for (;;) { // Check for end of file if (EndOfFile) { Out = Token = LX_CHAR_EOF; return true; } // Prepare for a new lexeme LexmStart = LexmEnd; uint8 CharType = Charset[uint8(*LexmEnd)]; // Parse an integer if (CharType == LX_CHAR_DIGIT) { if (!LexNumber()) return false; Out = Token; return true; } // Parse a string else if (CharType == LX_CHAR_ALPHA) { if (!LexString()) return false; Out = Token; return true; } // Skip spaces else if (CharType == LX_CHAR_SPACE) { if (!SkipSpaces()) return false; } // Illegal characters generate an error else if (CharType == LX_CHAR_ILLEGAL) { Out = Token = LX_CHAR_ILLEGAL; IllegalCharacter = *LexmEnd; return false; } // it was a lexeme else { Out = Token = CharType; return true; } } }Note the call to InitNextToken() which is mandatory at the very beginning of any of the NextToken methods (orelse you may notice memory leakages).