The best known interface for a lexical analyzer generator is the UNIX
program LEX. LEX for C and C++ exists on our system;
a quick web search will lead you to documentation for LEX as well as implementations
for languages like Java. The standard reference for LEX is Levine, J.R.,
Mason, T., and Brown, D. (1992), Lex and YACC. 2nd Edition. O'Reilly.
Here is a lexical anlayzer description in LEX
format for a small token set. LEX input consists of three sections: one
for regular expressions, one for pairs of regular expressions and code
segments, and one for auxiliary C code. From this, the LEX program generates
a file in C (or another language if you have the appropriate LEX implementation),
which contains the declaration of a single routine, int yylex(void).
When called, this routine starts isolating tokens from the input file according
to the regular expressions in the second section, and for each token found
it executes the C code associated with it.
/************************************************************************************/
%{
Token_Type Token;
whitespace
[ \t]
letter
[a-zA-Z]
operator
[-+*/]
%%
%%
void get_next_token(void)
int yywrap(void) {return 1;}
/*** Lex input for the token set defined by the following
regular expressions: ***/
/***
***/
/***
integral_number --> [0-9]+
***/
/***
fixed_point_number --> [0-9]*'.'[0-9]+
***/
/************************************************************************************/
#include "lex.h"
int line_number = 1;
%}
digit
[0-9]
underscore
"_"
letter_or_digit ({letter}|{digit})
underscored_tail ({underscore}{letter_or_digit}+)
identifier
({letter}{letter_or_digit}*{underscored_tail}*)
separator
[;,(){}]
{digit}+
{return INTEGER;}
{identifier}
{return IDENTIFIER;}
{operator}|{separator} {return yytext[0];}
#[^#\n]*#?
{/* ignore comment */}
{whitespace}
{/* ignore whitespace */}
\n
{line_number++;}
.
{return ERRONEOUS;}
void start_lex(void) {}
{
Token.class = yylex();
if (Token.class == 0)
{
Token.class = EOF; Token.repr = "<EOF>"; return;
}
Token.pos.line_number = line_number;
strcpy(Token.repr = (char*)malloc(strlen(yytext)+1),
yytext);
}
A short paper on-line describing the use of LEX is at
http://www.cs.utexas.edu/users/novak/lexpaper.htm.