LEX

The best known interface for a lexical analyzer generator is the UNIX program LEX. LEX for C and C++ exists on our system; a quick web search will lead you to documentation for LEX as well as implementations for languages like Java. The standard reference for LEX is Levine, J.R., Mason, T., and Brown, D. (1992), Lex and YACC. 2nd Edition. O'Reilly.

Here is a lexical anlayzer description in LEX format for a small token set. LEX input consists of three sections: one for regular expressions, one for pairs of regular expressions and code segments, and one for auxiliary C code. From this, the LEX program generates a file in C (or another language if you have the appropriate LEX implementation), which contains the declaration of a single routine, int yylex(void). When called, this routine starts isolating tokens from the input file according to the regular expressions in the second section, and for each token found it executes the C code associated with it.

/************************************************************************************/
/***  Lex input for the token set defined by the following regular expressions:   ***/
/***                                                                              ***/
/***                       integral_number    --> [0-9]+                          ***/
/***                       fixed_point_number --> [0-9]*'.'[0-9]+                 ***/
/************************************************************************************/

%{
#include "lex.h"

Token_Type Token;
int line_number = 1;
%}

whitespace             [ \t]

letter                 [a-zA-Z]
digit                  [0-9]
underscore             "_"
letter_or_digit        ({letter}|{digit})
underscored_tail       ({underscore}{letter_or_digit}+)
identifier             ({letter}{letter_or_digit}*{underscored_tail}*)

operator               [-+*/]
separator              [;,(){}]

%%
{digit}+               {return INTEGER;}
{identifier}           {return IDENTIFIER;}
{operator}|{separator} {return yytext[0];}
#[^#\n]*#?             {/* ignore comment */}
{whitespace}           {/* ignore whitespace */}
\n                     {line_number++;}
.                      {return ERRONEOUS;}

%%
void start_lex(void) {}

void get_next_token(void)
{
      Token.class = yylex();
      if (Token.class == 0)
      {
            Token.class = EOF; Token.repr = "<EOF>"; return;
      }
      Token.pos.line_number = line_number;
      strcpy(Token.repr = (char*)malloc(strlen(yytext)+1), yytext);
}

int yywrap(void) {return 1;}

A short paper on-line describing the use of LEX is at http://www.cs.utexas.edu/users/novak/lexpaper.htm.