The Hand-written Lexical Analyzer
The modules in the lexical analyzer
are:
- Source Handler and character reader: Opens the source file
and readies it for reading; includes a public method to return the next "significant"
character in the source file. Responsible for stripping out comments and
extraneous blanks, keeping track of line numbers and character positions,
buffering characters if necessary, etc. Note that a blank is a delimeter
and so must be returned to the caller as a significant character when
first encountered; only extra blanks beyond the first one found after a
token are considered extraneous.
- Tokenizer: Constructs tokens
from the characters provided from the character reader and assigns types
and values to the type-value pair. Keywords could be specified by separate
regular expressions in this phase, but this is usually inefficient. Instead,
the token assembler typically handles keywords as identifiers, and assignment
of the keyword type is handled by checking a table of keywords, and changing token type "identifier" to the appropriate keyword token if found.
Considerations
-
Tokens are constructed in a greedy fashion--i.e.,
the longest possible character sequence comprising a valid token is returned.
-
You need two-character lookahead to handle the doubledot
token, since "5.." is two tokens (the integer 5 and the doubledot) whereas
"5.3" is a constant. However, you do not know which is the case until you
read the third character in the sequence.
-
You must specify a maximum length for identifers
for your implementation.
-
There are three possible interpretations for plus
and minus signs: (1) a binary operator, with token type ADDOP; (2) a unary
operator; (3) part of a constant. The lexical analysis routine has to determine
which is the case. This can be handled as follows:
-
Upon locating a plus or minus sign in the source,
check the TYPE of the previous token identified. (This means that you must
have some way of knowing what the last token was. )
-
If the type of the last token scanned is RIGHTPAREN,
RIGHTBRACKET, IDENTIFIER, or CONSTANT, then assign the type ADDOP and the
appropriate value.
Otherwise, it is a unary operator. You can either
return it as such (i.e., as UNARYMINUS or UNARYPLUS) or check to see if
it is part of a constant. If it is, the next character in the input stream
will be a digit, and you can regard the plus or minus sign as part of a
constant and continue scanning to isolate the rest of the number.