Error Handling

The only errors that can occur in the lexical analyzer are (1) no regular expression matches the current input; and (2) an open comment character ({) appears inside a comment. For the most part, errors occur when an invalid character is encountered. The easiest way to handle this is to have the token assembler return the unrecognized character as a token with type "unknown character" or "erroneous". This token could simply be passed on to the parser (this is what the sample code does), which during syntax analysis would produce an error since this token will not be valid in any syntactic configuration. However, the resulting error message is not likely to be very informative. Another strategy is to catch this error in the lexical identification phase and give a more sensible error message, and not pass the token on to the parser. This strategy will also ensure that all other phases of the compiler can assume that tokens are correct, thus avoiding the generation of obscure error messages or, worse, a compiler crash.

All error messages should be generated by an error routine that is called by any module in the compiler whenever an error is detected. The caller can pass some indication of what the error message to be generated is. When an error is detected, the basic requirement for this assignment is that an error message is generated and tokenization continues.

Depending on how you structure your lexical analyzer, there may be other errors that could occur. For example, real numbers in our subset of Pascal are required to have a digit both before and after the decimal point, and in reals with exponential notation, digits must follow the "e" (e.g., "5." , 5.3e.", "5.3ea" are all illegal). In principle, since the dot is a legal token in our language, and since blanks are optional except surrounding keywords, the lexical analyzer could treat a sequence like 5.3e. as three tokens: the real number 5.3, the identifer "e", and the dot. Obviously, this will lead to an error at syntax analysis time, so at some level it is not the lexical analyzer's problem. On the other hand, it is a fairly sure bet that the error is a badly formed number, and the error the parser will generate is likely to be meaningless. Whether to handle this in the lexical analyzer or not is up to you as a compiler designer.