CMPU 331:   Compiler Design (Spring 2018)
1 Overview
2 Tokens
2.1 Token Types
2.2 Values
3 Hand-written Lexical Analyzer
Considerations
4 Error handling
4.1 Error types
4.2 Error recovery
5 Using a lexical analyzer generator:   JLex
6 Submission

CMPU 331: Compiler Design (Spring 2018)

Lexical Analyzer

1 Overview

The lexical analyzer will be designed to isolate tokens from an input source file containing a program written in a subset of the Pascal language. The lexical analyzer will be set up to be accessed by the parser, which will be written later in the semester.

Lexical analyzers can be written by hand or generated automatically using programs such as JLex, in both cases based on the specification of tokens with regular expressions. Generated lexical analyzers create large tables and are typically less efficient than hand written ones; hand written ones are fairly easy to program, so this is usually the chosen method. However, you are free to use a lexical analyzer generator for this portion of the project. Whichever method you choose, you will likely spend about the same amount of time and effort.

2 Tokens

Each time it is called, the Lexical Analyzer returns a type-value pair representing the next token found in the source. The token types in our language are listed below. For each token that is isolated, the lexical analyzer will determine its type and set the type specification to the appropriate value.

2.1 Token Types

PROGRAM

BEGIN

END

VAR

FUNCTION

PROCEDURE

RESULT

INTEGER

REAL

ARRAY

OF

IF

THEN

ELSE

WHILE

DO

NOT

IDENTIFIER

INTCONSTANT

REALCONSTANT

RELOP

MULOP

ADDOP

ASSIGNOP

COMMA

SEMICOLON

COLON

RIGHTPAREN

LEFTPAREN

RIGHTBRACKET

LEFTBRACKET

UNARYMINUS

UNARYPLUS

DOUBLEDOT

ENDMARKER (single dot)

ENDOFFILE

2.2 Values

The value portion of the type-value pair will vary depending on the type of token isolated.

3 Hand-written Lexical Analyzer

The hand-written lexical analyzer should include a class called Tokenizer with a public method called getNextToken (or a suitably comparable name) that returns the next valid token in the input stream.

Considerations

4 Error handling

4.1 Error types

Relatively few errors can occur in the lexical phase of compilation. These include:

These errors should be caught during lexical analysis and a meaningful error message issued. Remember that the message is for the user (programmer), and that he/she knows nothing about the internal objects used by the compiler, such as tokens. Messages that refer to these objects will only confuse your user!

You must decide what you want to include in your error messages. If, for example, you want to include information such as the line number, character position, and the text of the line of code where the error occurred, you will have to retain this information during the course of lexical analysis. You may of course do as much or little as you want, but your goal should be to make a useful and user-freindly compiler.

I suggest you create a CompilerError class that extends exception (in Java) and then for each phase of the compiler, create a class that extends this one. Here, you would need a class for LexicalError.

4.2 Error recovery

The basic requirement for the compiler is to simply stop and issue a message, and cease compilation.

Recovering from an error so that the analysis can continue is difficult: for example, if when you encounter an invalid character you try to discard it and continue, it is relatively unlikely that you can figure out what the programmer actually intended the token(s) to be. As a result, if you return a token that represents your best guess, it will typically just cause a parsing error that is less meaningful than flagging the invalid character.

You are, however, free to implement some error recovery in the lexical analyzer if you wish. One strategy is to skip characters until a valid prefix of a token is found, in an attempt to find additional lexical errors in the same input so that they can be corrected by the user before re-running the lexical analyzer. Some of these subsequent errors may, however, not be real errors but may be caused by the lexical analyzer not skipping enough characters (or skipping too many) after the first error is found. If, for example, the start of a comment is ill-formed, the lexical analyzer may try to interpret the contents of the comment as individual tokens, and if the end of a comment is ill-formed, the lexical analyzer will read until the end of the next comment (if any) before continuing, hence skipping too much text.

When the lexical analyzer finds an error, the consumer of the tokens that it produces cannot usually itself produce a valid result. However, other phases of the compiler may try to find other errors in the remaining input, again allowing the user to find several errors in one edit-compile cycle. Some of the subsequent errors may really be spurious errors caused by lexical error(s), so the user will have to guess at the validity of every error message except the first, as only the first error message is guaranteed to be a real error.

5 Using a lexical analyzer generator: JLex

If you choose to use the lexical analyzer generator JLex, supporting materials can be found here. Most useful is the User Manual.

6 Submission

Please follow the Submission instructions.