This is an old revision of the document!


CMPU-331 Compilers (Spring 2019)

Overview

The lexical analyser (“lexer”) is designed to isolate tokens from an input source file containing a program written in a language it is designed to recognise. For this project, your lexer will be designed to recognise tokens in 'Vascal', a custom language which is a subset of Pascal. The lexer must be designed so that it can be accessed by the parser, which you will write later in the semester.

Requirements

  1. Your lexical analyser must be written from scratch: you may not use any external libraries.
  2. You may not use a regular expression analyser of any kind.
  3. The lexical analyser must have a function which returns a token, i.e. some form of type-value pair; this function must be publicly callable.
  4. You must implement some form of error handling.

Design

Lexical Conventions

Here is a link to the lexical conventions for Vascal. Please make sure you follow it closely.

Tokens

Each time the lexer is called, it returns a type-value pair representing the next token found in the input stream. For each token isolated, the lexer must determine its type and set the type specification to an appropriate value.

Here is a complete list of tokens in Vascal. (Keywords are listed in bold type.)

PROGRAM RESULT IF INTCONSTANT COMMA LEFTBRACKET
BEGIN INTEGER THEN REALCONSTANT SEMICOLON RIGHTBRACKET
END REAL ELSE RELOP COLON UNARYMINUS
VAR ARRAY WHILE MULOP LEFTPAREN UNARYPLUS
FUNCTION OF DO ADDOP RIGHTPAREN ENDMARKER
PROCEDURE NOT IDENTIFIER ASSIGNOP DOUBLEDOT ENDOFFILE

Values

The value portion of the pair will vary depending on the type of token.

  1. For keywords, UNARYPLUS and UNARYMINUS, value will be null.
  2. For identifiers and constants, value is the lexeme representing the identifier or constant.
  3. For RELOP, ADDOP and MULOP, value is an integer as shown in the table below.


RELOP MULOP ADDOP
Operator Value Operator Value Operator Value
= 1 * 1 + 1
<> 2 / 2 - 2
< 3 DIV 3 OR 3
> 4 MOD 4
5 AND 5
>= 6

Considerations

  1. Tokens are constructed in a greedy fashion–i.e., the longest possible character sequence comprising a valid token is returned.
  2. You need two-character lookahead to handle the DOUBLEDOT token, since 5.. is two tokens (the integer 5 and the DOUBLEDOT) whereas 5.3 is a constant. However, you do not know which is the case until you read the third character in the sequence.
  3. You must specify a maximum length for identifiers and constants for your implementation. This is up to you as the compiler writer.
  4. There are three possible interpretations for plus and minus signs: a binary operator, with token type ADDOP, a unary operator, or part of a constant. The lexical analysis routine has to determine which is the case. This can be handled as follows:
    1. Upon locating a plus or minus sign in the source, check the type of the previous token identified. (This means that you must have some way of knowing what the last token was.)
    2. If the type of the last token scanned is RIGHTPAREN, RIGHTBRACKET, IDENTIFIER, or CONSTANT, then assign the type ADDOP and the appropriate value.
    3. Otherwise, it is a unary operator. You can either return it as such (i.e. as UNARYMINUS or UNARYPLUS) or check to see if it is part of a constant. If it is, the next character in the input stream will be a digit, and you can regard the plus or minus sign as part of a constant and continue scanning to isolate the rest of the number.

Resources

Here are four sample files and their expected output. Note that not all possible edge cases may be covered.

Error Handling

Types

Relatively few errors can occur in the lexical phase of compilation. These include:

  1. Invalid character in the input. This occurs when a character not included in the Vascal character set is encountered in the input. Please see the lexical conventions for a list of valid Vascal characters.
  2. Ill-formed constant. Real numbers in our subset of Pascal are required to have a digit both before and after the decimal point, and in reals with exponential notation, digits must follow the e (e.g. 5. , 5.3e., 5.3ea are all illegal). In principle, since . is a legal token in our language, and since blanks are optional except surrounding keywords, the lexical analyzer could treat a sequence like 5.3e. as three tokens: the real number 5.3, the identifier e, and .. Obviously, this will lead to an error at syntax analysis time, so at some level it is not the lexical analyzer’s problem. On the other hand, it is a fairly sure bet that the error is a badly formed number, and the error the parser will generate is likely to be meaningless. Whether to handle this in the lexical analyzer or not is up to you as a compiler designer.
  3. Ill-formed comment. Comments that are not closed with a corresponding } (right brace) should cause an error. Also, note that the body of a comment may contain any characters except the end-of-comment character, so { this } } is not allowed.
  4. Identifier too long. As the compiler writer you must decide what the maximum length of an identifier is, and issue an error when that length is exceeded.


These errors should be caught during lexical analysis and a meaningful error message issued. Remember that the message is for the user (programmer), and that he/she knows nothing about the internal objects used by the compiler, such as tokens. Messages that refer to these objects will only confuse your user!

You must decide what you want to include in your error messages. If, for example, you want to include information such as the line number, character position, and the text of the line of code where the error occurred, you will have to retain this information during the course of lexical analysis. You may of course do as much or little as you want, but your goal should be to make a useful and user-freindly compiler.

I suggest you create a CompilerError class that extends exception and then for each phase of the compiler, create a class that extends this one. Here, you would need a class for LexicalError.

Recovery

The minimum (and acceptable) requirement for the compiler is to simply stop and issue a message, and cease compilation.

Recovering from an error so that the analysis can continue is difficult: for example, if when you encounter an invalid character you try to discard it and continue, it is relatively unlikely that you can figure out what the programmer actually intended the token(s) to be. As a result, if you return a token that represents your best guess, it will typically just cause a parsing error that is less meaningful than flagging the invalid character.

You are, however, free to implement some error recovery in the lexical analyzer if you wish. One strategy is to skip characters until a valid prefix of a token is found, in an attempt to find additional lexical errors in the same input so that they can be corrected by the user before re-running the lexical analyzer. Some of these subsequent errors may, however, not be real errors but may be caused by the lexical analyzer not skipping enough characters (or skipping too many) after the first error is found. If, for example, the start of a comment is ill-formed, the lexical analyzer may try to interpret the contents of the comment as individual tokens, and if the end of a comment is ill-formed, the lexical analyzer will read until the end of the next comment (if any) before continuing, hence skipping too much text.

When the lexical analyzer finds an error, the consumer of the tokens that it produces cannot usually itself produce a valid result. However, other phases of the compiler may try to find other errors in the remaining input, again allowing the user to find several errors in one edit-compile cycle. Some of the subsequent errors may really be spurious errors caused by lexical error(s), so the user will have to guess at the validity of every error message except the first, as only the first error message is guaranteed to be a real error.

Submission