CMPU-331 Compilers (Fall 2018)

Overview

The function of the parser is to ensure that the stream of tokens conforms to the rules of the language; that is, that the input is syntactically correct. It consists of three components:

  1. Parser. This code calls the lexical analyser for the next token in the stream until ENDOFFILE is reached and manipulates the parse stack.
  2. RHS Table. This contains a list of right-hand sides (productions) for all non-terminals in the Vascal grammar.
  3. Parse table. This matrix is indexed by stack symbol (non-terminals and terminals) and the current symbol from the input stream (terminals only). Each cell contains an integer value which can represent a production number in the RHS table, an accept action, or an error.

Requirements

  1. Your parser must be written from scratch: you may not use any external libraries.
  2. Your parser must be able to work with the lexical analyser you have already written.
  3. You must implement a function which can be called during debugging which prints the current contents of the parse stack, e.g. DumpStack(). This must be able to be turned on and off, as the compiler will ultimately not generate such internal information for users.

Design

Parsing Algorithm

The parser works as follows. It examines the symbol on the top of the stack and the current input symbol (i.e. the token most recently passed to the parser by the lexical analyser). If the stack symbol is a terminal and matches the current input symbol, the stack top is popped and the parser requests another input token. If the stack symbol is a non-terminal, the parser consults the cell in the parse table corresponding to the stack symbol and the current input symbol: if that entry is accept, the parse has successfully completed; if it is error, the error-handling routine is called.

If the entry in the cell is a production number, the parser must replace the non-terminal on top of the stack with the RHS of the appropriate production. In other words, the production number is used as an index into the RHS table. Remember that you must push the symbols of the production onto the stack in reverse order.

Once the last symbol on the stack (this is usually ENDOFFILE) has been successfully matched and the stack is empty, the parse has been completed.

Initialisation

To initialise the parser, the stack must contain the distinguished symbol of the grammar, <Goal>, with the ENDOFFFILE terminal symbol underneath it. Once this is done, the parser is ready to go.

Suggestion

It is strongly recommended you do not hard-code the grammar and parse table, but instead write functions to load them from disc. This is, however, only a recommendation: your program is your own to design. :-)

Resources

Grammar Files

The grammar you will be using is here: Vascal grammar.

The associated parse table is here: Vascal parse table.

Sample Parses

  1. Source code –> Completed parse (69 steps).
  2. Source code –> Completed parse (1385 steps).

Error Handling

Overview

A few general principles apply to errors found regardless of parsing technique being used:

  • A parser should try to determine that an error has occurred as soon as possible. Waiting too long before declaring an error can cause the parser to lose the actual location of the error.
  • A suitable and comprehensive message should be reported. Missing semicolon on line 36 is helpful; Error in parse table for non-terminal A with input b is not.
  • A parser should avoid cascading errors, which is when one error generates a lengthy sequence of spurious error messages.

Recognizing that the input is not syntactically valid can be relatively straightforward. An error is detected in predictive parsing when the terminal on top of the stack does not match the next input symbol or when non-terminal A is on top of the stack, a is the next input symbol and the parsing table entry M[A, a] is empty.

Recovery

The basic requirement for the assignment is to abort after reporting the first error.

Recovering from errors and being able to resume and successfully parse is a bit more difficult.

Most users would like to find out more than one error per compilation, so you may want to implement some error recovery. The problem is how to fix the error in some way to allow parsing to continue, including finding a reasonable place to resume the parse.

Many errors are relatively minor and involve syntactic violations for which the parser has a correction that it believes is likely to be what the programmer intended. For example, a missing semicolon at the end of the line or a misspelled keyword can usually be recognized. For many minor errors, the parser can “fix” the program by guessing at what was intended and reporting a warning, but allowing compilation to proceed unhindered. The parser might skip what appears to be an erroneous token in the input or insert a necessary, but missing, token or change a token into the one expected (substituting BEGIN for BGEIN). For more major or complex errors, the parser may have no reliable correction. The parser will attempt to continue but will probably have to skip over part of the input or take some other exceptional action to do so.

Two common and fairly simple techniques are “panic mode” and phrase-level error recovery.

Panic Mode

Panic mode error recovery is a simple technique that just bails out of the current construct, looking for a safe symbol at which to restart parsing. The parser just discards input tokens until it finds what is called a synchronizing token. The set of synchronizing tokens are those that we believe confirm the end of the invalid statement (e.g. a semicolon) and allow us to pick up at the next piece of code. The basic procedure is this:

  • Skip tokens until you find the next semi-colon. If you hit ENDOFFILE, this is an unrecoverable error.
  • Pop the parse stack until you find the non-terminal <statement_list_tail>.
  • Re-populate the stack with the expected symbols when a semi-colon is encountered (<statement_list_tail> (if you removed the one you found), <statement>, SEMICOLON).
Phrase-Level

Phrase-level error recovery fills blank entries in the parse table with specific error actions. These actions may change, delete, or insert symbols in the input and/or issue appropriate error messages. At the very least you can provide a precise error message that describes the mismatch between expected and found.

Submission