class: center, middle # Languages, Syntax, and Parsing _CMPU 331 - Compilers_ --- # Recap: Stages of Compilation 1. Lexical Analysis 2. Parsing 3. Semantic Analysis 4. Optimization 5. Code Generation ![multistage pipeline](/~cs331/images/lectures/multistage.png) (source: _Language Implementation Patterns_ by Terence Parr) --- # Recap: Lexical Analysis First step is to recognize the tokens (words): ``` (+ 3 4) ``` We could identify the useful tokens as: ``` OPENPAREN ADDOP INTEGER CLOSEPAREN ``` --- # Recap: Parsing Second step is to understand the structure (syntax): ![scheme diagram](/~cs331/images/lectures/scheme_diagram.png) --- # What does a lexer do? * **Input:** string of characters * **Output:** sequence of tokens --- # What does a parser do? * **Input:** sequence of tokens from lexer * **Output:** parse tree of the program (or some other intermediate representation) --- # What does a parser do? * Not all sequences of tokens are valid programs * Parser must distinguish * Need * A way to describe valid sequences of tokens * A method to distinguish valid from invalid sequences of tokens --- # Context-Free Grammars * Programming languages have a recursive structure ``` if EXPR then EXPR else EXPR while EXPR loop EXPR ... ``` * Context-free grammars are a natural notation for recursive structure --- # Context-Free Grammars A context-free grammar (CFG) consists of: * A set of _terminals_ **_T_** * A set of _non-terminals_ **_N_** * A _start symbol_ **_S_** (a non-terminal) * A set of _productions_ > _X_ → _Y_<sub>1</sub>_Y_<sub>2</sub> ... _Y_<sub>_n_</sub> where * _X_ is a non-terminal * _Y_ is a terminal, a non-terminal, or the empty set (nothing) --- # Notation * In these slides * Non-terminals are written upper-case * Terminals are written lower-case * The start symbol is the left-hand side of the first production * More about other notations later --- # Examples of CFGs Simple arithmetic expressions: ``` E → E * E | E + E | (E) | id ``` for example: ``` (id + id) * (id + id) ``` --- # Language of CFG Read productions as rules: > _X_ → _Y_<sub>1</sub> ... _Y_<sub>_n_</sub> * Means _X_ can be replaced by _Y_<sub>1</sub> ... _Y_<sub>_n_</sub> --- # Key Idea * (1) Begin with a string consisting of the start symbol _S_ * (2) Replace any non-terminal _X_ in the string with the right-hand side of some production > _X_ → _Y_<sub>1</sub> ... _Y_<sub>_n_</sub> * (3) Repeat (2) until there are no non-terminals in the string --- # Terminals * Terminals are called that because there are no rules for replacing them * They are "terminal" in the sense of "the end" * Terminals should be the tokens of the language --- # Derivations and Parse Trees A _derivation_ is a sequence of productions > _S_ → ... → ... → ... A derivation can be drawn as a tree * Start symbol is the tree's root * For a production _X_ → _Y_<sub>1</sub> ... _Y_<sub>_n_</sub> add children _Y_<sub>1</sub> ... _Y_<sub>_n_</sub> to node _X_ --- # Example Grammar: ``` E → E * E | E + E | (E) | id ``` String: ``` (id + id) * (id + id) ``` --- # Example Grammar: ``` E → E * E | E + E | (E) | id ``` ![derivation](/~cs331/images/lectures/cfg_diagram.png) --- # Notes on Derivations * A parse tree has * Terminals at the leaves * Non-terminals at the interior nodes * A in-order traversal of the leaves is the original input * The parse tree shows the association of operations (the input string does not) --- # Backus-Naur Form (BNF) Another CFG notation, common in programming language descriptions ``` <expr> ::= <term> | <expr> + <term> | <expr> - <term> <term> ::= <factor> | <term> * <factor> | <term> / <factor> <factor> ::= <integer> | (<expr>) <integer> ::= <digit> | <integer> <digit> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ``` --- # Extended Backus–Naur Form (EBNF) Another CFG notation, common in programming language descriptions ``` expr = term | expr "+" term | expr "-" term. term = factor | term "*" factor | term "/" factor. factor = integer | "("expr")". integer = digit | integer digit. digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9". ``` --- # Room to Explore * Required material is limited (assignments/project) * We have time to do more * Interesting, fun, and can help you understand compilers * Not assessed * What are you curious about? * What languages would you like to see? * I'll build some end-to-end examples --- # Esoteric Languages * Experiment with language design and compilers * Proof of concept * Push the envelope * ...or as a joke --- # Esoteric Languages - Ook! * Created by David Morgan-Mar * All tokens are orangutan words --- # Esoteric Languages - Ook! Complete language description: `Ook. Ook?` - Move the pointer to the right `Ook? Ook.` - Move the pointer to the left `Ook. Ook.` - Increment the memory cell under the pointer `Ook! Ook!` - Decrement the memory cell under the pointer `Ook! Ook.` - Output the character signified by the cell at the pointer `Ook. Ook!` - Input a character and store it in the cell at the pointer `Ook! Ook?` - Jump past the matching `Ook? Ook!` if the cell under the pointer is 0 `Ook? Ook!` - Jump back to the matching `Ook! Ook?` --- # Esoteric Languages - Ook! Example program: ``` Ook. Ook. Ook. Ook. ... Ook! Ook. ``` 1. Increment memory cell #0 to 65 * increment by 1, repeated 65 times 2. Output the character that corresponds to the value of memory cell #0 * ASCII character 65 is "A" --- # Esoteric Languages - Ook! The "Hello, world!" program: ``` Ook. Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook? Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook? Ook! Ook! Ook? Ook! Ook? Ook. Ook! Ook. Ook. Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook? Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook? Ook! Ook! Ook? Ook! Ook? Ook. Ook. Ook. Ook! Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook. Ook! Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook. Ook. Ook? Ook. Ook? Ook. Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook? Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook? Ook! Ook! Ook? Ook! Ook? Ook. Ook! Ook. Ook. Ook? Ook. Ook? Ook. Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook? Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook? Ook! Ook! Ook? Ook! Ook? Ook. Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook. Ook? Ook. Ook? Ook. Ook? Ook. Ook? Ook. Ook! Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook. Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook. Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook. Ook. Ook? Ook. Ook? Ook. Ook. Ook! Ook. ``` --- # Esoteric Languages - Ook! * Not human-friendly * Not even orangutan-friendly * Based on earlier language, BF, by Urban Müller * Goal of BF: smallest possible compiler on Amiga OS (1993) * Result: 240-byte compiler * Beat record of: False, 1024-byte compiler * Turing-complete, can perform any calculation a universal Turing machine can