Lexical Analysis

class: center, middle

# Lexical Analysis

_CMPU 331 - Compilers_

---

# Lexical Analysis

The first stage of compilation

![multistage pipeline](/~cs331/images/lectures/multistage.png)

(source: _Language Implementation Patterns_ by Terence Parr)

---

# Lexical Analysis

The first stage of compilation

```
    (+ 3 4)
```

* The input is just a string of characters

* Split it into substrings, called tokens

---

# What is a Token?

A word in a particular "syntactic" category

* In English: noun, verb, adjective ...

* In a programming language: identifier, integer, operator...

---

# Tokens

Tokens correspond to sets of strings

* Integer: a non-empty string of digits
* Identifier: string of letters or digits, starting with a letter
* Operator: "+", "-"...
* Keyword: "if", "else" ...
* Whitespace: a non-empty sequence of spaces, tabs, newlines...

---

# Tokens

* Classify the substrings in a program

* Output of lexer is a stream of tokens...

* ...tokens are input to the parser

* Parser builds on the distinctions between tokens

---

# How to Design a Lexer - Step 1

Define a finite set of tokens

* Tokens describe all substrings of interest

* Choice of tokens depends on language, design of parser

---

# Example

For our simple Scheme program:

```
    (+ 3 4)
```

We could identify the useful tokens as:

```
    OPENPAREN
    ADDOP
    INTEGER
    CLOSEPAREN
```

---

# How to Design a Lexer - Step 2

Describe which strings belong to each token

* Integer: a non-empty string of digits
* Operator: "+", "-" ...
* Whitespace: a non-empty sequence of spaces, tabs, newlines...

---

# How to Implement a Lexer

An implementation must do two things

1. Recognize substrings corresponding to tokens
2. Also capture the value or _lexeme_ of the token

For example, the token `ADDOP` corresponds to the lexeme "`+`".

---

# How to Implement a Lexer

* The goal is to partition the entire input string (the program)

* Reading left-to-right, recognizing one token at a time

* Discard tokens irrelevant to parsing, like comments and whitespace

---

# Lookahead

* Looking at one character at a time isn't always enough

```
    <
    <=
```

* Is that `LTOP` followed by `ASSIGNOP`? No, it's `LEOP`.

```
    i
    if
```

* Is that a variable `i` followed by a variable `f`? No, it's the `if` keyword.

* Lookahead may be required to decide where one token ends and the next token begins

---

# How to Implement a Lexer

* You can write a lexer by hand
  * 3.2 (Watson) has a detailed example

* Modern compilers use lexer generators

* Useful to know how they work

---

# Regular Languages

* There are several formalisms for specifying tokens
* Regular languages are the most popular
  * Simple and useful theory
  * Easy to understand
  * Efficient implementations

---

# Languages

**Definition.** Let ∑ be a set of characters. A _language
over_ ∑ is a set of strings of characters drawn
from ∑.

Example 1:

* Alphabet = English characters
* Language = English sentences
* Not every string of English characters is an English sentence

Example 2:

* Alphabet = ASCII characters
* Language = C programs
* Not every string of ASCII characters is a C program

---

# Regular Expressions

* Languages are sets of strings

* Need some notation for specifying which sets we want

* The standard notation for regular languages is _regular expressions_

---

# Atomic Regular Expressions

* Single character

> 'c' = {"c"}

* Empty string

> ε = {""}

---

# Compound Regular Expressions

* Union

> A | B = { s : s ∈  A or s ∈  B }

* Concatenation

> AB = { ab : a ∈  A and b ∈  B }

* Iteration

> A\* = { s : repeated 0 or more times }

> B+ = { s : repeated 1 or more times }

---

# Example

Integer: a non-empty string of digits

> digit = '0 ' | '1' | ' 2 ' | '3' | ' 4 ' | '5' | '6 ' | '7 ' | '8' | '9 '  
> integer = digit+

---

# Example

Identifier: string of letters or digits, starting with a letter

> letter = ‘A’ | ... | ‘Z’ | ‘a’ | ... | ‘z’  
> identifier = letter (letter | digit)\*

Question: Would this have the same result?

> identifier = letter (letter\* | digit\*)

---

# Example

Email address: anyone@vassar.edu

> name = (letter | digit)+  
> domain = letter+  
> email = name '@' name '.' domain

---

# Regular Expressions

* Regular expressions are simple, but useful

* A regular expression R describes some set of strings L(R)
  * L(R) is the "language" defined by R
  * If a string _s_ is valid in a language, we say _s_ ∈ L(R)

* Regular languages are a specification, we need an implementation

---

# Regular Expressions ⇒ Lexer

* (1) Write a regular expression for the lexemes of each token:

> INTEGER = digit+  
> IDENTIFIER = letter (letter | digit)\*  
> OPENPAREN = '('

* (2) Construct R from all lexemes for all tokens:

> R = INTEGER | IDENTIFIER | OPENPAREN | ...

* (3) Take a substring of the input, _s_. Check that _s_ ∈ L(R).

* (4) If success, remove _s_ from the input, and repeat (3).

---

# Ambiguities

* How much input is used? ("<" vs "<=")
  * Rule: pick longest possible string in L(R)

* Which token is used? ("if" keyword, not identifier)
  * Rule: pick the one defined first

---

# Next Lecture

* Lexer generator tools for the project
* Brief introduction to Python
* Release first assignment and first project milestone