Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:cs331-201701:lexer [2018/09/10 18:26]
sellis
courses:cs331-201701:lexer [2018/09/21 11:43] (current)
sellis [Table]
Line 25: Line 25:
 === Tokens === === Tokens ===
  
-Each time the lexer is called, it returns a **type-value pair** representing the next token found in the input stream. For each token isolated, the lexer must dcetermine ​its type and set the type specification to an appropriate value.+Each time the lexer is called, it returns a **type-value pair** representing the next token found in the input stream. For each token isolated, the lexer must determine ​its type and set the type specification to an appropriate value.
  
-Here is a complete list of tokens in Vascal:+Here is a complete list of tokens in Vascal. (Keywords are listed in bold type.)
  
-|  PROGRAM ​   |  RESULT ​     |  IF     ​|  INTCONSTANT ​  ​| ​ COMMA       ​| ​ LEFTBRACKET ​  | +|  ​**PROGRAM**    ​|  ​**RESULT**   |  ​**IF**      ​|  INTCONSTANT ​  ​| ​ COMMA       ​| ​ LEFTBRACKET ​  | 
-|  BEGIN      |  INTEGER ​    ​|  THEN   ​|  REALCONSTANT ​ |  SEMICOLON ​  ​| ​ RIGHTBRACKET ​ | +|  ​**BEGIN**      ​|  ​**INTEGER**  ​|  ​**THEN**    ​|  REALCONSTANT ​ |  SEMICOLON ​  ​| ​ RIGHTBRACKET ​ | 
-|  END        |  REAL        |  ELSE   ​|  RELOP         ​| ​ COLON       ​| ​ UNARYMINUS ​   | +|  ​**END**        ​|  ​**REAL**     |  ​**ELSE**    ​|  RELOP         ​| ​ COLON       ​| ​ UNARYMINUS ​   | 
-|  VAR        |  ARRAY       ​|  WHILE  |  MULOP         ​| ​ LEFTPAREN ​  ​| ​ UNARYPLUS ​    | +|  ​**VAR**        ​|  ​**ARRAY**    ​|  ​**WHILE**   |  MULOP         ​| ​ LEFTPAREN ​  ​| ​ UNARYPLUS ​    | 
-|  FUNCTION ​  ​| ​ OF          |  DO     ​|  ADDOP         ​| ​ RIGHTPAREN ​ |  ENDMARKER ​    | +|  ​**FUNCTION**   |  ​**OF**       |  ​**DO**      ​|  ADDOP         ​| ​ RIGHTPAREN ​ |  ENDMARKER ​    | 
-|  PROCEDURE ​ |  ​IDENTIFIER  ​|  ​NOT    ​|  ASSIGNOP ​     |  DOUBLEDOT ​  ​| ​ ENDOFFILE ​    |+|  ​**PROCEDURE**  ​|  ​**NOT** ​     ​|  ​IDENTIFIER  ​|  ASSIGNOP ​     |  DOUBLEDOT ​  ​| ​ ENDOFFILE ​    |
  
 === Values === === Values ===
Line 46: Line 46:
 <​html><​br/></​html>​ <​html><​br/></​html>​
  
-^  RELOP            ​^|      ^  MULOP            ​^|      ^  ADDOP            ​^^+^  RELOP            ​||      ^  MULOP            ​||      ^  ADDOP            ​||
 ^ Operator ​ ^ Value  | :::  ^ Operator ​ ^ Value  | :::  ^ Operator ​ ^ Value  ^ ^ Operator ​ ^ Value  | :::  ^ Operator ​ ^ Value  | :::  ^ Operator ​ ^ Value  ^
-|  =        |  1     ​| ​:::  ​|  *        |  1     ​| ​:::  ​|  +        |  1     |+|  =        |  1     ​| ​     |  *        |  1     ​| ​     |  +        |  1     |
 |  <> ​      ​| ​ 2     | :::  |  /        |  2     | :::  |  -        |  2     | |  <> ​      ​| ​ 2     | :::  |  /        |  2     | :::  |  -        |  2     |
 |  <        |  3     | :::  |  DIV      |  3     | :::  |  OR       ​| ​ 3     | |  <        |  3     | :::  |  DIV      |  3     | :::  |  OR       ​| ​ 3     |
Line 58: Line 58:
  
   - Tokens are constructed in a greedy fashion–i.e.,​ the longest possible character sequence comprising a valid token is returned.   - Tokens are constructed in a greedy fashion–i.e.,​ the longest possible character sequence comprising a valid token is returned.
-  - You need two-character lookahead to handle the doubledot ​token, since ''​5..''​ is two tokens (the integer 5 and the doubledot) whereas ''​5.3''​ is a constant. However, you do not know which is the case until you read the third character in the sequence. +  - You need two-character lookahead to handle the ''​DOUBLEDOT'' ​token, since ''​5..''​ is two tokens (the integer ​''​5'' ​and the ''​DOUBLEDOT''​) whereas ''​5.3''​ is a constant. However, you do not know which is the case until you read the third character in the sequence. 
-  - You must specify a maximum length for identifers ​and constants for your implementation. This is up to you as the compiler writer.+  - You must specify a maximum length for identifiers ​and constants for your implementation. This is up to you as the compiler writer.
   - There are three possible interpretations for plus and minus signs: a binary operator, with token type ''​ADDOP'',​ a unary operator, or part of a constant. The lexical analysis routine has to determine which is the case. This can be handled as follows:   - There are three possible interpretations for plus and minus signs: a binary operator, with token type ''​ADDOP'',​ a unary operator, or part of a constant. The lexical analysis routine has to determine which is the case. This can be handled as follows:
-    - Upon locating a plus or minus sign in the source, check the TYPE of the previous token identified. (This means that you must have some way of knowing what the last token was.)+    - Upon locating a plus or minus sign in the source, check the **type** ​of the previous token identified. (This means that you must have some way of knowing what the last token was.)
     - If the type of the last token scanned is ''​RIGHTPAREN'',​ ''​RIGHTBRACKET'',​ ''​IDENTIFIER'',​ or ''​CONSTANT'',​ then assign the type ''​ADDOP''​ and the appropriate value.     - If the type of the last token scanned is ''​RIGHTPAREN'',​ ''​RIGHTBRACKET'',​ ''​IDENTIFIER'',​ or ''​CONSTANT'',​ then assign the type ''​ADDOP''​ and the appropriate value.
     - Otherwise, it is a unary operator. You can either return it as such (i.e. as ''​UNARYMINUS''​ or ''​UNARYPLUS''​) or check to see if it is part of a constant. If it is, the next character in the input stream will be a digit, and you can regard the plus or minus sign as part of a constant and continue scanning to isolate the rest of the number.     - Otherwise, it is a unary operator. You can either return it as such (i.e. as ''​UNARYMINUS''​ or ''​UNARYPLUS''​) or check to see if it is part of a constant. If it is, the next character in the input stream will be a digit, and you can regard the plus or minus sign as part of a constant and continue scanning to isolate the rest of the number.
 +
 +==== Resources ====
 +
 +Here are four sample files and their expected output. **Note that not all possible edge cases may be covered.**
 +
 +^  Test 1                                                                            ^      ^  Test 2                                                                            ^   ​^ ​ Test 3                                                                            ^   ​^ ​ Test 4                                                                            ^    ^  Test 5                                                                           ^
 +|  [[https://​www.cs.vassar.edu/​~cs331/​proj/​01_lexer/​lextest_1.txt|Sample 1 input]] ​  ​| ​     |  [[https://​www.cs.vassar.edu/​~cs331/​proj/​01_lexer/​lextest_2.txt|Sample 2 input]] ​  ​| ​  ​| ​ [[https://​www.cs.vassar.edu/​~cs331/​proj/​01_lexer/​lextest_3.txt|Sample 3 input]] ​  ​| ​  | [[https://​www.cs.vassar.edu/​~cs331/​proj/​01_lexer/​lextest_4.txt|Sample 4 input]] ​   |    | [[https://​www.cs.vassar.edu/​~cs331/​proj/​01_lexer/​lextest_5.txt|Sample 5 input]] ​  |
 +|  [[https://​www.cs.vassar.edu/​~cs331/​proj/​01_lexer/​lexsoln_1.txt|Sample 1 output]] ​ | :::  |  [[https://​www.cs.vassar.edu/​~cs331/​proj/​01_lexer/​lexsoln_2.txt|Sample 2 output]] ​ |   ​| ​ [[https://​www.cs.vassar.edu/​~cs331/​proj/​01_lexer/​lexsoln_3.txt|Sample 3 output]] ​ |   ​| ​ [[https://​www.cs.vassar.edu/​~cs331/​proj/​01_lexer/​lexsoln_4.txt|Sample 4 output]] ​ |    | [[https://​www.cs.vassar.edu/​~cs331/​proj/​01_lexer/​lexsoln_5.txt|Sample 5 output]] ​ |
  
 ==== Error Handling ==== ==== Error Handling ====