Optimization by precomputation

Here is another option for increasing the efficiency of your lexical analyzer. You can see from the code fragments above that questions like is_letter(ch)are asked frequently in the token assembler. These questions have the property that their input paparmeters are from a finite set and their result depends on these parameters only. So, for a given set of parameters the answer is the same every time. If the finite set defined by the parameters is small enough, we can compute all the answers in advance, store the results in an array, and replace the routine and calls by simple array indexing. This technique is called precomputation, and the gains in speed are considerable. This is important for a lexical analyzer, which is one of the places in a compiler where most time is spent. However, note that this method is applicable to any problem where the conditions for its use (finite set of inputs, result dependent only on these values) are satisifed.

The input parameter of each of the macros used for character classification is an 8-bit character, which can have at most 256 values, and the outcome of the macro is one bit. This suggests representing the table of answers as an array A of 256 1-bit elements, in which element A[ch] contains the result for parameter ch. However, few languages offer 1-bit arrays, and if they do, accessing the elements on a byte-oriented machine is slow. We can sacrifice 7 x 256 bits and allocate an array of 256 bytes for the answers. Here is the relevant part of the table implementation of is_operator(), assuming that the ASCII character set is used. The answers are collected in the table is_operator_bit[]; the first 42 positions contain 0's, then there are some 1's in the proper ASCII positions and 208 more 0's fill up the rest of the array. One small complication that had to be overcome is that the ANSI C standard leaves it undefined whether the range of a char is 0 to 255 (unsigned char) or -128 to 127 (signed char). Since we want to use the input characters as indexes to arrays, we have to make sure the range is 0 to 255. Forcibly extracting the rightmost 8 bits by AND-ing with the octal number 03777 -- which reads 11111111 in binary -- solves the problem at the expense of one more operation.

This technique is typically called table lookup and, because it is implemented by array indexing, its cost is constant. The main drawback of this method is the size of the tables; for example, there are 12 classifying macros in the example above, each using 256 bytes for a total of 3072 bytes. This small amount of memory is not likely to exceed availability, but for a more complex language tables can be much larger. In these cases table compression is essential; but even in this simple case we can reduce the size of the tables by more than a factor of ten.

Table lookup can also be applied to token recognition. In this approach, a two-dimensional array is constructed which is indexed by state numbers along one axis and all possible significant input characters on the other. The token assembler operates by initializing its current state to 0, calling the routine to get the next significant character from the input, and using the current state value and the character just received as indeces into the array. The referenced element of the array will contain a number which gives the "next state"--i.e., the state the machine should move into. Thus this number replaces 0 as the current state, a new input character is called for, and the process repeats until an entry in the array signals "accept" (and provides an indication of which token type has just been recognized) or "error". The table-driven approach demands determining the contents of the array by, of course, constructing a finite state automaton that recognizes tokens and using this to fill in the array. This is in fact what programs like LEX actually do. The code for token recognition generated by LEX is very stream-lined and concise, but the problem of large table size is even greater than for character recognition.