The input parameter of each of the macros used for character classification is an 8-bit character, which can have at most 256 values, and the outcome of the macro is one bit. This suggests representing the table of answers as an array A of 256 1-bit elements, in which element A[ch] contains the result for parameter ch. However, few languages offer 1-bit arrays, and if they do, accessing the elements on a byte-oriented machine is slow. We can sacrifice 7 x 256 bits and allocate an array of 256 bytes for the answers. Here is the relevant part of the table implementation of is_operator(), assuming that the ASCII character set is used. The answers are collected in the table is_operator_bit[]; the first 42 positions contain 0's, then there are some 1's in the proper ASCII positions and 208 more 0's fill up the rest of the array. One small complication that had to be overcome is that the ANSI C standard leaves it undefined whether the range of a char is 0 to 255 (unsigned char) or -128 to 127 (signed char). Since we want to use the input characters as indexes to arrays, we have to make sure the range is 0 to 255. Forcibly extracting the rightmost 8 bits by AND-ing with the octal number 03777 -- which reads 11111111 in binary -- solves the problem at the expense of one more operation.
This technique is typically called table lookup
and, because it is implemented by array indexing, its cost is constant.
The main drawback of this method is the size of the tables; for example,
there are 12 classifying macros in the example above, each using 256 bytes
for a total of 3072 bytes. This small amount of memory is not likely
to exceed availability,
but for a more complex language tables can be much larger. In these cases table
compression is essential; but even in this simple case we can reduce
the size of the tables by more than a factor of ten.
Table lookup can also be applied to token recognition. In this approach,
a two-dimensional array is constructed which is indexed by state numbers
along one axis and all possible significant input characters on the other.
The token assembler operates by initializing its current state to 0, calling
the routine to get the next significant character from the input, and using
the current state value and the character just received as indeces into
the array. The referenced element of the array will contain a number which
gives the "next state"--i.e., the state the machine should move into. Thus
this number replaces 0 as the current state, a new input character is called
for, and the process repeats until an entry in the array signals "accept"
(and provides an indication of which token type has just been recognized)
or "error". The table-driven approach demands determining the contents
of the array by, of course, constructing a finite state automaton that
recognizes tokens and using this to fill in the array. This is in fact
what programs like LEX actually do. The code for token recognition generated
by LEX is very stream-lined and concise, but the problem of large table
size is even greater than for character recognition.