The left-most 7 bits of each byte in the arrays are always 0, and the
idea suggests itself to use these bits to store outcomes of some of the
other functions. The proer bit for a function can then be extracted by
AND-ing with a mask in which one bit is set to 1 at the proper position.
If there are 11 functions, we need 11 bit positions; however, among the
macros in the example, two test for one character only: is_end_of_input()
and is_comment_starter(). Replacing the simple comparison performed by
these macros would not gain anything, so they can be left as is. Two of
the other macros define their classes as combinations of existing classes:
is_letter() and is_letter_or_digit(). These can be implemented bycombining
the masks for these existing classes. That leaves 7 bit positions needed
for the remaining 7 macros, which fits comfortably in one byte. Here is
a representative part of the implementation:
/*** Efficient classification of characters (excerpt) ***/
#define UC_LETTER_MASK
(1<<1) /* a 1 bit, shifted left 1 pos. */
#define LC_LETTER_MASK
(1<<2) /* a 1 bit, shifted left 2 pos. */
#define OPERATOR_MASK
(1<<5)
#define LETTER_MASK (UC_LETTER_MASK | LC_LETTER_MASK)
#define bits_of(ch) (charbits[(ch)&0377])
#define is_end_of_input(ch) ((ch) == '\0')
#define is_uc_letter(ch) (bits_of(ch)
& UC_LETTER_MASK)
#define is_lc_letter(ch) (bits_of(ch)
& LC_LETTER_MASK)
#define is_letter(ch)
(bits_of(ch) & LETTER_MASK)
#define is_operator(ch) (bits_of(ch) & OPERATOR_MASK)
static const char charbits[256] =
{
0000,
/* position 0 */
...
0040,
/* '*', position 42 */
0040,
/* '+' */
...
0000,
/* position 64 */
0002,
/* 'A' */
0002,
/* 'B' */
0000,
/* position 96 */
0004,
/* 'a' */ 0004, /* 'b' */
...
0000
/* position 255 */
};