Input Buffering

Input Buffering

Before discussing the problem of recognizing lexemes in the input, let us examine some ways to speed up the simple but important task of reading the source program. This task is made difficult by the need to look at one or more characters beyond the next lexeme to be sure we have the right lexeme.

Buffer Pairs

Specialized buffering techniques have been developed to reduce the overhead required to process a single input character, due to the time taken to process characters and the large number of characters processed during the compilation of a large source program. An important scheme involves two buffers that are alternately reloaded.

  • Each buffer is of the same size (N), usually the size of a disk block (e.g., 4096 bytes).
  • Using one system read command, (N) characters are read into a buffer, rather than using one system call per character.
  • If fewer than (N) characters remain in the input file, a special character, represented by (\backslash 0), marks the end of the source file and is different from any possible character of the source program.

Maintaining Pointers

Two pointers to the input are maintained:

  1. Pointer lexemeBegin: Marks the beginning of the current lexeme, whose extent we are attempting to determine.
  2. Pointer forward: Scans ahead until a pattern match is found.

Advancing forward

Once the next lexeme is determined, forward is set to the character at its right end. After the lexeme is recorded as an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme just found.

Advancing forward requires checking whether we have reached the end of one of the buffers. If so, we must reload the other buffer from the input and move forward to the beginning of the newly loaded buffer.

As long as we never need to look so far ahead of the actual lexeme that the sum of the lexeme’s length plus the distance we look ahead is greater than (N), we will never overwrite the lexeme in its buffer before determining it.

Sentinels

If we use buffer pairs, we must check each time we advance forward that we have not moved off one of the buffers. If we do, we must also reload the other buffer. Thus, for each character read, we make two tests: one for the end of the buffer and one to determine what character is read.

Combining Buffer-End Test

We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character EOF (end-of-file).

Leave a Reply

Your email address will not be published. Required fields are marked *