Tuesday, May 3, 2011

COMPILERS AND INTERPRETERS

With the development of FORTRAN came a new and more complex program for creating machine language
from a higher-level expression of a programmer’s intent. The new program was called a compiler, and the job
of a compiler is to translate a high-level programming language into machine code.
The input to a compiler is the source code written by the programmer in the high-level language. We will
show a simple example program from a seminal book in statistics entitled Multivariate Data Analysis, written
by William Cooley and Paul Lohnes in 1971. The book was remarkable in its time for its inclusion of many
FORTRAN programs in source code form.
In 1971 the input would have been from punched cards, and the output would have been to a printer. In the
read statements below we have replaced the card reader device ID of 7 with the asterisk character to allow the
program to read from the keyboard of the PC. Likewise, in the write statements, we have replaced the printer
device ID of 6 with an asterisk. This permits the output to go to the computer display.
FORTRAN of that time was a column-oriented language (newer FORTRAN standards have allowed “free
format” statements). Statement numbers appeared in columns 1–5, and statements were written in columns
7–72. Putting a number in column 6 meant that the line was a continuation of the previous line. Putting a C in
column 1 meant that the line was a comment, not a program statement.
Variables beginning with letters I, J, K, L, M, or N were integers, and all others were floating point (characters
could be read into integer-type variables).
If you are interested in trying out this program, a free FORTRAN compiler, called GNU FORTRAN G77, is
available from the Free Software Foundation. You can find the download link at http://kkourakis.tripod.com/.
The following program from Cooley and Lohnes was used to compute average scores on each of several tests
for each student in a study.
PROGRAM CLAVG
C
C COMPUTE AVERAGE OF M SCORES FOR EACH OF N SUBJECTS
C INPUT:
C FIRST CARD CONTAINS N IN COLS 1-5, AND M IN COLS 6-10
C FOLLOWING CARDS ARE SCORE CARDS, ONE SET PER SUBJECT.
C FIRST CARD PER SET CONTAINS ID IN COLS 1-5 AND UP TO
C 25 SCORES IN 3-COLUMN FIELDS. ID FIELD MAY BE
C BLANK IN ADDITIONAL CARDS IN A SET
C COOLEY, W & LOHNES, P, MULTIVARIATE DATA ANALYSIS, 1971
C
DIMENSION X(1000)
READ( *, 1 ) N, M
1 FORMAT( 2I5 )
WRITE( *, 2 ) M, N
2 FORMAT( 'AVERAGES ON ', I6, ' TESTS FOR EACH OF ', I6,
1’ SUBJECTS’ )
EM=M
DO 5 J=1, N
READ( *, 3 ) ID, ( X(K), K=1, M )
3 FORMAT( I5, 25F3.0/ (5X, 25F3.0) )
SUM = 0.0
DO 4 K=1, M
4 SUM = SUM + X(K)
AV = SUM / EM
5 WRITE( *, 6 ) J, ID, AV
6 FORMAT( I6, 3X, 'SUBJECT ', I6, 3X, 'AV= ', F9.2 )
STOP
END
Example input cards:
5 5
821 3 7 9 4 7
812 1 4 3 3 2
813 3 2 3 1 1
824 7 9 9 9 9
825 6 9 8 8 5
This program starts by reserving an array of 1000 elements for real numbers. Then it reads the first line of
input to get values for N and M, the number of students and the number of scores for each student. Then it writes
a message summarizing the task ahead.
The main loop starts next at the keyword DO. The loop starts there and ends at the statement numbered 5.
The work of the loop begins with reading another line of input, which the program expects to consist of a student
identifier and five test scores. Inside the main loop, there is a smaller loop that starts at the next DO and
continues just to the following line, numbered 4. The inner loop sums all the scores for that line of input. The
last work of the main loop is to divide the sum of test scores for the student by the number of tests in order to
compute an average score for that student. After writing the result for that student, the loop resumes with the next
student, until the scores for all the students have been averaged.
Obviously, translating such English-like or mathematical statements into machine code is much more challenging
than the work done by an assembler. Compilers process source code in a series of steps.
The first step is called “scanning” or “lexical analysis,” and the output is a stream of tokens. Tokens are the
words of the language, and “READ”, “FORMAT”, “AV”, “4”, and “3X” are all tokens in the example program.
Next, the compiler “parses” the token stream. This step is also called “syntax analysis.” Referring to the “grammar”
or rules of the language, the compiler uses a “parse tree” to verify that the statements in the source code comprise
legal statements in the language. It is at this step that the compiler will return error messages if a comma is missing,
for example, or a key word is misspelled. Later in this chapter we will return to the topics of parsing and parse trees.
If all of the statements are legal statements, the compiler proceeds with “semantic analysis.” In this phase,
the meaning of the statements is created. By meaning, we mean implementing the programmer’s intent in
executable code. Modern compilers often create a program in an intermediate language that is later converted
into machine code, but early compilers either created assembly language code that was then assembled by the
trusty assembler, or created machine language directly. The advantage of the modern approach is that compilers
for different languages can create intermediate code in the common form, which intermediate code can be fed
to a common machine language generator.
The result of compiling a program is a file of object code, which is the binary file of machine instructions that
will be executed when the program is run. Compilers create a special type of object code called relocatable code,
which is object code that can be loaded into any part of memory. When the program is loaded into memory to run,
addresses and references in relocatable files are adjusted to reflect the actual location of the program in memory.
With compilers, the translation of source code to executable code is accomplished once. Once the program is
compiled, executing the program requires no translation, and the program in machine code form executes swiftly.
Interpreters operate differently. Interpreters translate the source code to machine code one source code line
at a time, and they do this every time the program executes. The interpreter is always the program in control; it is the interpreter that is actually executing when a program in an interpreted language is run. BASIC is a language
that is usually implemented with an interpreter.
In general, a program executed by an interpreter will run more slowly than a program that is first compiled
into object code. The reason, of course, is that the interpreter must analyze each line and convert each line to
machine code each time the program runs.
On the other hand, interpreters have other compensating advantages in some situations. For instance, when
students are learning to program, the interactivity of an interpreter, and the savings of recompilation time on
long programs, can be more important than final execution speed. Interpreters often can provide better diagnostic
messages, too, since they work directly from the source code, line by line. In addition, with the continuing
increases in hardware computation speeds, speed of execution sometimes becomes less important to users than
other features of a language.
The distinctions are sometimes “fuzzy.” First of all, some languages are implemented both as interpreted
and compiled languages (e.g., BASIC, PERL, LISP). The modern Java language also blurs compiler/interpreter
boundaries, as we will now discuss.

No comments:

Post a Comment