Tuesday, May 3, 2011

LANGUAGE SYNTAX AND SEMANTICS

To prepare a user-written program for execution, the language processor must perform several tasks. In order,
computer scientists refer to these tasks as scanning (lexical analysis), parsing (syntax analysis), and code generation
(semantic analysis).
Scanning, the first step, reads the character sequence that is a source code file and creates a sequence of
tokens of the language. Tokens are the words of a language, and tokens fall into several categories. A token
may be a key word like return or a reserved word like String, a special symbol like ‘+’, a variable name
or identifier like myCount, or a literal constant like the number 3.14 or the character string Please enter
your name:.
After the scanner “tokenizes” the source code, the parser accepts the list of tokens as input and builds
a “parse tree” according to the syntax rules of the language. The parser tests the token stream against the syntax,
or grammar rules, of the language, and in the process finds many of the errors we programmers make.
The syntax of a language describes the allowable statements in the language. Following correct syntax does
not guarantee correct programming, but correct programming requires correct syntax. For instance, in English, the
sentence, “The octopus combed his hair” is syntactically correct, but foolish. On the other hand, the sentence, “The
mab ran after the bus” is not syntactically correct because the dictionary does not recognize the token ‘mab’. In
programming languages, as in English, many syntax errors occur because of misspellings and typographical errors.
Today, language syntax rules are usually expressed in Backus-Naur form (BNF), or extended Backus-Naur
form (EBNF), after John Backus (inventor of FORTRAN) and Peter Naur. BNF uses a set of rules or “productions”
to describe the grammar, or syntax.
On the left-hand side of a production, BNF shows a linguistic concept known as a “non-terminal.”
Examples of non-terminals from English include “verb-phrase” and “sentence.” In a programming language,
examples of non-terminals might be “term” or “expression.”
Non-terminals are so-called because they can be broken down into combinations of smaller concepts. For
instance, a verb-phrase can consist of a verb and a direct-object-phrase. Ultimately, the grammar defines the
units of the language that cannot be further reduced, the words of the language, and these are called “terminals.”
On the right-hand side of a production, BNF shows the possible combinations of non-terminals and/or terminals
that can be substituted for the higher-level non-terminal on the left-hand side. Here is a grammar for mathematical
expressions:
1 expression -> term | expression add_op term
2 term -> factor | term mult_op factor
3 factor -> identifier | number | - factor | (expression)
4 add_op -> + | -
5 mult_op -> * | /
The vertical lines mean “or.” To simplify the discussion so that we need not also supply rules for creating
“identifiers” and “numbers,” assume that identifiers are valid variable names and numbers are valid numbers.
We will treat them as terminals.
Production 1 says that an expression can consist either of a term, or of an expression plus an add_op
(addition operator) plus a term. Production 2 says that a term can be a factor, or it can be another term plus
a mult_op (multiplication operator) plus a factor.
For example, we can parse the following expression according to the grammar:
X * 3 + 4
We can, by rule 1, replace the original expression with another expression (X * 3), an add_op (+), and
a term (4). By rule 2 the single-token term (4) can be replaced by a factor, which can, by rule 3 be replaced
by a number (4), which is a terminal for us.
It remains for us to parse the expression (X * 3). At this point, by rule 1 the only legal substitution for
(X * 3) is a term. By rule 2 the term (X * 3) can be replaced by another term (X), a mult_op (*), and
a factor (3). Again, rule 3 says the factor (3) can be replaced by a number (3), which is a terminal.
By rule 2 the term (X) can be replaced by a factor (X), which by rule 3 can be replaced by an identifier
(X), which we said was a terminal for us.

Such decomposition of a more complex expression into its terminals according to the rules of the grammar
is called a derivation. The result of a successful derivation is a parse tree or syntax tree. Here is the parse tree
for the derivation we just completed:
( X * 3 + 4 ) expression
/ | \
/ | \
(X * 3) + 4 expression add_op term
/ | \ 4 factor
/ | \ 4 number
/ | \
/ | \
X * 3 term mult_op factor
X factor 3 number
X identifier
To compute the meaning of the expression, the parse tree can be traversed from the bottom up, computing
the multiplication first and then performing the addition.
If an expression can be parsed according to the grammar of the language, the expression conforms to
the syntax of the language. Once the parser creates the parse tree, the compiler can work from the bottom of the
tree to the top, creating the machine instructions to implement the expression. This last phase is called code
generation.
Today most descriptions of language syntax use a version (there are several) of EBNF. Some notational
changes simplify the representations of productions. In particular, EBNF uses curly brackets to denote “zero
or more occurrences of,” and it uses square brackets to denote optional parts of a production. EBNF uses
parentheses and vertical “or” separators to denote multiple-choice options for a single element. We can rewrite
the grammar above using this EBNF notation:
expression -> term { (+ | -) term }
term -> factor { (* | /) factor }
factor -> identifier | number | - factor | ( expression )
If it is not obvious that these rules agree with our earlier grammar, consider our earlier first rule for
expressions:
expression -> term | expression add_op term
From this rule, we can generate:
expression -> term
expression -> expression + term
expression -> expression + term + term
expression -> expression + term + term + term
...
expression -> term + term + term + term ...+ term
So, the EBNF notation says more simply:
expression -> term { (+ | -) term }
An expression is a term followed by zero, one, or many additive terms.
Here is an example of EBNF used to represent an optional element in a production:
if-statement -> if( expression ) statement [else statement]
This production says that an if-statement consists of the key word if, followed by an open parenthesis, followed by
an expression, followed by a closed parenthesis, followed by a program statement, optionally followed by the
key word else and another program statement.

Given an expression in the language, there must be one and only one valid derivation in the language.
To illustrate an ambiguous grammar, consider this simplification of the grammar for mathematical
expressions:
1 expression -> expression operator expression | identifier |
number | - expression | ( expression )
2 operator -> + | - | * | /
We can again parse the expression (X * 3 + 4) proceeding from the left to the right, and the result will
be the same parse tree we derived from the more complex grammar. However, this simpler grammar would also
allow a rightmost approach, with the following result:
( X * 3 + 4 ) expression
/ | \
/ | \
X * (3 + 4) expression operator expression
| \
X Identifier \
(3 + 4) expression operator expression
/ \
/ | \
/ | \
/ | \
3 number + 4 number
The meaning of the second parsing is very different from the first, because in the rightmost parsing the
addition occurs before the multiplication. That is not the customary hierarchy of operations, and the second
parse tree will, in general, produce a different value for the expression than the first.
Because the simpler grammar can produce two different and valid parse trees for the same expression, the
grammar is ambiguous. Programming language grammars must be unambiguous.
Look again at the first grammar, the more complex example, and notice how the grammar enforces a hierarchy
of operations; multiplication and division occur before addition or subtraction. Correct grammars place higher
“precedence” operations lower in the cascade of productions.
Another key to a correctly specified grammar is the “associativity” of language elements. Does a mathematical
operator associate left to right, or right to left? This makes a difference with expressions like (9 - 4 - 2).
Left associativity of operators yields 3, while right associativity yields 7. How do the grammar rules express
associativity?
A production like this is left-associative:
expression -> term | expression add_op term
A production like this is right-associative:
expression -> term | term add_op expression
The significant difference is that the recursion (where an expression is part of an expression) is on the left
in the first case, and on the right in the second case.
Using the left-associative production to parse (9 - 4 - 2) results in this parse tree:
( 9 - 4 - 2 ) expression
/ | \
/ | \
(9 - 4) - 2 expression add_op term

Using the right-associative production to parse the same expression results in this tree:
( 9 - 4 - 2 ) expression
/ | \
/ | \
9 - (4 - 2) term add_op expression
The result is 3 in the left-associative grammar, and 7 in the right-associative grammar.

No comments:

Post a Comment