SYNTAX equations are rules, of a special-purpose non-procedural language, designed to permit the specification of the syntax of a programming language and the transformation of programs written in that language into an intermediate list, tree, structure. This tree structure may then be processed by functions written in the GENERATOR language.

Any compiler may be thought of as implementing an idealized pseudo-machine whose order code and memory organization are reflected in the operators and data structures of the language processed by the compiler. The pseudo-machine specified by the SYNTAX language part of the compiler has four major components:

  1. An input stream, which contains the source code, characters, to be parsed.
  2. A push down accumulator, called the parse stack, which contains the parse tree or elements parsed by the SYNTAX language.
  3. The operator stack holds items that are later combined with parse stack items to form trees.
  4. The symbol table.

A language is composed of elements that exist on at least three hierarchical levels. In the context of natural language, these elements are sentences, words, and characters. In programming languages, these elements are statements or expressions, tokens, and characters. The SYNTAX language has equation forms for recognizing and processing each type of element; it distinguishes them by using a different first operator in the equation.

The language element at the lowest level is the character. Characters are used to form, tokens, the members of the next level.

Tokens are formed from a subset of the characters of the language. This subset is called a character class, and its operator, the colon, identifies the syntax equation that describes it (:). The characters that are members of the class are written on the right side of the definition. Each printable character is enclosed in primes. Nonprinting and special characters may by represented as constants. Characters are separated by a bar character (|) OR operator. The equation ends with a semicolon. For example, the equation

digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';

Defines a digit as a 0 OR 1 OR 2, etc.

Character classes may be defined in terms of other character classes, as well as in terms of terminal characters. Thus the equation

HEXdigit: digit | 'A' | 'B' | 'C' | 'D' | 'E' | 'F';

Defines a hexadecimal digit as an A or B or C or D or E or an F or one of the ten characters identified as a digit. Character class definitions may not reference any syntax equation other then another character class definitions.

Basic language elements such as identifiers and numbers are built up from character classes. It is also necessary to save these elements for further processing. These functions are performed in the token making syntax equations, whose operator is the double dot (..). The sequence operator ($) is Also commonly employed to specify that zero or more elements are to be recognized. For example, the equation

number .. digit $digit;

Defines a number as a digit followed by a sequence of zero or more digits-- i.e., as one or more digits.

Leading skip class characters (blanks or characters that are treated as blanks) are skipped on the input stream until a non-skip character or a digit is encountered.

If the character is not a member of the character class digit the scan does not advance, the operation terminates the input stream is positioned back to the point when we entered number and number return failure to whatever syntax equation called it. If the character is a digit, it is put into a token buffer and the input scan advances.

Succeeding characters are scanned and placed into the buffer until a non-digit is encountered. The input stream is advanced as many positions as there where digits. The operation will return true even if no digits beyond the first were gathered, as the sequence operator specifies zero or more occurrences of digit. Once the sequence terminates, all characters gathered are formed into a token and placed onto the push down accumulator (the parse stack). The token is also entered into a dictionary so that all subsequent instances of the same token will reference the same dictionary entry.

Note skip class skipping will stop if a character is matched by the rule, allowing skip class characters to be used in a token rule. An example of this: suppose a line starting with an * is to be treated as a comment.

cr: 13;

CommentLine.. cr'*' $-cr;

cr (line separator) normally a skip class character would not be skipped and would terminate skipping.

It may be the case that a symbol is not the desired result. For example, a number should be retained as a number, rather than as a collection of digits. Functions, MakeNumber and MakeHex, which perform the conversion may be called as in the following definitions:

digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';

HEXdigit: 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | digit;

decmal.. digit $digit MakeNumber[];

HEXnum.. '0' ('X' | 'x') HEXdigit $HEXdigit MakeHex[];

These conversion routines intercept the characters gathered in the token buffer, convert them, and place the result on top of the parse stack. No dictionary entry is made.

No special function, however, is required in the following case:

alpha: 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' | 'a' |;'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z';

alphanum: alpha | digit;

ID.. alpha $alphanum;

Here the resultant token is precisely what is desired. As normal operation has not been interceded by MakeNumber or MakeHex, the token is placed on the parse stack and is entered into the dictionary via an implicit call to the system function MakeSymbol.

You may have observed that the following definition of ID is equivalent to the one above.

ID.. alpha $(alpha | digit);

However the earlier version results in more efficient code. As a general rule, better code results if "or" operation is avoided by making its components a character class.

Entities other than character classes may be referenced in token making equations. The programmer must specify whether or not he wishes to retain the characters as part of the token. For example, an identifier might be defined in this manner:

ID.. alpha $(alphanum | '_');

This specifies that, once the initial alpha is recognized succeeding elements may be either a member of the class alphanum or an "underscore". The alpha and alphanum will be retained if recognized; the underscore will not. It will be recognized and the input advanced, but the resultant token will appear in the dictionary exactly as if the underscore had not been coded. That is, This_Name will be precisely identical to ThisName. The operator to be employed if the character is to be kept is the plus sign (+):

ID.. alpha $(alphanum | +'_');

This is the definition of an identifier in the SLIC languages.

The functions ToUpper and ToLower maybe used to convert the token buffer to all upper case characters or all lower case characters respectivly. The comma operator may also be used to output characters or a variable to the token buffer. When the variable contains NULL no output to the token buffer occures.

ID.. alpha $(alphanum | '_',UnderBarOption);

When UnderBarOption contains '_', symbols will gathered into the token buffer with the underbars retained. But when UnderBarOption contains NULL, symbols will have underbars removed when gathered into the token buffer.

In some cases, it is necessary to specify that a character may not appear on the input. The "not" operator (-) is used for this purpose, as demonstrated hereafter in the definition of string. Note that, whether or not the proscribed character is present, the input will not advance. The input advances only when affirmative recognition of characters occurs.

string .. '''' $(-'''' any | '''''' ,'''') '''';

A string is defined as a sequence of characters of arbitrary length enclosed by a pair of primes; e.g., 'ABC'. If the enclosed characters are anything but primes, no conflict occurs. If a prime is to be one of the enclosed characters, two primes are written; they are recognized, and the comma operator appends a single prime to the token buffer. Thus, the word "ISN'T" would be coded 'ISN''T'.

The techniques applied to recognition of syntactic entities are well illustrated by the case of arithmetic expressions. The most important factor in recognition of arithmetic expressions is the definition of the operator hierarchy. The example adheres to tradition: on the highest level are exponentiation and unary minus, followed by multiplication and division, followed by addition and subtraction. Within each level the order of operation is from left to right, except in case of exponentiation. This hierarchy may be overridden by parenthesis.

Beginning with the left to right problem, let us look at the definition of an expression. An EXP is a TERM followed by zero or more instances of plus or minus TERM.

EXP = TERM $(('+' | '-') TERM);

This demonstrates how the use of the sequence operator permits the replacement of recursion in BNF by iteration in the syntax language.

TERM = FACTOR $(('*' | '/') FACTOR);

TERM has already been used as something that is added to or subtracted from another TERM. Now it can be seen that a TERM may be composed of multiplication or division of two or more FACTOR's. As an EXP is evaluated multiplication and division must be performed prior to addition and subtraction. Consider the expression.

A + B * C

Which means add A to the product of B and C. This implies the following parenthesis:

(A + (B * C))

FACTOR = '-' ITEM | ITEM ('^' FACTOR | empty);

The SYNTAX language primitive empty means that the element with which it alternates may or may not be present in a legal statement. It follows, from the definition of FACTOR, that the unary minus and exponentiation will take priority over other arithmetic operation. The part of the equation dealing with exponentiation defines a FACTOR as an ITEM raised to a FACTOR that is an item raised to a factor, etc. That is until the '^' factor fails and .empty is found, the compiler will continue to search for more exponentiation. Thus, the expression.

A**B**C

Would be implicitly parenthesized from right to left as

(A**(B**C))

Note that the infinite iteration problem is not caused by right recursion. The right recursive definition of FACTOR states that ITEM, unless it is preceded by a (minus), is always followed by another FACTOR or empty. Succeeding calls on FACTOR will recognize characters and so move the input stream until empty provides an escape. Neither an infinite loop nor incomplete recognition is possible. Finally, to complete the definition:

ITEM = VARIABLE | '(' EXP ')' | NUMBER;

VARIABLE = ID ('(' EXP $(',' EXP) ')' | .empty);

An ITEM is either a VARIABLE (simple or with any number of subscripts or arguments), a parenthesized expression, or a number.

The techniques covered to this point are sufficient to permit the writing of a recognizer--that is, a program whose sole function is to check its input for correct syntax. To produce a compiler it is necessary to specify the details of the transformation of the source code into a form comprehensible to GENERATOR functions. This form, similar to the LISP language, is most easily represented in the tree structure format. The programmer's task is to direct the building of this tree by specifying the elements at its branches and the nodes connecting them.

In its simplest form, a tree is a pictorial representation of a functional notation. An operator such as + or * is a node, from which zero or more branches are extended, representing the operands.

Two operators are employed to specify the building of the tree. The colon (:) followed by an identifier directs that a node be pushed on a special operator stack, above any previous nodes. The exclamation point (!) followed by a number directs that that number of items on the parse stack be removed, and connected in the order they were gathered, to the top node (i.e., the node at the top of the operator stack). That node is then removed from the operator stack. This subtree, formed by the node connected to its branches, is then placed on top of the parse stack.

Consider the simple arithmetic expression A - B. It is desired that A and B be placed on the stack and tied to a SUB (for subtract) node. Thus, EXP might be rewritten as

EXP = TERM $('+' TERM :ADD!2 | '-' TERM :SUB!2);

However, there is a simpler way-- that of factoring the equation:

EXP = TERM $(('+':ADD | '-':SUB) TERM !2);

For every iteration, a plus or minus sign produces the appropriate node (via the operator). When the second term is recognized, the tree is constructed from the node and the two operands. Note that both sets of parentheses in the syntax equation are necessary.

The remainder of the expression-- handling syntax equations will now appear as follows:

TERM = FACTOR $(('*':MPY | '/':DIV) FACTOR !2);

FACTOR = ('-' ITEM:UNARY!1 | ITEM)('**' FACTOR:EXPON!2 | empty);

ITEM = VARIABLE | '(' EXP ')' | NUMBER;

ITEM remains as previously coded: VARIABLE presents a problem because the number of subscripts is uncertain, as may be seen in its definition. Since the exclamation point operator specifies a precise number of entities to be attached to a node. How to handle such constructions, as parameter lists and subscripts, which contain a variable number of elements is accomplished by grouping the elements into a list structure. and attaching the list as a single branch to a node.

The following definition of VARIABLE employs the operator which perform this task:

VARIABLE = ID ('(' &ltEXP $(',' EXP)> ')' :SUBSC!2 | empty);

The use of the angle brackets (< and >) specifies that any expressions recognized within them are to be formed into a first in, first out list and placed on the parse stack as a single entity. The exclamation point operator attaches two elements to the subsc node: the identifier and the list of subscripts. The tree for the subscript expression a(i,j,k) might appear as follows:

SUBSC[a,[i,j,k]]

The same technique may be used to define a procedure or function call:

PROC_CALL = ID '(' <(EXP $(',' EXP) | empty)> ')':CALL !2;

An alternate coding for the same definition might appear as follows:

PROC_CALL = ID '(' (&ltEXP $(',' EXP)> | <>) ')':CALL !2;

 

The second version of PROC_CALL employs a pair of angle brackets with nothing between them; this is equivalent to LOAD[.NIL]. Both place the empty list on the parse stack.

Observe that angle brackets must be completely nested within parentheses and vice versa. It must also be emphasized that the parse stack must contain at least as many elements as the number specified with the exclamation point operator.

It is now possible to parse any arithmetic expression containing these operators. Consider the expression A + B - C * D(j,2). These are the steps:

  1. A and B are each TERM's (because they are FACTOR's, ITEM's VARIABLE's, and ID's), so their addition is all or part of an expression. They are placed in the parse stack, an ADD node is placed in the operator stack, and the tree that results from attaching the node to the operands replaces the operands on the stack.
  2. The iteration in EXP continues. The minus is recognized and a SUB node is placed on the operator stack. TERM is called, and it recognized C * D as a multiplication of two FACTOR's. The node building operation creates an MPL node, which is attached to C and D, placing onto the parse stack the tree:
  3. This tree is now the first element in the stack; the second is the tree headed by the ADD node created previously.
  4. The !2 in EXP now attaches the top operator stack entry (the SUB node) to the two top parse stack items (the trees created previously). a further iteration fails to turn up any more TERM's so the parse stack finally contains a single entity, the parse tree:

SUB[ADD[A,B],MPY[C,SUBSC[D,[j,2]]]]

There is a lot more:

If you have ever used bnf like syntax directed compilers you might have noticed that unlike others, SLIC will parse context sensitive languages like COBOL.

That's all for now (can't give away everything)

Back to SLIC main page