In this project you are to implement a scanner and a recursive descent parser, and a de-parser for a small programming language (letâs call it Hawk). Your program will take as input a program written in the provided language and produce a parse tree in the form of a table. In addition to generating the parse tree, your program must generate the proper errors when encountered. After generating the tree, you need to write a program that de-parses the tree into the original code that generated it.
The following is the grammar that you will use:
Rule 01: PROGRAM ï program DECL_SEC begin STMT_SEC end;
Rule 02: DECL_SEC ï DECL | DECL DECL_SEC
Rule 03: DECL ï ID_LIST : int ;
Rule 04: ID_LIST ï ID | ID , ID_LIST
Rule 05: ID ï a|b|c|â¦| z | A | ⦠| Z
Rule 06: STMT_SEC ï STMT | STMT STMT_SEC
Rule 07: STMT ï ASSIGN | IFSTMT | WHILESTMT | INPUT | OUTPUT
Rule 08: ASSIGN ï ID := EXPR ;
Rule 09: IFSTMT ï if COMP then STMT_SEC end if ; |
if COMP then STMT_SEC else STMT_SEC end if ;
Rule 10: WHILESTMT ï while COMP loop STMT_SEC end loop ;
Rule 11: INPUT ï input ID;
Rule 12: OUTPUT ï output ID_LIST ;
Rule 13: EXPR ï FACTOR | FACTOR + EXPR | FACTOR - EXPR
Rule 14: FACTOR ï OPERAND | OPERAND * FACTOR
Rule 15: OPERAND ï INT | ID | ( EXPR )
Rule 16: INT ï 0 | 1 | ⦠| 9
Rule 17: COMP ï ( OPERAND = OPERAND ) | ( OPERAND <> OPERAND ) |
( OPERAND > OPERAND ) | ( OPERAND < OPERAND )
This grammar has 17 rules. It also has reserved words in it indicating that they cannot be used for identifiers. Non-terminal symbols are those that are capitalized, and terminal symbols are those that are lowercase. Some rules have alternative choices, for example Rule 09 has two alternatives, one without an else, and one with and else. You do not need to worry about the alternatives for Rules 05 and 16 when generating the parse tree. However, which alternative for a rule that you produce (choose) should be indicated in the parse tree.
The following are the lexemes for the language:
⢠Reserved words: program, begin, end, if, then, else, input, output, int, while, loop.
⢠Operators: assignment (:=), less than (<), greater than (>), equals (=), not equals (<>), plus (+), minus (-) , multiply (*), and parentheses.
⢠The â;â is also used to terminate statements and the â,â and the â:â are used when declaring variables.
⢠Identifiers: only one-character identifiers a-z or A-Z and they are case insensitive.
⢠Integers: only one-digit integers 0-9.
⢠Note: The only types to be declared in the language are integers.
Upon encountering a variable in a declaration section, you must add it to the symbol table. This means that a redeclaration of a variable should produce an error and exit the program. The use of a variable before it is declared should also produce an error an exit the program.
Errors that could be generated upon scanning and parsing a program:
⢠Illegal symbol
⢠Illegal identifier
⢠Illegal number
⢠Parse errors encountered by expecting a symbol and not getting that symbol.
All errors must indicate the line number in the file. All error message must be descriptive as well indicating what happened precisely. Upon encountering an error, you must exit the program immediately after reporting the error and its location.
You will be provided with some test cases to use to test your program. However, your implementation should work with other test cases. So, you are encouraged to generate your own test cases to make them work with your parser.
In the assignment, you will be turning it in incremental. The first part should be only a scanner to print out the tokens and lexemes. The second part should be just a parser without the tree. The third should be a parser with the tree.
The fourth is going to be a de-parser that accepts the tree and produces the program that generated it. While the de-parser is a separate application than the parser (the parser by the way includes the scanner and the building of the tree), it will use a lot of the constructs in the parser. So, you will see a description of it after the parsing example.
â
The following is an example program and its expected output:
program x, y: int;
begin
input x,y;
y:=x+y;
output y;
end;
This happens to be a correct program, so the resulting output for it looks like this:
Node # Rule # Branch 1 Branch 2 Branch 3 Alternative # Id/Int Value
------ ------ -------- -------- -------- ------------- -----------
1 1 2 8 0 1
2 2 3 0 0 1
3 3 4 0 0 1
4 4 5 6 0 2
5 5 0 0 0 0 24
6 4 7 0 0 1
7 5 0 0 0 0 25
8 6 9 15 0 2
9 7 10 0 0 4
10 11 11 0 0 1
11 4 12 13 0 2
12 5 0 0 0 0 24
13 4 14 0 0 1
14 5 0 0 0 0 25
15 6 16 27 0 2
16 7 17 0 0 1
17 8 18 19 0 1
18 5 0 0 0 0 25
19 13 20 23 0 2
20 14 21 0 0 1
21 15 22 0 0 1
22 5 0 0 0 0 24
23 13 24 0 0 1
24 14 25 0 0 1
25 15 26 0 0 1
26 5 0 0 0 0 25
27 6 28 0 0 1
28 7 29 0 0 5
29 12 30 0 0 1
30 4 31 0 0 1
31 5 0 0 0 0 25
Output Description:
⢠Node # is pretty much a sequence number indicating the node of a parse tree.
⢠Rule # indicates which rule was applied during the construction of that node.
⢠The three branches are those that represent the nodes that happen to be the children of the current node. For example, Node 1 has two children and those are nodes 2 and node 8. Branch 3 for Node 1 doesnât exist and therefore it is a 0. For this example this actually represents that Non-terminal symbols on the RHS of every rule, so the PROGRAM node (node 1) has two children represented by DECL_SEC and STMT_SEC. The maximum number of branches a node can have is 3 and thatâs why we have 3 branches listed (if-else is the one that has the maximum.
⢠Alternative # represents which alternative to the production rule the building of the node chose. Notice that Rule 7 has 5 alternatives but only one branch, where rule 9 has two alternatives, the first of which has two branches, and the second has three. So, you need to indicate which alternative you used to build your node.
⢠Id/Int this one only has a value when you have either an Identifier or an int. For example, 24 is the 24th character, 25 is the 25th character and this according to the 26 characters alphabet. The rule number here helps you distinguish if you are representing an integer or a variable, so for example if youâre using rule 5 you produce the number 6 to represent the letter f, but if youâre using rule 16 the number 6 represents integer 6.
Programs can actually have as many whitespaces as possible. Programs do not need to be neatly formatted in order to parse correctly, so your code should adhere to this and ignore whitespaces whenever they are encountered.