CSE340 Fall 2018 Project 1: Lexical Analysis
Due: Friday, September 14, 2018 by 11:59 pm MST
1. Introduction
I will start with a high-level description of the project in this section and then go into a detailed description
on how to go about achieving these goals in subsequent sections.
The goal of this project is to implement the function getToken() that I described in class automatically for
any list of tokens. The input to your program will have two parts:
1. The first part of the input is a list of tokens separated by commas and terminated with the # (hash)
symbol. Each token in the list consists of a token name and a token description. The token description
is a regular expression for the token. The list has the following form:
t1_name t1_description , t2_name t2_description , ... , tk_name tk_description #
2. The second part of the input is a string of letters and digits and space characters.
Your program will read the list of tokens, represent them internally in appropriate data structures, and then
do lexical analysis on the input string to break it down into a sequence of tokens from the provided list and
corresponding lexemes. The output of the program will be this sequence of tokens and lexemes. If during the
processing of the input string, your program cannot identify a token to match for the next call to getToken(),
it outputs ERROR and stops.
More details about the input format and the expected output of your program are given in what follows.
The remainder of this document is organized as follows.
1. The second section describes the input format.
2. The third section describes the expected output.
3. The fourth section describes the requirements on your solution.
4. The fifth and largest section is a detailed explanation how to go about implementing a solution.
5. The sixth section gives instructions for this programming assignment and additional instructions that
apply to all programming assignments in this course.
2. Input Format
The code that we provided reads the input for you, so you do not need to do anything in terms or parsing the
input to your program. What you need to do is to store the information being parsed by the code that we
provided in appropriate data structures to allow your program to break down the input string into a list of
tokens and lexemes according to the provided list. I describe the format of the input for completeness, but, as
I said, reading the input is done for you by the provided code.
The input of your program is specified by the following context-free grammar:
1
input → tokens_section INPUT_TEXT
tokens_section → token_list HASH
token_list → token
token_list → token COMMA token_list
token → ID expr
expr → CHAR
expr → LPAREN expr RPAREN DOT LPAREN expr RPAREN
expr → LPAREN expr RPAREN OR LPAREN expr RPAREN
expr → LPAREN expr RPAREN STAR
expr → UNDERSCORE
Where
CHAR = a | b | ... | z | A | B | ... | Z | 0 | 1 | ... | 9
LETTER = a | b | ... | z | A | B | ... | Z
SPACE = ' ' | \n | \t
INPUT_TEXT = " (CHAR | SPACE)* "
COMMA = ','
LPAREN = '('
RPAREN = ')'
STAR = '*'
DOT = '.'
OR = '|'
UNDERSCORE = '_'
ID = LETTER . CHAR*
In the description of regular expressions, UNDERSCORE represents epsilon.
Examples
The following are examples of input.
1. t1 (a)|(b) , t2 (a).((a)*) , t3 (((a)|(b))*).(c) #
"a aa bb aab"
This input specifies three tokens t1, t2, and t3 and an INPUT_TEXT "a aa bb aab".
2. t1 (a)|(b) , t2 ((c)*).(b) #
"a aa bb aad aa"
This input specifies two tokens t1, t2, and an INPUT_TEXT "a aa bb aad aa".
3. t1 (a)|(b) , t2 (c).((a)*) , t3 (((a)|(b))*).(((c)|(d))*)#
"aaabbcaaaa"
This input specifies two tokens t1, t2, and an INPUT_TEXT "aaabbcaaaa".
4. tok (a).((b)|(_)) , toktok (a)|(_)#
"aaabbcaaaa"
This input specifies two tokens whose names are tok, toktok, and an INPUT_TEXT "aaabbcaaaa".
Recall that in the description of regular expressions, underscore represents epsilon, so the regular
expressions for the token tok is equivalent to (a).((b)|()) and the regular expressions for the token
toktok is equivalent to (a)|()
2
Note 1
None of the token in the list of tokens in the input to your program will have a regular expressions that can
generate the empty string.
Note 2
The code we provided breaks down the input to the program into tokens like ID, LPAREN, RPAREN and so
on. To read the input, the code we provide has an object called lexer and a function getToken() used in
reading the input according to the fixed list of tokens for the input to the program.
Your program will then have to break down the INPUT_TEXT string into a sequence of tokens according
to the list of token in the input to the program. In order not to confuse the function that breaks down
the INPUT_TEXT from the function getToken() in the code we provided, you should call your function
something else like my_getToken()
3. Output Format
The output will be a sequence of tokens and their corresponding lexemes according to the list of tokens
provided in the input or SYNTAX ERROR:
1. if the input to your program is in the correct format, the program should do lexical analysis on
INPUT_TEXT and should produce a sequence of tokens and lexemes in INPUT_TEXT according to the list
of tokens specified in the input to your program. Each token and lexeme should be printed on a separate
line. The output on a given line will be of the form
t , "lexeme"
where t is the name of a token and lexeme is the actual lexeme for the token t. If during lexical analysis
of INPUT_TEXT, a syntax error is encountered then ERROR is printed on a separate line and the program
exits.
In doing lexical analysis for INPUT_TEXT, SPACE is treated as a separator and is otherwise ignored.
2. if the input to your program is not in the correct format, the program should output SYNTAX ERROR
and nothing else. The code we provided already handles this case so you should make sure not to print
anything before the complete parsing of the input is done. If a syntax error is encountered, the code we
provided will print syntax error and exits the program. If the code we provided does not encounter
a syntax error in the input to the program, then your code will break down the INPUT_TEXT into
tokens and lexeme by repeatedly calling the function my_getToken() that you will write.
Examples
Each of the following examples gives an input and the corresponding expected output.
1. t1 (a)|(b) , t2 ((a)*).(a) , t3 (((a)|(b))*).(((c)*).(c)) #
"a aac bbc aabc"
This input specifies three tokens t1, t2, and t3 and an INPUT_TEXT "a aa bb aab". Since the input
is in the correct format, the output of your program should be the list tokens in the INPUT_TEXT:
t1 , "a"
t2 , "aac"
t3 , "bbc"
t3 , "aabc"
2. t1 (a)|(b) , t2 ((a)*).(a) , t3 (((a)|(b))*).(c) #
"a aa bbc aad aa"
3
Since the input is in the correct format, the output of your program should be the list tokens in the
INPUT_TEXT the output of the program should be
t1 , "a"
t2 , "aa"
t3 , "bbc"
t2 , "aa"
ERROR
Note that the input is in the correct format, but doing lexical analysis for INPUT_TEXT according to
the list of tokens produces ERROR after the second t2 token because there is no token that starts with
’d’.
3. t1a (a)|(b) , t2bc (a).((a)*) , t34 (((a)|(b))*).((c)|(d))#
"aaabbcaaaa"
This input specifies three tokens whose names are t1a, t2bc, and t34 and an input text "aaabbcaaaa".
Since the input is in the correct format, the output of your program should be the list tokens in the
INPUT_TEXT:
t34 , "aaabbc"
t2bc , "aaaa"
4. Requirements
You should write a program to generate the correct output for a given input as described above. You will
be provided with a number of test cases. Since this is the first project, the number of test cases provided
with the project will be relatively large. In your solution, you are not allowed to use any built-in
or library support for regular expressions in C/C++. This requirement will be enforced by
checking your code.
5. How to Implement a Solution
The main difficulty in coming up with a solution is to transform a given list of token names and their regular
expression descriptions into a my_getToken() function for the given list of tokens. This transformation will
be done in three high-level steps:
1. Transform regular expressions into REGs. The goal here is to parse a regular expression description
and generate a graph that represents the regular expression1
. The generated graph will have a specific
format and I will describe below how to generate it. I will call it a regular expression graph, or REG for
short.
2. Write a function match(r,s,p), where r is a REG , s is a string and p is a position in the string s.
The function match will return the longest possible lexeme starting from position p in the string s that
matches the regular expression of the graph r.
3. Write a class my_LexicalAnalyzer(list,s) where list is a list of structures: {token_name, reg_pointer}
and s is an input string. my_LexicalAnalyzer stores the list of structures and keeps track of the part
of the input string that has been processed. my_Lexical analyzer has a method my_getToken().
For every call of my_getToken(), match(r,s,p) is called for every REG r in the list starting from
the current position p maintained in my_LexicalAnalyzer. my_getToken() returns the token with
the longest matching prefix together with its lexeme and updates the current position. If the longest
matching prefix matches more than one token, the matched token that is listed first in the list of tokens
is returned.
1The graph is a representation of a non-deterministic finite state automaton
4
Figure 1: Regular expressions graphs for the base cases
Figure 2: Regular expression graph for the an expression obtained using the dot operator
In what follows I describe how a regular expression description can be transformed into a REG and how to
implement the function match(r,s,p).
Constructing REGs
The construction of REGs is done recursively. The construction we use is called Thompson’s construction.
Each REG has a one start node and one final node. For the base cases of epsilon and a, where a is
a character of the alphabet, the REGs are shown in Figure 1. For the recursive cases, the constructions
are shown in Figures 2, 3, and 4. An example REG for the regular expression ((a)*).((b)*) is shown in
Figure 5.
Data Structures and Code for REGs
In the construction of REGs, every node has at most two outgoing arrows. This will allow us to use a simple
representation of a REG node.
5
Figure 3: Regular expression graph for the an expression obtained using the or operator
Figure 4: Regular expression graph for the an expression obtained using the star operator
struct REG_node {
struct REG_node * first-neighbor;
char first_label;
struct REG_node * second_neighbor;
char second_label;
}
In the representation, first_neighbor is the first node pointed to by a node and second_neighbor is the
second node pointed to by a node. first_label and second_label are the labels of the arrows from the
node to its neighbors. If a node has only one neighbor, then second_neighbor will be NULL. If a node has
no neighbors, then both first_neighbor and second_neighbor will be NULL.
struct REG {
struct REG_node * start;
struct REG_node * accept;
}
6
Figure 5: Regular expression graph for the an expression obtained using concatenation and star operators
Figure 6: Data structure representation for the REG of ((a)*).((b*))
In the code we provided, there is a function called parse_expr() that parses regular expressions but returns
nothing (return type in the provided code is void). You should modify this function so that it not only parses
a regular expression, but also returns the REG of the regular expression that is parsed (you should change
the return type of the provided function so that it returns a pointer to a REG). The construction of REGs is
done recursively. An outline of the process is shown on the next page.
7
struct REG * parse_expr()
{
// if expression is UNDERSCORE or a CHAR, say 'a' for example
// create a REG for the expression and return a pointer to it
// (see Figure 1, for how the REG looks like)
// if expression is (R1).(R2)
//
// the program will call parse_expr() twice, once
// to parse R1 and once to parse R2
//
// Each of the two calls will return a REG, say they are
// r1 and r2
//
// construct a new REG r for (R1).(R2) using the
// two REGs r1 and r2
// (see Figure 2 for how the two REGs are combined)
//
// return r
//
// the cases for (R1)|(R2) and (R)* are similar and
// are omitted from the description
}
Detailed Examples for REG Construction
I consider the regular expression ((a)*).((b)*) and explain step by step how its REG is constructed
(Figure 5).
When parsing the ((a)*).((b)*), the first expression to be parsed is a and its REG is constructed according
to Figure 1. In Figure 5, the nodes for the REG of the regular expression a have numbers 1 and 2 to indicate
that they are the first two nodes to be created.
The second expression to be parsed when parsing ((a)*).((b)*) is (a)*. The REG for (a)* is obtained
from the REG for the regular expression a by adding two more nodes (3 and 4) and adding the appropriate
arrows as described in the general case in Figure 4. The starting node for the REG of (a)* is the newly
created node 3 and the accepting node is the newly created node 4.
The third regular expression to be parsed while parsing ((a)*).((b)*) is the regular expression b. The REG
for regular expression b is constructed as shown in Figure 1. The nodes for this REG are numbered 5 and 6.
The fourth regular expression to be parsed while parsing ((a)*).((b)*) is (b)*. The REG for (b)* is
obtained from the REG for the regular expression b by adding two more nodes (7 and 8) and adding the
appropriate arrows as described in the general case in Figure 4. The starting node for the REG of (b)* is the
newly created node 7 and the accepting node is the newly created node 8.
Finally, the last regular expression to be parsed is the regular expression ((a)*).((b)*). The REG of
((a)*).((b)*) is obtained from the REGs of (a)* and (b)* by creating a new REG whose initial node is
node 3 and whose accepting node is node 8 and adding an arrow from node 4 (the accepting node of the REG
of (a)*) to node 7 (the initial node for the REG of (b)*).
Another example for the REG of (((a)*).((b).(b)))|((a)*) is shown in Figures 8 and 9. In the next
section, I will use REG of (((a)*).((b).(b)))|((a)*) to illustrate how match(r,s,p) can be implemented.
8
set_of_nodes match_one_char(set_of_nodes S, char c)
{
// 1. find all nodes that can be reached from S by consuming c
//
// S' = empty set
// for every node n in S
// if ( (there is an edge from n to m labeled with c) &&
// ( m is not in S) ) {
// add m to S'
// }
//
// if (S' is empty)
// return empty set
//
// At this point, S' is not empty and it contains the nodes that
// can be reached from S by consuming the character c directly
//
//
// 2. find all nodes that can be reached from the resulting
// set S' by consuming no input
//
// changed = true
// S'' = empty set
// while (changed) {
// changed = false
// for every node n in S' {
// add n to S''
// for ever neighbor m of n {
// if ( (the edge from n to m labeled with '_') && ( m is not in S'') )
// add m to S''
// }
// }
// if (S' not equal to S'') {
// changed = true;
// S' = S''
// S'' = empty set
// }
// }
//
// at this point the set S' contains all nodes that can be reached
// from S by first consuming C, then traversing 0 or more epsilon
// edges
//
// return S'
}
Figure 7: Pseudocode for matching one character
.
Implementing match(r,s,p)
Given an REG r, a string s and a position p in the string s, we would like to determine the longest possible
lexeme that matches the regular expression for r.
As you will see in CSE355, a string w is in L(R) for a regular expression R with REG r if and only if there is
a path from the starting node of r to the accepting node of r such that w is equal to the concatenation of
all labels of the edges along the path. I will not into the details of the equivalence in this document. I will
describe how to find the longest possible substring w of s starting at position p such that there is a path from
the starting node of r to the accepting node of r that can be labeled with w.
9
Figure 8: Regular expression graph ((a)*).((b).(b))
Figure 9: Regular expression graph (((a)*).((b).(b)))|((a)*)
To implement match(r,s,p), we need to be able to determine for a given input character a and a set of nodes
S the set of nodes that can be reached from nodes in S by consuming a. To consume a we can traverse any
number of edges labeled ’_’, traverse one edge labeled a, then traverse any number of edges labeled ’_’. To
match one character, you will implement a function called match_one_char() shown in Figure 7. For a given
character a and a given set of nodes S, match_one_char() will find all the nodes that can be reached from S
by consuming the single character a.
In order to match a whole string, we need to match the characters of the strings one after another. At each
step, the solution will keep track of the set of nodes S that can be reached by consuming the prefix of the
input string that has been processed so far.
To implement match(r,s,p), we start with the set of nodes that can be reached from the starting node of r
by consuming no input. Then we repeatedly call match_one_char() for successive characters of the string s
starting from position p until the returned set of nodes S is empty or we run out of input. If at any point
during the repeated calls to match_one_char() the set S of nodes contains the accepting node, we note the
fact that the prefix of string s starting from position p up to the current position is matching. At the end of
the calls to match_one_char() when S is empty or the end of input is reached, the last matched prefix is
the one returned by match(r,s,p). If none of the prefixes are matched, then there is no match for r in s
10
starting at p.
Note.
? The algorithms given above are not the most efficient, but they are probably the simplest to implement
the matching functions.
? The algorithm uses sets, so you need to have a representation for a set of nodes and to do operations on
sets of nodes. The provided code includes a program that shows how one can define and manipulate
sets in C++.
Detailed Example for Implementing match(r,s,p)
In this section, I illustrate the steps of an execution of match(r,s,p) on the REG of (((a)*).((b).(b)))|((a)*)
shown in Figure 9. The input string we will consider is the string s = "aaba" and the initial position is p =
0.
1. Initially
The set of states that can reached by consuming no input starting from node 17 is S0 = {17, 3, 1, 4, 9, 15, 13, 16, 18}
Note that S0 contains node 18 which means that the empty string is a matching prefix.
2. Consuming a
The set of states that can be reached by consuming a starting from S0 is S1 = {2, 14}
The set of states that can be reached by consuming no input starting from S1 is S1_ = {2, 1, 4, 9, 14, 13, 16, 18}
Note that S1_ contains node 18, which means that the prefix "a" is a matching prefix.
3. Consuming a
The set of states that can be reached by consuming a starting from S1_ is S2 = {2, 14}
The set of states that can be reached by consuming no input starting from S2 is S2_ = {2, 1, 4, 9, 14, 13, 16, 18}
Note that S2_ contains node 18, which means that the prefix "aa" is a matching prefix.
4. Consuming b
The set of states that can be reached by consuming b starting from S2_ is S3 = {10}
The set of states that can be reached by consuming no input starting from S3 is S3_ = {10, 11}
Note that S3_ does not contain node 18 which means that "aab" is not a matching prefix, but is still a viable prefix.
5. Consuming a
The set of states that can be reached by consuming a starting from S3_ is S4 = {}
Since S4 is empty, "aaba" is not viable and we stop.
The longest matching prefix is aa. This is the lexeme that is returned. Note that the second call to
match(r,s,p) starting after "aa" will return ERROR.
6. Instructions
Follow these steps:
Download the lexer.cc, lexer.h, inputbuf.cc and inputbuf.h files accompanying this project
description. Note that these files might be a little different from the code you’ve seen in class or
elsewhere.
11
Compile your code using GCC version 4.8.5 compiler on CentOS 7. You will need to use the g++
command to compile your code in a terminal window. See section 4 for more details on how to compile
using GCC.
Note that you are required to compile and test your code on CentOS 7 using the GCC compiler
version 4.8.5. You are free to use any IDE or text editor on any platform, however, using tools available in
CentOS (or tools that you could install on CentOS) could save time in the development/compile/test cycle.
Test your code to see if it passes the provided test cases. You will need to extract the test cases from
the zip file and run the test script test1.sh. See section 5 for more details.
Submit your code on the course submission website before the deadline. You can submit as many times
as you need. Make sure your code is compiled correctly on the website, if you get a compiler error, fix
the problem and submit again.
Only the last version you submit is graded. There are no exception to this.
Keep in mind that
You should use C/C++, no other programming languages are allowed.
All programming assignments in this course are individual assignments. Students must complete the
assignments on their own.
You should submit your code on the course submission website, no other submission forms will be
accepted.
You should familiarize yourself with the CentOS environment and the GCC compiler. Programming
assignments in this course might be very different from what you are used to in other classes.
Evaluation
The submissions are evaluated based on the automated test cases on the submission website. Your grade will
be proportional to the number of test cases passing. If your code does not compile on the submission website,
you will not receive any points.
NOTE: The next two sections apply to all programming assignments.
You should use the instructions in the following sections to compile and test your programs for all programming
assignments in this course.
Compiling your code with GCC
You should compile your programs with the GCC compilers which are available in CentOS 7. GCC is a
collection of compilers for many programming languages. There are separate commands for compiling C and
C++ programs:
Use the gcc command to compile C programs
Use the g++ command to compile C++ programs
Here is an example of how to compile a simple C++ program:
$ g++ test_program.cpp
If the compilation is successful, it will generate an executable file named a.out in the same folder as the
program. You can change the output file name by specifying the -o option:
$ g++ test_program.cpp -o hello.out
12
To enable C++11 with g++, use the -std=c++11 option:
$ g++ -std=c++11 test_program.cpp -o hello.out
The following table summarizes some useful GCC compiler options:
Switch Can be used with Description
-o path gcc, g++ Change the filename of the generated artifact
-g gcc, g++ Generate debugging information
-ggdb gcc, g++ Generate debugging information for use by GDB
-Wall gcc, g++ Enable most warning messages
-w gcc, g++ Inhibit all warning messages
-std=c++11 g++ Compile C++ code using 2011 C++ standard
-std=c99 gcc Compile C code using ISO C99 standard
-std=c11 gcc Compile C code using ISO C11 standard
You can find a comprehensive list of GCC options in the following page:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/
Compiling projects with multiple files
If your program is written in multiple source files that should be linked together, you can compile and link all
files together with one command:
$ g++ file1.cpp file2.cpp file3.cpp
Or you can compile them separately and then link:
$ g++ -c file1.cpp
$ g++ -c file2.cpp
$ g++ -c file3.cpp
$ g++ file1.o file2.o file3.o
The files with the .o extension are object files but are not executable. They are linked together with the last
statement and the final executable will be a.out.
You can replace g++ with gcc in all examples listed above to compile C programs.
Testing your code on CentOS
Your programs should not explicitly open any file. You can only use the standard input e.g. std::cin in
C++, getchar(), scanf() in C and standard output e.g. std::cout in C++, putchar(), printf() in
C for input/output.
However, this restriction does not limit our ability to feed input to the program from files nor does it mean
that we cannot save the output of the program in a file. We use a technique called standard IO redirection to
achieve this.
Suppose we have an executable program a.out, we can run it by issuing the following command in a terminal
(the dollar sign is not part of the command):
$ ./a.out
If the program expects any input, it waits for it to be typed on the keyboard and any output generated by
the program will be displayed on the terminal screen.
To feed input to the program from a file, we can redirect the standard input to a file:
$ ./a.out < input_data.txt
13
Now, the program will not wait for keyboard input, but rather read its input from the specified file. We can
redirect the output of the program as well:
$ ./a.out > output_file.txt
In this way, no output will be shown in the terminal window, but rather it will be saved to the specified file.
Note that programs have access to another standard stream which is called standard error e.g. std::cerr in
C++, fprintf(stderr, ...) in C. Any such output is still displayed on the terminal screen. It is possible
to redirect standard error to a file as well, but we will not discuss that here.
Finally, it’s possible to mix both into one command:
$ ./a.out < input_data.txt > output_file.txt
Which will redirect standard input and standard output to input_data.txt and output_file.txt respectively.
Now that we know how to use standard IO redirection, we are ready to test the program with test cases.
Test Cases
A test case is an input and output specification. For a given input there is an expected output. A test case for
our purposes is usually represented by two files:
test_name.txt
test_name.txt.expected
The input is given in test_name.txt and the expected output is given in test_name.txt.expected.
To test a program against a single test case, first we execute the program with the test input data:
$ ./a.out < test_name.txt > program_output.txt
The output generated by the program will be stored in program_output.txt. To see if the program generated
the expected output, we need to compare program_output.txt and test_name.txt.expected. We do that
using a general purpose tool called diff:
$ diff -Bw program_output.txt test_name.txt.expected
The options -Bw tell diff to ignore whitespace differences between the two files. If the files are the same
(ignoring the whitespace differences), we should see no output from diff, otherwise, diff will produce a
report showing the differences between the two files.
We would simply consider the test passed if diff could not find any differences, otherwise we consider the
test failed.
Our grading system uses this method to test your submissions against multiple test cases. There is also a
test script accompanying this project test1.sh which will make your life easier by testing your code against
multiple test cases with one command.
Here is how to use test1.sh to test your program:
Store the provided test cases zip file in the same folder as your project source files
Open a terminal window and navigate to your project folder
Unzip the test archive using the unzip command: bash $ unzip test_cases.zip
NOTE: the actual file name is probably different, you should replace test_cases.zip with the correct file
name.
Store the test1.sh script in your project directory as well
Make the script executable: bash $ chmod +x test1.sh
Compile your program. The test script assumes your executable is called a.out
Run the script to test your code: bash $ ./test1.sh
14
The output of the script should be self explanatory. To test your code after you make changes, you will just
perform the last two steps (compile and run test1.sh).
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。