联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2018-11-21 09:59

CSE340 Fall 2018 Project 1: Lexical Analysis

Due: Friday, September 14, 2018 by 11:59 pm MST

1. Introduction

I will start with a high-level description of the project in this section and then go into a detailed description

on how to go about achieving these goals in subsequent sections.

The goal of this project is to implement the function getToken() that I described in class automatically for

any list of tokens. The input to your program will have two parts:

1. The first part of the input is a list of tokens separated by commas and terminated with the # (hash)

symbol. Each token in the list consists of a token name and a token description. The token description

is a regular expression for the token. The list has the following form:

t1_name t1_description , t2_name t2_description , ... , tk_name tk_description #

2. The second part of the input is a string of letters and digits and space characters.

Your program will read the list of tokens, represent them internally in appropriate data structures, and then

do lexical analysis on the input string to break it down into a sequence of tokens from the provided list and

corresponding lexemes. The output of the program will be this sequence of tokens and lexemes. If during the

processing of the input string, your program cannot identify a token to match for the next call to getToken(),

it outputs ERROR and stops.

More details about the input format and the expected output of your program are given in what follows.

The remainder of this document is organized as follows.

1. The second section describes the input format.

2. The third section describes the expected output.

3. The fourth section describes the requirements on your solution.

4. The fifth and largest section is a detailed explanation how to go about implementing a solution.

5. The sixth section gives instructions for this programming assignment and additional instructions that

apply to all programming assignments in this course.

2. Input Format

The code that we provided reads the input for you, so you do not need to do anything in terms or parsing the

input to your program. What you need to do is to store the information being parsed by the code that we

provided in appropriate data structures to allow your program to break down the input string into a list of

tokens and lexemes according to the provided list. I describe the format of the input for completeness, but, as

I said, reading the input is done for you by the provided code.

The input of your program is specified by the following context-free grammar:

1

input → tokens_section INPUT_TEXT

tokens_section → token_list HASH

token_list → token

token_list → token COMMA token_list

token → ID expr

expr → CHAR

expr → LPAREN expr RPAREN DOT LPAREN expr RPAREN

expr → LPAREN expr RPAREN OR LPAREN expr RPAREN

expr → LPAREN expr RPAREN STAR

expr → UNDERSCORE

Where

CHAR = a | b | ... | z | A | B | ... | Z | 0 | 1 | ... | 9

LETTER = a | b | ... | z | A | B | ... | Z

SPACE = ' ' | \n | \t

INPUT_TEXT = " (CHAR | SPACE)* "

COMMA = ','

LPAREN = '('

RPAREN = ')'

STAR = '*'

DOT = '.'

OR = '|'

UNDERSCORE = '_'

ID = LETTER . CHAR*

In the description of regular expressions, UNDERSCORE represents epsilon.

Examples

The following are examples of input.

1. t1 (a)|(b) , t2 (a).((a)*) , t3 (((a)|(b))*).(c) #

"a aa bb aab"

This input specifies three tokens t1, t2, and t3 and an INPUT_TEXT "a aa bb aab".

2. t1 (a)|(b) , t2 ((c)*).(b) #

"a aa bb aad aa"

This input specifies two tokens t1, t2, and an INPUT_TEXT "a aa bb aad aa".

3. t1 (a)|(b) , t2 (c).((a)*) , t3 (((a)|(b))*).(((c)|(d))*)#

"aaabbcaaaa"

This input specifies two tokens t1, t2, and an INPUT_TEXT "aaabbcaaaa".

4. tok (a).((b)|(_)) , toktok (a)|(_)#

"aaabbcaaaa"

This input specifies two tokens whose names are tok, toktok, and an INPUT_TEXT "aaabbcaaaa".

Recall that in the description of regular expressions, underscore represents epsilon, so the regular

expressions for the token tok is equivalent to (a).((b)|()) and the regular expressions for the token

toktok is equivalent to (a)|()

2

Note 1

None of the token in the list of tokens in the input to your program will have a regular expressions that can

generate the empty string.

Note 2

The code we provided breaks down the input to the program into tokens like ID, LPAREN, RPAREN and so

on. To read the input, the code we provide has an object called lexer and a function getToken() used in

reading the input according to the fixed list of tokens for the input to the program.

Your program will then have to break down the INPUT_TEXT string into a sequence of tokens according

to the list of token in the input to the program. In order not to confuse the function that breaks down

the INPUT_TEXT from the function getToken() in the code we provided, you should call your function

something else like my_getToken()

3. Output Format

The output will be a sequence of tokens and their corresponding lexemes according to the list of tokens

provided in the input or SYNTAX ERROR:

1. if the input to your program is in the correct format, the program should do lexical analysis on

INPUT_TEXT and should produce a sequence of tokens and lexemes in INPUT_TEXT according to the list

of tokens specified in the input to your program. Each token and lexeme should be printed on a separate

line. The output on a given line will be of the form

t , "lexeme"

where t is the name of a token and lexeme is the actual lexeme for the token t. If during lexical analysis

of INPUT_TEXT, a syntax error is encountered then ERROR is printed on a separate line and the program

exits.

In doing lexical analysis for INPUT_TEXT, SPACE is treated as a separator and is otherwise ignored.

2. if the input to your program is not in the correct format, the program should output SYNTAX ERROR

and nothing else. The code we provided already handles this case so you should make sure not to print

anything before the complete parsing of the input is done. If a syntax error is encountered, the code we

provided will print syntax error and exits the program. If the code we provided does not encounter

a syntax error in the input to the program, then your code will break down the INPUT_TEXT into

tokens and lexeme by repeatedly calling the function my_getToken() that you will write.

Examples

Each of the following examples gives an input and the corresponding expected output.

1. t1 (a)|(b) , t2 ((a)*).(a) , t3 (((a)|(b))*).(((c)*).(c)) #

"a aac bbc aabc"

This input specifies three tokens t1, t2, and t3 and an INPUT_TEXT "a aa bb aab". Since the input

is in the correct format, the output of your program should be the list tokens in the INPUT_TEXT:

t1 , "a"

t2 , "aac"

t3 , "bbc"

t3 , "aabc"

2. t1 (a)|(b) , t2 ((a)*).(a) , t3 (((a)|(b))*).(c) #

"a aa bbc aad aa"

3

Since the input is in the correct format, the output of your program should be the list tokens in the

INPUT_TEXT the output of the program should be

t1 , "a"

t2 , "aa"

t3 , "bbc"

t2 , "aa"

ERROR

Note that the input is in the correct format, but doing lexical analysis for INPUT_TEXT according to

the list of tokens produces ERROR after the second t2 token because there is no token that starts with

’d’.

3. t1a (a)|(b) , t2bc (a).((a)*) , t34 (((a)|(b))*).((c)|(d))#

"aaabbcaaaa"

This input specifies three tokens whose names are t1a, t2bc, and t34 and an input text "aaabbcaaaa".

Since the input is in the correct format, the output of your program should be the list tokens in the

INPUT_TEXT:

t34 , "aaabbc"

t2bc , "aaaa"

4. Requirements

You should write a program to generate the correct output for a given input as described above. You will

be provided with a number of test cases. Since this is the first project, the number of test cases provided

with the project will be relatively large. In your solution, you are not allowed to use any built-in

or library support for regular expressions in C/C++. This requirement will be enforced by

checking your code.

5. How to Implement a Solution

The main difficulty in coming up with a solution is to transform a given list of token names and their regular

expression descriptions into a my_getToken() function for the given list of tokens. This transformation will

be done in three high-level steps:

1. Transform regular expressions into REGs. The goal here is to parse a regular expression description

and generate a graph that represents the regular expression1

. The generated graph will have a specific

format and I will describe below how to generate it. I will call it a regular expression graph, or REG for

short.

2. Write a function match(r,s,p), where r is a REG , s is a string and p is a position in the string s.

The function match will return the longest possible lexeme starting from position p in the string s that

matches the regular expression of the graph r.

3. Write a class my_LexicalAnalyzer(list,s) where list is a list of structures: {token_name, reg_pointer}

and s is an input string. my_LexicalAnalyzer stores the list of structures and keeps track of the part

of the input string that has been processed. my_Lexical analyzer has a method my_getToken().

For every call of my_getToken(), match(r,s,p) is called for every REG r in the list starting from

the current position p maintained in my_LexicalAnalyzer. my_getToken() returns the token with

the longest matching prefix together with its lexeme and updates the current position. If the longest

matching prefix matches more than one token, the matched token that is listed first in the list of tokens

is returned.

1The graph is a representation of a non-deterministic finite state automaton

4

Figure 1: Regular expressions graphs for the base cases

Figure 2: Regular expression graph for the an expression obtained using the dot operator

In what follows I describe how a regular expression description can be transformed into a REG and how to

implement the function match(r,s,p).

Constructing REGs

The construction of REGs is done recursively. The construction we use is called Thompson’s construction.

Each REG has a one start node and one final node. For the base cases of epsilon and a, where a is

a character of the alphabet, the REGs are shown in Figure 1. For the recursive cases, the constructions

are shown in Figures 2, 3, and 4. An example REG for the regular expression ((a)*).((b)*) is shown in

Figure 5.

Data Structures and Code for REGs

In the construction of REGs, every node has at most two outgoing arrows. This will allow us to use a simple

representation of a REG node.

5

Figure 3: Regular expression graph for the an expression obtained using the or operator

Figure 4: Regular expression graph for the an expression obtained using the star operator

struct REG_node {

struct REG_node * first-neighbor;

char first_label;

struct REG_node * second_neighbor;

char second_label;

}

In the representation, first_neighbor is the first node pointed to by a node and second_neighbor is the

second node pointed to by a node. first_label and second_label are the labels of the arrows from the

node to its neighbors. If a node has only one neighbor, then second_neighbor will be NULL. If a node has

no neighbors, then both first_neighbor and second_neighbor will be NULL.

struct REG {

struct REG_node * start;

struct REG_node * accept;

}

6

Figure 5: Regular expression graph for the an expression obtained using concatenation and star operators

Figure 6: Data structure representation for the REG of ((a)*).((b*))

In the code we provided, there is a function called parse_expr() that parses regular expressions but returns

nothing (return type in the provided code is void). You should modify this function so that it not only parses

a regular expression, but also returns the REG of the regular expression that is parsed (you should change

the return type of the provided function so that it returns a pointer to a REG). The construction of REGs is

done recursively. An outline of the process is shown on the next page.

7

struct REG * parse_expr()

{

// if expression is UNDERSCORE or a CHAR, say 'a' for example

// create a REG for the expression and return a pointer to it

// (see Figure 1, for how the REG looks like)

// if expression is (R1).(R2)

//

// the program will call parse_expr() twice, once

// to parse R1 and once to parse R2

//

// Each of the two calls will return a REG, say they are

// r1 and r2

//

// construct a new REG r for (R1).(R2) using the

// two REGs r1 and r2

// (see Figure 2 for how the two REGs are combined)

//

// return r

//

// the cases for (R1)|(R2) and (R)* are similar and

// are omitted from the description

}

Detailed Examples for REG Construction

I consider the regular expression ((a)*).((b)*) and explain step by step how its REG is constructed

(Figure 5).

When parsing the ((a)*).((b)*), the first expression to be parsed is a and its REG is constructed according

to Figure 1. In Figure 5, the nodes for the REG of the regular expression a have numbers 1 and 2 to indicate

that they are the first two nodes to be created.

The second expression to be parsed when parsing ((a)*).((b)*) is (a)*. The REG for (a)* is obtained

from the REG for the regular expression a by adding two more nodes (3 and 4) and adding the appropriate

arrows as described in the general case in Figure 4. The starting node for the REG of (a)* is the newly

created node 3 and the accepting node is the newly created node 4.

The third regular expression to be parsed while parsing ((a)*).((b)*) is the regular expression b. The REG

for regular expression b is constructed as shown in Figure 1. The nodes for this REG are numbered 5 and 6.

The fourth regular expression to be parsed while parsing ((a)*).((b)*) is (b)*. The REG for (b)* is

obtained from the REG for the regular expression b by adding two more nodes (7 and 8) and adding the

appropriate arrows as described in the general case in Figure 4. The starting node for the REG of (b)* is the

newly created node 7 and the accepting node is the newly created node 8.

Finally, the last regular expression to be parsed is the regular expression ((a)*).((b)*). The REG of

((a)*).((b)*) is obtained from the REGs of (a)* and (b)* by creating a new REG whose initial node is

node 3 and whose accepting node is node 8 and adding an arrow from node 4 (the accepting node of the REG

of (a)*) to node 7 (the initial node for the REG of (b)*).

Another example for the REG of (((a)*).((b).(b)))|((a)*) is shown in Figures 8 and 9. In the next

section, I will use REG of (((a)*).((b).(b)))|((a)*) to illustrate how match(r,s,p) can be implemented.

8

set_of_nodes match_one_char(set_of_nodes S, char c)

{

// 1. find all nodes that can be reached from S by consuming c

//

// S' = empty set

// for every node n in S

// if ( (there is an edge from n to m labeled with c) &&

// ( m is not in S) ) {

// add m to S'

// }

//

// if (S' is empty)

// return empty set

//

// At this point, S' is not empty and it contains the nodes that

// can be reached from S by consuming the character c directly

//

//

// 2. find all nodes that can be reached from the resulting

// set S' by consuming no input

//

// changed = true

// S'' = empty set

// while (changed) {

// changed = false

// for every node n in S' {

// add n to S''

// for ever neighbor m of n {

// if ( (the edge from n to m labeled with '_') && ( m is not in S'') )

// add m to S''

// }

// }

// if (S' not equal to S'') {

// changed = true;

// S' = S''

// S'' = empty set

// }

// }

//

// at this point the set S' contains all nodes that can be reached

// from S by first consuming C, then traversing 0 or more epsilon

// edges

//

// return S'

}

Figure 7: Pseudocode for matching one character

.

Implementing match(r,s,p)

Given an REG r, a string s and a position p in the string s, we would like to determine the longest possible

lexeme that matches the regular expression for r.

As you will see in CSE355, a string w is in L(R) for a regular expression R with REG r if and only if there is

a path from the starting node of r to the accepting node of r such that w is equal to the concatenation of

all labels of the edges along the path. I will not into the details of the equivalence in this document. I will

describe how to find the longest possible substring w of s starting at position p such that there is a path from

the starting node of r to the accepting node of r that can be labeled with w.

9

Figure 8: Regular expression graph ((a)*).((b).(b))

Figure 9: Regular expression graph (((a)*).((b).(b)))|((a)*)

To implement match(r,s,p), we need to be able to determine for a given input character a and a set of nodes

S the set of nodes that can be reached from nodes in S by consuming a. To consume a we can traverse any

number of edges labeled ’_’, traverse one edge labeled a, then traverse any number of edges labeled ’_’. To

match one character, you will implement a function called match_one_char() shown in Figure 7. For a given

character a and a given set of nodes S, match_one_char() will find all the nodes that can be reached from S

by consuming the single character a.

In order to match a whole string, we need to match the characters of the strings one after another. At each

step, the solution will keep track of the set of nodes S that can be reached by consuming the prefix of the

input string that has been processed so far.

To implement match(r,s,p), we start with the set of nodes that can be reached from the starting node of r

by consuming no input. Then we repeatedly call match_one_char() for successive characters of the string s

starting from position p until the returned set of nodes S is empty or we run out of input. If at any point

during the repeated calls to match_one_char() the set S of nodes contains the accepting node, we note the

fact that the prefix of string s starting from position p up to the current position is matching. At the end of

the calls to match_one_char() when S is empty or the end of input is reached, the last matched prefix is

the one returned by match(r,s,p). If none of the prefixes are matched, then there is no match for r in s

10

starting at p.

Note.

? The algorithms given above are not the most efficient, but they are probably the simplest to implement

the matching functions.

? The algorithm uses sets, so you need to have a representation for a set of nodes and to do operations on

sets of nodes. The provided code includes a program that shows how one can define and manipulate

sets in C++.

Detailed Example for Implementing match(r,s,p)

In this section, I illustrate the steps of an execution of match(r,s,p) on the REG of (((a)*).((b).(b)))|((a)*)

shown in Figure 9. The input string we will consider is the string s = "aaba" and the initial position is p =

0.

1. Initially

The set of states that can reached by consuming no input starting from node 17 is S0 = {17, 3, 1, 4, 9, 15, 13, 16, 18}

Note that S0 contains node 18 which means that the empty string is a matching prefix.

2. Consuming a

The set of states that can be reached by consuming a starting from S0 is S1 = {2, 14}

The set of states that can be reached by consuming no input starting from S1 is S1_ = {2, 1, 4, 9, 14, 13, 16, 18}

Note that S1_ contains node 18, which means that the prefix "a" is a matching prefix.

3. Consuming a

The set of states that can be reached by consuming a starting from S1_ is S2 = {2, 14}

The set of states that can be reached by consuming no input starting from S2 is S2_ = {2, 1, 4, 9, 14, 13, 16, 18}

Note that S2_ contains node 18, which means that the prefix "aa" is a matching prefix.

4. Consuming b

The set of states that can be reached by consuming b starting from S2_ is S3 = {10}

The set of states that can be reached by consuming no input starting from S3 is S3_ = {10, 11}

Note that S3_ does not contain node 18 which means that "aab" is not a matching prefix, but is still a viable prefix.

5. Consuming a

The set of states that can be reached by consuming a starting from S3_ is S4 = {}

Since S4 is empty, "aaba" is not viable and we stop.

The longest matching prefix is aa. This is the lexeme that is returned. Note that the second call to

match(r,s,p) starting after "aa" will return ERROR.

6. Instructions

Follow these steps:

Download the lexer.cc, lexer.h, inputbuf.cc and inputbuf.h files accompanying this project

description. Note that these files might be a little different from the code you’ve seen in class or

elsewhere.

11

Compile your code using GCC version 4.8.5 compiler on CentOS 7. You will need to use the g++

command to compile your code in a terminal window. See section 4 for more details on how to compile

using GCC.

Note that you are required to compile and test your code on CentOS 7 using the GCC compiler

version 4.8.5. You are free to use any IDE or text editor on any platform, however, using tools available in

CentOS (or tools that you could install on CentOS) could save time in the development/compile/test cycle.

Test your code to see if it passes the provided test cases. You will need to extract the test cases from

the zip file and run the test script test1.sh. See section 5 for more details.

Submit your code on the course submission website before the deadline. You can submit as many times

as you need. Make sure your code is compiled correctly on the website, if you get a compiler error, fix

the problem and submit again.

Only the last version you submit is graded. There are no exception to this.

Keep in mind that

You should use C/C++, no other programming languages are allowed.

All programming assignments in this course are individual assignments. Students must complete the

assignments on their own.

You should submit your code on the course submission website, no other submission forms will be

accepted.

You should familiarize yourself with the CentOS environment and the GCC compiler. Programming

assignments in this course might be very different from what you are used to in other classes.

Evaluation

The submissions are evaluated based on the automated test cases on the submission website. Your grade will

be proportional to the number of test cases passing. If your code does not compile on the submission website,

you will not receive any points.

NOTE: The next two sections apply to all programming assignments.

You should use the instructions in the following sections to compile and test your programs for all programming

assignments in this course.

Compiling your code with GCC

You should compile your programs with the GCC compilers which are available in CentOS 7. GCC is a

collection of compilers for many programming languages. There are separate commands for compiling C and

C++ programs:

Use the gcc command to compile C programs

Use the g++ command to compile C++ programs

Here is an example of how to compile a simple C++ program:

$ g++ test_program.cpp

If the compilation is successful, it will generate an executable file named a.out in the same folder as the

program. You can change the output file name by specifying the -o option:

$ g++ test_program.cpp -o hello.out

12

To enable C++11 with g++, use the -std=c++11 option:

$ g++ -std=c++11 test_program.cpp -o hello.out

The following table summarizes some useful GCC compiler options:

Switch Can be used with Description

-o path gcc, g++ Change the filename of the generated artifact

-g gcc, g++ Generate debugging information

-ggdb gcc, g++ Generate debugging information for use by GDB

-Wall gcc, g++ Enable most warning messages

-w gcc, g++ Inhibit all warning messages

-std=c++11 g++ Compile C++ code using 2011 C++ standard

-std=c99 gcc Compile C code using ISO C99 standard

-std=c11 gcc Compile C code using ISO C11 standard

You can find a comprehensive list of GCC options in the following page:

https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/

Compiling projects with multiple files

If your program is written in multiple source files that should be linked together, you can compile and link all

files together with one command:

$ g++ file1.cpp file2.cpp file3.cpp

Or you can compile them separately and then link:

$ g++ -c file1.cpp

$ g++ -c file2.cpp

$ g++ -c file3.cpp

$ g++ file1.o file2.o file3.o

The files with the .o extension are object files but are not executable. They are linked together with the last

statement and the final executable will be a.out.

You can replace g++ with gcc in all examples listed above to compile C programs.

Testing your code on CentOS

Your programs should not explicitly open any file. You can only use the standard input e.g. std::cin in

C++, getchar(), scanf() in C and standard output e.g. std::cout in C++, putchar(), printf() in

C for input/output.

However, this restriction does not limit our ability to feed input to the program from files nor does it mean

that we cannot save the output of the program in a file. We use a technique called standard IO redirection to

achieve this.

Suppose we have an executable program a.out, we can run it by issuing the following command in a terminal

(the dollar sign is not part of the command):

$ ./a.out

If the program expects any input, it waits for it to be typed on the keyboard and any output generated by

the program will be displayed on the terminal screen.

To feed input to the program from a file, we can redirect the standard input to a file:

$ ./a.out < input_data.txt

13

Now, the program will not wait for keyboard input, but rather read its input from the specified file. We can

redirect the output of the program as well:

$ ./a.out > output_file.txt

In this way, no output will be shown in the terminal window, but rather it will be saved to the specified file.

Note that programs have access to another standard stream which is called standard error e.g. std::cerr in

C++, fprintf(stderr, ...) in C. Any such output is still displayed on the terminal screen. It is possible

to redirect standard error to a file as well, but we will not discuss that here.

Finally, it’s possible to mix both into one command:

$ ./a.out < input_data.txt > output_file.txt

Which will redirect standard input and standard output to input_data.txt and output_file.txt respectively.

Now that we know how to use standard IO redirection, we are ready to test the program with test cases.

Test Cases

A test case is an input and output specification. For a given input there is an expected output. A test case for

our purposes is usually represented by two files:

test_name.txt

test_name.txt.expected

The input is given in test_name.txt and the expected output is given in test_name.txt.expected.

To test a program against a single test case, first we execute the program with the test input data:

$ ./a.out < test_name.txt > program_output.txt

The output generated by the program will be stored in program_output.txt. To see if the program generated

the expected output, we need to compare program_output.txt and test_name.txt.expected. We do that

using a general purpose tool called diff:

$ diff -Bw program_output.txt test_name.txt.expected

The options -Bw tell diff to ignore whitespace differences between the two files. If the files are the same

(ignoring the whitespace differences), we should see no output from diff, otherwise, diff will produce a

report showing the differences between the two files.

We would simply consider the test passed if diff could not find any differences, otherwise we consider the

test failed.

Our grading system uses this method to test your submissions against multiple test cases. There is also a

test script accompanying this project test1.sh which will make your life easier by testing your code against

multiple test cases with one command.

Here is how to use test1.sh to test your program:

Store the provided test cases zip file in the same folder as your project source files

Open a terminal window and navigate to your project folder

Unzip the test archive using the unzip command: bash $ unzip test_cases.zip

NOTE: the actual file name is probably different, you should replace test_cases.zip with the correct file

name.

Store the test1.sh script in your project directory as well

Make the script executable: bash $ chmod +x test1.sh

Compile your program. The test script assumes your executable is called a.out

Run the script to test your code: bash $ ./test1.sh

14

The output of the script should be self explanatory. To test your code after you make changes, you will just

perform the last two steps (compile and run test1.sh).


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp