Project CS3103 Operating Systems Version 2.1 (20201021)
1 CS3103 - Operating Systems
Project: Parallel Zip
Due Date: Tuesday, November 24, 2020 at 8 PM HKT. Late submissions will be penalized as per syllabus.
I. Project Instructions
Overview
In the earlier programming assignment, you implemented a simple compression tool based on run-length
encoding, known simply as czip. For this project, you'll implement something similar, except you'll use
threads to make a parallel version of czip. We'll call this version pzip.
There are three specific objectives to this project:
• To familiarize yourself with the Linux pthreads library for writing multi-threaded programs.
• To learn how to parallelize a program.
• To learn how to program for performance.
Project Organization
Each group should do the following pieces to complete the project. Each piece is explained below:
• Design
• Implementation
• Evaluation
30 points
30 points
40 points
You’re required to submit a project report which concisely describes your design, implementation, and
experimental results.
Design
The design space of building an effective compression tool is large. A practical zip that achieve a better
performance will require you to address (at least) the following issues (see part II. Project Description for
more details):
• How to parallelize the compression.
• How to determine the number of threads to create.
• How to efficiently compress multiple files in parallel.
• How to access the input file efficiently.
In your project report, please describe the main techniques or mechanisms proposed to parallelize the
compression. List and describe all the functions used in this project.
2 CS3103 - Operating Systems
Implementation
Your code should be nicely formatted with plenty of comments. Each function should be preceded by a
header comment that describes what the function does. The code should be easy to read, properly indented,
employ good naming standards, good structure, and should correctly implement the design. Your code
should match your design.
Evaluation
We provide 10 test cases for you to test your code. Before submitting your work, please run your pzip on the
test cases and check if your pzip zips input files correctly. Time limitation is set for each test case, and if this
limit is exceeded, your test will fail. For each test case, your grade will be calculated as follows:
• Correctness. Your code will first be measured for correctness, ensuring that it zips input files correctly.
You will receive full points if your solution passes the correctness tests performed by the test script. You
will receive zero points for this test case if your code is buggy.
• Performance. If you pass the correctness tests, your code will be tested for performance; Test script will
record the running time of your program for performance evaluation. Shorter time/higher performance
will lead to better scores.
In your project report, summary and analyze the results. You can also compare your solution with the provided
baseline implementation.
Tips: Keep a log of work you have done. You may wish to list optimizations you tried, what failed, etc.
Keeping a good log will make it easy to put together your final writeup.
Bonus
You’re encouraged to be creative and innovative, and this project award bonus points (up to 10 points) for
additional and/or exemplary work.
• New ideas/designs are welcome to fully explore the parallelism of the process of zipping files.
Additional test cases will be used to test your solution for performance evaluation.
• To encourage healthy competition and desire to improve, we'll provide a scoreboard that shows scores
for each group. The latest scores are displayed, rank ordered, on the scoreboard. We will award bonus
points for top 10 groups. Further details will be posted on Canvas soon.
Language/Platform
The project should be written in ANSI standard C.
This project can be done on Linux (recommended), MacOS, or Windows using Cygwin. Since grading of
this project will be done using gateway Linux server, students who choose to develop their code on any
other machine are strongly encouraged to run their programs on the gateway Linux server before turning it
in. There will be no points for programs that do not compile and run on gateway Linux server, even if they
run somewhere else.
3 CS3103 - Operating Systems
Handing In
The project can be done individually, in pairs, or in groups, where each group can have a maximum of three
members. All students are required to join one project group in Canvas: "People" section > "Groups" tab >
“Project” Group > Group 01–Group 40 or Individual 41–Individual 50. Contact TA to add additional groups
if necessary. Self sign-up is enabled for these groups. Instead of all students having to submit a solution to the
project, Canvas allows one person from each group to submit on behalf of their team. If you work with
partner(s), both you and your partner(s) will receive the same grade for the entire project unless you explicitly
specify each team member's contribution in your report. Please be sure to indicate who are in your group when
submitting the project report.
Before you hand in, make sure to add the requested identifying information about your project group, which
contains project group number, full name and e-mail address for each group member.
When you're ready to hand in your solution, go to the course site in Canvas, choose the "Assignments" section >
"Project" group > "Project" item > "Submit Assignment" button and upload your files, including the following:
1) A PDF document which concisely describes your design, implementation, and experimental results;
If you are working in a team, please also describe each team member's contribution.
2) The source file, i.e., pzip.c;
Academic Honesty
All work must be developed by each group separately. Please write your own code. All submitted source
code will be scanned by anti-plagiarism software. If the code does not work, please indicate in the report
clearly.
Questions?
If you have questions, please first post them on Canvas so others can get the benefit of the TA's answer.
Avoid posting code that will give away your solution or allow others to cheat. If this does not resolve your
issue, contact the TA (Mr. LI Jinheng<jinheng.li@my.cityu.edu.hk>).
Acknowledgements
This project is taken and modified from the OSTEP book written by Remzi and Andrea Arpaci-Dusseau at
the University of Wisconsin. This free OS book is available at http://www.ostep.org. Automated testing scripts
are from Kyle C. Hale at the Illinois Institute of Technology.
Disclaimer
The information in this document is subject to change with notice via Canvas. Be sure to download the latest
version from Canvas.
4 CS3103 - Operating Systems
II. Project Description
For this project, you will implement a parallel version of zip using threads. First, recall how zip works by
reading the description in Assignment 1 Part II. You'll use the same basic specification, with run-length
encoding (RLE) as the basic technique.
RLE is quite simple: when you encounter n characters of the same type in a row, the compression tool (pzip)
will turn that into the number n and a single instance of the character.
Thus, if we had a file with the following contents:
aaaaaaaaaabbbb
The tool would turn it (logically) into:
10a4b
However, the exact format of the compressed file is quite important; here, you will write out a 4-byte integer
in binary format followed by the single character in ASCII. Thus, a compressed file will consist of some
number of 5-byte entries, each of which is comprised of a 4-byte integer (the run length) and the single
character. To write out an integer in binary format (not ASCII), you should use fwrite(). Read the man
page for more details. For pzip, all output should be written to standard output (the stdout file stream).
Your pzip will externally look the same as czip. However, internally, the program will use POSIX threads
to parallelize the compression process. The general usage from the command line will be as follows:
$ ./pzip file.txt > file.z
Doing so effectively and with high performance will require you to address (at least) the following issues:
• How to parallelize the compression. The central challenge of this project is to parallelize the
compression process. You are required to think about whether the compression process can be separated
into several sub-processes, what sub-processes can be done in parallel, and what sub-processes must be
done serially by a single thread. Then, you are required to design your parallel zip as appropriate. For
example, does it possible to zip several small sub-files using multiple threads instead of zipping a large
file using only one thread? If it's possible, how to divide the large file? How to zip those small sub-files
using multiple threads? How to merge zipped results of several small sub-files? One interesting issue that
the "best" implementations will handle is this: what happens if one thread runs much slower than another?
Does the compression give more work to faster threads? This issue is often referred to as the straggler
problem.
• How to determine the number of threads to create. On Linux, the determination of the number of
threads may refer to some interfaces like get_nprocs() and get_nprocs_conf(); You are suggested
to read the man pages for more details. Then, you are required to create an appropriate number of threads
to match the number of CPUs available on whichever system your program is running.
5 CS3103 - Operating Systems
• How to efficiently compress multiple files in parallel. In previous issues, you may have completed the
parallel compression for one large file. Now you are required to think about how to parallelize the
compression processes of multiple files. A naïve way is to sequentially process the parallel compression
process of each file. However, this method cannot fully explore the parallelism of the compression
processes of multiple files. You are required to explore the parallelism between the compression processes
of multiple files and design an efficient and fast parallel method to compress multiple files. Note that
when the input contains directories, you can first obtain the paths of all files in the directories recursively
using readdir(), then compress them as multiple files.
• How to access the input file efficiently. On Linux, there are many ways to read from a file, including C
standard library calls like fread() and raw system calls like read(). One particularly efficient way is
to use memory-mapped files, available via mmap(). By mapping the input file into the address space, you
can then access bytes of the input file via pointers and do so quite efficiently.
To understand how to make tackle these problems, you should first understand the basics of thread creation,
and perhaps locking and signaling via mutex locks and condition variables. Review the tutorials and read the
following chapters from OSTEP book carefully in order to prepare yourself for this project.
• Intro to Threads
• Threads API
• Locks
• Using Locks
• Condition Variables
6 CS3103 - Operating Systems
III. Project Guidelines
1. Getting Started
The project is to be done on the CSLab SSH gateway server, to which you should already be able to log in.
As before, follow the same copy procedure as you did in the previous tutorials to get the project files (code
and test files). They are available in /public/cs3103/project/ on the gateway server. project.zip contains
the following files/directories:
/project
├── Makefile
├── pzip <- A sample solution (executable file).
├── pzip.c <- Modify and hand in pzip.c file.
├── README.txt
└── tests
├── bin
│ ├── generator.py
│ ├── generic-tester.py
│ ├── serialized_runner.py
│ └── test-pzip.csh
├── config
│ ├── 1.json
│ ├── ...
│ └── 10.json
├── stdout
│ ├── 1.out
│ ├── 1.rc
│ ├── ...
│ └── 10.rc
└── tests-pzip
├── 1
│ └── 1-0.in
├── 2
│ ├── 2-0.in
│ ├── 2-1.in
│ ├── ...
│ └── 2-11.in
├── ...
└── 10
├── 10-0
│ ├── 10-0-0.in
│ ├── 10-0-1.in
7 CS3103 - Operating Systems
│ └── 10-0-2.in
├── 10-1
│ ├── 10-1-0.in
│ ├── 10-1-1.in
│ └── 10-1-2.in
├── 10-2.in
├── ...
└── 10-7.in
Start by copying the provided files to a directory in which you plan to do your work. For example, copy
/public/cs3103/project/project.zip to your home directory, extract the files from the ZIP file with the unzip
command. Note that the file size of project.zip is only 10MB, but the uncompressed project directory has a
size of 5GB. It takes about 90 seconds to unzip the project.zip file on our gateway server. After the unzip
command extracts all files to the current directory, change to the project directory and take a look at the
directory contents:
$ cd ~
$ cp /public/cs3103/project/project.zip .
$ unzip project.zip
$ cd project
$ ls
A sample pzip is also provided (we only provide a single executable file without source code). This pzip
uses pthread to support the parallel compression of multiple files or directories. When the input files of the
compression process contain directories, the pzip first obtains the paths of all files in the directories
recursively using readdir(), then treats the compression process as the compression of multiple files. For
the multiple files’ compression, the pzip uses mmap to map files into multiple pages and compress all pages
in parallel. The parallel compression process can be treated as producer-consumer problem. The pzip uses
one thread of producer to map files and multiple threads of consumers to compress pages.
You can run and test pzip by using the make run command.
$ make run
TEST 1 - single file test, a small file of 10 MB (2 sec timeout)
Test finished in 0.045 seconds
RESULT passed
(content removed for brevity)
TEST 10 - mixed test 3, two directories that contain three small files of
10 MB, 20 MB, 10 MB and three large files of 200 MB, 100 MB, 200 MB, and
six small files outside directory of 10 MB, 20 MB, 10 MB, 20 MB, 10 MB,
20 MB (2 sec timeout)
Test finished in 0.252 seconds
RESULT passed
8 CS3103 - Operating Systems
You can also use this one as a baseline implementation for performance evaluation, which means you can
compare the execution time of your pzip with that of the provided one in final report. Note that after building
your own pzip (using make or make test), the provided pzip file will be overwritten. But don’t worry,
you can always copy it from /public/cs3103/project/pzip.
2. Writing your pzip program
The pzip.c is the file that you will be handing in and is the only file you should modify. Write your code from
scratch or simply borrow the code from your czip to implement this parallel version of zip. Again, it's a good
chance to learn (as a side effect) how to use a proper code editor such as vim1,2.
3. Building your program
A simple makefile that describes how to compile pzip is provided for you.
To compile your pzip.c and to generate the executable file, use the make command within the directory that
contains your project. It will display the command used to compile the pzip.
$ make
gcc -Wall -Werror -pthread -O pzip.c -o pzip
Note that the -Werror compiler flag is specified. It causes all warnings to be treated as build errors. It would
be better to fix the compiling issue instead of disabling -Werror flag.
If everything goes well, there would an executable file pzip in it:
$ ls
Makefile pzip pzip.c README.txt tests
If you make some changes in pzip.c later, you should re-compile the project by running make command again.
To remove any files generated by the last make, use the make clean command.
$ make clean
rm -f pzip
$ ls
Makefile pzip.c README.txt tests
4. Testing your C program
We also provide 10 test cases for you to test your code. You can find them in the directory tests/tests-pzip/.
The makefile could also trigger automated testing scripts, type make run (run testing only) or make test
(build your program and run testing).
1 Interactive Vim tutorial, https://www.openvim.com/
2 Vim Cheat Sheet, https://vim.rtorr.com
9 CS3103 - Operating Systems
$ make test
TEST 0 - clean build (program should compile without errors or warnings)
Test finished in 0.189 seconds
RESULT passed
TEST 1 - single file test, a small file of 10 MB (2 sec timeout)
Test finished in 0.045 seconds
RESULT passed
TEST 2 - multiple files test, twelve small files of 10 MB, 20 MB, 30 MB,
10 MB, 20 MB, 30 MB, 10 MB, 20 MB, 30 MB, 10 MB, 20 MB, 30 MB (2 sec
timeout)
Test finished in 0.101 seconds
RESULT passed
TEST 3 - empty file test (2 sec timeout)
Test finished in 0.013 seconds
RESULT passed
TEST 4 - no file test (2 sec timeout)
Test finished in 0.014 seconds
RESULT passed
TEST 5 - single large file test, a large file of 100 MB (2 sec timeout)
Test finished in 0.074 seconds
RESULT passed
TEST 6 - multiple large files test, six large files of 100 MB, 200 MB, 300
MB, 100 MB, 200 MB, 300 MB (2 sec timeout)
Test finished in 0.435 seconds
RESULT passed
TEST 7 - directory test, a directory that contains twelve small files of
10 MB, 20 MB, 30 MB, 40 MB, 10 MB, 20 MB, 30 MB, 40 MB, 10 MB, 20 MB, 30
MB, 40 MB (2 sec timeout)
Test finished in 0.135 seconds
RESULT passed
TEST 8 - mixed test 1, a directory that contains six small files of 10 MB,
20 MB, 10 MB, 20 MB, 10 MB, 20 MB and six large files outside directory
of 100 MB, 200 MB, 300 MB, 100 MB, 200 MB, 300 MB (2 sec timeout)
10 CS3103 - Operating Systems
Test finished in 0.410 seconds
RESULT passed
TEST 9 - mixed test 2, a directory that contains six large files of 100
MB, 200 MB, 100 MB, 200 MB, 100 MB, 200 MB, and six small files outside
directory of 30 MB, 40 MB, 30 MB, 40 MB, 30 MB, 40 MB (2 sec timeout)
Test finished in 0.374 seconds
RESULT passed
TEST 10 - mixed test 3, two directories that contain three small files of
10 MB, 20 MB, 10 MB and three large files of 200 MB, 100 MB, 200 MB, and
six small files outside directory of 10 MB, 20 MB, 10 MB, 20 MB, 10 MB,
20 MB (2 sec timeout)
Test finished in 0.252 seconds
RESULT passed
The job of those automated scripts is to orderly run your pzip on the test cases and check if your pzip zips
input files correctly. TEST 0 (available for make test) will fail if your program is compiled with errors or
warnings. Time limitation is set for each test case, and if this limit is exceeded, your test will fail. Besides, the
script will record the running time of your program for performance evaluation. Shorter time/higher
performance will lead to better scores.
Below is a brief description of each test case:
• Test case 1-6: Test cases 1, 2, are small files. For test case 3, the file is empty. For test case 4, there is no
file. if no files are specified, the program should exit with return code 1 and print "pzip: file1 [file2 ...]"
(followed by a newline).
Test case 1) single file test – a small file.
Test case 2) multiple files test – twelve small files of different file size.
Test case 3) empty file test.
Test case 4) no files test.
Test case 5) single large file test – a large file.
Test case 6) multiple large files test – six large files of different file size.
• Test case 7-10: Some files are stored in a directory, and you are required to compress the directory and
other files. Do note that if multiple files are passed to pzip, they are compressed into a single compressed
output. The information that multiple files were originally input into pzip is lost.
Test case 7) directory test – a directory that contains twelve small files of different file size.
Test case 8) mixed test 1 – a directory that contains six small files, and six large files outside
directory.
Test case 9) mixed test 2 – a directory that contains six large files, and six small files outside
directory.
11 CS3103 - Operating Systems
Test case 10) mixed test 3 – two directories that contain three large files and three small files, and
six small files outside directory.
Each test consists of 5 files with different filename extension:
• n.json (in tests/config/ directory): The configuration of test case n.
§ binary: Indicates the data type of input and output is binary.
§ filename: The test case number.
§ filesize: The file size of each file. All numbers are included in a list. If an item is a list of
numbers, it indicates it is a directory.
§ timeout: Time limitation (seconds).
§ seed: The seed used to generate the content of input files.
§ description: A short text description of the test.
§ preparation: Code to run before the test, to set something up.
• n.rc (in tests/stdout/ directory): The return code the program should return (usually 0 or 1).
• n.out (in tests/stdout/ directory): The standard output expected from the test.
• n.in (in tests/tests-pzip/ directory): The input file of the test case.
Just like in Assignment 1, you can run and test your pzip manually. For example, to run your pzip to
compress the input file 1-0.in in tests/tests-pzip and save the compressed file as 1.out, enter:
$ ./pzip ./tests/tests-pzip/1/1-0.in > 1.out
To run your pzip to compress multiple input files (the input files 2-0.in, 2-1.in, 2-2.in) in tests/tests-pzip, and
save the compressed file as 2.out, enter:
$ ./pzip tests/tests-pzip/2/2-0.in tests/tests-pzip/2/2-1.in tests/testspzip/2/2-2.in
> 2.out
To run your pzip to compress the input directory 7-0 in tests/tests-pzip and save the compressed file as 7.out,
enter:
$ ./pzip tests/tests-pzip/7/7-0 > 7.out
To run your pzip to compress the input directory 8-0 and some input files (the input files 8-0.in, 8-1.in) in
tests/tests-pzip and save the compressed file as 8.out, enter:
$ ./pzip tests/tests-pzip/8/8-0 tests/tests-pzip/8/8-0.in tests/tests-pzip/8/8-
1.in > 8.out
To run other test cases, please run as follows:
$ ./pzip tests/tests-pzip/3/3-0.in > 3.out
$ ./pzip tests/tests-pzip/5/5-0.in > 5.out
$ ./pzip tests/tests-pzip/9/9-0 tests/tests-pzip/9/9-0.in tests/tests-pzip/9/9-
1.in > 9.out
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。