CMPT 215: Introduction to Computer Organization and Architecture
Assignment 4
March 20, 2024
Due Date: Friday April 5th, 6:00 pm – late submissions will not be accepted
Total points: 100
NOTE: For the non-programming questions, turn in a single .txt or .pdf file containing your answers. For the programming questions, turn in a separate .s file (code), and a .txt file (design document) for each question. All files (.txt/.pdf + .s) should be bundled in a single tarball (named ‘❁nsid❃-a4.tar’) using the command ‘tar -cvf ❁abc123❃- a4.tar ❁my file1❃ ❁my file2❃ ...’ using the department’s linux machines. Verify that your tar file is not corrupted (can happen sometimes when updating the contents of an existing tarball in an incorrect manner) by extracting it and checking its contents before submitting to Canvas. Your code must include comments describing what it does, and how. Code with syntax errors will receive little or no credit. You should always develop your solutions incrementally on Git and show your progress as you are developing your solutions; i.e., you should commit every single change you make as you work on your solution.
Additional Requirements
Styling: To simulate the styling requirements typically found in coding projects, starting this assignment, it is required that all text files (source code, design docs, git log, etc.) have lines no longer than 80 characters. To verify the maximum line length, you can use the wc utility (use man wc or wc --help to find out how). One or more hand-in files exceeding the 80 character line limit will result in a 5% penalty on the assignment.
Version control: The use of Git for the assignment is now mandatory. Failure to use Git entirely will result in a grade of 0.
Question 1 [6 points]
Suppose that a program does read operations on the following memory addresses (e.g., with “lb” or “lw” instructions): 96, 508, 1200, 500. Give the number of the memory block that each of these addresses belongs to, for each of the following memory block sizes.
(a) [2 points] Block size of one word
(b) [2 points] Block size of two words
(c) [2 points] Block size of four words
Question 2 [10 points]
Give the position (or set) in the cache that would be checked on each of the read operations of Question 1, for each of the following caches:
(a) [2 points] Direct-mapped cache with total capacity of 16 one-word blocks
(b) [2 points] Direct-mapped cache with total capacity of 8 two-word blocks
(c) [2 points] Direct-mapped cache with total capacity of 4 four-word blocks
(d) [2 points] 4-way set-associative cache with total capacity of 16 one-word blocks
(e) [2 points] 2-way set-associative cache with total capacity of 4 four-word blocks
Question 3 [10 points]
Suppose that immediately following the four read operations of Question 1, the program does read operations on the following memory addresses (e.g., with “lb” or “lw” instructions): 168, 500, 100, 232, and 168 (again). Assuming the cache was empty prior to the read operations of Question 1, and LRU replacement for the set associative caches, state which of these read operations will result in cache hits, if any, for each of the following cache configurations.
(a) [2 points] Direct-mapped cache with total capacity of 16 one-word blocks
(b) [2 points] Direct-mapped cache with total capacity of 8 two-word blocks
(c) [2 points] Direct-mapped cache with total capacity of 4 four-word blocks
(d) [2 points] 4-way set-associative cache with total capacity of 16 one-word blocks
(e) [2 points] 2-way set-associative cache with total capacity of 4 four-word blocks
Question 4 [6 points]
Consider a system with a 2GHz clock rate and a base CPI of 1.7. The system has an L1 data cache with a miss ratio of 5% and an L1 instruction cache with a miss ratio of 0.5%. The cache access time (including hit detection) is 1 ns, and the main memory access time is 40 ns. Assume that the frequency of load/store instructions is 28%.
(a) [1 point] What is the L1 miss penalty (in clock cycles)?
(b) [2 points] What is the average memory access cycles per instruction (for data and instructions)?
(c) [2 points] What is the effective CPI of the system using this cache?
(d) [1 point] What is the execution time if 3 billion instructions were executed?
Question 5 [6 points]
Repeat question 4 but now adding an L2 cache (for both data and instructions) having a miss ratio of 4% and an access time of 6ns (including hit detection),
(a) [1 point] What is the L1 miss penalty (in clock cycles)?
(b) [2 points] What is the average memory access cycles per instruction (for data and instructions)?
(c) [2 points] What is the effective CPI of the system using both caches?
(d) [1 point] Compared to the system in Question 4, what is the speedup?
Question 6 [2 points]
Given the following assignment of some program’s virtual pages to physical pages in a system with 4 KiB byte pages, what physical memory address corresponds to virtual address 20000? (All values are given in decimal.)
Virtual page number |
Physical page number |
0 1 2 3 4 |
3 11 4 5 7 |
Question 7 [2 points]
Consider a system using a 2-level page table tree for address translation, with 16 KiB pages, 8 byte page table entries, and such that each page table fits within a single page. What is the maximum size of the virtual memory address space, in GiB?
Question 8 [53 points]
Matrix multiplication
We will attempt to implement and optimize the generalized matrix multiplication algorithm presented in class. To recall, the implementation multiples matrix A by matrix B and writes the product in matrix C:
fgemm(A, B, C, n):
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
C[i][j] = 0
for (k = 0; k < n; k++)
C[i][j] += A[i][k] * B[k][j]
An element of the matrix C located at row i and column j is computed by finding the “dot product” of the i’th row of matrix A and the j’th column of matrix B. We can rewrite the above algorithm as follows:
vector_product(A, B, n, i, j):
cij = 0
for (k = 0; k < n; ++k)
cij += A[i][k] * B[k][j]
return cij
fgemm(A, B, C, n):
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
C[i][j] = vector_product(A, B, n, i, j)
This allows us to separate the “dot product” operation and optimize it separately.
Starter code
The starter code for this question is located under /student/cmpt215/starter/a4/
8.1 C implementation of vector product [6 points]
In a file named vector.c write a C implementation of the function vector product that takes 5 arguments: (1) a pointer of type float pointing to the first element of the square matrix A, (2) a pointer of type float pointing to the first element of the square matrix B, (3) an unsigned integer n indicating the size of the matrix, (4) an unsigned integer i, and (5) an unsigned integer j. The function computes and returns the “dot product” of the i’th row of matrix A and the j’th column of matrix B as outlined in the algorithm above.
Note: The matrices are in row-major order; i.e., A[x][y] can be referenced using *(A + x * n + y) or A[x * n + y].
8.2 Header files [9 points]
(a) [6 points] Write a header file matrix.h for the function(s) found in the provided files mat-util.c and mat-mul.c.
(b) [3 points] Write a header file vector.h for the function(s) found the file vector.c.
Note: Your header must have include guards, and you must include the headers using angle brackets and not using double-quotations.
8.3 Makefile [12 points]
Write a Makefile that creates the following:
i) The object files main.o, mat-util.o, mat-mul.o, and vector. o (in the appropriate build directory) from their corresponding C files,
ii) the static library libmatrix. a (in the appropriate lib directory) containing the objects mat-util.o and mat-mul.o,
iii) the static library libvector. a (in the appropriate lib directory) containing the object vector.o,
iv) the executable object mmul (in the appropriate bin directory) by linking the object main.o, the matrix library, and the vector library, and
v) a symbolic link to the executable object mmul at the root of the project.
Notes:
1. You must only use the utilities from the RISC-V toolchain gcc, as, ar, . . .
2. Instead of using ld to link, you will need to use gcc (without the -c option) to also link the C library with the rest of your C code.
3. You must use -std=c99, -Wall, and -pedantic in your CFLAGS and resolve all er- rors/warnings.
4. Make sure to use the ‘M’ and ‘F’ instruction set extensions; i.e., -march=rv32imf.
5. The ABI (Application Binary Interface) should be set to ‘ilp32’; i.e., -mabi=ilp32.
8.4 Design [8 points]
At this point you should be able to test your C implementation using the executable mmul. You will need to modify your run BASH script to allow passing command-line arguments to the program (hint: you can use shift twice and pass ✩@ in the script to QEMU). If the vector product function is implemented correctly, you should be able to run your program and give the size of the matrix as a command-line argument. For example, a sample output from a test to multiply two 1024x1024 matrices could look like the following:
$ ./run --qemu mmul 1024
Allocating matrices
Generating randomized matrix data
Multiplying matrices
Finished multiplication in 30 seconds
Verifying product matrix
Product matrix verified successfully
In a design document, outline the design and implementation strategy for an optimized ver- sion of the vector product procedure in RISC-V assembly. Think about the loop structure and memory access patterns in the procedure vector product and document your ideas for producing an implementation in assembly that could be faster than the C implementation. You might also want to use the instruction fmadd.s to speed up your implementation even more. To see the assembly code gcc produces, try compiling the file vector.c using the -S option to get the produced vector.s. This could give you an idea of what can be done to improve the code.
8.5 Assembly implementation of vector product [18 points]
In a file named fast-vector.s, write the assembly function vector product that takes the same 5 arguments as the C function and returns the “dot product” outlined previously.
Your code must use the “standard” conventions covered in class for passing parameters, returning results, and using the stack.
Update your Makefile to also create the following:
i) The object file fast-vector. o (in the appropriate build directory) from its corre- sponding S files using the assembler as,
ii) the static library libfastvector. a (in the appropriate lib directory) containing the object fast-vector.o,
iii) the executable object mmul-fast (in the appropriate bin directory) by linking the object main.o, the matrix library, and the fastvector library, and
iv) a symbolic link to the executable object mmul-fast at the root of the project.
A sample output from a test to multiply two 1024x1024 matrices could look like the following:
$ ./run --qemu mmul-fast 1024
Allocating matrices
Generating randomized matrix data
Multiplying matrices
Finished multiplication in 20 seconds
Verifying product matrix
Product matrix verified successfully
Try different matrix sizes and document your observations in your design document.
Files to hand in
Your Makefile should build your code as shown previously. All source files must be on git. The 5 points for git-log. txt are to evaluate your use of version control. Failure to use Git entirely (i.e., a missing git-log. txt) will result in a grade of 0 on the assignment.
The following files (absolutely no sub-directories; submissions with sub-directories will receive a zero grade) are to be packaged into atarball abc123-a4. tar (replacing ‘abc123’ with your actual NSID) and submitted via Canvas:
- git-log. txt [5 points]
- written. txt (or . pdf) [42 points]
- q8. design-doc. txt [8 points]
- matrix. h [6 points]
- vector. h [3 points]
- Makefile [12 points]
- vector. c [6 points]
- fast-vector. s [18 points]
- main. c (unmodified)
- mat-mul. c (unmodified)
- mat-util. c (unmodified)
- run (the QEMU BASH script)
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。