联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2018-11-21 10:03

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

UNIVERSITY OF BRITISH COLUMBIA

CPEN 211 Introduction to Microcomputers, Fall 2018

Lab 11: Caches, Performance Counters, and Floating-Point

The handin deadline is 9:59 PM the evening before your lab section the week of Nov 26 to Nov 30

1 Introduction

The ARM processor in your DE1-SoC has the Cortex-A9 microarchitecture. The specifications for the

Cortex-A9 include an 8-stage pipeline, an L1 instruction cache a separate L1 data cache, and a unified L2

cache. In this lab we explore factors that impact program performance with a focus on the L1 data cache.

1.1 Caches inside the DE1-SoC

Both L1 caches hold 32KB and are 4-way set associative with 32 byte blocks and pseudo random replacement.

Using the initialization code provided for this lab, addresses between 0x00000000 and 0x3FFFFFFF

are configured to be cached. Addresses larger than 0x3FFFFFFF are configured to “bypass the cache”,

meaning accesses to these addresses are not cached in L1 or L2 caches.

Why bypass a cache One reason we wish accesses to certain addresses to not be cached is if these

addresses correspond to registers in I/O peripherals. For example, consider what would happen if instead,

when software on the DE1-SoC reads SW_BASE at address 0xFF200040 the values it read from the control

register were allowed to be cached in the L1 data cache: The first LDR instruction to read from address

0xFF200040 would cause a cache block to be allocated in the L1 data cache. This cache block would

contain the value of the switches at the time this first LDR instruction was executed. Now, if the settings of

the switches change after the first LDR executes but that cache block remains in the cache, subsequent LDR

instructions reading from address 0xFF200040 will read the old or “stale” value for the switch settings that

in the cache. Thus, it will seem to the software like the switches have not changed even though they have.

Without an understanding of caches such behavior would be very surprising and hard to explain.

In addition, the initialization code provided for this lab configures the L1 data cache so that store instructions

(e.g., STR and FSTD) are handled as “write back with write allocate”. By write allocate we mean

that if the cache block accessed by the store was not in the L1 data cache then it will be brought into the

cache, possibly evicting another block. By write-back we mean that if a cache block in the L1 is written to

by a store instruction then only the copy of the block in the L1 is modified.

1.2 Performance Counters

How can you increase the performance of a software program? One common approach is to “profile” the

program to identify which lines of code it spends the most time executing. Standard developer tools such as

Visual Studio include such basic profiling capabilities1

. Using this form of profiling you can identify where

making “algorithmic” changes, such as using a hash table instead of a linked list, is worth the effort.

To obtain the highest performance it is also necessary to know about how a program interacts with the

microarchitecture of the CPU. One of the most important questions is “does the program incur many cache

misses?” The software that runs in datacenters, such as those operated by Google, Facebook, Amazon,

Microsoft and others, typically suffers many cache misses. Google reports that “Half of cycles [in their

datacenters] are spent stalled on caches”2

. Most modern microprocessors include special counter registers

that can be configured to measure how often events such as cache misses occur. Special profiling tools such

as Intel’s VTune3

can use these hardware performance counters. Hardware counters can also be used for

runtime optimization of software and to enable the operating system to select lower power modes.

1

https://msdn.microsoft.com/en-CA/library/ms182372.aspx

2 Kanev et al., Profiling a warehouse-scale computer, ACM/IEEE Int’l Symp. on Computer Architecture, 2015.

3

https://software.intel.com/en-us/intel-vtune-amplifier-xe

CPEN 211 - Lab 11 1 of 10

The Cortex-A9 processor supports six hardware counters. Each counter can be configured to track one

of 58 different events. In this lab you will use these counters to measure clock cycles, load instructions

executed, and L1 data cache misses (caused either by loads or stores). You will use these three counters

to analyze the performance as you make changes to programs. These performance counters are a standard

feature of the ARMv7 architecture and are implemented as part of “coprocessor 15” (CP15). CP15 also

includes functionality for controlling the caches and virtual memory hardware. For this lab we provide

you ARM assembly code to enable the L1 data cache and L2 unified cache using CP15. (The L2 cache

is accessed when a load or store instruction does not find a matching cache block in the L1 data cache.)

Enabling the data caches on the Cortex-A9 also requires enabling virtual memory. So, the code we provide

for Lab 11 (pagetable.s) also does this for you using a one-level page table with 1MB pages (called

“sections” in ARM terminology). You do not need to know how virtual memory works to complete this lab.

However, for those who are interested, Bonus #1 and Bonus #2 ask you to modify pagetable.s.

In Part 1 of this lab you run an example assembly program that helps illustrates how to access the

performance counters. In Part 2, you write a matrix-multiply function using floating-point instructions and

study its cache behavior using the performance counters. In Part 3, you modify your matrix-multiply to

improve cache performance.

To enable the caches on your DE1-SoC, your assembly needs to call the function CONFIG_VIRTUAL_MEMORY

defined inside pagetable.s to enable the cache. After virtual memory is enabled the Altera Monitor Program

will not correctly download changes to your code without first power cycling the DE1-SoC. To save

time during debugging (e.g., in Part 2 and 3) enable virtual memory only after you get your code

working. Also, note that resetting the ARM processor through the Altera Monitor Program does not “flush”

the contents of the caches. Thus, you will need to power cycle your DE1-SoC each time you want to

make a new performance measurement.

The ARM coprocessor model was briefly described in Slide Set #13. The Cortex-A9 contains a Performance

Monitor Unit (PMU) inside of Coprocessor 15. While there are 58 different events that can be

tracked on the Cortex-A9, the PMU contains only six performance counters with which to track them. These

are called PMXEVCNTR0 through PMXEVCNTR5, which we will abbreviate to PMN0 through PMN5.

These counters are controlled through several additional registers inside the PMU.

The specific PMU registers you will need to use in this lab are listed in Table 1. Recall that the MCR,

or “move to coprocessor from an ARM register”, instruction moves a value to the coprocessor (i.e., PMU)

from an ARM register (R0-R14). The MRC, or “move to ARM register from a coprocessor” instruction

copies a value from a coprocessor (i.e., PMU) into an ARM register. Certain registers in the PMU are used

to configure the performance counter hardware before using the counters PMN0 through PMN5 to actually

count hardware events. The relationship between the different PMU registers, the hardware events and the

performance counter registers is partly illustrated in Figure 1. The operation of this hardware is described

below. You will measure the three events listed in Table 2. The other 55 possible events can be found in

ARM documents that are available on ARM’s website4

.

To use one of the performance counters you need to complete the following steps:

1. Select counter PMNx by putting the value x in a regular register (e.g., R0-R12) and then executing

the ARM code shown in Table 1 for “Set PMSELR” (replacing <Rt> with the register you put the

value x in, e.g., R0). This put the value in <Rt> into the register labeled PMSELR in Figure 1, which

controls the “demultiplexer” labeled 1 .

2. Selecting the event that PMNx should count by putting the event number in the first column of Table 2

into a regular register (e.g., R0-R12) and then executing the ARM code in Table 1 for “Set PMXEVTYPER”

(replacing <Rt> with the register you put the value in). This causes the value of <Rt>

4ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition, Cortex-A9 Technical Reference Manual

CPEN 211 - Lab 11 2 of 10

Figure 1: Hardware Organization of Cortex-A9 Performance Monitor Unit

to be placed into the corresponding register named PMXEVTYPER0 through PMXEVTYPER5 in

Figure 1. Which one of PMXEVTYPER0 through PMXEVTYPER5, is updated depends upon the

value in PMSELR set in the prior step.

3. Repeat steps 1 through 2 for up to six counters.

4. Enable each PMNx by setting bit x in a regular register (e.g., R0-R12) to 1 and then executing the

ARM code in Table 1 for “Set PMCNTENSET” (replace <Rt> with the register containing bits set to

1). This sets the register labeled PMCNTENSET in Figure 1.

5. Reset all counters and start those that are enabled by putting the value 3 into a regular register (e.g.,

R0-R12) and then executing the ARM code in Table 1 for “Set PMCR” (replacing <Rt> with the

register containing 3). In Figure 1, this step resets the counters PMN0 through PMN5 and allows

them to begin counting the events passed through the multiplexes that connect to the hardware event

signals 5 .

6. Run the code you wish to measure the performance of (e.g., matrix multiply). During this step the

counters PMN0 through PMN5 shown in Figure 1 will be incremented whenever a configured event

occurs.

7. Stop the performance counters by putting the value 0 into a regular register (e.g., R0-R12) and then

executing the ARM code in Table 1 for “Set PMCR” (replacing <Rt> with the register containing 3).

8. For each counter you wish to read, follow steps 9 and 10 below.

CPEN 211 - Lab 11 3 of 10

Name ARM Code (NOTE: replace <Rt>) Function

Set PMSELR MCR p15, 0, <Rt>, c9, c12, 5 Value in ARM register <Rt> speci-

fies the performance counter PMN0

through PMN5 that will either be con-

figured using a PMXEVTYPER operation

or read using a PMXEVCNTR operation.

Set PMXEVTYPER MCR p15, 0, <Rt>, c9, c13, 1 Lower 8-bits of <Rt> configures which

event increments counter selected by

PMSELR.

Set PMCNTENSET MCR p15, 0, <Rt>, c9, c12, 1 A 1 in bit 0 through bit 5 of <Rt> enables

performance counter 0 through 5,

respectively

Set PMCR MCR p15, 0, <Rt>, c9, c12, 0 If Bit 1 of <Rt> is 1 this instruction

clears all six performance counters. If

Bit 0 of <Rt> is 1 this instruction starts

any performance counters enabled by

PMCNTENSET. If Bit 0 of <Rt> is ‘0’

this instruction stops all performance

counters.

Read PMXEVCNTR MRC p15, 0, <Rt>, c9, c13, 2 Copies current value of counter selected

by PMSELR into <Rt>.

Table 1: ARM Cortex-A9 Performance Monitor Interface

Event number Event description

0x3 Level 1 data cache misses

0x6 Number of load instructions executed (counted if condition code passed)

0x11 CPU cycles

Table 2: Event Numbers

9. Select counter PMNx by putting the value x in a regular register (e.g., R0-R12) and then executing the

ARM code shown in Table 1 for “Set PMSELR” (replacing <Rt> with the register you put the value

x in).

10. Read PMNx by executing the ARM code shown in Table 1 for “Read PMXEVCNTR” after replacing

<Rt> with the register (e.g., R0-R12) you want to copy the performance counter value into. This

corresponds to reading the counters via the multiplexer 10 illustrated in Figure 1.

These steps are illustrated in the example in Figure 2 which is described in more detail in the following

section.

2 Lab Procedure

Follow the steps below.

2.1 Part 1 [4 marks]: Performance Measurement using Example Code

Run the ARM assembly code in Figure 2 on your DE1-SoC (this code must be run on real hardware).

Note you should not single step while collecting performance counters. Set a breakpoint on the line

CPEN 211 - Lab 11 4 of 10

.text

.global _start

_start:

BL CONFIG_VIRTUAL_MEMORY

// Step 1-3: configure PMN0 to count cycles

MOV R0, #0 // Write 0 into R0 then PMSELR

MCR p15, 0, R0, c9, c12, 5 // Write 0 into PMSELR selects PMN0

MOV R1, #0x11 // Event 0x11 is CPU cycles

MCR p15, 0, R1, c9, c13, 1 // Write 0x11 into PMXEVTYPER (PMN0 measure CPU cycles)

// Step 4: enable PMN0

mov R0, #1 // PMN0 is bit 0 of PMCNTENSET

MCR p15, 0, R0, c9, c12, 1 // Setting bit 0 of PMCNTENSET enables PMN0

// Step 5: clear all counters and start counters

mov r0, #3 // bits 0 (start counters) and 1 (reset counters)

MCR p15, 0, r0, c9, c12, 0 // Setting PMCR to 3

// Step 6: code we wish to profile using hardware counters

mov r1, #0x00100000 // base of array

mov r2, #0x100 // iterations of inner loop

mov r3, #2 // iterations of outer loop

mov r4, #0 // i=0 (outer loop counter)

L_outer_loop:

mov r5, #0 // j=0 (inner loop counter)

L_inner_loop:

ldr r6, [r1, r5, LSL #2] // read data from memory

add r5, r5, #1 // j=j+1

cmp r5, r2 // compare j with 256

blt L_inner_loop // branch if less than

add r4, r4, #1 // i=i+1

cmp r4, r3 // compare i with 2

blt L_outer_loop // branch if less than

// Step 7: stop counters

mov r0, #0

MCR p15, 0, r0, c9, c12, 0 // Write 0 to PMCR to stop counters

// Step 8-10: Select PMN0 and read out result into R3

mov r0, #0 // PMN0

MCR p15, 0, R0, c9, c12, 5 // Write 0 to PMSELR

MRC p15, 0, R3, c9, c13, 2 // Read PMXEVCNTR into R3

end: b end // wait here

Figure 2: Example 1 (NOTE: CONFIG_VIRTUAL_MEMORY is defined in pagetable.s)

“end: b end;” and run to it without single stepping. This code measures the number of cycles to execute

a nested loop that repeatedly iterates over elements of a one-dimensional array. You will notice that the

code in Figure 2 does not actually use the values loaded from memory by the line:

ldr r6, [r1, r5, LSL #2] // read data from memory

The reason is in this example we are concerned only with how many cache hits or misses are generated by

a program that repeatedly reads values from an array.

Next, modify the code from Figure 2 to also measure cycles and number of load instructions. NOTE:

Running the above code the measured CPU cycles will decrease each time you run the program (e.g., if

using Actions>Restart). This occurs because if you do NOT power cycle your DE1-SoC and download the

program again the cache blocks brought into the cache by one run of the code will remain valid in the cache

CPEN 211 - Lab 11 5 of 10

#define N 128

double A[N][N], B[N][N], C[N][N];

void matrix_multiply(void)

{

int i, j, k;

for( i=0; i<N; i++ ) {

for( j=0; j<N; j++ ) {

double sum=0.0;

for( k=0; k<N; k++ ) {

sum = sum + A[i][k] * B[k][j]; }

C[i][j] = sum; } }

}

Figure 3: Matrix Multiply C code

thus reducing subsequent cache misses.

Measure all three performance counters and compute the three factors in the processor performance

equation discussed in Slide Set #14:

Execution Time = Instruction Count × CPI × Cycle Time (1)

CPI is the average cycles per instruction and can be obtained by dividing cycle count by the instruction

count. To obtain cycle time you need to know the clock frequency, which is 800 MHz. Surprisingly the ARM

Cortex-A9 does not have a counter that measures all instructions executed (the ARMv7 documentation

says this is mandatory, the Cortex-A9 documentation says it is not implemented!) So you will need to

compute instruction count by analyzing the program. Create a table (using your favorite document editor or

spreadsheet program) to record the values measured by each of the performance counters. Note that usually

hardware performance counters are not perfect and may slightly under or over count events versus what you

expect.

Then, try increasing the value of the shift parameter "#2" in the following line to at least one other value

and repeat the measurements:

ldr r6, [r1, r5, LSL #2] // read data from memory

Your mark for Part 1 will be:

4/4 If you measure all three counters for two values of the left shift parameter, compute the three terms in

the processor performance equation (Equation 1) and can explain the results.

3/4 If you measure all three counters for at least two values of the left shift parameters, compute the three

terms in the processor performance equation, but have difficulty explaining the results to your TA.

2/4 If you measure all three counters for the default value of the left shift parameter.

1/4 If you measure at least two counter values

2.2 Part 2 [4 marks]: Matrix Multiply

In this part you will write ARM assemble code equivalent to the C code shown in Figure 3.

This code multiplies the matrix A times B and puts the result in matrix C. Matrix multiplication is an

important computational “kernel” in many important applications today (e.g., machine learning algorithms

such as deep belief networks used in speech recognition, self driving cars, etc...). Note “+” and “*” operations

in the above code should be double precision floating-point (Slide Set 13). Two-dimensional C arrays

are stored in memory in “row major” format. The elements in a row are placed adjacent in memory. For

example, consider the array with 2 rows and 3 columns declared as follows:

CPEN 211 - Lab 11 6 of 10

address data

0x100 1.1

0x108 1.2

0x110 1.3

0x118 2.1

0x120 2.2

0x128 2.3

Figure 4: Layout of two dimensional array in memory

double my_array[2][3] = {{1.1, 1.2, 1.3}, {2.1, 2.2, 2.3}};

Drawn as a matrix my_array looks like:



1.1 1.2 1.3

2.1 2.2 2.3



(2)

This means my_array[0][0] contains 1.1, my_array[0][1] contains 1.2, my_array[0][2] contains

1.2, my_array[1][0] contains 2.1, and so on. Assume the base address of “my_array” is 0x100. Then, the

above six elements of “my_array” would be placed in memory as shown in Figure 4 (recall IEEE double

precision floating-point uses 64-bits which is 8 bytes):

You can use the .double directive to initialize the contents of the array. For example, “my_array” above

can be specified to have the initial contents in the above example using ARM assembly as follows:

my_array: .double 1.1

.double 1.2

.double 1.3

.double 2.1

.double 2.2

.double 2.3

To avoid conflicts with the memory used by CONFIG_VIRTUAL_MEMORY place your arrays below

address 0x01000000.

Use the above information about how arrays are placed in memory to help you compute the address

to load from for “A[i][k]” and “B[k][j]” and the address to store to for “C[i][j]”. You need to use

“N” in your address calculation. There an example of ARM assembly code performing matrix multiply on

pages 250-253 in Chapter 3 of COD4e (PDF on Canvas) with N=32. You can use this code as a starting

point provided you add a citation to it in your .s file in a comment. Alternatively, you can write the code

yourself. Either way, due to the limitations of the Altera Monitor Program noted earlier, you will need to

encode floating-point operations and your assembly code should support arbitrary values of N.

Before enabling virtual memory and caches run your code with a small value of N and with A and B

matrices with values of your choosing to verify the results of your matrix multiply code are correct. To

check the results are correct you will need to look at the values stored into memory for the output array

using the memory tab in the Altera Monitor Program. NOTE: The Altera Monitor Program does not know

how to display floating-point numbers. Instead, use the following URL to find the hexadecimal encoding

for a double precision number: http://www.binaryconvert.com/result_double.html?decimal=049

Rerun the code with virtual memory and caches enabled with N set to 128 and then N set to 16. Use

the performance counters to help you compute the average CPI in both cases and be prepared to be able to

explain them. When using larger values of N (e.g., 16 and 128) to measure cache performance you do NOT

need to explicitly initialize the input matrices unless you want to.

CPEN 211 - Lab 11 7 of 10

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

cond 1 1 0 1 U 0 0 1 Rn Dd 1 0 1 1 Imm8

Figure 5: Floating-Point Double Precision Load (FLDD Dd, [Rn,#imm8]). If U (bit-23) is 1, then

imm8 is added to the contents of Rn to form the effective address. If U is 0 imm8 is subtracted from Rn.

See also documentation on LDC in COD4e Appendix B1 & B2 (copro=11).

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

cond 1 1 1 0 0 0 1 0 Dn Dd 1 0 1 1 0 0 0 0 Dm

Figure 6: FP Double Precision Multiply (FMULD Dd, Dn, Dm). See also CDP in COD4e Appendix B1

& B2 (op1=2, op2=0,coproc=11).

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

cond 1 1 1 0 0 0 1 1 Dn Dd 1 0 1 1 0 0 0 0 Dm

Figure 7: FP Double Precision Addition (FADDD Dd, Dn, Dm). See also CDP in COD4e Appendix B1

& B2 (op1=3, op2=0,coproc=11).

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

cond 1 1 0 1 U 0 0 1 Rn Dd 1 0 1 1 Imm8

Figure 8: Floating-Point Double Precision Store (FSTD Dd, [Rn,#imm8]). If U (bit-23) is 1, then

imm8 is added to the contents of Rn to form the effective address. If U is 0 imm8 is subtracted from Rn.

See also documentation on STC in COD4e Appendix B1 & B2 (copro=11).

A challenge you will encounter is that the Altera Monitor Program is not setup to support programs that

use floating-point even though the ARM Cortex-A9 on the DE1-SoC has very good support for floatingpoint.

There are two issues: One is that the Altera Monitor Program does not show the contents of the

floating-point registers (e.g., D0-D15). Another is that it will not compile floating-point assembly mnemonics

such as “FMULD” (i.e., double precision floating-point multiply). We will “work around” the lack of

support for floating point in the Altera Monitor Program in the following way: You will manually assemble

FLDD, FMULD, FADDD and FSTD instructions into 1’s and 0’s then load them into memory at the

appropriate location in your assembly program using the “.word” directive.

The encodings for these four instructions are summarized below in Figures 5-8. Recall that ARM

floating-point operations are implemented using the “coprocessor” model described in Slide Set 11 on Slide

13. The double precision floating-point coprocessor number is 11 (CP11).

For example, the instruction “FLDD D0, [R8]” can be specified using “.word 0xED980B00”.

Your mark for Part 2 will be:

4/4 If you write the code and you can show it correctly computes results for small value of N that is not

a power of 2, and you run it with N = 128 and N = 16 and collect all three hardware counters and

compute the average CPI for both cases and you can explain the results you get.

3/4 If you write the code and you can show it correctly computes results for a small value of N other than

32.

2/4 If you wrote code for this part and it runs without triggering an illegal instruction fault and jumping

to address 0x00000004 on your hand coded float-point assembly and it stores values for the output

CPEN 211 - Lab 11 8 of 10

matrix in memory but the result looks wrong.

1/4 If you did not do this part or your code will not compile, or it compiles but triggers an illegal instruction

fault and jumps to address 0x00000004 and/or it does not write the results to memory.

2.3 Part 3: Blocked Matrix Multiply

Look at the C code in Figure 5.21 in COD ARM edition (page 429). If you do not have the second textbook

it is available on short term loan in the library. This code performs a blocked matrix multiply which helps

improve performance by ensuring values are used multiple times after they are brought into the cache.

Implement this same strategy in assembly code and measure the difference in performance.

Your mark for Part 3 will be:

2/2 If you code up blocked matrix multiply in assembly and can show it computes the correct result for

small values of N and you use performance counters to verify it improves average CPI for N=128.

1/2 If you coded something up and it looks to the TA like it might have a chance of working but it does not

actually work.

2.4 Bonus #1 of 2 [4 marks]: Two-Level Page Table and TLB Performance Events

Both bonuses require knowledge of virtual memory you can learn by going through the last flipped lecture

on Virtual Memory early. If you plan to attempt these bonus questions you will need to sign up to ARM’s

website so you can download additional ARM documentation.

Modify the code in pagetable.s to create a working two-level page table implementation with 4KB pages.

You will likely need to consult the ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition

available from the ARM website to complete this (you will need to register with ARM to access it). You

will also want to read ahead about virtual memory in the textbook (we will cover virtual memory in class

too, but not necessarily before your lab section). Once you think you have the two-level page table working

make sure you extend the testing approach in pagetable.s to verify that it does (you need to figure out how

to do this). Lookup the event numbers for translation look aside buffer (TLB) misses and measure them on

the code from Part 2 but with N set to a large enough value to trigger TLB misses with 4KB pages. Your

mark for Bonus #1 will be:

4/4 If you complete of the aspects described in the paragraph above.

3/4 If you don’t get the TLB performance counter part done but otherwise get everything done.

2/4 If you code up the two level page table and it runs but you don’t have any testing code or your test isn’t

convincing.

1/4 If you code up most of the changes needed for the two-level page table but it is not working.

2.5 Bonus #2 of 2 [4 marks]: Mini Operating System

Modify pagetable.s and combine it with the task switching code from Part 4 of Lab 10 to create a simple

operating system that provides virtual memory protection as well as preemptive multitasking for applications

that use floating-point. You may use code from another student’s Lab 10 Part 4 provided both you and they

have demoed and submitted Lab 10 using handin, they give you permission to do so, and you acknowledge

them in a CONTRIBUTIONS file that you submit with your code. Process 0 and Process 1 must each have

their own page table. For Process 0 and Process 1 Virtual addresses between 0x00000000 and 0x0FFFFFFF

should map to different physical locations. Other virtual addresses should be marked "invalid" in the page

table. Be sure to consider the impact of the TLBs when virtual to physical mapping change. Not required

(do this for fun): Use the SWI instruction to enable your OS to expose I/O safely to software. Your mark for

Bonus #2 will be: 4/4 If you complete the aspects described in the paragraph above and you can convince

your TA your code works or at your TA’s discretion otherwise.

CPEN 211 - Lab 11 9 of 10

3 Lab Submission

Submit all files by the deadline on Page 1 using handin. Use the same procedure outlined at the end of the

Lab 3 handout except that now you are submitting Lab 11, so use:

handin cpen211 Lab11-<section>

where <section> should be replaced by your lab section. Remember you can overwrite previous or trial

submissions using -o.

To ensure the demo proceeds quickly, your lab11.zip file should include all files including your assembly

source code AND your project files.

4 Lab Demonstration Procedure

As in prior labs we will be dividing each lab section into two one hour sessions (details on cpen211.ece.ubc.

ca). Your TA will have your submitted code with them and have setup a “TA marking station” where you

will go when it is your turn to be marked. Please bring your DE1-SoC.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp