15-418/618, Spring 2019
Assignment 1
Exploring Multi-Core, Instruction-Level, and SIMD Parallelism
Event Registered students Waitlist students
Assigned: Mon., Jan. 14 Mon., Jan. 14
Due: Wed., Jan. 30, 11:59 pm Wed., Jan. 23, 11:59 pm
Last day to handin: Sat., Feb. 2 Wed., Jan. 23
Overview
In this assignment you will modify and experiment with code designed to exploit the three main forms of
parallelism available on modern processors: the multiple cores that can execute programs independently, the
multiple functional units that can operate in parallel, and the SIMD vector units that allow each processor to
perform some of its arithmetic and memory operations on vectors of data. (You will see some of the effects
of Intel hyperthreading, as well, where each core can execute the instruction streams of two programs
simultaneously.)
You will also gain experience measuring and reasoning about the performance of parallel programs, a challenging,
but important, skill you will use throughout this class. This assignment involves only a small
amount of programming, but a lot of analysis!
This is an individual project. All handins are electronic. Your submission will consist of the code files you
have modified, as well as a single document reporting your findings on the 5 problems described below. You
may use any document preparation system you choose, but the final result must be stored as a single file in
PDF format, named report.pdf. Make sure the report includes your name and Andrew Id. More details
on how to submit this information is provided at the end of this document.
Before you begin, please take the time to review the course policy on academic integrity at:
http://www.cs.cmu.edu/?418/academicintegrity.html
Getting started
You will need to run code on the machines in the Gates cluster for this assignment. Host names for these
machines are ghcX.ghc.andrew.cmu.edu, where X is between 26 and 46. Each of these machines
1
has an eight-core, 3.2 GHz Intel Core i7 processor (although dynamic frequency scaling can take them
to 3.8 GHz when the chip decides it is useful and possible to do so). Each core can execute AVX2 vector
instructions, supporting simultaneous execution of the same operation on multiple data values (8 in
the case of single-precision data). For the curious, a complete specification for this CPU can be found at
ark.intel.com/products/92985/Intel-Xeon-Processor-E5-1660-v4-20M-Cache-3 20-GHz.
You can log into these machines in the cluster, or you can reach them via ssh.
We will grade your analysis of code run on the Gates machines; however for your own interest, you may
also want to run these programs on other machines. To do this, you will first need to install the Intel
SPMD Program Compiler (ISPC) available at: ispc.github.com/. Feel free to include your findings
from running code on other machines in your report as well, just be very clear in your report to describe the
machine(s) you used.
ISPC is needed to compile many of the programs used in this assignment. ISPC is currently installed on the
Gates machines in the directory /usr/local/depot/ispc-v1.9.1-linux/. You will need to add
this directory to your system path.
We will distribute the assignment starter code via a repository hosted on Github. Clone the assignment 1
starter code using:
git clone https://github.com/cmu15418/asst1-s19.git
1 Problem 1: Parallel Fractal Generation Using Pthreads (15 points)
Build and run the code in the prob1_mandelbrot_threads directory of the Assignment 1 code base.
This program produces the image file mandelbrot-vV -serial.ppm, where V is the view index. This
image is a visualization of a famous set of complex numbers called the Mandelbrot set. As you can see
in the images below, the result is a familiar and beautiful fractal. Each pixel in the image corresponds to
a value in the complex plane, and the brightness of each pixel is proportional to the computational cost
of determining whether the value is contained in the Mandelbrot set—white pixels required the maximum
(256) number of iterations, dark ones only a few iterations, and colored pixels were somewhere in between.
(See function mandel() defined in mandelbrot.cpp.) You can learn more about the definition of the
Mandelbrot set at en.wikipedia.org/wiki/Mandelbrot set.
Use the command option “--view V ” for V between 0 and 6 to get the different images. You can click
the links below to see the different images on a browser. Take the time to do this—the images are quite
striking. (View 0 is not shown—it is all white.)
2
View 1 View 2 View 3
1× magnification 66× magnification 50× magnification
View 4 View 5 View 6
50× magnification 50× magnification 500× magnification
Your job is to parallelize the computation of the images using Pthreads. The command-line option “--threads T”
specifies that the computation is to be partitioned over T threads. In function mandelbrotThread(),
located in mandelbrot.cpp, the main application thread creates T1 additional thread using pthread_create().
It waits for these threads to complete using pthread_join(). Currently, neither the launched threads
nor the main thread do any computation, and so the program generates an error message. You should add
code to the workerThreadStart() function to accomplish this task. You will not need to use of any
other Pthread API calls in this assignment.
What you need to do:
1. Modify the code in mandelbrot.cpp to parallelize the Mandelbrot generation using two cores.
Specifically, compute the top half of the image in thread 0, and the bottom half of the image in thread
1. This type of problem decomposition is referred to as spatial decomposition since different spatial
regions of the image are computed by different processors.
2. Extend your code to utilize T threads for T ∈ {2, 4, 8, 16}, partitioning the image generation work
into the appropriate number of horizontal blocks. You will need to modify the code in function
workerThreadStart, to partition the work over the threads.
Note that the processor only has 8 cores, but each core supports two hyper-threads. Also, the active
images have 749 rows (with another row added to detect array overrun), and so you must handle the
case where the number of rows is not evenly divisible by the number of threads. In your write-up,
produce a graph of speedup compared to the reference sequential implementation as a function of the
number of cores used for views 0, 1, and 2. Is speedup linear in the number of cores used? In your
writeup hypothesize why this is (or is not) the case?
3
3. To confirm (or disprove) your hypothesis, measure the amount of time each thread requires to complete
its work by inserting timing code at the beginning and end of workerThreadStart(). How do your
measurements explain the speedup graph you previously created?
4. Modify the mapping of work to threads to improve speedup to at least 8× on view 0 and almost
8× on views 1 and 2 (if you’re close to 8× that’s fine, don’t sweat it). You may not use any synchronization
between threads. We expect you to come up with a single work decomposition policy
that will work well for all thread counts; hard coding a solution specific to each configuration is not
allowed! (Hint: There is a very simple static assignment that will achieve this goal, and no communication/synchronization
among threads is necessary.) In your writeup, describe your approach and
report the final 16-thread speedup obtained. Also comment on the difference in scaling behavior from
4 to 8 threads versus 8 to 16 threads.
As a bonus (to you, not extra credit), once you have your threaded code running properly, you can run
our interactive visualization tool of the Mandelbrot set. You will find this quite addictive. The program
is implemented as the file mviz.py in the main assignment directory.1
Invoke it using, as command-line
arguments, the command line you give to run the mandelbrot program. For example, you might give the
command
linux> ../mviz.py ./mandelbrot -t 16
When the program starts, it will display the list of single-character keyboard commands you can use to zoom
and pan around the set. You will notice the fractal self similarity property, where common patterns keep
occurring as you zoom deeper. You will also find that speedup you get from threading can greatly improve
the interactive response.
What you need to turn in:
1. Your report should contain the graphs, analyses, and answers specified above.
2. Your report should describe the decomposition strategy you used to maximize speedup.
3. The archive file you submit will contain your version of the file mandelbrot.cpp. This file should
contain the best performing code you created. Any modifications you made should follow good
coding conventions, in terms of indenting, variable names, and comments.
2 Problem 2: Vectorizing Code Using SIMD Intrinsics (20 points)
Take a look at the function clampedExpSerial() in prob2_vecintrin/functions.cpp of the
Assignment 1 code base. The function raises values[i] to the integer power given by exponents[i]
for all elements of the input array and clamps the resulting values at 4.18. The function computes x
p based
on the technique known as exponentiation by squaring. Whereas the usual technique of multiplying together
p copies of x requires p ? 1 multiplications, iterative squaring requires at most 2 log2 p multiplications. For
1The visualization program requires the Image and ImageTk packages from the Python PIL library. Unfortunately, these
have not yet been installed on the GHC machines. Hopefully, that will be fixed soon.
4
p = 1000, exponentiation by squaring requires less than 20 multiplications rather than 999. In Problem 2,
your job is to vectorize this piece of code so that it can be run on a machine with SIMD vector instructions.
We won’t ask you to craft an implementation using the SSE or AVX vector instrinsics that map to real
vector instructions on moderns CPUs. Instead, to make things a little easier, we’re asking you to implement
your version using 15-418’s “fake vector instrinics” defined in CMU418intrin.h. The CMU418intrin
library provides you with a set of vector instructions that operate on vector values and/or vector masks.
(These functions don’t translate to real vector instructions; instead we simulate these operations for you
in our library, and provide feedback that makes for easier debugging.) As an example of using the 15-
418 intrinsics, a vectorized version of the abs() function is given in functions.cpp. This example
contains some basic vector loads and stores and manipulates mask registers. Note that the abs() example
is only a simple example, and in fact the code does not correctly handle all inputs. (We will let you figure
out why.) You may wish to read through the comments and function definitions in CMU418intrin.h to
know what operations are available to you.
Here are a few hints to help you in your implementation:
1. Every vector instruction is subject to an optional mask parameter. The mask parameter defines which
lanes have their output “masked” for this operation. A 0 in the mask indicates a lane is masked, and
so its value will not be overwritten by the results of the vector operation. If no mask is specified in
the operation, no lanes are masked. (This is equivalent to providing a mask of all ones.) Your solution
will need to use multiple mask registers and various mask operations provided in the library.
2. You will find the _cmu418_cntbits() function to be helpful in this problem.
3. You must handle the case where the total number of loop iterations is not a multiple of SIMD vector
width. We suggest you test your code with ./vrun -s 3. You might find _cmu418_init_ones()
helpful.
4. Use command-line option -l to print a log of executed vector instruction at the end. Insert calls to
function addUserLog() in your vector code to add customized debugging information to the log.
The output of the program will tell you if your implementation generates correct output. If there are incorrect
results, the program will print the first one it finds and print out a table of function inputs and outputs.2
Your function’s output is after “output = ”, which should match with the results after “gold = ”.
The program also prints out a list of statistics describing utilization of the 15418 fake vector units. You
should consider the performance of your implementation to be the value “Total Vector Instructions”. “Vector
Utilization” shows the percentage of vector lanes that are enabled.
What you need to do:
1. Implement a vectorized version of clampedExpSerial() as the function clampedExpVector()
in file functions.cpp. Your implementation should work with any combination of input array
size N and vector width W.
2. Run ./vrun -s 10000 and sweep the vector width over the values {2, 4, 8, 16, 32}. Record the
resulting vector utilization. You can do this by changing the defined value of VECTOR_WIDTH in
2
If it finds a mismatch in row 749, that indicates that your program overran the image array.
5
CMU418intrin.h and recompiling the code each time. How much does the vector utilization
change as W changes? Explain the reason for these changes and the degree of sensitivity the utilization
has on the vector width. Explain how the total number of vector instructions varies with
W.
3. Extra credit: (1 point) Implement a vectorized version of arraySumSerial() as the function
arraySumVector() in file functions.cpp. Your implementation may assume that W is a
factor of the input array size N. Whereas the serial implementation has O(N) span, your implementation
should have at most O(N/W + log2 W) span. You may find the hadd and interleave
operations useful.
What you need to turn in:
1. Your report should contains tables giving the vector utilizations and total vector instructions for the
different values of W.
2. Your report should contain the analyses and answers to the questions listed above.
3. If you did the extra credit problem, state in the report whether or not your code passed the correctness
test.
4. The archive file you submit will contain your version of the file functions.cpp. This file should
contain the best performing code you created. Any modifications you made should follow good
coding conventions, in terms of indenting, variable names, and comments.
3 Problem 3: Using Instruction-level Parallelism in Fractal Generation (20
points)
Now that you have seen how SIMD execution can operate on multiple data values simultaneously, we will
explore how these principles can speed up a conventional C++ program. In this case, we will exploit the
parallel execution capability provided in the multiple, pipelined functional units of an out-of-order processor.
This exercise will give us the chance to analyze how well a program makes use of these units and how this
limits program performance.
You will want to refer to material in Computer Systems: A Programmer’s Perspective, third edition (CS:APP3e)
that you perhaps did not study carefully in whatever version of 15-213 you took. In particular, you will find
Sections 3.11 (floating-point code) and Sections 5.7–5.10 (out-of-order execution and instruction-level parallelism)
to be helpful. Copies of the book are on reserve in the Sorrell’s Library. Don’t try to get by with
an earlier edition of the book.
All of the code for this problem is provided in the directory prob3 mandelbrot ilp. Your job will be
to measure performance and understand it. Compile and run the code on view 0. This is a useful benchmark,
since all of the iterations hit the maximum limit of 256, yielding a minimum of overhead. You will find that
the reference implementation, as given by the function mandel ref requires around 10.7 clock cycles per
iteration.
6
# This is the inner loop of mandel_ref
# Parameters are passed to the function as follows:
# %xmm0: c_re
# %xmm1: c_im
# %edi: count
# Before entering the loop, the function sets registers
# to initialize local variables:
# %xmm2: z_re = c_re
# %xmm3: z_im = c_im
# %eax: i = 0
.L123:
vmulss %xmm2, %xmm2, %xmm4
vmulss %xmm3, %xmm3, %xmm5
vaddss %xmm5, %xmm4, %xmm6
vucomiss .LC0(%rip), %xmm6 # Compare to 4.f
ja .L126
vaddss %xmm2, %xmm2, %xmm2
addl $1, %eax
cmpl %edi, %eax # Set condition codes for jne below
vmulss %xmm3, %xmm2, %xmm3
vsubss %xmm5, %xmm4, %xmm2
vaddss %xmm3, %xmm1, %xmm3
vaddss %xmm2, %xmm0, %xmm2
jne .L123
Figure 1: Assembly code for inner loop of function mandel ref
7
Figure 1 shows the assembly code generated for the inner loop of mandel ref, with some comments
added to help you understand how the integer and floating-point registers are used. Add more annotation
to this code to describe how it relates to the original C++ code. (A copy of the code is provided in the file
mandel ref loop.s.) Explain how the expression 2.f * z re * z im is implemented?
Create a data-flow diagram indicating the data dependencies formed by successive iterations of the loop,
similar to Figure 5.14(b) of CS:APP3e. You need not show the integer operations, since these do not affect
the program performance.
The Gates-cluster machines are based on what Intel labels its “Broadwell” microarchitecture. This is very
similar to Haswell microarchitecture described in Section 5.7.2 of CS:APP3e, but with slightly more and
faster functional units. You can find out more about the microarchitecture in
www.agner.org/optimize/microarchitecture.pdf
Pages 140–141 of this document describe the functional units. In particular, the processor has two units
for floating-point arithmetic. These are fully pipelined, able to start new operations on every clock cycle.
For floating point, unit 0 can perform both multiplication and addition, while unit 1 can only perform
multiplication. Both operations have a latency of 3 clock cycles. A separate functional unit can perform the
floating-point comparison required by the vucomiss instruction.
Based on these latencies and the data-flow diagram you have created, determine the latency bound for
function mandel ref. How close does the measured performance come to reaching this bound?
It is difficult to see how to speed up the individual iterations through increased instruction-level parallelism.
However, it is possible to increase the level of activity in the loop by processing multiple candidate points.
This will enable increasing the throughput of the processing to yield better performance than is dictated by
the latency bound.
The file mandelbrot.cpp contains a macro MANDEL BODY and its instantiations to generate the functions
mandel parU for U ranging from 1 to 5. The idea is to process U points in parallel, using a style
of programming similar to SIMD execution. We represent the multiple data with arrays of size U, and
write conventional code that uses looping to process all U elements. As with SIMD, the loop exits only
when all points have completed their iterations, using a bit vector to determine which elements should be
updated. The compiler automatically unrolls all of these loops, and so the generated code is equivalent to
what would be obtained by explicitly writing the computation for all U elements. (You can see the generated
code in the file mandelbrot.s, generated when you compiled the code for this problem.) The function
mandelbrotParallel exploits this capability by computing U adjacent rows of the array in parallel.
Running the program on view 0, you will see that it reaches a peak performance of around 6.1 clock cycles
per iteration for U = 3. (That is, each execution of the loop requires around 18.3 cycles, but it performance
an iteration for three points.) That’s an improvement over the original performance, but perhaps not quite
as significant as one may hope.
Considering the floating-point operations, what is the highest throughput bound imposed by the functional
units? You should consider the bounds imposed by multiplication, addition, and the two in combination.
How close does the measured performance come to reaching this bound?
If you could modify the generated assembly code, do you think you could lower the throughput bound?
Extra Credit: (1 point) Modify the C++ code for both the reference and the parallel versions to (slightly)
8
improve the performance of the parallel version, while still passing the functionality tests. You may alter the
computed function, but only in a way that demonstrates how a better compiler could generate faster code.
As a bonus, you can run the interactive visualization tool mviz.py of the ILP-enhanced version the Mandelbrot
computation with U = 3. Run with the command line:
linux> ../mviz.py ./mandelbrot
What you need to turn in:
1. An annotated version of the assembly code for the main loop of mandel ref.
2. Data-flow diagram showing the loop-carried dependencies among the floating-point values.
3. An analysis of the latency bound of mandel ref and how the measured performance compares.
4. An analysis of the throughput bound for the parallel versions, and how the measured performance
compares.
5. Ideas for how a compiler could generate code that runs faster on this particular microarchitecture.
6. (Extra credit.) A demonstration of how the compiler could generate code that runs faster.
4 Problem 4: Parallel Fractal Generation Using ISPC (20 points)
The code for this problem is in the subdirectory prob4_mandelbrot_ispc. Now that you’re comfortable
with SIMD execution, we’ll return to parallel Mandelbrot fractal generation. As in Problem 1, Problem
4 computes a Mandelbrot fractal image, but it achieves even greater speedups by utilizing the SIMD execution
units within each of the 8 cores. We will also see that the ILP-enhancement techniques used in Problem
3 can, in principle, be applied to ISPC code.
In Problem 1, you parallelized image generation by creating one thread for each processing core in the
system. Then, you assigned parts of the computation to each of these concurrently executing threads. Instead
of specifying a specific mapping of computations to concurrently executing threads, Problem 4 uses ISPC
language constructs to describe independent computations. These computations may be executed in parallel
without violating program correctness. In the case of the Mandelbrot image, computing the value of each
pixel is an independent computation. With this information, the ISPC compiler and runtime system take on
the responsibility of generating a program that utilizes the CPUs collection of parallel execution resources
as efficiently as possible.
4.1 Problem 4, Part 1. A Few ISPC Basics (5 of 20 points)
When reading ISPC code, you must keep in mind that, although the code appears much like C/C++ code,
the ISPC execution model differs from that of standard C/C++. In contrast to C, multiple program instances
of an ISPC program are always executed in parallel on the CPU’s SIMD execution units. The number of
program instances executed simultaneously is determined by the compiler (and chosen specifically for the
9
underlying machine). This number of concurrent instances is available to the ISPC programmer via the
built-in variable programCount. ISPC code can reference its own program instance identifier via the
built-in programIndex. Thus, a call from C code to an ISPC function can be thought of as spawning a
group of concurrent ISPC program instances (referred to in the ISPC documentation as a gang). The gang
of instances runs to completion, then control returns back to the calling C code.
As an example, the following program uses a combination of regular C code and ISPC code to add two
1024-element vectors. As discussed in class, since each instance in a gang is independent and performs the
exact same program logic, execution can be accelerated via SIMD instructions.
A simple ISPC program is given below. First, the C program, which calls the ISPC-generated code:
------------------------------------------------------------------------
C program code: myprogram.cpp
------------------------------------------------------------------------
const int TOTAL_VALUES = 1024;
float a[TOTAL_VALUES];
float b[TOTAL_VALUES];
float c[TOTAL_VALUES]
// Initialize arrays a and b here.
. . .
sum(TOTAL_VALUES, a, b, c);
// Upon return from sumArrays, result of a + b is stored in c.
The function sum() called by the C code is generated by compiling the following ISPC code:
------------------------------------------------------------------------
ISPC code: myprogram.ispc
------------------------------------------------------------------------
export sum(uniform int N, uniform float* a, uniform float* b, uniform float* c)
{
// Assumes programCount divides N evenly.
for (int i=0; i<N; i+=programCount)
{
c[programIndex + i] = a[programIndex + i] + b[programIndex + i];
}
}
The ISPC program code above interleaves the processing of array elements among program instances. Note
the similarity to Problem 1, where you statically assigned parts of the image to threads.
However, rather than thinking about how to divide work among program instances (that is, how work is
mapped to execution units), it is often more convenient, and more powerful, to instead focus only on the
partitioning of a problem into independent parts. ISPCs foreach construct provides a mechanism to
express problem decomposition. Below, the foreach loop in the ISPC function sum2() defines an
iteration space where all iterations are independent and therefore can be carried out in any order. ISPC
10
handles the assignment of loop iterations to concurrent program instances. The difference between sum()
and sum2() below is subtle, but very important. sum() is imperative: it describes how to map work to
concurrent instances. The sum2() function below is declarative: it specifies only the set of work to be
performed.
-------------------------------------------------------------------------
ISPC code:
-------------------------------------------------------------------------
export sum2(uniform int N, uniform float* a, uniform float* b, uniform float* c)
{
foreach (i = 0 ... N)
{
c[i] = a[i] + b[i];
}
}
Before proceeding, you are encouraged to familiarize yourself with ISPC language constructs by reading
through the ISPC walkthrough available at http://ispc.github.com/example.html. The example
program in the walkthrough is almost exactly the same as Problem 4’s implementation of mandelbrot_ispc()
in mandelbrot.ispc. In the assignment code, we have changed the bounds of the foreach loop to
yield a more straightforward implementation.
What you need to do:
1. Compile and run the program mandelbrot.ispc. The ISPC compiler is configured to emit 8-
wide AVX vector instructions. What is the maximum speedup you expect given what you know
about these CPUs? Why might the number you observe be less than this ideal? Hint: Consider the
characteristics of the computation you are performing. What parts of the image present challenges for
SIMD execution? Comparing the performance of rendering the different views of the Mandelbrot set
may help confirm your hypothesis.
We remind you that for the code described in this subsection, the ISPC compiler maps gangs of
program instances to SIMD instructions executed on a single core. This parallelization scheme differs
from that of Problem 1, where speedup was achieved by running threads on multiple cores.
4.2 Problem 4, Part 2: Combining instruction-level and SIMD parallelism (7 of 20 points)
The floating-point functional units in the Gates cluster processors are fully vectorized. They can perform
8-wide additions and multiplications on single-precision data. Thus, the same techniques we used to exploit
instruction-level parallelism can be combined with the SIMD execution targeted by ISPC.
The function mandel par2 provides two-way parallelism of the Mandelbrot computation. (Unfortunately,
the ISPC compiler cannot handle the small loops and arrays we used in Problem 3, and so this code must
explicitly duplicate each line of code. It must also pass the point data as separate scalar values, rather than
as arrays.) Your job is to modify the function mandelbrot ispc par2 to make use of this parallel form.
As we did in Problem 3, your code should process two rows per pass. You should make use of the ISPC
11
foreach construct, but modifying it to only do half as many passes. You also should handle the case
where the height of the array is not a multiple of two.
How much speedup does this two-way parallelism give over the regular ISPC version? Is it worth the
effort?
4.3 Problem 4, Part 3: ISPC Tasks (8 of 20 points)
ISPC’s SPMD execution model and the foreach mechanism facilitate the creation of programs that utilize
SIMD processing. The language also provides the launch mechanism to utilize multiple cores in an ISPC
computation, via a lightweight form of threading known as tasks.
See the launch command in the function mandelbrot_ispc_withtasks() in the file mandelbrot.ispc.
This command launches multiple tasks (2 in the starter code). Each task defines a computation that will be
executed by a gang of ISPC program instances. As given by the function mandelbrot_ispc_task(),
each task computes a region of the final image. Similar to how the foreach construct defines loop iterations
that can be carried out in any order (and in parallel by ISPC program instances), the tasks created by
this launch operation can be processed in any order (and in parallel on different CPU cores).
What you need to do:
1. Run mandelbrot_ispc with the command-line option “--tasks.” What speedup do you observe
on view 1? What is the speedup over the version of mandelbrot_ispc that does not partition
that computation into tasks?
2. There is a simple way to improve the performance of mandelbrot_ispc --tasks by changing
the number of tasks the code creates. By only changing code in the function mandelbrot_ispc_withtasks(),
you should be able to achieve performance that exceeds the sequential version of the code by about
20–22 times! How did you determine how many tasks to create? Why does the number you chose
work best?
Note: Your code must correctly handle the case where the number of rows in the image is not divisible
by the number of tasks.
3. Extra Credit: (1 point) What are differences between the Pthread abstraction (used in Problem 1) and
the ISPC task abstraction? There are some obvious differences in semantics between the (create/join
and (launch/sync) mechanisms, but the implications of these differences are more subtle. Here’s a
thought experiment to guide your answer: what happens when you launch 10,000 ISPC tasks? What
happens when you launch 10,000 pthreads?
The smart-thinking student’s question: Hey wait! Why are there two different mechanisms (foreach and
launch) for expressing independent, parallelizable work to the ISPC system? Couldn’t the system just
partition the many iterations of foreach across all cores and also emit the appropriate SIMD code for the
cores? Answer: Great question! And there are a lot of possible answers. We’ll talk more in lecture.
As a bonus, you can run the interactive visualization tool mviz.py of the ISPC versions of the Mandelbrot
computation. Run with either the command line:
12
linux> ../mviz.py ./mandelbrot
or
linux> ../mviz.py ./mandelbrot -t
You will see how the speedup improves interactivity.
What you need to turn in:
1. Your report must contain answers to the questions listed above.
2. Your report should describe performance gains you get from SIMD, SIMD+ILP, and threaded parallelism.
3. Your answer to the extra credit question, if any.
4. The archive file you submit will contain your version of the file mandelbrot.ispc. This file
should contain the best performing code you created. Any modifications you made should follow
good coding conventions, in terms of indenting, variable names, and comments.
5 Problem 5: Iterative Square Root (10 points)
The code for this problem is in the subdirectory prob5_sqrt. Problem 5 concerns the program sqrt,
generated from an ISPC program that computes the square root of 20 million random numbers between 0
and 3. For value s, it uses the iterative Newton’s method (named after Isaac Newton) to solve the equation
1/x2 = s. This gives a value x ≈
p
1/s. Multiplying x by s gives an approximation to √
s.
The value 1.0 is used as the initial guess in this implementation. The graph below shows the number of
iterations required for the program to converge to an accurate solution for values in the open interval (0, 3).
(The implementation does not converge for inputs outside this range). Notice how the rate of convergence
depends on how close the solution is to the initial guess.
0
5
10
15
20
25
30
35
40
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Itera&ons
InputValue
Itera&onstoConvergencebySquareRootFunc&on
13
What you need to do:
1. Build and run sqrt. Report the ISPC implementation speedup for a single CPU core (no tasks) and
when using all cores (with tasks). What is the speedup due to SIMD parallelization? What is the
speedup due to multi-core parallelization?
2. Modify the function initGood() in the file data.cpp to generate data that will yield a very high
relative speedup of the ISPC implementations. (You may generate any valid data.) Describe why
these input data will maximize speedup over the sequential version and report the resulting speedup
achieved (for both the with- and without-tasks ISPC implementations). You can test this version with
the command-line argument “--data g.” Does your modification improve SIMD speedup? Does it
improve multi-core speedup? Please explain why.
3. Modify the function initBad() in the file data.cpp to generate data that will yield a very low
(less than 1.0) relative speedup of the ISPC implementations. (You may generate any valid data.) Describe
why these input data will cause the SIMD code to have very poor speedup over the sequential
version and report the resulting speedup achieved (for both the with- and without-tasks ISPC implementations).
You can test this version with the command-line argument “--data b.” Does your
modification improve multi-core speedup? Please explain why.
Notes and comments: When running your “very-good-case input”, take a look at what the benefit of multicore
execution is. You might be surprised at how high it is. This is a hyper-threading effect.
What you need to handin:
1. Provide explanations and answers in your report.
2. The archive file you submit will contain your version of the file data.cpp.
Hand-in Instructions
As mentioned, you should have your report in a file report.pdf in the home directory for the assignment.
(Make sure the report includes your name and Andrew Id.) In this directory, execute the command
make handin.tar
This will create an archive file with your report, and some of the files you modified for the different problems.
Completing the handin then depends on your status in the class:
If you are a registered student, you should have a course Autolab account. Go to Autolab at:
https://autolab.andrew.cmu.edu/courses/15418-s19
and submit the file handin.tar.
14
If you are on the waitlist, you will not have an Autolab account. Instead, you should run the provided
program submit.py as follows:
./submit.py -u AndrewID
where AndrewID is your Andrew ID.
In either case, you can submit multiple times, but only your final submission will be graded, and the time
stamp of that submission will be used in determining any late penalties.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。