联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-03-21 08:33

p4-mapreduce

EECS 485 Project 4: Map Reduce

Due: 8pm on March 27th, 2019. This is a group project to be completed in groups of two to three.

Change Log

Introduction

In this project, you will implement a MapReduce server in Python. This will be a single machine,

multi-process, multi-threaded server that will execute user-submitted MapReduce jobs. It will run

each job to completion, handling failures along the way, and write the output of the job to a given

directory. Once you have completed this project, you will be able to run any MapReduce job on your

machine, using a MapReduce implementation you wrote!

There are two primary modules in this project: the Master , which will listen for MapReduce jobs,

manage the jobs, distribute work amongst the workers, and handle faults. Worker modules register

themselves with the master, and then await commands, performing map, reduce or sorting

(grouping) tasks based on instructions given by the master.

You will not write map reduce programs, but rather the MapReduce server. We have provided

several sample map/reduce programs that you can use to test your MapReduce server.

Refer to the P1 setup tutorial for setting up your development environment.

Refer to the Python processes, threads and sockets tutorial for background and examples.

Table of Contents

Project Structure

Starter files

Init script

Master Class

Worker Class

Libraries

MapReduce Server Specification

Worker Registration - [Master + Worker]

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 2/24

New Job Request - [Master]

Job Queue - [Master]

Input Partitioning - [Master]

Mapping - [Workers]

Grouping - [Master + Workers]

Reducing - [Workers]

Wrapping Up - [Master]

Shutdown - [Master + Worker]

Fault tolerance + Heartbeats - [Master + Worker]

Walk-through example

Test Case Descriptions

Testing

Submitting and grading

Further Reading

Project Structure

You will write a mapreduce Python package includes master and worker modules. Launch a master

with the command line entry point mapreduce-master and a worker with mapreduce-worker . We’ve

also provided mapreduce-submit to send a new job to the master.

Starter files

We will start with a summary of the starter files.

mapreduce/master/ : Implement the MapReduce Master here.

mapreduce/worker/ : Implement the MapReduce Worker here.

mapreduce/utils.py : Common code shared by Master and Worker (optional).

tests/ : Public unit tests.

tests/input/ : Sample input files.

tests/exec/ : MapReduce programs. All use stdin and stdout.

Download the starter files and copy the contents into your project root directory.

$ pwd

/Users/awdeorio/src/eecs485/p4-mapreduce

$ wget https://eecs485staff.github.io/p4-mapreduce/starter_files.tar.gz

$ tar -xvzf starter_files.tar.gz

$ cp -R starter_files/* .

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 3/24

Create and activate a Python virtual environment. Install.

$ python3 -m venv env

$ source env/bin/activate

$ pip install -e .

Your code will go inside the mapreduce/master and mapreduce/worker packages, where you will

define the two classes (we got you started in mapreduce/master/__main__.py and

mapreduce/worker/__main__.py ). Since we are using Python packages, you may create new files as

you see fit inside each package. We have also provided a utils.py inside mapreduce/ which you

can use to house code common to both Worker and Master. We will only define the communication

specs for the Master and the Worker, but the actual implementation of the classes is entirely up to

you.

The starter code will run out of the box, it just won’t do anything. At this point, you will be able to

pass test_master_0 and test_worker_0 . The master and the worker run as seperate processes, so

you will have to start them up separately. This will start up a master which will listen on port 6000

using TCP. Then, we start up two workers, and tell them that they should communicate with the

master on port 6000, and then tell them which port to listen on. The ampersand ( & ) means to start

the process in the background.

$ mapreduce-master 6000 &

$ mapreduce-worker 6000 6001 &

$ mapreduce-worker 6000 6002 &

See your processes running in the background. Note: use pgrep -lf on OSX and pgrep -af on

GNU/Linux systems.

Stop your processes.

$ pkill -f mapreduce-master

$ pkill -f mapreduce-worker

$ pgrep -lf mapreduce-worker # no output, because no processes

$ pgrep -lf mapreduce-master # no output, because no processes

$ pgrep -lf mapreduce-worker

15364 /usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/Resources/Pyth

15365 /usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/Resources/Pyth

$ pgrep -lf mapreduce-master

15353 /usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/Resources/Pyth

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 4/24

Lastly, we have also provided mapreduce/submit.py . It sends a job to the Master’s main TCP socket.

You can specify the job using command line arguments.

$ mapreduce-submit --help

Usage: mapreduce-submit [OPTIONS]

Top level command line interface.

Options:

-p, --port INTEGER Master port number, default = 6000

-i, --input DIRECTORY Input directory, default=./tests/input

-o, --output DIRECTORY Output directory, default=./output

-m, --mapper FILE Mapper executable, default=./tests/exec/wc_map.sh

-r, --reducer FILE Reducer executable,

default=./tests/exec/wc_reduce.sh

--nmappers INTEGER Number of mappers, default=4

--nreducers INTEGER Number of reducers, default=1

--help Show this message and exit.

Here’s how to run a job. Later, we’ll simplify starting the server using a bash script. Right now we

expect the job to fail because Master and Worker are not implemented.

$ pgrep -f mapreduce-master # check if you already started it

$ pgrep -f mapreduce-worker # check if you already started it

$ mapreduce-master 6000 &

$ mapreduce-worker 6000 6001 &

$ mapreduce-worker 6000 6002 &

$ mapreduce-submit --mapper tests/exec/wc_map.sh --reducer tests/exec/wc_reduce.sh

Init script

The MapReduce server is an example of a service (or daemon), a program that runs in the

background. We’ll write an init script to start, stop and check on the map reduce master and worker

processes. It should be a bash script named bin/mapreduce . Print the messages in the following

examples.

Be sure to follow the bash script best practices https://eecs485staff.github.io/p1-insta485-

static/#utility-scripts, e.g., starting with set -Eeuo pipefail .

Start server

Exit 1 if a master or worker is already running. Otherwise, execute the following commands.

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 5/24

mapreduce-master 6000 &

sleep 2

mapreduce-worker 6000 6001 &

mapreduce-worker 6000 6002 &

Example

$ ./bin/mapreduce start

starting mapreduce ...

Example: accidentally start server when it’s already running.

$ ./bin/mapreduce start

Error: mapreduce-master is already running

Stop server

Execute the following commands. Notice that || true will prevent a failed “nice” shutdown

message from causing the script to exit early. Also notice that we automatically figure out the

correct option for Netcat ( nc ).

if nc -V 2>&1 | grep -q GNU; then

NC="nc -c" # Linux (GNU)

else

NC="nc" # macOS (BSD)

fi

echo '{"message_type": "shutdown"}' | $NC localhost 6000 || true

sleep 2 # give the master time to receive signal and send to workers

Then, check if a master or worker is still running and kill the process. The following example is for

the master.

if pgrep -lf mapreduce-master; then

echo "killing mapreduce master ..."

pkill -f mapreduce-master

fi

Example 1, server responds to shutdown message.

$ ./bin/mapreduce stop

stopping mapreduce ...

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 6/24

Example 2, server doesn’t respond to shutdown message and process is killed.

./bin/mapreduce stop

stopping mapreduce ...

killing mapreduce master ...

killing mapreduce worker ...

Server status

Example

$ ./bin/mapreduce start

starting mapreduce ...

$ ./bin/mapreduce status

master running

workers running

$ ./bin/mapreduce stop

stopping mapreduce ...

$ ./bin/mapreduce status

master not running

workers not running

Restart server

Example: restart server

$ ./bin/mapreduce restart

stopping mapreduce ...

starting mapreduce ...

NOTE: On the autograder, the “test_scripts” will run with your Master and Worker.

Master Class

The Master should accept only one command line argument.

port_number : The primary TCP port that the Master should listen on.

On startup, the Master should do the following:

Create a new folder tmp . This is where we will store all intermediate files used by the

MapReduce server.

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 7/24

If tmp already exists, keep it

Delete any old mapreduce job folders in tmp . HINT: see the Python glob module and use

"tmp/job-*" .

Create a new thread, which will listen for UDP heartbeat messages from the workers. This

should listen on ( port_number - 1 )

Create any additional threads or setup you think you may need. Another thread for fault

tolerance could be helpful.

Create a new TCP socket on the given port_number and call the listen() function.

Wait for incoming messages!

Worker Class

The Worker should accept two command line arguments.

master_port : The TCP socket that the Master is actively listening on (same as the port_number in

the Master constructor)

worker_port : The TCP socket that this worker should listen on to receive instructions from the

master

On initialization, each Worker should do a similar sequence of actions as the Master:

Get the process ID of the Worker. This will be the Worker’s unique ID, which it should then use

to register with the master.

Create a new TCP socket on the given worker_port and call the listen() function.

Send the register message to the Master

Upon receiving the register_ack message, create a new thread which will be responsible for

sending heartbeat messages to the Master.

NOTE: The Master should safely ignore any heartbeat messages from a Worker before that Worker

successfully registers with the Master.

Libraries

These are some of the libraries that we used in our implementation. We strongly recommend you

use these - they will save you an incredible amount of time, and code!

Python Multithreading

Python Sockets

Python SH Module

Python JSON Library

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 8/24

We have provided examples of threads and sockets in our Python processes, threads and sockets

tutorial.

You will use UDP for heartbeat messages and TCP for all other communications. In Python, you can

specify the maximum queue size to a socket so that messages aren’t ignored if you’re busy (look at

the argument for the listen() function when you get to it). We highly recommend reading

through Python TCP and UDP communication documentation.

Logging

The Python logging facility is helpful for monitoring multiple processes. For logging output similar

to the instructor solution in the walk-through example, try this.

import logging

logging.basicConfig(level=logging.DEBUG) # Don't hide debug messages

logging.getLogger('sh').setLevel(logging.ERROR) # Hide sh module debug messages

...

logging.info("Starting worker:%s", worker_port)

logging.info("Worker:%s PWD %s", worker_port, os.getcwd())

...

logging.debug(

"Worker:%s received\n%s",

worker_port,

json.dumps(message_dict, indent=2),

)

MapReduce Server Specification

Here, we describe the functionality of the MapReduce server. The fun part is that we are only

defining the functionality and the communication spec, the implementation is entirely up to you.

You must follow our exact specifications below, and the Master and the Worker should work

independently (i.e. do not add any more data or dependencies between the two classes). Remember

that the Master/Workers are listening on TCP/UDP sockets for all incoming messages. Note: To test

your server, we will test your worker with our Master and your Master with our Worker. You should

not rely on any communication other than the messages listed below.

As soon as the Master/Worker receives a message on its main TCP socket, it should handle that

message to completion before continuing to listen on the TCP socket. In this spec, let’s say every

message is handled in a function called handle_msg . When the message returns and ends

execution, the Master will continue listening in an infinite while loop for new messages. Each TCP

message should be communicated using a new TCP connection. Note: All communication in this

project will be strings formatted using JSON; sockets receive strings but your thread must parse it

into JSON.

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 9/24

We put [Master/Worker] before the subsections below to identify which class should handle the

given functionality.

Worker Registration - [Master + Worker]

The Master should keep track of all Workers at any given time so that the work is only distributed

among the ready Workers. Workers can be in the following states:

ready : Worker is ready to accept work

busy : Worker is performing a job

dead : Worker has failed to ping for some amount of time

The Master must listen for registration messages from Workers. Once a Worker is ready to listen for

instructions, it should send a message like this to the Master

{

"message_type" : "register",

"worker_host" : string,

"worker_port" : int,

"worker_pid" : int

}

The Master will then respond with a message acknowledging the Worker has registered, formatted

like this. After this message has been received, the Worker should start sending heartbeats. More on

this later.

{

"message_type": "register_ack",

"worker_host": string,

"worker_port": int,

"worker_pid" : int

}

After the first Worker registers with the Master, the Master should check the job queue (described

later) if it has any work it can assign to the Worker (because a job could have arrived at the Master

before any Workers registered). If the Master is already executing a map/group/reduce, it can wait

until the next phase or wait until completion of the complete current task to assign the Worker any

tasks.

At this point, you should be able to pass test_master_1 and test_worker_1 .

New Job Request - [Master]

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 10/24

In the event of a new job, the Master will receive the following message on its main TCP socket:

{

"message_type": "new_master_job",

"input_directory": string,

"output_directory": string,

"mapper_executable": string,

"reducer_executable": string,

"num_mappers" : int,

"num_reducers" : int

}

In response to a job request, the Master will create a set of new directories where all of the

temporary files for the job will go, of the form tmp/job-{id} , where id is the current job counter

(starting at 0 just like all counters). The directory structure will resemble this example (you should

create 4 new folders for each job):

tmp

job-0/

mapper-output/

grouper-output/

reducer-output/

job-1/

mapper-output/

grouper-output/

reducer-output/

Remember, each MapReduce job occurs in 3 phases: mapping, grouping, reducing. Workers will do

the mapping and reducing using the given executable files independently, but the Master and

Workers will have to cooperate to do the grouping phase. After the directories are setup, the Master

should check if there are any Workers ready to work and check whether the MapReduce server is

currently executing a job. If the server is busy, or there are no available Workers, the job should be

added to an internal queue (described next) and end the function execution. If there are workers

and the server is not busy, than the Master can begin job execution.

At this point, you should be able to pass test_master_2 .

Job Queue - [Master]

If a Master receives a new job while it is already executing one or when there were no ready

workers, it should accept the job, create the directories, and store the job in an internal queue until

the current one has finished. Note that this means that the current job’s map, group, and reduce

tasks must be complete before the next job’s map phase can begin. As soon as a job finishes, the

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 11/24

Master should process the next pending job if there is one (and if there are ready Workers) by

starting its Map stage. For simplicity, in this project, your MapReduce server will only execute one

MapReduce task at any time.

As noted earlier, when you see the first Worker register to work, you should check the job queue for

pending jobs.

Input Partitioning - [Master]

To start off the Map Stage, the Master should scan the input directory and partition the input files in

‘X’ parts (where ‘X’ is the number of map tasks specified in the incoming job). After partitioning the

input, the Master needs to let each Worker know what work it is responsible for. Each Worker could

get zero, one, or many such tasks. The Master will send a JSON message of the following form to

each Worker (on each Worker’s specific TCP socket), letting them know that they have work to do:

{

"message_type": "new_worker_job",

"input_files": [list of strings],

"executable": string,

"output_directory": string

"worker_pid": int

}

Consider the case where there are 2 Workers available, 5 input files and 4 map tasks specified. The

master should create 4 tasks, 3 with one input file each and 1 with 2 input files. It would then

attempt to balance these tasks among all the workers. In this case, it would send 2 map tasks to

each worker. The master does not need to wait for a done message before it assigns more tasks to a

Worker - a Worker should be able to handle multiple tasks at the same time.

Mapping - [Workers]

When a worker receives this new job message, its handle_msg will start execution of the given

executable over the specified input file, while directing the output to the given output_directory

(one output file per input file and you should run the executable on each input file). The input is

passed to the executable through standard in and is outputted to a specific file. The output file

names should be the same as the input file (overwrite file if it already exists). The output_directory

in the Map stage will always be the mapper-output folder (i.e. tmp/job-{id}/mapper-output/ ).

For example, the Master should specify the input file is data/input/file_001.txt and the output file

tmp/job-0/mapper-output/file_001.txt

Hint: See the command line package sh listed in the Libraries section. See sh.Command(...) , and

the _in and _out arguments in order to funnel the input and output easily.

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 12/24

The Worker should be agnostic to map or reduce jobs. Regardless of the type of operation, the

Worker is responsible for running the specified executable over the input files one by one, and

piping to the output directory for each input file. Once a Worker has finished its job, it should send a

TCP message to the Master’s main socket of the form:

{

"message_type": "status",

"output_files" : [list of strings],

"status": "finished"

"worker_pid": int

}

At this point, you should be able to pass test_worker_3 , test_worker_4 , test_worker_5 .

Grouping - [Master + Workers]

Once all of the mappers have finished, the Master will start the “grouping” phase. This should begin

right after the LAST Worker finishes the Map stage (i.e. you will get a finished message from the last

Worker and the handle_msg handling that message will continue this grouping stage).

To start the group stage, the Master looks at all of the files created by the mappers, and assigns

Workers to sort and merge the files. Sorting in the group stage should happen by line not by key. If

there are more files than Workers, the Master should attempt to balance the files evenly among

them. If there are fewer files than Workers, it is okay if some Workers sit idle during this stage. Each

Worker will be responsible for merging some number of files into one larger file. The Master will

then take these files, merge them into one larger file, and then partition that file into the correct

number of files for the reducers. The messages sent to the Workers should look like this:

{

"message_type": "new_sort_job",

"input_files": [list of strings],

"output_file": string,

"worker_pid": int

}

Once the Worker has finished, it should send back a message formatted as follows:

{

"message_type": "status",

"output_file" : string,

"status": "finished"

"worker_pid": int

}

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 13/24

The name of the intermediate files produced - the merged files each Worker creates and the single

large file the Master creates - are up to you. However, once the Master has split up the single input

file into the files used for reducing, they must be named reducex , where x is the reduce task

number. If there are 4 reduce jobs specified, the master should create reduce1, reduce2, reduce3,

reduce4 in the grouper output directory.

Reducing - [Workers]

To the Worker, this is the same as the map stage - it doesn’t need to know if it is running a map or

reduce task. The Worker just runs the executable it is told to run - the Master is responsible for

making sure it tells the Worker to run the correct map or reduce executable. The output_directory

in the reduce stage will always be the reducer-output folder. Again, use the same output file name

as the input file.

Once a Worker has finished its job, it should send a TCP message to the Master’s main socket of the

form:

{

"message_type": "status",

"output_files" : [list of strings],

"status": "finished"

"worker_pid": int

}

Wrapping Up - [Master]

As soon as the master has received the last “finished” message for the reduce tasks for a given job,

the Master should move the output files from the reducer-output directory to the final output

directory specified by the original job creation message (The value specified by the

output_directory key). In the final output directory, the files should be renamed finaloutputx ,

where x is the final output file number. If there are 4 final output files, the master should rename

them finaloutput1, finaloutput2, finaloutput3, finaloutput4 . Create the output directory if it

doesn’t already exist. Check the job queue for the next available job, or go back to listening for jobs

if there isn’t one currently in the job queue.

Shutdown - [Master + Worker]

The Master can also receive a special message to initiate server shutdown. The shutdown message

will be of the following form and will be received on the main TCP socket:

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 14/24

{

"message_type": "shutdown"

}

The Master should forward this message to all of the Workers that have registered with it. The

Workers, upon receiving the shutdown message, should terminate as soon as possible. If the Worker

is already in the middle of executing a task, it is okay for it to complete that task before being able

to handle the shutdown message as both these happen inside a single thread.

After forwarding the message to all Workers, the Master should terminate itself.

At this point, you should be able to pass test_shutdown

Fault tolerance + Heartbeats - [Master + Worker]

Workers can die at any time and may not finish jobs that you send them. Your Master must

accommodate for this. If a Worker misses more than 5 pings in a row, you should assume that it has

died, and assign whatever work it was responsible for to another Worker machine.

Each Worker will have a heartbeat thread to send updates to Master via UDP. The messages should

look like this, and should be sent every 2 seconds:

{

"message_type": "heartbeat",

"worker_pid": int

}

If a Worker dies before completing all the tasks assigned to it, then all of those tasks (completed or

not) should be redistributed to live Workers. At each point of the execution (mapping, grouping,

reducing) the Master should attempt to evenly distribute work among available Workers. If a Worker

dies while it is executing a task, the Master will have to assign that task to another Worker. You

should mark the failed Worker as dead, but do not remove it from the Master’s internal data

structures. This is due to constraints on the Python dictionary data structure. It can result in an error

when keys are modified while iterating over the dictionary. For more info on this, please refer to this

link.

Your Master should attempt to maximize concurrency, but avoid duplication - that is, don’t send the

same job to different Workers until you know that the Worker who was previously assigned that task

has died.

At this point, you should be able to pass test_master_3 , test_master_4 , test_worker_2 ,

test_integration_1 , test_integration_2 , and test_integration_3 .

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 15/24

Walk-through example

See a complete example here.

Testing

To aid in writing test cases we have included a IntegrationManager class which is similar to the

manager the autograder will use to test your submissions. You can find this in the starter file

tests/integration_manager.py .

In addition, we have provided a simple word count map and reduce example. You can use these

executables, as well as the sample data provided, and compare your server’s output with the result

obtained by running:

$ cat tests/input/* | ./tests/exec/wc_map.sh | sort | \

$ ./tests/exec/wc_reduce.sh > correct.txt

This will generate a file called correct.txt with the final answers, and they must match your

server’s output, as follows:

$ cat tmp/job-{id}/reducer-output/* | sort > output.txt

$ diff output.txt correct.txt

Note that these executables can be in any language - your server should not limit us to running

map and reduce jobs written in python3! To help you test this, we have also provided you with a

word count solution written in bash (see section below).

Note that the autograder will swap out your Master for our Master in order to test the Worker (and

vice versa). Your code should have no other dependency besides the communication spec, and the

messages sent in your system must match those listed in this spec exactly.

Run the public unit tests.

$ pwd

/Users/awdeorio/src/eecs485/p4-mapreduce

$ pytest -sv

Note that the -s flag has been added to the pytest command in order to also show any messages

printed to stdout (such as the logging messages), to help with debugging.

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 16/24

Test for busy waiting

A solution that busy-waits may pass on your development machine and fail on the autograder due

to a timeout. Your laptop is probably much more powerful than the restricted autograder

environment, so you might not notice the performance problem locally. See the Processes, Threads

and Sockets in Python Tutorial for an explanation of busy-waiting.

To detect busy waiting, time a master without any workers. After a few seconds, kill it by pressing

Control - C several times. Ignore any errors or exceptions. We can tell that this solution busy-waits

because the user time is similar to the real time.

$ time mapreduce-master 6000

INFO:root:Starting master:6000

...

real 0m4.475s

user 0m4.429s

sys 0m0.039s

This example does not busy wait. Notice that the user time is small compared to the real time.

$ time mapreduce-master 6000

INFO:root:Starting master:6000

...

real 0m3.530s

user 0m0.275s

sys 0m0.036s

Testing Fault-Tolerance

Testing for fault tolerance is a major and tricky part of this project. This section will provide some

basic guidelines on how you can verify if your system handles fault-tolerance in the desired manner.

To enact the condition of a dead Worker, it is important for a Worker to die while performing a task

(or in other words for the Master to realize that a Worker has missed more than 5 consecutive

heartbeat messages when a still incomplete task was assigned to that Worker, post which it should

declare the Worker dead and reassign the task). We have given you slow running executables of

map and reduce tasks in tests/exec/wc_map_slow.sh and tests/exec/wc_reduce_slow.sh . These

scripts make use of sleep statements. You can choose a sleep time that you feel gives you enough

time to kill a Worker while it is execting a task.

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 17/24

The idea is to start your server and send a slow job to the Master. Once the task has been assigned

to a Worker, since there are sleep statements in the map/reduce scripts, you should have enough

time to manually kill a Worker and then see if the Master can still make forward progress (handle

the dead Worker and still produce the correct output).

For example, imagine a scenario, where there are 2 Workers, each executing one slow map task task

respectively. Now, the second Worker dies amidst this execution (because you manually killed the

process made possible due to the sleep times in the map code). In this scenario, how many mapping

tasks should the first Worker receive? How many mapping tasks should the second Worker have

received? How many sorting and reducing tasks should the first and the second Worker receive? If

your code gives expected answers to these questions, then you are in good shape.

Code Style

As in previous projects, all Python code should contain no errors or warnings from pycodestyle ,

pydocstyle , and pylint .

You may not use any external dependencies aside from what is provided in setup.py .

Test Case Descriptions

Many of the autograder test cases in this project are visible on the autograder, but the source code

is not published. We can’t publish the source code because many unit tests combine instructor code

(e.g., master) with your code, (e.g., worker). This section provides a description of each test case

lacking published source code.

test_master_1 :

1. Starts student master and one instructor worker

2. Verifies master received worker registration

3. Verifies master can send worker registration acknowledgement

test_master_2 :

1. Starts student master and one instructor worker

2. Submits a word count job to the master

input_directory: tests/input/

mapper_executable: tests/exec/wc_map.sh

reducer_executable: tests/exec/wc_reduce.sh

num_mappers: 2

num_reducers: 1

3. Verifies master created the correct job directory structure

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 18/24

test_master_3 :

1. Starts student master and one instructor worker

2. Submits a word count job to the master

input_directory: tests/input/

mapper_executable: tests/exec/wc_map.sh

reducer_executable: tests/exec/wc_reduce.sh

num_mappers: 2

num_reducers: 1

3. Verifies master created the correct job directory structure

4. Verifies master sent sort job to worker

5. Verifies output is correct

test_master_4 :

1. Starts student master and one instructor worker

2. Submits a word count job to the master

input_directory: tests/input/

mapper_executable: tests/exec/wc_map.sh

reducer_executable: tests/exec/wc_reduce.sh

num_mappers: 2

num_reducers: 1

3. Verifies master created the correct job directory structure

4. Verifies master sent a map job to the worker

5. Verifies master sent a sort job to the worker

6. Verifies master sent a reduce job to the worker

7. Verifies final output is correct

test_worker_1 :

1. Starts instructor master and one student worker

2. Verifies student worker process is running

3. Verifies instructor master received worker register message after 2 seconds

test_worker_2 :

1. Starts instructor master and one student worker

2. Verifies student worker process is running

3. Verifies instructor master received register message after 2 seconds

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 19/24

4. Verifies instructor master received heartbeat messages from worker

test_worker_3 :

1. Start instructor master and one student worker

2. Verifies student worker process is running

3. Verifies instructor master received register message after 2 seconds

4. Submits a word count map job to worker

executable: tests/exec/wc_map.sh

input_files: input/file01

output_directory: tmp/test_worker_3/output/

5. Verifies instructor master received “finished” message from worker

test_worker_4 :

1. Starts instructor master and one student worker

2. Verifies student worker process is running

3. Verifies instructor master received register message after 2 seconds

4. Submits a word count map job to worker

executable: tests/exec/wc_map.sh

input_files: input/file02

output_directory: tmp/test_worker_4/output/

5. Verifies instructor master received “finished” message from worker

6. Diff check worker-generated output file for correctness

test_worker_5 :

1. Starts instructor master and one student worker

2. Verifies student worker process is running

3. Verifies instructor master received register message after 2 seconds

4. Submits a mapping word count job to worker

executable: tests/exec/wc_map.sh

input_files: input/file01, input/file02

output_directory: tmp/test_worker_5/output/

5. Verifies instructor master received “finished” message from worker

6. Diff check worker-generated output file for correctness

Submitting and grading

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 20/24

One team member should register your group on the autograder using the create new invitation

feature.

Submit a tarball to the autograder, which is linked from https://eecs485.org. Include the --disablecopyfile

flag only on macOS.

$ tar \

--disable-copyfile \

--exclude '*__pycache__*' \

--exclude '*tmp*' \

-czvf submit.tar.gz \

setup.py \

bin \

mapreduce

Rubric

This is an approximate rubric.

Deliverable Value

Public unit tests 30%

Protected unit tests 40%

Hidden unit tests run after the deadline 30%

Protected unit tests are visible on the autograder before the project deadline, but the source code is

not published. See test case descriptions above.

Further Reading

Google’s original MapReduce paper

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce

https://eecs485staff.github.io/p4-mapreduce/ 21/24

2019/3/17 EECS 485 Project 4: Map Reduce | p4-mapreduce


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp