联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2020-03-10 10:27

STSCI 5065 - Spring 2020 HOMEWORK 2

This assignment intends to give you an idea about the HDFS, Ambari and hands-on experience with

Hortonworks Hadoop Sandbox and Hadoop streaming.

Logistics:

• This assignment needs to be completed individually and is of total 100 marks.

• The assignment is due by March 10, 2020 at 11:59 PM.

• You will submit a zip file named as “HW2_submission.zip” containing the below

o solutions.pdf - This contains all your answers including screenshots with your NetID and

Name in the top right corner.

 Note about screenshots - Attach the screenshots in solutions.pdf below the

answer. Do not attach them separately as they will not be accepted that way.

o README.txt (optional) - It consists of anything which you want us to be aware of while

grading and name/link of the sources from where you may have referred while solving the

assignment.

o feedback.txt (optional) - Any feedback you want to provide about the assignment.

• For this assignment you should work on Hortonworks Hadoop Sandbox 2.5.

Cornell’s academic integrity policy is enforced. Anyone caught cheating or plagiarizing will be

handled according to Cornell’s Code of Academic Integrity.

Please follow the instructions stated in the questions and above; otherwise, you may be subject to

unwanted penalties.

Question 1 (total 18 points)

Let's assume that there exists a 5 GB file. For all below sub-parts, assume that transfer rate is 50 MB/s

and seek time is 3 milliseconds. Remember from class, seek time is the time to find the start position

of Each block to read the data from. For simplicity, assume that 1GB = 1000 MB, 1 MB = 1000 KB and

so on, and the complete file is stored on just one system.

a) If the file is stored on a simple file system with a block size of 100 KB, how much time does it

take to read the complete file? (5 points)

b) If the file is stored on a Hadoop file system with a block size of 100 MB, how much time does it

take to read the complete file? (5 points)

c) What is the percentage gain of time in the second case as compared to the first case? (5

points)

d) Give reasons and situations where block size in Question 1a is better than block size in

Question 1b, and vice versa. (3 points)

Xiaolong Yang  2020

2

Question 2 (total 12 points)

In this question we will be working with the command line interface on the Hortonworks Sandbox and

use commands for working in the Hadoop ecosystem. Remember the username to login is "root" and

password is "Hadoop"(for the first time). For each of the below sub-questions, write the command

you executed to perform the task.

a) Create an HDFS directory called “course-data” at the root of the Hadoop filesystem.

b) Create a normal directory “data-set” at the root of the normal Linux file system.

c) Download the files from the following locations in the directory data-set and unzip it. It may

take some time to download them.

http://data.gdeltproject.org/events/1990.zip

http://data.gdeltproject.org/events/1991.zip

d) After extraction of the above files, copy the files into the Hadoop filesystem.

e) Write the command to validate if the files are actually copied correctly in the system.

f) Write the command to examine the file storage statistics of your Hadoop filesystem for the

directory course-data. How many block(s) is/are allocated for it? Attach a screenshot of the

command’s output.

Question 3 (total 4 points)

Remember in Question 2, we copied two files into the Hadoop filesystem. For the below questions,

you will use file browser, via Apache Ambari, to view the files and perform operation.

a) Attach the screenshot of the file browser showing the two files copied in Hadoop filesystem.

The screenshot should show the file size and permissions as well.

b) Delete the file “1990.csv” using the file browser. Attach the screenshot of the file browser

after the operation. Please write down the Hadoop command that performs the same

operation.

Question 4 (total 66 points)

In this question, you will practice data analysis using MapReduce in Hadoop with the streaming

method. You will modify the mapper and reducer programs written in Python to make them more

general and robust and to add more functionality so that they do more than just counting the word

frequencies. In Hortonworks Sandbox, in CentOS (through localhost:4200), create a new directory

called HW2Q4 at the root, and save all the related files in this directory.

You are required to use vi to code your Python map and reduce programs. In your code, you must use

comments to include your name at the beginning of each program file and to explain what your code

blocks do in the program, such as the meaning of the variables you define and the task that your “for

loop” performs, etc. These comments are very important for the grader to understand your code. If

the comments are missing (or too simple to make sense), up to 10 points may be taken away. You

are required to list all the Linux or Hadoop commands and steps you use to perform the tasks below.

Your Python map and reduce programs should end with a .py extension.

a) (30 points) The map and the reduce programs (WDmapper.py and WDreducer.py)introduced

in the class produced the desired results when we tested them with some simple and short

Xiaolong Yang  2020

3

texts; however, when we apply these to complicated texts, you will find problems that some

words are pre-fixed and/or post-fixed with punctuation marks, parentheses and quotes, etc.

and the same words are treated as different words.

Modify WDmapper.py and WDreducer.py to fix the above problems. The input text file to use

is the shakespeare.txt file on Canvas. First, test your WDmapper.py and WDreducer.py files

with Linux shell scripting by forming a pipeline; attach a screenshot of this result. Second, run

your mapper and reducer in HDFS; download the reducer output file, part-00000, fromthe

HDFS system to your host OS (Windows OS or Mac OS) and then rename it to “Q4a_Reduceroutput.txt.”

Paste this in your solution file. Additionally, submit WDmapper.py,

WDreducer.py and Q4a_Reducer-output.txt files.

b) (26 points) Further modify WDmapper.py and WDreducer.py and rename them to

WDmapper1.py and WDreducer1.py respectively so that you only output the following (you

should not output the whole word list and word frequencies).

1) The total number of the lines of shakespeare.txt.

2) The 100 most frequently used words in shakespeare.txt, including the words and their

counts. (Hint: you should consider using the Python lambda operator/function and the

sort() function)

3) The total number of the words in shakespeare.txt.

4) The number of unique words in the shakespeare.txt.

Here the same words with different cases are treated as different words, e.g., “University” and

“university” are different words. Again, your output words should only contain the words

themselves, i.e., no punctuation marks, quotes, etc. at the beginning and/or the end (same

below). Download the reducer output file, part-00000, from the HDFS system to your host OS

and then rename it to “Q4b_Reducer-output.txt.”. Paste this in your solution file. Additionally,

submit WDmapper1.py, WDreducer1.py and Q4b_Reducer-output.txt files.

c) (10 points) If the case difference of the words is ignored, modify your above Python programs

and rename them to WDmapper2.py and WDreducer2.py. Repeat similar sub-questions as in

b):

1) The total number of the lines of shakespeare.txt.

2) The 100 most frequently used words in shakespeare.txt, including the words andtheir

counts.

3) The total number of the words of shakespeare.txt. Here you use the same case for all

the words in the text.

4) The number of unique words in the shakespeare.txt.

Download the reducer output file, part-00000, from the HDFS system to your host OS and

then rename it to “Q4c_Reducer-output.txt.” Paste the output in your solution file.

Additionally, submit WDmapper2.py, WDreducer2.py and Q4c_Reducer-output.txt files.

As an example (not a portion of the real output), your output should be something as shown on the

next page.

Xiaolong Yang  2020

4


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp