联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2022-03-26 08:33

Assignment 1: Basics and Map-Reduce

Formative, Weight (15%), Learning objectives (1, 2, 3),

Abstraction (4), Design (4), Communication (4), Data (5), Programming (5)

Due date: 11 : 59pm, 28 March, 2022

1 Overview

This assignment must be done in groups consisting of TWO students. You

MUST use the created groups on the assignment page to submit your group’s

work. Submissions made outside the group’s submission page may NOT be

marked. Every group need to make ONLY ONE submission. If you have problems/questions

regarding grouping or require assistance, please use the discussion

forum, or email the teaching assistants directly. Note, in special cases, we

might have groups of a single member only (after seeking approval from the

course coordinator), but you will still need to follow the above instructions to

create a group and submit using the group interface.

2 Assignment

Exercise 1 Suspected Pairs (15 points)

Using the information from the first lecture (or Section 1.2.3 in the textbook),

what would be the number of suspected pairs if the following changes were made

to the data (Note all changes are to be applied at the same time).

? The number of days of observation was raised to 5000.

? The number of people observed was raised to 5 billion (and there were

therefore 500, 000 hotels).

? We only reported a pair as suspect if they were at the same hotel at the

same time on four different days.

1

COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022

Exercise 2 TF-IDF (15 points)

? Q1: Explain what TF.IDF is and provide its formulation (Note that you

might see slightly different definitions in different sources; here the definition

in the textbook is acceptable).

? Q2: Suppose there is a repository of ten million documents. What (to the

nearest integer) is the IDF for a word that appears in (a) 40 documents

(b) 10,000 documents?

? Q3: Suppose there is a repository of ten million documents, and word w

appears in 320 of them. In a particular document d, the maximum number

of occurrences of a word is 15. Approximately what is the TF.IDF score

for w if that word appears (a) once (b) five times?

Exercise 3 Hadoop Basics (20 points)

For this exercise, you will need to set up and configure your system to use

Hadoop, using Virtual Machine. Follow the instructions in the attached Hadoop

document to set up the virtual machine as described in Section 1. Run the example

program of Section 2 and carry out the different steps given in that

section. Note that depending on your system, you might face some hurdles to

have Hadoop running. You are expected to attempt to resolve the issues; if unsuccessful,

you may use the discussion forum or the workshops to seek assitance

from the teaching assistants. After you have Hadoop running according to the

attached document, follow the instructions below:

? Run your job on the attached file 100-0.txt in standalone mode and

pseudo-distributed mode and record the outputs. Describe every step

you take to check the outputs in different modes.

? Describe what task the provided code is trying to achieve and how different

the two modes are (i.e. standalone and pseudo-distributed). Do you see

any difference in the outputs of these modes? Explain.

Exercise 4 Map-Reduce in Hadoop (30 points)

This exercise has 4 parts. In this exercise, you will be writing and implementing

two separate MapReduce programs, described below. For each part,

you will have to run your program in a psuedo-distributed mode and record the

output results.

Note1: both problems below are NOT case sensitive, so you should transform

the words to lowercase or uppercase first, to avoid counting duplicates.

Note2: you may use the StringTokenizer to find the correct answers.

Part 1: Write a program that processes the FirstInputFile (pg100.txt) and

the SecondInputFile (3399.txt) attached to the assignment. Your program will

2

COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022

need to count the number of words with a specific number of letters in those files

- for example, count the number of words with 4 letters, 5 letters and so on...

If a specific word is repeated 20 times in the text, count it individually 20 times.

Part 2: Answer Questions 1-6.

? Q1: How many words are there with length 10 in FirstInputFile?

? Q2: How many words are there with length 4 in FirstInputFile?

? Q3: What is the longest length between words and what is its frequency

in FirstInputFile?

? Q4: How many words are there with length 2 in SecondInputFile?

? Q5: How many words are there with length 5 in SecondInputFile?

? Q6: What is the most frequent length and what is its frequency in SecondInputFile?

Part 3: Write a second program that again processes the FirstInputFile (pg100.txt)

and the SecondInputFile (3399.txt). However, in addition to counting the number

of words with a specific number of letters, if one word is repeated several

times, count it only once. So, your output will be the frequency of words with

the same length, but count a repeated word once only (i.e. unique words).

Note: both solutions with one or two MapReduce jobs are accepted.

Part 4: Answer Questions 7-12 below:

? Q7: How many words are there with length 10 in FirstInputFile?

? Q8: How many words are there with length 4 in FirstInputFile?

? Q9: What is the most frequent length and what is its frequency in FirstInputFile?

? Q10: How many words are there with length 5 in SecondInputFile?

? Q11: How many words are there with length 2 in SecondInputFile?

? Q12: What is the second-most frequent length and what is its frequency

in SecondInputFile?

Exercise 5 Summary of 2.4 and 2.5 (10 + 10 points)

For this exercise you will need to carefully read and understand Sections

2.3.9-2.3.11, 2.4, and 2.5 in Leskovec, Rajara- man, Ullman (third edition, 2020).

Then:

3

COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022

? Q1: Summarize the contents of Section 2.4 in your own words (approx.

600 words).

? Q2: Summarize the contents of Section of 2.5 in your own words (approx.

600 words).

Note: it is expected that you demonstrate understanding of the above Sections.

You may do so by explaining the content in your own words (i.e. paraphrasing)

and using diagrams/ figures to help better convey the concept. The quality of

the summary is important not just the word count.

3 General assignment submission guidelines

As stated in the beginning of the assignment, works MUST be submited using

the group’s interface on MyUni, and a single submission per group, ONLY. The

submissions will include the following, at minimum:

? PDF file of your solutions for theoretical exercises, descriptions of the

coding exercises or the results as requested per each exercise above.

? all source files, in the exact original form that is used on your system to

run the program. This includes your code, all the Hadoop log files and all

related project files.

? a README.txt file containing instructions to run the code, the names of

the group members, student IDs, and email addresses.

? submissions which do not follow the above guidlines may lose points accordingly.

Please do not hesitate to reach out using the discussion forum, workshops,

or the contact details of the teaching assistants on the home page of MyUni,

should you have any questions or concerns.

4


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp