联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-04-06 09:53

Assignment 2: Advanced Map-Reduce, Similar

Items and Data Streams

Formative, Weight (15%), Learning objectives (1, 2, 3),

Abstraction (4), Design (4), Communication (4), Data (5), Programming (5)

Due date: 11 : 59pm, 11 April, 2022

1 Overview

This assignment must be done in groups consisting of TWO students. You

MUST use the created groups for assignment 2, accessible from the ”People”

tab on MyUni to enrol yourself to a group before submission. Submissions

made without enrolling in a group on MyUni may NOT be marked. Every

group need to make ONLY ONE submission. If you have problems/questions

regarding grouping or require assistance, please use the discussion forum, or

email the teaching assistants directly.

Exercise 1 S-curve (exercise 3.4.1 in Leskovec, Rajaraman and Ullman) (15

points)

Write a program that evaluates and plots the S-curve 1 − (1 − s

r

)

b

for the

values s ∈ [0, 1], for the following values of r and b (one separate evaluation for

each of the three):

1. r=3 and b=10.

2. r=6 and b=20.

3. r=5 and b=50.

Note for this exercise you are allowed to use either of the following programming

languages: C++, Java or Python. Note the readibility of the plots is important

here, i.e. Make sure you have approperiate axes names and range of values in

your visualisations. Please submit both your code and include the visualisations

in your report.

1

COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022

Exercise 2 Filtering Streams (similar to Exercises of 4.3 in Leskovec, Rajaraman

and Ullman) (15 points)

Solve the following problems:

1. For the situation of our running example of Section 4.3.1 with changed

conditions (10 billion bits, 2 billion members of the set S), calculate the

false-positive rate when using three hash functions. Do the same for four

hash functions.

2. As a function of n, the number of bits and m the number of members

in the set S, what number of hash functions minimizes the false-positive

rate?

Exercise 3 Map-Reduce: Friend Recommendation System (40 points)

Write a MapReduce program in Hadoop that implements a People You Might

Know social network friendship recommendation algorithm. The key idea is

that if two people have a lot of mutual friends, then the system will recommend

them to one another to get connected. You will have to run the program in

Hadoop, using the system setup completed in Assignment 1 in order to receive

points for this exercise.

Input File: Download the input file from the assignment page, to your

Friend Recommendation System. This input file contains the adjacency list of

friends and has multiple lines in the following format:

<User><TAB><Friends>

Here, <User> is a unique integer ID corresponding to a unique user and

<Friends> is a comma separated list of unique IDs corresponding to the friends

of the user with the unique ID <User>. Note that the friendships are mutual

(i.e., edges are undirected): if A is friend with B then B is also friend with A.

Algorithm: You will need to design an algorithm such that, for each user

U, the algorithm recommends N = 10 users who are not already friends with

U, but have the most number of mutual friends in common with U.

Expected Output: The output has to contain one line per user in the

following format:

<User><TAB><Recommendations>

where <User> is a unique ID corresponding to a user and <Recommendations>

2

COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022

is a comma separated list of unique IDs corresponding to the algorithms recommendation

of people that <User> might know, ordered in decreasing number

of mutual friends. Even if a user has less than 10 second-degree friends, output

all of them in decreasing order of the number of mutual friends. If there are

recommended users with the same number of mutual friends, then output those

user IDs in numerically ascending order. Also, please provide a description of

how you are going to use MapReduce jobs to solve this problem.

Note: It is possible to solve this question with a single MapReduce job. But if

your solution requires multiple map reduce jobs, then that is fine too.

Your submission is expected to include the following:

• Include all your source code, in the exact system setup that you have run

the program. This includes all the log files, input/ outputs etc.

• Write-up: describe the algorithm you have designed to solve this problem

in enough details to understand your approach. Note this is different

to writing comments about your code. It should describe the high level

idea of your algorithm. It may include diagrams, graphs etc. for ease of

understanding.

• Results: in your write-up include the recommendations for the users with

following IDs: 924, 8941, 8942, 9019, 9020, 9021, 9022, 9990, 9992, 9993.

Exercise 4 Data streams (10 points)

Follow the scenario 1 and 2 below and answer the related questions regarding

the Flajolet-Martin Algorithm. The hash functions are of the form h(x) = ax+b

mod 32 for some a and b. You should treat the result as a 5-bit binary integer

1. Scenario 1: Suppose a data stream consists of the integers 3, 1, 4, 6, 5, 9.

Determine (a) the maximum tail length for each stream element and (b)

the resulting estimate of the number of distinct elements for the hash

functions in Question 1-3 below.

– Question 1: Hash function: h(x) = (2x + 1) mod 32

– Question 2: Hash function: h(x) = (3x + 7) mod 32

– Question 3: Hash function: h(x) = 4x mod 32

2. Scenario 2: Suppose a data stream consists of the integers 4, 5, 6, 7, 10, 15.

Determine (a) the maximum tail length for each stream element and (b)

the resulting estimate of the number of distinct elements for the hash

functions in Question 4-6 below.

3

COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022

– Question 4: Hash function: h(x) = (6x + 2) mod 32

– Question 5: Hash function: h(x) = (2x + 5) mod 32

– Question 6: Hash function: h(x) = 2x mod 32

Exercise 5 Summary of 3.6 and 3.7 (20 points)

For this exercise you have to read Section 3.6 and 3.7 in Leskovec, Rajaraman,

Ullman (third edition, 2020).

1. Summarize the content of 3.6 in your own words (approx. 600 words).

2. Summarize the content of 3.7 in your own words (approx. 600 words).

Note: it is expected that you demonstrate understanding of the above Sections.

You may do so by explaining the content in your own words (i.e. paraphrasing)

and using digrams/ figures the convey the concept.

2 General assignment submission guidelines

As stated in the beginning of the assignment, work MUST be submited using

the group’s interface on MyUni, and a single submission per group, ONLY. The

submissions will include the following, at minimum:

• PDF file of your solutions for theoretical exercises, descriptions of the

coding exercises or the results as requested per each exercise above.

• all source files, in the exact original form that is used on your system to

run the program. This includes your code, all the Hadoop log files and all

related project files.

• a README.txt file containing instructions to run the code, the two group

members’ names, student IDs, and email addresses.

• the submissions that do not follow the above guidlines may lose points

accordingly.

Please do not hesitate to reach out using the discussion forum, workshops,

or the contact details of the teaching assistants on the home page of MyUni,

should you have any questions or concerns.

4


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp