联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2021-04-19 11:29

School of Computer Science

COMP5349: Cloud Computing Sem. 1/2021

Assignment 1: Data Analysis with Spark RDD API

Individual Work: 20% 01.04.2021

1 Introduction

This assignment tests your ability to implement simple data analytic workload using Spark

RDD API. The data set you will work on is adapted from Trending Youtube Video Statistics

data from Kaggle. There are two workloads you should design and implement against

the given data set. You are required to designed and implement both workloads using

ONLY basic Spark RDD API. You should not use Spark SQL or other advanced features.

2 Input Data Set Description

The dataset contains several months’ records of daily top trending YouTube video in the

following ten countries: Canada,France, Germany, India,Japan, Mexico, Russia, South Korea,

United Kingdom and United States of America. There are up to 200 trending videos

listed per day.

In the original data set, each country’s data is stored in a separate CSV file, with each

row representing a trending video record. If a video is listed as trending in multiple days,

each trending appearance has its own record. The category names are stored in a few

separate JSON files.

The following preprocessing have been done to ensure that you can focus on the main

workload design.

? Merge the 10 individual CSV files into a single CSV file;

? Add a column category to store the actual category name based on the mapping

? Add a column country to store the trending country, each country is represented by

two capital letter code.

? Remove rows with invalid video id values

? Remove most columns that are not relevant to the workloads

The results is a CSV file AllVideos.csv with 8 columns and no header row. The

columns are as follows. The trending date column has the date format: yy.dd.mm

video_id,trending_date,category,views,likes,dislikes,country

1

3 Analysis Workload Description

3.1 Controversial Trending Videos Identification

Listing a video as trending would help it attract more views. However, not all trending

videos are liked by viewers. For some video, listing it as trending would increase its

dislikes number more than the increase of its likes number. This workload aims to identify

such videos. Below are a few records of a particular video demonstrating the change

of various numbers over time:

video id trending date views likes dislikes country

QwZT7T-TXT0 2018-01-03 13305605 835378 629120 US

QwZT7T-TXT0 2018-01-04 23389090 1082422 1065772 US

QwZT7T-TXT0 ... ... ... ... US

QwZT7T-TXT0 2018-01-09 37539570 1402578 1674420 US

QwZT7T-TXT0 2018-01-03 13305605 835382 629123 GB

QwZT7T-TXT0 ... ... ... ... GB

QwZT7T-TXT0 2018-01-18 45349447 1572111 1944971 GB

The video has multiple trending appearances in US and GB. In both countries, its views,

likes and dislikes all increase over time with each trending appearance. As highlighted

in the table above, the dislikes number grows much faster than the likes numbers. In

both countries, the video ended with higher number of dislikes than likes albeit starting

with higher likes number.

In this workload, you are asked to find out the top 10 videos with fastest growth

of dislikes number between its first and last trending appearances. Here we measure

the growth of dislikes number by the difference of dislikes increase and likes increase

between the first and last trending appearances in the same country. For instance, the

dislikes growth of video QwZT7T-TXT0 in US is computed as follows:

(1674420 ? 629120) ? (1402578 ? 835378) = 478100

The result of this workload should show the video id, dislike growth value and country

code. Below is the sample results.

’QwZT7T-TXT0’, 579119, ’GB’

’QwZT7T-TXT0’, 478100, ’US’

’BEePFpC9qG8’, 365862, ’DE’

’RmZ3DPJQo2k’, 334390, ’KR’

’q8v9MvManKE’, 299044, ’IN’

’pOHQdIDds6s’, 160365, ’CA’

’ZGEoqPpJQLE’, 151913, ’RU’

’84LBjXaeKk4’, 134836, ’FR’

’84LBjXaeKk4’, 134834, ’DE’

’84LBjXaeKk4’, 121240, ’RU’

2

3.2 Category and Trending Correlation

Some videos are trending in multiple countries. We are interested to know if there is

any correlation between video category and trending popularity among countries. For

instance, we may expect to see a common set of trending music videos in many countries

and a distinctive set of trending political videos in each country. In this workload, you are

asked to find out the average country number for videos in each category.

The following sample data set contains five videos belonging to category “Sports”,

their trending data are as follows:

video id category trending date views country

1 Sports 18.17.02 700 US

1 Sports 18.18.02 1500 US

2 Sports 18.11.03 3000 US

2 Sports 18.11.03 2000 CA

2 Sports 18.11.03 5000 IN

2 Sports 18.12.03 7000 IN

3 Sports 18.17.04 2000 JP

4 Sports 18.16.04 3000 KR

4 Sports 18.17.04 9000 KR

5 Sports 18.16.04 4000 RU

We can see that video 1,3,4,5, each appears in one country; video 2 appears in three

countries; If they are the only videos in Sports category, the average country number

would be (1 + 3 + 1 + 1 + 1)/5 = 1.4 You should print out the final result sorted by the

average country number. The sample output of this work load is as follows.

(’Trailers’, 1.0),

(’Autos & Vehicles’, 1.0190448285965426),

(’News & Politics’, 1.052844979051223),

(’Nonprofits & Activism’, 1.057344064386318),

(’Education’, 1.0628976994615762),

(’People & Blogs’, 1.0640343760329336),

(’Pets & Animals’, 1.0707850707850708),

(’Howto & Style’, 1.0876256925918326),

(’Travel & Events’, 1.0929411764705883),

(’Gaming’, 1.0946163477016635),

(’Sports’, 1.1422245184146431),

(’Entertainment’, 1.1447534885477444),

(’Science & Technology’, 1.1626835588828102),

(’Film & Animation’, 1.1677314564158094),

(’Comedy’, 1.2144120659156503),

(’Movies’, 1.25),

(’Music’, 1.310898044427568),

(’Shows’, 1.614678899082569)

3

A small number of videos have more than one category name. The category name may

change over time. For instance video id “119YrPUNM28” has changed its category name from

“News & Politics” to “Entertainment”. A video may be given different category names

in different countries. For instance, video id “7klO0p092Y” is under category “People &

Blogs” in CA and DE but under category “Entertainment” in US. As the number is quite

small, you do not need to identify and handle them separately. The sample answer double

count them in all categories they appear.

4 Coding and Execution Requirement

Below are requirements on coding and Execution:

? You should implement both workloads in PySpark using Spark RDD API.

? You should implement both workloads in a single Jupyter notebook. There should

be clear indication which cells belong to which workload.

? You should not modify the input data file in any way and your code should read the

data file from the same directory as the notebook file.

5 Deliverable and Demo

The source code should be submitted as a single Jupyter notebook file. The due date is

week 7 Wednesday 21/04/21 23:59. Please name your notebook file as

<labCode>-<uniKey>-<firstName>-<lastName>.ipynb.

There will be a 10 minutes demo in week 7/8. You need to attend the demo to

receive mark for this assignment!

During the demo, the marker will run your notebook on their own environment to

check the correctness of the result. You should also have your environment ready to run

your code and to answer questions. The marker may ask you to explain the overall computation

graph or certain part of the implementation. You may be asked to add some

statement in your code to show the structure of an intermediate RDD, or to apply various

filters on intermediate RDDs to provide slightly different result.

4


版权所有:编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。