Lab Course: Distributed Data Analytics
Exercise Sheet 10
Mohsan Jameel
Information Systems and Machine Learning Lab
University of Hildesheim
Submission deadline: Friday July 13, 23:59PM (on LearnWeb, course code: 3114)
Instructions
Please following these instructions for solving and submitting the exercise sheet.
1. You should submit a zip or a tar file containing two things a) python scripts and b) a pdf document.
2. In the pdf document you will explain your approach (i.e. how you solved a given problem), and
present your results in the form of graphs and tables.
3. The submission should be made before the deadline, only through learnweb.
Distributed Computing with Apache Spark : Recommender Systems
In this lab you will build a recommender system. For this exercise you can work with movielens100k,
movielens1m and/or movielens10m datasets available at https://grouplens.org/datasets/movielens/.
The movielens rating dataset consists of user, movie and rating tuples, where ratings are on a five scale.
Exercise 1: Recommender System from scratch ( 10 points)
You will build a basic recommender system using Apache Spark. To build a basic recommender system
you will implement a matrix factorization (MF) without a bias term. [you can also add the bias term
if you want to but it is not required for this task]. Matrix Factorization is basically approximating a
Rating matrix R ∈ RM×N using low rank matrices U ∈ RM×K and V ∈ R
N×K, where U is a user latent
matrix, V is a movie latent matrix, M is the number of users and N the number of movies and K the
number of latent dimensions. Read more about matrix factorization see reference [3] and [4].
Your tasks are:
1. Use Apache Spark transformations and actions to implement MF using Stochastic Gradient Descent.
(Remember PSGD from Exercise Sheet 04).
(a) One strategy to implement MF with Spark is to use map or mapPartition function to create
multiple splits of data and learn an independent MF model for each split. Once the individual
learning is done you can use aggregation on the model parameters i.e. U and V . http:
//apachesparkbook.blogspot.com/2015/11/mappartition-example.html
(b) report the train and test RMSE scores. You can follow standard 3-fold cross validation
Exercise 2: Recommender System using Apache Spark MLLIB
( 10 points)
In this exercise you will use matrix factorization using Apache Spark MLLIB library build in function.
You will experiment with the same dataset as mentioned in the Exercise 1.
1. Implement recommender system using Apache Spark MLLIB functions.
2. report the train and test RMSE scores. You can follow standard 3-fold cross validation
3. compare your scores from Exercise 1 and Exercise 2.
4. look at the results for movilens dataset at http://www.mymedialite.net/examples/datasets.
html and compare your own results.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。