联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-02-03 11:12

5M project: stochastic optimization

Big data analytics

Project outline

The purpose of this project is to learn the basics of contemporary stochastic optimization methods.

You will learn how such methods operate by implementing them and by tuning their hyperparemeters.

The stochastic optimization algorithms considered in the project are stochastic gradient

descent (SGD), stochastic gradient descent with momentum (MSGD), stochastic gradient

descent with Nesterov accelerated gradient (NAGSGD), AdaGrad, RMSProp and Adam.

Several of these stochastic optimization approaches were developed to train convolutional

neural networks. For the purposes of the project, linear regression will be used instead of deep

learning models. By using a familiar model such as linear regression, your focus will be placed

on understanding the stochastic optimization algorithms.

Inference for linear regression coefficients via stochastic optimization will be conducted for

three different datasets. The first dataset contains 1, 000 simulated data points and is associated

with two regression coefficients, namely the intercept and slope of a simple linear regression

model. Starting with simple linear regression and simulated data, you will develop the involved

stochastic optimization algorithms. Once you have implemented the stochastic optimization

methods, you will use them on two real datasets. The second dataset is available on Kaggle,

pertains to weather in Szegen between 2006-2016 and it contains 96, 453 observations and six

parameters. The third dataset is the million song dataset (MSD), which you have encountered

in the first lab and is available on the UCI machine learning repository. Recall that MSD is a

big dataset, containing 515, 345 data points and 91 parameters, so you will have the chance to

fit multiple linear regression on a big dataset using stochastic optimization.

R code template

A template folder is available on Moodle, which will help you get started with coding. It

contains three R scripts, namely cost.r, optimizers.r and question1.r.

The script cost.r provides the definition of the cost function and its gradient for linear

regression. This is the only cost function needed for the project, since linear regression is the

only model applied to the three datasets.

The script optimizers.r provides the R function gd() for gradient descent (GD). You can

use the gd() function for answering the relevant questions of the project and as a coding basis

for developing the stochastic optimization algorithms. Moreover, optimizers.r contains R

comments with function signatures for the required stochastic algorithms. You will need to

complete the function body of these functions to tackle the project.

The script question1.r sources cost.r and optimizers.r initially, and provides a template

of R comments subsequently to help structure your answer to question 1. Towards the end of

question1.r, commented lines give you directions towards storing the numerical and visual

output of your simulations for question 1.

The template folder contains also the data files simulated data.csv, weather data.csv

and uci data.csv associated with questions 1, 2 and 3, respectively.

1

An introduction to stochastic optimization

This section outlines the stochastic optimization methods appearing in the project. Gradient

descent (GD), which has been taught in lectures 4-5 and has been coded in lab 1, is the starting

point for stochastic optimization.

Gradient descent (GD)

Let J(θ, D) be a cost function, where θ is an nθ-length parameter vector and D a dataset of nd

samples. Moreover, letθJ(θ, D) be the gradient of cost function J(θ) := J(θ, D) with respect to θ. The approximation θ(k)

of θ at step k of GD given the approximation θ

(k1) of θ at step

k 1 is defined as

(1)where the learning rate a is a positive real number.

In the case of a linear regression model, the data are set to D := {X, y}, where X is the

nd · nθ design matrix and y is the nd-length output vector associated with linear regression.

Stochastic gradient descent (SGD)

For a big data set D consisting of a large number nd of data points, the evaluation of the gradient

θJ(θ, D) can be computationally expensive. To reduce the computational complexity of GD,

stochastic gradient descent (SGD) samples an s-size subset D of the data D at step k and

evaluates the gradientθJ(θ, D(k)s ) using the subset D(k)s . Thus, the approximation θ(k)of θat step k of SGD given the approximation θ

(k1) of θ at step k 1 is given by

SGD is more computationally efficient than GD in large-scale machine learning problems

entailing large nd or large nθ.

As an implementation hint, use the sample() function in R to sample s out of nd samples at

step k without replacement. Subsequently, make sure that you index the design matrix X and

output variable y appropriately.

The rest of stochastic optimization methods of this project also use a random subset D

the data D at step k.

SGD with momentum (MSGD)

In SGD with momentum (MSGD), a ‘momentum’ term m is defined, which is a moving average

of the gradientθJ(θ, D(k)s ), and the parameters are then updated using the momentum m

instead of the gradientθJ(θ, D(k)s ).

The momentum m(k) and parameter approximation θ(k)

at the k-th step of MSGD are

defined recursively according to

m(k):= bm(k1) + (1 b)θJ(θ

(4)

MSGD has two hyper-parameters, a momentum memory factor b with values in (0, 1) and a

positive learning rate a. Furthermore, an initial momentum value m(0) is required by equation

(3) at step k = 1.

For values of b closer to one, the moving average equation (3) retains stronger memory of

momentum history. For values of b closer to zero, the moving average equation (3) remembers

less any past momentum history and places more emphasis on the current gradient of the cost

function.

Variant representations of MSGD are available in the literature. Equations (3)-(4) have been

preferred due to providing a clearer understanding of MSGD.

2

SGD with Nesterov accelerated gradient (NAGSGD)

SGD with Nesterov accelerated gradient (NAGSGD) differs from MSGD only in the way it

calculates the gradient of the cost function in the moving average equation. More specifically,

NAGSGD calculates the gradient of the cost function with respect to approximate future position(k1) abm(k1) rather than with respect to the current parameter approximation θ

Thus, step k of NAGSGD is defined as

m(k):= bm(k1) + (1 b)θJ(θ(k1) abm(k1), D(k)s), (5)

(6)

Alternative expressions for NAGSGD can be found in the literature. Equations (5)-(6) have

been chosen for the purpose of communicating the moving average model for momentum and

the resulting parameter update clearly.

AdaGrad

The learning rate a of SGD can be too small for one parameter and too large for another one.

This problem tends to manifest itself for large number nθ of parameters. AdaGrad mitigates such

a problem by scaling a different learning rate for each parameter adaptively and automatically.

Using the shorthand g(k):=θJ(θ(k), D(k)s ), consider the outer product g

(k) × g(k) at step

k, which is an nθ · nθ matrix. Summing over all steps up to k yields the nθ · nθ matrix

(7)

At step k, AdaGrad performs the parameter update

(8)

where a is a positive learning rate, diag(G(k1)) is the diagonal of matrix Gis a positive

smoothing term that avoids division by zero (usually on the order of 1e ? 8) and ⊙ denotes

Hadamard (element-wise) operations. Step k = 1 requires an initial value G(0)

.

To further clarify AdaGrad, equation (8) will be written below in an alternative form by

using element-wise notation instead of matrix notation. Let g

j-th partial derivative of the cost function at step k, i.e. the j-th coordinate of the gradient

. The AdaGrad update of the j-th parameter θ

jj is the (j, j)-th diagonal element of G(k1).

Equation (9) makes it clear that AdaGrad uses a differently scaled learning rate a/√

for each parameter θj .

As a hint to assist implementation, you may focus on equation (9). You do not need to

compute the outer products appearing in equation (8). It suffices to calculate the diagonal of

matrix G(k)

, which means that it suffices to calculate the sum of squared gradients

(10)AdaGrad eliminates the need to manually tune the learning rate, since it tunes adaptively

the learning rate for each parameter according to equation (9). However, the adaptive tuning

of learning rates in AdaGrad comes with a weakness. Squared gradients accumulate in the

denominator of learning rate a/√, hence learning rates shrink during training and

AdaGrad is no longer able to perform parameter updates via equation (9).

3

RMSProp

RMSProp attempts to alleviate the aforementioned weakness of AdaGrad by applying a moving

average model to squared gradients and by then using the output of the moving average model

in the AdaGrad parameter update mechanism.

More concretely, the j-th coordinate v

of the moving average of squared gradients and j-th

parameter θ

at step k of RMSprop are updated according to. (12)

The hyperparameter c is known as the memory factor for squared gradients, taking values in

c ∈ (0, 1). Equation (11) requires an initial value v

t step k = 1. Taking into account all

coordinates j = 1, 2, . . . , nθ, the initial value v

(0) for the moving average model of RMSProp is

an nθ-length vector.

Adam

Adaptive moment estimation (Adam) is a stochastic optimization method that computes a

different learning rate for each parameter adaptively. Apart from retaining a moving average of

squared gradients similarly to RMSProp, Adam stores a moving average of gradients additionally.

The Adam update of the j-th parameter θ

at step k is given by

(17)

Adam has three hyperparameters, namely a positive learning rate a, a memory factor b of

gradients with values in (0, 1) and a memory factor c of squared gradients with values in (0, 1).

Morevover, Adam requires the initial values m(0) and v

(0) at step k = 1.

Equations (13) and (14) provide the moving average of gradients and of squared gradients,

respectively. Adam takes its name after these first and second order estimators of gradients. Due

to m(0) and v

(0) being initialized typically as vectors of zeros, the moving averages m(k) and v(k) are asymptotically biased towards zero, especially when b and c are set to a value close to

one. To contouract such biases while estimating the moving averages of gradients and of squared

gradients, equations (15) and (16) are introduced.

Question 1

The goal of question 1 is to develop the stochastic optimizers of this project using a simulated

dataset and a cost function for simple linear regression.

To help you get started, the provided template folder contains an optimizers.r file with

R code for gradient descent, a cost.r file with R code for the cost function for linear regression

and its gradient, a question1.r file with some preliminary R code towards answering question

1 and the simulated data.csv file on which simple linear regression will be fitted.

As part of your answer, you will edit the optimizers.r file to include in it your R code for

the optimizers of this project and you will also edit the file question1.r to include in it your R

4

code for running the optimizers and simulations of question 1. You will not provide any output

files or a report as part of your project submission.

Make sure that every optimizer in optimizers.r is a standalone R function with a return

statement

return(list(theta=theta, cost=cost))

as in the provided gd() function, to store the iterations of the parameter estimates and of the

cost function.

You need to make sure before submission that the R command source("question1.r")

runs successfully and stores the required results in output files as explained further below.

question1.r starts with the lines

source("optimizers.r")

source("cost.r")

to load the R code for the cost function for linear regression and its gradient as well as the R

code for the optimizers. The subsequent line in question1.r loads the data for question 1:

simulated_data <- read.csv("simulated_data.csv", header=TRUE)

simulated data.csv consists of two columns; the first column is labelled as y and represents

the output variable, while the second column is labelled as covariates and plays the role of

covariate. The remaining lines in question1.r are R comments meant to guide you towards

answering question 1.

Firstly, standardize the covariate and create the design matrix X including a first column

of ones for intercept.

Fit a simple linear regression to the data using the standardized covariate and the lm()

function in R. The model will have two parameters, namely a regression coefficient corresponding

to covariate and an intercept.

Use the same initial value θ

(0) = (7,8)T

for all optimizers.

Run 100 iterations of GD tuning the learning rate a of equation (1) to converge towards the

regression coefficient estimates obtained via lm().

Run 100 iterations of SGD by sampling a subset of s = 100 data points per iteration.

Tune the learning rate a of equation (2) to converge towards the regression coefficient estimates

obtained via lm(). Notice that even for such a simple example, SGD obtains an approximation

to the regression coefficient estimates known from lm(). Although SGD does not yield an exact

solution to the optimization problem, it provides a reasonable approximation.

Run 100 iterations of MSGD by sampling a subset of s = 100 data points per iteration. Set

the initial value for momentum to be a vector of zeros, that is set m(0) = (0, 0)T

. Tune the

learning rate a of equation (4) and memory factor b of equation (3) to converge towards the

regression coefficient estimates obtained via lm().

Run 100 iterations of NAGSGD by sampling a subset of s = 100 data points per iteration.

Set the initial value for momentum to be a vector of zeros, that is set m(0) = (0, 0)T

. Tune

the learning rate a of equation (6) and memory factor b of equation (5) to converge towards the

regression coefficient estimates obtained via lm().

Run 100 iterations of AdaGrad by sampling a subset of s = 100 data points per iteration.

Set diag(G(0)) = (0, 0)T and = 1e 8. Tune the learning rate a of equation (9) to converge

towards the regression coefficient estimates obtained via lm().

Run 100 iterations of RMSProp by sampling a subset of s = 100 data points per iteration.

Set v

(0) = (0, 0)T and = 1e 8. Tune the learning rate a of equation (12) and memory factor

c of equation (11) to converge towards the regression coefficient estimates obtained via lm().

Run 100 iterations of Adam by sampling a subset of s = 100 data points per iteration. Set

m(0) = (0, 0)T

, v

(0) = (0, 0)T and = 1e8. Tune the learning rate a of equation (17), memory

factor b of equation (13) and memory factor c of equation (14) to converge towards the regression

coefficient estimates obtained via lm().

Your code in question1.r will store numerical summaries in CSV files and visual summaries

in PDF files as follows:

(a) Save the design matrix X in the comma-separated value file answer1a.csv.

5

(b) Save the parameter estimates obtained via lm() and the final parameter estimates at the

last iteration of each optimizer in the comma-separated value file answer1b.csv. The

parameter estimates saved in answer1b.csv correspond to a 2 · 8 matrix, in which the

first and second row represent the respective regression coefficient estimates θ0 and θ1,

while the eight columns represent the parameter estimates obtained via lm(), GD, SGD,

MSGD, NAGSGD, AdaGrad, RMSProp and Adam. R code in the form of comments has

been provided in question1.r as guidance to facilitate the process of saving all parameter

estimates.

(c) Save the value of the cost function at the last itration of each optimizer in the commaseparated

value file answer1c.csv. The cost function values saved in answer1c.csv correspond

to a 1 · 7 matrix, in which the the seven columns represent the parameter estimates

obtained via GD, SGD, MSGD, NAGSGD, AdaGrad, RMSProp and Adam. R code in the

form of comments has been provided in question1.r as guidance to facilitate the process

of saving all cost function values.

(d) Plot the cost function of NAGSGD and of AdaGrad against the number of iterations in

a single plot. Ensure that it is possible to distinguish between the overlaid cost function

values of NAGSGD and of AdaGrad. Add a legend to the plot to make such a distinction

clearer. Save the plot in the PDF file answer1d.pdf. R code in the form of comments has

been provided in question1.r as guidance to facilitate the process of saving the plot of

cost function values.

(e) Plot the successive iterations of parameter θ0 against the successive iterations of parameter

θ1 for GD and RMSProp. Moreover, add a point in the plot indicating the parameter

estimate obtained via lm(). All optimizers started from the same initial parameter value

θ

(0) = (7, ?8)T

, so the trajectories of GD and RMSProp must have the same starting

point in the plot. Moreover, the end points of the trajectories of GD and RMSProp must

be in near proximity to the single point representing the parameter estimate obtained via

lm() if the optimizers were tuned appropriately. Ensure that it is possible to distinguish

between the overlaid trajectories of GD and RMSProp. Add a legend to the plot to make

such a distinction clearer. Save the plot in the PDF file answer1e.pdf. R code in the

form of comments has been provided in question1.r as guidance to facilitate the process

of saving the plot of parameter trajectories.

Upload only the R scripts cost.r, optimizers.r and question1.r. Make sure that the R

command source("question1.r") works free from bugs and that it generates the output files

answer1a.csv, answer1b.csv, answer1c.csv, answer1d.pdf and answer1e.pdf. However, do

not upload the output files themselves. I will assess your work by running the R command

source("question1.r") and by looking into the generated output files.

Question 2

The goal of question 2 is to run the stochastic optimizers of this project to fit multiple linear

regression on a real dataset of relatively small size. Given that you have already developed the

R code for stochastic optimization to tackle question 1, the only additional work required for

tackling question 2 is to tune the hyperparameters of the optimizers.

The files cost.r and optimizers.r should be available after having answered question 1.

Create an R script with name question2.r to include in it your R code for running the optimizers

and simulations of question 2. You will not provide any output files or a report as part of your

project submission.

You need to make sure before submission that the R command source("question2.r")

runs successfully and stores the required results in output files as explained further below.

Similarly to question1.r, the file question2.r will start with the lines

source("optimizers.r")

source("cost.r")

6

to load the R code for the cost function for linear regression and its gradient as well as the R

code for the optimizers. All subsequent R code for answering question 2 should be saved in

question2.r.

Start by loading the data file weather data.csv, whose first line is the header containing

column labels. For the purposes of this project, it suffices to know that the column labelled

apparent temperature is the output variable, while the columns labelled temperature,

humidity, wind speed, wind bearing and pressure will be used as covariates in the multiple

linear regression model.

The dataset weather data.csv can be found on Kaggle at

https://www.kaggle.com/budincsevity/szeged-weather

There is no need to use Kaggle in order to address question 2. If you decide to look up the

dataset on Kaggle, keep in mind that the original column labels have been altered to make R

coding more user-friendly.

Firstly, standardize all five covariates and create the design matrix X including a first column

of ones for intercept.

Fit a multiple linear regression to the data using the standardized covariates and the lm()

function in R. The model will have six parameters, namely five regression coefficient corresponding

to the covariates and an intercept.

Use the same initial value θ

(0) = (5,3, 4, 1, 10, 9)T

for all optimizers.

Run 70 iterations of GD tuning the learning rate a of equation (1) to converge towards the

regression coefficient estimates obtained via lm().

Run 200 iterations of SGD by sampling a subset of s = 1000 data points per iteration.

Tune the learning rate a of equation (2) to converge towards the regression coefficient estimates

obtained via lm(). Notice that even for such a simple example, SGD obtains an approximation

to the regression coefficient estimates known from lm(). Although SGD does not yield an exact

solution to the optimization problem, it provides a reasonable approximation.

Run 200 iterations of MSGD by sampling a subset of s = 1000 data points per iteration. Set

the initial value for momentum to be a vector of zeros, that is set m(0) = (0, 0, 0, 0, 0, 0)T

. Tune

the learning rate a of equation (4) and memory factor b of equation (3) to converge towards the

regression coefficient estimates obtained via lm().

Run 200 iterations of NAGSGD by sampling a subset of s = 1000 data points per iteration.

Set the initial value for momentum to be a vector of zeros, that is set m(0) = (0, 0, 0, 0, 0, 0)T

.

Tune the learning rate a of equation (6) and memory factor b of equation (5) to converge towards

the regression coefficient estimates obtained via lm().

Run 200 iterations of AdaGrad by sampling a subset of s = 1000 data points per iteration.

Set diag(G(0)) = (0, 0, 0, 0, 0, 0)T and ? = 1e ? 8. Tune the learning rate a of equation (9) to

converge towards the regression coefficient estimates obtained via lm().

Run 200 iterations of RMSProp by sampling a subset of s = 1000 data points per iteration.

Set v

(0) = (0, 0, 0, 0, 0, 0)T and = 1e8. Tune the learning rate a of equation (12) and memory

factor c of equation (11) to converge towards the regression coefficient estimates obtained via

lm().

Run 200 iterations of Adam by sampling a subset of s = 1000 data points per iteration.

Set m(0) = (0, 0, 0, 0, 0, 0)T

, v

(0) = (0, 0, 0, 0, 0, 0)T and = 1e 8. Tune the learning rate a

of equation (17), memory factor b of equation (13) and memory factor c of equation (14) to

converge towards the regression coefficient estimates obtained via lm().

Your code in question2.r will store numerical summaries in CSV files and visual summaries

in PDF files as follows:

(a) Save the design matrix X in the comma-separated value file answer2a.csv.

(b) Save the parameter estimates obtained via lm() and the final parameter estimates at the

last iteration of each optimizer in the comma-separated value file answer2b.csv. The parameter

estimates saved in answer2b.csv correspond to a 6 · 8 matrix, in which each row

represents one of the six regression coefficient estimates, while the eight columns represent

the parameter estimates obtained via lm(), GD, SGD, MSGD, NAGSGD, AdaGrad,

RMSProp and Adam.

7

(c) Save the value of the cost function at the last itration of each optimizer in the commaseparated

value file answer2c.csv. The cost function values saved in answer2c.csv correspond

to a 1 · 7 matrix, in which the the seven columns represent the parameter estimates

obtained via GD, SGD, MSGD, NAGSGD, AdaGrad, RMSProp and Adam.

(d) Plot the cost function of NAGSGD and of AdaGrad against the number of iterations in

a single plot. Ensure that it is possible to distinguish between the overlaid cost function

values of NAGSGD and of AdaGrad. Add a legend to the plot to make such a distinction

clearer. Save the plot in the PDF file answer2d.pdf.

(e) Plot the successive iterations of parameter θ2 against the successive iterations of parameter

θ3 for GD, MSGD and AdaGrad, recalling that indexing notation for parameters has been

set to start from zero for the intercept θ0. Moreover, add a point in the plot indicating

the parameter estimate (θ2, θ3) obtained via lm(). All optimizers started from the same

initial parameter value θ

(0) = (?5, ?3, 4, 1, 10, ?9)T

, so the trajectories of GD, MSGD

and AdaGrad must have the same starting point in the plot. Moreover, the end points

of the trajectories of GD, MSGD and AdaGrad must be in near proximity to the single

point representing the parameter estimate obtained via lm() if the optimizers were tuned

appropriately. Ensure that it is possible to distinguish between the overlaid trajectories of

GD, MSGD and AdaGrad. Add a legend to the plot to make such a distinction clearer.

Save the plot in the PDF file answer2e.pdf.

Upload only the R scripts cost.r, optimizers.r and question2.r. Make sure that the R

command source("question2.r") works free from bugs and that it generates the output files

answer2a.csv, answer2b.csv, answer2c.csv, answer2d.pdf and answer2e.pdf. However, do

not upload the output files themselves. I will assess your work by running the R command

source("question2.r") and by looking into the generated output files.

Question 3 (not assessed)

Question 3 will not be assessed, it is provided for educational purposes. If you do

not tackle question 3, you will not be penalized. However, it is advised that you try to answer

question 3 in order to experience the computational advantage of stochastic optimization over

optimization in the case of big datasets.

The goal of question 3 is to run the stochastic optimizers of this project to fit multiple

linear regression on a relativey big dataset. Given that you have already developed the R

code for stochastic optimization to tackle question 1, the only additional work required for

tackling question 3 is to handle data processing efficiently and to tune the hyperparameters of

the optimizers.

The dataset uci data.csv used in question 3 is the same subset of the million song data set

used in lab 1. It contains 515, 345 songs (file rows) and 91 columns. The first column is the year

of release, ranging from 1922 to 2011, which plays the role of output variable. The remaining

90 columns are continuous predictors of the year of release. uci data.csv is included in the

template folder. More information about this subset of the million song data set is available at

http://archive.ics.uci.edu/ml/datasets/YearPredictionMSD

You will observe the computational cost of fine-tuning optimization given a relatively big

dataset. At the same time, you will see how stochastic optimization reduces runtime dramatically

in comparison to gradient descent. Moreover, you will see that stochastic optimization attains

reasonable parameter estimates in a problem somewhat less trivial in terms of data space and

parameter space dimensionality (nd = 515, 345 and nθ = 91).

The files cost.r and optimizers.r should be available after having answered question 1.

Create an R script with name question3.r to include in it your R code for running the optimizers

and simulations of question 3.

The R command source("question3.r") will store the results in output files as explained

further below. The anticipated runtime of the R command source("question3.r") is not more

than fifteen minutes on a computer at one of the labs of the Mathematics and Statistics building.

Similarly to question1.r, the file question3.r will start with the lines

8

source("optimizers.r")

source("cost.r")

to load the R code for the cost function for linear regression and its gradient as well as the

R code for the optimizers. All subsequent R code for answering question 3 can be saved in

question3.r.

Start by loading the data file uci data.csv using the fread() function of the data.table

package in R. Do not use the read.csv() function, which is prohibitively slow for the needs of

question 3. The dataset uci data.csv does not contain a header.

Firstly, standardize all 90 covariates and create the design matrix X including a first column

of ones for intercept.

Fit a multiple linear regression to the data using the standardized covariates and the lm()

function in R. The model will have 91 parameters, namely 90 regression coefficient corresponding

to the covariates and an intercept.

Randomly sample the initial value θ

(0) of the parameters by running the R command

theta0 <- rnorm(npars, mean=0, sd=10)

Use the same initial value θ

(0) for all optimizers.

Run 1, 000 iterations of GD tuning the learning rate a of equation (1) to converge towards

the regression coefficient estimates obtained via lm().

Run 50, 000 iterations of SGD by sampling a subset of s = 500 data points per iteration.

Tune the learning rate a of equation (2) to converge towards the regression coefficient estimates

obtained via lm().

Run 20, 000 iterations of MSGD by sampling a subset of s = 500 data points per iteration.

Set the initial value m(0) for momentum to be a vector of zeros. Tune the learning rate a of

equation (4) and memory factor b of equation (3) to converge towards the regression coefficient

estimates obtained via lm().

Run 20, 000 iterations of NAGSGD by sampling a subset of s = 500 data points per iteration.

Set the initial value m(0) for momentum to be a vector of zeros. Tune the learning rate a of

equation (6) and memory factor b of equation (5) to converge towards the regression coefficient

estimates obtained via lm().

Run 20, 000 iterations of AdaGrad by sampling a subset of s = 500 data points per iteration.

Set diag(G(0)) to be a vector of zeros and ? = 1e ? 8. Tune the learning rate a of equation (9)

to converge towards the regression coefficient estimates obtained via lm().

Run 20, 000 iterations of RMSProp by sampling a subset of s = 500 data points per iteration.

Set v

(0) to be a vector of zeros and ? = 1e ? 8. Tune the learning rate a of equation (12) and

memory factor c of equation (11) to converge towards the regression coefficient estimates obtained

via lm().

Run 20, 000 iterations of Adam by sampling a subset of s = 500 data points per iteration.

Set each of m(0) and v

(0) to be a vector of zeros and = 1e 8. Tune the learning rate a

of equation (17), memory factor b of equation (13) and memory factor c of equation (14) to

converge towards the regression coefficient estimates obtained via lm().

Your code in question3.r will store numerical summaries in CSV files and visual summaries

in PDF files as follows:

(a) Save the design matrix X in the comma-separated value file answer3a.csv.

(b) Save the parameter estimates obtained via lm() and the final parameter estimates at

the last iteration of each optimizer in the comma-separated value file answer3b.csv. The

parameter estimates saved in answer3b.csv correspond to a 91·8 matrix, in which each row

represents one of the regression coefficient estimates, while the eight columns represent the

parameter estimates obtained via lm(), GD, SGD, MSGD, NAGSGD, AdaGrad, RMSProp

and Adam.

(c) Save the value of the cost function at the last itration of each optimizer in the commaseparated

value file answer3c.csv. The cost function values saved in answer3c.csv correspond

to a 1 · 7 matrix, in which the the seven columns represent the parameter estimates

obtained via GD, SGD, MSGD, NAGSGD, AdaGrad, RMSProp and Adam.

9

(d) Plot the cost function of RMSProp against the number of iterations. Save the plot in the

PDF file answer3d.pdf.

(e) Plot the cost function of RMSProp for iterations between 15, 000 and 20, 000 against the

number of iterations. Save the plot in the PDF file answer3e.pdf. Confine the vertical

axis of the plot in answer3e.pdf in a narrower range than the vertical axis of the plot in

answer3d.pdf so that it the former plot is visibly less smooth and more volatile than the

latter plot.

(f) Plot the cumulative mean of the cost function of RMSProp against the number of iterations.

Save the plot in the PDF file answer3f.pdf.

(g) Plot the successive iterations of parameter θ0 against the successive iterations of parameter

θ1 for GD, MSGD and RMSProp, recalling that indexing notation for parameters has been

set to start from zero for the intercept θ0. Moreover, add a point in the plot indicating

the parameter estimate (θ0, θ1) obtained via lm(). All optimizers started from the same

initial parameter value θ

(0), so the trajectories of GD, MSGD and RMSProp must have the

same starting point in the plot. Moreover, the end points of the trajectories of GD, MSGD

and RMSProp must be in near proximity to the single point representing the parameter

estimate obtained via lm() if the optimizers were tuned appropriately. Ensure that it

is possible to distinguish between the overlaid trajectories of GD, MSGD and RMSProp.

Add a legend to the plot to make such a distinction clearer. Save the plot in the PDF file

answer3g.pdf.

(h) Plot the successive iterations of parameter θ0 against the successive iterations of parameter

θ9 for GD, NAGSGD and AdaGrad, recalling that indexing notation for parameters has

been set to start from zero for the intercept θ0. Moreover, add a point in the plot indicating

the parameter estimate (θ0, θ9) obtained via lm(). All optimizers started from the same

initial parameter value θ

(0), so the trajectories of GD, NAGSGD and AdaGrad must have

the same starting point in the plot. Moreover, the end points of the trajectories of GD,

NAGSGD and AdaGrad must be in near proximity to the single point representing the

parameter estimate obtained via lm() if the optimizers were tuned appropriately. Ensure

that it is possible to distinguish between the overlaid trajectories of GD, NAGSGD and

AdaGrad. Add a legend to the plot to make such a distinction clearer. Save the plot in

the PDF file answer3h.pdf.

(i) Record the runtime of each simulation for each optimizer using the system.time() function

in R. system.time() returns three different times labelled as user, system and elapsed.

Make use of times labelled as user to answer this sub-question. Create a 7 · 2 table, in

which every row corresponds to one of the seven optimizers, while the first column provides

the runtime of each optimizer. The second column will provide the relative speed-up of

each optimizer, defined as the ratio of the runtime of the optimizer divided by the runtime

of GD. Save this 7 · 2 table in the comma-separated value file answer3i.csv.

The R command source("question3.r") will generate the output files answer3a.csv,

answer3b.csv, answer3c.csv, answer3d.pdf, answer3e.pdf, answer3f.pdf, answer3g.pdf

and answer3i.csv. Do not upload the output files themselves. Moreover, you do not need

to upload source("question3.r") as part of your submission, since question 3 will not be

assessed.

Project submission

Your answers must be uploaded on the relevant section of the Moodle page for Big Data Analytics

by 4:00pm on Wednesday 6 February 2019. You will upload a single folder using your

matriculation number as the name of the folder. The folder will contain the files question1.r,

question2.r, optimizers.r, cost.r and declaration.pdf.

Do not upload any output files or any files other than the ones required.

I will assess your work by running the following sequence of commands:

10

source("question1.r)

source("question2.r)

Make sure that these commands work before submitting your answers. If any of these commands

do not run or if they do not generate the required output files as explained in the project

description, marks will be deducted.

The declaration of originality form declaration.pdf is available in the template folder.

Make sure that you sign declaration.pdf before uploading it.

Make sure that you submit your work on Moodle before the deadline. Late submissions will

receive a penalty of 10%.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp