联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2024-03-27 09:03

Computer Science Department - Rutgers University Spring 2024

Homework 1: MNIST Architecture and Hyperparameter Search 01:198:462

In this assignment, we’re taking a slightly backwards approach to building and training neural networks. Specifically,

I want to look at the question of effects of tweaks, changes, and tricks we can apply to neural networks to get good

performance, rather than the more standard problem of ‘here’s a data set, let’s fit a model to it’. As such, I am not

holding to a rigorous standard of data validation and management here.

Note: All problems below are in the context of the MNIST data set, so you must load that as your testing and

training set for all your models.

1 Breadth vs Depth

The most fundamental question when it comes to neural networks is ‘how many nodes and how many layers’. I want

to try to address the tradeoff in this section. Here, we’ll be building networks to classify the MNIST data set, and

look at the tradeoff in network shapes.

Problem 1: A network to classify the MNIST data set will have 28 ? 28 = 784 input nodes, and 10 output

nodes (one for each of the ten classes). How many parameters would a linear softmax model with no hidden

layers have? If a model had k ≥ 1 hidden layers, and m nodes in each hidden layer, how many parameters

would it have? Call this function Params(k, m).

Note: I am looking for a mathematical function here in terms of m and k, not a coded solution.

Problem 2: For a given number of parameters P, what is the smallest and largest values of k such that

Params(k, m) = P? Call this max k value kP .

Note: What should you do when kP is not an integer?

One difficulty in comparing network shapes is making the comparison fair. We can say that a comparison between

a network of k layers and m nodes per layer, and a network of k

0

layers is ‘fair’, if both networks have the same (or

approximately the same) total number of parameters.

1

Computer Science Department - Rutgers University Spring 2024

Problem 3: For a given number of parameters P, let mP (k) be the number of nodes per layer such that

Params(k, m) = P (or as close to P as possible). A network with k layers and mP (k) nodes per layer should

therefore have approximately P total parameters. For such a network, we can define trainLoss(k, P) and

testLoss(k, P), for each k from 1 to kP .

? Identify 10 values of P, sufficiently distinct so that the resulting network shapes and scale. How did

you make your choice? Use the linear softmax model as a baseline.

Note, because of the integer/rounding issues, your P values should be distinct enough so that

you don’t accidentally create networks with the same m and k for two distinct P values.

? Plot (overlaying the curves for each P) trainLoss(k, P), with the x-axis as k/kP from 0 to 1.

Note, you probably do not want to test every possible k value, due to time constraints. But test

enough k values so that the trend is clear.

? Plot (overlaying the curves for each P) testLoss(k, P), with the x-axis as k/kP from 0 to 1.

? How do the results compare to the baseline performance of the linear softmax model?

? What do you notice about the underlying trends? Is there a point where layers become too narrow to

be useful, and if so, where is it? What seems to be the sweet spot, if any, for network shape? How does

it depend on P?

? Create a plot showing (overlaying the curves for each P) total training time in terms of passes through

the data. Set the x-axis to go over k/kP for ease of comparison. Does this change your assessment of

the network shape tradeoff at all?

Bonus:

? Is total parameters P a fair comparison point? Try to find a better one. Justify it.

? Does introducing regularization (weight decay) or normalization layers help?

Problem 4: For a P of your choice and the optimal network shape as determined above - try to find

an even better network shape (layers of unequal size, for instance) that gives better results for the same

(approximate) total number of parameters. Is it better to have uniform layers? Layers of decreasing size? Increasing size? Experiment with it, and summarize your results. You may want to save the best model you find.

Bonus: Does regularization help?

2

Computer Science Department - Rutgers University Spring 2024

2 Training Effects: Activation Functions, Optimizers, Batch Size

Problem 5: For the model structure you chose in Problem 4, consider the problem of training, but with

different batch sizes. Should you train with a small batch size, or a large batch size? Experiment with ten

batch sizes (from b = 1 to b = something sufficiently large), and plot (as a function of training time), training

loss, testing loss, and clock time. What do you notice, and what does this say about a good choice of batch

size?

Problem 6: Repeat Problem 5, but comparing Adam optimization vs SGD optimization. What are the

tradeoffs? What determines a good stepsize?

Problem 7: Consider Problem 5 again, but changing the underlying activation function for the model, in

particular sigmoid vs tanh vs relu vs ELU. Can you draw any conclusions?

Bonus: What changes if you consider these experiments but run on a GPU?

3 CNNs vs Dense Layers

Problem 8: Consider again the model you built in Problem 4. This was a relatively simple model with

vanilla dense layers. Consider constructing a simple CNN model in the following way:

? Pass the input image (28 x 28 x 1) into a convolutional layer and an activation function.

? Flatten the result.

? Pass the flat result into a number of dense layers (and activation functions).

? Pass the result through a softmax layer to get class probabilities.

With a single convolutional layer (kernel size and number at your discretion), find the smallest model you can

(in terms of total number of parameters) that ultimately matches or exceeds the performance of the model

you found in Problem 4. How did you go about your neural architecture search to answer the question?

For the CNN architecture you find and the original architecture from Problem 4, plot the training and testing

loss over training time for comparable batch sizes, step sizes, and optimizer (be clear about the choices you

are making).

Problem 9: Consider Problem 8, but you are allowed two stacked convolutional layers (of different kernel

sizes / numbers). Can you beat the network from Problem 8, in terms of performance vs parameter count?

Bonus: For the dense model from Problem 4, and the model from Problem 9, find instances where the models

fail (incorrectly classifying the image). Are the mistakes being made reasonable, to your eye? Are the models

making different kinds of mistakes?

3

Computer Science Department - Rutgers University Spring 2024

4 Auto-Encoders

An auto-encoder is a model structure composed of two parts (we will discuss this more in class):

? The Encoder: a network that maps the input down into some smaller dimensional bottleneck.

? The Decoder: a network that maps the value of the output of the encoder back up to the same dimension and

shape as the input.

We could think of implementing this in two ways: either mathematically) a network F(x) = z that maps into some

small dimension, where z is the ‘encoded’ version of x, and then another network G that maps G(z) = x, to ‘decode’

z back to x; alternately, in code) you can consider a network structure that outputs both the final output layer, and

the value of a smaller intermediate layer, together.

An auto-encoder is an example of unsupervised learning. There is no desired output or ‘task’ to solve - we want

to learn an auto-encoder capable of encoding the input down and reconstructing it from its ‘compressed’ form. We

train the model by minimizing the ‘reconstruction loss’,

X

N

i=1

||G(F(x

i

)) ? x

i

||2

, (1)

comparing the ‘reconstructed’ input to the actual input.

The point of this is that by ‘compressing’ the input in a way that nonetheless preserves relevant information (hence

decompression or decoding is possible), the encoder is learning something important or meaningful about the structure

of the data. If for instance we cannot recover the original data when compressing it down into three dimensions - it

suggests that the original data is more complex than three dimensional data. (Note: A circle for instance is a two

dimensional shape on a plane, but is itself only one dimensional, since it can be thought of just in terms of a single

angle.)

Once an encoder has been learned, the encoded vector F(x

i

) = z

i

, can often be used as a meaningful replacement

for x

i

, in this representative ‘latent space’.

Problem 10: Consider training an auto-encoder model on the MNIST data, where the total number of

parameters in F and G are P/2 (P as in Problem 4), and the dimension of the encoder output is k. You

may structure the encoder and decoder layers as you like (they do not have to be symmetric), but the input

layer needs to match the (flattened) MNIST data, as does the output layer (i.e., 784 dimensional).

Plot, as a function of k, the mean test and training reconstruction losses. What does this tell you

about the ‘dimension’ of the MNIST data set, compared to its ‘raw’ dimension of 784?

As an example of why this might be useful, consider the following: once we have trained an auto-encoder, and specifically an encoder the captures ‘relevant information about the data’ (we know that it does, because reconstruction

is possible), we can then utilize that encoded or latent representation elsewhere.

4

Computer Science Department - Rutgers University Spring 2024

Problem 11: For the optimal k and trained F, G you found in Problem 10, consider building a network in

the following way:

? Input layer.

? Computes F of the input.

? Passes through single linear layer, with 10-dimensional output.

? Computes the softmax for classification probabilities.

Note that we are taking F as ‘frozen’ here, the only trainable parameters are meant to be in that last linear

layer. How can you construct this in code, specifically with F taken as given, i.e., not trainable?

How does this model, trained for classification, compare to the baseline linear softmax model? Or

the comparable parameter sized models you looked at for classification previously? Note with respect to

application, we can discount the training of F, as that would’ve been done elsewhere or beforehand - we only

have to worry about training that last layer.

5 Bonus, For Fun

Consider a simple network for classifying MNIST, with one hidden layer (number of nodes on this layer at your

discretion). Consider training the model with stochastic gradient descent, but with different stepsizes α0 on the

parameters going into the hidden layer, and α1 on the parameters coming out of the hidden layer. Create a 2D plot,

with α0 on the x-axis and α1 on the y-axis. For every point in the plot, color that point in the following way: color

it black if SGD fails to converge, color it (something else) if SGD converges, where the strength of the color depends

on the final loss converged to. Zoom in as best you can on the boundary between regions of convergence and regions

of failure of convergence. What do you notice? Note, you’ll need to take great care with testing convergence to get

good results!

5


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp