Computer Science Department - Rutgers University Spring 2024
Homework 1: MNIST Architecture and Hyperparameter Search 01:198:462
In this assignment, we’re taking a slightly backwards approach to building and training neural networks. Specifically,
I want to look at the question of effects of tweaks, changes, and tricks we can apply to neural networks to get good
performance, rather than the more standard problem of ‘here’s a data set, let’s fit a model to it’. As such, I am not
holding to a rigorous standard of data validation and management here.
Note: All problems below are in the context of the MNIST data set, so you must load that as your testing and
training set for all your models.
1 Breadth vs Depth
The most fundamental question when it comes to neural networks is ‘how many nodes and how many layers’. I want
to try to address the tradeoff in this section. Here, we’ll be building networks to classify the MNIST data set, and
look at the tradeoff in network shapes.
Problem 1: A network to classify the MNIST data set will have 28 ∗ 28 = 784 input nodes, and 10 output
nodes (one for each of the ten classes). How many parameters would a linear softmax model with no hidden
layers have? If a model had k ≥ 1 hidden layers, and m nodes in each hidden layer, how many parameters
would it have? Call this function Params(k, m).
Note: I am looking for a mathematical function here in terms of m and k, not a coded solution.
Problem 2: For a given number of parameters P, what is the smallest and largest values of k such that
Params(k, m) = P? Call this max k value kP .
Note: What should you do when kP is not an integer?
One difficulty in comparing network shapes is making the comparison fair. We can say that a comparison between
a network of k layers and m nodes per layer, and a network of k
0
layers is ‘fair’, if both networks have the same (or
approximately the same) total number of parameters.
1
Computer Science Department - Rutgers University Spring 2024
Problem 3: For a given number of parameters P, let mP (k) be the number of nodes per layer such that
Params(k, m) = P (or as close to P as possible). A network with k layers and mP (k) nodes per layer should
therefore have approximately P total parameters. For such a network, we can define trainLoss(k, P) and
testLoss(k, P), for each k from 1 to kP .
• Identify 10 values of P, sufficiently distinct so that the resulting network shapes and scale. How did
you make your choice? Use the linear softmax model as a baseline.
Note, because of the integer/rounding issues, your P values should be distinct enough so that
you don’t accidentally create networks with the same m and k for two distinct P values.
• Plot (overlaying the curves for each P) trainLoss(k, P), with the x-axis as k/kP from 0 to 1.
Note, you probably do not want to test every possible k value, due to time constraints. But test
enough k values so that the trend is clear.
• Plot (overlaying the curves for each P) testLoss(k, P), with the x-axis as k/kP from 0 to 1.
• How do the results compare to the baseline performance of the linear softmax model?
• What do you notice about the underlying trends? Is there a point where layers become too narrow to
be useful, and if so, where is it? What seems to be the sweet spot, if any, for network shape? How does
it depend on P?
• Create a plot showing (overlaying the curves for each P) total training time in terms of passes through
the data. Set the x-axis to go over k/kP for ease of comparison. Does this change your assessment of
the network shape tradeoff at all?
Bonus:
• Is total parameters P a fair comparison point? Try to find a better one. Justify it.
• Does introducing regularization (weight decay) or normalization layers help?
Problem 4: For a P of your choice and the optimal network shape as determined above - try to find
an even better network shape (layers of unequal size, for instance) that gives better results for the same
(approximate) total number of parameters. Is it better to have uniform layers? Layers of decreasing size? Increasing size? Experiment with it, and summarize your results. You may want to save the best model you find.
Bonus: Does regularization help?
2
Computer Science Department - Rutgers University Spring 2024
2 Training Effects: Activation Functions, Optimizers, Batch Size
Problem 5: For the model structure you chose in Problem 4, consider the problem of training, but with
different batch sizes. Should you train with a small batch size, or a large batch size? Experiment with ten
batch sizes (from b = 1 to b = something sufficiently large), and plot (as a function of training time), training
loss, testing loss, and clock time. What do you notice, and what does this say about a good choice of batch
size?
Problem 6: Repeat Problem 5, but comparing Adam optimization vs SGD optimization. What are the
tradeoffs? What determines a good stepsize?
Problem 7: Consider Problem 5 again, but changing the underlying activation function for the model, in
particular sigmoid vs tanh vs relu vs ELU. Can you draw any conclusions?
Bonus: What changes if you consider these experiments but run on a GPU?
3 CNNs vs Dense Layers
Problem 8: Consider again the model you built in Problem 4. This was a relatively simple model with
vanilla dense layers. Consider constructing a simple CNN model in the following way:
• Pass the input image (28 x 28 x 1) into a convolutional layer and an activation function.
• Flatten the result.
• Pass the flat result into a number of dense layers (and activation functions).
• Pass the result through a softmax layer to get class probabilities.
With a single convolutional layer (kernel size and number at your discretion), find the smallest model you can
(in terms of total number of parameters) that ultimately matches or exceeds the performance of the model
you found in Problem 4. How did you go about your neural architecture search to answer the question?
For the CNN architecture you find and the original architecture from Problem 4, plot the training and testing
loss over training time for comparable batch sizes, step sizes, and optimizer (be clear about the choices you
are making).
Problem 9: Consider Problem 8, but you are allowed two stacked convolutional layers (of different kernel
sizes / numbers). Can you beat the network from Problem 8, in terms of performance vs parameter count?
Bonus: For the dense model from Problem 4, and the model from Problem 9, find instances where the models
fail (incorrectly classifying the image). Are the mistakes being made reasonable, to your eye? Are the models
making different kinds of mistakes?
3
Computer Science Department - Rutgers University Spring 2024
4 Auto-Encoders
An auto-encoder is a model structure composed of two parts (we will discuss this more in class):
• The Encoder: a network that maps the input down into some smaller dimensional bottleneck.
• The Decoder: a network that maps the value of the output of the encoder back up to the same dimension and
shape as the input.
We could think of implementing this in two ways: either mathematically) a network F(x) = z that maps into some
small dimension, where z is the ‘encoded’ version of x, and then another network G that maps G(z) = x, to ‘decode’
z back to x; alternately, in code) you can consider a network structure that outputs both the final output layer, and
the value of a smaller intermediate layer, together.
An auto-encoder is an example of unsupervised learning. There is no desired output or ‘task’ to solve - we want
to learn an auto-encoder capable of encoding the input down and reconstructing it from its ‘compressed’ form. We
train the model by minimizing the ‘reconstruction loss’,
X
N
i=1
||G(F(x
i
)) − x
i
||2
, (1)
comparing the ‘reconstructed’ input to the actual input.
The point of this is that by ‘compressing’ the input in a way that nonetheless preserves relevant information (hence
decompression or decoding is possible), the encoder is learning something important or meaningful about the structure
of the data. If for instance we cannot recover the original data when compressing it down into three dimensions - it
suggests that the original data is more complex than three dimensional data. (Note: A circle for instance is a two
dimensional shape on a plane, but is itself only one dimensional, since it can be thought of just in terms of a single
angle.)
Once an encoder has been learned, the encoded vector F(x
i
) = z
i
, can often be used as a meaningful replacement
for x
i
, in this representative ‘latent space’.
Problem 10: Consider training an auto-encoder model on the MNIST data, where the total number of
parameters in F and G are P/2 (P as in Problem 4), and the dimension of the encoder output is k. You
may structure the encoder and decoder layers as you like (they do not have to be symmetric), but the input
layer needs to match the (flattened) MNIST data, as does the output layer (i.e., 784 dimensional).
Plot, as a function of k, the mean test and training reconstruction losses. What does this tell you
about the ‘dimension’ of the MNIST data set, compared to its ‘raw’ dimension of 784?
As an example of why this might be useful, consider the following: once we have trained an auto-encoder, and specifically an encoder the captures ‘relevant information about the data’ (we know that it does, because reconstruction
is possible), we can then utilize that encoded or latent representation elsewhere.
4
Computer Science Department - Rutgers University Spring 2024
Problem 11: For the optimal k and trained F, G you found in Problem 10, consider building a network in
the following way:
• Input layer.
• Computes F of the input.
• Passes through single linear layer, with 10-dimensional output.
• Computes the softmax for classification probabilities.
Note that we are taking F as ‘frozen’ here, the only trainable parameters are meant to be in that last linear
layer. How can you construct this in code, specifically with F taken as given, i.e., not trainable?
How does this model, trained for classification, compare to the baseline linear softmax model? Or
the comparable parameter sized models you looked at for classification previously? Note with respect to
application, we can discount the training of F, as that would’ve been done elsewhere or beforehand - we only
have to worry about training that last layer.
5 Bonus, For Fun
Consider a simple network for classifying MNIST, with one hidden layer (number of nodes on this layer at your
discretion). Consider training the model with stochastic gradient descent, but with different stepsizes α0 on the
parameters going into the hidden layer, and α1 on the parameters coming out of the hidden layer. Create a 2D plot,
with α0 on the x-axis and α1 on the y-axis. For every point in the plot, color that point in the following way: color
it black if SGD fails to converge, color it (something else) if SGD converges, where the strength of the color depends
on the final loss converged to. Zoom in as best you can on the boundary between regions of convergence and regions
of failure of convergence. What do you notice? Note, you’ll need to take great care with testing convergence to get
good results!
5
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。