COMP4702 Report
1. Objec甘ve
The objec甘ve of this report has two parts:
1. Find a good regression problem that can be done on the Rel 2 Nutrient_file dataset.
2. Develop a model that has a high predic甘ve performance on an unseen dataset.
2. Finding a Regression Problem
We first need to understand the dataset and find the features and target for the regression problem. The Rel 2 Nutrient_file dataset has the energy, classifica甘on, and nutrient informa甘on of 1616 different food examples. It has 293 columns, and 288 of them are of nutrient informa甘on. Each of these are for a specific nutrient and specifies a numerical value for each food represen甘ng the amount contained in that food. There are 2 columns for energy, one including dietary fibre, and one not.
Since energy is a con甘nuous measurement, it is a good candidate for the target of our regression problem. We will arbitrarily choose the energy with the dietary fibre.
Figure 1. Distribu甘on of Food Energies.
However, the distribu甘on of the energy is skewed, which may make it difficult to evaluate the predic甘ons (see Figure 1). For example, if the error for a par甘cular food was 100kJ, this would be small if the food had 3000kJ of energy but very large if it had 10kJ. To make this easier, we will transform. the energy with log(x + 1).
Figure 2. Distribu甘on of Food Log Energy
Figure 2 shows the distribu甘on of the energy with the log transforma甘on. Although this shows some outliers with lower log energy, this is much beter in terms of skewness. We will use this as the target. We will use mean squared error as the evalua甘on metric – a common choice for regression problems.
Not all the 288 nutrient columns are good candidates for the features of our regression problem. This is because the large number of null values present in the columns. 58 of the nutrient columns have no null values whatsoever in the dataset. We will thus use these 58 columns as the features. 58 columns is a lot, so overfitting is an issue to keep in mind. If the model performance was poor, we can atempt incorporate some of the other columns.
For performance evalua甘on, we will use an 80-20 training and tes甘ng split. The test data will only be used at the end to evaluate our final model as an “unseen dataset”. The test data will not be used to choose our final model or to fine-tune hyperparameters. This is to avoid fitting the model on the test data, which would invalidate the performance evalua甘on as a measure of performance on unseen data.
3. Selec甘ng the Best Model
To select the best hyperparameters for each model, and to select which model to use as our final model, we will use 10-fold cross valida甘on. These folds are created on the training dataset from the previous 80-20 split. We will compute the train and test MSE losses for each of the 10 itera甘ons. The means of these train and test losses will be referred to as cross-valida甘on train and cross-valida甘on test losses respec甘vely. We will use the cross-valida甘on test loss as the metric to choose the best model and cross-valida甘on train loss to analyse underfitting and overfitting.
We use this method over using a hold-out dataset as it is a more accurate es甘ma甘on of the model performance, and the models we will look at are not too computa甘onally expensive.
4. Linear Regression
As a baseline model, we will use a simple mul甘variate linear regression. It achieved a cross-valida甘on train loss of 0.233 and cross-valida甘on test loss of 0.451. We can see that the train loss is lower than test loss, and that is expected.
Figure 3 shows how the test loss varied between each fold of cross-valida甘on. It is quite a big varia甘on. Because linear regression has an analy甘cal exact solu甘on, the varia甘on should be due to the varia甘on in the training and test data in each itera甘on of cross-valida甘on. This suggests that for a given itera甘on of cross-valida甘on, the test loss from the hold-out set may not be a good es甘mate of the true expected error for new data. This supports the use of cross-valida甘on instead of another hold-out set in choosing the best model and hyperparameters.
Figure 3: Box plot of the test loss observed for the 10 folds of the cross-valida甘on process.
5. Improving the Model by Regulariza甘on
In our linear regression model, there is a gap between the training and test losses. This may be due to overfing and may be improved by incorpora甘ng regulariza甘on techniques.
Looking at the coefficients of the final trained linear model, we can see a huge varia甘on in the magnitudes. Plong them on a log scale (see Figure 4), we see that some are very small (~10^-5) while others are 100k 甘mes larger (~10^0). It may be that the scale of the input varies between different columns. We can check this by plotting the distribu甘on of the sample standard devia甘ons of each column (see Figure 5). We see that the varia甘on in the standard devia甘ons is large, which may explain some of the varia甘ons in the coefficient magnitudes. To account for this, we can retrain the linear model over the training data with normalised features. Figure 6 shows that even with normalisa甘on, some coefficients have much larger magnitudes than others. This suggests that L2 and L1 regulariza甘on can have some effect on the model, which may reduce overfitting.
Figure 4: Distribu甘on of the log of non-regularized linear regression coefficients.
Figure 5: Distribu甘on of the log sample standard devia甘ons of the columns in the training data set.
Figure 6: Distribu甘on of the log of non-regularized linear regression coefficients aier normalizing the features. Most coefficients are between the orders of 10^-3 and 10^2.
L2 regulariza甘on tends to make coefficients smaller in magnitude because it reduces the regulariza甘on term We also want to normalize features so that the regulariza甘on does not affect features with lower values the L2 regulariza甘on implemented as Ridge in sklearn. To choose the op甘mal value on the regularization strength alpha, we will do a search on alpha and find the best performing one. The search on alpha was done on the interval [1, 1000], and the reason is described below.
Figure 7: The cross-valida甘on test loss of L2 regulariza甘on with different regulariza甘on strengths.
Figure 7 shows that the cross-valida甘on test loss achieves a minimum at the strength 523. It Because it looks to keep increasing for greater or lower strengths, we will stop the search here.
Using a strength of 523, the regularized linear regression had a cross-valida甘on test loss of 0.360, sugges甘ng beter generaliza甘on performance. This is an improvement from the non-regularized linear regression. Moreover, we can see that the coefficients did get smaller, with values taking orders of 10^-2 or less. (see Figure 8) instead of 10^2 (see Figure 6).
Figure 8: Distribu甘on of L2 linear regression coefficients on a log scale. The coefficients are of order 10^-2 or less.
L1 regulariza甘on is another type of regulariza甘on that may help the model generalize. It tends to favour sparse coefficients, which may have an impact to the model as it has a large number of coefficients. Unlike L2 regulariza甘on, L1 regulariza甘on seems to work beter for lower regulariza甘on strengths. See Figure 9, which shows that a local minimum of the mean squared error is achieved with strength 0.04.
Figure 9: The cross-valida甘on test loss of L1 regulariza甘on various regulariza甘on strengths. A minimum is reached at strength 0.04.
With the op甘mal strength of 0.04, the cross-valida甘on test loss was 0.410. This is lower than what was achieved without regulariza甘on, but higher than what L2 regulariza甘on achieved. Moreover, 19 out of the 58 coefficients were 0, showing that it did make the coefficients sparser.
6. Caveat with Linear Regression
Linear regression works well when there is a linear rela甘onship between the features and the target. For our case, this is likely inaccurate for many of the features, which would lead to the predic甘ons having high bias that contributes to the mean squared error.
We can see how linear the rela甘onships are by looking at the scater plots between Energy and each column values. A subset is shown in Figure 10).
Figure 10: Scaterplot of the Energy over 3 feature values.
They don’t look very linear, so it is possible that more complex models will perform beter than linear regression.
The error metrics also suggest that we may need a more complex model. The cross-valida甘on training losses have not gone below 0.2. It will be difficult for our model to achieve a test error below that without increasing the model complexity. One such model would be a neural network.
7. Neural Network Training Process
To code the neural networks and their training and evalua甘on processes, we will use pytorch.
Normaliza甘on is known to help deep learning models train beter. Because of this, we will normalize the features.
To train the model, we will use a mini-batched gradient-based method to op甘mize the
training mean square error. Mini-batches are used to help the model avoid overfitting. The Adam op甘mizer is used which modifies the learning rate based on the curvature of the loss func甘on on top of using the gradient.
In each epoch, we feed all the data into the network and calculate the loss. But because we are using batches, we feed one batch at a 甘me and calculate the loss for each batch. The network weights are updated using the loss and the gradient of the loss calculated for each batch. Pytorch can automa甘cally calculate the gradient using backpropaga甘on.
At the end of each epoch, the model is evaluated on the test data using the same loss func甘on. Both the mean of the train losses for all the batches and the test loss is recorded for the epoch.
This process is repeated for many epochs, ideally un甘l the test loss converges to the minimum. We can analyse how the train losses changes over the epochs to make a guess on when it has converged.
Figure 11 is the code used to do this training process.
Figure 11: Training and evalua甘on code.
Like previous models, we will use cross-valida甘on to evaluate this model and fine-tune the hyperparameters. However, sklearn does not support pytorch neural networks for its cross- valida甘on func甘on, so we have to write our own (see Figure 12).
Figure 12: Cross-valida甘on code for pytorch neural networks.
This means the training process is done 10 甘mes, and in each 甘me 9 folds are used as the training data and 1 is used as the tes甘ng data. This can take a long 甘me. This means we have less 甘me to do hyperparameter tuning, but this is necessary due to the varia甘on in the test loss we could get for different segmenta甘ons of test datasets as we saw when training our linear model.
To analyse how the training and test losses changes over the epochs, we can use the mean loss over all 10 models for each epoch. Aier all epochs have completed, the mean train and test loss over all 10 models would be the cross-valida甘on train and test losses of our final model.
Unlike linear regression, op甘mizing neural networks oien do not have a closed-form solu甘on. In addi甘on, the complexity of neural networks makes the op甘miza甘on process non-determinis甘c, and the results may vary each 甘me it is trained.
8. Single-Unit Neural Network
Unlike previous methods, neural networks require a lot more code to be writen and is more prone to bugs. To test this code, we will train a simple neural net with 1 layer with a single unit and no ac甘va甘on. In theory, this should be equivalent to a simple linear regression.
Figure 13: The diagram of a single layer, single unit neural network. The inputs are x1, … ,xn and the predic甘on output is y. The parameters of the neuron are w1, … ,wn for the weights and b for the bias.
The mathema甘cal model for this simple model (diagram shown in Figure 13) is = w1x1 + w2x2 + … +wnxn + b. See that this is equivalent to the linear regression model. Thus, if the code is working as expected, it should be possible to achieve similar cross-valida甘on losses to our previous linear model.
Using a batch size of 8, a learning rate of 0.01, and 50 epochs, we achieved a cross-valida甘on train loss of 0.338 and a test loss of 0.377.
These losses are in the same ballpark as what we achieved for our linear model, which was 0.233 and 0.451 for train and test respec甘vely. But they are a bit different.
One possibility for the difference is that the folds used for training the two models will have been different from each other, where each model had different combina甘ons of data points in each fold. However, the training dataset in each cross-valida甘on itera甘on would be 90% of all the data. It is probably difficult to get much varia甘on in the training data regardless of how the folds are assigned.
Another possibility is that the model’s training loss did not converge to the minimum. This explains the train loss being higher, and different test loss. However, the mean loss over the 10 tests for each epoch seems to have converged (see Figure 14).
Figure 14: Mean training and test loss over the epoch (mean over 10 cross-valida甘on itera甘ons).
Although the means look converged, perhaps some of the 10 training processes did not converge. We can plot the training loss for each of the 10 training processes (see Figure 15).
Figure 15: Training loss for each training process over epoch.
We can see that the first itera甘on starts with a high loss, while all other itera甘ons start with a loss below 0.4. Moreover, the losses for later itera甘ons seems lower than earlier itera甘ons.
This plot can help us realise that in our code for cross valida甘on, we forgot to ini甘alize the weights before the start of each itera甘on. This means that for each itera甘on, the model starts with the weights obtained from the result of the previous itera甘on. Moreover, the trend of lower layers having lower training loss suggests that we need to train for more epochs.
By adding the re-ini甘aliza甘on and increasing the number of epochs to 200, we get Figure 16.
Figure 16: The mean training and test loss (above) and the training loss for each cross- valida甘on itera甘on (below) for the single-layer neural network.
Now all itera甘ons start with higher training loss. However, the mean train and test losses aier 200 epochs is 0.333 and 0.394 respec甘vely. This is s甘ll a bit different from the 0.233 training and 0.451 test losses achieved by our linear model. Moreover, we can see more fluctua甘ons in the loss. It is possible that the loss is not converging to the op甘mum due to the learning rate being too high. It could also be due to the mini-batch process we used, which computes the gradients from small batches rather than the full dataset which may not be representa甘ve of the data.
In addi甘on, the 10 itera甘ons all slightly converge to a different loss, which may be because of the slightly differing makeup of the training and test set in each itera甘on.
9. Mul甘-layer Neural Network
To increase the complexity of the model, we will first try a two-layer MLP. The first layer is composed of 30 hidden units, each with ReLU ac甘va甘on. The second layer is the output.
Using a batch size of 8, learning rate of 0.01, and 200 epochs, we get mean train and test scores of 0.107 and 0.209 respec甘vely. This is a significant improvement from our linear models.
Figure 17: Mean training and test loss over epoch for our two-layer neural network
The mean train losses seem to be flatening out towards epoch 200 (see Figure 17). It is unlikely to have a significantly more decrease, so we do not have to train longer. There is no clear sign of overfing yet, so the model could be underfing.
Because of the poten甘al for underfitting, we will further increase the complexity by having 2 hidden layers, each with 30 units. All hidden units have ReLU ac甘va甘on. With all other hyperparameters kept the same as before, we get mean train and test scores of 0.038 and 0.183 respec甘vely.
Figure 18: Mean training and tes甘ng losses over epoch for our three-layer neural network
The model now seems to be overfitting. The training loss is con甘nuing to go down, while the test loss is staying around 0.2 (See Figure 18).
To help reduce the overfitting while maintaining the complexity needed to achieve low loss, we can incorporate L2 regulariza甘on. This is done by adding weight decay. Another method for regulariza甘on is to incorporate dropout to each of the layers. With a dropout probability of 0.2 and a weight decay of 0.0001, we got 0.083 training and 0.167 test losses. This is an improvement in the test loss. We can see in Figure 19 that we have less overfitting.
Figure 19: Loss curve for our three-layer neural network with regulariza甘on.
We can also try to reduce the complexity a litle bit by lowering the number of units in each layer. We can do a grid search on the values [30, 26, 22, 18]. The lowest cross-valida甘on loss was 0.132 and was observed with 22 hidden units. This is the lowest cross-valida甘on loss we have seen so far.
Running cross-valida甘on again with 22 hidden units, we get a cross-valida甘on test loss of 0.145. This is s甘ll lower than all previous models, so we will s甘ck with it. However, it shows how the randomness in the process of ini甘alizing and training the neural network leads to the model converging at different local minimums.
10. Evalua甘on of Final Model
We will use the 3-layer MLP with 22 hidden units in each hidden layer. This is trained on the en甘re training dataset with the same hyperparameter configura甘ons as we used previously. We then evaluate the network on the hold-out test set that we made at the start.
This results in a test MSE of 0.183. It is slightly higher than our cross-valida甘on losses.
Perhaps this is due to the varia甘on in the dataset as we saw when doing cross-valida甘on, or the randomness in trying to get the model to converge to the global minimum. Regardless, it is a good result and is significantly beter than our baseline linear regression model.
11. Conclusion
In conclusion, we have found a good regression problem to work on and developed a model that has a high predic甘ve performance on the unseen test set. Moreover, we were able to analyse various parts of the models and how they contribute to the models’ predic甘on performance.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。