CSC8630/CSC8635 - Machine Learning
Resit specification: Machine Learning project
Submission will be via Canvas
The learning objectives of this assignment are:
1. To learn about the design of machine learning analysis pipelines
2. To understand how to select appropriate methods given the dataset type
3. To learn how to conduct machine learning experimentation in a rigorous and effective manner
4. To critically evaluate the performance of the designed machine learning pipelines
5. To learn and practice the skills of reporting machine learning experiments
For this coursework you will be provided with a choice of four different datasets of different nature
1. A tabular dataset (defined as a classification problem)
2. A image dataset
3. A text dataset
4. A time series dataset
Your job is easy to state: You should pick one out of these four options and design a range of machine learning pipelines appropriate to the nature of each of the selected datasets. Overall, we expect that you will perform a thorough investigation involving (whenever relevant) all parts of a machine learning pipeline (exploration, preprocessing, model training, model interpretation and evaluation), evaluating a range of options for all parts of the pipeline and with proper hyperparameter tuning.
You will have to write a short report that presents the experiments you did, their justification, a detailed description of the performance of your designed pipelines using the most appropriate presentation tools (e.g., tables of results, plots). We expect that you should be able to present your work at a level of detail that would enable a fellow student to reproduce your steps.
1) Description and requirements for the tabular dataset
The dataset, called FARS, is a collection of statistics of US road traffic accidents. The class label is about the severity of the accident. It has 20 features and over 100K examples. The dataset is available in Canvas as a CSV file, in which the last column contains the class labels: https://ncl.instructure.com/courses/53509/files/7652449/download?download_frd=1
Experiments on the tabular dataset will be relatively fast compared to the other three options. To compensate, we expect that you evaluate a very broad range of options for the design of your machine learning pipelines, including (but not limited to) data normalisation, feature/instance selection, class imbalance correction, several (appropriate) machine learning models, hyperparameter tuning and cross-validation evaluation.
2) Description and Requirements for the image dataset
The CIFARTile dataset is an extension to the CIFAR10 dataset. In each image there are four CIFAR10 images tiled in a grid. The idea is to predict the label. The label is the number of unique CFAR10 image classes within the tiled image subtract one. So, for example in Figure 1 below there are two images of birds, one of a frog and one of an automobile. Thus, three unique classes and hence the label is 2. More details on the dataset can be found on the page
https://github.com/RobGeada/cvpr-nas-datasets. However, please download your data from:
Train
http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/train_x.npy
http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/train_y.npy
Validate
http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/valid_x.npy
http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/valid_y.npy
Test
http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/test_x.npy
http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/test_y.npy
Figure 1: Example image from the CIFARTile dataset, class label is 2
Some hints:
− There’s a notebook
(http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/viewCIFARTile.ipynb) which shows you how to load and view the data.
− To speedup your work here are some hints:
o Make sure you set the Runtime type to either GPU or TPU.
o Copy the data to your Google drive so you don’t have to keep uploading it.
o As the dataset is large you might want to do some of your initial testing on a subset of the data.
You might consider cutting the image up into 4 and running each through a CFAR10 classifier. This is not allowed and will score you zero for Method.
3) Description and Requirements for the textdataset
Dataset: sentiment analysis dataset. It includes a training set (https://ncl.instructure.com/files/7666186/download?download_frd=1), a development set (https://ncl.instructure.com/files/7666193/download?download_frd=1), and a test set (https://ncl.instructure.com/files/7666197/download?download_frd=1). Each sample in the dataset represents a tweet. Each tweet has a sentiment label (Positive, Negative, Neutral).
Task Description: Apply a combination of different approaches including pre-processing techniques, shallow and deep classifiers, ensembled approaches, machine learning approaches beyond supervised learning if applicable, data augmentation if applicable to predict the sentiment of the test set. Try your best to improve the prediction results.
Main Evaluation metrics: F-1 measure.
4) Description and Requirements for the time series dataset
The Weather dataset is a time-series dataset collected by a Raspberry Pi computer at a home in Newcastle. It contains a bunch of different features about the weather collected over an approximate 12-month period. The features are as follows:
Column no |
Feature |
1 |
Date and time in standard Linux format |
2 |
Temperature from the first internal sensor (Celsius) |
3 |
Outside temperature (Celsius) |
4 |
CPU Temperature (Celsius) |
5 |
Count (always 1) |
6 |
Temperature from the second internal sensor (Celsius) |
7 |
Air Pressure (mmHg) |
8 |
Humidity |
Readings are measured in one-minute intervals between November 2021 and November 2022. Your task is to try and predict future values 5, 10, 15, 30 minutes into the future along with 1, 2, 6 and 12 hours into the future. You can do this for each of the 6 weather features (not date or count). You should separate out a test set of the last 2 months of data (you need to have a continuous and separate test set to prevent leakage between training and testing).
The dataset can be downloaded from:
http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/weather.csv
Some hints:
- There is a notebook:
(http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/weather.ipynb) which shows you how to load and view the data to get you started.
- In order to score top marks for this dataset you should demonstrate multiple models, at least one of them should not use Deep Learning.
- To speedup your work here are some hints:
o Make sure you set the Runtime type to either GPU or TPU.
o Copy the data to your Google drive so you don’t have to keep uploading it.
o As the dataset is large you might want to do some of your initial testing on a subset of the data.
Marking criteria
• Writing Style, references, figures, etc. 10%
• Dataset exploration 10%
• Methods 30%
• Results of analysis 30%
• Discussion 20%
Deliverables
A finished report, addressing the marking scheme above together with the source code of your best pipelines for the selected dataset. The report should have 1000 to 2000 words. The word count excludes references, tables, figures and section headers, and has a 10% leeway.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。