1 Introduction
This assignment is based on the human activity recognition using smartphones1 dataset. This data has been collected from
a group of 30 volunteers aged 19-48 years. Each person performed six activities (walking, walking upstairs, walking
downstairs, sitting, standing, lying down) whilst wearing a smartphone on the waist. Using the phone’s embedded
accelerometer and gyroscope, data was captured relating to the 3-axial linear acceleration and 3-axial angular velocity at
a sample rate of 50Hz. Video was then used to manually annotate the data. x
Figure 1: Activity monitoring (Image: Universitat Polit`ecnica de Catalunya, Catalonia (Spain))
The dataset presented here (and used in this assignment) is a sub-table of the original data and contains 1,200 data
instances (or objects), where a single activity is represented by a single data instance. There are 336 attributes (or features)
including the decision class label (‘ACTIVITY’), which can take the values 1 (walking), 2 (walking upstairs), 3 (walking
downstairs), 4 (sitting), 5 (standing), 6 (lying down). Each of these classes is represented more-or-less equally by 200
instances in the dataset. The goal is to predict the activity (walking, sitting, etc.), using the extracted sensor input values.
The sensor acceleration signal also has gravitational and body motion components. More info regarding the process and
data can be obtained from the link provided in the footnote. It is important to note that the dataset for this assignment is
a sub-table of the original data linked in the footnote and you should be aware that the distribution of the data has also
been modified. This means that any classifiers learned on the full dataset will not generalise well to the data for this
assignment.
Note also that some data instances have missing feature values. You need to be mindful of this when building classifiers.
2 General Guidelines
As with the previous assignment for this module, you should run experiments using WEKA (version 3.8.1) Explorer. If
you are working on a machine outside of the departmental network (e.g. your laptop), please ensure that you download
and install this exact version. This is very important as your results need to be reproducible on a departmental
machine.
When dealing with large datasets in WEKA, it is possible that you may encounter out-of-memory errors because the
JVM has not reserved enough memory for WEKA (i.e. the ‘heap size’ is too small). If this is the case, then there are a number
of ways to address this - see the appendix for more info.
In many cases, the WEKA Explorer allows you to modify the random seed that will be used. However, you should use the
default seed to aid reproducibility of your results. You will have to work out some details in applying WEKA Explorer,
including the meaning of certain terminologies (such as ‘random seed’ as referred to above).
1 https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
2
3 The Task
Essentially, there are three different subtasks for this assignment which are discussed in more detail in each section below.
3.1 Task 1 - Evaluating classifier performance (30%)
The first task is to examine how some of the classifiers provided in WEKA perform on the provided humanActivity.arff
dataset.
1. You should train and evaluate different classifiers on the dataset by first loading the humanActivity.arff dataset. In
this task, you will be using cross-validation (CV) on the dataset to evaluate the classifier performance. In order to
reduce computation time you may use 10-fold cross validation.
2. A good starting point for the analysis is to try both Naive Bayes (NaiveBayes) and C4.5 Decision Trees (J48). However,
for a complete analysis I would expect you to explore the data using at least four different learners including some
of the sub-symbolic learners that we covered in the lectures (e.g. SVM, NN, etc.). Compare the results (percent
correct, PC, etc.) for the different classifiers. What do you notice? What do you think would be a reasonable baseline
against which to compare the classification performance? You are not limited to any particular selection of classifier
learners and you may try others (as many as you like, in fact). However, you must provide a solid reasoning for
choosing those that we have not covered in the material for this module (and their inclusion in the report), and have
an understanding of how they work, as well as stating why you think they perform better/worse than others. It is
not sufficient to simply include lots of results for different classifiers without any meaningful analysis.
3. In the introduction section, it is mentioned that some instance-attribute values are missing. It is important todeal
with these as they may affect the performance of the learners applied to the data. One way of addressing this might
be to use one of the filters provided in WEKA to replace such values with something else. This may be appropriate
in some cases and not in others. It is important to bear this in mind however, and you should check the data carefully,
use your best judgement and clearly state any assumptions you make when performing any subsequent
experiments.
The use of a filter and/or indeed manual manipulation of the dataset by identifying any values to be changed, of
course results in a different dataset from the original. You should now re-analyse the modified dataset (or datasets
if you attempt more than one approach) and present your findings. What effect does this have on the models that
are learned? Again, as in the previous step, you should provide sound reasoning for any conclusions you draw and
classifiers you choose.
4. Returning to the missing attribute values: can you suggest another way, apart from the strategies employed bythe
WEKA filters, to change the values? If so, you should describe it and present any experimental results, if you make
any changes to the data. Discuss the effects that these changes might have on the learner.
You should use the dataset you believe to be the most suitable, and explain your choice in your report. Discuss (in your
report) the analysis of the experiments performed.
3.2 Task 2 - Feature selection (40%)
In any dataset, there may be superfluous features - i.e. those that do not contribute to determining the classes of the data
instances. Indeed, often these attributes can be misleading and mean that classification performance can be negatively
affected as a result. One way in which we can address this issue is to perform attribute or feature selection.
You are asked to perform feature selection by carefully considering the properties of the original attributes (mean,
standard deviation, etc.), and applying any changes (removal or inclusion of attributes) for the supplied training set,
before then also applying the same changes to the test set.
The two datasets that are provided for this are available on Blackboard; humanActivityTrain.arff and
humanActivityTest.arff. Note: you should not use an automatic feature selection method or mechanism such as those
included in WEKA for the first two parts of this task - I am interested in your analysis not whether you have chosen
an optimal subset of the data. The use of a feature selection algorithm will be obvious from the detail provided in
your report.
1. Examine each of the attributes in the training dataset (humanActivityTrain.arff) in detail using the visualisation tools
provided in the WEKA Explorer preprocess tab. Document/calculate some basic measures: mean, standard
3
deviation, maximum and minimum, and whatever other metrics you consider appropriate. What does this tell you
about each of the features of the dataset? Document your findings in the report.
2. Using the findings from 1) above, can you suggest a strategy for the inclusion or removal of certain features? Ifso,
perform a manual removal/inclusion of features on humanActivityTrain.arff using the preprocess tab in WEKA and
save the new dataset. Remove the same features for humanActivityTest.arff. Now, re-analyse the performance of the
classifiers that you chose for task 1. What do you notice? Is performance better or worse? Provide an analysis of
why you think that the performance has changed. You may include/exclude as many features as you like as long as
you have a sound reasoning for doing so. Also, you may generate more than one dataset with selected features and
perform analyses accordingly.
3. Very often as part of some classification algorithms, e.g. tree-based approaches, feature information gain scoresare
used as a splitting criterion to perform classification. What can you say about the tree that is generated from the
humanActivityTrain.arff and the features that are used to split upon using the J48 classifier in WEKA? Is the tree
similar for the humanActivityTest.arff dataset? Document your findings in the report.
3.3 Task 3 - Summary of results/findings (15%)
The final task is to draw upon the findings of tasks 1 and 2 above and should summarise the following questions:
1. What is a good model for the full dataset using CV, and why? Provide a reasoned argument summarising thefindings
in your report.
2. What effect does the approach you employed for dealing with missing-valued data (denoted ‘?’ in the .arff file) have
upon the dataset and the performance of the classifiers? State why you think that the approach you have used is
appropriate.
3. Which features are most useful in the dataset? Refer to your findings of the analysis of the individual features aswell
as any methods such as decision trees (J48) which used feature information gain metrics. Is feature selection
appropriate for this classification problem/dataset and why?
4. Does a reduced dataset offer any advantages over the full dataset even if there are some negative changes inoverall
performance?
4 Submission and Marking
You are required to submit two separate things:
1. A report in .pdf format via TurnItIn. You should aim to keep your answers concise, while also conveying the
important information. A report of 2,800 words max. (including references) is appropriate for this. You should
also include the TurnItIn word count in your document so that it can be verified. Please do not exceed the word
limit as this will delay the marking process.
2. The modified dataset(s) created in Task 2. This should be submitted via Blackboard not TurnItIn.
Your assignment will be assessed according to the department’s assessment criteria for essays (see Student Handbook
Appendix AC) and marked based on your report and any supporting data you submit. In general it is advisable to
concentrate on appropriately sized selections or excerpts of data to support your discussion. It is not necessary to include
very large excerpts of data. However, if you feel the need to include such elements please do not include them into the
main text, but instead as an appendix, in order not to clutter the format of the report. The following marking scheme will
be used to mark your submission:
• Task 1 - Evaluating classifier performance (see description above) (30%).
• Task 2 - Feature selection (see description above) (40%)
• Task 3 - Summary of Results/Findings (15%)
• Report - readability, correct formatting, layout and proper referencing (15%)
You should aim to keep your answers concise, while conveying the important information.
4
5
Appendix 1. - Increasing the Java Heap Size in WEKA
1. For Windows & Linux: You can change the heap size parameter in the RunWeka.ini file (Windows only). In Linux
or Windows, you can also do so by navigating (using the command line) to the folder where the weka.jar file is
located on your machine and typing:
java -Xmx2048m -jar weka.jar
to start WEKA. The figure ‘2048’ represents the maximum size of the heap. This can be increased to make heap size
larger if desired.
2. For MacOS: Open:
System Preferences→Java Control Panel→Java→ click on View.
Edit the Runtime args box for the user by including the parameter e.g. -Xmx2048m. Click Apply.
Note that the figure ‘2048’ represents the maximum size of the heap. This can be increased to make heap size larger
if desired.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。