联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2020-08-05 10:39

1 Introduction

This assignment is based on the human activity recognition using smartphones1 dataset. This data has been collected from

a group of 30 volunteers aged 19-48 years. Each person performed six activities (walking, walking upstairs, walking

downstairs, sitting, standing, lying down) whilst wearing a smartphone on the waist. Using the phone’s embedded

accelerometer and gyroscope, data was captured relating to the 3-axial linear acceleration and 3-axial angular velocity at

a sample rate of 50Hz. Video was then used to manually annotate the data. x

Figure 1: Activity monitoring (Image: Universitat Polit`ecnica de Catalunya, Catalonia (Spain))

The dataset presented here (and used in this assignment) is a sub-table of the original data and contains 1,200 data

instances (or objects), where a single activity is represented by a single data instance. There are 336 attributes (or features)

including the decision class label (‘ACTIVITY’), which can take the values 1 (walking), 2 (walking upstairs), 3 (walking

downstairs), 4 (sitting), 5 (standing), 6 (lying down). Each of these classes is represented more-or-less equally by 200

instances in the dataset. The goal is to predict the activity (walking, sitting, etc.), using the extracted sensor input values.

The sensor acceleration signal also has gravitational and body motion components. More info regarding the process and

data can be obtained from the link provided in the footnote. It is important to note that the dataset for this assignment is

a sub-table of the original data linked in the footnote and you should be aware that the distribution of the data has also

been modified. This means that any classifiers learned on the full dataset will not generalise well to the data for this

assignment.

Note also that some data instances have missing feature values. You need to be mindful of this when building classifiers.

2 General Guidelines

As with the previous assignment for this module, you should run experiments using WEKA (version 3.8.1) Explorer. If

you are working on a machine outside of the departmental network (e.g. your laptop), please ensure that you download

and install this exact version. This is very important as your results need to be reproducible on a departmental

machine.

When dealing with large datasets in WEKA, it is possible that you may encounter out-of-memory errors because the

JVM has not reserved enough memory for WEKA (i.e. the ‘heap size’ is too small). If this is the case, then there are a number

of ways to address this - see the appendix for more info.

In many cases, the WEKA Explorer allows you to modify the random seed that will be used. However, you should use the

default seed to aid reproducibility of your results. You will have to work out some details in applying WEKA Explorer,

including the meaning of certain terminologies (such as ‘random seed’ as referred to above).

1 https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

2

3 The Task

Essentially, there are three different subtasks for this assignment which are discussed in more detail in each section below.

3.1 Task 1 - Evaluating classifier performance (30%)

The first task is to examine how some of the classifiers provided in WEKA perform on the provided humanActivity.arff

dataset.

1. You should train and evaluate different classifiers on the dataset by first loading the humanActivity.arff dataset. In

this task, you will be using cross-validation (CV) on the dataset to evaluate the classifier performance. In order to

reduce computation time you may use 10-fold cross validation.

2. A good starting point for the analysis is to try both Naive Bayes (NaiveBayes) and C4.5 Decision Trees (J48). However,

for a complete analysis I would expect you to explore the data using at least four different learners including some

of the sub-symbolic learners that we covered in the lectures (e.g. SVM, NN, etc.). Compare the results (percent

correct, PC, etc.) for the different classifiers. What do you notice? What do you think would be a reasonable baseline

against which to compare the classification performance? You are not limited to any particular selection of classifier

learners and you may try others (as many as you like, in fact). However, you must provide a solid reasoning for

choosing those that we have not covered in the material for this module (and their inclusion in the report), and have

an understanding of how they work, as well as stating why you think they perform better/worse than others. It is

not sufficient to simply include lots of results for different classifiers without any meaningful analysis.

3. In the introduction section, it is mentioned that some instance-attribute values are missing. It is important todeal

with these as they may affect the performance of the learners applied to the data. One way of addressing this might

be to use one of the filters provided in WEKA to replace such values with something else. This may be appropriate

in some cases and not in others. It is important to bear this in mind however, and you should check the data carefully,

use your best judgement and clearly state any assumptions you make when performing any subsequent

experiments.

The use of a filter and/or indeed manual manipulation of the dataset by identifying any values to be changed, of

course results in a different dataset from the original. You should now re-analyse the modified dataset (or datasets

if you attempt more than one approach) and present your findings. What effect does this have on the models that

are learned? Again, as in the previous step, you should provide sound reasoning for any conclusions you draw and

classifiers you choose.

4. Returning to the missing attribute values: can you suggest another way, apart from the strategies employed bythe

WEKA filters, to change the values? If so, you should describe it and present any experimental results, if you make

any changes to the data. Discuss the effects that these changes might have on the learner.

You should use the dataset you believe to be the most suitable, and explain your choice in your report. Discuss (in your

report) the analysis of the experiments performed.

3.2 Task 2 - Feature selection (40%)

In any dataset, there may be superfluous features - i.e. those that do not contribute to determining the classes of the data

instances. Indeed, often these attributes can be misleading and mean that classification performance can be negatively

affected as a result. One way in which we can address this issue is to perform attribute or feature selection.

You are asked to perform feature selection by carefully considering the properties of the original attributes (mean,

standard deviation, etc.), and applying any changes (removal or inclusion of attributes) for the supplied training set,

before then also applying the same changes to the test set.

The two datasets that are provided for this are available on Blackboard; humanActivityTrain.arff and

humanActivityTest.arff. Note: you should not use an automatic feature selection method or mechanism such as those

included in WEKA for the first two parts of this task - I am interested in your analysis not whether you have chosen

an optimal subset of the data. The use of a feature selection algorithm will be obvious from the detail provided in

your report.

1. Examine each of the attributes in the training dataset (humanActivityTrain.arff) in detail using the visualisation tools

provided in the WEKA Explorer preprocess tab. Document/calculate some basic measures: mean, standard

3

deviation, maximum and minimum, and whatever other metrics you consider appropriate. What does this tell you

about each of the features of the dataset? Document your findings in the report.

2. Using the findings from 1) above, can you suggest a strategy for the inclusion or removal of certain features? Ifso,

perform a manual removal/inclusion of features on humanActivityTrain.arff using the preprocess tab in WEKA and

save the new dataset. Remove the same features for humanActivityTest.arff. Now, re-analyse the performance of the

classifiers that you chose for task 1. What do you notice? Is performance better or worse? Provide an analysis of

why you think that the performance has changed. You may include/exclude as many features as you like as long as

you have a sound reasoning for doing so. Also, you may generate more than one dataset with selected features and

perform analyses accordingly.

3. Very often as part of some classification algorithms, e.g. tree-based approaches, feature information gain scoresare

used as a splitting criterion to perform classification. What can you say about the tree that is generated from the

humanActivityTrain.arff and the features that are used to split upon using the J48 classifier in WEKA? Is the tree

similar for the humanActivityTest.arff dataset? Document your findings in the report.

3.3 Task 3 - Summary of results/findings (15%)

The final task is to draw upon the findings of tasks 1 and 2 above and should summarise the following questions:

1. What is a good model for the full dataset using CV, and why? Provide a reasoned argument summarising thefindings

in your report.

2. What effect does the approach you employed for dealing with missing-valued data (denoted ‘?’ in the .arff file) have

upon the dataset and the performance of the classifiers? State why you think that the approach you have used is

appropriate.

3. Which features are most useful in the dataset? Refer to your findings of the analysis of the individual features aswell

as any methods such as decision trees (J48) which used feature information gain metrics. Is feature selection

appropriate for this classification problem/dataset and why?

4. Does a reduced dataset offer any advantages over the full dataset even if there are some negative changes inoverall

performance?

4 Submission and Marking

You are required to submit two separate things:

1. A report in .pdf format via TurnItIn. You should aim to keep your answers concise, while also conveying the

important information. A report of 2,800 words max. (including references) is appropriate for this. You should

also include the TurnItIn word count in your document so that it can be verified. Please do not exceed the word

limit as this will delay the marking process.

2. The modified dataset(s) created in Task 2. This should be submitted via Blackboard not TurnItIn.

Your assignment will be assessed according to the department’s assessment criteria for essays (see Student Handbook

Appendix AC) and marked based on your report and any supporting data you submit. In general it is advisable to

concentrate on appropriately sized selections or excerpts of data to support your discussion. It is not necessary to include

very large excerpts of data. However, if you feel the need to include such elements please do not include them into the

main text, but instead as an appendix, in order not to clutter the format of the report. The following marking scheme will

be used to mark your submission:

• Task 1 - Evaluating classifier performance (see description above) (30%).

• Task 2 - Feature selection (see description above) (40%)

• Task 3 - Summary of Results/Findings (15%)

• Report - readability, correct formatting, layout and proper referencing (15%)

You should aim to keep your answers concise, while conveying the important information.

4

5

Appendix 1. - Increasing the Java Heap Size in WEKA

1. For Windows & Linux: You can change the heap size parameter in the RunWeka.ini file (Windows only). In Linux

or Windows, you can also do so by navigating (using the command line) to the folder where the weka.jar file is

located on your machine and typing:

java -Xmx2048m -jar weka.jar

to start WEKA. The figure ‘2048’ represents the maximum size of the heap. This can be increased to make heap size

larger if desired.

2. For MacOS: Open:

System Preferences→Java Control Panel→Java→ click on View.

Edit the Runtime args box for the user by including the parameter e.g. -Xmx2048m. Click Apply.

Note that the figure ‘2048’ represents the maximum size of the heap. This can be increased to make heap size larger

if desired.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp