158.222-2019 Semester 1 Massey University
Page 1 of 11
ASSIGNMENTS 2 AND 3 – PREDICTING HAPPINESS AND DOING OTHER STUFF
Deadline: Hand in by midnight 4 May 2019 (end of week 8)
Evaluation: Assignment 2: 10% of your final course grade.
Assignment 3: 10% of your final course grade.
Late Submission: Refer to the course guide.
Work These assignments are to be done individually.
Purpose: Implement the entire data science/analytics workflow. Use regression techniques to solve realworld
problems. Gain skills in extracting data from the web using APIs and web scraping. Build on
the data wrangling, data visualization and introductory data analysis skills gained up to this point
as well as problem formulation and presentation of findings. Gain skills in kNN regression
modelling or supervised learning, and unsupervised learning.
Learning outcomes 1 - 5 from the course outline.
Please note that all data manipulation must be written in python code in the Jupyter Notebook
environment. No marks will be awarded for any data wrangling that is completed in excel.
Assignments 2 and 3 are related to each other and have the same due date. However, they are to be
submitted separately. Create a separate notebook for each assignment.
These assignments will take longer than you think, so…do not leave starting these assignments until
the last minute. You have the tools you need to start now.
As of the week 5 lecture, you will have been introduced to tools that will assist you in completing
assignment 2. By week 7 (before semester break) you will be able to complete most of assignment
3, except for task 3 (unsupervised learning), which you will can complete after the week 8 lecture.
Terminology:
Note that ‘feature’ and ‘variable’ refer to the same thing: an array of values that represent a
data attribute. For the purposes of your work, this will usually be in the form of a Pandas
series (column in a dataframe). These terms are used interchangeably in this specification.
The World Bank refers to country attributes as ‘indicators’.
‘Predictors’, ‘explanatory variables/features’, ‘input variables/features’ are terms used
interchangeably and refer to arrays of values that represent data attributes, that can be input
into a model and are distinct from the target variable (the value you are trying to predict).
Before commencing, read through both assignment specifications for context.
158.222-2019 Semester 1 Massey University
Page 2 of 11
****************
*** Plagiarism***
****************
It is mandatory that any assessment items that you submit during your University
study are your own work. Massey University takes a firm stance on academic
misconduct, such as plagiarism and any form of cheating.
Plagiarism is the copying or paraphrasing of another person’s work, whether
published or unpublished, without clearly acknowledging it. It includes copying the
work of other students and reusing work previously submitted by yourself for another
course. It also includes the copying of code from unacknowledged sources.
Academic integrity breaches impact on students as it disadvantages honest students
and undermines the credibility of your qualification. Plagiarism, and cheating in tests
and exams will be penalised; it is likely to lead to loss of marks for that item of
assessment and may lead to an automatic failing grade for the course and/or
exclusion from reenrolment at the University.
Please see the Academic Integrity Guide for Students on the University website for
more information. The Guide steps you through the University Academic Integrity
Policy and Procedures. For example, you will find definitions of academic integrity
misconduct, such as plagiarism; how misconduct is determined and managed; and
where to find resources and assistance to help develop the skills of academic writing,
exam preparation and time management. These skills will help you approach
university study with academic integrity.
158.222-2019 Semester 1 Massey University
Page 3 of 11
ASSIGNMENT 2: DATA ACQUISITION AND REGRESSION TO PREDICT HAPPINESS
In Assignment 2 you will be integrating data from two sources:
the World Happiness Index
and one of:
the World Bank API, or
a web-scraped source of your choosing
Your goal is to build Regression models for predicting happiness, following a good process, including:
careful selection of explanatory variables (features) through engaging your critical thinking in choosing data
sources, exploratory data analysis and optional feature set expansion;
good problem formulation;
good model experimentation (including explanation of your experimentation approach), and
thoughtful model interpretation
TASK 1: DATA ACQUISITION AND INTEGRATION (25 MARKS)
a) Static Data: Import Table 2.1 of the World Happiness Report data (1 mark)
To begin building your feature set, download the “WHRData.xls” static dataset from Stream and read in the
data. The dataset is from the 2019 World Happiness Report. Learn more about this report here:
http://worldhappiness.report/ed/2019/
Data definitions and other feature documentation can be found here:
https://s3.amazonaws.com/happiness-report/2019/WHR19_Ch2A_Appendix1.pdf
You should familiarise yourself with the data documentation before proceeding. As a bare minimum, you will
need to identify which variable represents ‘Happiness’.
Note: if you are unable to meet the challenges laid out in Task 1 b) and c) you will still be able to continue with
Tasks 2 and 3 using the static dataset.
b) Dynamic data (14 marks)
To expand your feature set with dynamic data, do ONE of either option 1 or option 2.
OPTION 1:
API Data: Identify, import and wrangle indicators from the World Bank API
The World Bank API is briefly introduced in Lecture 5. Your task is to identify and import 5 or more World Bank
indicators (features) that you would like to have as options for inclusion in your models for predicting
happiness.
Identify: To identify 5 or more appropriate indicators, you will need to explore the World Bank API
documentation and figure out for yourself how to find which indicators are available and then how to
identify and request them. Finding your own way through the documentation is a deliberate part of this
challenge. Briefly explain your process and why you chose your indicators. These links will provide you
with a start:
https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-api-documentation
https://datahelpdesk.worldbank.org/knowledgebase/articles/898599-api-indicator-queries
158.222-2019 Semester 1 Massey University
Page 4 of 11
Import and wrangle your chosen indicators so that they are in the right shape for integration with the WHR
data. In Lecture 5, only one indicator is imported. To import many indicators in a tidy fashion (i.e. without
repeating code) will involve the use of a loop and/or function, depending on your approach.
Note: By default, you may not be returned all data you require. You may have to set arguments to obtain
the full range (look out for the ‘per_page’ argument). Also note that you can specify a date range.
Task 1b) Option 1 marking:
6 marks for identifying indicators and explaining why you chose them. We expect curiosity and initiative in
exploring the World Bank API and figuring out how to use it to identify appropriate indicators. 0/6 marks
for simply importing a subset of the indicators that you have already been given codes for in Lecture 5.
8 marks for the import and wrangling of data – highest marks for elegant and tidy solutions.
OPTION 2:
Web-scraped data: Source, import, parse and wrangle web data to add to your feature set
Source: Go to the internet and find another data source with which to expand your feature set that:
o can be web-scraped (downloading a csv or excel file from a website is not web-scraping),
o you think may improve your predictive model, and
o can be meaningfully integrated with the WHR data and your World Bank data.
In case it is not obvious, you will be looking for data that can be linked on both country name and one
or more years of the data you have already acquired.
Import, parse, wrangle: Scrape the data and wrangle it into the shape it needs to be in in order to
integrate it later.
Explain: Give a brief explanation of source choice and wrangling process before your wrangling script.
Task 1b) Option 2 marking:
3 marks for finding an appropriate and good quality data source.
8 marks for effective and tidy import/parse/wrangle code. Also, highest marks for the biggest
challenge taken on (some sources, such as Wikipedia tables, are easier to scrape than other sources)
3 marks for briefly explaining your source choice and wrangling process
c) Integration: By whichever means appropriate, clean labels and integrate the two datasets from a) and b) into
one dataframe (10 marks)
Inspect and clean labels for integration: To integrate your data without losing rows, you must ensure
compatibility of labels in the features you are joining with. This may involve data cleaning/updating
using old-fashioned gruntwork. For instance, one country can have different names in different
datasets (e.g. Dem. People's Rep. of Korea vs North Korea). Do data checks pre and post-integration to
ensure you have not lost data. Outer joins combined with filtering may assist you in this process. Data
loss due to a country being present in one dataset but genuinely not present in another is acceptable.
Include a brief explanation of your process before your cleaning script.
Integrate your data into one dataframe.
Task 1c) Marking:
6 marks for checking label compatibility for integration (via scripting) and, if required,
cleaning/updating those labels
2 marks for briefly explaining your process
2 marks for the final integration. At this point the final integration should be a straightforward line (or
few lines) of code.
158.222-2019 Semester 1 Massey University
Page 5 of 11
TASK 2: DATA CLEANING AND EXPLORATORY DATA ANALYSIS (EDA) (24 MARKS)
a) EDA – data quality inspection (12 marks)
Explore: Explore your integrated dataset with a view to looking for data quality issues. This could
involve looking at summary statistics, plots, inspection of nulls and duplicates – whatever you think is
appropriate, there is no single correct way of doing this. Clean your data if and as required and save
the cleaned dataset to csv for use in assignment 3.
Inspect target variable: Do a visual inspection of the distribution of your target variable (Happiness).
Explain whether it needs transformation to conform to a normal distribution. Transform if required.
Explain: Include a brief explanation of your process before your quality inspection and cleaning script.
Task 2a) marking:
10 marks for your code/outputs (explore/inspect): Did you produce code and outputs appropriate for
inspecting and addressing (if and as required) data quality issues?
2 marks for briefly explaining your process and discussion of the distribution of the target variable.
b) EDA – the search for good predictors (12 marks)
Explore: Explore your data with the goal of finding features that could be good predictors of your
target variable. This should include:
o Inspection of correlations between features
o Pairs plot/scatter matrix
o Any other visualisation that you deem appropriate
Explain: Include a brief explanation of your process before the script for exploring your predictors.
Discuss: Briefly discuss your findings, e.g. “I have chosen this subset of features as good candidates for
model predictors because…” (warning: do not copy and paste this text into your report, we will
deduct marks if you do.) It is also OK to choose features for reasons other than them being the best
predictors – perhaps you are curious as to whether a given feature would have any effect in a model.
Note: You are looking for features that are well correlated with the Target variable. You are also looking out for
features that are highly correlated with each other. Be aware that while models can have predictive power
while including highly correlated predictors (multicollinearity), the effects of those correlated predictors will be
masked by each other. Where there is multicollinearity, interpretation of specific feature coefficients is
uncertain. Bear this in mind later when interpreting your models.
Note: You may find that your chosen predictors come from the same data source. That is OK.
Task 2b) marking:
8 marks for your code/outputs (explore): Did you produce outputs appropriate for finding good
predictors? Is your code elegant and concise?
4 marks for your words (explain and discuss): Did you explain your process and discuss your findings?
**BONUS QUESTION**
Up to 5 marks will be awarded for feature set expansion via the creation of derived predictor/s that make a significant
and novel contribution to your final model. How you do this is up to you. Being an extension task, we will give no further
guidance, and a very high standard is set for achieving maximum marks. Ingenuity and initiative will be rewarded.
158.222-2019 Semester 1 Massey University
Page 6 of 11
TASK 3: MODELLING (44 MARKS)
Build the best regression model you can, with Happiness as the target variable, within whichever bounds you set yourself
in your problem formulation.
Formulate a problem: You know ‘Happiness’ is your target variable, but what else are you interested in with
respect to this problem? Would you like to simply find the model with the most predictive power? Are you
interested in understanding how particular features of interest to you affect Happiness? Or perhaps you are
interested in finding the most parsimonious model possible, while still retaining predictive power? Another
approach is to look at models for a particular group or groups. Perhaps you would like to filter your dataset to
include only OECD countries? Or perhaps you would like to build different models for developed, developing
and underdeveloped countries? (the World Bank API has this data). Maybe you have some other ideas? Briefly
explain how you will be approaching this regression problem. This will help you to focus your experimentation.
Experiment and explain: Explore regression models in a way that is appropriate to your problem formulation.
Experiment with single variable (optional) and multiple variable (required) linear regression. Consider using a
step-wise algorithm. Optionally, experiment with polynomials. Explain your approach to experimentation.
o Do not use joint plots as a substitute for regression modelling. Zero marks will be given to any model
experimentation that relies on joint plots.
o Do use a module for modelling, and do not code up your regression model from scratch.
o Do consider ‘Year’ as a predictor to include in your model.
o Do display model statistics
Note: If you are interested in the predictive power of your model, your best model is likely to include multiple
explanatory variables so don’t waste effort bulking out the assignment with single variable models.
Note: when you have more than one predictor in your model, you will not be able to produce the regression plots
from Lecture 4 because they are two dimensional (target vs one predictor). That is OK. There are other ways to
visualise, for instance you could plot summary statistics (like RMSE or RSq) from your different models, that you
have collated into a dataframe.
Write elegant code: Experimenting with many different models will involve repetition of code, so employ loops
and functions for model creation and evaluation. Functions and loops = less code = easier to read reports and
easier and more effective experimentation.
Evaluate/interpret: To compare models, you will need to interpret model outputs. For instance, the
probability of the F statistic tells us whether there is a significant relationship between the response and
predictors as expressed by the model. R-Squared tells us about the strength of that relationship (and how good
our model would be for prediction). Consider the coefficients for your predictors – are they significant and
doing heavy lifting in the model, or are they surprisingly superfluous? Can the coefficients be interpreted or is
multicollinearity an issue? You may like to calculate RMSE and interpret that in context.
Present preferred/final model: settle on a preferred or final model for further inspection.
o Residuals: Produce a plot of residuals and fitted values and explain whether it is likely that this model
fulfils the necessary assumptions of homoscedasticity (homoscedastic residuals should not fan out) and
linearity (the residuals should randomly scatter around the fitted line and not follow a curved shape).
You could find code for this online, or you could look up the code in the exercise hints for Lecture 4.
For the purposes of this assignment you are not expected to analyse the residuals beyond a visual
inspection. We would usually inspect residuals before interpreting any model output, not just the final
model. That requirement is waived here to pare down the scope.
o Describe what the coefficients of the model mean, remembering to mention what units they are in as
appropriate (e.g. sealevel = 0.58*temp_celsius : ‘for every degree Celsius increase in average global
temperature, sea level rises by 58 centimetres’).
o Explain how reliable the model was. Was it a good fit and good for prediction? How did the residuals
look, do you think they conformed with assumptions? Could you recommend this model to a client?
o Optional - Plot the confidence intervals and prediction bands for that model and describe what they
tell you (there are no extra marks for this option)
158.222-2019 Semester 1 Massey University
Page 7 of 11
Note: As we do not delve deeply into statistics in this course, and to keep the assignment scope manageable, we will
not be holding your work in this assignment to a high statistical standard (for instance, looking for outliers, high
leverage points, etc). We would like you to demonstrate curiosity, your ability to use the tools provided, and show
that you can select good predictive features, experiment and evaluate a model.
Task 3 Marking:
2 marks for problem formulation
8 marks for rich experimentation in modelling (maximum of half marks if no multiple-variable models included)
8 marks for producing appropriate outputs for model evaluation. Highest marks for creative approaches to this
(e.g. producing visualisations that display model statistics from multiple models for ease of comparison.)
8 marks for elegance of code (effective use of loops/functions)
8 marks for interpretation of outputs
10 marks for presentation of final/preferred model:
o Residuals plot – 4 marks
o Interpretation of residuals plot – 2 marks
o Coefficient explanation – 1 mark
o Discussion of model reliability – 3 marks
TASK 4: PRESENTATION - ‘REPORT-ERIZE’ YOUR WORK (7 MARKS)
Go back through what you have done and turn your Assignment 2 work into something that looks like a report that you
could hand to a client (a technically savvy client, as you still need to include/provide your scripting for marking). Include
a brief introduction that describes the modelling problem you formulated, and a brief description of the datasets that
you use. Add a conclusion. Use formatted markdown boxes that include headings and subheadings. Do also include
text/headings that break the ‘fourth wall’ to clearly delineate the different tasks of the assignment (eg ‘Task 1b’) and for
the sake of fulfilling requirements for marking (e.g. descriptions of process). Any formatting that makes the task of
marking easier would be most appreciated and ensures we do not overlook areas where marks should be specifically
awarded.
Put your name and ID on your report. It seems obvious, but this is a common omission. As an incentive, there will be a
3-mark deduction for reports that are missing a name and ID.
Clear out any unnecessary code and outputs that clutter your work. Run your text through a spell checker extension. See
the end of assignment 3 (‘Assignment 3 requirements’) for more tips on how to tidy up a report by hiding scripts.
HAND-IN:
Zip-up all your notebooks, python files and dataset(s) into a single file. Submit this file via stream. Make sure that your
jupyter notebook has been run with all outputs visible. Download an HTML version of your notebook (with outputs
showing) and include this in your zip file.
ASSIGNMENT 3 STARTS ON THE NEXT PAGE
158.222-2019 Semester 1 Massey University
Page 8 of 11
ASSIGNMENT 3: KNN REGRESSION, SUPERVISED AND UNSUPERVISED LEARNING
PROJECT OUTLINE
In this project you will be producing a Jupyter Notebook report. You will apply techniques taught so far to either build
kNN regression models or supervised learning models. You will also build unsupervised learning models. You will use the
happiness dataset you built in assignment 2 that you may optionally expand. If you do supervised learning, you may
optionally choose a different dataset.
You do not need to repeat the analysis from assignment 2 - assignment 3 extends this work. You may nonetheless find
that further data wrangling and analysis is required. In that case, such work will be considered when marking.
TASK 1 – IMPORT THE CSV YOU SAVED IN TASK 2A) OF ASSIGNMENT 2 (NO MARKS FOR THIS)
TASK 2 – BUILD KNN REGRESSION MODELS OR SUPERVISED LEARNING MODELS (60 MARKS)
Choose ONE Option:
OPTION 1 – KNN REGRESSION MODELS
Formulate: Using your assignment 2 dataset, creatively formulate a problem that enables you to perform kNN
regression for prediction. It is acceptable if this problem is the same problem you explored in your regression
analysis in assignment 2. Describe this problem in your introduction.
Model: Experiment with models for this prediction containing different subsets of features.
Marking expectations (what we are specifically looking for in your modelling):
o Models with multiple input features
o Scaling of all input features
o train/test split for all models so that the models can be meaningfully evaluated (train with the training
data, evaluate with the testing data). This is not explicitly demonstrated in the kNN regression lecture,
but it is demonstrated in the supervised learning lecture. Some guidance on how to achieve this with
kNN regression (if you need it) is included in the appendix.
o Experimentation with input feature subsets
o Experimentation with model parameters - different distance metrics and different values of k
Evaluation and interpretation - Generate, interpret and compare evaluation metrics for your various models.
Ideally, this will involve some visualisations such as plotting metrics for different models. Consider questions
such as which values of k are most robust for the size of your dataset and your problem domain?
Discussion - How reliable are your prediction models? Could you recommend any to a client? Would you expect
this model to preserve its accuracy on data beyond the range it was built on?
Note: As with assignment 2, there are plots in the kNN regression lecture than cannot be reproduced for multivariable
models. Do not let this prevent you from producing models with many features - they will give you the best
results. There are other plots you could produce, e.g. plotting metrics across different models to compare them.
OPTION 2 – SUPERVISED LEARNING
Formulate: Using your assignment 2 dataset, or a dataset of your choosing, creatively formulate a classification
problem for which you can build supervised learning models. Describe this problem in your introduction.
Explore features: Explore the ability of features to discriminate between your chosen or derived class labels. For
instance, as in the lecture, you could plot histograms or box plots of different features by class label and see if
the distributions are noticeably different. Consider exploring other types of plots.
Model: Create models using different subsets of input features for prediction.
Marking expectations (what we are specifically looking for in your modelling):
o Models with multiple input features
o Scaling of all input features
o a train/test split for all models so that the models can be meaningfully evaluated (train with the
training data, evaluate with the testing data). There is guidance for this in Lecture 7.
o Experimentation with input feature subsets, feature selection and algorithms
o Evaluate and interpret - Generate, interpret and compare evaluation metrics for your models. This
should involve visualisations such as plotting metrics across different models. Consider crossvalidation.
158.222-2019 Semester 1 Massey University
Page 9 of 11
Note: Target/class labels: If you would like to use your assignment 2 dataset for Option 2, you will need categories to
predict. There are many ways of doing this – you could see whether there are appropriate categorical features from
the World Bank API or other data sources that you could integrate into the dataset. Alternatively, and more simply,
you could derive labels from an existing feature. For instance, you could create ‘high’, ‘’medium’ and ‘low’ happiness
labels according to happiness score (or do something similar with any feature of your choosing that you would like to
predict). If you use an entirely new dataset, we would expect some EDA.
Note: Input features: Feel free to derive new input features.
Note: You can use Python’s scikit-learn module for machine learning or try using other algorithms. There are many
other Python implementations of machine learning algorithms such as Neural Networks (PyBrain) which are not
implemented in scikit-learn. Which you may use if you wish.
TASK 3 – BUILD UNSUPERVISED LEARNING MODELS (30 MARKS)
? Feature selection: Choose different subsets of input features from your assignment 2 dataset for clustering.
Scale all the input features that you will be using
Perform cluster analyses with scikit-learn using the input feature sets. Multi-variable clustering models are
expected. Do not create models using the ‘coding from scratch’ algorithms in the lecture. Do create cluster
models using scikit-learn.
Visualise, evaluate, interpret and discuss your results (there is some guidance for visualising clusters in the
appendix)
TASK 4: PRESENTATION - ‘REPORT-ERIZE’ YOUR WORK (10 MARKS)
Refer to Task 4, Assignment 2 for what to do here.
Assignment 3 Requirements:
The Python code in the submitted notebooks must be entirely self-contained and all the experiments and the graphs
must be replicable. Do not use absolute paths, but instead use relative paths if you need to. Consider hiding away some
of your Python code in your notebook by putting them into .py files that you can import and call. This will help the
readability of your final notebook by not allowing the python code to distract from your actual findings and discussions.
Do not dump dataframe contents in the notebook – show only 5-10 lines at a time – as this severely affects readability.
You may install and use any additional Python packages you wish that will help you with this project. When submitting
your project, include a README file that specifies what additional python packages you have installed in order to make
your project repeatable on my computer, should I need to install extra modules.
HAND-IN:
Zip-up all your notebooks, python files and dataset(s) into a single file. Submit this file via stream. Make sure that your
jupyter notebook has been run with all outputs visible. Download an HTML version of your notebook (with outputs
showing) and include this in your zip file.
Marking criteria are on the next page
158.222-2019 Semester 1 Massey University
Page 10 of 11
Marking criteria - Marks will be awarded for different components of the project using the following rubric. In all cases
expect higher marks for elegant code (use of loops/functions):
Component Marks Requirements and expectations
You will receive a total mark for each task and not marks per the numbers in
parentheses. These are indicative weightings provided to help focus your effort on
what is important (to your marker??)
Task 2
kNN regression option
60 Problem formulation (2)
Modelling:
o Implementation of a train/test split (8) (higher weighting here compared
to the supervised learning as this is not implemented in the knn
regression lecture)
o Scaling of input features (3)
o Quality of experimentation with:
feature subsets containing multiple features (10)
model parameters (10)
o Quality of evaluation and interpretation*
Quality of evaluation process and metrics including visualisations (12)
Quality of interpretations (10)
Quality of final discussion (5)
*If you did not do a train/test split and train on the training set, evaluate with the
testing set, you will not have quality evaluations so expect low marks here if that is
the case
Task 2
Supervised learning
option
60 Problem formulation and creation/integration of class labels (6)
Initial input feature exploration via visualisations (6)
Modelling:
o Implementation of a train/test split (3)
o Scaling of input features (3)
o Quality of experimentation (20)
(the more you can explore feature selection, different feature subsets
containing multiple features, and different algorithms, the richer your
experimentation will be).
o Quality of evaluation and interpretation*
Quality of evaluation process and metrics including visualisations (12)
Quality of interpretations (10)
*If you did not do a train/test split and train on the training set, evaluate with the
testing set, you will not have quality evaluations so expect low marks here if that is
the case
Task 3
Unsupervised learning
30 Feature subset selection and feature scaling (4)
Quality of experimentation: creation of different cluster models with feature
subsets containing multiple features. Note the requirement to use scikit-learn
for clustering. (6)
Visualisation, evaluation, interpretation and discussion of cluster models (20)
Task 4
Presentation
10 Report structure (4)
Tidy code and outputs (4)
Spelling (2)
Appendix is on the next page
158.222-2019 Semester 1 Massey University
Page 11 of 11
APPENDIX
TRAIN TEST SPLIT WITH KNN REGRESSION
Do a train test split of your data before doing any kNN modelling (you will get spurious metrics otherwise). To achieve
this, you would need to do something like this:
from sklearn.cross_validation import train_test_split
X = df_std #the standardised explanatory variables
y = np.array(df['Target_feature])
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
If you plan to borrow the ‘calculate_regression_goodness_of_fit’ function for your analysis, you would need to change
this line:
y_mean = y.mean()
to this:
y_mean = ys.mean()
and then let your common sense guide you as to the other necessary changes to the lecture code. Think about where
the training sets should be used and where the testing sets should be used (hint, the training sets should be used for the
model fit, the testing sets should be used for prediction and goodness of fit calculation).
VISUALISING CLUSTERS
Look at 2 or 3 different features at a time in scatter plots with points coloured according to cluster and see if you can
discern which features were important in defining the clustering (there are other ways of doing this, but for the purposes
of this assignment we are satisfied if you simply look at some visualisations). There is no guarantee that you will be able
to see a clear difference but have a go and show what you have done. Try to describe the effect that each feature has on
the clustering, if it is discernible.
The examples below are artificial, but I provide them to give you the general idea. In the example on the left, the feature
Y is important in defining the clusters, not X. In the example on the right, X is important in defining the clusters, not Y:
You will likely need to iterate through subsets of your features and produce such plots to get an idea of which features
may have been important in defining your clusters (functions are your friend). In reality, it will be a combination of
them. If one feature is really dominant, you should double check that you have scaled your features.
Avoid univariate clustering.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。