Introduction to Data Analysis
Final Project [30 points]
The final project is worth 30% of the grade.
The final project utilizes the machine learning technique of classification to predict an outcome for a
banking marketing problem. A bank is planning a telemarketing campaign to increase the number of
term deposits it has. The data records include the output target (whether they responded positively: yes
or no), and several candidate features. Your task is to use the data analysis techniques you have learned
in class to predict which customers are likely to respond positively to the campaign. You will use any two
techniques learned in class, compare the models from both the techniques and make a
recommendation. Based on your recommendation the bank will then use your chosen model to score
unseen data to target customers for the telemarketing campaign.
You will create a power point deck to report your findings and make a recommendation on which model
to choose and likely impact.
Data set and related information:
The dataset is available in the UCI machine learning repository:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Read through the dataset information, attribute information.
Note that we will only use the more recent versions of the datasets. That is, we will only use the
bankadditional.zip folder and files.
Also note that bank-additional-full.csv contains the full dataset of 41,118 records and bankadditional.csv
contains a 10% random sample of 4119 examples. It is recommended that you try most of your work on
the bank-additional.csv (the 10% random sample) and move to the full dataset only when you have got
your models working and are trying to improve the accuracy or other aspects.
The following is a checklist of the contents for each slide.
Slide 1 [5 points]
• Name of presenter
• Description of the problem
• How you would apply data analytics to the problem
• What are the likely impacts of applying data analytics
Slide 2 [5 points]
• The methodology you will use in tackling the problem
Slides 3-8 [5 points]
• Create some basic plots and graphs (histograms, boxplots, scatterplots) of the data
• Also compute some statistics of the features that you think are important
• Plot some scatter plots showing the classes in different colors
Slides 9-11 [5 points]
• Describe your first choice for model building
• Justify your choice. How is it meaningful or relevant for the business problem at hand?
• Describe your model
• Report on performance metrics of your model
Slides 12-14 [5 points]
• Describe your second choice for model building
• Justify your choice. How is it meaningful or relevant for the business problem at hand?
• Describe your model
• Report on performance metrics of your model
Slide 15 [5 points]
• Make a recommendation on which model should be selected among your two models
• State your conclusion based on this data analytics exercise
• State what are the possible business outcomes
Some tips you will find useful
1. Converting categorical variables to numeric variables: This stackoverflow page has some tips on how
to convert categorical variables to numeric variables:
https://stackoverflow.com/questions/32011359/convert-categorical-data-in-pandas-dataframe
2. Assessing model performance (In addition to the metrics that we have seen in class): ROC Curves
and AUC
This scikit-learn help page contains some hints on create ROC curves and computing Area under the
curve:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#
You can learn more about the ROC curve and Area under the curve here:
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
3. Plotting
You might find this page helpful in getting started with plotting:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
You might also find the scikit-learn pages helpful:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.scatter.html
Final Term Project Rubric
Slides Exemplary
(5)
Proficient
(3-4)
Incomplete
(1-2)
Incorrect or
Unacceptable
(0-1)
1 Name along with a clear
description of the
problem is
given. A clear description
of how data analytics can
be
applied to the problem
along with the likely
impacts
is provided.
Name along with
a clear
description of the
problem is given.
Mostly clear
description of how
data analytics can be
applied to the
problem along with
the likely impacts is
provided.
Name along with
a clear description
of the problem is
given. An
incomplete
description of
how data
analytics can be
applied to the
problem along
with the likely
impacts is
provided.
Description of
problem is
incorrect or
missing.
Description of
how data
analytics can be
used to solve the
problem is
missing or
incorrect.
2 A clear methodology that
is suitable for solving the
problem is outlined.
A methodology
that is mostly
clear and is likely
to be suitable for
solving the problem is
outlined.
The description
of the
methodology or
the suitability for
the problem is
incomplete.
The methodology
is not suitable for
the problem or
the outline is
missing.
3-8 The basic plots are
correct and complete.
The statistics
presented are relevant
and important for the
problem. Scatter plots
with the different classes
in different
colors is correct.
The basic plots
are correct and
complete. The
statistics
presented are
relevant and
important for the
problem. Scatter
plots are mostly
correct.
The basic plots
are mostly
correct and
almost complete.
The statistics are
relevant and
almost complete.
Scatter plots are
mostly correct.
The basic plots
are incorrect or
incomplete.
Statistics are
missing or
incorrect and so
are the scatter
plots with the
different classes.
9-11 A clear description of the
first choice is presented
along with a clear
justification of why it is
meaningful and relevant.
The model description is
complete and
performance
metrics are reported
clearly.
A clear
description of the
first choice is
presented along
with a clear
justification of
why it is
meaningful and
relevant. The
model
description is
complete and
performance
metrics are
mostly correct.
A clear
description of the
first choice is
presented along
with a clear
justification of
why it is
meaningful and
relevant. The
model
description and
performance
metrics report
are incomplete.
The first choice
for model is not
clearly described
and there is no
clear justification
for that choice.
12-14 A clear description of the
second choice is
presented
along with a clear
justification of why it is
meaningful and relevant.
The model description is
complete and
performance
metrics are reported
clearly.
A clear
description of the
second choice is
presented along
with a clear
justification of
why it is
meaningful and
relevant. The
model
description is
complete and
performance
metrics are
mostly correct.
A clear
description of the
second choice is
presented along
with a clear
justification of
why it is
meaningful and
relevant. The
model
description and
performance
metrics report
are incomplete.
The second
choice for model
is not clearly
described and
there is no clear
justification for
that choice.
15 A clear recommendation
of
which model should be
selected is provided. The
conclusions are stated
clearly with evidence
from
the data analysis exercise.
Possible business
outcomes
are stated clearly and are
correct.
A clear
recommendation
of which model
should be
selected is
provided. The
conclusions are
stated clearly
with evidence
from the data
analysis exercise.
Possible business
outcomes are
mostly correct.
A clear
recommendation
of which model
should be
selected is
provided. The
conclusions are
incomplete or
evidence from
the data analysis
exercise is
incomplete.
Possible business
outcomes may be
incomplete.
No clear
recommendation
is provided.
Conclusions are
incorrect or
incomplete
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。