Introduction to Data Analysis

Final Project [30 points]

The final project is worth 30% of the grade.

The final project utilizes the machine learning technique of classification to predict an outcome for a

banking marketing problem. A bank is planning a telemarketing campaign to increase the number of

term deposits it has. The data records include the output target (whether they responded positively: yes

or no), and several candidate features. Your task is to use the data analysis techniques you have learned

in class to predict which customers are likely to respond positively to the campaign. You will use any two

techniques learned in class, compare the models from both the techniques and make a

recommendation. Based on your recommendation the bank will then use your chosen model to score

unseen data to target customers for the telemarketing campaign.

You will create a power point deck to report your findings and make a recommendation on which model

to choose and likely impact.

Data set and related information:

The dataset is available in the UCI machine learning repository:

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

Read through the dataset information, attribute information.

Note that we will only use the more recent versions of the datasets. That is, we will only use the

bankadditional.zip folder and files.

Also note that bank-additional-full.csv contains the full dataset of 41,118 records and bankadditional.csv

contains a 10% random sample of 4119 examples. It is recommended that you try most of your work on

the bank-additional.csv (the 10% random sample) and move to the full dataset only when you have got

your models working and are trying to improve the accuracy or other aspects.

The following is a checklist of the contents for each slide.

Slide 1 [5 points]

? Name of presenter

? Description of the problem

? How you would apply data analytics to the problem

? What are the likely impacts of applying data analytics

Slide 2 [5 points]

? The methodology you will use in tackling the problem

Slides 3-8 [5 points]

? Create some basic plots and graphs (histograms, boxplots, scatterplots) of the data

? Also compute some statistics of the features that you think are important

? Plot some scatter plots showing the classes in different colors

Slides 9-11 [5 points]

? Describe your first choice for model building

? Justify your choice. How is it meaningful or relevant for the business problem at hand?

? Describe your model

? Report on performance metrics of your model

Slides 12-14 [5 points]

? Describe your second choice for model building

? Justify your choice. How is it meaningful or relevant for the business problem at hand?

? Describe your model

? Report on performance metrics of your model

Slide 15 [5 points]

? Make a recommendation on which model should be selected among your two models

? State your conclusion based on this data analytics exercise

? State what are the possible business outcomes

Some tips you will find useful

1. Converting categorical variables to numeric variables: This stackoverflow page has some tips on how

to convert categorical variables to numeric variables:

https://stackoverflow.com/questions/32011359/convert-categorical-data-in-pandas-dataframe

2. Assessing model performance (In addition to the metrics that we have seen in class): ROC Curves

and AUC

This scikit-learn help page contains some hints on create ROC curves and computing Area under the

curve:

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#

You can learn more about the ROC curve and Area under the curve here:

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

3. Plotting

You might find this page helpful in getting started with plotting:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html

You might also find the scikit-learn pages helpful:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.scatter.html

Final Term Project Rubric

Slides Exemplary

(5)

Proficient

(3-4)

Incomplete

(1-2)

Incorrect or

Unacceptable

(0-1)

1 Name along with a clear

description of the

problem is

given. A clear description

of how data analytics can

be

applied to the problem

along with the likely

impacts

is provided.

Name along with

a clear

description of the

problem is given.

Mostly clear

description of how

data analytics can be

applied to the

problem along with

the likely impacts is

provided.

Name along with

a clear description

of the problem is

given. An

incomplete

description of

how data

analytics can be

applied to the

problem along

with the likely

impacts is

provided.

Description of

problem is

incorrect or

missing.

Description of

how data

analytics can be

used to solve the

problem is

missing or

incorrect.

2 A clear methodology that

is suitable for solving the

problem is outlined.

A methodology

that is mostly

clear and is likely

to be suitable for

solving the problem is

outlined.

The description

of the

methodology or

the suitability for

the problem is

incomplete.

The methodology

is not suitable for

the problem or

the outline is

missing.

3-8 The basic plots are

correct and complete.

The statistics

presented are relevant

and important for the

problem. Scatter plots

with the different classes

in different

colors is correct.

The basic plots

are correct and

complete. The

statistics

presented are

relevant and

important for the

problem. Scatter

plots are mostly

correct.

The basic plots

are mostly

correct and

almost complete.

The statistics are

relevant and

almost complete.

Scatter plots are

mostly correct.

The basic plots

are incorrect or

incomplete.

Statistics are

missing or

incorrect and so

are the scatter

plots with the

different classes.

9-11 A clear description of the

first choice is presented

along with a clear

justification of why it is

meaningful and relevant.

The model description is

complete and

performance

metrics are reported

clearly.

A clear

description of the

first choice is

presented along

with a clear

justification of

why it is

meaningful and

relevant. The

model

description is

complete and

performance

metrics are

mostly correct.

A clear

description of the

first choice is

presented along

with a clear

justification of

why it is

meaningful and

relevant. The

model

description and

performance

metrics report

are incomplete.

The first choice

for model is not

clearly described

and there is no

clear justification

for that choice.

12-14 A clear description of the

second choice is

presented

along with a clear

justification of why it is

meaningful and relevant.

The model description is

complete and

performance

metrics are reported

clearly.

A clear

description of the

second choice is

presented along

with a clear

justification of

why it is

meaningful and

relevant. The

model

description is

complete and

performance

metrics are

mostly correct.

A clear

description of the

second choice is

presented along

with a clear

justification of

why it is

meaningful and

relevant. The

model

description and

performance

metrics report

are incomplete.

The second

choice for model

is not clearly

described and

there is no clear

justification for

that choice.

15 A clear recommendation

of

which model should be

selected is provided. The

conclusions are stated

clearly with evidence

from

the data analysis exercise.

Possible business

outcomes

are stated clearly and are

correct.

A clear

recommendation

of which model

should be

selected is

provided. The

conclusions are

stated clearly

with evidence

from the data

analysis exercise.

Possible business

outcomes are

mostly correct.

A clear

recommendation

of which model

should be

selected is

provided. The

conclusions are

incomplete or

evidence from

the data analysis

exercise is

incomplete.

Possible business

outcomes may be

incomplete.

No clear

recommendation

is provided.

Conclusions are

incorrect or

incomplete

版权所有：编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。