MATH2319 Machine Learning¶
Semester 1, 2020
Assignment 2
Assignment Rules: Please read carefully!
Assignments are to be treated as "limited open-computer" take-home exams. That is, you must not discuss your assignment solutions with anyone else (including your classmates, paid/unpaid tutors, friends, parents, relatives, etc.) and the submission you make must be your own work. In addition, no member of the teaching team will assist you with any issues that are directly related to your assignment solutions.
For other assignment Codes of Conduct, please refer to this web page on Canvas:
https://rmit.instructure.com/courses/67061/pages/assignments-summary-purpose-and-code-of-conduct
You must document all your work in Jupyter notebook format. Please submit one Jupyter notebook file & one HTML file per question. Specifically, you will need to upload the following 4 files for this assignment:
{StudentID}_A2_Q1.html
{StudentID}_A2_Q1.ipynb
{StudentID}_A2_Q2.html
{StudentID}_A2_Q2.ipynb
Please put your Honour Code at the top in your answer for the first question. At least one of your HTML files must contain the Honour Code.
Please make sure your online submission is consistent with the checklist below:
https://rmit.instructure.com/courses/67061/pages/online-submissions-checklist
For full Assignment Instructions and Summary of Penalties, please see this web page on Canvas:
https://rmit.instructure.com/courses/67061/pages/instructions-for-online-submission-assessments
So that you know, there are going to be penalties for any assignment instruction or specific question instruction that you do not follow.
Assignment Instructions
For this assignment, please follow the additional instructions below:
Textbook info can be found on Canvas at this link: https://rmit.instructure.com/courses/67061/pages/course-resources
You may not use any one of the classifiers in the Scikit-Learn module. Likewise, you may not use any one of the metrics in the Scikit-Learn module. You will need to show and explain all your solution steps without using the Scikit-Learn module. You will not receive any points for any work that uses Scikit-Learn. The reason for this restriction is so that you get to learn how some things work behind the scenes.
For this assignment, you do NOT have to use Python to work out any of your numerical answers.
Specifically, you are allowed to use Microsoft Excel for this assignment. If you like, you can mix and match some of your Excel solutions with Python.
Regardless of whatever tool you are using,
Please keep your answers concise and to-the-point. For instance, please do not copy & paste excessive material from Excel (if that is what you are using).
You must write all your narrative in your Jupyter notebook and you must clearly explain all your steps in clear English.
You must pay attention to good presentation practices with section headers, correct spelling, etc.
At the end, for each question part, you must present your solutions as either appropriate text, or as the requested Pandas data frames in your Jupyter notebook. If you are using Excel, you can populate these Pandas data frames from Excel output.
However, you must do the plotting in Q2 using Python 3.0 or above (with Altair being preferred, but not required). Excel will not be accepted for the plot question in Q2.
Question 1
(50 points)
This question is inspired from Exercise 3 in Chapter 4 in the textbook. On Canvas, you will see a CSV file named "A2_Q1.csv". You will use this dataset for this question. In our version of the dataset, the annual income target variable has been renamed as low, mid, and high.
For this question, you will build a simple decision tree with depth 1 using this dataset for predicting the annual income target feature using the Gini Index split criterion. You will present your results as Pandas data frames.
You are allowed to use any Python code available on our website here. In fact, you are recommended to use some of this code.
Part A (5 points)
Compute the impurity of the target feature.
Part B (30 points)
In this part, you will determine the root node for your decision tree. Please refer Chapter 4 slides on Canvas and Part (c) of this exercise question in the textbook for handling the Age continuous feature.
Your answer to this part needs to be a Pandas data frame and it needs to be called "df_splits". Also, it needs to have the following columns:
Split
Remainder
Information_Gain
Is_Optimal (True or False - only the optimal split's Is_Optimal flag needs to be True and the others need to be False)
You can populate this data frame line by line by referring to Cell 6 in our Pandas tutorial.
You will populate and display your df_splits data frame. As an example for your "df_splits" data frame, consider the spam prediction example in Table 4.2 in the textbook on page 121, which was also covered in lectures. The df_splits data frame would look something like the table below.
SplitRemainderInformation_GainIs_Optimal
suspicious words??True
unknown sender??False
contains images??False
PART B CLARIFICATION: In your "df_splits" data frame, please do not bundle all age splits together and call it "Age". Rather, please have a separate row for each age threshold value that qualifies as a split candidate. Please name these splits as "Age_{YY}" where {YY} represents a numerical age threshold (in years). For example, if there are only two threshold values that qualify as a split candidate, say 20 and 30, you would add two rows to "df_splits" with Split values of "Age_20" and "Age_30". Apparently, you will need to populate the rest of the columns with correct values for these two rows.
Part C (15 points)
In this part, you will assume the Education descriptive feature is at the root node (NOTE: This feature may or may not be the optimal root node, but you will just assume it is). Under this assumption, you will make predictions for the annual income target variable.
Your answer to this part needs to be a Pandas data frame and it needs to be called "df_prediction". Also, it needs to have the following columns:
Leaf_Condition
Low_Income_Prob (Probability)
Mid_Income_Prob
High_Income_Prob
Leaf_Prediction
Assuming that Education is the root node, you will populate and display your df_prediction data frame. As an example, continuing the spam prediction problem, assume the suspicious words descriptive feature is at the root node. The df_prediction data frame would look something like the table below.
Leaf_ConditionSpam_ProbHam_ProbLeaf_Prediction
suspicious words == true
suspicious words == false
HINT: Your df_prediction data frame should have only 3 rows.
Question 1 Wrap-up
For marking purposes, please display your "df_splits" and "df_prediction" data frames (in separate cells!) as the last thing in your notebook for this question. Thank you.
Question 2
(50 points)
This question is inspired from Exercise 6 in Chapter 8 in the textbook. On Canvas, you will see a CSV file named "A2_Q2.csv". You will use this dataset for this question.
Part A (10 points)
Assume true is the positive target level. Using a score threshold of 0.5, work out the confusion matrix (using pd.crosstab() or as a Jupyter notebook table) with appropriate row and column labels.
Part B (20 points, 4 points each)
Compute the following 5 metrics:
Error Rate
Precision
TPR (True Positive Rate) (also known as Recall)
F1-Score
FPR (False Positive Rate)
You will need to display your answers as a Pandas data frame called "df_metrics" (with 5 rows, one row for each metric) with the following 2 columns:
Metric
Value
Marking Note for Part B: If your confusion matrix is incorrect, you will not get full credit for a correct follow-through.
Part C (15 points)
By varying the score threshold from 0.1 to 0.9 (both inclusive) in 0.1 increments, compute TPR and FPR values. You will need to display your answers as a Pandas data frame called "df_roc" with the following columns:
Threshold
TPR
FPR
HINT: Your df_roc data frame should have 9 rows. For this part and the next, you might find Cells #20 and #21 useful in our SK4 Tutorial here.
Part D (5 points)
Using your answer in the above part, display an ROC curve with appropriate axes labels and a title.
Question 2 Wrap-up
For marking purposes, please display your "df_metrics" and "df_roc" data frames (in separate cells!) as the last thing in your notebook right before your plot for Part D. Thank you.
www.featureranking.com
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。