Assignment Part A: Storytelling with data (30 marks)
Statistics is the science of learning from data. Turning data into information is a critical aspect of decision-making in business. In this world of big data, storytelling through data has emerged as an important aspect of all data analysis. Complex ideas can be understood easily through storytelling. In this part, we build your storytelling skills by improving your ability to visualise and communicate findings.
In this task, you are required to do the following. Please read carefully.
Problem background: House Prices
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that price negotiations influence price much more than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames,
Iowa, this competition challenges you to predict the final price of each home.
The link:https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques
Data definition:
We have selected a sample with fewer variables in the data file provided. Refer to the data file labelled Assignment Data.xlsx and open the worksheet labelled House Prices. In the worksheet, you are provided with both numeric and categorical data. Note that this data has already been cleaned for you, and any missing records are removed. The following table contains the data definition.
Column |
Column Name |
Data Definition |
A |
ID |
The general zoning classification |
B |
MSZoning |
The general zoning classification RH Residential High Density RL Residential Low Density |
C |
LotArea |
Lot size in square feet |
||||||
D |
Street |
Type of road access Grvl Gravel Pave Paved |
||||||
E |
LotShape |
General shape of property Reg Regular IR1 Slightly irregular IR2 Moderately Irregular IR3 Irregular |
||||||
F |
LandContour |
Flatness of the property Lvl Near Flat/Level Bnk Banked - Quick and significant rise from street grade to building HLS Hillside - Significant slope from side to side Low Depression |
||||||
G |
LandSlope |
Slope of property
|
||||||
H |
BldgType |
Type of dwelling 1Fam Single-family Detached 2FmCon Two-family Conversion; originally built as one-family dwelling
|
||||||
I |
OverallQual |
Overall material and finish quality 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average
5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor
|
J |
OverallCond |
Overall condition rating 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor |
K |
SalePrice |
the property's sale price in dollars. This is the target variable that you're trying to predict |
Using the data set, answer the following questions.
Q1 About the data
In less than 100 words, write a summary of the databased on the following points:
• what is the data about (do not copy from the website)
• what information (variables) does it contain
• select one numerical and one categorical variable from your data set and provide full data classification for each variable you selected.
• explain your choice of classification for each variable selected above
(4 marks)
Q2 Pivot table
Instruction for Q2
Create a new column L labelled as “Price Category” . Use the information provided in Table 1 to categorise Sale Price (column K). Label each Sale Price into “High”, “Medium” and “Low” price. Use the VLOOKUP function to complete this task. Once you have done this, filter and select RM and RL categories for the MSZoning variable (in column B). Based on the data, answer the following questions.
a. Construct a pivot table showing a grand total of counts with “Price Category” in rows and the two selected general zoning classification (MSZoning) in columns. Label and format the pivot table accordingly. Marks will be deducted for poor presentation. (3 marks)
b. Using your pivot table from (a), provide one example of each of the following:
• Marginal probability
• Conditional probability
• Joint probability
In your answers, you are required to show the following for each of the three
probabilities:
1. probability statement
2. workings
3. final answers
Your answers must be stated in 2 decimal places. (3 marks)
c. Provide a contextual interpretation for each of the probability values obtained above. (3 marks)
d. State two methods that can be used to investigate if Price Category and Zoning
Classification (MSZoning) are related. (2 marks)
e. Use the data provided and apply both methods mentioned in your answer (d) to
investigate if the two variables in (d) are related. (6 marks)
Q3. Visualisation and Overall Summary
In this section, you are required to draw a visualisation and write an overall report on your observations.
a. Select at least two appropriate variables from your dataset and provide ONE suitable visualisation (graph). Your variable selection here can be similar or different to the variables you selected above. Be sure to select variables that tell an interesting story, as you are required to write a summary in part (b).
Instructions for this task
• You are required to use Microsoft Excel for this task.
• Your visualisation must be appropriately labelled and formatted. (4 marks)
b. Using the visualisation (graph) you provided in your answer above.
• briefly explain why you have chosen this type of visualisation (graph)
• write a summary describing the main findings and patterns visible from the visualisation.
The word limit here is strictly less than 150 words. (5 marks)
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。