联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-12-21 10:31

BUSS6002 Assignment 1

Semester 2, 2022

Instructions

Due: at 23:59 on Friday, September 16, 2022 (end of week 7).

You must submit a Jupyter Notebook (.ipynb) file with the following filename format,

replacing STUDENTID with your own student ID: BUSS6002 A1 STUDENTID.ipynb.

There is a limit of 1000 words for your submission (excluding code, tables, and captions).

Do not include any more Python output than necessary and include only concise discussions.

Each task must be clearly labelled with the corresponding question (and sub-question) num-

ber so that the marker can spot your solution easily.

The submitted .ipynb file must be free of any errors, and the results must be reproducible.

All figures must be appropriately sized (by setting figsize) and have readable axis labels

and legends (where applicable).

Use plt.show() instead of plt.savefig(‘plot.png’) to display each figure.

Libraries needed: numpy, pandas, matplotlib, statsmodels.

You may submit multiple times but only your last submission will be marked.

A late penalty applies if you submit your assignment late without a successful special con-

sideration. See the Unit Outline for more details.

Rubric

This assignment is worth 20% of the unit’s marks. The assessment is designed to test your technical

ability and statistical knowledge in performing important basic tasks associated with an exploratory

data analysis (or EDA) of a real-world dataset.

Assessment Item Goal Marks

Question 1 Overall summary of the dataset 7

Question 2 Univariate analysis 14

Question 3 Multivariate analysis 18

Jupyter Notebook Logical and clear presentation 1

Total 40

Table 1: Assessment Items and Mark Allocation

1

Overview

Being able to accurately predict the sale prices of residential properties is crucial to many aspects

of the economy. Some companies base their entire business models on providing their clients

with predictions of property sale prices. As a data-scientist-in-training, you will analyse data on

residential home sales in Ames, a city in the state of Iowa of the United States. The dataset

contains sale prices between 2006 and 2010 of all residential properties in Ames, as well as many

numerical and categorical features (i.e., variables) associated with each dwelling. The following

downloadable files are available on Canvas.

File Description

AmesHousing.txt Data file containing 2,930 observations and 82 variables

DataDocumentation.txt Data dictionary containing description of each variable

BUSS6002 A1 STUDENTID.ipynb A Jupyter Notebook template for getting you started

AmesResidential.pdf A map of Ames

Table 2: Files Provided

Question 1

Place the data file AmesHousing.txt in the same location (i.e., directory) as your Jupyter Notebook

file (.ipynb), and then read the data into a pandas DataFrame object using exactly the following

code.

import pandas as pd

data = pd.read_csv(

’AmesHousing.txt’,

sep=’\t’,

keep_default_na=False,

na_values=[’’])

Once the data file is successfully read in, complete the following tasks.

(a) (3 marks) Write some code to automatically print out the column names of the variables

with missing values, as well as the number of missing observations associated with each of

those variables. The output should be sorted by the number of missing observations from

most to least. Note that a missing value is represented by the special numpy constant nan;

the ‘NA’ value of a categorical variable (e.g., ‘Alley’) is not considered missing. Hint: you

may find the .isna() method of a DataFrame object useful.

(b) (1 mark) Briefly discuss your finding in part (a).

(c) (3 marks) Construct a DataFrame that contains the five-number-summaries of all the nu-

merical variables in the dataset, excluding the variable ‘Order’. Round each value of the

DataFrame to its nearest integer. The resulting DataFrame should have a shape of (k, 5)

(i.e., k rows and 5 columns), where k is the number of numerical variables in the dataset. The

rows of your DataFrame should be indexed by variable names, and the columns should be

named as: min, 25%, 50%, 75%, and max, respectively. Print out the constructed DataFrame.

2

Question 2

(a) (4 marks) Graphically summarise the distributions of the variables ‘SalePrice’ and ‘Lot

Area’, one at a time, and briefly discuss the distributional characteristics of the two variables.

Your discussion should also connect the distributional characteristics to the domain-specific

context of these variables.

(b) (2 marks) Create two new Python variables (of pandas type Series), called log saleprice

and log lotarea, that contain the log-transformed values of ‘SalePrice’ and ‘Lot Area’,

respectively. To be clear, we say that a is a log-transformed value of b if a = log(b), where

log(·) is the natural logarithm function, that is, b = exp(a) := ea.

(c) (3 marks) Graphically summarise the distributions of the new variables log saleprice and

log lotarea (created in part (b)), and briefly state the observed differences in distributions

between the log-transformed and the original variables.

(d) (1 mark) Create another new variable (of pandas type Series), called log saleprice 01,

that contains the standardised values of log saleprice such that log saleprice 01 has

zero mean and unit variance. To confirm, print out the mean and variance of the new

variable and round the output to 2 decimal places.

(e) (2 marks) Create a Q-Q plot of the standardised variable log saleprice 01 to check whether

the variable is normally distributed. Give your conclusion regarding normality based on the

Q-Q plot. Hint: you may find the qqplot function from the statsmodels library use-

ful: statsmodels.graphics.gofplots.qqplot. The documentation of this function can

be accessed via the URL: www.statsmodels.org/dev/generated/statsmodels.graphics.

gofplots.qqplot.html.

(f) (2 marks) Graphically summarise the distribution of the variable ‘Neighborhood’ and briefly

discuss what you observe based on the graphical summary constructed.

Question 3

(a) (3 marks) Print out the correlation coefficient between ‘SalePrice’ and each of other nu-

merical variables in the dataset, excluding the variable ‘Order’. The output should contain

both the variable names and their corresponding correlations. It should also be sorted by

the value of the correlation coefficient in descending order and rounded to 2 decimal places.

(b) (2 marks) Construct an appropriate plot that can help visualise the correlations in part (a).

(c) (1 mark) Briefly discuss the correlation coefficients in parts (a) and (b) in the context of

predicting ‘SalePrice’.

(d) (2 marks) Suppose that ‘Gr Liv Area’ is used to predict ‘SalePrice’. With the goal of

predicting ‘SalePrice’ in mind, construct an appropriate plot that can help visualise the

systematic relationship between these two variables.

(e) (2 marks) Briefly discuss the relationship between ‘Gr Liv Area’ and ‘SalePrice’ based

on the plot you created in part (d).

(f) (2 marks) Print out all the unique categories of the variable ‘Lot Shape’ together with the

number of observations falling into each category. This is called a frequency table. Based on

3

the obtained frequency table, briefly discuss why it could be a good idea to combine some of

the categories in ‘Lot Shape’.

(g) (2 marks) Create a new variable (of pandas type Series), called lotshape binary, by

combining the categories {‘IR1’, ‘IR2’, ‘IR3’} of ‘Lot Shape’ into a single category named

‘IR’, so that the new variable has two categories {‘Reg’, ‘IR’}. Print out the frequency

table for lotshape binary to confirm.

(h) (4 marks) Create a single plot that allows one to visually examine the effect of the new

variable lotshape binary on the relationship between ‘Gr Liv Area’ and ‘SalePrice’,

and briefly discuss what you observe from the plot. Hint: see the “Spending and salary by

gender” exercise in the Week 4 tutorial.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp