Coursework 1 – Exploratory data analysis and correlation
MATH20811 Practical Statistics: Coursework 1
The marks awarded for this coursework constitute 30% of the total assessment for the module.
Your solution to the coursework should be a consice report (max 10 pages) and it should take, on
average, about 15 hours to complete.
The submission deadline is 10am on Monday 28 October 2019.
Please note that this deadline is a strict one with a University set penalty of 10% of the total
marks applied for each day late up to a maximum of five days, after which your mark for the
coursework will be zero.
Your submitted solutions should all be in one document. This must be prepared using LaTeX.
For each part of the question you should provide explanations as to how you completed what is
required, show your workings and also comment on computational results, where applicable.
When you include a plot, be sure to give it a title and label the axes correctly.
When you have written or used R code to answer any of the parts, then you should list this R code
after the particular written answer to which it applies. This may be the R code for a function you
have written and/or code you have used to produce numerical results, plots and tables. R code
should also be clearly annotated.
Avoid using screenshots of R code/output. Instead, to include R code use the verbatim environment
and summarise R output in tables using the table environment, as demonstrated in the solution of
Example Sheet 2.
Your file should be submitted through the module site on Blackboard to the Turnitin assessment
in the Coursework folder entitled “MATH20811 CW1” by the above time and date. The work
will be marked anonymously on Blackboard so please ensure that your filename is clear but that
it does not contain your name and student id number. Similarly, do not include your name and
id number in the document itself.
Turnitin will generate a similarity report for your submitted document and indicate matches to
other sources, including billions of internet documents (both live and archived), a subscription
repository of periodicals, journals and publications, as well as submissions from other students.
Please ensure that the document you upload represents your own work and is written in your own
words. The Turnitin report will be available for you to see shortly after the due date.
This coursework should hopefully help to reinforce some of the methodology you have been studying,
as well as the skills in R you have been developing in the module. Correct interpretation and
meaningful discussion of the results (i.e. attempt to put the results into context) are as important
as correct calculation of the results, in order to achieve a high mark for the coursework.
Coursework 1 – Exploratory data analysis and correlation
The data in red_wine.csv and white_wine.csv (Cortez et al, 2009) contain various measurements
on red and white variants of the Portuguese Vinho Verde wine. Import the data in R and
save them as objects red_wine and white_wine. Each object should contain measurements on
11 continuous variables: fixed.acidity, volatile.acidity, citric.acid, residual.sugar,
chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol
and one discrete variable: quality.
1. Perform exploratory analysis of the data and report some interesting findings about the
data. Some suggestions include producing summary statistics of the data, comparing the
distributions of specific variables for each of the red and white variants using histograms or
box-plots (as appropriate) and exploring any associations between the variables, in particular
alcohol and quality. [10]
2. Using the function cor, calculate both Pearson’s and Spearman’s correlation between:
• white_wine$chlorides and white_wine$alcohol
• log(white_wine$chlorides) and white_wine$alcohol
Comment on the results and give an explanation for any discrepancies between the various
correlation estimates. Hint: Inspecting the scatterplots for each pair might be useful. [5]
3. Let ρ1 be Pearson’s correlation between alcohol and density for the red wine dataset. Using
the function cor.test, test the hypothesis H0 : ρ1 = 0 vs HA : ρ1 6= 0 and report
your findings. Calculate (DIY) an approximate 95% confidence interval (CI) for ρ1
based on Fisher’s z-transform and verify your calculations agree with the CI produced by
cor.test. [5]
4. Perform (DIY) a hypothesis test for H0 : ρ1 = −0.5 vs HA : ρ1 > −0.5 at 2.5% significance
level, using Fisher’s z-transform. Compute the p-value and use it to decide whether to reject
the null hypothesis in favour of the alternative. [5]
5. Write a function in R to verify via simulation that the distribution of the Fisher’s ztransform
statistic is approximately Normal. Your function should output a plot comparing
the sampling distribution of Fisher’s z-transform statistic and the appropriate Normal distribution
the statistic has under the null hypothesis. In your simulation, you may assume
the data pairs (x, y) come from independent Normal distributions and that the test statistic
corresponds to a test of zero correlation. [5]
References
[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data
mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553.
ISSN: 0167-9236.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。