STA220 Data Analysis Project Instruction
New Due Date: Tuesday June 11th
Submit in Tutorials at 6:10p.m.
Purpose:
The objective of this project is to give you the opportunity to use some of the statistical techniques that you
have learned in this course for exploring a real data set.
Submission Format of Data Analysis Report:
You are required to give your answers to related questions based on the analysis of this data in the template
posted on our Quercus page. Please print your filled-in template and attach your R outputs to it.
You may work individually or in groups of no more than three students. Your group members can be from
different tutorial sections in our class. If you are working in a group, please think of creating a team name
and note that you will only submit one report (filled-in template with R output of the data analysis project)
on the new due date in your tutorial (or your team member’s tutorial) on Tuesday June 11th at 6:10p.m..
The ways in which your data analysis project will be assessed is described on page 5 of this document.
Context of Data:
The Organisation of Economic Cooperation and Development (OECD) gathers various information
regarding OECD countries and its partners in order to promote policies that aims to improve the economic
and social well-being of people around the world (http://www.oecd.org/about/).
This agency collects quantitative information on many domains and makes the collected data available for
public use (e.g., researchers) so that interested individuals can further investigate relationships among a set
of variables. A domain named “Social Protection and Well-being” includes a yearly collection of data
“Better Life Index” from OECD countries. This information can be retrieved from: http://stats.oecd.org
2
From the “Better Life Index” (BLI), the most recent data published in 2019 but collected in 2018, we will
analyze a quantitative variable named “Social Network Support”. Information regarding this variable can
be retrieved from: http://www.oecd.org/statistics/OECD-Better-Life-Index-definitions-
2019.pdf#_ga=2.145820212.1027110605.1559482147-696144184.1473183978
(note that this definition-document is posted on our Quercus page in the module Data Analysis Project).
This variable is a sub-component of the Social connections/Community component in BLI, which reflects
percentage of males and females aged 15-years and over in 36 OECD countries who perceive their social
network as having relatives or friends that they can count on to help them in times of need and trouble.
OECD indicates that they obtained and calculated this information based on a certain Poll.
Let us recap the variables of interest in our data analysis:
1. Percentage of people (15 years of age and older) having social network support
2. Sex of the respondents identified as Male, Female
I recommend that you read about this data here:
http://www.oecdbetterlifeindex.org/#/11111111111
Also, click on “Community” on the right hand-side menu to be directed to another web-link:
http://www.oecdbetterlifeindex.org/topics/community/
Scroll down that page and you can click and read about each country’s supported network %.
R Activity:
1. Understanding and comparing distributions of percentages of males and females who reported having a
Social Network Support, in the 36 OECD countries.
2. Examining the relationship between percentage of males and females who reported having a Social
Network Support, in the 36 OECD countries. That is, we aim to predict percentages of males’ social
network support from percentage of females’ social network support.
Overview of Steps:
1. Save the data file SocNet_BLI2019.txt in your computer (your default R working directory).
2. Save the R Script Support_Net_BLI2019.R
3. Open RStudio. Go to file > Open File
Search for the saved Support_Net_BLI2019.R Rscript in your computer and open it.
4. Run each line of code step-by-step. Please make the necessary changes to some codes as specified in the
R script. If you are working as a team, change the specified variable names with your team name. Also,
make sure to give appropriate titles to the required plots. The codes do not give titles, but they direct you to
write titles. Copy your R outputs and paste them in a document (e.g., word document) to analyze this data.
5. Work on the related questions on the next page in order to interpret the results of your data analysis.
3
Part 1. Identify the Elements of Statistics and Method of Data Collection.
1. Who are the cases in this study?
2. Identify the population of interest in the context of this study.
3. Identify the sample in the context of this study.
4. Identify the population parameter(s) of interest in the context of this study.
5. What is/are the variable(s) of interest in this study? Identify their type and their scale of measurements.
6. Think about the purpose of this study. Why this study was conducted?
7. Where was this study conducted?
8. When was the study conducted?
9. How was the data for this study collected? Hint: Read the web-page on OECD community:
http://www.oecdbetterlifeindex.org/topics/community/ to find the answer to this question.
Part 2. Compare percentages of perceived social network support between males and females.
1. Suppose that the researchers are interested to investigate the relationship between percentages of
perceived social network support and the sex of the respondents in the 36 OECD countries. Identify the
response variable and the explanatory variable in the context of this study.
2. Use the side-by-side boxplots and the summary statistics to compare distributions of percentage of
perceived social network support of males and females in the OECD countries. That is, compare the
shapes, centres, and spreads of both distributions and note/identify any outliers.
3. Use the boxplot and summary statistics for the differences between females and males’ percentages of
perceived social network support (in each country) to describe what is apparent in this plot that is not
apparent in the other (the side-by-side boxplots of percentages of perceived social network support by sex).
Describe the shape, centre, and spread of this distribution. Indicate which countries are suspect outliers
(pointed individually on the boxplot) and what makes them unusual. That is, use the 1.5IQR rule to
determine whether the outlying points are suspect outliers. Also, find the number of standard deviations
that the potential outlier(s) is/are away from the overall mean of this distribution. Discuss why this graph
(boxplot of differences) is more useful for learning about differences between males and females in the
OECD countries?
4
Part 3. Predict percentage of males’ perceived social network support from percentage of females’
perceived social network support.
1. Use the scatterplot of percentage of males perceived social network support verses percentage of
females perceived social network support to describe the relationship.
2. What is the estimated correlation coefficient? Interpret this value.
3. If we examined only those countries with percentage of perceived social network support of over 90 for
both sexes, what would happen to the correlation? And, discuss why would that happen to correlation?
4. Fit a linear regression model relating percentage of males perceived social network support to
percentage of females perceived social network support. That is, fit a straight line for predicting percentage
of males perceived social network support from percentage of females perceived social network support.
What is the equation of the regression line?
5. What does the regression line tell us in the context of this study?
6. What does the slope of regression line mean in the context of this study?
7. Note that the slope of the line does not differ much from 1.00. What would a slope of 1.0 indicate about
the nature of the relationship? If we fitted a model with the slope fixed at 1.00, what prediction equation
would you expect to get? (Hint: Refer to the summary statistics described by males and females. Find the
mean percentage of perceived social network support for males and for females to answer this question).
8. Can we, at all, interpret the value for y-intercept in the regression equation? Justify your answer.
9. What is the standard deviation of residuals? Interpret this value in the context of this problem.
10. Use the plots of residuals to assess the overall adequacy of linear regression model fit to this data. State
the assumption(s) about the residuals that each of the constructed plot checks and determine whether the
assumption(s) is/are met.
11. In which country or countries do the male respondents have “somewhat unusually” less percentage of
perception of having a social network support in relation to the female respondents, according to the
regression model? Give the residual(s) to make and justify your argument.
12. Give and interpret the R2
value in the context of this study.
5
Assessment of Data Analysis Project
Last Name of Student
1.________________________________________
2. ________________________________________
3. ________________________________________
Part 1, Question: Point(s) Point(s) Received
1: Identify the cases in this study 1
2: Identify population of interest in this study 1
3: Identify the sample in this study 1
4: Identify population parameter(s) of interest 1
5: Identify variable(s) of interest in this study 2
6: Identify the purpose 1
7: Location (where) of this study 1
8: Time (when) of this study 1
9: Data collection (how) 1
Total 10
Part 2, Question: Point(s) Point(s) Received
1: Identify the response and explanatory variables 2
2: Interpretation of Side-by-side boxplots 6
3: Interpretation of boxplot of differences 12
Total 20
Part 3, Question: Point(s) Point(s) Received
1: Interpretation of scatterplot 1
2: Interpretation of correlation coefficient, r 1
3: Interpretation of restricting the range 1
4: Identifying the equation of the regression line 1
5: Interpretation of the regression equation 1
6: Interpretation of the estimated slope 1
7: Realization of fixing slope at 1 2
8: Interpretation of the estimated y-intercept 1
9: Interpretation of the standard deviation of residuals 1
10: Diagnostic check of residuals using plots 2
11: Detection of unusual residual values 2
12: Interpretation of the value of R-squared 1
Total 15
Inclusion of R outputs: necessary modifications made
to variable names and titles of plots 5
Total Points 50
Marked by TA:
Comment (if any):
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。