联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-12-17 11:11

In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006

Overview of ICA 3

There are two tasks associated with this ICA. You must complete both. More details

are given below, but in brief:

1. Analyse the given dataset culminating in a linear model for life expectancy, writing

your findings as a short report.

2. Write a short supplement to the existing STAT0006 course notes on the implications

of heteroskedasticity of the error term in a linear regression model. This should

include a discussion of the methods available to reduce or eliminate heteroskedasticity,

and a demonstration of these methods in R. You should make sure that your

supplement is suitable for students on this course.

Task 1

The data

Data are available in the Excel file icadata.xlsx, and you should analyse it in any way you

see fit in order to address the task in hand.

Each row relates to a particular country. In the original data there were missing values

but these have been inputed for the purposes of this assessment. Your task is to investigate

the drivers of life expectancy in 2018 - that is, the mean number of years a newborn

would live if current mortality patterns were to stay the same (the variable lifeexp2018

in the dataset) - using the other variables in the dataset. Please see the associated data

dictionary for a description of each variable.

How do the factors in the given dataset collectively affect life expectancy? Your analysis

should include building a linear model for life expectancy. Please note that these are real

data. There is no ‘correct model’: it all depends on the assumptions you are willing to

make.

Write a report on your findings, which should include the following things:

• An initial exploratory analysis of the dataset. The aim of this is to give someone

who doesn’t have access to the data an overview of what the data are and a feel for

the variables in the dataset (e.g. summaries of each variable or simple relationships).

This should be non-technical.

• A description of how you approached the model-building phase. Don’t just show your

chosen final model. How did you choose your particular model? What processes did

you go through?

• How well does your final model fit the data? Note that you don’t need to write

about the fit of all models; you just need to convince me that your final model is a

reasonable fit for the data.

– Please note that these are real data.

– It can be the case that getting a linear model to fit well is a little tricky.

1

In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006

– This doesn’t mean that you shouldn’t care about how well the model fits your

data but you should be realistic in terms of how well a linear model will fit (i.e.

is the fit reasonable?).

• A brief description of the final model. What does it tell you about the drivers of life

expectancy?

• Conclusion, including a brief discussion of limitations of the data and model. Do

you think the model is reliable?

• Note that, unless you find the data to have an issue with heterscedasticity, you don’t

need to apply the methods discussed in Task 2.

The maximum length for Task 1 is three sides of A4, which is to include plots/ tables/

figures. Make sure that any plots/ tables/ figures, if applicable, are legible (i.e. don’t

squeeze these in if they are not readable - you will be penalised for this). The minimum

font size is 11pt. You may choose your own margin size. Given the maximum length, you

are strongly advised to select plots/ tables/ figures with care.

Task 2

Heteroskedasticity (unequal variances) of the error term violate one of the assumptions

we make in fitting regression models. As a group, investigate:

(a) methods for detecting heteroskedasticity;

(b) how heteroskedasticity impacts model estimates (regression coefficients and their

standard errors, variance of the error term);

(c) methods to overcome this issue, including advice on when to use these methods and

how to implement them in R;

(d) a demonstration in R of the impact of heteroskedasticity on model estimates and

a commentary on how the methods you discuss in (c) help in dealing with heteroskedasticity.

The output should be a written document in the style of a supplement to the current

STAT0006 course notes. That is, your supplement should be accessible (understandable)

to STAT0006 students. Your supplement should not be overly mathematical. Instead,

your focus should be on explaining the practicalities of using the methods along

with their implications and interpretation.

This supplement should be no longer than four sides of A4. For this assignment you should

reference sources; the list of references is not included in the 4 page limit (i.e. you may

have additional pages with a list of references). However you should not quote directly

from any source, even if you put the text in quotation marks and reference it: everything

should be in your own words but acknowledging where you found the information. The

same goes for pictures or graphics: you may not copy and paste a picture/graphic that

you found elsewhere into your write-up. Minimum font size is 11pt, but you may choose

your margin size and font. Any graphics should be large enough to be easily readable,

adequately labelled, and captioned.

2

In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006

More details about what to include in your supplement for Task 2

I expect the supplement to contain the following for each of (a) through (d) listed above:

(a) An explanation of what heteroskedasticity is, and a discussion of how to detect such

problems. The emphasis of this section should be on the graphical methods as seen

in class though you are welcome to investigate other methods.

(b) A discussion of the implications of heteroskedasticity on the estimated regression

coefficients, their standard errors, and the variance of the error term (that is, how

is heteroskedasticity likely to affect these factors?).

(c) What methods exist to reduce or eliminate the effect of heteroskedasticity? When

should these methods be used? How can we implement these methods in R?

• You are free to choose any relevant methods, but you should discuss transformations

of variables, robust standard errors, and weighted least squares as

possible remedies. For the latter, you should include a discussion on how to

find appropriate ‘weights’. While you are welcome to mention other techniques

you will not get extra credit for them.

• You do not need to write your own R code to implement robust standard

errors or weighted least squares. I expect you to research existing packages

which contain relevant functions. You should explicity state which packages

and functions can be used to implement these methods (there may be more

than one package for each of these methods, but you don’t need to write about

all packages that do the job - one will do!).

(d) Demonstrate the use of these methods in R on simulated data.

• You will need to simulate your own data in R, but please keep it simple! Having

just one numeric covariate will suffice and will enable you to plot results.

• Start with generating bivariate data which has no inherent heteroskedasticity

(see example code below, which you are welcome to use directly) and then

introduce heteroskedasticity.

– There are many ways of doing the latter and it’s worth spending time

thinking about what heteroskedasticity means in order to design some way

of introducing it into your data.

– I don’t mind how you do this as long as it’s sensible!

– Don’t over-complicate things: you can introduce heteroskedasticity in relatively

straightforward ways.

• I expect you to show the following:

– Explain how you simulated your heteroskedastic data (i.e. tell me the

format of the model that generated the data including the parameters used

- I give an example below, but this is for homoskedastic data).

– You can vary the ‘amount’ of heteroskedasticity in your data. This means

you can have one simulated dataset with a ‘mild’ or ‘moderate’ amount of

heterskedasticity, and another where the data are even more heteroskedastic.

Keep it simple here - just two datasets with different amounts of heteroskedasticity

will do. Make sure that the data generating mechansim is

the same for both, other than the amount of heteroskedasticity (otherwise

3

In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006

you will not be able to see what happens as you increase the amount of

heterskedasticiy, holding everything else in the data-generating mechanism

the same).

– For your simulated data with ‘mild’ or ‘moderate’ heteroskedasticity, compare

the model estimates from fitting a model using the usual methods

(ordinary least squares) and, where possible, using the methods you discussed

in part (c).

– Do the same for your dataset where the heteroskedasticity is more pronounced.

– How do the methods compare? How well do the methods estimate the

regression coefficients and variance of the error term? What happens to

the standard errors of the regression coefficients in each case? How does

the amount of heteroskedasticity affect the results?

∗ Note: the purpose of simulating the data is that you know what the

‘true’ values of the regression coefficients and variance of error term is,

which wouldn’t be available to you with real data.

∗ You can compare these ‘true’ values with those you get from the various

analyses requested here.

Note: heteroskedasticity is sometimes a consequence of violations of other assumptions.

In this assignment you may assume that all other assumptions have been met (that is, the

heteroskedasticity isn’t caused by another assumption being violated).

Example code for generating homoskedastic data

Example code for generating homoskedastic data. Note the data are homoskedastic, not

heteroskedastic.

#Set sample size

n<-1000

#Generate a single covariate.

#I’m assuming the covariate follows a uniform[0, 100] distribution.

x<-runif(n,0,100)

#Generate an error term, which is normally distributed with mean 0 and SD 4

error<-rnorm(n,0,4)

#Choose parameters (intercept and slope) and generate the outcome:

y<- -10+0.4*x+error

I would explain the data generating process as follows.

One thousand bivariate observations were generated according to the following model. We

assumed that the covariate X was uniformly distributed between 0 and 100, and that the

error term, , was normally distributed with mean 0 and standard deviation of 4. The

outcome, Y , was then generated assuming:

Yi = −10 + 0.4xi + i

.

4

In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006

Administrative details

Basic details

• This assessment counts for 50% of your final mark for STAT0006.

• You should work in groups of no more than 3 students. You may work on the project

alone if you wish but note that this is not efficient. It is up to you to form your own

groups. You should have already registered your group on Moodle.

In addition to the outputs from Tasks 1 and 2, all groups must submit an additional page

where each group member briefly describes their contribution to the project.

• You will need to agree this in your groups before submitting the report.

• If all group members agree that everyone contributed equally then it is sufficient

to write a single sentence to that effect, or alternatively you are very welcome to

describe your own personal contribution to the project.

• Note that I will not mark this page, nor allocate different marks to different group

members based on this. The purpose is to encourage you all to be mindful about

contributing to this piece of group-work.

• If you feel that one or more of your peers is not contributing fairly, please contact

me by email in the first instance BEFORE SUBMISSION of the report and as early

as possible.

You should insert student ID numbers of all students in your group on the report, but do

not write your names. Your report will be marked anonymously. This also applies to

the page with descriptions of contributions.

Please note: it would be very helpful if you could adhere to the following format when

submitting your work:

• Please only submit ONE document, which should include everything: the first three

pages should be allocated to Task 1, the next four to Task 2, and the final page

should have your declaration of contribution to the project.

• Please start Task 2 and the declaration of contributions on a fresh page.

• Do not provide a cover sheet.

• All pages should have your GROUP NUMBER and STUDENT NUMBERS printed

somewhere along the top of the page.

• Save your document as a pdf file.

• Name your file using your group name, e.g. Group ICA3 200.pdf

• Please see associated document (layout.pdf) for suggested layout.

5

In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006

How do I get help with this assignment?

You can ask for help from me during office hours. Please note that I will not provide

comments on draft reports. Note that it may not be appropriate for me to answer all your

questions.

You may also post to the Moodle forum to ask questions. Please do not email me with

statistical questions - if you do, I will ask you to post them to the Moodle forum instead.

This being said, you should email me immediately if you have any technical difficulties

with Moodle (e.g. with submitting your report).

Submitting your work

The outputs from both tasks should be submitted to Moodle by 12 noon on Monday

27th January 2020. A submission button will appear on Moodle a few days prior to this

date. Under no circumstances should you email me your submission - if you do this, I

will immediately delete your email.

How will the report be marked?

Your report will be marked out of 50, with allocation as follows:

• 25 marks for Task 1, split as follows:

– 18 marks for the content of the report, including whether you have selected

appropriate information and supporting evidence (e.g. plots, tables), whether

your interpretation of the results are accurate, etc.

– 7 marks for the presentation and clarity of the report overall, including clarity

of expression and how easy it is to read and understand, whether you have

structured the report sensibly, good use of plots/tables where appropriate, adequately

sized graphics with suitably informative captions and labelling, and

so on.

• 25 marks for Task 2, split as follows:

– 15 marks for technical accuracy of your supplement;

– 10 marks for overall presentation and clarity of the supplement, including suitability

for the intended audience.

The mark you will receive is your group mark - everyone in the group will be awarded the

same mark, unless there are exceptional circumstances (e.g. a member of a group did not

contribute to the project).

Elinor M Jones

December 2019

6


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp