Ludwig-Maximilians-Universit?t München – Institute of Statistics

Take-home exam for the lecture Statistische Software (R)

Summer Term 2020

Notes:

1. The exam problems are provided in English. You are free to hand in verbal solutions in English and

German. However, you need to stay consistent within one problem solution. For language clarification

use the Moodle forum.

2. The take-home exam features 7 problems. There are 90 marks to be achieved via solving all 7 problems.

Additionally, you can receive up to 10 marks for compliance to the style guide of Hadley Wickham. All

problems need to be solved either using R code or with verbal solutions.

3. By submitting the exam you register for the exam at the same time.

4. Teamwork is prohibited. Your work must be your own. We expect you to attach a signed declaration of

originality to the exam. If there is any suspicion of cheating, we will report any associated person to

the examination board (Prüfungsamt für Statistik). This may ultimately result in failing the exam

and/or additional disciplinary measures.

5. The exam submission will take place via the Moodle page of the lecture. We expect from you to hand

in a ZIP-file containing the following:

- an .Rmd file with your solutions. Your solutions will be both, text and code answers.

Use the template on the Moodle page.

- the compiled version (PDF) of your .Rmd

- all data sets used in the .Rmd to make sure that your code actually runs.

Please only supply external data sets that are not replicable with R code.

- a declaration of originality (PDF).

When one of the items is missing, the exam will be failed. This also holds if the .Rmd cannot be compiled.

6. Enter your complete name and your student ID into the header of the supplied template. The template

only covers the first few problems. You need to continue it individually.

7. For your files, use the following naming scheme: statsoft_firstname_lastname.Rmd, statsoft_firstname_lastname.pdf,

statsoft_firstname_lastname_declaration.pdf. Name your ZIP

file statsoft_firstname_lastname.zip. The naming of the data sets need to be consistent with your R

code. When reading an external data set (typically .csv), you have to assume that the data set is in

your current directory. This means that you need to read it via, e.g. . read_csv("data.csv").

8. The exam period will be from 05.08.2020 to 02.09.2020.

9. Terminal deadline: 02.09.2020 - 18:00 CEST (Central European Summer Time). Any exams provided

after the deadline will not be considered. This means they will also not be counted as an attempt.

10. You are only allowed to use the base packages of R unless the problem explicitly asks for it. The base

packages are:

1

(.packages())

## [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"

## [7] "base"

11. Make sure that in the PDF output all of your solutions are displayed (echo = TRUE).

12. Limit yourself to only providing the requested solution. If you supply unnecessary code and answers,

this may result in mark deduction.

13. Make sure that your code is easily correctable, precise, efficient, and easily readable and understandable.

14. Make use of comments in your code if necessary. Document your functions when requested. Be precise

and short in answer and code.

15. By submitting the exam you consent that we may electronically check your exam on plagiarism.

16. Follow the instructions on the Moodle page what needs to be in your declaration of originality.

17. Questions on the exam are answered in the Forum only to ensure that all students receive the same

information. You find the Forum on top of the Moodle page of the lecture.

2

Problem 1 – Tidyverse (25P)

This problem is a straight-forward data science problem solved by using R effectively. It is not too complex

but most likely something you will see over and over again when working with data. Use the tidyverse

package(s) for this task.

a) Download the following dataset from openml https://www.openml.org/d/187 as a .csv and read it

into R such that it is a tibble. Create a new column wine_type from the existing column class with

3 levels: Let 1 denote “red_wine”, 2 “white_wine” and 3 “other_wines”. Create boxplots using ggplot

of the alcohol content for the three different wine types using your new column wine_type. Rearrange

the data set so that wine_type is the first column. (3P)

b) You want to investigate a few key features of each wine type. Create a tibble with the name

wine_summary that contains the min and max alcohol content, the mean of Flavanoids, and the median

of color_intensity. Which wine type contains the wine with the maximum alcohol content? (1P)

c) Look for the three wines with the minimum alcohol content in each group of wines in the original wine

tibble from a). Report on their magnesium and Total_phenols. Make sure that for each wine_type

you only select the row from the wine tibble that contains the wine with the minimum alcohol content

from its according group. (2P)

d) Is the wine data tidy? Explain your answer. (1P)

e) From a different data source, you know that wines with flavanoids between and equal to 1 and 2 are

from France and wines with flavanoids greater than 3 and smaller than 1 are from Germany. Create a

new column containing the regions. Create another column that contains 1 if the region is Germany

and 0 if the region is France. How many wines are from each region? How many NAs are there? Where

are the most white wines from? (3P)

f) Our information on the regions was insufficient as there are still NAs in the data. Fit a linear model on

the part of the wine data without NAs using the R function lm. As the dependent variable use one of

the region columns you have constructed in e) and as independent variables use Alcohol, Malic_acid,

and Ash. Use the linear model to predict the missing region for the other part of the wine data that

contains NAs.

Hints: Make sure that your dependent variable is numeric during fitting. Also make sure that the final

prediction fits the region variable (you may need some post processing after the model prediction). (5P)

If you did not succeed in predicting the missing values, you may use the RDS region_01.RDS on the Moodle

page for the consequent problems.

g) Combine the part of the wine data on which you fitted the model in f) and the part of the wine data

with the predicted regions back into one wine dataset. For this, all columns need to be identical in

both tibbles, so make sure they are. Check if there are really no NAs left. (4P)

h) Now that we have a region for each wine in the dataset, we want to look at some summary statistics for

each region. Create a new data set that contains the mean and median observation of each numeric

column in the wine data for each region and wine_type. Name the new columns such that they have

the suffix "_mean" and "_median". (3P)

i) Create a scatter plot using ggplot of Color_intensity on the x-axis and Alcohol on the y-axis from

your wine data set from g). For each wine_type, choose a different color of points. Add a smooth line

including its confidence interval that fits all the data points in the plot. Color the smooth line in black.

Add another three smooth lines for each wine_type and color them in the same color as the according

points. Do not display the confidence interval for these lines. How are Alcohol and Color_intensity

related to each other in this plot for the three wine types? (3P)

3

Problem 2 – Basic R (11P)

In this problem, we want you to demonstrate that you understood how R operates in the backend. Make sure

to invest enough time to actually give a good answer using the material from this lecture.

a) Explain the following outputs. Focus your answer on how a single function can be so flexible w.r.t. to

its input. Note: We only want a vague verbal explanation, technical details are far beyond what we

taught you. (1P)

summary(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width

## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100

## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300

## Median :5.800 Median :3.000 Median :4.350 Median :1.300

## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

## Species

## setosa :50

## versicolor:50

## virginica :50

##

##

##

summary(lm(Petal.Width ~., data = iris))

##

## Call:

## lm(formula = Petal.Width ~ ., data = iris)

##

## Residuals:

## Min 1Q Median 3Q Max

## -0.59239 -0.08288 -0.01349 0.08773 0.45239

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -0.47314 0.17659 -2.679 0.00824 **

## Sepal.Length -0.09293 0.04458 -2.084 0.03889 *

## Sepal.Width 0.24220 0.04776 5.072 1.20e-06 ***

## Petal.Length 0.24220 0.04884 4.959 1.97e-06 ***

## Speciesversicolor 0.64811 0.12314 5.263 5.04e-07 ***

## Speciesvirginica 1.04637 0.16548 6.323 3.03e-09 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 0.1666 on 144 degrees of freedom

## Multiple R-squared: 0.9538, Adjusted R-squared: 0.9522

## F-statistic: 594.9 on 5 and 144 DF, p-value: < 2.2e-16

b) Explain the following outputs. Note: For this answer, it matters which R version yo have. We typically

expect you to have R >= 4.0.0. Thus, please indicate if you do not (e.g. via sessionInfo() at the

end of the document). (1P)

4

class(state.x77)

## [1] "matrix" "array"

typeof(state.x77)

## [1] "double"

class(as.data.frame(state.x77))

## [1] "data.frame"

typeof(as.data.frame(state.x77))

## [1] "list"

c) What do you find odd about the following outputs? Write code to make sure that the integer values

are identical to the factor values. (This means that the factors actually display ones and zeros.) (1P)

fct <- as.factor(c(0, 1, 0, 1, 1))

int <- as.integer(fct)

fct

## [1] 0 1 0 1 1

## Levels: 0 1

int

## [1] 1 2 1 2 2

d) Explain the following behaviour of R. Hint: infix. (1P)

`+`(3, 4)

## [1] 7

e) Based on d) write your own function infix which creates an integer sequence. The function should

take two inputs (from, to) and return an integer sequence. Name the function %to%. (2.5P)

3 %to% 11

## [1] 3 4 5 6 7 8 9 10 11

f) Define a positive semidefinte matrix in R. (1P)

g) Find an explanation for the following behaviour. (1P)

(0.1 + 0.2) == 0.3

## [1] FALSE

h) Compute the element-wise, matrix and cross product of the following matrices. Both, X and Y should

be on the right-hand and left-hand side of the operation if meaningful. Also, invert X. (2.5P)

X <- matrix(c(3, 4, 0, 5, 6, 1, 0, -1, 7), nrow = 3)

Y <- matrix(c(7, 44, 0, -5, 16, -1, -4, -1, 0), nrow = 3)

5

Problem 3 – Functions from your daily life (8P + 1P)

We want you to implement some functions that you may know from your daily life. If you need some

additional input, feel free to reference the internet. Please reference the URL which you used here. Wikipedia

will be accepted as a reference. We only expect you to write code that works for scalar input. If your code

accepts vectors (and of course works) you will receive an extra mark. That means, for example, if your

function computes your blood alcohol concentration, it must work for one individual, we however value if

you manage to return a vector of blood alcohol levels for multi-dimensional input (i.e. multiple persons).

Some of your functions will need only one input, some will need multiple ones. Make a short documentation

(description, arguments, output) of your function. (See a).) Disclaim the resource you used for the function

in the documentation. We are only interested in your resource if your solution is wrong. Then, however, we

will only accept freely accessible German or English resources. Give your functions proper names. Also, make

sure that variable names are somewhat meaningful. In general, we only expect you to supply reasonable

input. Thus, your functions do not need to perform input checking. That means it is okay if unreasonable

input results in unreasonable output – as long as your function is correct.

a) Temperature conversion: Write a function that converts degrees Fahrenheit into Celsius. Also, write a

function converting Celsius into Fahrenheit. (0.5P)

#' Temperature conversion from Fahrenheit to Celsius

#' This function converts degrees F to C using the formula provided here:

#' https://www.metric-conversions.org/temperature/fahrenheit-to-celsius.htm

#' Arguments:

#' temp: temperature in Fahrenheit. A numeric vector.

#' Returns: a numeric vector indicating the temperature in Celsius.

b) Body-mass-index: Write a function that computes the BMI. (0.5P)

c) Risk assessment: Write a function that computes the Sharpe ratio (a financial market metric). Hints:

Make sure that you cast percentages as floats. The risk-free rate is the same for all portfolios. (0.5P)

d) Energy conservation of kinetic energy: Write a function that predicts the speed of impact (either in

km/h or m/s) of a physical body when being dropped from any given height. Ignore frictions etc. just

like in high school physics. Hint: Energy is conserved! (0.5P)

e) Speeding fines: Write a function that computes your speeding fine (in Euros) with a car (PKW, no

trailer!) in Germany. Neglect the tolerance of 3kph which is typically applied. (This resource may be

helpful. We are only interested in the fine and not other penalties. Despite the current legal issues, use

the normal 2020 penalties. Reported speeds are always rounded.) (2.5P)

f) Tax in 2020: Write a function that computes the income tax to be paid by a single individual in

Germany in 2020. (This resource may be helpful. We are only interested in singles. Thus the zVe is

identical to the real income. Only rounded incomes are considered by the tax.) (1.5P)

g) Infant language: Write a function that adds syllables to an existing word (a character): Each vowel

should be replaced by an “ellu”, e.g. “alle” becomes “ellullellu”. You only need to consider lower case

input. (2P)

6

Problem 4 – Linear regression & simulation (23P)

A linear model is typically estimated by the least-squares method. Using least-squares the Normal equation

from which the estimator β? for the parameter vector β can be derived looks as follows:

XT Xβ = XT

y

You are interested in the association of children heights to other physical features. You observe the following

features:

? Bodyweight in KG

? Age in years

? The combined height of both biological parents in cm

? Sex (male / female)

We are going to work with simulated data in this problem. Note that the data is completely simulated and

hence there is no claim that the underlying relationships actually behave in the described manner. Make

yourself familiar with simulation functions in R as for example runif() and sample().

a) Find a way to make sure that your results are reproducible. Which function in R should always be used

before simulating data (as long as results are supposed to be reproducible)? We expect the associated

code which makes your code reproducible as your answer below. Note: Only make use of the base

package. (0.5P)

b) We assume that there is (little or) no correlation between age, sex, and the combined parents’ heights.

Simulate 250 observations for each of the three covariates. Store them in separate vectors with proper

names. Make use of meaningful rounding. Assume that age is uniformly distributed in our sample. The

minimum age is – obviously – 0 and children are excluded on the day of their 15th birthday. Sex is

Bernoulli distributed with a 0.5 probability of being male. The combined height of both biological

parents is the sum of two independent normally distributed variables. The male part can be described

by a mean of 177 cm and a standard deviation of 10 cm. The female part can be described by a mean

of 164 cm and a standard deviation of 9 cm. We expect the code which simulates all three vectors as

your answer below. Make sure to write your code for the vectors so that it is easy to interpret and

understandable. (1.5P)

c) Summarise your data very briefly. Make use of one meaningful descriptive function for each vector.

(1P)

d) Of course, the bodyweight will likely depend – causally – on the child’s sex and the child’s age. There

will also be a correlation between the bodyweight of a child and its parents’ combined height. However,

this will most likely be because both are equally connected to the child’s height. Thus, we neglect this

correlation for now. Simulate the vector weight assuming that bodyweight is normally distributed.

The mean depends on the sex and age. For boys the mean is described by 9 + 6 * log(age + 1) +

(age > 5) * (-4 + 2.5 * age) + 2 * (age > 11) * sqrt(max(age - 5, 0)). For girls the mean

is described by 6 + (age <= 5) * 5 * log(age + 1) + (age > 5) * (-3 + 3 * age) + (age >=

11) * sqrt(age). The variance for both sexes is described by 2 * log(0.2 * m) where m is the

respective mean for each sex. Explain the three functions mathematically and intuitively (i) and

simulate the weight for all 250 observations (ii). (2P)

e) The height of the children can be deterministically described by a linear function. This function has

an intercept of -80 cm. For each additional cm of combined parents’ height, a child is on average 0.4

cm taller. Boys are on average 4 cm taller and children grow on average 9 cm per year. Next to this

deterministic function, the child’s height is subject to an additive random error which is normally

distributed with zero mean and a standard deviation of 7 cm. Simulate the variable height. (1P)

7

If you did not succeed in simulating, you can use the RDS objects height, weight, sex, and age supplied

on the Moodle page instead for the consequent problems. You can also use these vectors for the previous

problems (e.g. e)) if you only failed in simulating some of the vectors.

f) Compute the correlation between height and weight. Also, compute a suitable measure of the

correlation between height and sex. (This might require some online search! Keyword: Point-biserial

correlation.) (4P)

g) Construct a data.frame from all vectors. Name it df. Explain the following output. (1P)

is.list(df)

## [1] TRUE

h) Construct a design matrix X from df. The design matrix is supposed to be used to model y, the

height. Name the matrix X. (1P)

i) Compute the right-hand side of the Normal Equation. Find a mathematical expression for β when

solving for it. Also compute β (or β? to be precise). Explain how β from this question is related to

question e). (3P)

j) Fit a linear model using df to explain height. Report your results and compare them to i). (1P)

k) Re-simulate your data set but now with 1000 observations. Again, fit a linear model and compare the

results to j). Hint: You are allowed to solve this problem together with l). (2P)

l) Use your work from k) to write a function that simulates this specific data situation. Your function

should have two inputs: n, the number of observations to be simulated, and seed, a seed for random

number generation. Your function should return a data frame with all simulated columns. Note: If

you did not succeed in simulating the data previously, we also accept pseudo-code. (2P)

m) A linear model typically models the mean of a normally distributed random variable (μi). In this case,

we model a link which is equal to the resulting mean (identity / linear link). However, imagine that a

linear model is used to model count data or the mean of a random variable from a Poisson distribution.

The mean of this variable is equal to its rate parameter (λi). We can simulate data following a Poisson

distribution like before in this problem. Here we use a log-link instead of an identity link. Explain why

the code below throws a message and produces wrong results. Correct it so that it works. To check

whether you did the right thing, you can compare your simulated values with the histogram shown

below. Note: This problem can be solved independently from the previous ones. (3P)

set.seed(11)

beta0 <- 1

beta1 <- 2

beta2 <- -0.25

x1 <- runif(100, -1, 2)

x2 <- runif(100, 3, 7)

link <- beta0 + beta1 * x1 + beta2 * x2

y <- rpois(100, link)

## Warning in rpois(100, link): NAs produced

8

Histogram of y

y

Frequency

0 10 20 30 40 50

0 20 40 60 80

Problem 5 – String wrangling (6P)

Regular expressions are often used in programming languages when working with text data. In R there are

two dominant ways to work with regex: The base and the stringr package. You are allowed to use both in

this problem Before you start working on this problem, make yourself familiar with regular expressions.

On the Moodle page, there is a newspaper article (text.RDS) which we aim to prepare for the use in a

machine learning algorithm. So far, the article is just as it was scraped from the web page.

a) The algorithm cannot deal with special characters except “.”, “,” and " ". Remove all other special

characters. Additionally, the text should be converted to lower cases only. (3P)

Your solution should look like this:

substr(text, 1, 50)

## [1] "der von der griechischen justiz verfolgte frühere "

b) Change the German Umlaute (e.g. ?) into the international equivalent (e.g. ?). (1P)

c) On Moodle, we also provide a vector (top50.Rds) with the 50 most frequently used words in German.

Remove all words occurring in this vector from the text. Only remove the words reported in the vector

and no variants. (As you may have seen the vector has more than 50 entries. We already added some

variations of the most frequent German words.) (2P)

9

Problem 6 – Gradient descent (11P)

You are hiking and you want to start your descent. You are hiking with your stats friends and one of them

mentions that descending a mountain can be seen – simplified – as if you follow the slope of a quadratic

function from your point into the valley.

Using base R/graphics one can visualise the description as follows:

grid <- seq(-4, 4, length.out = 100L)

altitude <- 0.3 * grid ^ 2

plot(grid, altitude, type = "l", xlim = c(-5, 5), ylim = c(0, 5),

xlab = "Distance in Km", ylab = "Altitude in Km")

points(grid[95], altitude[95], col = "blue")

Distance in Km

Altitude in Km

You as a hiker are the blue dot. A smart strategy would now be to go down the hill by its slope step by step.

The local slope changes. The local slope of a function is described by its derivative.

a) Compute the local slope for all values of grid. Visualise the local slope using graphics. (1P)

b) Define the quadratic function as a function in R itself. It should have one input only. (1P)

c) Also, define the (analytical) derivative as a function in R. (1P)

d) Go down the hill by one step. Your step length is exactly one meter. The scale of grid is in kilometers,

though. Report the new location. Plot your new position in the previous graph. Hint: Use your

functions from b) and c). (2P)

e) This is a very slow descent. You decide to increase your step length to 1.2 m. However, this is still very

slow coding (if we have to update every step manually). You can also simulate your descent using a

for-loop. Go down the hill 180 steps with a step length of 1.2 m. Where are you now? Plot your new

position, too. (1P)

f) You are close to the valley! You feel very confident by your descending strategy. Thus, you do 1000

steps in the same direction. This means you do not adapt your slope at every step anymore. Where are

you now? Plot your new position, too. Explain what happened. (1P)

g) As your last idea failed, you go back to the previous descending strategy and do 4500 steps adapting

for your slope again at every single one. Again, plot your new location. Now it seems that you are in

the valley. (Right?) (1P)

h) Rewrite the descending strategy as a function using either the repeat function or a do-while loop. As

seen in g) you will not be able to arrive in y == 0 (exactly). Make sure that your loop stops when you

are in the valley with some tolerance. Name the function descent_hill. Your tolerance should refer

to the slope – a slope sufficiently close to zero indicates that you are in the valley (as you are basically

10

not descending anymore.) We suggest tolerance of 0.01. Your function should return the location of

the valley (x and y) coordinates. Your function should accept the following inputs:

? Start point.

? Step size.

? The function defining the hill (here the quadratic function).

? Slope / derivative of this function.

? Tolerance in the valley. (3P)

Problem 7 – Outlier detection in the linear model (6P)

Now, we use a variation of the code of the end of problem 4 to model a normally distributed dependent

variable:

set.seed(11)

beta0 <- 1

beta1 <- 2

beta2 <- -0.25

x1 <- runif(100, -1, 2)

x2 <- runif(100, 3, 7)

y <- beta0 + beta1 * x1 + beta2 * x2 + rnorm(100, 0, 1)

hist(y)

Histogram of y

y

Frequency

?4 ?2 0 2 4

0

5 10 15 20

a) Create a data.frame with all independent and dependent variables. Fit a linear model using all

covariates explaining y. (0.5P)

We are interested in finding the most influential or outlying observations. There are many different ways to

do this. Today we focus on two specific ones.

b) Fit 100 linear models leaving out one distinct observation in each. For each model compute the relative

change in the adjusted R-squared compared to the baseline model estimated in a). Remodel the linear

model leaving out the five observations which decrease adjusted R-squared the most. (3P)

c) Alternatively, identify the five observations (out of all 100 observations) with the largest residuals. Refit

the model leaving these five observations out. (1.5P)

d) Argue which procedure worked better in the underlying case. (1P)

11

版权所有：编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。