Assignment #2 STA437H1S/2005H1S
due Friday February 17, 2023
Instructions: Solutions to problems 1–3 are to be submitted on Quercus (PDF files only).
1. Andrews curves (conceived the University of Toronto’s own David Andrews) represent an
interesting approach to multivariate visualization. The idea is to represent each multivariate
observation (xi1, · · · , xip) (which is possibly normalized) by a sinusoidal function on [0, 1]:
gi(t) =
xi1√
2
+ xi2 sin(2pit) + xi3 cos(2pit) + xi4 sin(4pit) + xi5 cos(4pit) + · · ·
Observations that are similar will have similar Andrews curves while outlying observations
will often have curves that are distinctively different.
On Quercus, there is a file andrews.txt, which contains a function andrews that computes
Andrews curves for a data matrix whose columns are variables and rows are observations;
for example,
> source("andrews.txt") # read the function into R
> x <- cbind(rnorm(100),rnorm(100),rnorm(100),rnorm(100),rnorm(100))
> r <- andrews(x,scale=T) # scales columns to have mean 0 and variance 1
The file testdata.txt contains 100? k observations from a 10-variate normal distribution
and k outliers generated from another distribution (where k ≤ 15).
(a) Look at the data using Andrews curves. How many clear outliers do there seem to be?
(b) Using the information from the Andrews curves as well as pairwise scatterplots, principal
components etc, give an estimate of how many outliers are in the data.
2. (a) If {gi(t)} are the Andrews curves defined in question 1, show that
2
∫ 1
0
[gi(t)? gj(t)]2 dt =
p∑
k=1
(xik ? xjk)2.
(b) If xˉ =
1
n
n∑
i=1
xi, what is the Andrews curve of xˉ?
(c) Suppose that xk lies on a line between xi and xj, that is, xk = λxi + (1? λ)xj for some
0 < λ < 1. What can you say about the Andrews curve of xk relative to those of xi and xj?
3. In Assignment #1, you looked at two dimensional scatterplots of data on two species of
rock crabs; here, you will do a principal components analysis of these data.
As before, the data are in a file crabs.txt on Quercus; the columns of the file are species (B
or O), sex (M or F), index (1-50 within each species-sex combination), width of the frontal
lip (LP), the rear width of the shell (RW), length along the midline of the shell (CL), the
maximum width of the shell (CW), and the body depth (BD).
The data can be read into R using the following code:
> x <- scan("crabs.txt",skip=1,what=list("c","c",0,0,0,0,0,0))
> colour1 <- ifelse(x[[1]]=="B","blue","orange") # species colours
> colour2 <- ifelse(x[[2]]=="M","black","red") # sex colours
> sex <- x[[2]]
> FL <- x[[4]]
> RW <- x[[5]]
> CL <- x[[6]]
> CW <- x[[7]]
> BD <- x[[8]]
(a) Using the correlation matrix, do a principal component analysis of the 5 variables.
> r <- princomp(~FL+RW+CL+CW+BD,cor=T)
> summary(r,loadings=T)
Give an interpretation of the first two principal components based on their loadings.
(b) Look at pairwise scatterplots of the 5 principal components using colour1 to distinguish
the two species:
> pairs(r$scores,col=colour1)
Which pairs of principal components seem to separate the two species?
(c) Now look at pairwise scatterplots of the 5 principal components using colour2 to dis-
tinguish the two sexes:
> pairs(r$scores,col=colour2)
Which pairs of principal components seem to separate the two sexes?
(d) Suppose you are given the following measurements for the 5 variables: FL = 18.7,
RW = 15.0, CL = 35.0, CW = 40.3, BD = 16.6. What is your prediction of the species and
sex of this crab?
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。