FINM 331: MULTIVARIATE DATA ANALYSIS
FALL 2018
PROBLEM SET 3
The required files for all problems can be found in:
http://www.stat.uchicago.edu/~lekheng/courses/331/hw3/
The file name indicates which problem the file is for (p1*.txt for Problem 1, etc). You are welcomed
to use any programming language or software packages you like.
1. (Factor Analysis) This is the same air quality data set we saw in the previous problem set but
this time we will only take four variables X1, X2, X5 and X6 by leaving out CO, NO, and HC
variables.
(a) Obtain the principal component solution to the factor model X = μ+LF+ε with number
of factors m = 1 and m = 2 using:
(i) the sample covariance matrix;
(ii) the sample correlation matrix.
In other words, you should find the matrix factor loadings L ∈ R
n×m, the specific variances
ψ1, . . . , ψp ∈ R, and write down the proportions of variability (in percentages) due to the
factors.
(b) Find the angle between the first factor loading in (i) and that the first factor loading in (ii).
(c) For the m = 2 case, compare the factor loadings obtained in (i) and that in (ii) using
orthogonal Procrustes analysis.
(d) Comment on your results.
2. (Population Canonical Correlation Analysis) The 2 × 1 random vectors X and Y have joint
mean vector and joint covariance matrix
(a) Calculate the canonical correlation ρ1 (the largest), ρ2 (the second largest).
(b) Find the canonical correlation variables (U1, V1) and (U2, V2) corresponding to ρ1 and ρ2.
(c) Let U = [U1, U2]
T and V = [V1, V2]
T
. Evaluate
E
U
V
and Cov U
V
=
ΣU ΣUV
ΣV U ΣV
(d) Comment on the correlation structure between and within U and V .
3. (Sample canonical correlation analysis) The data set for this problem is obtained by taking four
different measures of stiffness, shock, vibrate, static1, static2, for each of n = 30 boards.
The first measurement involves sending a shock wave down the board, the second measurement
Date: November 5, 2018 (Version 1.0); due: December 3, 2018.
1
2 FINM 331 ASSIGNMENT 3
is determined while vibrating the board, and the last two measurements are obtained from static
tests. The squared distances d
2
j = (xj x)
TS
1
(xjx) are also included as the last column in
the data matrix.
Let X = [X1, X2]
T be the random vector representing the dynamic measures of stiffness, and let
Y = [Y1, Y2]
T be the random vector representing the static measures of stiffness. Load the data
matrix p3.txt (R command: stiff = read.table("p3.txt"))
(a) Perform a canonical correlation analysis of these data by computing the singular value decomposition
of an appropriate matrix formed from the sample covariance matrices. You may
compare your result with that obtained by your software (if you use R, it is cancor(X1,X2)).
(b) Write the first canonical correlation variables U1 and V1 as linear combinations of shock,
vibrate, static1, static2.
(c) Produce two scatterplots of the data: one in the coordinate plane of the first canonical
correlation vectors, one in the plane of the second canonical correlation vectors.
(d) Based on the two plots and the values of the canonical correlations {ρ1, ρ2}, comment on
the correlation structure captured by each canonical pair.
(e) Repeat (a) with sample correlation matrices in place of sample covariance matrices and
verify that the pairs of canonical vectors obtained are related via scaling by the sample
standard deviation matrix.
4. (Canonical correlation analysis for angular measurements) Some observations are in the form
of angles. Here we will see how to deal with such data.
(a) Consider a bivariate random vector X = [X1, X2]
T with a uniform distribution inside a
circle of radius 1 centered at some unknown point
μ =μ1
μ2∈ R2
.
Then E(X) = μ. A sample of n = 4 is taken. The observed values are
Compute sample mean x and sample covariance matrix. Is x a good estimator of μWhy
or why not?
(b) We consider an angular valued random variable θ, note that this can always be represented
as a random vector Y = [cos θ,sin θ]
T
that takes value on the circle. Show that
2 = cos β and b2/
p
b
2
1 + b
2
2 = sin β. Here b = [b1, b2]
T ∈ R
2
is a constant
vector.
(c) Let X = X be a random vector with a single component, i.e., just a random variable. Here
X is not angular valued. Show that the population canonical correlation is
ρ1 = max
β
Corr
X, cos(θβ)
and that selecting the population canonical correlation variable V1 amounts to selecting a
new ‘origin’ or ‘baseline’ β for the angle θ.
(d) Let X is a random variable representing ozone (O3) levels and θ is a angular random variable
representing wind direction measured from the north. We make 19 observations to obtain
FINM 331 ASSIGNMENT 3 3
the sample correlation matrix
R =
RX RXθ
RθX Rθ
=O3 cos θ sin θ
O3 1.000 0.166 0.694
cos θ 0.166 1.000 ?0.051
sin θ 0.694 ?0.051 1.000.
Find the sample canonical correlation ρb1 and the sample canonical correlation variable Vb1
representing the new origin β.
(e) Let φ be another angular valued random variable and let X = [cos φ,sin φ]
T
. Then similar
to (b), we get
a
TX =
q
a
2
1 + a
2
2
cos(φα).
Now show that
ρ1 = max
α,β
Corr
cos(φα), cos(θ β)
.
(f) Let φ and θ be two angular random variables representing wind directions at 6:00 a.m. and
at 12:00 p.m. We make 21 measurements of X and Y (related to φ and θ as in (b) and
(d)). We obtain the sample correlation matrix
R =
RX RXY
RY X RY
cos φ sin φ cos θ sin θ
cos φ 1.000 0.291 0.440 0.372
sin φ 0.291 1.000 0.205 0.243
cos θ 0.440 0.205 1.000 0.181
sin θ 0.372 0.243 0.181 1.000
Find the sample canonical correlation ρb1 and sample canonical correlation variables Ub1 and
Vb1.
5. (Proofs behind cca) Let A ∈ R
p×p and B ∈ R
q×q be symmetric positive definite matrices and
C ∈ R
p×q
. Let
G = A
?1/2CB?1/2 ∈ R
p×q
.
We shall write λmax(M) for the largest eigenvalue of a matrix M.
(a) Suppose p = q. Show that eigenvalues of B?1A, B?1/2AB?1/2
, and AB?1 are all equal.
What are the relations between the eigenvectors?
(b) Suppose p = q. Show that
max
x∈Rp
{x
TAx : x
TBx = 1} = max
y∈Rp
{y
TB
1/2AB1/2y : y
Ty = 1}.
By using Problem 7 in Homework 2, deduce that
max
x∈Rp
{x
TAx : x
TBx = 1} = λmax(B
1/2AB1/2
),
argmax
x∈Rp
{x
TAx : x
TBx = 1} = qmax,
where qmax ∈ R
p
is the eigenvector of B1A corresponding to the largest eigenvalue.
(c) Show that if we fix x ∈ R
p and just maximize over all y ∈ R
q
, then
max
y∈Rq
{(x
TCy)
2
: y
TBy = 1} = max
y∈Rq
{y
T
[C
TxxTC]y : y
TBy = 1}
and deduce that from (a) and (b) that
max
y∈Rq
{(x
TCy)
2
: y
TBy = 1} = λmax(B
?1C
TxxTC).
4 FINM 331 ASSIGNMENT 3
Show that the largest eigenvalue of a rank-1 matrix abT
is b
Ta and deduce that
max
y∈Rq
{(x
TCy)
2
: y
TBy = 1} = x
TCB?1C
Tx.
(d) Using (a), (c), and Problem 7 in Homework 2, show that
max
x∈Rp, y∈Rq
{(x
TCy)
2
: x
TAx = 1, y
TBy = 1} = λmax(GGT
).
(e) Let σ1, . . . , σp ∈ R, u1, . . . , up ∈ R
p
, v1, . . . , vp ∈ R
q be the singular values and left/right
singular vectors of G. By Problem 7 in Homework 2, show that
max
x∈Rp
{x
TGGTx : x
Tx = 1, u
T
i x = 0, i = 1, . . . , k ? 1} = σ
2
k
,
argmax
x∈Rp
{x
TGGTx : x
Tx = 1, u
T
i x = 0, i = 1, . . . , k ? 1} = uk,
for k = 1, . . . , p. Hence deduce that
max
x∈Rp, y∈Rq
{x
TCy : x
TAx = 1, y
TBy = 1, u
T
i A
1/2
x = 0, i = 1, . . . , k 1} = σk,
argmax
x∈Rp, y∈Rq
{x
TCy : x
TAx = 1, y
TBy = 1, u
T
i A
1/2
x = 0, i = 1, . . . , k 1} = (A
1/2uk, B1/2
vk),
for k = 1, . . . , p. Finally show that
max
x∈Rp, y∈Rq
{x
TCy : x
TAx = 1, y
TBy = 1, u
T
i A
1/2
x = 0, v
T
i B
1/2
y = 0, i = 1, . . . , k ? 1} = σk,
argmax
x∈Rp, y∈Rq
{x
TCy : x
TAx = 1, y
TBy = 1, u
T
i A
1/2
x = 0, v
T
i B
1/2
y = 0, i = 1, . . . , k 1} = (A
1/2uk, B1/2
vk),
for k = 1, . . . , p.
6. (Linear discriminant analysis) The admissions committee of a business school used GPA and
GMAT scores to make admission decisions. The values for the variable admit = 1,2,3 correspond
to admission decisions of yes, no, waitlist. Label the data set p6.txt — helpful R
commands:
gsbdata = read.table("p6.txt"); colnames(gsbdata)=c("GPA", "GMAT","admit");
(a) Calculate xi
, x and Spool.
(b) Calculate the sample within groups matrix W, its inverse W?1
, and the sample between
groups matrix B. Find the eigenvalues and eigenvectors of W?1B. (R command for A1
is solve(A)).
(c) Use the linear discriminants derived from these eigenvectors to classify the two new observations
x = [3.21, 497]T and x = [3.22, 497]T
.
(d) Scatterplot the original data set on the plane of the first two discriminants, labeled by
admission decisions. Comment on the results in (c). Is this a good admission policy?
7. (Correspondence Analysis) A client of a law firm would like to visualize the number of large
class-action lawsuits each year across different industries from 2011 to the first half of 2017. The
correspondence analysis provides a means of displaying or summarizing a set of categorical data
in two-dimensional graphical form. The data on class-action lawsuits are from annual reports
of Stanford Law School’s Securities Class Action Clearinghouse. To load the data in R, you can
use the following command:
CALaw = read.csv("/classaction lawsuit.csv",header=TRUE)
Notation: Denote X as a data matrix of the number of class action lawsuits for industryyear.
xi denotes row total (summing across all years for each industry). xj denotes column
total (summing across all industries for each year). x?? denotes grand total. Define Dr =
diag(x1, . . . , xn) and Dc = diag(x1, . . . , xp).
(a) What are the dimensions, n and p, in this dataset?
FINM 331 ASSIGNMENT 3 5
(b) Show 1 is an eigenvalue of matrices D1
r XD1
c XT and D1
c XTD?1
r X and that the corresponding
eigenvectors are proportional to 1 = [1, . . . , 1]T
.
(c) Transform the data as follows:
Y =
√
xD1/2
r
X
abT
x
D1/2
c ∈ R
n×p
,
where a = Dr1n and b = Dc1p. Report the SVD on Y (both singular values and left/right
singular vectors). Is there another formula to compute the entries of the matrix Y ?
(d) Write down the formula to compute row weight vectors and column weight vectors. How
many different row weight vectors and column weight vectors are there? Report all row
weight vectors and column weight vectors.
(e) Similar to PCA, makes the following two plots:
Scatterplot of the first two row weight vectors: Does this scatterplot inform us
about year or industry? What do you learn from this scatterplot?
2D biplot: What do you learn from the biplot?
(f) Write down the formula to calculate the Frobenius norm of Y . Compute the Frobenius
norm of Y . What is the relationship between the sum of squares of the singular values and
the Frobenius norm of Y ?
(g) Report the percentage of original variance that each dimension in the row/column weight
vectors explain? How many singular values are needed to effectively summarize at least
90% of the variability in the data?
8. (Multidimensional Scaling) An investor looking to allocate his funds to different industries seeks
to visually understand the relationship between returns across different US industries. This
investor has a deep pocket but does not know statistics so he comes to seek your advice. As
a financial mathematician, multidimensional scaling first comes to your mind to answer this
investor’s question. To collect the data, the US industry returns can be downloaded from the
Industry Return sections in Kenneth French’s website at:
http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data library.html#Research
For the purpose of this problem set, the dataset of monthly returns of 30 US industries is
downloaded and formatted. To read in the dataset in R, you may use
FF=read.csv("./FamaFrench30.csv", header=TRUE).
Each row of the data represents how the industry in each column goes up or down on different
date. The value of 1 means that the industry on a particular column goes up 1% on that month,
compared to the previous month.
(a) Report mean returns and standard deviation of five industries of your choice. Out of all
30 industries, which industry performs the best on average, which industry is the most
volatile?
(b) Let Ri
t be the return of industry i at time t. Write a formula to compute the distance between
two industries. Denote what each subscript/superscript means and specify dimension of
each subscript/superscript (i.e., explicitly stating what do you sum to). Write a code to
compute distance and report the distance of the following pair of industries:
Autos – ElecEq
Autos – Trans
Autos – Oil
(c) Do you need to demean the data to compute distance matrix? Why?
(d) Report the distance matrix of all industries. To conveniently compute distance, R has a
built in distance matrix command dist.
dist(data matrix, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
6 FINM 331 ASSIGNMENT 3
The first input is the data matrix. The distance command will compute the Euclidean
distances among each row of the data. (Hint: You may need to convert the results into
matrix using the as.matrix command.)
(e) Multidimensional scaling: With the distance matrix in hand, you are now ready to perform
multidimensional scaling to visualize this data. The end goal is to plot the first two dimensions
after multidimensional scaling. To perform MDS, you first need Euclidean distance
matrix (EDM) from the previous part. Then, you would perform the following steps
Step 1: Form Gram matrix G from EDM. [Handout 9, equation 7.6]
Step 2: Perform EVD on G and recover X using X = QpΛ
1/2
p .
Report the result by plotting the first two dimensions after multidimensional scaling with
corresponding industry label for each data point. Does this plot have to be unique? Why?
(f) Interpret the results. What does closer/further in distance mean in this setting? Which
industry tends to co-move with Games industry the most? List three industries whose
returns tend to move on its own.
(g) (Optional): What is your advice for an investor who put most of his money on stocks in
Telecom? [Think about diversification]
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。