联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2018-11-27 10:58

FINM 331: MULTIVARIATE DATA ANALYSIS

FALL 2018

PROBLEM SET 3

The required files for all problems can be found in:

http://www.stat.uchicago.edu/~lekheng/courses/331/hw3/

The file name indicates which problem the file is for (p1*.txt for Problem 1, etc). You are welcomed

to use any programming language or software packages you like.

1. (Factor Analysis) This is the same air quality data set we saw in the previous problem set but

this time we will only take four variables X1, X2, X5 and X6 by leaving out CO, NO, and HC

variables.

(a) Obtain the principal component solution to the factor model X = μ+LF+ε with number

of factors m = 1 and m = 2 using:

(i) the sample covariance matrix;

(ii) the sample correlation matrix.

In other words, you should find the matrix factor loadings L ∈ R

n×m, the specific variances

ψ1, . . . , ψp ∈ R, and write down the proportions of variability (in percentages) due to the

factors.

(b) Find the angle between the first factor loading in (i) and that the first factor loading in (ii).

(c) For the m = 2 case, compare the factor loadings obtained in (i) and that in (ii) using

orthogonal Procrustes analysis.

(d) Comment on your results.

2. (Population Canonical Correlation Analysis) The 2 × 1 random vectors X and Y have joint

mean vector and joint covariance matrix

(a) Calculate the canonical correlation ρ1 (the largest), ρ2 (the second largest).

(b) Find the canonical correlation variables (U1, V1) and (U2, V2) corresponding to ρ1 and ρ2.

(c) Let U = [U1, U2]

T and V = [V1, V2]

T

. Evaluate

E

U

V

 and Cov U

V

 =



ΣU ΣUV

ΣV U ΣV



(d) Comment on the correlation structure between and within U and V .

3. (Sample canonical correlation analysis) The data set for this problem is obtained by taking four

different measures of stiffness, shock, vibrate, static1, static2, for each of n = 30 boards.

The first measurement involves sending a shock wave down the board, the second measurement

Date: November 5, 2018 (Version 1.0); due: December 3, 2018.

1

2 FINM 331 ASSIGNMENT 3

is determined while vibrating the board, and the last two measurements are obtained from static

tests. The squared distances d

2

j = (xj x)

TS

1

(xjx) are also included as the last column in

the data matrix.

Let X = [X1, X2]

T be the random vector representing the dynamic measures of stiffness, and let

Y = [Y1, Y2]

T be the random vector representing the static measures of stiffness. Load the data

matrix p3.txt (R command: stiff = read.table("p3.txt"))

(a) Perform a canonical correlation analysis of these data by computing the singular value decomposition

of an appropriate matrix formed from the sample covariance matrices. You may

compare your result with that obtained by your software (if you use R, it is cancor(X1,X2)).

(b) Write the first canonical correlation variables U1 and V1 as linear combinations of shock,

vibrate, static1, static2.

(c) Produce two scatterplots of the data: one in the coordinate plane of the first canonical

correlation vectors, one in the plane of the second canonical correlation vectors.

(d) Based on the two plots and the values of the canonical correlations {ρ1, ρ2}, comment on

the correlation structure captured by each canonical pair.

(e) Repeat (a) with sample correlation matrices in place of sample covariance matrices and

verify that the pairs of canonical vectors obtained are related via scaling by the sample

standard deviation matrix.

4. (Canonical correlation analysis for angular measurements) Some observations are in the form

of angles. Here we will see how to deal with such data.

(a) Consider a bivariate random vector X = [X1, X2]

T with a uniform distribution inside a

circle of radius 1 centered at some unknown point

μ =μ1

μ2∈ R2

.

Then E(X) = μ. A sample of n = 4 is taken. The observed values are

Compute sample mean x and sample covariance matrix. Is x a good estimator of μWhy

or why not?

(b) We consider an angular valued random variable θ, note that this can always be represented

as a random vector Y = [cos θ,sin θ]

T

that takes value on the circle. Show that

2 = cos β and b2/

p

b

2

1 + b

2

2 = sin β. Here b = [b1, b2]

T ∈ R

2

is a constant

vector.

(c) Let X = X be a random vector with a single component, i.e., just a random variable. Here

X is not angular valued. Show that the population canonical correlation is

ρ1 = max

β

Corr

X, cos(θβ)



and that selecting the population canonical correlation variable V1 amounts to selecting a

new ‘origin’ or ‘baseline’ β for the angle θ.

(d) Let X is a random variable representing ozone (O3) levels and θ is a angular random variable

representing wind direction measured from the north. We make 19 observations to obtain

FINM 331 ASSIGNMENT 3 3

the sample correlation matrix

R =



RX RXθ

RθX Rθ

=O3 cos θ sin θ

O3 1.000 0.166 0.694

cos θ 0.166 1.000 ?0.051

sin θ 0.694 ?0.051 1.000.

Find the sample canonical correlation ρb1 and the sample canonical correlation variable Vb1

representing the new origin β.

(e) Let φ be another angular valued random variable and let X = [cos φ,sin φ]

T

. Then similar

to (b), we get

a

TX =

q

a

2

1 + a

2

2

cos(φα).

Now show that

ρ1 = max

α,β

Corr

cos(φα), cos(θ β)



.

(f) Let φ and θ be two angular random variables representing wind directions at 6:00 a.m. and

at 12:00 p.m. We make 21 measurements of X and Y (related to φ and θ as in (b) and

(d)). We obtain the sample correlation matrix

R =

RX RXY

RY X RY

cos φ sin φ cos θ sin θ

cos φ 1.000 0.291 0.440 0.372

sin φ 0.291 1.000 0.205 0.243

cos θ 0.440 0.205 1.000 0.181

sin θ 0.372 0.243 0.181 1.000

Find the sample canonical correlation ρb1 and sample canonical correlation variables Ub1 and

Vb1.

5. (Proofs behind cca) Let A ∈ R

p×p and B ∈ R

q×q be symmetric positive definite matrices and

C ∈ R

p×q

. Let

G = A

?1/2CB?1/2 ∈ R

p×q

.

We shall write λmax(M) for the largest eigenvalue of a matrix M.

(a) Suppose p = q. Show that eigenvalues of B?1A, B?1/2AB?1/2

, and AB?1 are all equal.

What are the relations between the eigenvectors?

(b) Suppose p = q. Show that

max

x∈Rp

{x

TAx : x

TBx = 1} = max

y∈Rp

{y

TB

1/2AB1/2y : y

Ty = 1}.

By using Problem 7 in Homework 2, deduce that

max

x∈Rp

{x

TAx : x

TBx = 1} = λmax(B

1/2AB1/2

),

argmax

x∈Rp

{x

TAx : x

TBx = 1} = qmax,

where qmax ∈ R

p

is the eigenvector of B1A corresponding to the largest eigenvalue.

(c) Show that if we fix x ∈ R

p and just maximize over all y ∈ R

q

, then

max

y∈Rq

{(x

TCy)

2

: y

TBy = 1} = max

y∈Rq

{y

T

[C

TxxTC]y : y

TBy = 1}

and deduce that from (a) and (b) that

max

y∈Rq

{(x

TCy)

2

: y

TBy = 1} = λmax(B

?1C

TxxTC).

4 FINM 331 ASSIGNMENT 3

Show that the largest eigenvalue of a rank-1 matrix abT

is b

Ta and deduce that

max

y∈Rq

{(x

TCy)

2

: y

TBy = 1} = x

TCB?1C

Tx.

(d) Using (a), (c), and Problem 7 in Homework 2, show that

max

x∈Rp, y∈Rq

{(x

TCy)

2

: x

TAx = 1, y

TBy = 1} = λmax(GGT

).

(e) Let σ1, . . . , σp ∈ R, u1, . . . , up ∈ R

p

, v1, . . . , vp ∈ R

q be the singular values and left/right

singular vectors of G. By Problem 7 in Homework 2, show that

max

x∈Rp

{x

TGGTx : x

Tx = 1, u

T

i x = 0, i = 1, . . . , k ? 1} = σ

2

k

,

argmax

x∈Rp

{x

TGGTx : x

Tx = 1, u

T

i x = 0, i = 1, . . . , k ? 1} = uk,

for k = 1, . . . , p. Hence deduce that

max

x∈Rp, y∈Rq

{x

TCy : x

TAx = 1, y

TBy = 1, u

T

i A

1/2

x = 0, i = 1, . . . , k 1} = σk,

argmax

x∈Rp, y∈Rq

{x

TCy : x

TAx = 1, y

TBy = 1, u

T

i A

1/2

x = 0, i = 1, . . . , k 1} = (A

1/2uk, B1/2

vk),

for k = 1, . . . , p. Finally show that

max

x∈Rp, y∈Rq

{x

TCy : x

TAx = 1, y

TBy = 1, u

T

i A

1/2

x = 0, v

T

i B

1/2

y = 0, i = 1, . . . , k ? 1} = σk,

argmax

x∈Rp, y∈Rq

{x

TCy : x

TAx = 1, y

TBy = 1, u

T

i A

1/2

x = 0, v

T

i B

1/2

y = 0, i = 1, . . . , k 1} = (A

1/2uk, B1/2

vk),

for k = 1, . . . , p.

6. (Linear discriminant analysis) The admissions committee of a business school used GPA and

GMAT scores to make admission decisions. The values for the variable admit = 1,2,3 correspond

to admission decisions of yes, no, waitlist. Label the data set p6.txt — helpful R

commands:

gsbdata = read.table("p6.txt"); colnames(gsbdata)=c("GPA", "GMAT","admit");

(a) Calculate xi

, x and Spool.

(b) Calculate the sample within groups matrix W, its inverse W?1

, and the sample between

groups matrix B. Find the eigenvalues and eigenvectors of W?1B. (R command for A1

is solve(A)).

(c) Use the linear discriminants derived from these eigenvectors to classify the two new observations

x = [3.21, 497]T and x = [3.22, 497]T

.

(d) Scatterplot the original data set on the plane of the first two discriminants, labeled by

admission decisions. Comment on the results in (c). Is this a good admission policy?

7. (Correspondence Analysis) A client of a law firm would like to visualize the number of large

class-action lawsuits each year across different industries from 2011 to the first half of 2017. The

correspondence analysis provides a means of displaying or summarizing a set of categorical data

in two-dimensional graphical form. The data on class-action lawsuits are from annual reports

of Stanford Law School’s Securities Class Action Clearinghouse. To load the data in R, you can

use the following command:

CALaw = read.csv("/classaction lawsuit.csv",header=TRUE)

Notation: Denote X as a data matrix of the number of class action lawsuits for industryyear.

xi denotes row total (summing across all years for each industry). xj denotes column

total (summing across all industries for each year). x?? denotes grand total. Define Dr =

diag(x1, . . . , xn) and Dc = diag(x1, . . . , xp).

(a) What are the dimensions, n and p, in this dataset?

FINM 331 ASSIGNMENT 3 5

(b) Show 1 is an eigenvalue of matrices D1

r XD1

c XT and D1

c XTD?1

r X and that the corresponding

eigenvectors are proportional to 1 = [1, . . . , 1]T

.

(c) Transform the data as follows:

Y =

xD1/2

r



X

abT

x



D1/2

c ∈ R

n×p

,

where a = Dr1n and b = Dc1p. Report the SVD on Y (both singular values and left/right

singular vectors). Is there another formula to compute the entries of the matrix Y ?

(d) Write down the formula to compute row weight vectors and column weight vectors. How

many different row weight vectors and column weight vectors are there? Report all row

weight vectors and column weight vectors.

(e) Similar to PCA, makes the following two plots:

Scatterplot of the first two row weight vectors: Does this scatterplot inform us

about year or industry? What do you learn from this scatterplot?

2D biplot: What do you learn from the biplot?

(f) Write down the formula to calculate the Frobenius norm of Y . Compute the Frobenius

norm of Y . What is the relationship between the sum of squares of the singular values and

the Frobenius norm of Y ?

(g) Report the percentage of original variance that each dimension in the row/column weight

vectors explain? How many singular values are needed to effectively summarize at least

90% of the variability in the data?

8. (Multidimensional Scaling) An investor looking to allocate his funds to different industries seeks

to visually understand the relationship between returns across different US industries. This

investor has a deep pocket but does not know statistics so he comes to seek your advice. As

a financial mathematician, multidimensional scaling first comes to your mind to answer this

investor’s question. To collect the data, the US industry returns can be downloaded from the

Industry Return sections in Kenneth French’s website at:

http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data library.html#Research

For the purpose of this problem set, the dataset of monthly returns of 30 US industries is

downloaded and formatted. To read in the dataset in R, you may use

FF=read.csv("./FamaFrench30.csv", header=TRUE).

Each row of the data represents how the industry in each column goes up or down on different

date. The value of 1 means that the industry on a particular column goes up 1% on that month,

compared to the previous month.

(a) Report mean returns and standard deviation of five industries of your choice. Out of all

30 industries, which industry performs the best on average, which industry is the most

volatile?

(b) Let Ri

t be the return of industry i at time t. Write a formula to compute the distance between

two industries. Denote what each subscript/superscript means and specify dimension of

each subscript/superscript (i.e., explicitly stating what do you sum to). Write a code to

compute distance and report the distance of the following pair of industries:

Autos – ElecEq

Autos – Trans

Autos – Oil

(c) Do you need to demean the data to compute distance matrix? Why?

(d) Report the distance matrix of all industries. To conveniently compute distance, R has a

built in distance matrix command dist.

dist(data matrix, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

6 FINM 331 ASSIGNMENT 3

The first input is the data matrix. The distance command will compute the Euclidean

distances among each row of the data. (Hint: You may need to convert the results into

matrix using the as.matrix command.)

(e) Multidimensional scaling: With the distance matrix in hand, you are now ready to perform

multidimensional scaling to visualize this data. The end goal is to plot the first two dimensions

after multidimensional scaling. To perform MDS, you first need Euclidean distance

matrix (EDM) from the previous part. Then, you would perform the following steps

Step 1: Form Gram matrix G from EDM. [Handout 9, equation 7.6]

Step 2: Perform EVD on G and recover X using X = QpΛ

1/2

p .

Report the result by plotting the first two dimensions after multidimensional scaling with

corresponding industry label for each data point. Does this plot have to be unique? Why?

(f) Interpret the results. What does closer/further in distance mean in this setting? Which

industry tends to co-move with Games industry the most? List three industries whose

returns tend to move on its own.

(g) (Optional): What is your advice for an investor who put most of his money on stocks in

Telecom? [Think about diversification]


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp