联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2025-04-21 09:33

STAT4602 Multivariate Data Analysis Assignment 2

Hand in solutions for ALL questions by April 23 (Wednesday), 2025,

11:59pm

1. The file IRIS.DAT gives a dataset containing 4 measurements for 3 species

of iris. In the dataset, each row corresponds to one observation. The first 4

columns gives the 4 measurements, and the last column takes values 1, 2, 3,

corresponding to the 3 species of iris.

(a) Perform multivariate regression for each species separately, treating the

two sepal measures (x1 and x2) as response variables, and the two petal

measures (x3 and x4) as indepedent variables. Report the fitted models.

(b) For the species “versicolour” (serial number 2), test whether the two sets of

regression coefficients (excluding intercepts) are the same in the regression

equations for x1 and for x2.

(c) Consider a multivariate linear model as in (a), but incorporate the

3 species in the model with the aid of additional dummy variables.

Specifically, intorduce new variables:

• s ∈ {0, 1}: s = 1 if species = 1, and s = 0 otherwise.

• v ∈ {0, 1}: v = 1 if species = 2, and v = 0 otherwise.

• sx3 = s · x3: sx3 = x3 if species = 1, and sx3 = 0 otherwise.

• sx4 = s · x4: sx4 = x4 if species = 1, and sx4 = 0 otherwise.

• vx3 = v · x3: vx3 = x3 if species = 2, and vx3 = 0 otherwise.

• vx4 = v · x4: vx4 = x4 if species = 2, and vx4 = 0 otherwise.

Perform the regression and test the hypothesis that the 3 species have

the same model.

(d) For a input with species = 1, 2, 3, is the model obtained in (c) equivalent

to the 3 separate multivariate regression models obtained in (a)?

2. Consider the data given by CORKDATA.sas in Question 1 of Assignment 1:

N E S W N E S W

72 66 76 77 91 79 100 75

60 53 66 63 56 68 47 50

56 57 64 58 79 65 70 61

41 29 36 38 81 80 68 58

32 32 35 36 78 55 67 60

30 35 34 26 46 38 37 38

39 39 31 27 39 35 34 37

42 43 31 25 32 30 30 32

37 40 31 25 60 50 67 54

33 29 27 36 35 37 48 39

32 30 34 28 39 36 39 31

63 45 74 63 50 34 37 40

54 46 60 52 43 37 39 50

47 51 52 45 48 54 57 43

(a) Find the principal components based on the covariance matrix. Interpret

them if possible.

HKU STAT4602 (2024-25, Semester 2) 1

STAT4602 Multivariate Data Analysis Assignment 2

(b) How many principal components would you suggest to retain in

summarizing the total variability of the data? Give reasons, including

results of statistical tests if appropriate.

(c) Repeat (a) and (b) using the correlation matrix instead.

(d) Compare and comment on the two sets of results for covariance and

correlation matrices. Recommend a set of results and explain why.

3. Annual financial data are collected for bankrupt firms approximately 2 years

prior to their bankruptcy and for financially sound firms at about the same

time. The data on four variables, X1 = (cash flow) / (total debt), X2 = (net

income) / (total assets), X3 = (current assets) / (current liabilities) and X4 =

(current assets) / (net sales) are stored in the file FINANICALDATA.TXT. In

addition, a categorical variable Y identifies whether a firm is bankrupt (Y = 1)

or non-bankrupt (Y = 2).

(a) Apply the linear discriminant analysis (LDA) to classify the firms into

a bankrupt group and a non-bankrupt group. Calculate the error rates

with cross-validation and report the results.

(b) Apply quadratic discriminant analysis (QDA) to classify the firms,

perform cross-validation and report the results.

4. The distances between pairs of five items are as follows:

Cluster the five items using the single linkage, complete linkage, and average

linkage hierarchical methods. Compare the results.

5. Consider multivariate linear regression with the following data structure:

individual Y1 Y2 · · · Yp X1 X2 · · · Xk

1 y11 y12 · · · y1p x11 x12 · · · x1k

2 y21 y22 · · · y2p x21 x22 x2k

n yn1 yn2 · · · ynp xn1 xn2 · · · xnk

The regression model is given as

Y

n×p

= Xn×k

B

k×p

+ Un×p

,

HKU STAT4602 (2024-25, Semester 2) 2

STAT4602 Multivariate Data Analysis Assignment 2

where the matrices Y , X, B and U are given as follows:

Here for i = 1, . . . , n, the vector of errors of observation i is εi =

(εj1, εj2, · · · , εjp)

, and we assume that ε1, . . . , εn

iid∼ Np(0, Σ).

(a) We know that the maximum likelihood estimator of B and Σ are:

Bˆ = (X′X)

−1 X′Y , Σˆ =

1

n

Uˆ , where Uˆ = Y − XBˆ .

Calculate the maximum value of the log-likelihood function

ℓ(B, Σ) = −

np

2

log(2π) −

n

2

log |Σ| − 1

2

tr[(Y − XB)Σ

−1

(Y − XB)

]

= −

np

2

log(2π) −

n

2

log |Σ| − 1

2

tr[Σ

−1

(Y − XB)

(Y − XB)].

(b) Plug in the definition of Bˆ and express Uˆ as a matrix calculated based

on X and Y . Calculate X⊤Uˆ and Uˆ

X.

(c) Prove the identity

(Y − XB)

(Y − XB)

= (Y − XBˆ )

(Y − XBˆ ) + (XBˆ − XB)

(XBˆ − XB).

Hint: by definition, Y − XBˆ = Uˆ , and we have

(Y − XB)

(Y − XB)

= (Y − XBˆ + XBˆ − XB)

(Y − XBˆ + XBˆ − XB).

6. Consider p random variables X1, . . . , Xp. Suppose that Y1, . . . , Yp are the first

to the p-th population principle components of X1, . . . , Xp.

(a) What are the population principle components of the random variables

Y1, . . . , Yp? Why?

(b) Suppose that the population covariance matrix of (X1, . . . , Xp)

is Σ and

its eigenvalue decomposition is

Σ =

p

X

i=1

λiαiα

i

,

where α1, . . . , αp are orthogonal unit vectors. What is the covariance

bewteen X1 and Y1?

7. Consider a k-class classification task with ni observations in class i, i =

1, . . . , k. Define matrices

H =

k

X

j=1

nj (x¯·j − x¯··)(x¯·j − x¯··)

, E =

k

X

j=1

nj

X

i=1

(xij − x¯·j )(xij − x¯·j )

, S =

n

E

− k

.

HKU STAT4602 (2024-25, Semester 2) 3

STAT4602 Multivariate Data Analysis Assignment 2

In LDA for multiclass classification, we consider the eigenvalue decompostion

E

−1Hai = ℓiai

, i = 1, . . . , s, s = rank(E

−1H).

where a1, . . . , as satisfy a

iSai = 1 and a

iSai

′ = 0 for all i, i′ = 1, . . . , s, i = i

.

(a) While the above definitions were introduced in the case of multiclass

classification (k > 2), we may check to what extent these definitions are

reasonable in binary classification (k = 2). In this case, we have the

sample means within class 1 and class 2 as x¯·1 and x¯·2 respectively. Can

you calculate the overall mean x¯·· based on x¯·1, x¯·2 and n1, n2?

(b) For k = 2, express H as a matrix calculated based on x¯·1, x¯·2 and n1, n2.

(c) What is the rank of the matrix H when k = 2?

(d) We mentioned in the lecture that we can simply use one Fisher

discriminant function for binary classification. Can we adopt the

definitions above to define more than one Fisher discriminant functions

for binary classification? Why?

HKU STAT4602 (2024-25, Semester 2) 4


相关文章

【上一篇】:到头了
【下一篇】:没有了

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp