Student
Number
Semester 2 Assessment, 2019
School of Mathematics and Statistics
MAST90083 Computational Statistics and Data Mining
Writing time: 3 hours
Reading time: 15 minutes
This is NOT an open book exam
This paper consists of 3 pages (including this page)
Authorised Materials
• Mobile phones, smart watches and internet or communication devices are forbidden.
• No handwritten or print materials may be brought into the exam venue.
• This is a closed book exam.
• No calculators of any kind may be brought into the examination.
Instructions to Students
• You must NOT remove this question paper at the conclusion of the examination.
Instructions to Invigilators
• Students must NOT remove this question paper at the conclusion of the examination.
This paper must NOT be held in the Baillieu Library
MAST90083 Semester 2, 2019
Question 1 Suppose we have a model p(x, z | θ) where x is the observed dataset and z are the
latent variables.
(a) Suppose that q(z) is a distribution over z. Explain why the following
F(q, θ) = Eq [log p(x, z | θ) − log q(z)]
is a lower bound on log p(x | θ).
(b) Show that F(q; θ) can be decomposed as follows
F(q, θ) = −KL(q(z) || p(z|x, θ)) + log p(x | θ)
where for any two distributions p and q, KL(q||p) = −Eq log p(z)
q(z)
is the Kullback-Leibler
(KL) divergence.
(c) Describe the EM algorithm in terms of F(q, θ).
(d) Note that the KL divergence is always non-negative. Furthermore, it is zero if and only if
p = q. Conclude the optimal q that maximises F is p(z | x, θ).
[10 + 10 + 5 + 5 = 30 marks]
Question 2 Let {(xi
, yi)}
n
i=1 be our dataset, with xi ∈ R
p and yi ∈ R. Classic linear regression
can be posed a empirical risk minimisation, where the model is to predict y using a class of
functions f(x) = w
T x, parametrised by vector w ∈ R
p using the squared loss, i.e. we minimise
(a) Show that the optimal parameter vector is
wˆn = (XT X)
−1XT Y
where X is n × p matrix, with i-th row given by x
T
i
and Y is a n × 1 column vector with
i-th entry yi
(b) Consider regularising the empirical risk by incorporating an l2 penalty. That is, find w
minimising.
Show that the optimal parameter is given by the ridge regression estimator
wˆridge
n = (XT X + λI)−1XT Y.
(c) Suppose we now wish to introduce nonlinearities into the model, by transforming x to
φ(x). Let Φ be a matrix with i-th row given by φ(xi)T.
(i) Show the optimal parameters would be given by
wˆkernel
n = (ΦT Φ + λI)−1ΦT Y
(ii) Express the predicted y values on the training set, Φ ˆw
kernel n, only in terms of y and
the Gram matrix K = ΦΦT
, with Kij = φ(xi)
T φ(xj ) = k(xi
, xj ), where k is some
kernel function. (This is known as the kernel trick.) Hint: You will find the following
matrix inversion formula useful:
Page 2 of 3 pages
MAST90083 Semester 2, 2019
(iii) Compute an expression for the value of y
∗ predicted by the model at an unseen test
vector x∗.
[5+5+5+10+5 = 30 marks]
Total marks = 60
End of Exam
Page 3 of 3 pages
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。