联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-08-12 10:23

COMS 4771 SU19 HW4

Due: Mon Aug 12, 2019 at 11:59pm

You are allowed to write up solutions in groups of (at max) two students. These group members

don’t necessarily have to be the same from previous homeworks. Only one submission per group

is required by the due date on Gradescope. Name and UNI of all group members must be clearly

specified on the homework. No late homeworks are allowed. To receive credit, a typesetted copy of

the homework pdf must be uploaded to Gradescope by the due date. You must show your work to

receive full credit. Discussing possible solutions for homework questions is encouraged on piazza

and with peers outside your group, but every group must write their own individual solutions. You

must cite all external references you used (including the names of individuals you discussed the

solutions with) to complete the homework.

1 [Estimating parameters with complete and incomplete data] Consider the data generation

process for observation pair (a, b) as follows:

– a is the outcome of an independent six-faced (possibly loaded) dice-roll. That is, chance

of rolling face ‘1’ is p1, rolling face ‘2’ is p2, etc., with a total of six distinct possibilities.

– Given the outcome a, b is drawn independently from a density distributed as qae

qab

(where qa > 0).

(i) List all the parameters of this process. We shall denote the collection of all the parameters

as the variable θ (the parameter vector).

(ii) Suppose we run this process n times independently, and get the sequence:

(a1, b1),(a2, b2), . . . ,(an, bn).

What is the likelihood that this sequence was generated by a specific setting of the parameter

vector θ

(iii) What is the most likely setting of the parameter vector θ given the complete observation

sequence (ai, bi)

ni=1 that is, find the Maximum Likelihood Estimate of θ given the observations.

(iv) What is the probability of the partial (incomplete) observation b1, b2, . . . , bn given a specific

setting of the parameter vector θ?

(v) Derive the Expectation Maximization (EM) algirthm to estimate of the parameters given

the incomplete observation sequence (bi)

1

2 [kernelizing k-means] k-means with Euclidean distance metric assumes that each pair of

clusters is linearly separable. This may not be the case. A classic example is where we have

two clusters corresponding to data points on two concentric circles in the R2.

(i) Implement Lloyd’s method for k-means algorithm and show the resulting cluster assignment

for the dataset depicted above. Give two more examples of datasets in R

2

, for

which optimal k-means setting results in an undesirable clustering. Show the resulting

cluster assignment for the two additional example datasets.

Let φ denote an explicit non-linear feature space mapping in some inner product space. We

will show how one can derive an implicit kernel-based version of the Lloyd’s method for

k-means, where we only operate on data as φ(xi) · φ(xj ) = K(xi

, xj ).

(ii) Let zij be an indicator that is equal to 1 if the φ(xi) is currently assigned to the jth

cluster and 0 otherwise (1 ≤ i ≤ n and 1 ≤ j ≤ k). Show that the jth cluster center cj

can be written as Pn

i=1 αijφ(xi), where αij only depends on zij variables.

(iii) Given any two data points φ(xi) and φ(xj ), show that the square distance kφ(xi)φ(xj )k2

can be computed using only (linear combinations of) inner products.

(iv) Given the results of parts (ii) and (iii), show how to compute the square distance kφ(xi)cjk

2 using only (linear combinations of) inner products between the data points x1, ..., xn.

(v) From results from parts (ii), (iii) and (iv), propose the algorithm for kernelized version

of Lloyd’s method of finding k-means.

(vi) Implement your proposed kernelized k-means method and run it on the three example

datasets of part (i). Compare the resulting cluster for various choices of kernels (e.g.

linear kernel, quadratic kernel, rbf kernel).

(submit your datasets and kernelized code on Courseworks to receive full credit)

3 [Randomized decision rules] Consider a “reward-based” prediction framework, where given

a continuous state-space X and a discreet action-space A, a learning algorithm f : X → A

decides to perform action a ∈ A when it is in state x ∈ X. The learner then receives a

numerical (possibly negative) reward R(x, a) for performing action a ∈ A in state x ∈ X.

(Note: such reward-based frameworks are common in robotics where the learning agent, i.e.,

the robot, performs some—possibly randomized—action a in some situation x and receives

reward R(x, a). The goal of the learning agent is to maximize its expected reward.)

(i) Assume that f is allowed take a randomized action on any state x. That is, in any state

x ∈ X, f chooses to perform action a with probability P[a|x]. Show that the average

(over the states) expected reward received by f, i.e., Ex,f(x)

[R(x, f(x))] is

Z hX

a

R(x, a)P[a|x]

i

P[x]dx.

(ii) Show that the expected reward is maximized by choosing P[a|x] = 1 for the action a

associated with the maximal reward R(x, a) (for a given state x). This shows us that

there is no benefit in randomizing the best decision rule.

(iii) Can one benefit from randomizing a suboptimal rule? Briefly explain.

2

4 [From distances to embeddings] Your friend from overseas is visiting you and asks you

the geographical locations of popular US cities on a map. Not having access to a US map,

you realize that you cannot provide your friend accurate information. You recall that you

have access to the relative distances between nine popular US cities, given by the following

distance matrix D:

Distances (D) BOS NYC DC MIA CHI SEA SF LA DEN

BOS 0 206 429 1504 963 2976 3095 2979 1949

NYC 206 0 233 1308 802 2815 2934 2786 1771

DC 429 233 0 1075 671 2684 2799 2631 1616

MIA 1504 1308 1075 0 1329 3273 3053 2687 2037

CHI 963 802 671 1329 0 2013 2142 2054 996

SEA 2976 2815 2684 3273 2013 0 808 1131 1307

SF 3095 2934 2799 3053 2142 808 0 379 1235

LA 2979 2786 2631 2687 2054 1131 379 0 1059

DEN 1949 1771 1616 2037 996 1307 1235 1059 0

Being a machine learning student, you believe that it may be possible to infer the locations

of these cities from the distance data. To find an embedding of these nine cities on a two

dimensional map, you decide to solve it as an optimization problem as follows.

You associate a two-dimensional variable xi as the unknown latitude and the longitude value

for each of the nine cities (that is, x1 is the lat/lon value for BOS, x2 is the lat/lon value for

NYC, etc.). You write down the an (unconstrained) optimization problem

minimizex1,...,x9

2 denotes the embedding discrepancy function.

(i) What is the derivative of the discrepancy function with respect to a location xi?

(ii) Write a program in your preferred language to find an optimal setting of locations

x1, . . . , x9. You must submit your code to Courseworks to receive full credit.

(iii) Plot the result of the optimization showing the estimated locations of the nine cities.

(here is a sample code to plot the city locations in Matlab)

>> cities={’BOS’,’NYC’,’DC’,’MIA’,’CHI’,’SEA’,’SF’,’LA’,’DEN’};

>> locs = [x1;x2;x3;x4;x5;x6;x7;x8;x9];

>> figure; text(locs(:,1), locs(:,2), cities);

What can you say about your result of the estimated locations compared to the actual

geographical locations of these cities?

3


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp