联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2021-10-01 10:57

Instructions: Submit two files. One should be a write-up of all solutions and observations, as LastnameFirstnameSolution.pdf.

The second should be an archive LastnameFirstnameCode.zip containing code and any results

files.

1 [5 pts.] Irreducible data example

In class we discussed that not all datasets’ dimensionality can be successfully reduced using PCA.

(a)[1 pts] Discuss the cases when PCA will fail.

(b)[2 pts] How do we quantify that it fails?

(c)[2 pts] Provide an example dataset of 2D points (specify the points as vectors of numbers) in which PCA will

not work well for dimensionality reduction. Explain why. Hint: Think of 2D points and reduction to 1D.

2 [40 pts] Dimensionality reduction

For this question you can use the cloud.data.pdf from the UCI ML repository: https://archive.ics.uci.edu/ml/

datasets/Cloud. Read about it to get familiar with what is measured. Within the data, there are two datasets: DB

#1 and DB #2. For this homework, just use the 1024 vectors in DB #1. Use python for all your programming. You

will have to submit your code in LastnameFirstnameCode.zip together with the relevant write-up in the main

solution file LastnameFirstnameSolution.pdf.

(a)[5 pts] Load the data into a python program and center it. Note: there should be a function called center()

in your code that achieves this.

(b)[5 pts] Compute the covariance matrix of the data Σ. Hint: by using the definition of sample covariance, as a

matrix product or as a sum of outer products. See book for details. Use Numpy for linear algebra computations

(https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.linalg.html).As a result you should have a function

covar() in your code which does not use the built-in covariance functions.

(c)[5 pts] Compute the eigenvectors and eigenvalues of Σ. The numpy linear algebra module referenced above has a

function that can help.

(d)[10 pts] Plot the percentage of retained variance as a function of the number of principle components used.

Determine the number of principal components (PCs) r that will ensure 90% retained variance? How did you

compute this? Provide a function in your code that determines r based on an arbitrary percentage α of retained

variance.

(e)[10 pts] Compute the reduced dimension data matrix A with two dimensions by projection on the first two PCs.

Plot the points using a scatter plot (two dimensional diagram that places each sample i according to its new

dimensions ai1, ai2). Discuss the observations. Are there clusters of close-by points? Argue for or against whether

1

these are sufficient dimensions.

(f)[5 pts] Study the PCA implementation in pythons’ sklearn library https://scikit-learn.org/stable/

modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA. Do PCA using the library

on the same data. Do the eigenvalues approximately match to what you computed above?

3 [20 pts.] Kernel methods

Consider the problem of finding the most dissimilar diametric pair (MDDP): this is a pair of data points that are

dissimilar from the mean and also dissimilar from each other. Below is an algorithm that would find such a pair

given a data matrix D:

Algorithm 1: MDDP(D)

Result: a, b - the most dissimilar diametric points in D

Compute the data mean μ = mean(D);

s = +∞;

for i in (1 . . . n) do

for j in (i + 1 . . . n) do

temp = x

T

i μ + x

T

j μ + x

T

i xj ;

if temp < s then

s = temp;

a = xi

;

b = xj ;

end

end

end

The algorithm computes the sum of inner products x

T

i μ + x

T

j μ + x

T

i xj for each pair of points and returns the

pair with the lowest such quantity.

(a)[5 pts] Demonstrate the execution of this algorithm on the following data matrix of 2D instances: D=

?

???

0 1

1 3

5 0

2 4

?

???.

Show the steps and the resulting MDD pair of points.

(b)[15 pts] As we discussed in class sometimes we would like to kernelize methods to handle non-linearity in data.

Provide a pseudo-code for a kernel version of the MDDP algorithm above. The goal is to kernelize the algorithm

for an arbitrary kernel Hint: Assume that you can compute the kernel matrix K, corresponding to

some mapping φ() and then use the basic kernel operations we discussed in class and also in the

book, to derive the steps of MDDP in terms of elements in K.

4 [10 pts.] Orthogonality of Error in Regression:

Prove that Y? T

 = 0, where Y? is the predicted response and  = Y ?Y? is the error between the actual and predicted

response. Hint: Use the solution for the predicted response as a transformation of Y through the

hat matrix.

5 [25 pts.] Regression to understand CO2 pollution:

For this task, you will use data about CO2 emissions in European cities included as an excel sheet. The data has

total emissions per city (Column E) as well as other variables including number of airports, buildings, etc

(columns G and later). Use only data for cities (Column C: admin level=8). Feel free to prepare the data into

simpler text format before you use code.

The goal will be to train linear regression models of the total CO2 as a function of the predictor variables in

columns G and on.

2

(a) [7pts] Use scikitlearn’s linear regression: linear model.LinearRegression to learn an unregularized model.

Report the corresponding coefficient for each predictor variable.

(b) [7pts] Use scikitlearn’s ridge regression: linear model.Ridge to learn a regularized model. Try 2 values for

alpha: 1 and 10. Report the coefficients of the predictor variables for the two models.

(c) [11pts] Discuss your findings. Which are the most important factors determining the CO2 emissions? How

do the loading coefficients for factors differ between regularized and non-regularized models? How does the

residual error differ? Use visuazlizations as necessary to discuss your findings.

3


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp