Questions 1-3 consider a set of data from a crowd-sourced lending service. It has attributes
of 5425 loans that were “Charged Off” (not paid in full) and an equal number that were paid
in full. There are 10 additional variables:
Funded_amnt: the amount of money lent.
Loan_amnt: the amount of money requested.
Dti: debt to income ratio, excluding mortgage and proposed loan.
Emp_length: the number of years the borrower has been employed.
Installment: the monthly payment.
Annual_inc: borrower’s annual income.
Revol_bal: the balance on all the borrower’s revolving credit accounts.
Earlyear: the year in which the borrower first borrowed money.
Proputil: the proportion of the borrower’s maximum revolving credit being utilized (this is
actually given as a percentage, a number between 0 and 100).
Open_acc: the number of credit accounts the borrower has opened over the years.
Q1 First we perform a principle components analysis of the 10 predictor variables, after
scaling.
a) 2 marks The eigenvalues of the correlation matrix are:
3.55 1.53 1.26 1.00 0.94 0.67 0.48 0.47 0.08 0.02
Sketch the scree plot.
b) 3 marks How many principal components do you suggest using? Explain your reasoning.
What proportion of total variability do they account for?
STATS 302
Page 3 of 9
c) 2 marks The loadings of two different three component solutions, varimax rotated and
unrotated, have been given below. Loadings below 0.2 have been suppressed. What is the
purpose of rotation? Has that been achieved here?
Unrotated:
Rotated:
STATS 302
Page 4 of 9
d) 1 mark In both sets of loadings, whenever “earlyear” appears, it has an opposite sign to
“emp_length.” Explain why this makes sense.
e) 2 marks Suppose we decide to go with a two component rather than three component
solution. Will the loadings of the first two components change for either the rotated or
unrotated solution? Explain why.
STATS 302
Page 5 of 9
Q2 We now wish to perform MANOVA to test whether there are differences between the
charged off and full paid groups.
a) 4 marks Consider the following diagnostics designed to evaluate the MANOVA
assumptions. State these assumptions, and your conclusion about how well these are satisfied,
referencing specific proportions of the output.
b) 4 marks Below find two p-values generated by comparing Pillai’s trace to the appropriate
F distribution, and from comparing to a permutation distribution. Are either of them
adequate for summarizing a test for a difference between the means of the two groups?
Explain.
Observed Pillai’s trace P-value from F distribution P-value from permutation
0.045973 < 2.2e-16 *** 0.001
STATS 302
Page 6 of 9
Q3. A linear discriminant analysis is performed using the 10 variables used in the
MANOVA.
a) 4 marks Below, see a table with the predicted classifications from a leave-one out cross
validation, compared to the true classifications. What is the error rate? What is the purpose
of performing cross validation? What are the advantages and disadvantages of leave-one-out
vs 10-fold cross validation for this dataset?
b) 4 marks A table showing the loadings (correlations) of the original variables with the
LDA score is given below. Do any variables appear to be more important than the others in
predicting whether a loan will be charged off? Explain your reasoning. Based on these
loadings, do you expected charged off (unpaid) loans to have higher or lower LDA scores?
Explain.
c) 2 marks Consider an individual with LDA score 0.10. The height of the implied density of
the LDA score is 0.38 for the charged off category, and 0.40 for the fully paid category. In
the relevant population, the frequency of charged off loans is 14%. What is the posterior
probability that this individual’s loan will be charged off?
STATS 302
Page 7 of 9
Questions 4-6 concern bird sightings at 36 locations in Borneo. Counts of X species are
recorded at each location. The locations are of three different types: “P” for pristine (never
logged), “Q,” logged 8 years previously, and “R,” logged 4 years previously.
Q4
a) 2 marks The counts have been fourth root transformed, and then have the Bray Curtis
distance taken between them. In what circumstances do we prefer the Bray Curtis distance?
What is the purpose of taking the quarter root of the counts?
b) 2 marks Below you see plot showing the stress for non-metric multidimensional scalings
on 2-11 axes. Which number of axes do you prefer? Explain your reasoning.
c) 2 marks Explain the difference between metric and non-metric multidimensional scaling.
In what case is metric scaling equivalent to principle components analysis?
d) 4 marks A permanova has been performed to compare the sites. Explain how this
analysis works. Are there any important assumptions or caveats? Output is given below;
what is your conclusion?
STATS 302
Page 8 of 9
Q5 We now compute two distances between each pair of bird species, one based on the bird
sightings, using the Bray-Curtis distance on the quarter root counts, and one based on the
birds’ divergent characteristics (food source, nesting preference etc.) using the Manhattan
metric.
a) 2 marks Compute the Manhattan distance between the two bird species below based on
the characteristics given.
Canopy Bark Gleaning insectivore Foliage Frugivore Nectivore Sallying Raptor
Sp. 1 1 1 1 1 0 0 0 0 0
Sp. 2 0 0 1 1 1 0 0 0 0
b) 3 marks Output from a Mantel test is given below. Explain briefly what this procedure
tests, and how it works. What is the conclusion here?
STATS 302
Page 9 of 9
Q6 Using the Bray-Curtis distances between the occupancy profiles of bird species who are
understory foliage gleaning insectivores, a dendrogram has been created for the understory
foliage gleaning insectivores using complete linkage clustering.
a) 2 marks Describe what complete linkage clustering is.
b) 2 marks Which two bird species have the most similar occupancy profiles, according to
the dendrogram? Would these two still be most similar under single linkage clustering?
Explain.
c) 3 marks Two genera, Malacopteron and Phaenicophaeus make up the majority of these
birds. Cut the dendrogram so that three clusters are created. Tabulate the number of
Malacopteron and Phaenicophaeus in each cluster. Do you think bird species tend to occupy
the same sites as other species with the same genera? Explain.
END QUESTION PAPER
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。