Questions 1-3 consider a set of data from a crowd-sourced lending service. It has attributes

of 5425 loans that were “Charged Off” (not paid in full) and an equal number that were paid

in full. There are 10 additional variables:

Funded_amnt: the amount of money lent.

Loan_amnt: the amount of money requested.

Dti: debt to income ratio, excluding mortgage and proposed loan.

Emp_length: the number of years the borrower has been employed.

Installment: the monthly payment.

Annual_inc: borrower’s annual income.

Revol_bal: the balance on all the borrower’s revolving credit accounts.

Earlyear: the year in which the borrower first borrowed money.

Proputil: the proportion of the borrower’s maximum revolving credit being utilized (this is

actually given as a percentage, a number between 0 and 100).

Open_acc: the number of credit accounts the borrower has opened over the years.

Q1 First we perform a principle components analysis of the 10 predictor variables, after

scaling.

a) 2 marks The eigenvalues of the correlation matrix are:

3.55 1.53 1.26 1.00 0.94 0.67 0.48 0.47 0.08 0.02

Sketch the scree plot.

b) 3 marks How many principal components do you suggest using? Explain your reasoning.

What proportion of total variability do they account for?

c) 2 marks The loadings of two different three component solutions, varimax rotated and

unrotated, have been given below. Loadings below 0.2 have been suppressed. What is the

purpose of rotation? Has that been achieved here?

Unrotated:

Rotated:

d) 1 mark In both sets of loadings, whenever “earlyear” appears, it has an opposite sign to

“emp_length.” Explain why this makes sense.

e) 2 marks Suppose we decide to go with a two component rather than three component

solution. Will the loadings of the first two components change for either the rotated or

unrotated solution? Explain why.

Q2 We now wish to perform MANOVA to test whether there are differences between the

charged off and full paid groups.

a) 4 marks Consider the following diagnostics designed to evaluate the MANOVA

assumptions. State these assumptions, and your conclusion about how well these are satisfied,

referencing specific proportions of the output.

b) 4 marks Below find two p-values generated by comparing Pillai’s trace to the appropriate

F distribution, and from comparing to a permutation distribution. Are either of them

adequate for summarizing a test for a difference between the means of the two groups?

Explain.

Observed Pillai’s trace P-value from F distribution P-value from permutation

0.045973 < 2.2e-16 *** 0.001

Q3. A linear discriminant analysis is performed using the 10 variables used in the

MANOVA.

a) 4 marks Below, see a table with the predicted classifications from a leave-one out cross

validation, compared to the true classifications. What is the error rate? What is the purpose

of performing cross validation? What are the advantages and disadvantages of leave-one-out

vs 10-fold cross validation for this dataset?

b) 4 marks A table showing the loadings (correlations) of the original variables with the

LDA score is given below. Do any variables appear to be more important than the others in

predicting whether a loan will be charged off? Explain your reasoning. Based on these

loadings, do you expected charged off (unpaid) loans to have higher or lower LDA scores?

Explain.

c) 2 marks Consider an individual with LDA score 0.10. The height of the implied density of

the LDA score is 0.38 for the charged off category, and 0.40 for the fully paid category. In

the relevant population, the frequency of charged off loans is 14%. What is the posterior

probability that this individual’s loan will be charged off?

Questions 4-6 concern bird sightings at 36 locations in Borneo. Counts of X species are

recorded at each location. The locations are of three different types: “P” for pristine (never

logged), “Q,” logged 8 years previously, and “R,” logged 4 years previously.

Q4

a) 2 marks The counts have been fourth root transformed, and then have the Bray Curtis

distance taken between them. In what circumstances do we prefer the Bray Curtis distance?

What is the purpose of taking the quarter root of the counts?

b) 2 marks Below you see plot showing the stress for non-metric multidimensional scalings

on 2-11 axes. Which number of axes do you prefer? Explain your reasoning.

c) 2 marks Explain the difference between metric and non-metric multidimensional scaling.

In what case is metric scaling equivalent to principle components analysis?

d) 4 marks A permanova has been performed to compare the sites. Explain how this

analysis works. Are there any important assumptions or caveats? Output is given below;

Q5 We now compute two distances between each pair of bird species, one based on the bird

sightings, using the Bray-Curtis distance on the quarter root counts, and one based on the

birds’ divergent characteristics (food source, nesting preference etc.) using the Manhattan

metric.

a) 2 marks Compute the Manhattan distance between the two bird species below based on

the characteristics given.

Canopy Bark Gleaning insectivore Foliage Frugivore Nectivore Sallying Raptor

Sp. 1 1 1 1 1 0 0 0 0 0

Sp. 2 0 0 1 1 1 0 0 0 0

b) 3 marks Output from a Mantel test is given below. Explain briefly what this procedure

tests, and how it works. What is the conclusion here?

Q6 Using the Bray-Curtis distances between the occupancy profiles of bird species who are

understory foliage gleaning insectivores, a dendrogram has been created for the understory

foliage gleaning insectivores using complete linkage clustering.

a) 2 marks Describe what complete linkage clustering is.

b) 2 marks Which two bird species have the most similar occupancy profiles, according to

the dendrogram? Would these two still be most similar under single linkage clustering?

Explain.

c) 3 marks Two genera, Malacopteron and Phaenicophaeus make up the majority of these

birds. Cut the dendrogram so that three clusters are created. Tabulate the number of

Malacopteron and Phaenicophaeus in each cluster. Do you think bird species tend to occupy

the same sites as other species with the same genera? Explain.

END QUESTION PAPER