Midterm Exam: Web Scraping and FinTech Data Analysis
NOTE: Use all available resources. You can find a solution to most of the coding problems on the Internet. Google it if you are stuck in the middle.
NOTE2: Do your best to conduct the task!!!
PART I. (10 points)
Complete the task below. Copy your code onto the answer sheet.
Visit www.freelancer.com and choose any category. Extract the titles, descriptions, tags, and prices from the first page of search results. Create four corresponding vectors, named ‘titles’, ‘descriptions’, ‘tags’, and ‘prices’, with 'prices' specifically formatted as a numeric vector. Finally, assemble these vectors into a single consolidated data frame, named ‘freelancer’. Please note, it might be required to clean the data and use NA values to ensure the vectors are of equal length prior to merging them into a single data frame. |
PART II. (15 points)
Analyze the dataset named ‘fintech,’ obtained from a prominent FinTech company based in Hong Kong. This company specializes in lending money to loan applicants. Upon receiving an application, the company's decision engine automatically classifies each loan application into one of three categories: 'approve', 'reject', or 'manual review.' For applications marked as 'manual review,' reviewers undertake a subjective assessment to further reclassify each case as either 'approve' or 'reject.’
Field description:
Field |
Description |
id |
Loan application id |
loan_amount |
Requested amount in HKD |
tenor |
Requested repayment periods in months |
age |
|
month_of_service |
The employment period of current job |
residential_status |
Own, Rent, Others |
monthly_repayment |
Monthly repayment for other existing loans |
monthly_income |
Average monthly income for the last three months |
self-employed |
|
bankrupted |
Whether the applicant has a record of bankruptcy |
housewife |
|
currently_employed |
Whether the applicant is employed in a full-time job |
channel |
Loan application channel |
language |
tc (traditional Chinese), EN (English) |
manual_review |
t = manually reviewed, f = otherwise |
approved |
t = approved, f = rejected |
manual_approved |
t = manually approved, f = otherwise |
credit score |
Higher is better |
friends_facebook |
No. of Facebook friends (0 indicates either no friend or the account was not provided) |
time_application |
Time of the day when the application was submitted |
location |
The location where the application was submitted |
default |
Whether the repayment was overdue as of June 2017 |
Q1. What is the average monthly income of the whole sample? What is the average monthly income of the currently employed? (1 point)
Q2. Generate the histogram of “loan_amount.” Can you find any interesting patterns? Can you guess the reason why the graph has such a shape? (1 point)
Q3. Replace the value of “friends_facebook” with NA if the value is 0. What is the average number of Facebook friends of those who have provided their Facebook accounts? (2 points)
Q4. Generate the scatterplot of “month_of_service” and “credit_score”. Can you find any relationship between them? What about “monthly_income” and “credit_score”? Confirm the relationship with the correlation tests. (2 points)
Q5. Make a new variable, named “automatic_approved,” which has the value “t” if approved by the decision engine, “f” if rejected by the decision engine, and “NA” if reviewed manually. How many cases are approved or rejected by their decision engine? How many are classified as “manual review”? (2 points)
Hint:
fintech.df$automatic_approved=ifelse(fintech.df$approved=="t" & is.na(fintech.df$manual_approved)==TRUE, "t", fintech.df$automatic_approved) fintech.df$automatic_approved=ifelse(fintech.df$approved=="f" & is.na(fintech.df$manual_approved)==TRUE, "f", fintech.df$automatic_approved) |
Q6. Compare the automatically approved cases and the automatically rejected cases. Conduct statistical tests on variables available in the dataset to answer the following subquestions. (5 points)
1) Are they different in “loan_amount”?
2) Are they different in “tenor”?
3) Are they different in “age”?
4) Are they different in “month_of_service”?
5) Are they different in “residential_status”?
6) Are they different in “monthly_income”?
7) Are they different in “bankrupted”?
8) Are they different in “currently_employed”?
9) Are they different in “channel”?
10) Are they different in “language”?
11) Are they different in “credit_score”?
12) Are they different in “friends_facebook”?
13) Are they different in “location_application”?
Q7. Based on the analysis results above, provide the logic behind the decision engine to judge “approve”. (2 points)
Guideline
Submit 1) your answer sheet, 2) R-code used for the analysis, and 3) the signed declaration as a zip file to the LeranUs System. Please include your student number and name in the header of the answer sheet. Make your answer sheet formatted as follows: Times New Roman, 12-point font, double-spaced only (not 1.5), 1-inch margins all around 8.5 x 11-inch paper (or A4).
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。