In-Class ML Competition
Econometrics II: ver. 2019 Spring Semester
Competition Outline
2 / 16
Competition Outline
Date: 23 July; Time: 10:40 – 12:00 (10 mins for preparation, 70
mins for coding and submission.); Place: 3-901 (this room).
Download the data and submit your answer through the Course Navi.
The submission folder will be automatically closed at 12:00.
You are allowed to bring “anything” with you to the class room,
including lecture slides, textbooks, pre-written R scripts, etc.
But you are not allowed to communicate with others and also to ask
me any technical questions.
3 / 16
Competition Outline
Available data: Xtrain, Ytrain, Xtest.
Task: to predict the true value of Y
test as accurately as possible,
where Y
test is a dummy variable (i.e., classification problem).
IMPORTANT: do NOT submit the predicted binary responses, but
submit the "classification scores" s(X
test)’s to compute the AUC.
It is not necessary to submit the R code used.
Evaluation: The performance of your prediction algorithm will be
evaluated by the AUC score.
* This competition accounts for 40% of your final grade:
40 = a + b × your AUC score,
where a and b will be determined later.
4 / 16
Cross Validation
5 / 16
Cross Validation
True Y
test is unobservable you cannot compute any performance
statistics including Accuracy and AUC.
How can we compare alternative prediction models?
Cross Validation
Further split the training set into a reduced training set and a
"validation" set.
Train each model on the reduced training set, and select the best
model based on the results on the validation set.
6 / 16
Cross Validation
K-fold Cross Validation
The above mentioned method runs the risk of overfitting to a
particular validation set (especially when the size of the training data
is small). K-fold cross-validation approach:
1 Randomly split the training set in K equally sized subsets.
2 Keep the k-th subset as a validation set, and train the model on the
remaining K 1 subsets. Compute the AUC on this validation set, AUCk
3 Repeat this process from k = 1 to k = K, and compute the average. Finally, choose the best model in terms of AUC.
7 / 16
Cross Validation
K-fold Cross Validation (cont.)
A common choice for K is either 5 (80% for training and 20% for
testing) or 10 (90% for training and 10% for testing), but there is no
formal theoretical justification for these numbers.
Repeated K-fold Cross Validation:
In a K-fold cross validation, only K estimates of model performance are
obtained.
After reshuffling the data, run K-fold cross validation multiple times.
8 / 16
Sample Code
library(ROCR)
setwd("C:/Rdataset")
data <- read.csv("spam_train.csv")
data$type <- (data$type == "spam")
AUC <- function(s, Y){
pred <- ROCR::prediction(s, Y)
auc <- performance(pred, "auc")@y.values[[1]]
return(auc)
}
K <- 10
N <- nrow(data) # Total sample size
n <- floor(N/K) # The size of each subset
data <- data[sample(N),] # Randomly shuffle the data
9 / 16
Sample Code (cont.)
CV <- function(k){
ids <- (k - 1)*n + 1:n
test <- data[ids,]
train <- data[-ids,]
m1 <- lm(type ~., data = train)
m2 <- glm(type ~., data = train, family = binomial(link = "logit"))
s1 <- predict(m1, newdata = test)
s2 <- predict(m2, newdata = test)
auc1 <- AUC(s1, test$type)
auc2 <- AUC(s2, test$type)
return(c(auc1,auc2))
}
10 / 16
Sample Code (cont.)
cvmat <- matrix(0,K,2) # Matrix of zeros of dimension (K, #models)
for(k in 1:K) cvmat[k,] <- CV(k)
colMeans(cvmat)
The above R code is available at the Course Navi.
11 / 16
Pre-Competition
12 / 16
Infant Birth Weight
Birth weight data1
Training data: bweight_train.csv (including both X and Y)
Test data: X_test.csv and Y_test.csv, where Y_test.csv will not
be downloadable until 12:00.
Submission file: submission.csv
The csv files are uploaded on the Course Navi (not from my website).
1Obtained from Wooldridge’s dataset:
http://fmwww.bc.edu/ec-p/data/wooldridge/datasets.list.html
13 / 16
Infant Birth Weight
Definitions of variables
Response variable
lbw3000 TRUE if birth weight ≤ 3,000 (kg), and FALSE otherwise.
Input variables
xage, xeduc, xrace x’s age, x’s education in years, and x’s race ("white",
"black" or "other"), respectively.
x = m mother; x = f father.
monpre month prenatal care began.
npvis total number of prenatal visits.
omaps, fmaps One-minute and five-minute Apgar scores, respectively.2
cigs average cigarettes per day.
drink average drinks per week.
2The Apgar score is the very first test performed on a newborn baby at 1 and 5
minutes after birth.
14 / 16
Infant Birth Weight
The task:
Compute classification scores for all 500 individuals in the test set,
which are indexed by ID = 1, ..., 500, for the prediction of
{lbw3000 = TRUE}.
Submission process:
1 Using the training data, develop your prediction model.
2 Based on your model, compute the classification scores for the
observations in the test data. Typically, you can obtain them using the
predict() function.3
3Here, it would be important to check that the obtained scores are not binary but
continuous values.
15 / 16
Infant Birth Weight
Submission process (cont.):
3 Load the submission.csv file:
store the obtained classification scores in the variable score, and
overwrite the csv file:
submit <- read.csv("submission.csv")
submit$score <- s # s = classification score
write.csv(submit, "submission.csv") # overwrite the file
4 Submit this through Course Navi.
16 / 16
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。