Predicting the Price of Round Cut Diamonds
STP 494/STP 598: Machine Learning
Introduction 1
Data 1
Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Methods 1
Results 2
Variable Selection Using Regsubsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Discussion and Conclusion 4
Future Work 4
References 5
Appendix 6
Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
i
Introduction
One of the biggest purchase a couple makes is the engagement ring. The engagement ring has been a staple
since the 19th century. One of the main component of the engagement ring is the diamond. With this being
a very expensive purchase as consumers, we should be well informed on how to price these things, so we don’t
overpay! Our data set involves pricing of round cut diamonds based on numerous attributes below. Our goal
is to fit a model to predict the price of round cut diamonds using the best methods and best set of variables
given.
Data
Data Source
This dataset contains 53,940 diamond observations consisted of 10 variables as follows:
Variable Description
price price in US dollars ($326-$18,823)
carat weight of the diamond (0.2-5.01)
cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color diamond color (J (worst), I, H, G, F, E, D (best))
clarity grade of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
NOTE: I = Included, SI = Slightly Included, VS = Very Slightly Included, VVS = Very, Very
Slightly Included, IF = Internally Flawless
x length in mm (0-10.74)
y width in mm (0-58.9)
z depth in mm (0-31.8)
depth total depth percentage (43-79)
table width of top of diamond relative to widest point (43-95)
Data Preparation
First, the data was loaded in Excel for overview. There were not any missing values. However, the first
column, which showed the indices of each observation, was deemed indeterminate with no value added and
was deleted as a result. The data was then loaded into R and further examined by looking at the structure.
We see that the levels for each categorical variables (cut, color, and clarity) are sorted by alphabetical order
by default. Because the levels are sequential, we releveled each according to the order presented in the data
source section above. Our data was then split into 75 percent training and 25 percent testing sets.
Methods
Multiple linear regression and random forest regression was used to fit models to predict the price of diamonds.
We started off using a variable selection method (regsubsets) to select the “best” subset of predictors. After
selecting which predictors to use, we fitted them using multiple linear regression. We evaluated our models
using the root mean square error (RMSE) criteria. This was used as our base model. From there, we tried
to improve our model using random forest regression. From class, we were informed it is one of the most
powerful tools out there. With random forest, we wanted to compare how the RMSE value changes when
mtrys and ntrees change in an attempt to minimize the RMSE. Since we have a large dataset and random
forest is computationally expensive, we opted to resample our dataset to 2000 observations.
1
Results
Variable Selection Using Regsubsets
Regsubsets was used to select the best variables to use.
5 10 15 20
1050 1150 1250 1350
RMSE vs. Num Vars
Num Vars
RMSE
From the plot above, we can see that the RMSE is at its lowest point at around 16 variables and beyond.
Note that there were 3 categorical variables, which were dummy coded uisng 0 and 1 and each is shown here
as a separate variable. For example, colorD would be dummy coded as 0 if the grade of the color is not D
and 1 if it is D. This would be counted as a separate variable here. After 16 varibles, it starts plateauing.
Since we want a model with the lowest amount of variables used and the RMSE of a full model and one with
16 variables is nearly the same, a 16 variable model is the best choice.
Multiple Linear Regression
The variables selected by regsubsets was used here on the full dataset to get our multiple linear regression
model which resulted in a RMSE of $1065.38. Our model is as follows:
price =
+147.61 + 11242.57carat + 916.79colorI + 1399.00colorH + 1900.69colorG + 2098.07colorF
+2164.11colorE + 2380.17colorD + 2870.17claritySI2 + 3849.15claritySI1 + 4473.61clarityV S2
+4786.27clarityV S1 + 5174.62clarityV V S2 + 5242.24clarityV V S1 + 5599.45clarityIF ? 83.86depth1041.00x + ε
The reference level for the following categorial variables are:
2
Categorical Variable Reference Level
color colorJ
clarity clarityI
This is the worst color grade and the worst clarity grade. This was made the reference level intentionally
for easy comparison. We can see from our model that the coefficients of each color grade and clarity grade
increases as each grade gets better. Intuitively, this makes sense since on average, beter quality means more
expensive.
Random Forest Regression
For random forest regression, we tried mtry values of 2, 3, 4, and 5, and ntree of 100, 200, 300, and 400. The
RMSE for each mtry and ntree combination are tabulated below.
mtry ntree = 100 ntree = 200 ntree = 300 ntree = 400
2 $968.51 $955.21 $956.18 $960.16
3 $987.38 $994.53 $969.73 $989.85
4 $1025.51 $997.77 $1005.13 $1003.57
5 $1037.21 $1031.66 $1048.65 $1034.73
For visualization, we have plotted the RMSE against the mtry values separated by each value of ntree as
shown below:
RMSE vs mtry
mtry
RMSE
ntree = 100
ntree = 200
ntree = 300
ntree = 400
2 3 4 5
930 970 1010 1050
Clearly, we can see that the “best” random forest regression model is using mtry = 2 and ntree = 200. For
mtry = 2, the changes in ntree from 200 to 300 to 400 has marginal effect since the RMSE were pretty close
to each other. We can also see that using too many trees can actually cause overfitting of the data (RMSE
3
increased).
Discussion and Conclusion
Multiple linear regression was far “worse” than random forest regression with respective RMSEs of $1065.38
compared to $955.21 (using mtry = 2 and ntree = 200). The RMSEs are tolerable considering the price of
each diamond ranged from $326 to $18,823. However when using mtry of 4 and 5 with ntree of 100, 200, 300,
or 400, the RMEs were similar to the RMSE of the multiple linear regression model with RMSEs around the
early to mid $1000. With our results backing up, we can verify that as predicted, random forest was the
dominant method once agian since it comes out on top.
Future Work
We can further extend this work by looking into gradient boosting tree as well as neutral networks and deep
learning. However, we chose to focus on using random forest in our project since it is widely known as the
“best” method.
4
References
https://bluenile.v360.in/49/imaged/gia-1162408531/2/still.jpg
https://vincentarelbundock.github.io/Rdatasets/datasets.html
An Introduction to Statistical Learning with Application in R by Gareth James, Daniela Witten, Trevor
Hastie, and Robert Tibshirani
Machine Learning with R by Brett Lantz
5
Appendix
Code
# looking at data
diamond = read.csv("diamonds.csv")
str(diamond)
# reorganizing levels
diamond$cut = factor(diamond$cut, levels = c("Fair", "Good", "Very Good", "Premium",
"Ideal"))
diamond$color = factor(diamond$color, levels = rev(levels(diamond$color)))
diamond$clarity = factor(diamond$clarity, levels = c("I1", "SI2", "SI1", "VS2",
"VS1", "VVS2", "VVS1", "IF"))
str(diamond)
summary(diamond$price)
# resampling and spliting data into train and test sets
set.seed(99)
smp_train = 2000
train_ind <- sample(seq_len(nrow(diamond)), size = smp_train)
train <- diamond[train_ind, ]
smp_test = 0.25 * smp_train
test_ind <- sample(seq_len(nrow(diamond)), size = smp_test)
test <- diamond[test_ind, ]
# using regsubsets to choose best model
library(leaps)
##--------------------------------------------------
## function to do rmse for k in 1:p
dovalbest = function(object, newdata, ynm) {
form = as.formula(object$call[[2]])
p = 23 #categorical variables split up denoted 0 or 1 for each level
rmsev = rep(0, p)
test.mat = model.matrix(form, newdata)
for (k in 1:p) {
coefk = coef(object, id = k)
xvars = names(coefk)
pred = test.mat[, xvars] %*% coefk
rmsev[k] = sqrt(mean((newdata[[ynm]] - pred)^2))
}
return(rmsev)
}
##------------------------------------------------------------
## do validation approach several times
ntry = 100
p = 23
resmat = matrix(0, p, ntry) #each row for num vars, each col for new train/test draw
for (i in 1:ntry) {
regfit.best = regsubsets(price ~ ., data = train, nvmax = 23, nbest = 1,
6
method = "exhaustive")
resmat[, i] = dovalbest(regfit.best, test, "price")
}
mresmat = apply(resmat, 1, mean) #average across columns
##--------------------------------------------------
## plot results of repeated train/val
plot(mresmat, xlab = "Num Vars", ylab = "RMSE", type = "b", col = "blue", pch = 19,
main = "RMSE vs. Num Vars")
##--------------------------------------------------
## Fit using number of vars chosen by train/validation and all the data.
kopt = 16 #optimal k=number of vars: chosen by eye-balling plot
regfit.best = regsubsets(price ~ ., data = diamond, nvmax = kopt, nbest = 1,
method = "exhaustive")
xmat = model.matrix(price ~ ., diamond)
ddf = data.frame(xmat[, -1], price = diamond$price) #don't use intercept (-1) & y=price
nms = c(names(coef(regfit.best, kopt))[-1], "price")
ddfsub = ddf[, nms] #drop all vars except those names by the coef at kopt
thereg = lm(price ~ ., ddfsub)
print(summary(thereg))
# multiple linear regression using variables from best subset
fit0 = lm(price ~ carat + color + clarity + depth + x, data = train)
pred = predict(fit0, test)
rmse = sqrt(mean((test$price - pred)^2))
cat("The root mean square error is: ", rmse)
fit00 = lm(price ~ carat + color + clarity + depth + x, data = diamond)
summary(fit00)
# random forest
library(randomForest)
fit1 = randomForest(price ~ carat + color + clarity + depth + x, data = train,
mtry = 2, ntree = 100)
pred1 = predict(fit1, test)
rmse1 = sqrt(mean((test$price - pred1)^2))
cat("The root mean square error is: ", rmse1)
fit2 = randomForest(price ~ carat + color + clarity + depth + x, data = train,
mtry = 2, ntree = 200)
pred2 = predict(fit2, test)
rmse2 = sqrt(mean((test$price - pred2)^2))
cat("The root mean square error is: ", rmse2)
fit3 = randomForest(price ~ carat + color + clarity + depth + x, data = train,
mtry = 2, ntree = 300)
pred3 = predict(fit3, test)
rmse3 = sqrt(mean((test$price - pred3)^2))
cat("The root mean square error is: ", rmse3)
fit4 = randomForest(price ~ carat + color + clarity + depth + x, data = train,
7
mtry = 2, ntree = 400)
pred4 = predict(fit4, test)
rmse4 = sqrt(mean((test$price - pred4)^2))
cat("The root mean square error is: ", rmse4)
rmeasure = c()
for (i in 2:5) {
for (j in c(100, 200, 300, 400)) {
fit <- randomForest(price ~ carat + color + clarity + depth + x, data = train,
mtry = i, ntree = j)
pred <- predict(fit, test)
rmse <- sqrt(mean((test$price - pred)^2))
rmeasure <- append(rmeasure, rmse, i)
cat("mtry = ", i, "ntree = ", j, "RMSE = ", rmse, "\n")
}
}
# plotting random forest RMSEs against mtry by ntree
x = c(2, 3, 4, 5)
y = c(968.506, 955.2064, 956.1764, 960.1649, 987.3767, 994.533, 969.7294, 989.8527,
1025.512, 997.7698, 1005.133, 1003.568, 1037.209, 1031.658, 1048.647, 1034.725)
ntree100 = c(968.506, 987.3767, 1025.512, 1037.209)
ntree200 = c(955.2064, 994.533, 997.7698, 1031.658)
ntree300 = c(956.1764, 969.7294, 1005.133, 1048.647)
ntree400 = c(960.1649, 989.8527, 1003.568, 1034.725)
ntree100 = c(968.506, 987.3767, 1025.512, 1037.209)
ntree200 = c(955.2064, 994.533, 997.7698, 1031.658)
ntree300 = c(956.1764, 969.7294, 1005.133, 1048.647)
ntree400 = c(960.1649, 989.8527, 1003.568, 1034.725)
plot(x, ntree100, type = "l", col = 2, xlab = "mtry", ylab = "RMSE", main = "RMSE vs mtry",
ylim = c(930, 1070), axes = FALSE)
lines(x, ntree200, type = "l", col = 3)
lines(x, ntree300, type = "l", col = 4)
lines(x, ntree400, type = "l", col = 5)
legend("topleft", c("ntree = 100", "ntree = 200", "ntree = 300", "ntree = 400"),
col = c("2", "3", "4", "5"), lty = 1, cex = 0.7)
axis(side = 1, at = c(2:5))
axis(side = 2, at = c(930, 950, 970, 990, 1010, 1030, 1050, 1070))
box()
8
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。