联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-05-05 11:00

Predicting the Price of Round Cut Diamonds

STP 494/STP 598: Machine Learning

Introduction 1

Data 1

Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Methods 1

Results 2

Variable Selection Using Regsubsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Discussion and Conclusion 4

Future Work 4

References 5

Appendix 6

Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

i

Introduction

One of the biggest purchase a couple makes is the engagement ring. The engagement ring has been a staple

since the 19th century. One of the main component of the engagement ring is the diamond. With this being

a very expensive purchase as consumers, we should be well informed on how to price these things, so we don’t

overpay! Our data set involves pricing of round cut diamonds based on numerous attributes below. Our goal

is to fit a model to predict the price of round cut diamonds using the best methods and best set of variables

given.

Data

Data Source

This dataset contains 53,940 diamond observations consisted of 10 variables as follows:

Variable Description

price price in US dollars ($326-$18,823)

carat weight of the diamond (0.2-5.01)

cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color diamond color (J (worst), I, H, G, F, E, D (best))

clarity grade of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

NOTE: I = Included, SI = Slightly Included, VS = Very Slightly Included, VVS = Very, Very

Slightly Included, IF = Internally Flawless

x length in mm (0-10.74)

y width in mm (0-58.9)

z depth in mm (0-31.8)

depth total depth percentage (43-79)

table width of top of diamond relative to widest point (43-95)

Data Preparation

First, the data was loaded in Excel for overview. There were not any missing values. However, the first

column, which showed the indices of each observation, was deemed indeterminate with no value added and

was deleted as a result. The data was then loaded into R and further examined by looking at the structure.

We see that the levels for each categorical variables (cut, color, and clarity) are sorted by alphabetical order

by default. Because the levels are sequential, we releveled each according to the order presented in the data

source section above. Our data was then split into 75 percent training and 25 percent testing sets.

Methods

Multiple linear regression and random forest regression was used to fit models to predict the price of diamonds.

We started off using a variable selection method (regsubsets) to select the “best” subset of predictors. After

selecting which predictors to use, we fitted them using multiple linear regression. We evaluated our models

using the root mean square error (RMSE) criteria. This was used as our base model. From there, we tried

to improve our model using random forest regression. From class, we were informed it is one of the most

powerful tools out there. With random forest, we wanted to compare how the RMSE value changes when

mtrys and ntrees change in an attempt to minimize the RMSE. Since we have a large dataset and random

forest is computationally expensive, we opted to resample our dataset to 2000 observations.

1

Results

Variable Selection Using Regsubsets

Regsubsets was used to select the best variables to use.

5 10 15 20

1050 1150 1250 1350

RMSE vs. Num Vars

Num Vars

RMSE

From the plot above, we can see that the RMSE is at its lowest point at around 16 variables and beyond.

Note that there were 3 categorical variables, which were dummy coded uisng 0 and 1 and each is shown here

as a separate variable. For example, colorD would be dummy coded as 0 if the grade of the color is not D

and 1 if it is D. This would be counted as a separate variable here. After 16 varibles, it starts plateauing.

Since we want a model with the lowest amount of variables used and the RMSE of a full model and one with

16 variables is nearly the same, a 16 variable model is the best choice.

Multiple Linear Regression

The variables selected by regsubsets was used here on the full dataset to get our multiple linear regression

model which resulted in a RMSE of $1065.38. Our model is as follows:

price =

+147.61 + 11242.57carat + 916.79colorI + 1399.00colorH + 1900.69colorG + 2098.07colorF

+2164.11colorE + 2380.17colorD + 2870.17claritySI2 + 3849.15claritySI1 + 4473.61clarityV S2

+4786.27clarityV S1 + 5174.62clarityV V S2 + 5242.24clarityV V S1 + 5599.45clarityIF ? 83.86depth1041.00x + ε

The reference level for the following categorial variables are:

2

Categorical Variable Reference Level

color colorJ

clarity clarityI

This is the worst color grade and the worst clarity grade. This was made the reference level intentionally

for easy comparison. We can see from our model that the coefficients of each color grade and clarity grade

increases as each grade gets better. Intuitively, this makes sense since on average, beter quality means more

expensive.

Random Forest Regression

For random forest regression, we tried mtry values of 2, 3, 4, and 5, and ntree of 100, 200, 300, and 400. The

RMSE for each mtry and ntree combination are tabulated below.

mtry ntree = 100 ntree = 200 ntree = 300 ntree = 400

2 $968.51 $955.21 $956.18 $960.16

3 $987.38 $994.53 $969.73 $989.85

4 $1025.51 $997.77 $1005.13 $1003.57

5 $1037.21 $1031.66 $1048.65 $1034.73

For visualization, we have plotted the RMSE against the mtry values separated by each value of ntree as

shown below:

RMSE vs mtry

mtry

RMSE

ntree = 100

ntree = 200

ntree = 300

ntree = 400

2 3 4 5

930 970 1010 1050

Clearly, we can see that the “best” random forest regression model is using mtry = 2 and ntree = 200. For

mtry = 2, the changes in ntree from 200 to 300 to 400 has marginal effect since the RMSE were pretty close

to each other. We can also see that using too many trees can actually cause overfitting of the data (RMSE

3

increased).

Discussion and Conclusion

Multiple linear regression was far “worse” than random forest regression with respective RMSEs of $1065.38

compared to $955.21 (using mtry = 2 and ntree = 200). The RMSEs are tolerable considering the price of

each diamond ranged from $326 to $18,823. However when using mtry of 4 and 5 with ntree of 100, 200, 300,

or 400, the RMEs were similar to the RMSE of the multiple linear regression model with RMSEs around the

early to mid $1000. With our results backing up, we can verify that as predicted, random forest was the

dominant method once agian since it comes out on top.

Future Work

We can further extend this work by looking into gradient boosting tree as well as neutral networks and deep

learning. However, we chose to focus on using random forest in our project since it is widely known as the

“best” method.

4

References

https://bluenile.v360.in/49/imaged/gia-1162408531/2/still.jpg

https://vincentarelbundock.github.io/Rdatasets/datasets.html

An Introduction to Statistical Learning with Application in R by Gareth James, Daniela Witten, Trevor

Hastie, and Robert Tibshirani

Machine Learning with R by Brett Lantz

5

Appendix

Code

# looking at data

diamond = read.csv("diamonds.csv")

str(diamond)

# reorganizing levels

diamond$cut = factor(diamond$cut, levels = c("Fair", "Good", "Very Good", "Premium",

"Ideal"))

diamond$color = factor(diamond$color, levels = rev(levels(diamond$color)))

diamond$clarity = factor(diamond$clarity, levels = c("I1", "SI2", "SI1", "VS2",

"VS1", "VVS2", "VVS1", "IF"))

str(diamond)

summary(diamond$price)

# resampling and spliting data into train and test sets

set.seed(99)

smp_train = 2000

train_ind <- sample(seq_len(nrow(diamond)), size = smp_train)

train <- diamond[train_ind, ]

smp_test = 0.25 * smp_train

test_ind <- sample(seq_len(nrow(diamond)), size = smp_test)

test <- diamond[test_ind, ]

# using regsubsets to choose best model

library(leaps)

##--------------------------------------------------

## function to do rmse for k in 1:p

dovalbest = function(object, newdata, ynm) {

form = as.formula(object$call[[2]])

p = 23 #categorical variables split up denoted 0 or 1 for each level

rmsev = rep(0, p)

test.mat = model.matrix(form, newdata)

for (k in 1:p) {

coefk = coef(object, id = k)

xvars = names(coefk)

pred = test.mat[, xvars] %*% coefk

rmsev[k] = sqrt(mean((newdata[[ynm]] - pred)^2))

}

return(rmsev)

}

##------------------------------------------------------------

## do validation approach several times

ntry = 100

p = 23

resmat = matrix(0, p, ntry) #each row for num vars, each col for new train/test draw

for (i in 1:ntry) {

regfit.best = regsubsets(price ~ ., data = train, nvmax = 23, nbest = 1,

6

method = "exhaustive")

resmat[, i] = dovalbest(regfit.best, test, "price")

}

mresmat = apply(resmat, 1, mean) #average across columns

##--------------------------------------------------

## plot results of repeated train/val

plot(mresmat, xlab = "Num Vars", ylab = "RMSE", type = "b", col = "blue", pch = 19,

main = "RMSE vs. Num Vars")

##--------------------------------------------------

## Fit using number of vars chosen by train/validation and all the data.

kopt = 16 #optimal k=number of vars: chosen by eye-balling plot

regfit.best = regsubsets(price ~ ., data = diamond, nvmax = kopt, nbest = 1,

method = "exhaustive")

xmat = model.matrix(price ~ ., diamond)

ddf = data.frame(xmat[, -1], price = diamond$price) #don't use intercept (-1) & y=price

nms = c(names(coef(regfit.best, kopt))[-1], "price")

ddfsub = ddf[, nms] #drop all vars except those names by the coef at kopt

thereg = lm(price ~ ., ddfsub)

print(summary(thereg))

# multiple linear regression using variables from best subset

fit0 = lm(price ~ carat + color + clarity + depth + x, data = train)

pred = predict(fit0, test)

rmse = sqrt(mean((test$price - pred)^2))

cat("The root mean square error is: ", rmse)

fit00 = lm(price ~ carat + color + clarity + depth + x, data = diamond)

summary(fit00)

# random forest

library(randomForest)

fit1 = randomForest(price ~ carat + color + clarity + depth + x, data = train,

mtry = 2, ntree = 100)

pred1 = predict(fit1, test)

rmse1 = sqrt(mean((test$price - pred1)^2))

cat("The root mean square error is: ", rmse1)

fit2 = randomForest(price ~ carat + color + clarity + depth + x, data = train,

mtry = 2, ntree = 200)

pred2 = predict(fit2, test)

rmse2 = sqrt(mean((test$price - pred2)^2))

cat("The root mean square error is: ", rmse2)

fit3 = randomForest(price ~ carat + color + clarity + depth + x, data = train,

mtry = 2, ntree = 300)

pred3 = predict(fit3, test)

rmse3 = sqrt(mean((test$price - pred3)^2))

cat("The root mean square error is: ", rmse3)

fit4 = randomForest(price ~ carat + color + clarity + depth + x, data = train,

7

mtry = 2, ntree = 400)

pred4 = predict(fit4, test)

rmse4 = sqrt(mean((test$price - pred4)^2))

cat("The root mean square error is: ", rmse4)

rmeasure = c()

for (i in 2:5) {

for (j in c(100, 200, 300, 400)) {

fit <- randomForest(price ~ carat + color + clarity + depth + x, data = train,

mtry = i, ntree = j)

pred <- predict(fit, test)

rmse <- sqrt(mean((test$price - pred)^2))

rmeasure <- append(rmeasure, rmse, i)

cat("mtry = ", i, "ntree = ", j, "RMSE = ", rmse, "\n")

}

}

# plotting random forest RMSEs against mtry by ntree

x = c(2, 3, 4, 5)

y = c(968.506, 955.2064, 956.1764, 960.1649, 987.3767, 994.533, 969.7294, 989.8527,

1025.512, 997.7698, 1005.133, 1003.568, 1037.209, 1031.658, 1048.647, 1034.725)

ntree100 = c(968.506, 987.3767, 1025.512, 1037.209)

ntree200 = c(955.2064, 994.533, 997.7698, 1031.658)

ntree300 = c(956.1764, 969.7294, 1005.133, 1048.647)

ntree400 = c(960.1649, 989.8527, 1003.568, 1034.725)

ntree100 = c(968.506, 987.3767, 1025.512, 1037.209)

ntree200 = c(955.2064, 994.533, 997.7698, 1031.658)

ntree300 = c(956.1764, 969.7294, 1005.133, 1048.647)

ntree400 = c(960.1649, 989.8527, 1003.568, 1034.725)

plot(x, ntree100, type = "l", col = 2, xlab = "mtry", ylab = "RMSE", main = "RMSE vs mtry",

ylim = c(930, 1070), axes = FALSE)

lines(x, ntree200, type = "l", col = 3)

lines(x, ntree300, type = "l", col = 4)

lines(x, ntree400, type = "l", col = 5)

legend("topleft", c("ntree = 100", "ntree = 200", "ntree = 300", "ntree = 400"),

col = c("2", "3", "4", "5"), lty = 1, cex = 0.7)

axis(side = 1, at = c(2:5))

axis(side = 2, at = c(930, 950, 970, 990, 1010, 1030, 1050, 1070))

box()

8


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp