The code is presented and explained. Outputs are shown where appropriate. To save space and maintain clarity, selected results are shown at the end. For further details please see the executable r-data file, which is provided with the data storage device.
The first section of the code represents the model assumptions.
This line of code determines the number of cross-validation folds and could be changed to increase computational speed. The results were generated with 5 folds, which amounted to 2 minutes of run time. Cross-Validation is used for the tuning of ML-Methods.
cv.folds<-5
The model assumptions are 10, 50, 150 and 290 features, p. The number of observations n is kept constant at 300. The parameter cvfolds was already assigned to 5.
p<-c(10, 50, 150, 290)
n=300
cvfolds <- cv.folds
Next, the names of the four runs and alternatives are set up, which will later be used to label the rows and columns of the results-data-frames.
name<-c(p[1], p[2], p[3], p[4])
names<-c("p=", "p=", "p=", "p=")
index<-paste(names, name, sep = "", collapse = NULL)
alt.names <- c("TF", "HB", "NC")
run.names <- c("run1", "run2","run3", "run4")
Storage data frames in the form of lists are created for the storage of results such as signal-to-noise ratios, accuracies, coefficients, confusion matrices, variable importance and tabulations of the y-variable.
details <- list()
coeff.logit <- list()
details1 <- list()
details3 <- list()
details4 <- list()
coeff.reg <- list()
coeff.reg.TF<- list()
coeff.reg.HB<- list()
coeff.reg.NC<- list()
coeff.lasso <- list()
coeff.lasso.TF <- list()
coeff.lasso.HB <- list()
coeff.lasso.NC <- list()
cm.logit <- list()
cm.reg <- list()
cm.lasso <- list()
cm.tree <- list()
cm.rf <- list()
cm.boost <- list()
rf_plot <- list()
gbm_plot <- list()
summ_fit_sample <- list()
summ_test_sample <- list()
summ_fit_lin_sample <- list()
summ_test_lin_sample <- list()
Finally a for-loop is created to vary based on the assigned p’s.
for (i in 1:length(p)) {
p[i]<-p[i]
Next the relevant coefficients are set to -5, 3 and 5. A vector sparce is created to count the number of non-relevant features. In the case of the first run p=10, which will serve as an example, are the non-relevant coefficients 7.
pr<--5 #price
qual<-3 #quanlity
pop<-5 #popularity
sparce<-p-(length(pr)+length(qual)+length(pop)) #count 0-betas for sparcity
sparce
## [1] 7
Then a beta vector is created with the 3 relevant coefficients, while the rest are set to 0.
set.seed(1)
beta2<- c(pr, qual, pop, rep(0, sparce))
beta2
## [1] -5 3 5 0 0 0 0 0 0 0
In the next step the price, quality and popularity averages are assigned to the three alternatives (TF, HB and NC). The alternatives are modeled to be dissimilar, therefore each has a different magnitude of price, quality and popularity. Below are the coefficient-means of the first alternative shown.
prices<-c(3.90, 3.50, 1.50)
mu1=c(rep(prices[1] , length(pr)), rep(5 , length(qual)), rep(2 , length(pop)),
rep(0, sparce)) #TF
mu2=c(rep(prices[2] , length(pr)), rep(1 , length(qual)), rep(4 , length(pop)),
rep(0, sparce)) #HB
mu3=c(rep(prices[3] , length(pr)), rep(1 , length(qual)), rep(2 , length(pop)),
rep(0, sparce)) #NC
round(mu1, 1)
## [1] 3.9 5.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Next the covariance structure of the x’s is created, which is based on a diagonal matrix. Therefore the x’s are not dependent on each other.
set.seed(1)
covar=diag(0.2, p, p)
head(covar)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0.2 0.0 0.0 0.0 0.0 0.0 0 0 0 0
## [2,] 0.0 0.2 0.0 0.0 0.0 0.0 0 0 0 0
## [3,] 0.0 0.0 0.2 0.0 0.0 0.0 0 0 0 0
## [4,] 0.0 0.0 0.0 0.2 0.0 0.0 0 0 0 0
## [5,] 0.0 0.0 0.0 0.0 0.2 0.0 0 0 0 0
## [6,] 0.0 0.0 0.0 0.0 0.0 0.2 0 0 0 0
After that the x’s are created using the covariance structure and the means.
X1=rmvnorm(n, mean=mu1, sigma=covar)
X2=rmvnorm(n, mean=mu2, sigma=covar)
X3=rmvnorm(n, mean=mu3, sigma=covar)
str(X1)
## num [1:300, 1:10] 3.62 4.58 4.31 4.51 3.83 ...
The created X-matrices are 300 observations long and 10 observations wide.
Finally the x’s are packed into a matrix for each alternative and are labeled.
x1<-matrix(X1, nrow = n, ncol = p)
colnames(x1)<-paste(as.character(1:p), rep(".TF", p), sep = "")
x2<-matrix(X2, nrow = n, ncol = p)
colnames(x2)<-paste(as.character(1:p), rep(".HB", p), sep = "")
x3<-matrix(X3, nrow = n, ncol = p)
colnames(x3)<-paste(as.character(1:p), rep(".NC", p), sep = "")
To obtain representative utility, matrix multiplication is done and the X-matrices are transposed.
V1<-beta2%*%t(x1) #1st product
V2<-beta2%*%t(x2) #2nd product
V3<-beta2%*%t(x3) #3rd product
The resulting representative utilities are bound into a single matrix.
V<-cbind(t(V1), t(V2), t(V3))
summary(V)
## V1 V2 V3
## Min. :-6.176 Min. :-3.674 Min. :-3.395
## 1st Qu.: 3.288 1st Qu.: 3.197 1st Qu.: 3.805
## Median : 5.508 Median : 6.087 Median : 5.802
## Mean : 5.759 Mean : 5.718 Mean : 5.749
## 3rd Qu.: 8.054 3rd Qu.: 7.852 3rd Qu.: 7.943
## Max. :15.053 Max. :16.076 Max. :13.755
str(V)
## num [1:300, 1:3] 5.28 1.25 4.66 3.19 7.09 ...
Next, the error term is generated using the package for generating extreme values.
#install.packages("evd")
library(evd)
set.seed(1)
e1<-rgumbel(n*ncol(V), loc=0, scale=1.72)
e1<-matrix(e1,n,ncol(V),byrow=FALSE)
str(e1)
## num [1:300, 1:3] 0.483 -0.287 3.313 3.384 1.428 ...
The model is completed by adding the unobserved (random) values of utility.
z1<-V+e1
signal<-mean(var(V))
noise<-mean(var(e1))
noise
## [1] 1.64465
The noise is according to logit-model assumptions.
snr<-mean(var(V))/mean(var(e1))
snr
## [1] 2.236826
The signal-to-noise is above 2.
pr1 = exp(z1)/apply(exp(z1), drop=F, 1, sum)
The probabilities have the necessary qualities. They are between 0 and 1.
specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))
head(specify_decimal(pr1, 2))
## [,1] [,2] [,3]
## [1,] "0.05" "0.95" "0.00"
## [2,] "0.01" "0.00" "0.98"
## [3,] "0.23" "0.32" "0.45"
## [4,] "0.40" "0.00" "0.60"
## [5,] "0.95" "0.05" "0.00"
## [6,] "0.00" "0.96" "0.04"
They sum to 1.
head(apply(pr1, drop=F, 1, sum))
## [1] 1 1 1 1 1 1
Next using the apply command, the rmultinom function is applied on in the previous step generated probability matrix. One random vector is drawn for each row of the probability matrix. Using the generated probabilities one object is chosen. As seen in the output, where the probability is highest, is the chance higher for that column to take on the value of 1.
mChoices1 = t(apply(pr1, 1, rmultinom, n = 1, size = 1))
head(mChoices1)
## [,1] [,2] [,3]
## [1,] 0 1 0
## [2,] 0 0 1
## [3,] 0 1 0
## [4,] 0 0 1
## [5,] 1 0 0
## [6,] 0 1 0
By replacing 1-s with respective column labels, a categorical ranking results with the options y=1,2 or 3.
y1 = apply(mChoices1, 1, function(x) which(x==1))
head(y1)
## [1] 2 3 2 3 1 2
length(y1)
## [1] 300
tabulate(y1)
## [1] 103 109 88
A data frame for not linear Methods: logit, tree, random forest and boosting is saved.
cereals<-data.frame(y1,x1,x2,x3)
str(cereals)
## 'data.frame': 300 obs. of 31 variables:
## $ y1 : int 2 3 2 3 1 2 2 3 1 1 ...
## $ X1.TF : num 3.62 4.58 4.31 4.51 3.83 ...
## $ X2.TF : num 5.08 5.17 5.35 4.95 4.89 ...
## $ X3.TF : num 1.63 1.72 2.03 2.17 2.31 ...
## $ X4.TF : num 0.7134 -0.9904 -0.8897 -0.0241 0.2489 ...
## $ X5.TF : num 0.147 0.503 0.277 -0.616 -0.308 ...
## $ X6.TF : num -0.3669 -0.0201 -0.0251 -0.1856 -0.3164 ...
## $ X7.TF : num 0.21798 -0.00724 -0.06967 -0.17633 0.16305 ...
## $ X8.TF : num 0.3302 0.4221 -0.6577 -0.0265 0.3437 ...
## $ X9.TF : num 0.2575 0.3673 -0.2138 0.4919 -0.0502 ...
## $ X10.TF: num -0.137 0.266 0.187 0.341 0.394 ...
## $ X1.HB : num 3.83 3.72 3.52 4.1 3.7 ...
## $ X2.HB : num 1.173 1.192 1.69 0.206 0.804 ...
## $ X3.HB : num 4.58 3.43 4.43 4.17 4.54 ...
## $ X4.HB : num -0.3594 1.0001 -0.8194 0.0188 -0.6364 ...
## $ X5.HB : num -0.7167 0.1485 0.0595 -0.0594 -0.0263 ...
## $ X6.HB : num 0.41736 -0.06229 -0.46215 -0.00624 0.09111 ...
## $ X7.HB : num 0.808 -0.329 -0.773 -0.164 0.332 ...
## $ X8.HB : num -0.0253 -1.2422 -0.5187 -0.047 -0.1689 ...
## $ X9.HB : num 0.843 -0.144 -0.623 0.742 1.064 ...
## $ X10.HB: num 0.706 -0.463 -0.448 0.695 -0.397 ...
## $ X1.NC : num 1.22 1.04 1.52 1.12 2.18 ...
## $ X2.NC : num 0.504 0.837 1.444 0.163 1.186 ...
## $ X3.NC : num 1.03 1.38 1.26 2.33 1.67 ...
## $ X4.NC : num -0.014 -0.2395 0.0635 -0.1548 -0.7202 ...
## $ X5.NC : num -0.116 0.123 -0.301 -0.623 -0.232 ...
## $ X6.NC : num 0.239 0.588 -0.645 -0.349 0.529 ...
## $ X7.NC : num -0.2502 -0.0762 0.3721 0.2759 0.0937 ...
## $ X8.NC : num 0.7193 0.6471 0.1148 -0.1233 -0.0928 ...
## $ X9.NC : num 0.249 0.737 -0.307 0.716 -0.498 ...
## $ X10.NC: num 0.08301 0.45325 -0.00687 -0.06702 0.82171 ...
for linear Methods: regression and lasso, a second data frame is saved. The second data frame is the same as the first, except for that it has three additional probability variables (X1, X2 and X3).
cereals.lin<-data.frame(y1,pr1,x1,x2,x3)
names(cereals.lin)
## [1] "y1" "X1" "X2" "X3" "X1.TF" "X2.TF" "X3.TF"
## [8] "X4.TF" "X5.TF" "X6.TF" "X7.TF" "X8.TF" "X9.TF" "X10.TF"
## [15] "X1.HB" "X2.HB" "X3.HB" "X4.HB" "X5.HB" "X6.HB" "X7.HB"
## [22] "X8.HB" "X9.HB" "X10.HB" "X1.NC" "X2.NC" "X3.NC" "X4.NC"
## [29] "X5.NC" "X6.NC" "X7.NC" "X8.NC" "X9.NC" "X10.NC"
Format the y-column in the data frame as a factor and call it “choice”.
choice<-factor(cereals$y1)
head(choice)
## [1] 2 3 2 3 1 2
## Levels: 1 2 3
Name the levels of the as-factor-formatted choice-column. 1 for Tiger Flakes (TF), 2 for Honey Bits (HB) and 3 for Nougat Crisps (NC).
levels<-c("TF", "HB", "NC")
levels(choice)<-levels
head(choice)
## [1] HB NC HB NC TF HB
## Levels: TF HB NC
Replace the “y” column with the “choice” column in the data frames.
cereals<-cbind(choice, cereals[,-1])
cereals.lin<-cbind(choice, cereals.lin[,-1])
The fitting of the six classical and machine learning models is presented next.
Find split point in the data.
r<-nrow(cereals)/2
r
## [1] 150
cereals_fit<-cereals[1:r,]
cereals_test<-cereals[-(1:r),]
nrow(cereals_fit)
## [1] 150
means show that only relevant features (X1 to X3) have means higher than 0.
round(sapply(cereals_fit[,-1], mean), 2)
## X1.TF X2.TF X3.TF X4.TF X5.TF X6.TF X7.TF X8.TF X9.TF X10.TF
## 3.86 5.08 2.01 -0.07 0.06 -0.03 -0.02 -0.03 -0.03 0.03
## X1.HB X2.HB X3.HB X4.HB X5.HB X6.HB X7.HB X8.HB X9.HB X10.HB
## 3.49 1.00 4.05 0.00 0.01 -0.02 -0.02 0.08 -0.01 -0.05
## X1.NC X2.NC X3.NC X4.NC X5.NC X6.NC X7.NC X8.NC X9.NC X10.NC
## 1.46 1.07 1.94 -0.02 -0.06 -0.02 -0.01 0.00 0.09 -0.01
Below is a tabulation of the selected alternatives in the train sample.
summ_cereals_fit<-summary(cereals_fit$choice)
summ_cereals_fit
## TF HB NC
## 51 57 42
summ_cereals_test<-summary(cereals_test$choice)
library(mlogit)
Cereal_fit <- mlogit.data(cereals_fit, varying = c(2:ncol(cereals_fit)),
shape = "wide", choice = "choice")
Cereal_test <- mlogit.data(cereals_test, varying = c(2:ncol(cereals_test)),
shape = "wide", choice = "choice")
last<-ncol(Cereal_fit)-2
k<-colnames(Cereal_fit)[-1]
k<-k[-1]
k<-k[-last]
factors <- k
factors
## [1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9" "X10"
X<-as.formula(paste("choice~", paste(factors, collapse="+")))
X
## choice ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
f1 <- mFormula(X)
f0 <- update(f1, .~.-1)
m<-mlogit(f0, data=Cereal_fit, seed = 1)
Make predictions. The logit predictions are in the form of probabilities.
library("caret")
set.seed(1)
logit.pred_in<-fitted(m, outcome = FALSE)
head(logit.pred_in)
## HB NC TF
## 1 0.80802081 0.00895389 0.183025301
## 2 0.15875477 0.71489982 0.126345407
## 3 0.95266582 0.01520610 0.032128083
## 4 0.03575981 0.85312309 0.111117101
## 5 0.36271514 0.01108054 0.626204322
## 6 0.70555287 0.29111293 0.003334203
The column names are used to label the alternatives.
y_hat = colnames(logit.pred_in)[apply(logit.pred_in, 1, which.max)]
head(y_hat)
## [1] "HB" "NC" "HB" "NC" "TF" "HB"
They are then compared to the observed alternatives from the train sample.
logit_in = confusionMatrix(cereals_fit$choice, as.factor(y_hat))
The accuracy is recorded the generated confusion matrix output.
logit_in[2]
## $table
## Reference
## Prediction HB NC TF
## HB 41 6 10
## NC 9 31 2
## TF 6 3 42
logit_in<-logit_in$overall[1]*100
logit_in
## Accuracy
## 76
logit_in<-logit_in[[1]]
logit_in
## [1] 76
Make predictions
logit.pred_out<-predict(m, newdata=Cereal_test)
Specify a decimal format for the predicted probabilities and reformat the predicted probabilities into a readable format.
specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))
head(specify_decimal(logit.pred_out, 2))
## HB NC TF
## 1 "0.01" "0.62" "0.38"
## 2 "0.02" "0.43" "0.55"
## 3 "0.00" "0.00" "1.00"
## 4 "0.87" "0.11" "0.02"
## 5 "0.04" "0.89" "0.06"
## 6 "0.26" "0.03" "0.71"
Verify predicted probabilities add to 1
head(apply(logit.pred_out, drop=F, 1, sum))
## 1 2 3 4 5 6
## 1 1 1 1 1 1
Record accuracy.
y_hat = colnames(logit.pred_out)[apply(logit.pred_out, 1, which.max)]
head(y_hat)
## [1] "NC" "TF" "TF" "HB" "NC" "TF"
logit_out_cm = confusionMatrix(as.factor(y_hat), cereals_test$choice)
logit_out<-logit_out_cm$overall[1]*100
logit_out<-logit_out[[1]]
logit<-round(cbind(logit_in, logit_out), 0)
logit
## logit_in logit_out
## [1,] 76 78
find split point in the data and split it.
r<-nrow(cereals.lin)/2
cereals_fit<-cereals.lin[1:r,]
cereals_test<-cereals.lin[-(1:r),]
names(cereals_fit)
## [1] "choice" "X1" "X2" "X3" "X1.TF" "X2.TF" "X3.TF"
## [8] "X4.TF" "X5.TF" "X6.TF" "X7.TF" "X8.TF" "X9.TF" "X10.TF"
## [15] "X1.HB" "X2.HB" "X3.HB" "X4.HB" "X5.HB" "X6.HB" "X7.HB"
## [22] "X8.HB" "X9.HB" "X10.HB" "X1.NC" "X2.NC" "X3.NC" "X4.NC"
## [29] "X5.NC" "X6.NC" "X7.NC" "X8.NC" "X9.NC" "X10.NC"
Test sample includes logit probabilities.
record summary statistics of samples for later
summ_cereals_fit_lin<-summary(cereals_fit$choice)
summ_cereals_test_lin<-summary(cereals_test$choice)
Data frames are build using the logit probability and features per alternative.
cereals.TF<-cbind(cereals_fit$X1, cereals_fit[(1+4):(5+p-1)])
str(cereals.TF)
## 'data.frame': 150 obs. of 11 variables:
## $ cereals_fit$X1: num 0.0477 0.0122 0.2296 0.4004 0.9462 ...
## $ X1.TF : num 3.62 4.58 4.31 4.51 3.83 ...
## $ X2.TF : num 5.08 5.17 5.35 4.95 4.89 ...
## $ X3.TF : num 1.63 1.72 2.03 2.17 2.31 ...
## $ X4.TF : num 0.7134 -0.9904 -0.8897 -0.0241 0.2489 ...
## $ X5.TF : num 0.147 0.503 0.277 -0.616 -0.308 ...
## $ X6.TF : num -0.3669 -0.0201 -0.0251 -0.1856 -0.3164 ...
## $ X7.TF : num 0.21798 -0.00724 -0.06967 -0.17633 0.16305 ...
## $ X8.TF : num 0.3302 0.4221 -0.6577 -0.0265 0.3437 ...
## $ X9.TF : num 0.2575 0.3673 -0.2138 0.4919 -0.0502 ...
## $ X10.TF : num -0.137 0.266 0.187 0.341 0.394 ...
cereals.HB<-cbind(cereals_fit$X2, cereals_fit[(5+p-1+1):(5+2*p-1)])
cereals.NC<-cbind(cereals_fit$X3, cereals_fit[(5+2*p-1+1):ncol(cereals_fit)])
Test Data has the same structure as the train data.
cereals.TF.test<-cbind(cereals_test$X1, cereals_test[(1+4):(5+p-1)])
cereals.HB.test<-cbind(cereals_test$X2, cereals_test[(5+p-1+1):(5+2*p-1)])
cereals.NC.test<-cbind(cereals_test$X3, cereals_test[(5+2*p-1+1):ncol(cereals_fit)])
k<-colnames(cereals.TF[2:ncol(cereals.TF)])
TF<-as.formula(paste("log(cereals.TF[,1])~", paste(k, collapse="+")))
TF
## log(cereals.TF[, 1]) ~ X1.TF + X2.TF + X3.TF + X4.TF + X5.TF +
## X6.TF + X7.TF + X8.TF + X9.TF + X10.TF
k<-colnames(cereals.HB[2:ncol(cereals.HB)])
HB<-as.formula(paste("log(cereals.HB[,1])~", paste(k, collapse="+")))
HB
## log(cereals.HB[, 1]) ~ X1.HB + X2.HB + X3.HB + X4.HB + X5.HB +
## X6.HB + X7.HB + X8.HB + X9.HB + X10.HB
k<-colnames(cereals.NC[2:ncol(cereals.NC)])
NC<-as.formula(paste("log(cereals.NC[,1])~", paste(k, collapse="+")))
NC
## log(cereals.NC[, 1]) ~ X1.NC + X2.NC + X3.NC + X4.NC + X5.NC +
## X6.NC + X7.NC + X8.NC + X9.NC + X10.NC
form.TF <- update(TF, .~.-1)
form.HB <- update(HB, .~.-1)
form.NC <- update(NC, .~.-1)
ols.TF <- lm(form.TF, data=cereals.TF)
ols.HB <- lm(form.HB, data=cereals.HB)
ols.NC <- lm(form.NC, data=cereals.NC)
The predictions of the three regressions are compared.
TF<-fitted(ols.TF, outcome = FALSE)
HB<-fitted(ols.HB, outcome = FALSE)
NC<-fitted(ols.NC, outcome = FALSE)
util<-cbind(TF, HB, NC)
head(util, 4)
## TF HB NC
## 1 -2.848340 -2.811533 -4.560369
## 2 -5.846930 -5.872507 -2.778482
## 3 -5.301662 -2.086851 -4.101773
## 4 -5.686562 -6.390244 -2.695700
The alternative that has the highest utility is chosen.
wahl<-apply(util, 1, function(x) which.max(x))
head(wahl)
## 1 2 3 4 5 6
## 2 3 2 3 1 2
Column numbers serve as indicators for the alternatives.
The predicted choices are compared to the numeric of the observed choices. The comparison is done in a confusion matrix.
y_hat<-as.numeric(wahl)
cm = table(wahl, cereals_fit$choice)
cm
##
## wahl TF HB NC
## 1 40 7 3
## 2 6 44 9
## 3 5 6 30
Accuracy is calculated
reg_in<-sum(diag(cm))/sum(cm)*100
reg_in
## [1] 76
Make Predictions.
TF=predict(ols.TF, newdata =cereals.TF.test)
HB=predict(ols.HB, newdata =cereals.HB.test)
NC=predict(ols.NC, newdata =cereals.NC.test)
util<-cbind(TF, HB, NC)
wahl<-apply(util, 1, function(x) which.max(x))
reg_out_cm = table(wahl, cereals_test$choice)
rownames(reg_out_cm)<- alt.names
reg_out<-sum(diag(reg_out_cm))/sum(reg_out_cm)*100
lin.reg<-round(cbind(reg_in, reg_out), 0)
lin.reg
## reg_in reg_out
## [1,] 76 71
LASSO using glmnet package requires coding the y and x variables in separate matrices.
Y.train.TF = log(cereals.TF[, 1]) # output variable
str(Y.train.TF)
## num [1:150] -3.0428 -4.4063 -1.4714 -0.9153 -0.0553 ...
X.train.TF = as.matrix(cereals.TF[,2:ncol(cereals.TF)]) # regressors
str(X.train.TF)
## num [1:150, 1:10] 3.62 4.58 4.31 4.51 3.83 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:150] "1" "2" "3" "4" ...
## ..$ : chr [1:10] "X1.TF" "X2.TF" "X3.TF" "X4.TF" ...
Y.train.HB = log(cereals.HB[, 1]) # output variable
X.train.HB = as.matrix(cereals.HB[,2:ncol(cereals.HB)]) # regressors
Y.train.NC = log(cereals.NC[, 1]) # output variable
X.train.NC = as.matrix(cereals.NC[,2:ncol(cereals.NC)]) # regressors
Y.test.TF = log(cereals.TF.test[, 1]) # output variable
X.test.TF = as.matrix(cereals.TF.test[,2:ncol(cereals.TF)]) # regressors
Y.test.HB = log(cereals.HB.test[, 1]) # output variable
X.test.HB = as.matrix(cereals.HB.test[,2:ncol(cereals.HB)]) # regressors
Y.test.NC = log(cereals.NC.test[, 1]) # output variable
X.test.NC = as.matrix(cereals.NC.test[,2:ncol(cereals.NC)]) # regressors
library(glmnet)
lasso.cv.TF=cv.glmnet(X.train.TF, Y.train.TF, family="gaussian",alpha=1)
lasso.cv.HB=cv.glmnet(X.train.HB, Y.train.HB, family="gaussian",alpha=1)
lasso.cv.NC=cv.glmnet(X.train.NC, Y.train.NC, family="gaussian",alpha=1)
bestlam.lasso.cv = lasso.cv.TF$lambda.min
reg.la.TF=glmnet(X.train.TF,Y.train.TF, family="gaussian", alpha=1,
lambda = bestlam.lasso.cv)
TF = predict(reg.la.TF, s= bestlam.lasso.cv, newx=X.train.TF)
str(TF)
## num [1:150, 1] -3.3 -5.6 -4.45 -5.16 -2.76 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:150] "1" "2" "3" "4" ...
## ..$ : chr "1"
Utilities result from the prediction.
bestlam.lasso.cv = lasso.cv.HB$lambda.min
reg.la.HB=glmnet(X.train.HB,Y.train.HB, family="gaussian", alpha=1,
lambda = bestlam.lasso.cv)
HB = predict(reg.la.HB, s= bestlam.lasso.cv, newx=X.train.HB,
lambda = bestlam.lasso.cv)
bestlam.lasso.cv = lasso.cv.NC$lambda.min
reg.la.NC=glmnet(X.train.NC,Y.train.NC, family="gaussian", alpha=1,
lambda = bestlam.lasso.cv)
NC = predict(reg.la.NC, s= bestlam.lasso.cv, newx=X.train.NC)
W <- as.matrix(coef(reg.la.TF))
W
## s0
## (Intercept) -7.96959386
## X1.TF -2.89974525
## X2.TF 2.20885371
## X3.TF 2.28251390
## X4.TF 0.00000000
## X5.TF 0.00000000
## X6.TF 0.00000000
## X7.TF 0.05462683
## X8.TF 0.65283385
## X9.TF 0.00000000
## X10.TF 0.00000000
keep_X <- rownames(W)[W!=0]
keep_X <- keep_X[!keep_X == "(Intercept)"] #remove intercept
keep_X
## [1] "X1.TF" "X2.TF" "X3.TF" "X7.TF" "X8.TF"
x<- X.train.TF[,keep_X]
summ.lasso.TF<-lm(Y.train.TF~x -1)
#stargazer(summ.lasso.TF, type = "latex")
W <- as.matrix(coef(reg.la.HB))
keep_X <- rownames(W)[W!=0] #only keep non-zero coefficients
keep_X <- keep_X[!keep_X == "(Intercept)"] #don't selected intercept
x<- X.train.HB[,keep_X] #only use selected features
summ.lasso.HB<-lm(Y.train.HB~x -1) #remove intercept in new model
W <- as.matrix(coef(reg.la.NC))
keep_X <- rownames(W)[W!=0]
keep_X <- keep_X[!keep_X == "(Intercept)"]
x<- X.train.NC[,keep_X]
summ.lasso.NC<-lm(Y.train.NC~x -1)
util<-cbind(TF, HB, NC)
colnames(util)<-levels
wahl<-apply(util, 1, function(x) which.max(x))
cm = table(wahl, cereals_fit$choice)
lasso_in<-sum(diag(cm))/sum(cm)*100
TF = predict(reg.la.TF, s= bestlam.lasso.cv, newx=X.test.TF)
HB = predict(reg.la.HB, s= bestlam.lasso.cv, newx=X.test.HB)
NC = predict(reg.la.NC, s= bestlam.lasso.cv, newx=X.test.NC)
util<-cbind(TF, HB, NC)
wahl<-apply(util, 1, function(x) which.max(x))
lasso_out_cm = table(wahl, cereals_test$choice)
rownames(lasso_out_cm)<-alt.names
lasso_out<-sum(diag(lasso_out_cm))/sum(lasso_out_cm)*100
lasso.cv<-round(cbind(lasso_in, lasso_out), 0)
lasso.cv
## lasso_in lasso_out
## [1,] 72 77
This is the same split as in logit. But hast to be done again because variable names overlap.
r<-nrow(cereals)/2
cereals_fit<-cereals[1:r,]
cereals_test<-cereals[-(1:r),]
The tree is first fit with the default cp of 0.001.
library(rpart)
library(rpart.plot)
tree <- rpart(choice ~ ., data = cereals_fit, method = "class", cp= 0.001)
The optimal cp displyed below by the printcp() function for the first run does not match the chosen cp in the code below. The chosen cp level however yields good overal results and remedies the poor classification of the fourth run, if the minimal cp were to be used.
printcp(tree)
##
## Classification tree:
## rpart(formula = choice ~ ., data = cereals_fit, method = "class",
## cp = 0.001)
##
## Variables actually used in tree construction:
## [1] X1.TF X2.TF X3.HB X3.NC X3.TF X5.HB X7.TF
##
## Root node error: 93/150 = 0.62
##
## n= 150
##
## CP nsplit rel error xerror xstd
## 1 0.193548 0 1.00000 1.06452 0.062384
## 2 0.107527 1 0.80645 0.93548 0.064998
## 3 0.080645 2 0.69892 0.97849 0.064331
## 4 0.053763 4 0.53763 0.92473 0.065134
## 5 0.032258 5 0.48387 0.84946 0.065753
## 6 0.021505 6 0.45161 0.80645 0.065846
## 7 0.001000 7 0.43011 0.83871 0.065794
tree <- rpart(choice ~ ., data = cereals_fit, method = "class", cp= 0.06451613)
Probabilities for each class are predicted
fit.tree<-predict(tree, newdata=cereals_fit)
head(fit.tree, 4)
## TF HB NC
## 1 0.24324324 0.5135135 0.2432432
## 2 0.20000000 0.5600000 0.2400000
## 3 0.20000000 0.5600000 0.2400000
## 4 0.08695652 0.1304348 0.7826087
Choose alternative that has the maximum utility. Compare to observed choices and calculate accuracy from confusion matrix.
fit.tree<-apply(fit.tree, 1, function(x) which.max(x))
head(fit.tree)
## 1 2 3 4 5 6
## 2 2 2 3 1 3
cm = table(fit.tree, cereals_fit$choice)
tree_in<-sum(diag(cm))/sum(cm)*100
tree.pred=predict(tree, newdata=cereals_test)
fit.tree<-apply(tree.pred, 1, function(x) which.max(x))
tree_out_cm = table(fit.tree, cereals_test$choice)
rownames(tree_out_cm)<-alt.names
tree_out<-sum(diag(tree_out_cm))/sum(tree_out_cm)*100
tree<-round(cbind(tree_in, tree_out), 0)
tree
## tree_in tree_out
## [1,] 67 49
library(randomForest)
set.seed(1)
#bag1 = randomForest(choice ~ ., data=cereals_fit, mtry = round(sqrt(p),0),
#ntree=500)
#bag1 = randomForest(choice ~ ., data=cereals_fit, mtry =23, ntree=500,
#importance=TRUE, proximity=TRUE)
#bag1 = randomForest(choice ~ ., data=cereals_fit, importance=TRUE,
#proximity=TRUE)
The caret package divides the train data set randomly into folds and trains model on them, then averages the obtained error terms. This type of fitting is better than the manual fitting.
library("caret")
tc = trainControl(method = "repeatedcv", number = cvfolds)
bag1 = train(choice ~., data=cereals_fit, method="rf", trControl=tc)
If enabled, print(bag1) shows how the parameters were tuned. It is disabled for better clarity and to reduce space. Please see the provided executable r data file for more details.
#print(bag1)
rf_var_plot <- varImp(bag1, top = 20) #only to be used with bag1 from caret package
pred.bag1 = predict(bag1, newdata =cereals_fit)
head(pred.bag1) #predictions are in the form of categories
## [1] HB NC HB NC TF HB
## Levels: TF HB NC
cm = table(pred.bag1, cereals_fit$choice)
rf_in<-sum(diag(cm))/sum(cm)*100
pred.bag1 = predict(bag1, newdata =cereals_test)
rf_out_cm = table(pred.bag1, cereals_test$choice)
rf_out<-sum(diag(rf_out_cm))/sum(rf_out_cm)*100
rf<-round(cbind(rf_in, rf_out), 0)
rf
## rf_in rf_out
## [1,] 100 63
The commented out line shows the manual tuning method with the gbm package. This method was not used in favor of the automatic tuning.
library("gbm")
set.seed(1)
#boost1=gbm(choice ~ ., data=cereals_fit, distribution= "multinomial")
The automatic tuning is done using the caret package.
tc = trainControl(method = "repeatedcv", number = cvfolds)
garbage <- capture.output(boost1 <- train(choice ~., data=cereals_fit, method="gbm",
trControl=tc))
If enabled, boost1 shows how the parameters were tuned. It is disabled for better clarity and to reduce space. Please see the provided executable r data file for more details.
#boost1
var_imp_gbm<-plot(varImp(boost1))
pred.boost1=predict(boost1, newdata =cereals_fit)
cm = table(pred.boost1, cereals_fit$choice)
boost_in<-sum(diag(cm))/sum(cm)*100
pred.boost1=predict(boost1, newdata =cereals_test)
boost_out_cm = table(pred.boost1, cereals_test$choice)
rownames(boost_out_cm) <- alt.names
boost_out<-sum(diag(boost_out_cm))/sum(boost_out_cm)*100
boost<-round(cbind(boost_in, boost_out), 0)
boost
## boost_in boost_out
## [1,] 88 61
##Logit
coef<-c(coefficients(m)[1],
coefficients(m)[2],
coefficients(m)[3])
##Linear Regression
coef.reg<-c(coefficients(ols.TF)[1],
coefficients(ols.TF)[2],
coefficients(ols.TF)[3],
coefficients(ols.HB)[1],
coefficients(ols.HB)[2],
coefficients(ols.HB)[3],
coefficients(ols.NC)[1],
coefficients(ols.NC)[2],
coefficients(ols.NC)[3])
coef.lasso<-c(coefficients(summ.lasso.TF)[1],
coefficients(summ.lasso.TF)[2],
coefficients(summ.lasso.TF)[3],
coefficients(summ.lasso.HB)[1],
coefficients(summ.lasso.HB)[2],
coefficients(summ.lasso.HB)[3],
coefficients(summ.lasso.NC)[1],
coefficients(summ.lasso.NC)[2],
coefficients(summ.lasso.NC)[3])
details[[i]]<-cbind(coef)
coeff.logit[[i]]<-m
details1[[i]]<-cbind(logit,lin.reg,lasso.cv,tree,rf,boost)
details3[[i]]<-cbind(snr, noise)
details4[[i]]<-cbind(sum(rf_in), sum(rf_out))
coeff.reg[[i]] <-cbind(coef.reg)
coeff.reg.TF[[i]] <-ols.TF
coeff.reg.HB[[i]] <-ols.HB
coeff.reg.NC[[i]] <-ols.NC
coeff.lasso[[i]] <- cbind(coef.lasso)
coeff.lasso.TF[[i]] <- summ.lasso.TF
coeff.lasso.HB[[i]] <- summ.lasso.HB
coeff.lasso.NC[[i]] <- summ.lasso.NC
cm.logit[[i]] <- logit_out_cm[2]
cm.reg[[i]] <- reg_out_cm
cm.lasso[[i]] <- lasso_out_cm
cm.tree[[i]] <- tree_out_cm
cm.rf[[i]] <- rf_out_cm
cm.boost[[i]] <- boost_out_cm
rf_plot[[i]] <- rf_var_plot
gbm_plot[[i]] <- var_imp_gbm
summ_fit_sample[[i]] <- summ_cereals_fit
summ_test_sample[[i]] <-summ_cereals_test
summ_fit_lin_sample[[i]] <-summ_cereals_fit_lin
summ_test_lin_sample[[i]] <-summ_cereals_test_lin
} #end Loop
This chunk includes the preparation of the results using various data frames and row and column naming. The preparation is important for presentation purposes in the paper.
suppressMessages(library(stargazer))
suppressMessages(library(xtable))
details2<-details1
details2<-unlist(details2, recursive = TRUE, use.names = TRUE)
run1<-round(matrix(details2[1:12], ncol = 2, byrow = TRUE), 0)
run2<-round(matrix(details2[13:24], ncol = 2, byrow = TRUE), 0)
run3<-round(matrix(details2[25:36], ncol =2, byrow = TRUE),0)
run4<-round(matrix(details2[37:48], ncol =2, byrow = TRUE), 0)
names1<-c("in sample", "out of sample")
names2<-c("logit", "lin reg", "lasso.cv", "tree", "random forrest", "boosting")
colnames(run1)<-names1
rownames(run1)<-names2
colnames(run2)<-names1
rownames(run2)<-names2
colnames(run3)<-names1
rownames(run3)<-names2
colnames(run4)<-names1
rownames(run4)<-names2
run1.table<-xtable(run1, caption = index[1], digits = 0)
run2.table<-xtable(run2, caption = index[2], digits = 0)
run3.table<-xtable(run3, caption = index[3], digits = 0)
run4.table<-xtable(run4, caption = index[4], digits = 0)
coeff<-t(matrix(unlist(details), ncol = 3, byrow = TRUE))
colnames(coeff)<-index
rownames(coeff)<-c("price","quality","popularity")
coeff.table<-xtable(coeff, caption = "Coefficients")
coeff.reg2<-t(matrix(unlist(coeff.reg), ncol = 9, byrow = TRUE))
colnames(coeff.reg2)<-index
rownames(coeff.reg2)<-c("X1.TF", "X2.TF", "X3.TF", "X1.HB", "X2.HB", "X3.HB",
"X1.NC", "X2.NC", "X3.NC")
coeff.table.reg<-xtable(coeff.reg2, caption = "Coefficients")
coef.lasso<-t(matrix(unlist(coeff.lasso), ncol = 9, byrow = TRUE))
colnames(coef.lasso)<-index
rownames(coef.lasso)<-c("X1.TF", "X2.TF", "X3.TF", "X1.HB", "X2.HB", "X3.HB",
"X1.NC", "X2.NC", "X3.NC")
coeff.table.lasso<-xtable(coef.lasso, caption = "Coefficients")
details.snr<-unlist(details3)
details.snr<-matrix(details.snr, ncol = 2, byrow = TRUE)
colnames(details.snr)<-c("Signal-to-Noise", "Noise")
rownames(details.snr)<-index
snr.table<-xtable(details.snr, camption= "Signal-to-Noise")
This section presents the code for the results. Only the performance comparision is presented to save space. The results match the Results Section of the paper. For further details please see the provided r-data file.
details.snr
## Signal-to-Noise Noise
## p=10 2.236826 1.64465
## p=50 3.017325 1.64465
## p=150 2.425730 1.64465
## p=290 2.836494 1.64465
index
## [1] "p=10" "p=50" "p=150" "p=290"
run1
## in sample out of sample
## logit 76 78
## lin reg 76 71
## lasso.cv 72 77
## tree 67 49
## random forrest 100 63
## boosting 88 61
run2
## in sample out of sample
## logit 77 62
## lin reg 72 45
## lasso.cv 62 54
## tree 58 41
## random forrest 100 45
## boosting 100 45
run3
## in sample out of sample
## logit 100 41
## lin reg 80 33
## lasso.cv 70 65
## tree 69 47
## random forrest 100 51
## boosting 100 49
run4
## in sample out of sample
## logit 100 36
## lin reg 84 29
## lasso.cv 74 64
## tree 65 47
## random forrest 100 50
## boosting 100 47
Logit coefficients are listed below. First are the relevant coefficients for all runs listed. Then next output lists all estimated logit coefficients for the first run.
coeff
## p=10 p=50 p=150 p=290
## price -3.096471 -3.354573 -28.90478 -12.353016
## quality 1.851797 1.918786 16.58122 6.582303
## popularity 3.212596 3.314025 31.08082 11.217060
summary(coeff.logit[[1]])
##
## Call:
## mlogit(formula = choice ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 +
## X8 + X9 + X10 - 1, data = Cereal_fit, seed = 1, method = "nr")
##
## Frequencies of alternatives:
## HB NC TF
## 0.38 0.28 0.34
##
## nr method
## 6 iterations, 0h:0m:0s
## g'(-H)^-1g = 0.000249
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## X1 -3.0964715 0.4140316 -7.4788 7.505e-14 ***
## X2 1.8517973 0.2478674 7.4709 7.971e-14 ***
## X3 3.2125959 0.4196564 7.6553 1.932e-14 ***
## X4 -0.0039057 0.3015569 -0.0130 0.9897
## X5 -0.0257784 0.2895766 -0.0890 0.9291
## X6 -0.3328288 0.3096260 -1.0749 0.2824
## X7 0.0014206 0.3040154 0.0047 0.9963
## X8 0.2248430 0.2962448 0.7590 0.4479
## X9 -0.0603981 0.3093601 -0.1952 0.8452
## X10 0.2815895 0.3250167 0.8664 0.3863
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -91.375
Linear Regression coefficients are listed. First the relevant, then the estimated coefficients per alternative for the first run.
coeff.reg2
## p=10 p=50 p=150 p=290
## X1.TF -3.9066881 -4.3611262 -10.056788 -677.756368
## X2.TF 1.4512183 1.5410204 4.598484 634.968445
## X3.TF 2.1921421 2.9160312 6.152307 -386.862818
## X1.HB -4.3510830 -2.7276638 5.957411 -40.995783
## X2.HB 1.0891153 1.3686861 -8.140303 -49.804010
## X3.HB 2.6882814 1.1111883 -4.574937 43.408298
## X1.NC -4.8601217 -4.5298688 4.287643 -3.286762
## X2.NC 1.4941079 1.0705718 -6.531668 -7.961015
## X3.NC 0.9540236 0.7777074 -10.656929 4.203862
summary(coeff.reg.TF[[1]])
##
## Call:
## lm(formula = form.TF, data = cereals.TF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.889 -1.379 0.512 1.771 5.318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X1.TF -3.90669 0.43701 -8.940 2.06e-15 ***
## X2.TF 1.45122 0.33132 4.380 2.31e-05 ***
## X3.TF 2.19214 0.52916 4.143 5.91e-05 ***
## X4.TF -0.18126 0.55436 -0.327 0.7442
## X5.TF 0.03678 0.48089 0.076 0.9391
## X6.TF 0.06824 0.50473 0.135 0.8926
## X7.TF 0.22404 0.55816 0.401 0.6887
## X8.TF 1.18130 0.55561 2.126 0.0352 *
## X9.TF 0.20141 0.52722 0.382 0.7030
## X10.TF -0.08117 0.58449 -0.139 0.8898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.911 on 140 degrees of freedom
## Multiple R-squared: 0.6675, Adjusted R-squared: 0.6437
## F-statistic: 28.1 on 10 and 140 DF, p-value: < 2.2e-16
summary(coeff.reg.HB[[1]])
##
## Call:
## lm(formula = form.HB, data = cereals.HB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2658 -1.3238 0.3358 1.8578 6.2953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X1.HB -4.351083 0.436698 -9.964 < 2e-16 ***
## X2.HB 1.089115 0.547734 1.988 0.0487 *
## X3.HB 2.688281 0.375667 7.156 4.26e-11 ***
## X4.HB -0.025502 0.541557 -0.047 0.9625
## X5.HB 0.461733 0.566486 0.815 0.4164
## X6.HB 0.461519 0.552594 0.835 0.4050
## X7.HB 0.334463 0.568390 0.588 0.5572
## X8.HB 0.007566 0.548774 0.014 0.9890
## X9.HB -0.022064 0.563030 -0.039 0.9688
## X10.HB 0.204218 0.618313 0.330 0.7417
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.077 on 140 degrees of freedom
## Multiple R-squared: 0.6443, Adjusted R-squared: 0.6189
## F-statistic: 25.36 on 10 and 140 DF, p-value: < 2.2e-16
summary(coeff.reg.NC[[1]])
##
## Call:
## lm(formula = form.NC, data = cereals.NC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0819 -1.7971 0.0966 1.6785 5.6364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X1.NC -4.86012 0.45618 -10.654 <2e-16 ***
## X2.NC 1.49411 0.46432 3.218 0.0016 **
## X3.NC 0.95402 0.36686 2.601 0.0103 *
## X4.NC -0.28330 0.57875 -0.490 0.6252
## X5.NC -0.04915 0.52389 -0.094 0.9254
## X6.NC 0.30359 0.51397 0.591 0.5557
## X7.NC 0.54486 0.48957 1.113 0.2676
## X8.NC -0.34925 0.47651 -0.733 0.4648
## X9.NC 0.07644 0.53379 0.143 0.8863
## X10.NC -0.76902 0.51780 -1.485 0.1397
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.803 on 140 degrees of freedom
## Multiple R-squared: 0.7201, Adjusted R-squared: 0.7001
## F-statistic: 36.01 on 10 and 140 DF, p-value: < 2.2e-16
LASSO coefficients are listed. First the relevant, then the estimated coefficients for the first run per alternative.
coef.lasso
## p=10 p=50 p=150 p=290
## X1.TF -3.9012652 -3.8703536 -4.1970812 -3.5156274
## X2.TF 1.4647826 0.7921983 1.4707380 0.7801829
## X3.TF 2.1496418 3.6677306 2.7068046 3.0942549
## X1.HB -4.3478177 -2.7278889 -3.5658226 -3.2054035
## X2.HB 1.1806458 0.8591900 0.9522660 1.6118244
## X3.HB 2.6579528 1.3227454 2.0299172 1.6411899
## X1.NC -4.8652362 -3.9992464 -5.9801915 -4.2136059
## X2.NC 1.4939412 1.0644524 0.9997719 0.8643321
## X3.NC 0.9650781 1.2361749 2.1236241 0.8196607
summary(coeff.lasso.TF[[1]])
##
## Call:
## lm(formula = Y.train.TF ~ x - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.699 -1.429 0.451 1.817 5.179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xX1.TF -3.9013 0.4266 -9.146 4.95e-16 ***
## xX2.TF 1.4648 0.3213 4.559 1.09e-05 ***
## xX3.TF 2.1496 0.5111 4.206 4.53e-05 ***
## xX7.TF 0.1994 0.5368 0.371 0.7109
## xX8.TF 1.1627 0.5361 2.169 0.0317 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.863 on 145 degrees of freedom
## Multiple R-squared: 0.6668, Adjusted R-squared: 0.6553
## F-statistic: 58.03 on 5 and 145 DF, p-value: < 2.2e-16
summary(coeff.lasso.HB[[1]])
##
## Call:
## lm(formula = Y.train.HB ~ x - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6714 -1.3513 0.1692 1.9714 6.3899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xX1.HB -4.3478 0.4224 -10.292 < 2e-16 ***
## xX2.HB 1.1806 0.5253 2.248 0.0261 *
## xX3.HB 2.6580 0.3584 7.417 9.04e-12 ***
## xX5.HB 0.4698 0.5531 0.849 0.3971
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.026 on 146 degrees of freedom
## Multiple R-squared: 0.6414, Adjusted R-squared: 0.6315
## F-statistic: 65.27 on 4 and 146 DF, p-value: < 2.2e-16
summary(coeff.lasso.NC[[1]])
##
## Call:
## lm(formula = Y.train.NC ~ x - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1309 -1.7957 0.1941 1.6694 5.7915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xX1.NC -4.8652 0.4353 -11.176 < 2e-16 ***
## xX2.NC 1.4939 0.4519 3.306 0.00119 **
## xX3.NC 0.9651 0.3506 2.752 0.00667 **
## xX7.NC 0.5452 0.4739 1.150 0.25192
## xX10.NC -0.8190 0.4958 -1.652 0.10076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.764 on 145 degrees of freedom
## Multiple R-squared: 0.7181, Adjusted R-squared: 0.7083
## F-statistic: 73.86 on 5 and 145 DF, p-value: < 2.2e-16
The provided confusion matrices in the paper are shown below. Only the first one is shown for clarity.
cm.logit[[1]]
## $table
## Reference
## Prediction TF HB NC
## TF 40 4 3
## HB 8 38 4
## NC 4 10 39
# cm.logit[[4]]
# cm.reg[[1]]
# cm.reg[[4]]
# cm.lasso[[1]]
# cm.lasso[[4]]
# cm.tree[[1]]
# cm.tree[[4]]
# cm.rf[[1]]
# cm.rf[[4]]
# cm.boost[[1]]
# cm.boost[[4]]
sum.stat.fit<- t(cbind(summ_fit_sample[[1]],
summ_fit_sample[[2]],
summ_fit_sample[[3]],
summ_fit_sample[[4]]))
rownames(sum.stat.fit)<- run.names
sum.stat.test<- t(cbind(summ_test_sample[[1]],
summ_test_sample[[2]],
summ_test_sample[[3]],
summ_test_sample[[4]]))
rownames(sum.stat.test)<- run.names
sum.stat.fit.lin <- t(cbind(summ_fit_lin_sample[[1]],
summ_fit_lin_sample[[2]],
summ_fit_lin_sample[[3]],
summ_fit_lin_sample[[4]]))
rownames(sum.stat.fit.lin)<- run.names
sum.stat.test.lin <- t(cbind(summ_test_lin_sample[[1]],
summ_test_lin_sample[[2]],
summ_test_lin_sample[[3]],
summ_test_lin_sample[[4]]))
rownames(sum.stat.test.lin)<- run.names
The y-variable is tabulated in order to determine the distribution of the three classes inside the train and test samples. The cereals and cereals.lin data frames are the same in regards to the categorical y-variable.
sum.stat.fit
## TF HB NC
## run1 51 57 42
## run2 51 57 42
## run3 49 62 39
## run4 51 57 42
sum.stat.fit.lin
## TF HB NC
## run1 51 57 42
## run2 51 57 42
## run3 49 62 39
## run4 51 57 42
sum.stat.test
## TF HB NC
## run1 52 52 46
## run2 44 52 54
## run3 52 57 41
## run4 48 53 49
sum.stat.test.lin
## TF HB NC
## run1 52 52 46
## run2 44 52 54
## run3 52 57 41
## run4 48 53 49
Therefore they yield the same two train-samples and same two test-samples. This is to show that although there are two data frames, the data is comparable.
sum.stat.fit==sum.stat.fit.lin
## TF HB NC
## run1 TRUE TRUE TRUE
## run2 TRUE TRUE TRUE
## run3 TRUE TRUE TRUE
## run4 TRUE TRUE TRUE
sum.stat.test==sum.stat.test.lin
## TF HB NC
## run1 TRUE TRUE TRUE
## run2 TRUE TRUE TRUE
## run3 TRUE TRUE TRUE
## run4 TRUE TRUE TRUE
The decison tree split plot is not provided, because the prp() output could not be saved in a data frame.
First the variable importance of the random forest is presented for the first and last run, such as in the paper.
plot(rf_plot[[1]], top = 20)
plot(rf_plot[[4]], top = 20)
Second the variable importance of the boosted trees is presented for the first and last run, such as seen in the paper.
gbm_plot[[1]]
gbm_plot[[4]]
The first code chunk shows with which version of R the code was ran. Please see r-data file for more info.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] xtable_1.8-4 stargazer_5.2.2 gbm_2.1.5
## [4] randomForest_4.6-14 rpart.plot_3.0.8 rpart_4.1-15
## [7] glmnet_3.0-1 Matrix_1.2-17 caret_6.0-86
## [10] ggplot2_3.2.1 lattice_0.20-38 mlogit_1.0-2
## [13] lmtest_0.9-37 zoo_1.8-5 Formula_1.2-3
## [16] evd_2.3-3 mvtnorm_1.0-11
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 lubridate_1.7.8 class_7.3-15
## [4] assertthat_0.2.1 digest_0.6.25 ipred_0.9-9
## [7] foreach_1.4.7 R6_2.4.0 plyr_1.8.4
## [10] stats4_3.6.0 e1071_1.7-3 evaluate_0.13
## [13] pillar_1.4.4 Rdpack_0.11-1 rlang_0.4.6
## [16] lazyeval_0.2.2 data.table_1.12.2 rmarkdown_2.3
## [19] splines_3.6.0 statmod_1.4.32 gower_0.2.1
## [22] stringr_1.4.0 munsell_0.5.0 compiler_3.6.0
## [25] xfun_0.7 pkgconfig_2.0.2 shape_1.4.4
## [28] htmltools_0.3.6 nnet_7.3-12 tidyselect_0.2.5
## [31] gridExtra_2.3 tibble_3.0.1 prodlim_2019.11.13
## [34] codetools_0.2-16 crayon_1.3.4 dplyr_0.8.5
## [37] withr_2.1.2 MASS_7.3-51.4 recipes_0.1.12
## [40] ModelMetrics_1.2.2.2 grid_3.6.0 nlme_3.1-144
## [43] gtable_0.3.0 lifecycle_0.2.0 magrittr_1.5
## [46] pROC_1.16.2 scales_1.0.0 bibtex_0.4.2.2
## [49] stringi_1.4.3 reshape2_1.4.3 timeDate_3043.102
## [52] ellipsis_0.3.0 vctrs_0.3.0 generics_0.0.2
## [55] lava_1.6.7 iterators_1.0.12 tools_3.6.0
## [58] glue_1.3.1 purrr_0.3.4 survival_2.44-1.1
## [61] yaml_2.2.0 colorspace_1.4-1 gbRd_0.4-11
## [64] knitr_1.23
search() shows the loaded packages
search()
## [1] ".GlobalEnv" "package:xtable" "package:stargazer"
## [4] "package:gbm" "package:randomForest" "package:rpart.plot"
## [7] "package:rpart" "package:glmnet" "package:Matrix"
## [10] "package:caret" "package:ggplot2" "package:lattice"
## [13] "package:mlogit" "package:lmtest" "package:zoo"
## [16] "package:Formula" "package:evd" "package:mvtnorm"
## [19] "package:stats" "package:graphics" "package:grDevices"
## [22] "package:utils" "package:datasets" "package:methods"
## [25] "Autoloads" "package:base"
The warnings have not been presented in order to save space, but could be viewed in the r-data file if interested.
warnings()