The code is presented and explained. Outputs are shown where appropriate. To save space and maintain clarity, selected results are shown at the end. For further details please see the executable r-data file, which is provided with the data storage device.

Model Assumptions

The first section of the code represents the model assumptions.

Model Characteristics

This line of code determines the number of cross-validation folds and could be changed to increase computational speed. The results were generated with 5 folds, which amounted to 2 minutes of run time. Cross-Validation is used for the tuning of ML-Methods.

cv.folds<-5

The model assumptions are 10, 50, 150 and 290 features, p. The number of observations n is kept constant at 300. The parameter cvfolds was already assigned to 5.

p<-c(10, 50, 150, 290)
n=300
cvfolds <- cv.folds

Labels

Next, the names of the four runs and alternatives are set up, which will later be used to label the rows and columns of the results-data-frames.

name<-c(p[1], p[2], p[3], p[4])
names<-c("p=", "p=", "p=", "p=")
index<-paste(names, name, sep = "", collapse = NULL)
alt.names <- c("TF", "HB", "NC")
run.names <- c("run1", "run2","run3", "run4")

Create Storage for Loop Results

Storage data frames in the form of lists are created for the storage of results such as signal-to-noise ratios, accuracies, coefficients, confusion matrices, variable importance and tabulations of the y-variable.

details <- list()
coeff.logit <- list()
details1 <- list()
details3 <- list()
details4 <- list()

coeff.reg <- list()
coeff.reg.TF<- list()
coeff.reg.HB<- list()
coeff.reg.NC<- list()

coeff.lasso <- list()
coeff.lasso.TF <- list()
coeff.lasso.HB <- list()
coeff.lasso.NC <- list()

cm.logit <- list()
cm.reg <- list()
cm.lasso <- list()
cm.tree <- list()
cm.rf <- list()
cm.boost <- list()

rf_plot <- list()
gbm_plot <- list()

summ_fit_sample <- list()
summ_test_sample <- list()
summ_fit_lin_sample <- list()
summ_test_lin_sample <- list()

Loop

Finally a for-loop is created to vary based on the assigned p’s.

for (i in 1:length(p)) {
  p[i]<-p[i]

Simulation

Set Relevant Coefficients

Next the relevant coefficients are set to -5, 3 and 5. A vector sparce is created to count the number of non-relevant features. In the case of the first run p=10, which will serve as an example, are the non-relevant coefficients 7.

pr<--5    #price
qual<-3 #quanlity
pop<-5  #popularity
sparce<-p-(length(pr)+length(qual)+length(pop)) #count 0-betas for sparcity
sparce

## [1] 7

Then a beta vector is created with the 3 relevant coefficients, while the rest are set to 0.

set.seed(1)
beta2<- c(pr, qual, pop, rep(0, sparce))
beta2

##  [1] -5  3  5  0  0  0  0  0  0  0

Generate X-Variables

In the next step the price, quality and popularity averages are assigned to the three alternatives (TF, HB and NC). The alternatives are modeled to be dissimilar, therefore each has a different magnitude of price, quality and popularity. Below are the coefficient-means of the first alternative shown.

prices<-c(3.90, 3.50, 1.50)

mu1=c(rep(prices[1] , length(pr)), rep(5 , length(qual)), rep(2 , length(pop)), 
      rep(0, sparce)) #TF

mu2=c(rep(prices[2] , length(pr)), rep(1 , length(qual)), rep(4 , length(pop)), 
      rep(0, sparce)) #HB
mu3=c(rep(prices[3] , length(pr)), rep(1 , length(qual)), rep(2 , length(pop)), 
      rep(0, sparce)) #NC

round(mu1, 1)

##  [1] 3.9 5.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Next the covariance structure of the x’s is created, which is based on a diagonal matrix. Therefore the x’s are not dependent on each other.

set.seed(1)
covar=diag(0.2, p, p)
head(covar)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]  0.2  0.0  0.0  0.0  0.0  0.0    0    0    0     0
## [2,]  0.0  0.2  0.0  0.0  0.0  0.0    0    0    0     0
## [3,]  0.0  0.0  0.2  0.0  0.0  0.0    0    0    0     0
## [4,]  0.0  0.0  0.0  0.2  0.0  0.0    0    0    0     0
## [5,]  0.0  0.0  0.0  0.0  0.2  0.0    0    0    0     0
## [6,]  0.0  0.0  0.0  0.0  0.0  0.2    0    0    0     0

After that the x’s are created using the covariance structure and the means.

X1=rmvnorm(n, mean=mu1, sigma=covar)
X2=rmvnorm(n, mean=mu2, sigma=covar) 
X3=rmvnorm(n, mean=mu3, sigma=covar)
str(X1)

##  num [1:300, 1:10] 3.62 4.58 4.31 4.51 3.83 ...

The created X-matrices are 300 observations long and 10 observations wide.

Finally the x’s are packed into a matrix for each alternative and are labeled.

x1<-matrix(X1, nrow = n, ncol = p)  
colnames(x1)<-paste(as.character(1:p), rep(".TF", p), sep = "")
x2<-matrix(X2, nrow = n, ncol = p)  
colnames(x2)<-paste(as.character(1:p), rep(".HB", p), sep = "")
x3<-matrix(X3, nrow = n, ncol = p)  
colnames(x3)<-paste(as.character(1:p), rep(".NC", p), sep = "")

Calculate Utility

To obtain representative utility, matrix multiplication is done and the X-matrices are transposed.

V1<-beta2%*%t(x1)    #1st product
V2<-beta2%*%t(x2)    #2nd product
V3<-beta2%*%t(x3)    #3rd product

The resulting representative utilities are bound into a single matrix.

V<-cbind(t(V1), t(V2), t(V3))
summary(V)

##        V1               V2               V3        
##  Min.   :-6.176   Min.   :-3.674   Min.   :-3.395  
##  1st Qu.: 3.288   1st Qu.: 3.197   1st Qu.: 3.805  
##  Median : 5.508   Median : 6.087   Median : 5.802  
##  Mean   : 5.759   Mean   : 5.718   Mean   : 5.749  
##  3rd Qu.: 8.054   3rd Qu.: 7.852   3rd Qu.: 7.943  
##  Max.   :15.053   Max.   :16.076   Max.   :13.755

str(V)

##  num [1:300, 1:3] 5.28 1.25 4.66 3.19 7.09 ...

Generate Error Term

Next, the error term is generated using the package for generating extreme values.

#install.packages("evd")
library(evd)
set.seed(1)
e1<-rgumbel(n*ncol(V), loc=0, scale=1.72)
e1<-matrix(e1,n,ncol(V),byrow=FALSE)
str(e1)

##  num [1:300, 1:3] 0.483 -0.287 3.313 3.384 1.428 ...

The model is completed by adding the unobserved (random) values of utility.

z1<-V+e1

Investigate Signal-to-Noise Ratio

signal<-mean(var(V))
noise<-mean(var(e1))    
noise

## [1] 1.64465

The noise is according to logit-model assumptions.

snr<-mean(var(V))/mean(var(e1))
snr

## [1] 2.236826

The signal-to-noise is above 2.

Calculate Logit Probabilities

pr1 = exp(z1)/apply(exp(z1), drop=F, 1, sum)

The probabilities have the necessary qualities. They are between 0 and 1.

specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))
head(specify_decimal(pr1, 2))

##      [,1]   [,2]   [,3]  
## [1,] "0.05" "0.95" "0.00"
## [2,] "0.01" "0.00" "0.98"
## [3,] "0.23" "0.32" "0.45"
## [4,] "0.40" "0.00" "0.60"
## [5,] "0.95" "0.05" "0.00"
## [6,] "0.00" "0.96" "0.04"

They sum to 1.

head(apply(pr1, drop=F, 1, sum))

## [1] 1 1 1 1 1 1

Next using the apply command, the rmultinom function is applied on in the previous step generated probability matrix. One random vector is drawn for each row of the probability matrix. Using the generated probabilities one object is chosen. As seen in the output, where the probability is highest, is the chance higher for that column to take on the value of 1.

mChoices1 = t(apply(pr1, 1, rmultinom, n = 1, size = 1))
head(mChoices1)

##      [,1] [,2] [,3]
## [1,]    0    1    0
## [2,]    0    0    1
## [3,]    0    1    0
## [4,]    0    0    1
## [5,]    1    0    0
## [6,]    0    1    0

Generate Y-Variable

By replacing 1-s with respective column labels, a categorical ranking results with the options y=1,2 or 3.

y1 = apply(mChoices1, 1, function(x) which(x==1))
head(y1)

## [1] 2 3 2 3 1 2

length(y1)

## [1] 300

tabulate(y1)

## [1] 103 109  88

Safe Data

A data frame for not linear Methods: logit, tree, random forest and boosting is saved.

cereals<-data.frame(y1,x1,x2,x3) 
str(cereals)

## 'data.frame':    300 obs. of  31 variables:
##  $ y1    : int  2 3 2 3 1 2 2 3 1 1 ...
##  $ X1.TF : num  3.62 4.58 4.31 4.51 3.83 ...
##  $ X2.TF : num  5.08 5.17 5.35 4.95 4.89 ...
##  $ X3.TF : num  1.63 1.72 2.03 2.17 2.31 ...
##  $ X4.TF : num  0.7134 -0.9904 -0.8897 -0.0241 0.2489 ...
##  $ X5.TF : num  0.147 0.503 0.277 -0.616 -0.308 ...
##  $ X6.TF : num  -0.3669 -0.0201 -0.0251 -0.1856 -0.3164 ...
##  $ X7.TF : num  0.21798 -0.00724 -0.06967 -0.17633 0.16305 ...
##  $ X8.TF : num  0.3302 0.4221 -0.6577 -0.0265 0.3437 ...
##  $ X9.TF : num  0.2575 0.3673 -0.2138 0.4919 -0.0502 ...
##  $ X10.TF: num  -0.137 0.266 0.187 0.341 0.394 ...
##  $ X1.HB : num  3.83 3.72 3.52 4.1 3.7 ...
##  $ X2.HB : num  1.173 1.192 1.69 0.206 0.804 ...
##  $ X3.HB : num  4.58 3.43 4.43 4.17 4.54 ...
##  $ X4.HB : num  -0.3594 1.0001 -0.8194 0.0188 -0.6364 ...
##  $ X5.HB : num  -0.7167 0.1485 0.0595 -0.0594 -0.0263 ...
##  $ X6.HB : num  0.41736 -0.06229 -0.46215 -0.00624 0.09111 ...
##  $ X7.HB : num  0.808 -0.329 -0.773 -0.164 0.332 ...
##  $ X8.HB : num  -0.0253 -1.2422 -0.5187 -0.047 -0.1689 ...
##  $ X9.HB : num  0.843 -0.144 -0.623 0.742 1.064 ...
##  $ X10.HB: num  0.706 -0.463 -0.448 0.695 -0.397 ...
##  $ X1.NC : num  1.22 1.04 1.52 1.12 2.18 ...
##  $ X2.NC : num  0.504 0.837 1.444 0.163 1.186 ...
##  $ X3.NC : num  1.03 1.38 1.26 2.33 1.67 ...
##  $ X4.NC : num  -0.014 -0.2395 0.0635 -0.1548 -0.7202 ...
##  $ X5.NC : num  -0.116 0.123 -0.301 -0.623 -0.232 ...
##  $ X6.NC : num  0.239 0.588 -0.645 -0.349 0.529 ...
##  $ X7.NC : num  -0.2502 -0.0762 0.3721 0.2759 0.0937 ...
##  $ X8.NC : num  0.7193 0.6471 0.1148 -0.1233 -0.0928 ...
##  $ X9.NC : num  0.249 0.737 -0.307 0.716 -0.498 ...
##  $ X10.NC: num  0.08301 0.45325 -0.00687 -0.06702 0.82171 ...

for linear Methods: regression and lasso, a second data frame is saved. The second data frame is the same as the first, except for that it has three additional probability variables (X1, X2 and X3).

cereals.lin<-data.frame(y1,pr1,x1,x2,x3) 
names(cereals.lin)

##  [1] "y1"     "X1"     "X2"     "X3"     "X1.TF"  "X2.TF"  "X3.TF" 
##  [8] "X4.TF"  "X5.TF"  "X6.TF"  "X7.TF"  "X8.TF"  "X9.TF"  "X10.TF"
## [15] "X1.HB"  "X2.HB"  "X3.HB"  "X4.HB"  "X5.HB"  "X6.HB"  "X7.HB" 
## [22] "X8.HB"  "X9.HB"  "X10.HB" "X1.NC"  "X2.NC"  "X3.NC"  "X4.NC" 
## [29] "X5.NC"  "X6.NC"  "X7.NC"  "X8.NC"  "X9.NC"  "X10.NC"

Assign Levels

Format the y-column in the data frame as a factor and call it “choice”.

choice<-factor(cereals$y1)
head(choice)

## [1] 2 3 2 3 1 2
## Levels: 1 2 3

Name the levels of the as-factor-formatted choice-column. 1 for Tiger Flakes (TF), 2 for Honey Bits (HB) and 3 for Nougat Crisps (NC).

levels<-c("TF", "HB", "NC")
levels(choice)<-levels
head(choice)

## [1] HB NC HB NC TF HB
## Levels: TF HB NC

Replace the “y” column with the “choice” column in the data frames.

cereals<-cbind(choice, cereals[,-1])
cereals.lin<-cbind(choice, cereals.lin[,-1])

Model Fitting and Tuning

The fitting of the six classical and machine learning models is presented next.

Fit Logit

Split Data

Find split point in the data.

r<-nrow(cereals)/2
r

## [1] 150

cereals_fit<-cereals[1:r,]
cereals_test<-cereals[-(1:r),]
nrow(cereals_fit)

## [1] 150

Means of Samples

means show that only relevant features (X1 to X3) have means higher than 0.

round(sapply(cereals_fit[,-1], mean), 2)

##  X1.TF  X2.TF  X3.TF  X4.TF  X5.TF  X6.TF  X7.TF  X8.TF  X9.TF X10.TF 
##   3.86   5.08   2.01  -0.07   0.06  -0.03  -0.02  -0.03  -0.03   0.03 
##  X1.HB  X2.HB  X3.HB  X4.HB  X5.HB  X6.HB  X7.HB  X8.HB  X9.HB X10.HB 
##   3.49   1.00   4.05   0.00   0.01  -0.02  -0.02   0.08  -0.01  -0.05 
##  X1.NC  X2.NC  X3.NC  X4.NC  X5.NC  X6.NC  X7.NC  X8.NC  X9.NC X10.NC 
##   1.46   1.07   1.94  -0.02  -0.06  -0.02  -0.01   0.00   0.09  -0.01

Below is a tabulation of the selected alternatives in the train sample.

summ_cereals_fit<-summary(cereals_fit$choice)
summ_cereals_fit

## TF HB NC 
## 51 57 42

summ_cereals_test<-summary(cereals_test$choice)

Transform Data into Mlogit Data Object

library(mlogit)
Cereal_fit <- mlogit.data(cereals_fit, varying = c(2:ncol(cereals_fit)), 
                          shape = "wide", choice = "choice")
Cereal_test <- mlogit.data(cereals_test, varying = c(2:ncol(cereals_test)), 
                           shape = "wide", choice = "choice")

Set up Flexible Formula for Logit

last<-ncol(Cereal_fit)-2
k<-colnames(Cereal_fit)[-1]
k<-k[-1]
k<-k[-last]
factors <- k
factors

##  [1] "X1"  "X2"  "X3"  "X4"  "X5"  "X6"  "X7"  "X8"  "X9"  "X10"

X<-as.formula(paste("choice~", paste(factors, collapse="+")))
X

## choice ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10

f1 <- mFormula(X)

Remove Intercept

f0 <- update(f1, .~.-1)

Fit Logit Model

m<-mlogit(f0, data=Cereal_fit, seed = 1)

In-Sample Performance

Make predictions. The logit predictions are in the form of probabilities.

library("caret")
set.seed(1)
logit.pred_in<-fitted(m, outcome = FALSE)
head(logit.pred_in)

##           HB         NC          TF
## 1 0.80802081 0.00895389 0.183025301
## 2 0.15875477 0.71489982 0.126345407
## 3 0.95266582 0.01520610 0.032128083
## 4 0.03575981 0.85312309 0.111117101
## 5 0.36271514 0.01108054 0.626204322
## 6 0.70555287 0.29111293 0.003334203

The column names are used to label the alternatives.

y_hat = colnames(logit.pred_in)[apply(logit.pred_in, 1, which.max)]
head(y_hat)

## [1] "HB" "NC" "HB" "NC" "TF" "HB"

They are then compared to the observed alternatives from the train sample.

logit_in = confusionMatrix(cereals_fit$choice, as.factor(y_hat))

The accuracy is recorded the generated confusion matrix output.

logit_in[2]

## $table
##           Reference
## Prediction HB NC TF
##         HB 41  6 10
##         NC  9 31  2
##         TF  6  3 42

logit_in<-logit_in$overall[1]*100
logit_in

## Accuracy 
##       76

logit_in<-logit_in[[1]]
logit_in

## [1] 76

Out-of-Sample Performance

Make predictions

logit.pred_out<-predict(m, newdata=Cereal_test)

Data-Check

Specify a decimal format for the predicted probabilities and reformat the predicted probabilities into a readable format.

specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))
head(specify_decimal(logit.pred_out, 2))

##   HB     NC     TF    
## 1 "0.01" "0.62" "0.38"
## 2 "0.02" "0.43" "0.55"
## 3 "0.00" "0.00" "1.00"
## 4 "0.87" "0.11" "0.02"
## 5 "0.04" "0.89" "0.06"
## 6 "0.26" "0.03" "0.71"

Verify predicted probabilities add to 1

head(apply(logit.pred_out, drop=F, 1, sum))

## 1 2 3 4 5 6 
## 1 1 1 1 1 1

Record accuracy.

y_hat = colnames(logit.pred_out)[apply(logit.pred_out, 1, which.max)]
head(y_hat)

## [1] "NC" "TF" "TF" "HB" "NC" "TF"

logit_out_cm = confusionMatrix(as.factor(y_hat), cereals_test$choice)
logit_out<-logit_out_cm$overall[1]*100
logit_out<-logit_out[[1]]

Save Results

logit<-round(cbind(logit_in, logit_out), 0)
logit

##      logit_in logit_out
## [1,]       76        78

Fit Linear Regression

Split Data

find split point in the data and split it.

r<-nrow(cereals.lin)/2
cereals_fit<-cereals.lin[1:r,]
cereals_test<-cereals.lin[-(1:r),]
names(cereals_fit)

##  [1] "choice" "X1"     "X2"     "X3"     "X1.TF"  "X2.TF"  "X3.TF" 
##  [8] "X4.TF"  "X5.TF"  "X6.TF"  "X7.TF"  "X8.TF"  "X9.TF"  "X10.TF"
## [15] "X1.HB"  "X2.HB"  "X3.HB"  "X4.HB"  "X5.HB"  "X6.HB"  "X7.HB" 
## [22] "X8.HB"  "X9.HB"  "X10.HB" "X1.NC"  "X2.NC"  "X3.NC"  "X4.NC" 
## [29] "X5.NC"  "X6.NC"  "X7.NC"  "X8.NC"  "X9.NC"  "X10.NC"

Test sample includes logit probabilities.

record summary statistics of samples for later

summ_cereals_fit_lin<-summary(cereals_fit$choice)
summ_cereals_test_lin<-summary(cereals_test$choice)

Pre-Process Data

Separate Data for Regressions

Data frames are build using the logit probability and features per alternative.

cereals.TF<-cbind(cereals_fit$X1, cereals_fit[(1+4):(5+p-1)])
str(cereals.TF)

## 'data.frame':    150 obs. of  11 variables:
##  $ cereals_fit$X1: num  0.0477 0.0122 0.2296 0.4004 0.9462 ...
##  $ X1.TF         : num  3.62 4.58 4.31 4.51 3.83 ...
##  $ X2.TF         : num  5.08 5.17 5.35 4.95 4.89 ...
##  $ X3.TF         : num  1.63 1.72 2.03 2.17 2.31 ...
##  $ X4.TF         : num  0.7134 -0.9904 -0.8897 -0.0241 0.2489 ...
##  $ X5.TF         : num  0.147 0.503 0.277 -0.616 -0.308 ...
##  $ X6.TF         : num  -0.3669 -0.0201 -0.0251 -0.1856 -0.3164 ...
##  $ X7.TF         : num  0.21798 -0.00724 -0.06967 -0.17633 0.16305 ...
##  $ X8.TF         : num  0.3302 0.4221 -0.6577 -0.0265 0.3437 ...
##  $ X9.TF         : num  0.2575 0.3673 -0.2138 0.4919 -0.0502 ...
##  $ X10.TF        : num  -0.137 0.266 0.187 0.341 0.394 ...

cereals.HB<-cbind(cereals_fit$X2, cereals_fit[(5+p-1+1):(5+2*p-1)])
cereals.NC<-cbind(cereals_fit$X3, cereals_fit[(5+2*p-1+1):ncol(cereals_fit)])

Test Data has the same structure as the train data.

cereals.TF.test<-cbind(cereals_test$X1, cereals_test[(1+4):(5+p-1)])
cereals.HB.test<-cbind(cereals_test$X2, cereals_test[(5+p-1+1):(5+2*p-1)])
cereals.NC.test<-cbind(cereals_test$X3, cereals_test[(5+2*p-1+1):ncol(cereals_fit)])

Set up Formulas for 3 Regressions

k<-colnames(cereals.TF[2:ncol(cereals.TF)])
TF<-as.formula(paste("log(cereals.TF[,1])~", paste(k, collapse="+")))
TF

## log(cereals.TF[, 1]) ~ X1.TF + X2.TF + X3.TF + X4.TF + X5.TF + 
##     X6.TF + X7.TF + X8.TF + X9.TF + X10.TF

k<-colnames(cereals.HB[2:ncol(cereals.HB)])
HB<-as.formula(paste("log(cereals.HB[,1])~", paste(k, collapse="+")))
HB

## log(cereals.HB[, 1]) ~ X1.HB + X2.HB + X3.HB + X4.HB + X5.HB + 
##     X6.HB + X7.HB + X8.HB + X9.HB + X10.HB

k<-colnames(cereals.NC[2:ncol(cereals.NC)])
NC<-as.formula(paste("log(cereals.NC[,1])~", paste(k, collapse="+")))
NC

## log(cereals.NC[, 1]) ~ X1.NC + X2.NC + X3.NC + X4.NC + X5.NC + 
##     X6.NC + X7.NC + X8.NC + X9.NC + X10.NC

Remove Intercept

form.TF <- update(TF, .~.-1) 
form.HB <- update(HB, .~.-1) 
form.NC <- update(NC, .~.-1)

Fit Regression Models

ols.TF <- lm(form.TF, data=cereals.TF)
ols.HB <- lm(form.HB, data=cereals.HB)
ols.NC <- lm(form.NC, data=cereals.NC)

In-Sample Predictions

The predictions of the three regressions are compared.

TF<-fitted(ols.TF, outcome = FALSE)
HB<-fitted(ols.HB, outcome = FALSE)
NC<-fitted(ols.NC, outcome = FALSE)
util<-cbind(TF, HB, NC)
head(util, 4)

##          TF        HB        NC
## 1 -2.848340 -2.811533 -4.560369
## 2 -5.846930 -5.872507 -2.778482
## 3 -5.301662 -2.086851 -4.101773
## 4 -5.686562 -6.390244 -2.695700

The alternative that has the highest utility is chosen.

wahl<-apply(util, 1, function(x) which.max(x))
head(wahl)

## 1 2 3 4 5 6 
## 2 3 2 3 1 2

Column numbers serve as indicators for the alternatives.

In-Sample Correct

The predicted choices are compared to the numeric of the observed choices. The comparison is done in a confusion matrix.

y_hat<-as.numeric(wahl) 
cm = table(wahl, cereals_fit$choice)  
cm

##     
## wahl TF HB NC
##    1 40  7  3
##    2  6 44  9
##    3  5  6 30

Accuracy is calculated

reg_in<-sum(diag(cm))/sum(cm)*100
reg_in

## [1] 76

Out-of-Sample Predictions

Make Predictions.

TF=predict(ols.TF, newdata =cereals.TF.test)
HB=predict(ols.HB, newdata =cereals.HB.test)
NC=predict(ols.NC, newdata =cereals.NC.test)
util<-cbind(TF, HB, NC)

Out-of-Sample Correct

wahl<-apply(util, 1, function(x) which.max(x))    
reg_out_cm = table(wahl, cereals_test$choice)
rownames(reg_out_cm)<- alt.names
reg_out<-sum(diag(reg_out_cm))/sum(reg_out_cm)*100

Store Results

lin.reg<-round(cbind(reg_in, reg_out), 0)
lin.reg

##      reg_in reg_out
## [1,]     76      71

Fit LASSO

LASSO using glmnet package requires coding the y and x variables in separate matrices.

Code train and test samples for lasso.cv

Train

Y.train.TF = log(cereals.TF[, 1]) # output variable
str(Y.train.TF)

##  num [1:150] -3.0428 -4.4063 -1.4714 -0.9153 -0.0553 ...

X.train.TF = as.matrix(cereals.TF[,2:ncol(cereals.TF)]) # regressors
str(X.train.TF)

##  num [1:150, 1:10] 3.62 4.58 4.31 4.51 3.83 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:150] "1" "2" "3" "4" ...
##   ..$ : chr [1:10] "X1.TF" "X2.TF" "X3.TF" "X4.TF" ...

Y.train.HB = log(cereals.HB[, 1]) # output variable
X.train.HB = as.matrix(cereals.HB[,2:ncol(cereals.HB)]) # regressors
Y.train.NC = log(cereals.NC[, 1]) # output variable
X.train.NC = as.matrix(cereals.NC[,2:ncol(cereals.NC)]) # regressors

Test

Y.test.TF = log(cereals.TF.test[, 1]) # output variable
X.test.TF = as.matrix(cereals.TF.test[,2:ncol(cereals.TF)]) # regressors
Y.test.HB = log(cereals.HB.test[, 1]) # output variable
X.test.HB = as.matrix(cereals.HB.test[,2:ncol(cereals.HB)]) # regressors
Y.test.NC = log(cereals.NC.test[, 1]) # output variable
X.test.NC = as.matrix(cereals.NC.test[,2:ncol(cereals.NC)]) # regressors

Do Cross-Validation

library(glmnet)
lasso.cv.TF=cv.glmnet(X.train.TF, Y.train.TF, family="gaussian",alpha=1)
lasso.cv.HB=cv.glmnet(X.train.HB, Y.train.HB, family="gaussian",alpha=1)
lasso.cv.NC=cv.glmnet(X.train.NC, Y.train.NC, family="gaussian",alpha=1)

Determine Best Lambda, Fit LASSO, Predict lasso.cv In-Sample

TF

bestlam.lasso.cv = lasso.cv.TF$lambda.min
reg.la.TF=glmnet(X.train.TF,Y.train.TF, family="gaussian", alpha=1, 
                 lambda = bestlam.lasso.cv)
TF = predict(reg.la.TF, s= bestlam.lasso.cv, newx=X.train.TF)
str(TF)

##  num [1:150, 1] -3.3 -5.6 -4.45 -5.16 -2.76 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:150] "1" "2" "3" "4" ...
##   ..$ : chr "1"

Utilities result from the prediction.

HB

bestlam.lasso.cv = lasso.cv.HB$lambda.min
reg.la.HB=glmnet(X.train.HB,Y.train.HB, family="gaussian", alpha=1, 
                 lambda = bestlam.lasso.cv)
HB = predict(reg.la.HB, s= bestlam.lasso.cv, newx=X.train.HB, 
             lambda = bestlam.lasso.cv)

NC

bestlam.lasso.cv = lasso.cv.NC$lambda.min
reg.la.NC=glmnet(X.train.NC,Y.train.NC, family="gaussian", alpha=1, 
                 lambda = bestlam.lasso.cv)
NC = predict(reg.la.NC, s= bestlam.lasso.cv, newx=X.train.NC)

Post Lasso Coefficients

Variable Selection and Liner Model

TF

W <- as.matrix(coef(reg.la.TF))
W

##                      s0
## (Intercept) -7.96959386
## X1.TF       -2.89974525
## X2.TF        2.20885371
## X3.TF        2.28251390
## X4.TF        0.00000000
## X5.TF        0.00000000
## X6.TF        0.00000000
## X7.TF        0.05462683
## X8.TF        0.65283385
## X9.TF        0.00000000
## X10.TF       0.00000000

keep_X <- rownames(W)[W!=0]
keep_X <- keep_X[!keep_X == "(Intercept)"] #remove intercept
keep_X

## [1] "X1.TF" "X2.TF" "X3.TF" "X7.TF" "X8.TF"

x<- X.train.TF[,keep_X]
summ.lasso.TF<-lm(Y.train.TF~x -1)
#stargazer(summ.lasso.TF, type = "latex")

HB

W <- as.matrix(coef(reg.la.HB))
keep_X <- rownames(W)[W!=0]                 #only keep non-zero coefficients 
keep_X <- keep_X[!keep_X == "(Intercept)"]  #don't selected intercept 
x<- X.train.HB[,keep_X]                     #only use selected features
summ.lasso.HB<-lm(Y.train.HB~x -1)          #remove intercept in new model

NC

W <- as.matrix(coef(reg.la.NC))
keep_X <- rownames(W)[W!=0]
keep_X <- keep_X[!keep_X == "(Intercept)"]
x<- X.train.NC[,keep_X]
summ.lasso.NC<-lm(Y.train.NC~x -1)

Compare In-Sample Predictions and choose largest utility

util<-cbind(TF, HB, NC)
colnames(util)<-levels
wahl<-apply(util, 1, function(x) which.max(x))

In-Sample Accuracy

cm = table(wahl, cereals_fit$choice)        
lasso_in<-sum(diag(cm))/sum(cm)*100

Out-of-Sample Prediction

TF = predict(reg.la.TF, s= bestlam.lasso.cv, newx=X.test.TF)
HB = predict(reg.la.HB, s= bestlam.lasso.cv, newx=X.test.HB)
NC = predict(reg.la.NC, s= bestlam.lasso.cv, newx=X.test.NC)

Compare Out-of-Sample Predictions and choose highest

util<-cbind(TF, HB, NC)
wahl<-apply(util, 1, function(x) which.max(x))

Out-of-Sample Accuracy

lasso_out_cm = table(wahl, cereals_test$choice)
rownames(lasso_out_cm)<-alt.names
lasso_out<-sum(diag(lasso_out_cm))/sum(lasso_out_cm)*100

Save Results

lasso.cv<-round(cbind(lasso_in, lasso_out), 0)
lasso.cv

##      lasso_in lasso_out
## [1,]       72        77

Decision Tree

Split Data

This is the same split as in logit. But hast to be done again because variable names overlap.

r<-nrow(cereals)/2
cereals_fit<-cereals[1:r,]    
cereals_test<-cereals[-(1:r),]

Tuning

The tree is first fit with the default cp of 0.001.

library(rpart)
library(rpart.plot)
tree <- rpart(choice ~ ., data = cereals_fit, method = "class", cp= 0.001)

Prune Tree

The optimal cp displyed below by the printcp() function for the first run does not match the chosen cp in the code below. The chosen cp level however yields good overal results and remedies the poor classification of the fourth run, if the minimal cp were to be used.

printcp(tree)

## 
## Classification tree:
## rpart(formula = choice ~ ., data = cereals_fit, method = "class", 
##     cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] X1.TF X2.TF X3.HB X3.NC X3.TF X5.HB X7.TF
## 
## Root node error: 93/150 = 0.62
## 
## n= 150 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.193548      0   1.00000 1.06452 0.062384
## 2 0.107527      1   0.80645 0.93548 0.064998
## 3 0.080645      2   0.69892 0.97849 0.064331
## 4 0.053763      4   0.53763 0.92473 0.065134
## 5 0.032258      5   0.48387 0.84946 0.065753
## 6 0.021505      6   0.45161 0.80645 0.065846
## 7 0.001000      7   0.43011 0.83871 0.065794

tree <- rpart(choice ~ ., data = cereals_fit, method = "class", cp= 0.06451613)

In-Sample Tree Fit

Probabilities for each class are predicted

fit.tree<-predict(tree, newdata=cereals_fit)
head(fit.tree, 4)

##           TF        HB        NC
## 1 0.24324324 0.5135135 0.2432432
## 2 0.20000000 0.5600000 0.2400000
## 3 0.20000000 0.5600000 0.2400000
## 4 0.08695652 0.1304348 0.7826087

Choose alternative that has the maximum utility. Compare to observed choices and calculate accuracy from confusion matrix.

fit.tree<-apply(fit.tree, 1, function(x) which.max(x))
head(fit.tree)

## 1 2 3 4 5 6 
## 2 2 2 3 1 3

cm = table(fit.tree, cereals_fit$choice)
tree_in<-sum(diag(cm))/sum(cm)*100

Out-of-Sample Tree Fit and Save Results

tree.pred=predict(tree, newdata=cereals_test)
fit.tree<-apply(tree.pred, 1, function(x) which.max(x))
tree_out_cm = table(fit.tree, cereals_test$choice)
rownames(tree_out_cm)<-alt.names
tree_out<-sum(diag(tree_out_cm))/sum(tree_out_cm)*100

tree<-round(cbind(tree_in, tree_out), 0)
tree

##      tree_in tree_out
## [1,]      67       49

Random Forest

Manual Tuning (commented out)

library(randomForest)
set.seed(1)
#bag1 = randomForest(choice ~ ., data=cereals_fit, mtry = round(sqrt(p),0), 
#ntree=500)
#bag1 = randomForest(choice ~ ., data=cereals_fit, mtry =23, ntree=500, 
#importance=TRUE, proximity=TRUE)
#bag1 = randomForest(choice ~ ., data=cereals_fit, importance=TRUE, 
#proximity=TRUE)

Automatic Tuning

The caret package divides the train data set randomly into folds and trains model on them, then averages the obtained error terms. This type of fitting is better than the manual fitting.

library("caret")
tc = trainControl(method = "repeatedcv", number = cvfolds)
bag1 = train(choice ~., data=cereals_fit, method="rf", trControl=tc)

Analyse Fit

If enabled, print(bag1) shows how the parameters were tuned. It is disabled for better clarity and to reduce space. Please see the provided executable r data file for more details.

#print(bag1)
rf_var_plot <- varImp(bag1, top = 20)   #only to be used with bag1 from caret package

In-Sample Fit

pred.bag1 = predict(bag1, newdata =cereals_fit)
head(pred.bag1)      #predictions are in the form of categories

## [1] HB NC HB NC TF HB
## Levels: TF HB NC

cm = table(pred.bag1, cereals_fit$choice)
rf_in<-sum(diag(cm))/sum(cm)*100

Out-of-Sample Fit and Save Results

pred.bag1 = predict(bag1, newdata =cereals_test)
rf_out_cm = table(pred.bag1, cereals_test$choice)
rf_out<-sum(diag(rf_out_cm))/sum(rf_out_cm)*100

rf<-round(cbind(rf_in, rf_out), 0)
rf

##      rf_in rf_out
## [1,]   100     63

Gradient Boosting

Manual tuning

The commented out line shows the manual tuning method with the gbm package. This method was not used in favor of the automatic tuning.

library("gbm")
set.seed(1)
#boost1=gbm(choice ~ ., data=cereals_fit, distribution= "multinomial")

Automatic Tuning

The automatic tuning is done using the caret package.

tc = trainControl(method = "repeatedcv", number = cvfolds)
garbage <- capture.output(boost1 <- train(choice ~., data=cereals_fit, method="gbm", 
                                          trControl=tc))

Analyse Fit

If enabled, boost1 shows how the parameters were tuned. It is disabled for better clarity and to reduce space. Please see the provided executable r data file for more details.

#boost1
var_imp_gbm<-plot(varImp(boost1))

In-Sample Fit

pred.boost1=predict(boost1, newdata =cereals_fit)
cm = table(pred.boost1, cereals_fit$choice)
boost_in<-sum(diag(cm))/sum(cm)*100

Out-of-Sample Fit

pred.boost1=predict(boost1, newdata =cereals_test)
boost_out_cm = table(pred.boost1, cereals_test$choice)
rownames(boost_out_cm) <- alt.names
boost_out<-sum(diag(boost_out_cm))/sum(boost_out_cm)*100

Save Results

boost<-round(cbind(boost_in, boost_out), 0)
boost

##      boost_in boost_out
## [1,]       88        61

Save Estimated Coefficients

##Logit

coef<-c(coefficients(m)[1],
        coefficients(m)[2],
        coefficients(m)[3])

##Linear Regression

coef.reg<-c(coefficients(ols.TF)[1],
            coefficients(ols.TF)[2],
            coefficients(ols.TF)[3],
            coefficients(ols.HB)[1],
            coefficients(ols.HB)[2],
            coefficients(ols.HB)[3],
            coefficients(ols.NC)[1],
            coefficients(ols.NC)[2],
            coefficients(ols.NC)[3])

LASSO

coef.lasso<-c(coefficients(summ.lasso.TF)[1],
            coefficients(summ.lasso.TF)[2],
            coefficients(summ.lasso.TF)[3],
            coefficients(summ.lasso.HB)[1],
            coefficients(summ.lasso.HB)[2],
            coefficients(summ.lasso.HB)[3],
            coefficients(summ.lasso.NC)[1],
            coefficients(summ.lasso.NC)[2],
            coefficients(summ.lasso.NC)[3])

Save Results from Loop

details[[i]]<-cbind(coef)
coeff.logit[[i]]<-m

details1[[i]]<-cbind(logit,lin.reg,lasso.cv,tree,rf,boost)

details3[[i]]<-cbind(snr, noise)

details4[[i]]<-cbind(sum(rf_in), sum(rf_out))

coeff.reg[[i]] <-cbind(coef.reg)
coeff.reg.TF[[i]] <-ols.TF
coeff.reg.HB[[i]] <-ols.HB
coeff.reg.NC[[i]] <-ols.NC

coeff.lasso[[i]] <- cbind(coef.lasso)
coeff.lasso.TF[[i]] <- summ.lasso.TF
coeff.lasso.HB[[i]] <- summ.lasso.HB
coeff.lasso.NC[[i]] <- summ.lasso.NC

cm.logit[[i]] <- logit_out_cm[2]
cm.reg[[i]] <- reg_out_cm
cm.lasso[[i]] <- lasso_out_cm
cm.tree[[i]] <- tree_out_cm
cm.rf[[i]] <- rf_out_cm
cm.boost[[i]] <- boost_out_cm

rf_plot[[i]] <- rf_var_plot
gbm_plot[[i]] <- var_imp_gbm

summ_fit_sample[[i]] <- summ_cereals_fit
summ_test_sample[[i]] <-summ_cereals_test

summ_fit_lin_sample[[i]] <-summ_cereals_fit_lin
summ_test_lin_sample[[i]] <-summ_cereals_test_lin

} #end Loop

Prepare Results Tables

This chunk includes the preparation of the results using various data frames and row and column naming. The preparation is important for presentation purposes in the paper.

suppressMessages(library(stargazer))
suppressMessages(library(xtable))

Performance Table

details2<-details1
details2<-unlist(details2, recursive = TRUE, use.names = TRUE)

run1<-round(matrix(details2[1:12], ncol = 2, byrow = TRUE), 0)
run2<-round(matrix(details2[13:24], ncol = 2, byrow = TRUE), 0)
run3<-round(matrix(details2[25:36], ncol =2, byrow = TRUE),0)
run4<-round(matrix(details2[37:48], ncol =2, byrow = TRUE), 0)

names1<-c("in sample", "out of sample")
names2<-c("logit", "lin reg", "lasso.cv", "tree", "random forrest", "boosting")
colnames(run1)<-names1
rownames(run1)<-names2
colnames(run2)<-names1
rownames(run2)<-names2
colnames(run3)<-names1
rownames(run3)<-names2
colnames(run4)<-names1
rownames(run4)<-names2

run1.table<-xtable(run1, caption = index[1], digits = 0)
run2.table<-xtable(run2, caption = index[2], digits = 0)
run3.table<-xtable(run3, caption = index[3], digits = 0)
run4.table<-xtable(run4, caption = index[4], digits = 0)

Prepare Coefficients Tables

Logit

coeff<-t(matrix(unlist(details), ncol = 3, byrow = TRUE))
colnames(coeff)<-index
rownames(coeff)<-c("price","quality","popularity")
coeff.table<-xtable(coeff, caption = "Coefficients")

Liner Regression

coeff.reg2<-t(matrix(unlist(coeff.reg), ncol = 9, byrow = TRUE))
colnames(coeff.reg2)<-index
rownames(coeff.reg2)<-c("X1.TF", "X2.TF", "X3.TF", "X1.HB", "X2.HB", "X3.HB",
                        "X1.NC", "X2.NC", "X3.NC")
coeff.table.reg<-xtable(coeff.reg2, caption = "Coefficients")

LASSO

coef.lasso<-t(matrix(unlist(coeff.lasso), ncol = 9, byrow = TRUE))
colnames(coef.lasso)<-index
rownames(coef.lasso)<-c("X1.TF", "X2.TF", "X3.TF", "X1.HB", "X2.HB", "X3.HB",
                        "X1.NC", "X2.NC", "X3.NC")
coeff.table.lasso<-xtable(coef.lasso, caption = "Coefficients")

Signal-to-Noise Table

details.snr<-unlist(details3)
details.snr<-matrix(details.snr, ncol = 2, byrow = TRUE)
colnames(details.snr)<-c("Signal-to-Noise", "Noise")
rownames(details.snr)<-index
snr.table<-xtable(details.snr, camption= "Signal-to-Noise")

Show Results

This section presents the code for the results. Only the performance comparision is presented to save space. The results match the Results Section of the paper. For further details please see the provided r-data file.

Signal-to-Noise Ratio

details.snr

##       Signal-to-Noise   Noise
## p=10         2.236826 1.64465
## p=50         3.017325 1.64465
## p=150        2.425730 1.64465
## p=290        2.836494 1.64465

Performance Comparison

index

## [1] "p=10"  "p=50"  "p=150" "p=290"

run1

##                in sample out of sample
## logit                 76            78
## lin reg               76            71
## lasso.cv              72            77
## tree                  67            49
## random forrest       100            63
## boosting              88            61

run2

##                in sample out of sample
## logit                 77            62
## lin reg               72            45
## lasso.cv              62            54
## tree                  58            41
## random forrest       100            45
## boosting             100            45

run3

##                in sample out of sample
## logit                100            41
## lin reg               80            33
## lasso.cv              70            65
## tree                  69            47
## random forrest       100            51
## boosting             100            49

run4

##                in sample out of sample
## logit                100            36
## lin reg               84            29
## lasso.cv              74            64
## tree                  65            47
## random forrest       100            50
## boosting             100            47

Coefficients

Logit coefficients are listed below. First are the relevant coefficients for all runs listed. Then next output lists all estimated logit coefficients for the first run.

Logit

  coeff

##                 p=10      p=50     p=150      p=290
## price      -3.096471 -3.354573 -28.90478 -12.353016
## quality     1.851797  1.918786  16.58122   6.582303
## popularity  3.212596  3.314025  31.08082  11.217060

  summary(coeff.logit[[1]])

## 
## Call:
## mlogit(formula = choice ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + 
##     X8 + X9 + X10 - 1, data = Cereal_fit, seed = 1, method = "nr")
## 
## Frequencies of alternatives:
##   HB   NC   TF 
## 0.38 0.28 0.34 
## 
## nr method
## 6 iterations, 0h:0m:0s 
## g'(-H)^-1g = 0.000249 
## successive function values within tolerance limits 
## 
## Coefficients :
##       Estimate Std. Error z-value  Pr(>|z|)    
## X1  -3.0964715  0.4140316 -7.4788 7.505e-14 ***
## X2   1.8517973  0.2478674  7.4709 7.971e-14 ***
## X3   3.2125959  0.4196564  7.6553 1.932e-14 ***
## X4  -0.0039057  0.3015569 -0.0130    0.9897    
## X5  -0.0257784  0.2895766 -0.0890    0.9291    
## X6  -0.3328288  0.3096260 -1.0749    0.2824    
## X7   0.0014206  0.3040154  0.0047    0.9963    
## X8   0.2248430  0.2962448  0.7590    0.4479    
## X9  -0.0603981  0.3093601 -0.1952    0.8452    
## X10  0.2815895  0.3250167  0.8664    0.3863    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -91.375

Linear Regression

Linear Regression coefficients are listed. First the relevant, then the estimated coefficients per alternative for the first run.

  coeff.reg2

##             p=10       p=50      p=150       p=290
## X1.TF -3.9066881 -4.3611262 -10.056788 -677.756368
## X2.TF  1.4512183  1.5410204   4.598484  634.968445
## X3.TF  2.1921421  2.9160312   6.152307 -386.862818
## X1.HB -4.3510830 -2.7276638   5.957411  -40.995783
## X2.HB  1.0891153  1.3686861  -8.140303  -49.804010
## X3.HB  2.6882814  1.1111883  -4.574937   43.408298
## X1.NC -4.8601217 -4.5298688   4.287643   -3.286762
## X2.NC  1.4941079  1.0705718  -6.531668   -7.961015
## X3.NC  0.9540236  0.7777074 -10.656929    4.203862

  summary(coeff.reg.TF[[1]])

## 
## Call:
## lm(formula = form.TF, data = cereals.TF)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.889 -1.379  0.512  1.771  5.318 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## X1.TF  -3.90669    0.43701  -8.940 2.06e-15 ***
## X2.TF   1.45122    0.33132   4.380 2.31e-05 ***
## X3.TF   2.19214    0.52916   4.143 5.91e-05 ***
## X4.TF  -0.18126    0.55436  -0.327   0.7442    
## X5.TF   0.03678    0.48089   0.076   0.9391    
## X6.TF   0.06824    0.50473   0.135   0.8926    
## X7.TF   0.22404    0.55816   0.401   0.6887    
## X8.TF   1.18130    0.55561   2.126   0.0352 *  
## X9.TF   0.20141    0.52722   0.382   0.7030    
## X10.TF -0.08117    0.58449  -0.139   0.8898    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.911 on 140 degrees of freedom
## Multiple R-squared:  0.6675, Adjusted R-squared:  0.6437 
## F-statistic:  28.1 on 10 and 140 DF,  p-value: < 2.2e-16

 summary(coeff.reg.HB[[1]])

## 
## Call:
## lm(formula = form.HB, data = cereals.HB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2658  -1.3238   0.3358   1.8578   6.2953 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## X1.HB  -4.351083   0.436698  -9.964  < 2e-16 ***
## X2.HB   1.089115   0.547734   1.988   0.0487 *  
## X3.HB   2.688281   0.375667   7.156 4.26e-11 ***
## X4.HB  -0.025502   0.541557  -0.047   0.9625    
## X5.HB   0.461733   0.566486   0.815   0.4164    
## X6.HB   0.461519   0.552594   0.835   0.4050    
## X7.HB   0.334463   0.568390   0.588   0.5572    
## X8.HB   0.007566   0.548774   0.014   0.9890    
## X9.HB  -0.022064   0.563030  -0.039   0.9688    
## X10.HB  0.204218   0.618313   0.330   0.7417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.077 on 140 degrees of freedom
## Multiple R-squared:  0.6443, Adjusted R-squared:  0.6189 
## F-statistic: 25.36 on 10 and 140 DF,  p-value: < 2.2e-16

 summary(coeff.reg.NC[[1]])

## 
## Call:
## lm(formula = form.NC, data = cereals.NC)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0819 -1.7971  0.0966  1.6785  5.6364 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## X1.NC  -4.86012    0.45618 -10.654   <2e-16 ***
## X2.NC   1.49411    0.46432   3.218   0.0016 ** 
## X3.NC   0.95402    0.36686   2.601   0.0103 *  
## X4.NC  -0.28330    0.57875  -0.490   0.6252    
## X5.NC  -0.04915    0.52389  -0.094   0.9254    
## X6.NC   0.30359    0.51397   0.591   0.5557    
## X7.NC   0.54486    0.48957   1.113   0.2676    
## X8.NC  -0.34925    0.47651  -0.733   0.4648    
## X9.NC   0.07644    0.53379   0.143   0.8863    
## X10.NC -0.76902    0.51780  -1.485   0.1397    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.803 on 140 degrees of freedom
## Multiple R-squared:  0.7201, Adjusted R-squared:  0.7001 
## F-statistic: 36.01 on 10 and 140 DF,  p-value: < 2.2e-16

LASSO

LASSO coefficients are listed. First the relevant, then the estimated coefficients for the first run per alternative.

  coef.lasso

##             p=10       p=50      p=150      p=290
## X1.TF -3.9012652 -3.8703536 -4.1970812 -3.5156274
## X2.TF  1.4647826  0.7921983  1.4707380  0.7801829
## X3.TF  2.1496418  3.6677306  2.7068046  3.0942549
## X1.HB -4.3478177 -2.7278889 -3.5658226 -3.2054035
## X2.HB  1.1806458  0.8591900  0.9522660  1.6118244
## X3.HB  2.6579528  1.3227454  2.0299172  1.6411899
## X1.NC -4.8652362 -3.9992464 -5.9801915 -4.2136059
## X2.NC  1.4939412  1.0644524  0.9997719  0.8643321
## X3.NC  0.9650781  1.2361749  2.1236241  0.8196607

  summary(coeff.lasso.TF[[1]])

## 
## Call:
## lm(formula = Y.train.TF ~ x - 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.699 -1.429  0.451  1.817  5.179 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## xX1.TF  -3.9013     0.4266  -9.146 4.95e-16 ***
## xX2.TF   1.4648     0.3213   4.559 1.09e-05 ***
## xX3.TF   2.1496     0.5111   4.206 4.53e-05 ***
## xX7.TF   0.1994     0.5368   0.371   0.7109    
## xX8.TF   1.1627     0.5361   2.169   0.0317 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.863 on 145 degrees of freedom
## Multiple R-squared:  0.6668, Adjusted R-squared:  0.6553 
## F-statistic: 58.03 on 5 and 145 DF,  p-value: < 2.2e-16

 summary(coeff.lasso.HB[[1]])

## 
## Call:
## lm(formula = Y.train.HB ~ x - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6714  -1.3513   0.1692   1.9714   6.3899 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## xX1.HB  -4.3478     0.4224 -10.292  < 2e-16 ***
## xX2.HB   1.1806     0.5253   2.248   0.0261 *  
## xX3.HB   2.6580     0.3584   7.417 9.04e-12 ***
## xX5.HB   0.4698     0.5531   0.849   0.3971    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.026 on 146 degrees of freedom
## Multiple R-squared:  0.6414, Adjusted R-squared:  0.6315 
## F-statistic: 65.27 on 4 and 146 DF,  p-value: < 2.2e-16

 summary(coeff.lasso.NC[[1]])

## 
## Call:
## lm(formula = Y.train.NC ~ x - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1309 -1.7957  0.1941  1.6694  5.7915 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## xX1.NC   -4.8652     0.4353 -11.176  < 2e-16 ***
## xX2.NC    1.4939     0.4519   3.306  0.00119 ** 
## xX3.NC    0.9651     0.3506   2.752  0.00667 ** 
## xX7.NC    0.5452     0.4739   1.150  0.25192    
## xX10.NC  -0.8190     0.4958  -1.652  0.10076    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.764 on 145 degrees of freedom
## Multiple R-squared:  0.7181, Adjusted R-squared:  0.7083 
## F-statistic: 73.86 on 5 and 145 DF,  p-value: < 2.2e-16

Confusion Matrices

The provided confusion matrices in the paper are shown below. Only the first one is shown for clarity.

  cm.logit[[1]]

## $table
##           Reference
## Prediction TF HB NC
##         TF 40  4  3
##         HB  8 38  4
##         NC  4 10 39

# cm.logit[[4]]
# cm.reg[[1]]
# cm.reg[[4]]
# cm.lasso[[1]]
# cm.lasso[[4]]
# cm.tree[[1]]
# cm.tree[[4]]
# cm.rf[[1]]
# cm.rf[[4]]
# cm.boost[[1]]
# cm.boost[[4]]

Preparation of Y-Variable (Observed Choices)

sum.stat.fit<- t(cbind(summ_fit_sample[[1]],
summ_fit_sample[[2]],
summ_fit_sample[[3]],
summ_fit_sample[[4]]))
rownames(sum.stat.fit)<- run.names

sum.stat.test<- t(cbind(summ_test_sample[[1]],
summ_test_sample[[2]],
summ_test_sample[[3]],
summ_test_sample[[4]]))
rownames(sum.stat.test)<- run.names

sum.stat.fit.lin <- t(cbind(summ_fit_lin_sample[[1]],
summ_fit_lin_sample[[2]],
summ_fit_lin_sample[[3]],
summ_fit_lin_sample[[4]]))
rownames(sum.stat.fit.lin)<- run.names

sum.stat.test.lin <- t(cbind(summ_test_lin_sample[[1]],
summ_test_lin_sample[[2]],
summ_test_lin_sample[[3]],
summ_test_lin_sample[[4]]))
rownames(sum.stat.test.lin)<- run.names

Tabulation of the Y-Variable (Observed Choices)

The y-variable is tabulated in order to determine the distribution of the three classes inside the train and test samples. The cereals and cereals.lin data frames are the same in regards to the categorical y-variable.

  sum.stat.fit

##      TF HB NC
## run1 51 57 42
## run2 51 57 42
## run3 49 62 39
## run4 51 57 42

  sum.stat.fit.lin

##      TF HB NC
## run1 51 57 42
## run2 51 57 42
## run3 49 62 39
## run4 51 57 42

  sum.stat.test

##      TF HB NC
## run1 52 52 46
## run2 44 52 54
## run3 52 57 41
## run4 48 53 49

  sum.stat.test.lin

##      TF HB NC
## run1 52 52 46
## run2 44 52 54
## run3 52 57 41
## run4 48 53 49

Therefore they yield the same two train-samples and same two test-samples. This is to show that although there are two data frames, the data is comparable.

 sum.stat.fit==sum.stat.fit.lin

##        TF   HB   NC
## run1 TRUE TRUE TRUE
## run2 TRUE TRUE TRUE
## run3 TRUE TRUE TRUE
## run4 TRUE TRUE TRUE

 sum.stat.test==sum.stat.test.lin

##        TF   HB   NC
## run1 TRUE TRUE TRUE
## run2 TRUE TRUE TRUE
## run3 TRUE TRUE TRUE
## run4 TRUE TRUE TRUE

ML Variable Importance Plots

Tree Plot

The decison tree split plot is not provided, because the prp() output could not be saved in a data frame.

Random Forest

First the variable importance of the random forest is presented for the first and last run, such as in the paper.

 plot(rf_plot[[1]], top = 20)

 plot(rf_plot[[4]], top = 20)

Gradient Boosting

Second the variable importance of the boosted trees is presented for the first and last run, such as seen in the paper.

 gbm_plot[[1]]

 gbm_plot[[4]]

Session Info and Warnings

The first code chunk shows with which version of R the code was ran. Please see r-data file for more info.

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] xtable_1.8-4        stargazer_5.2.2     gbm_2.1.5          
##  [4] randomForest_4.6-14 rpart.plot_3.0.8    rpart_4.1-15       
##  [7] glmnet_3.0-1        Matrix_1.2-17       caret_6.0-86       
## [10] ggplot2_3.2.1       lattice_0.20-38     mlogit_1.0-2       
## [13] lmtest_0.9-37       zoo_1.8-5           Formula_1.2-3      
## [16] evd_2.3-3           mvtnorm_1.0-11     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1           lubridate_1.7.8      class_7.3-15        
##  [4] assertthat_0.2.1     digest_0.6.25        ipred_0.9-9         
##  [7] foreach_1.4.7        R6_2.4.0             plyr_1.8.4          
## [10] stats4_3.6.0         e1071_1.7-3          evaluate_0.13       
## [13] pillar_1.4.4         Rdpack_0.11-1        rlang_0.4.6         
## [16] lazyeval_0.2.2       data.table_1.12.2    rmarkdown_2.3       
## [19] splines_3.6.0        statmod_1.4.32       gower_0.2.1         
## [22] stringr_1.4.0        munsell_0.5.0        compiler_3.6.0      
## [25] xfun_0.7             pkgconfig_2.0.2      shape_1.4.4         
## [28] htmltools_0.3.6      nnet_7.3-12          tidyselect_0.2.5    
## [31] gridExtra_2.3        tibble_3.0.1         prodlim_2019.11.13  
## [34] codetools_0.2-16     crayon_1.3.4         dplyr_0.8.5         
## [37] withr_2.1.2          MASS_7.3-51.4        recipes_0.1.12      
## [40] ModelMetrics_1.2.2.2 grid_3.6.0           nlme_3.1-144        
## [43] gtable_0.3.0         lifecycle_0.2.0      magrittr_1.5        
## [46] pROC_1.16.2          scales_1.0.0         bibtex_0.4.2.2      
## [49] stringi_1.4.3        reshape2_1.4.3       timeDate_3043.102   
## [52] ellipsis_0.3.0       vctrs_0.3.0          generics_0.0.2      
## [55] lava_1.6.7           iterators_1.0.12     tools_3.6.0         
## [58] glue_1.3.1           purrr_0.3.4          survival_2.44-1.1   
## [61] yaml_2.2.0           colorspace_1.4-1     gbRd_0.4-11         
## [64] knitr_1.23

search() shows the loaded packages

search()

##  [1] ".GlobalEnv"           "package:xtable"       "package:stargazer"   
##  [4] "package:gbm"          "package:randomForest" "package:rpart.plot"  
##  [7] "package:rpart"        "package:glmnet"       "package:Matrix"      
## [10] "package:caret"        "package:ggplot2"      "package:lattice"     
## [13] "package:mlogit"       "package:lmtest"       "package:zoo"         
## [16] "package:Formula"      "package:evd"          "package:mvtnorm"     
## [19] "package:stats"        "package:graphics"     "package:grDevices"   
## [22] "package:utils"        "package:datasets"     "package:methods"     
## [25] "Autoloads"            "package:base"

The warnings have not been presented in order to save space, but could be viewed in the r-data file if interested.

warnings()

Code: Demand Estimation with Machine Learning

Georgi Zhelev

Model Assumptions

Model Characteristics

Labels

Create Storage for Loop Results

Loop

Simulation

Set Relevant Coefficients

Generate X-Variables

Calculate Utility

Generate Error Term

Investigate Signal-to-Noise Ratio

Calculate Logit Probabilities

Generate Y-Variable

Safe Data

Assign Levels

Model Fitting and Tuning

Fit Logit

Split Data

Means of Samples

Transform Data into Mlogit Data Object

Set up Flexible Formula for Logit

Remove Intercept

Fit Logit Model

In-Sample Performance

Out-of-Sample Performance

Data-Check

Save Results

Fit Linear Regression

Split Data

Pre-Process Data

Separate Data for Regressions

Set up Formulas for 3 Regressions

Remove Intercept

Fit Regression Models

In-Sample Predictions

In-Sample Correct

Out-of-Sample Predictions

Out-of-Sample Correct

Store Results

Fit LASSO

Code train and test samples for lasso.cv

Train

Test

Do Cross-Validation

Determine Best Lambda, Fit LASSO, Predict lasso.cv In-Sample

TF

HB

NC

Post Lasso Coefficients

Variable Selection and Liner Model

TF

HB

NC

Compare In-Sample Predictions and choose largest utility

In-Sample Accuracy

Out-of-Sample Prediction

Compare Out-of-Sample Predictions and choose highest

Out-of-Sample Accuracy

Save Results

Decision Tree

Split Data

Tuning

Prune Tree

In-Sample Tree Fit

Out-of-Sample Tree Fit and Save Results

Random Forest

Manual Tuning (commented out)

Automatic Tuning

Analyse Fit

In-Sample Fit

Out-of-Sample Fit and Save Results

Gradient Boosting

Manual tuning

Automatic Tuning

Analyse Fit

In-Sample Fit

Out-of-Sample Fit

Save Results