Term Paper: Statistical Programming

M.Sc. Business Administration

University of Hamburg

from Kim Wagner, Annika Zeyn and Georg Zhelev

Matriculation numbers: 6916663 6933827 7087423

Date: 01.07.2021

1. Distributions

First a continuous random variable $X$ with a density function

$$f_X(x)=\begin{cases}\frac{1}{4}\quad \text{if}\ 0\le x<2\\ \frac{1}{4}x\quad \text{if}\ 2\le x\le \sqrt{8}\\ 0\quad\text{otherwise}\end{cases}$$

is considered. To calculate the cumulative distribution function, the probability density function is integrated. For x in $[0, 2]$ we have $F(x) = \int_0^x \frac{1}4 \, dx = \frac{1}4x$ and for x in $[2, \sqrt{8}]$ we have $F(x) = \int_2^x \frac{1}4x \, dx = \frac{x^2}8$. Since $f(x)$ is 0 outside of $[0, 2]$ and $[2, \sqrt{8}]$ we know $F(a) = P(X \le a) = 0$ for $a < 0$ and $F(a) = 1$ for $a > \sqrt{8}$. Therefore the cummulative distribution function can be described as

$$F_X(x)=\begin{cases}0\quad \text{if}\ x\le0\\ \frac{1}{4}x\quad \text{if}\ 0\le x<2\\ \frac{x^2}{8}\quad \text{if}\ 2\le x\le \sqrt{8}\\ 0\quad\ x >\sqrt{8}\end{cases}.$$

The above PDF and CDF are plotted. An interval on which function should be plotted is set. The interval equals the interval on which the function is defined, but a bigger interval could also be used. Next the cumulative distribution function (CDF) can be calculated by normilizing the PDF (dividing each number in the sample by the sum of all the numbers in the sample) and calculating the cumulative sum of the elements. Or the definiton of the CDF can also be used based on the integration shown above. Next the PDF and CDF are plotted. The PDF is defined in the graph as was expressed by the equation above. Outside of that interval is the PDF 0. The CDF shows the cumulative probability of the function, which is between 0 and 1.

Next the the inverse $F^{-1}_X:(0,1)\mapsto\mathbb{R}$ of the CDF is shown with calculations. On the interval $(0,1)$ $F^{-1}_X$ is found by solving the inverses of the CDF cases. This is done by switching x and y and solving for y to yield

$$F^{-1}_X=\begin{cases} 4x\quad \text{if}\ 0\le x<\frac{1}{2}\\ \sqrt{8x} \quad \text{if}\ \frac{1}{2} \le x\le 1\end{cases}\\ .$$

The inverce CDF is then defined in accordance to the formula above and plotted. One can see that the inverse CDF takes values between 0 and 1 and yields a probability.

Next the population mean of the random variable $X$ is calculated. This can be done using integration or analyticaly using the definition of the PDF. One way is by taking the definite interval of the pdf as

$$\mu_f := \int_{0}^{\sqrt{8}}x \cdot f_X(x) \ dx = [\frac{1}{4}x]_0^2 + [\frac{1}{4}x^2]_2^{\sqrt{8}} = \underline{\underline{1.5}}.$$

Per definition of the PDF, the sum of the products of the normalized probability densities and the x interval on which the function is defined equals the populaiton mean.

$$\mu_f := \sum{x \cdot f_X(x)} \quad= 1.719$$

Next 100 random numbers with distribution $F_X$ are simulated. The samples are drawn using the inversion-method and the above defined inverce cdf function. The population mean is estimated using the sample mean. The bias of the estimate is $0.073$. To decrease this bias the sample would have to be increased. A random number is appended to the sample until the absolut bias is lower than $0.01$. The augmented sample is now $783$ observations in lenght. A random seed is used to allow reproducibility.

A decreasing bias is expected, because the more random numbers from the distribution of the population are included, the richer the sample, therefore it's empirical estimate of the mean gets asymptotically close to the population mean. The length of the sample after shows how many new random numbers were sampled before the bias decreased below $0.1$.

The next graphic illustrates the convergence of the sample mean to the population mean. Because the same seeds were used as above, the augmented sample requires $683$ additional random numbers to reach a bias of under $0.01$. The original sample had $100$ random numbers in it, adding up to a total size $783$. As additional random numbers are added to the sample, one can observe in the graphic below the convergence of the sample mean (blue line) to the population mean (dotted red line).

The empirical cumulative distribution function is plotted for sample S evaluated on a point-interval x. Then this distribution is compared with the population distribution $F_X$. As a reminder, additional random numbers were added to sample S until the bias was below $0.01$, therefore it is expected that the empirical CDF is simular to the population CDF. From the graphic below this is indeed the case, but does not match exactly. This is due to randomness and to the size of the sample. When a bigger sample were to be taken with more than $783$ random numbers, say $10,000$ the ECDF would approximate the population CDF almost exactly. This can be seen in the additional gaphic below.

2. Hypothesis Tests

In the following assignment we cover the topic of hypothesis testing in two fields. In the first part we model the binomial distribution of an urn experiment and calculate confidence intervals at a certain significance level respectively. In the second part we focus our analysis on a given housing price data set. Firstly we analyse the data set with respect to descriptive statistics and finally we test five scientific hypotheses on the data by applying methods of inference statistics.

2.1 Urn Experiment

Distribution Modeling

We consider an urn that contains three white and five black balls. A ball with replacement is drawn five times. The random variable $X$ that is defined as the number of white balls drawn is of interest. To model the draws, a binomial distribution with 2 parameters is needed. The amount of total draws is $n$ and the probability of drawing a white ball is $p=\frac{3}{8}$. Therefore the outcome: amount of white balls $k \in \{0,...,5\}$ in the sample of $n=5$ draws can be modeled. The balls in the urn are represented by the following vector:

Of interest is the random distribution that can be used to describe the experiment. Especially we are interested in quantifying the random variable $X$ that describes the amount of white balls drawn out of 5 draws. The random distribution that suits the requirements for this experiment is the binomial distribution $B$. $B$ describes the distribution of a random variable $X$ that counts the amount of successes $k$ after drawing $n$ times with success probability $p$. In our model we draw from $n=5$ and note a success probability $p=0.375$. The outcome of the random variable $k$ lies in $\{0,1,2,3,4,5\}$. For that parameters in the case of the binomial distribution we can calculate the true mean of the distribution by $mean_{pop} = n \dot p = 1.875$. We can then determine the expected amount of white balls drawn from the urn in that sample. In the next part we want to simulate the urn drawing with replacement.

We use the function random.choice from the numpy package to simulate the experiment, i.e. draw five balls with replacement. With one simulation of that experiment we get one random realization of $X$. By setting the np.random.seed(1234) to an arbitrary integer, we fix the quasi-random number generation built-in. With the numpy package we can apply the random.choice function that simulates our urn experiment. For this function we need 4 arguments. The previously defined urn vector, the size of our sample ( 𝑛=5 ), a boolean argument for the replacement and a probability-vector for drawing each item out of the urn. The next output is a sample of 5 urn drawings with replacement:

In the next step we want to generate $N=100$ independent observations of $X$: $X_1,\dots,X_{100}$. For this we just replicate the previously defined experiment 100 times and safe each result in a vector $S$. With the random.seed function we can fix the quasi-random seed of the random choices to get the same "random" results for every run of this cell. Then a custom function named draw(N,urn) that simulates $N$ observations of $X$ is defined. This function exactly wraps the just stated replications of the random.choice for an arbitrary number of iterations $N$.

With the custom function we can simulate a sample of size $100$ and calculate the empirical mean (set np.random.seed(7)). Further we can estimate the mean of the distribution by calculating the sample mean of that simulation output. For $N=100$ we apply the draw function and generate the random vector $S$ with the results of 100 random samples of the urn experiment. The random.seed again makes sure to get the same draws when running the cell again. By measuring the absolut deviation from the expected (theoretical) mean and the sample mean of $S$, we derive the empirical bias of the simulated experiment. With respect to the law of large numbers we expect this bias to get close to zero for large $N$.

We see that we have a low bias of the sample mean to the actual true distribution-mean of $1.875$. This is due to randomness and the fact that a sample of $100$ iterations is large enough to deliver a decent approximation. A colleague shows another random sample and asks if the underlying distribution has the same population mean as the random variable $X$. To answer that, we calculate the sample mean of the new sample and compare with the true distribution mean.

We see that the new sample has obviously a higher sample mean. To determine if this difference is statistically significant a confidence interval needs to be built. Therefore a 90% asymptotic confidence interval for the population mean based on this new sample is found. It is not necessary to have further information about the distribution of the data since the central limit theorem states: means of i.i.d. distributions are normaly-distributed for large $N$. For $N=100>30$ we can approximate the validity of that theorem for the new sample and build the mean confidence interval of our known urn experiment distribution. We derive that symmetric CI with 90% confidence, so that 90% of all sample means of samples drawn by that distribution are expected to lie in this interval. If the sample mean of the new sample lies outside of this CI, we are 90% sure the underlying distribution does not match our urn experiment. We build the Confidence Interval of the true distribution and check if the new sample fits into that interval via $\quad \mu\pm \frac{\sigma_n}{\sqrt{n}}q_z(0.90) = (1.74, 2.01) $.

The theoretical confidence interval for the mean of the urn experiment ($\mu = 1.875$) at 90% confidence level does not contain the sample mean of the new sample. Therefore, we reject the assumption that this sample was drawn by our urn experiment. Confidence Intervals are one part of inference statistics that can be used to test scientific hypotheses. In the next section we introduce further inference methods to test more hypotheses.

2.2 Boston Housing Data Set

Descriptive Statistics

In this part we analyse the given Boston housing data. In this part of the task we prepare the data to estimate the prices (response y) of houses in boston with all the given regressors (X) available in the data set. We do this by using the python pandas dataframe package and prepare the y-and X-variables. An excerpt of the data frame is shown below.

As descriptive statistics for the boston data set we consider a couple of standard methods to get familiar with the data. First, we get a better idea of the data set by inspecting the the dimensions with respect to row and column sizes. Second we want to calculate the sample mean and standard deviation for each variable. Further, we are interested in the histogram of the median value of owner-occupied homes (MEDV) and lastly we inspect if old houses have different prices than newer houses. For this last statistic we consider 50 years to be the threshold that seperates old and new.

The data set consists of 506 observations and 13 regressors. In the next chart the histogramm of MEDV is displayed and one can see that the distribution does not look normal.

This results make sense, because older houses require more investment in terms of renovation costs, therefore are on average cheaper.

Inference Statistics

In the last part we want to test five scientific hypotheses. Since Hypotheses 1,2,4 can be formulated as tests for differences in the mean of two groups, we apply a unpaired and independend t-test which is based on students' t-distribution. The t-distribution is used for that test statistic because studentized sample means (and therefore also their differences) can be proven to be t-distributed. For Hypothesis 3 we apply a Kolmogorov-Smirnoff test to show if both samples were drawn from the same population. Hypothesis 5 can be tested with a Shapiro-Wilk test to find if the sample is normal distributed or not.

First we test if the mean of the median housing prices is 20. We apply the t-test to see if the sample mean is 20. Because the p-value is close to 0, the null hypothesis that the mean is 20 is rejected.

Second we test if houses that are close to the Charles river are, on average, as expensive as houses that are not situated at the river. For this test we apply a t-test and a welch-test for the assumption of different variances in both samples. Because in both cases the p-value is close to 0, the null hypothesis that the two samples of houses have identical average prices is rejected. Therefore houses that are close to the Charles river are, on average more expensive than houses not situated on the river.

Third we test if the distribution of house prices is affected by distance to charles river. Using the Kolmogorov-Smirnov test, it is tested if the two samples come from different distributions. The null hypothesis is that they come from the same distribution. The p-value of the test is 0.01, which means that at the 1% confidence level the distributions are the same, but at the 5% and 10% level the distributions are different.

Fourth, it is tested if the per capita crime rate by town ('CRIM') is, on average, higher in districts where the share of lower status of the population ('LSTAT') is above 10%. A district is assigned as "low status" if more than 10% of its population is lower status. A district is assigned as "non-low status" if less than 10% of its population is low status. The one-sided t-test has a p-value close to 0, therefore the null hypothesis that on average crime rate per capita is the same in low and non-low status districts is rejected. The crime rate is indeed higher in low status districts.

Finally we test if the assumption of the normal distribution is justified for the variable 'MEDV'. The test result shows that the assumption of normality is not justified, because the null hypothesis is rejected at 5% significance level. The median housing price is not normal-distributed.

In this assignment we learned how to implement basic statistical ideas from the fields of descriptive and inference statistics. From modeling distributions and calculating confidence intervals for the sample mean, we analyzed a pandas data set and prepared it for regression modeling. Further we successfully tested a hand full of scientific hypotheses and were able to reject the null hypotheses in cases of small p-values (p<5%) that indicate statistical significance for the alternative hypotheses. We have shown that python and the underlying packages are a feasible tool to develop stochastic and statistic code development.

3. Functions and Simulations

Let $X_1,\dots,X_n$ i.i.d $\sim\mathcal{N}(\mu,\sigma^2)$, where $\mu$ and $\sigma^2$ are unknown. A level $\alpha$ confidence interval for the unknown variance can be constructed by

$$\left(\frac{n\hat{s}^2_n}{\chi^2_{n-1,1-\alpha/2}},\frac{n\hat{s}^2_n}{\chi^2_{n-1,\alpha/2}}\right)$$

with $\hat{s}^2_n=\frac{1}{n}\sum_{i=1}^n(X_i - \bar{X}_n)^2$ and $\chi^2_{k,\alpha}$ denotes the $\alpha$-quantile of the $\chi^2$-distribution with $k$ degrees of freedom. This confidence interval can be be used to test the hypothesis whether the population variance equals some given value $\sigma_0^2$. First, the objective is to calculate a 5% confidence interval to check if the true variance is included. Therefore, we set $\mu=0$ and $\sigma^2=4$. Furthermore, $n=100$ observations of $X\sim\mathcal{N}(\mu,\sigma^2)$ are simulated. This process will be repeated 20 times.

As can be seen, our Confidence interval of (3.196, 5.595) does indeed include the true variance. The results of the calculated confidence intervals are visualized in the graphic above. While the 20 calcualted confidence intervals are displayed as blue lines with their respective width and average, the red line displays the value of the true variance. According to the confidence level 95%, which assumes a 5% error rate, the confidence interval should not include the true population variance about $\frac{1}{20} * 100 =5$% of the time. We can see from the graphic that this one confidence interval is the 13th calculated one, because it does not include the population variance.

Next, we define a function $sigma\_test(X, \sigma_0^2, \alpha)$, which tests the hypothesis

$$H_0:\sigma^2=\sigma_0^2 \text{ against }H_1:\sigma^2\neq\sigma_0^2 $$

for some given level $\alpha$ and a given vector of observations $X$ (built on the confidence interval above). The function should return $1$ if the null hypothesis is rejected and $0$ otherwise.

The chi-squared test for variance is used to determine whether the variance of a variable obtained from a particular sample has the same size as the known population variance of the same variables. When testing the population variance equals 4 for example, the test should yield a 0, because the null hypothesis can not be rejected. When we repeat this test 1000 times in the next exercise, we hope that the function will yield a 1, 5% of the time, when the null hypothesis is rejected, because we set alpha as 0.05. To determine whether the test really has the significance level $\alpha$, a simulation with $l=1000$ repetitions is applied. We set $\mu=0, \sigma^2=\sigma_0^2=4, n=100$ and $\alpha=0.05$.

Indeed testing if the population variance equals the given value of 4 has the significance level of alpha = 0.051. Removing the random seed will show that the significance level hovers around 5%. Increasing the amount of repetitions from 1,000 to to 10,000 will remove this variation and the significance level will be closer to 5.0 To validate the power of our hypothesis, the previous procedure is repeated, but now we select $\sigma_0^2=2$ and $\sigma^2=4$.

The results show that if the alternative hypothesis is true, there is a 99.9% chance that we will correctly reject the null hypothesis. This is the power of the test for this particular given value of the variance. Power is the probability of rejecting the null hypothesis, when the alternative is true. $$Power = P(\text{reject H_o } | \text{H_a true})$$

In the above cause of the tested variance, our power is 99.9%, which indicates a very strong test. The previously used hypothesis test relies on the $\chi^2$-distribution of the following test statistic: $$T(X):=\frac{n\hat{s}^2_n}{\sigma^2}\sim\chi^2(n-1).$$ The objective now is to plot a histogram of the test statistic and compare it to the density of a suitable $\chi^2$-distribution. Furthermore, the qunatiles $\chi^2_{n-1,1-\alpha/2}$ and $ \chi^2_{n-1,\alpha/2}$, which specify the critical region, are marked by vertical lines. Then it is shown why we reject the null hypothesis $H_0$ if the test statistic $T(X)$ is on the left or the right of these two quantiles.

The calculated value of the test statistic is compared with the critical region of the chi-squared distribution as determined by the degrees of freedom and significance level alpha. Because we repeatedly calculated the test statistic based on a normaly distributed random sample, the distribution of the chi-squared test-statistic effectivly approximated the chi-squared distribution. In the graphic above, the test statistic is ploted by the light blue histogram and is comparable to the density of a $\chi^2$-distribution on the same interval $x$. If the test statistic for the variance of a specific sample falls to the left or to the right of the quantiles (in the critical region), we will reject the null hypothesis that the sample has the same population variance. In our case we set alpha to 0.05 this will only happen 5% of the time, therefore we can be 95% certain that the test works.

4. US Job Corps program

In the last exercise, we consider a welfare policy experiment with a binary treatment $D$ (assignment), which was conducted in the mid 1990s to assess the publicly funded US Job Corps program. The program targets young individuals (aged 16–24 years) who legally reside in the USA and come from a low-income household. It provides participants with approximately 1200 hours of vocational training and education, housing and board over an average duration of 8 months. Participants also receive health education as well as health and dental care.

Data Preparation

To start the data preparation process, a data frame $X$ will be created that includes the below listed covariates. This data frame will serve as a set of regressors to determine the impact on an specific outcome variable and will also be used for classification.

'assignment', 'female', 'age', 'white', 'black', 'hispanic', 'educ', 'educmis', 'geddegree', 'hsdegree', 'english', 'cohabmarried', 'haschild', 'everwkd', 'mwearn', 'hhsize', 'hhsizemis', 'educmum', 'educmummis', 'educdad', 'educdadmis', 'welfarechild', 'welfarechildmis', 'health', 'healthmis', 'smoke', 'smokemis', 'alcohol', 'alcoholmis'.

As it comes to the dependent variable, first a binary DataFrame Y_class will be created, which is $1$ if the general health 30 months after assignment ($health30$) is stated to be excellent ($health30=1$) and $0$ otherwise. Calculating the share based on the data frame $Y_{class}$ shows that 30 months after the assignment, a share of 38.27 percent of the people have an excellent health status. In contrast, 61.73 percent of the people are not in an excellent health status.

Another variable $Y_{pred}$ is created, which is equal to the weekly earnings in the third year after assignment ($earny3$). $Y_{pred}$ will serve as a dependent variable in different regression approaches, because it is continuous. To determine whether the participation in the job program has an effect on the dependent variables (on the participants health as well as on their weekly earnings), the sample means of the dependent variables $Y_{class}$ and $Y_{pred}$ are calculated conditioned on assignment.

The sample mean for $Y_{class}$ conditioned on the group of participants, who got assigned to the program (assignment =1) is only slightly higher than the mean conditioned on those participants, that did not get assigned in the program (assignment=0) (0.38 vs 0.37). Through this naive approach, it looks like there is no effect of participation in the job program on health, because the means are very close to each other. To actually test this, a hypothesis test needs to be used, because this would consider the variation of the data from which a standard error can be built. For $Y_{pred}$ the calculated sample means conditioned on the variable assignment show a larger difference between the groups. If only those participants are observed, that participated in the assignment (assignment =1), the average weekly earning is 177.48 USD, while those who did not participate in the assignment (assignment =0), have an average weekly earning of 166.01 USD. This observation indicates, that the assignment has had an impact on the earnings, as the average weekly earnings are higher for those people who were assignend to the Job Corps Program. But is this effect significant? To answer this question, a hypothesis test can be used to compare the means of the populations.

Testing the hypothesis for the mean comparison of $Y_{class}$ shows that the health status of people who were assigned to the job program are not significantly different of those who were not (considering on a 'minimum' signifiance level of 0.1). Thus, we can not indicate a significant effect from the assignment to the job program on the health status after 30 months. In contrast, hypothesis testing for $Y_{pred}$ shows that the average of the weekly incomes of people who took the assignment are significantly different of those who didn't took the assignment (with significance level = 0.01). Therefore, with 99% confidence level, we can conclude that assignment does have a significant effect on the weekly earnings after 3 years. Although statistically significant, this effect is economically irrelevant, because a difference of about 10.5 dollars is minuscule.

Regression

In the next step, regression analysis is applied with $Y_{pred}$ as a dependent variable, to determine the effect of the respective covariates on weekly earnings after three years. To do this, different regression approaches are applied and their performance is evaluated and compared. To prepare the data for the regression analysis, a train-test-split is applied - where $(Y_{pred},X)$ is randomly splitted into train (80%) and test (20%) samples. The importance of sample-splitting is discussed.

The process of sample splitting is important for later analysis and evaluation of the models' performance. With sample splitting, we are able to train a model on one part of the data (the train sample) and then test it on an unseen part of the data, the test sample. As we still got the full information of the test sample, meaning the respective values of the outcome variable to the regressors, we are able to evaluate our model based on the test sample by comparing the true values with the predicted values of our model. If we were to train and test the model on the same data, a possible danger could be that the results of our model would be too good, leading us to believe that we got a model with very good or at least sufficient performance. If then a brand new sample would be acquired and predicted, the model may not habe the same good performance. Therefore, we split samples in order to be able to detect if the model is learning the data perfectly including the associated noise, called overfitting, or if it actually recognizes the underlying paterns and can then generalize to an unseen sample from the same distribution.

To evaluate the performance of our models, a function is implemented that outputs the root mean sqaured error (RMSE) for a given pair of vectors $y$ and $\hat{y}$, which is based on the equation

$$\text{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y_i})^2}.$$

In the first step the train and test root mean squared errors, based on the linear model $$Y_{pred} = \beta * X + \epsilon,$$ are calculated via ols-regression. The RMSE results are compared to the standard deviation of the outcome variable $Y_{pred}$.

First we see that the RMSE of the train-set is around the same as the RMSE of the test-set. This indicates that our model is not overfitting. As the RMSE of the train-set is a little higher than of the test-set, this could be due to the random influences of the composition of test and train samples. Second, we notice that the RMSE of the test set is lower than the standard deviation of the outcome variable y_pred. This is good, because the regression provides a better estimate than a naive estimator. A naive estimator is if we were to take the mean of the outcome variable and use it as a prediction. Then we see that the RMSE is close to the standard deviation of the outcome variable y_pred. This is because of the mathematical simularity between calculating RMSE and Standard deviation. Finally, we notice that the standard deviation of the predicted values is much lower than that of the outcome variable. Since the RMSE is large, the small spread in y_hat isn't good. As we saw the regression estimate is hardly better than one with the mean value as a naive estimator. Therefore we should also expect a low R-squared value, which puts the spread in y in relation to the declared spread of the estimate.

The reults of our linear regression should now be compared to other regression methods, including lasso, ridge and elastic net regression. After comparing the performance of the respective models based on the RMSE, approaches will be discussed on how to improve the prediction quality.

Ridge, lasso and elastic net are machine learning models that penalize the beta coefficients in L1 norm (lasso), l2 norm(ridge) or both (elnet). Therefore, lasso and elastic net can set beta coefficients to zero that do not have a significant explanatory impact. Ridge can only make the beta coefficient very small and close to zero. As regression models penalize the beta coefficients when they take values too far from zero, specific regularization parameters of the models can be adjusted. Such an adjustment can have an influence on the performance of the model. Tuning these parameters can therefore lead to better perdiction performance. For further reductions in RMSE, exploratory data analysis (EDA) would need to be conducted to determine, what variable types and scales are included in the data set. For example the prediction quality could be improved by scaling variables such as age, to 0-1, possibily separating age into bins and taking the log of earnings variables, which tend to be much larger than the rest of the variables in the data set. Binning/Grouping could possibly help to determine heterogeneous groups. Removal of outliers could be useful as well.

Finally, we are interested in the treatment effect of the assignment to the Job Corps program. Therefore, the ols regression for the whole sample without sample splitting is rerun to determine the effect on earnings and the significance of the assignment to the Job Corps program. Looking at the results provided by the statsmodel-api below, we see that for all other variables constant, assignment to the program increases weekly earnings by 14 dollars and 22 cents after three years. This is a positive significant effect, because the p-value is zero. Considering the confidence interval the positive effect on weekly earnings can vary between 8 and 20 dollars. The adjusted R-squared of the model is howerver only 12%, which means that only 12% of the total variation in the outcome is explained by the variables. This could be of course due to poor fit, because no data pre-processing was conducted or the possibility that earnings are determined by factors not observed in this data set. Often data pre-processing contributes relevantly to improving the model fit. Nevertheless, making conclusions about the treatment effect of program assignment on earnings is difficult, based of this 'naive' linear regression. For a valid estimation of the treatment effect, it would need to be controlled for all covariates, so that it is assured that there are no other influences on the dependent variable. This would require the inclusion of all relevant covariates, preferably selected by a double-selection-approach. Therefore, we can observe a positive correlation from the assignment to the job corps program on the weekly earnings after 3 years, but causality cannot be further interpreted.

Classification

In the next part different approaches were conducted to classify the data. The variable $Y_{class}$ is the binary outcome variable of excellent health status. The data was randomly splitted into a train (80%) and test (20%) sample.

Logistic Regression:

First, logistic regression was used to classify the data. The performance was evaluated based on the accuracy of the test set. In order to improve the performance, different tuning parameters are applied and interpreted. In general, logistic regression is used to model the probability of a certain class. The penalty, solver and the regularisation strength are the most important tuning parameters for this classifier. For example the solver is the optimization algorithm, the penalty is used to specify the norm used in the penalization and C is the strength of regularization. For example an l1 penalty can be used for feature selection, because it reduces the coefficients of features that are not important, to 0. Both the performance of the l1 and l2 penalties are presented. For Hyperparameter Tuning we varied the values of the strength of the regularization (C) and the penalty. Loops were applied to test also different combinations of hyperparameters. Overall, the best result could be achieved when combining a C value of 0.1 and the penalty l1. Also, using the solver 'newton-cg' could further improve the results. An overview with the results is provided in the end.

Suport Vector Machines

Next, support vector machines were used with different kernels and other tuning parameters to minimize the empirical error rate. Support vector machines are based on the idea of finding a hyperlane that best divides a dataset into two classes. The SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of a kernel is to take data as input and transform it into the required form. There are different functions: linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. The polynomial function represents the similiarity of vectors over polynomials of the original variables, allowing the learning of non-linear models. The radial basis function is a function whose value depends on the eucledian distance between the input and some fixed point. A sigmoid funciton is a function with a domain of all real numbers and range usually between 0 and 1 or -1 and 1. For support vector machines, the kernels radial basis function, polynomial function and sigmoid function were used. For this, we varied the degrees in the polynomial function and the C values (which is the inverse of the strength of regularization). For the Sigmoid Kernel the results did not differ for different C values.

$k$-nearest neighbor

Another classification approach that was used is $k$-nearest neighbors. Weights and number of neighbors were varied to minimize the empirical error rate. First, the neighbors were varied with the weight being set to 'uniform'. After examining the results, the number of neighbors, which lead to the best accuracy, was also used for different weights ('uniform', 'distance') to examine the difference in performance between different weights, but with the same number of neighbors. The difference between the two types of weighting functions is that the first assign the same weight to to each neighbor, while the second one tends to weight neighbors closer to the quiry point more. When varying the neighbors, the accuracy score gets the best for 100 neighbors. With a growing number of neighbors the predicitions should become more stable up to a certain point, as the algorithm can 'choose' between more options, but pushing it to far (here neighbors = 1000, 2000) will contribute no more to a better accuracy score. Therefore, it is important to adjust the number of neighbors to the respective problem set and not unnecessarily increase complexity by setting the number of neighbors to high, while the performace stagnates. Also, an unsufficient number of neighbors can bear the danger of overfitting the model. The respective train accuracy should therefore also be calculated, to inspect the difference between the performance on the test and train sample. For 100 neighbors, we could not find strong differences between the train and the test sample.

Random Forest

Random forest is now applied in order to improve the results. Feature importances were also ascertained. The tuning parameters and how they affect the performance are explained. In general, random forest contains a large number of small decision trees, called estimators, with every tree making its own predictions. The random forest model combines the predictions of the estimators to produce a more accurate prediction.

The most important features to classify the data are printed and also shown in the feature importances plot above. Health is the most important variable that predicts the share of people with excellent health status 30 months after assignment. This is an obvious result. Other less important variables are listed above and include:female, houshold size, age, mother's years of education at assignment as well as average weekly gross earnings at assignment.

The tuning parameters of random forests are max_depth and n_estimators. The depth of the trees inside the ensemble determines how well the trees learn the data. If the depth is too high, this may lead to overfitting. The number of estimators is the amount of trees over which is averaged. Usually the more the better, because this leads to a more robust model, due to the variety of possible random splits. We kept the number of n_estimators constant while varying the number of max_depth, because changes in n_estimators did not further improve the accuracy score.

Finally, the results are summarized and presented. The results table shows different models and their test set accuracies. Overall, there are no strong differences between the performances of the models. Within the models, the performance could be improved by adjusting the hyperparameters, but no outstanding differences have been achieved. The best model seems to be the logistic regression, which achived an accuracy of 0.648. When comparing the results within the logistic regression models, a small improvement could be achieved by changing the solver to 'newton-cg'.
We consider the performance of this model a sufficient estimator for the out-of-sample performance. It is better than chance (0.5). A complete accuracy of 1.00 is impossible, given the variety of factors that may influence health. Further, none of the models are overfitted, because they show similar results to one another and if the train accuracies were to be shown, they wouldn't be very much different. This is an important consideration when tuning, because one can for example tune a random forest to predict the train-set to 99%, but the test-set only by 60%.