First a continuous random variable $X$ with a density function
$$f_X(x)=\begin{cases}\frac{1}{4}\quad \text{if}\ 0\le x<2\\ \frac{1}{4}x\quad \text{if}\ 2\le x\le \sqrt{8}\\ 0\quad\text{otherwise}\end{cases}$$is considered. To calculate the cumulative distribution function, the probability density function is integrated. For x in $[0, 2]$ we have $F(x) = \int_0^x \frac{1}4 \, dx = \frac{1}4x$ and for x in $[2, \sqrt{8}]$ we have $F(x) = \int_2^x \frac{1}4x \, dx = \frac{x^2}8$. Since $f(x)$ is 0 outside of $[0, 2]$ and $[2, \sqrt{8}]$ we know $F(a) = P(X \le a) = 0$ for $a < 0$ and $F(a) = 1$ for $a > \sqrt{8}$. Therefore the cummulative distribution function can be described as
$$F_X(x)=\begin{cases}0\quad \text{if}\ x\le0\\ \frac{1}{4}x\quad \text{if}\ 0\le x<2\\ \frac{x^2}{8}\quad \text{if}\ 2\le x\le \sqrt{8}\\ 0\quad\ x >\sqrt{8}\end{cases}.$$The above PDF and CDF are plotted. An interval on which function should be plotted is set. The interval equals the interval on which the function is defined, but a bigger interval could also be used. Next the cumulative distribution function (CDF) can be calculated by normilizing the PDF (dividing each number in the sample by the sum of all the numbers in the sample) and calculating the cumulative sum of the elements. Or the definiton of the CDF can also be used based on the integration shown above. Next the PDF and CDF are plotted. The PDF is defined in the graph as was expressed by the equation above. Outside of that interval is the PDF 0. The CDF shows the cumulative probability of the function, which is between 0 and 1.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import random as random
from statsmodels.distributions.empirical_distribution import ECDF
#b)
def pdf(x):
if 0 <= x < 2:
return(1/4)
elif 2 <= x <= np.sqrt(8):
return(1/4*x)
else:
return(0)
def cdf(x):
if x<=0:
return(0)
elif x>=0 and x<=2:
return(1/4*x)
elif x>=0 and x<=2:
return((x^2)/8)
else:
return(0)
x = np.linspace(0, np.sqrt(8), 1000)
fx=list(map(pdf,x))
#CDF can be calculated from PDF
fx_nor =list(map(pdf,x))
fx_nor /= np.array(fx).sum()
Fx=np.cumsum(fx_nor)
#or use defined function cdf(x)
#Fx=list(map(cdf,x))
#plot both
plt.subplot(1, 2, 1)
plt.plot(x,fx,"g-",label="pdf")
plt.xlabel("x")
plt.ylabel("probability density")
plt.legend()
plt.title("PDF")
plt.subplot(1, 2, 2)
plt.plot(x,fx,"bo",label="density")
plt.plot(x,Fx,"r-",label="cdf")
plt.xlabel("x")
plt.ylabel("probability")
plt.legend()
plt.title("CDF")
plt.tight_layout()
plt.show()
Next the the inverse $F^{-1}_X:(0,1)\mapsto\mathbb{R}$ of the CDF is shown with calculations. On the interval $(0,1)$ $F^{-1}_X$ is found by solving the inverses of the CDF cases. This is done by switching x and y and solving for y to yield
$$F^{-1}_X=\begin{cases} 4x\quad \text{if}\ 0\le x<\frac{1}{2}\\ \sqrt{8x} \quad \text{if}\ \frac{1}{2} \le x\le 1\end{cases}\\ .$$The inverce CDF is then defined in accordance to the formula above and plotted. One can see that the inverse CDF takes values between 0 and 1 and yields a probability.
#c define inverse CDF
def inv_cdf(x):
if 0 <= x < 1/2:
return 4*x
elif 1/2 <= x <= 1:
return np.sqrt(8*x)
return 0.0
x_inv = np.linspace(0, 1, 1000)
Fx_inv = list(map(inv_cdf,x_inv))
plt.subplot(1, 2, 2)
plt.plot(x_inv, Fx_inv,"b",label="Inverse CDF")
#plt.plot(x,Fx,"r-",label="cdf")
plt.xlabel("x")
plt.ylabel("probability")
plt.legend()
plt.title("Inverse CDF")
plt.tight_layout()
plt.show()
Next the population mean of the random variable $X$ is calculated. This can be done using integration or analyticaly using the definition of the PDF. One way is by taking the definite interval of the pdf as
$$\mu_f := \int_{0}^{\sqrt{8}}x \cdot f_X(x) \ dx = [\frac{1}{4}x]_0^2 + [\frac{1}{4}x^2]_2^{\sqrt{8}} = \underline{\underline{1.5}}.$$Per definition of the PDF, the sum of the products of the normalized probability densities and the x interval on which the function is defined equals the populaiton mean.
$$\mu_f := \sum{x \cdot f_X(x)} \quad= 1.719$$#one can also calculate population mean by the law of the unconscious statistician (LOTUS)
print("Population mean by taking integral:",1.5)
#calculating the integral of the pdf yields 1.5
#using population normalized pdf times interval all summed up
pop_mean=(fx_nor*x).sum()
print("Population mean per definiton of PDF:",pop_mean)
Population mean by taking integral: 1.5 Population mean per definiton of PDF: 1.7194336609982677
Next 100 random numbers with distribution $F_X$ are simulated. The samples are drawn using the inversion-method and the above defined inverce cdf function. The population mean is estimated using the sample mean. The bias of the estimate is $0.073$. To decrease this bias the sample would have to be increased. A random number is appended to the sample until the absolut bias is lower than $0.01$. The augmented sample is now $783$ observations in lenght. A random seed is used to allow reproducibility.
A decreasing bias is expected, because the more random numbers from the distribution of the population are included, the richer the sample, therefore it's empirical estimate of the mean gets asymptotically close to the population mean. The length of the sample after shows how many new random numbers were sampled before the bias decreased below $0.1$.
np.random.seed(1234)
sam = list(map(inv_cdf, np.random.random(100)))
len(sam)
#f
print("Sample mean:", np.mean(sam))
print("Bias:", np.absolute(np.mean(sam)-pop_mean))
#g
S=np.array(sam)
S = np.expand_dims(S, axis=1)
print("Length of S before:", len(S))
np.random.seed(1234)
while np.absolute(np.mean(S)-pop_mean) > 0.01:
S=np.append(S, list(map(inv_cdf, np.random.random(1))))
else:
print("Length of S after", len(S))
Sample mean: 1.7927925397287796 Bias: 0.07335887873051194 Length of S before: 100 Length of S after 783
The next graphic illustrates the convergence of the sample mean to the population mean. Because the same seeds were used as above, the augmented sample requires $683$ additional random numbers to reach a bias of under $0.01$. The original sample had $100$ random numbers in it, adding up to a total size $783$. As additional random numbers are added to the sample, one can observe in the graphic below the convergence of the sample mean (blue line) to the population mean (dotted red line).
#i
S=np.array(sam)
S = np.expand_dims(S, axis=1)
M = []
np.random.seed(1234)
while np.absolute(np.mean(S)-pop_mean) > 0.01:
S=np.append(S, list(map(inv_cdf, np.random.random(1))))
M=np.append(M, np.mean(S))
else:
print("# Iterations:", len(M))
M = np.array(M)
fig=plt.figure()
plt.plot(M)
plt.axhline(y=pop_mean, color='r', linestyle='--', label="true value")
fig.suptitle("Convergence")
plt.ylabel("mean")
plt.xlabel("Loop Iterations")
plt.show()
# Iterations: 683
The empirical cumulative distribution function is plotted for sample S evaluated on a point-interval x. Then this distribution is compared with the population distribution $F_X$. As a reminder, additional random numbers were added to sample S until the bias was below $0.01$, therefore it is expected that the empirical CDF is simular to the population CDF. From the graphic below this is indeed the case, but does not match exactly. This is due to randomness and to the size of the sample. When a bigger sample were to be taken with more than $783$ random numbers, say $10,000$ the ECDF would approximate the population CDF almost exactly. This can be seen in the additional gaphic below.
#j)
x = np.linspace(0, np.sqrt(8), 1000)
def ecdf(D,x):
y=list(map(ECDF(D),x))
return y
plt.figure()
plt.plot(x, ecdf(S,x), label="ECDF sample")
plt.plot(np.sort(x),Fx,'r--', label="CDF pop.")
plt.xlabel("x")
plt.ylabel("cumulative probability")
plt.legend()
plt.title("ECDF compared to $CDF$")
Text(0.5, 1.0, 'ECDF compared to $CDF$')
# for a large sample
sam_large = list(map(inv_cdf, np.random.random(10000)))
plt.figure()
plt.plot(x, ecdf(sam_large,x), label="ECDF large sample")
plt.plot(np.sort(x),Fx,'r--', label="CDF pop.")
plt.xlabel("x")
plt.ylabel("cumulative probability")
plt.legend()
plt.title("ECDF compared to $CDF$ Large Sample")
Text(0.5, 1.0, 'ECDF compared to $CDF$ Large Sample')
In the following assignment we cover the topic of hypothesis testing in two fields. In the first part we model the binomial distribution of an urn experiment and calculate confidence intervals at a certain significance level respectively. In the second part we focus our analysis on a given housing price data set. Firstly we analyse the data set with respect to descriptive statistics and finally we test five scientific hypotheses on the data by applying methods of inference statistics.
We consider an urn that contains three white and five black balls. A ball with replacement is drawn five times. The random variable $X$ that is defined as the number of white balls drawn is of interest. To model the draws, a binomial distribution with 2 parameters is needed. The amount of total draws is $n$ and the probability of drawing a white ball is $p=\frac{3}{8}$. Therefore the outcome: amount of white balls $k \in \{0,...,5\}$ in the sample of $n=5$ draws can be modeled. The balls in the urn are represented by the following vector:
urn = ['white','white','black','black','white','black','black','black']
print("Ball in the Urn:", urn)
Ball in the Urn: ['white', 'white', 'black', 'black', 'white', 'black', 'black', 'black']
Of interest is the random distribution that can be used to describe the experiment. Especially we are interested in quantifying the random variable $X$ that describes the amount of white balls drawn out of 5 draws. The random distribution that suits the requirements for this experiment is the binomial distribution $B$. $B$ describes the distribution of a random variable $X$ that counts the amount of successes $k$ after drawing $n$ times with success probability $p$. In our model we draw from $n=5$ and note a success probability $p=0.375$. The outcome of the random variable $k$ lies in $\{0,1,2,3,4,5\}$. For that parameters in the case of the binomial distribution we can calculate the true mean of the distribution by $mean_{pop} = n \dot p = 1.875$. We can then determine the expected amount of white balls drawn from the urn in that sample. In the next part we want to simulate the urn drawing with replacement.
mean_pop=1.875
We use the function random.choice
from the numpy
package to simulate the experiment, i.e. draw five balls with replacement. With one simulation of that experiment we get one random realization of $X$. By setting the np.random.seed(1234)
to an arbitrary integer, we fix the quasi-random number generation built-in. With the numpy package we can apply the random.choice function that simulates our urn experiment. For this function we need 4 arguments. The previously defined urn vector, the size of our sample ( 𝑛=5 ), a boolean argument for the replacement and a probability-vector for drawing each item out of the urn. The next output is a sample of 5 urn drawings with replacement:
import numpy as np
np.random.seed(1234)
draw=np.random.choice(urn, size=5, replace=True, p=[1/8, 1/8, 1/8,1/8,1/8,1/8,1/8,1/8])
print("A single draw from the urn:", draw)
print("Count of white balls drawn:", "X =",list(draw).count("white"))
A single draw from the urn: ['white' 'white' 'black' 'black' 'black'] Count of white balls drawn: X = 2
In the next step we want to generate $N=100$ independent observations of $X$: $X_1,\dots,X_{100}$. For this we just replicate the previously defined experiment 100 times and safe each result in a vector $S$. With the random.seed function we can fix the quasi-random seed of the random choices to get the same "random" results for every run of this cell. Then a custom function named draw(N,urn)
that simulates $N$ observations of $X$ is defined. This function exactly wraps the just stated replications of the random.choice for an arbitrary number of iterations $N$.
def draw(N,urn):
draw=np.random.choice(urn, size=5, replace=True, p=[1/8, 1/8, 1/8,1/8,1/8,1/8,1/8,1/8])
X=list(draw).count("white")
return X
With the custom function we can simulate a sample of size $100$ and calculate the empirical mean (set np.random.seed(7)
). Further we can estimate the mean of the distribution by calculating the sample mean of that simulation output. For $N=100$ we apply the draw function and generate the random vector $S$ with the results of 100 random samples of the urn experiment. The random.seed again makes sure to get the same draws when running the cell again. By measuring the absolut deviation from the expected (theoretical) mean and the sample mean of $S$, we derive the empirical bias of the simulated experiment. With respect to the law of large numbers we expect this bias to get close to zero for large $N$.
N=100
S = np.empty(N)
np.random.seed(7)
for i in range(N):
S[i]=draw(N,urn)
print("My Sample:", S)
print(np.mean(S))
bias=mean_pop-np.mean(S)
bias
My Sample: [1. 3. 1. 2. 3. 2. 1. 1. 0. 1. 1. 3. 2. 1. 1. 1. 1. 1. 1. 1. 2. 1. 3. 3. 3. 2. 2. 1. 3. 4. 3. 2. 1. 2. 2. 3. 3. 1. 2. 2. 2. 1. 2. 1. 3. 1. 2. 2. 1. 1. 3. 2. 2. 0. 2. 2. 3. 3. 4. 4. 2. 2. 0. 2. 2. 2. 3. 2. 1. 2. 2. 2. 1. 3. 1. 1. 0. 0. 1. 2. 0. 2. 1. 3. 0. 1. 1. 2. 4. 2. 1. 2. 0. 3. 2. 5. 1. 2. 3. 1.] 1.81
0.06499999999999995
We see that we have a low bias of the sample mean to the actual true distribution-mean of $1.875$. This is due to randomness and the fact that a sample of $100$ iterations is large enough to deliver a decent approximation. A colleague shows another random sample and asks if the underlying distribution has the same population mean as the random variable $X$. To answer that, we calculate the sample mean of the new sample and compare with the true distribution mean.
new_sam=np.array([1, 3, 3, 1, 2, 2, 2, 3, 4, 1, 1, 1, 1, 2, 1, 4, 4, 1, 4, 3, 3, 3,
2, 1, 2, 3, 1, 4, 4, 2, 4, 1, 2, 2, 1, 0, 2, 2, 4, 3, 3, 2, 2, 1,
1, 2, 4, 3, 4, 1, 2, 3, 1, 2, 1, 5, 5, 3, 2, 1, 2, 2, 2, 2, 2, 1,
3, 0, 2, 2, 2, 1, 4, 1, 1, 0, 3, 3, 1, 1, 2, 2, 2, 4, 2, 3, 4, 4,
0, 2, 1, 2, 1, 2, 4, 3, 1, 3, 1, 3])
print("New Sample:", new_sam)
New Sample: [1 3 3 1 2 2 2 3 4 1 1 1 1 2 1 4 4 1 4 3 3 3 2 1 2 3 1 4 4 2 4 1 2 2 1 0 2 2 4 3 3 2 2 1 1 2 4 3 4 1 2 3 1 2 1 5 5 3 2 1 2 2 2 2 2 1 3 0 2 2 2 1 4 1 1 0 3 3 1 1 2 2 2 4 2 3 4 4 0 2 1 2 1 2 4 3 1 3 1 3]
print("Mean New Sample:", np.mean(new_sam), "\n", "Mean My Sample:", np.mean(S), "\n", "Mean of Population:", 5*3/8)
Mean New Sample: 2.19 Mean My Sample: 1.81 Mean of Population: 1.875
We see that the new sample has obviously a higher sample mean. To determine if this difference is statistically significant a confidence interval needs to be built. Therefore a 90% asymptotic confidence interval for the population mean based on this new sample is found. It is not necessary to have further information about the distribution of the data since the central limit theorem states: means of i.i.d. distributions are normaly-distributed for large $N$. For $N=100>30$ we can approximate the validity of that theorem for the new sample and build the mean confidence interval of our known urn experiment distribution. We derive that symmetric CI with 90% confidence, so that 90% of all sample means of samples drawn by that distribution are expected to lie in this interval. If the sample mean of the new sample lies outside of this CI, we are 90% sure the underlying distribution does not match our urn experiment. We build the Confidence Interval of the true distribution and check if the new sample fits into that interval via $\quad \mu\pm \frac{\sigma_n}{\sqrt{n}}q_z(0.90) = (1.74, 2.01) $.
import scipy.stats as stats
(1.875 -np.sqrt(np.var(S)/len(S))*stats.norm.ppf(q = 0.9), 1.875+np.sqrt(np.var(S)/len(S))*stats.norm.ppf(q = 0.9))
(1.7434363881566644, 2.006563611843336)
The theoretical confidence interval for the mean of the urn experiment ($\mu = 1.875$) at 90% confidence level does not contain the sample mean of the new sample. Therefore, we reject the assumption that this sample was drawn by our urn experiment. Confidence Intervals are one part of inference statistics that can be used to test scientific hypotheses. In the next section we introduce further inference methods to test more hypotheses.
In this part we analyse the given Boston housing data. In this part of the task we prepare the data to estimate the prices (response y) of houses in boston with all the given regressors (X) available in the data set. We do this by using the python pandas dataframe package and prepare the y-and X-variables. An excerpt of the data frame is shown below.
from sklearn import datasets
import pandas as pd
import numpy as np
boston = datasets.load_boston()
regressors = pd.DataFrame(boston.data, columns=boston.feature_names)
outcome = pd.DataFrame(boston.target, columns=["MEDV"]).values[:]
regressors.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
As descriptive statistics for the boston data set we consider a couple of standard methods to get familiar with the data. First, we get a better idea of the data set by inspecting the the dimensions with respect to row and column sizes. Second we want to calculate the sample mean and standard deviation for each variable. Further, we are interested in the histogram of the median value of owner-occupied homes (MEDV) and lastly we inspect if old houses have different prices than newer houses. For this last statistic we consider 50 years to be the threshold that seperates old and new.
print("1.Observations, Regressors:")
print(regressors.shape)
print("2.Means of Regressors:")
print(regressors.mean())
print("3. Standard Deviations of Regressors")
print(regressors.std())
print(" ")
print("The mean price of houses in the boston data set is: ", round(outcome.mean()[0],2))
print("The standard deviation of the boston housing data sets' prices is: ", round(outcome.std()[0],2))
1.Observations, Regressors: (506, 13) 2.Means of Regressors: CRIM 3.613524 ZN 11.363636 INDUS 11.136779 CHAS 0.069170 NOX 0.554695 RM 6.284634 AGE 68.574901 DIS 3.795043 RAD 9.549407 TAX 408.237154 PTRATIO 18.455534 B 356.674032 LSTAT 12.653063 dtype: float64 3. Standard Deviations of Regressors CRIM 8.601545 ZN 23.322453 INDUS 6.860353 CHAS 0.253994 NOX 0.115878 RM 0.702617 AGE 28.148861 DIS 2.105710 RAD 8.707259 TAX 168.537116 PTRATIO 2.164946 B 91.294864 LSTAT 7.141062 dtype: float64 The mean price of houses in the boston data set is: 22.53 The standard deviation of the boston housing data sets' prices is: 9.2
The data set consists of 506 observations and 13 regressors. In the next chart the histogramm of MEDV is displayed and one can see that the distribution does not look normal.
import matplotlib.pyplot as plt
plt.hist(outcome, bins = 10)
plt.title('3.: Density of MEDV')
plt.ylabel('Frequency')
plt.xlabel('median value of owner-occupied homes (MEDV)')
plt.show()
outcome = pd.DataFrame(boston.target, columns=["MEDV"])
print("4.:")
print("Average Value of House over 50 years of age:", round(outcome.loc[regressors.AGE >= 50].MEDV.mean(),2))
print("Average Value of House under 50 years of age:", round(outcome.loc[regressors.AGE < 50].MEDV.mean(),2))
4.: Average Value of House over 50 years of age: 20.83 Average Value of House under 50 years of age: 26.69
This results make sense, because older houses require more investment in terms of renovation costs, therefore are on average cheaper.
In the last part we want to test five scientific hypotheses. Since Hypotheses 1,2,4 can be formulated as tests for differences in the mean of two groups, we apply a unpaired and independend t-test which is based on students' t-distribution. The t-distribution is used for that test statistic because studentized sample means (and therefore also their differences) can be proven to be t-distributed. For Hypothesis 3 we apply a Kolmogorov-Smirnoff test to show if both samples were drawn from the same population. Hypothesis 5 can be tested with a Shapiro-Wilk test to find if the sample is normal distributed or not.
First we test if the mean of the median housing prices is 20. We apply the t-test to see if the sample mean is 20. Because the p-value is close to 0, the null hypothesis that the mean is 20 is rejected.
import scipy.stats as stats
print(outcome.mean())
print(stats.ttest_1samp(outcome, 20))
MEDV 22.532806 dtype: float64 Ttest_1sampResult(statistic=array([6.19478358]), pvalue=array([1.21149443e-09]))
Second we test if houses that are close to the Charles river are, on average, as expensive as houses that are not situated at the river. For this test we apply a t-test and a welch-test for the assumption of different variances in both samples. Because in both cases the p-value is close to 0, the null hypothesis that the two samples of houses have identical average prices is rejected. Therefore houses that are close to the Charles river are, on average more expensive than houses not situated on the river.
chas_hous=outcome.loc[regressors.CHAS == 1].MEDV
non_chas_hous=outcome.loc[regressors.CHAS == 0].MEDV
print("House located on Charles River:", chas_hous.mean())
print("House not located on Charles River:", np.round(non_chas_hous.mean()),2)
print(stats.ttest_ind(chas_hous, non_chas_hous, equal_var= True))
print(stats.ttest_ind(chas_hous, non_chas_hous, equal_var= False))
House located on Charles River: 28.44 House not located on Charles River: 22.0 2 Ttest_indResult(statistic=3.996437466090509, pvalue=7.390623170519905e-05) Ttest_indResult(statistic=3.113291312794837, pvalue=0.003567170098137517)
Third we test if the distribution of house prices is affected by distance to charles river. Using the Kolmogorov-Smirnov test, it is tested if the two samples come from different distributions. The null hypothesis is that they come from the same distribution. The p-value of the test is 0.01, which means that at the 1% confidence level the distributions are the same, but at the 5% and 10% level the distributions are different.
stats.ks_2samp(chas_hous, non_chas_hous)
KstestResult(statistic=0.27788898999090084, pvalue=0.010198373575669883)
Fourth, it is tested if the per capita crime rate by town ('CRIM') is, on average, higher in districts where the share of lower status of the population ('LSTAT') is above 10%. A district is assigned as "low status" if more than 10% of its population is lower status. A district is assigned as "non-low status" if less than 10% of its population is low status. The one-sided t-test has a p-value close to 0, therefore the null hypothesis that on average crime rate per capita is the same in low and non-low status districts is rejected. The crime rate is indeed higher in low status districts.
lower_stat=regressors.loc[regressors.LSTAT > 10].CRIM
non_lower_stat=regressors.loc[regressors.LSTAT <= 10].CRIM
print("Per capita crime rate by town when share of low status is above 10%:", lower_stat.mean())
print("Per capita crime rate by town when share of low status is equal or below 10%:", non_lower_stat.mean())
print(stats.ttest_ind(lower_stat, non_lower_stat, equal_var= True, alternative='greater'))
Per capita crime rate by town when share of low status is above 10%: 6.010182822299653 Per capita crime rate by town when share of low status is equal or below 10%: 0.47269611872146106 Ttest_indResult(statistic=7.563826404064134, pvalue=9.346525153507563e-14)
Finally we test if the assumption of the normal distribution is justified for the variable 'MEDV'. The test result shows that the assumption of normality is not justified, because the null hypothesis is rejected at 5% significance level. The median housing price is not normal-distributed.
print(stats.shapiro(outcome))
ShapiroResult(statistic=0.91717529296875, pvalue=4.940618243974614e-16)
In this assignment we learned how to implement basic statistical ideas from the fields of descriptive and inference statistics. From modeling distributions and calculating confidence intervals for the sample mean, we analyzed a pandas data set and prepared it for regression modeling. Further we successfully tested a hand full of scientific hypotheses and were able to reject the null hypotheses in cases of small p-values (p<5%) that indicate statistical significance for the alternative hypotheses. We have shown that python and the underlying packages are a feasible tool to develop stochastic and statistic code development.
Let $X_1,\dots,X_n$ i.i.d $\sim\mathcal{N}(\mu,\sigma^2)$, where $\mu$ and $\sigma^2$ are unknown. A level $\alpha$ confidence interval for the unknown variance can be constructed by
$$\left(\frac{n\hat{s}^2_n}{\chi^2_{n-1,1-\alpha/2}},\frac{n\hat{s}^2_n}{\chi^2_{n-1,\alpha/2}}\right)$$with $\hat{s}^2_n=\frac{1}{n}\sum_{i=1}^n(X_i - \bar{X}_n)^2$ and $\chi^2_{k,\alpha}$ denotes the $\alpha$-quantile of the $\chi^2$-distribution with $k$ degrees of freedom. This confidence interval can be be used to test the hypothesis whether the population variance equals some given value $\sigma_0^2$. First, the objective is to calculate a 5% confidence interval to check if the true variance is included. Therefore, we set $\mu=0$ and $\sigma^2=4$. Furthermore, $n=100$ observations of $X\sim\mathcal{N}(\mu,\sigma^2)$ are simulated. This process will be repeated 20 times.
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
import math
np.random.seed(0)
sample = np.random.normal(0, np.sqrt(4), 100)
#true_var = np.var(sample)
true_var = np.mean((sample-np.mean(sample))**2) * len(sample)/(len(sample)-1)
print("Variance:", true_var)
print("Standard Deviation:", np.sqrt(true_var))
print('Confidence Interval :',(np.round((100*true_var)/stats.chi2.ppf(1-0.05 / 2, 100-1),3),np.round((100*true_var)/stats.chi2.ppf(0.05 / 2, 100-1) ,3)))
intervals = []
vars = []
np.random.seed(0)
for i in range(20):
sample = np.random.normal(0, np.sqrt(4), 100)
interval = ((100*np.var(sample))/stats.chi2.ppf(1-0.05 / 2, 100-1),(100*np.var(sample))/stats.chi2.ppf(0.05 / 2, 100-1))
intervals.append(interval)
var = np.var(sample)
vars.append(var)
plt.figure(figsize=(10,4))
plt.errorbar(x=np.arange(0.1, 20, 1), y=vars, yerr=[(top-bot)/2 for top,bot in intervals], fmt='o')
plt.hlines(xmin=0, xmax=20, y=true_var, linewidth=2.0, color="red")
plt.title("Confidence Interval for the Variance")
plt.xlabel("Repetitions")
plt.ylabel("Width of Confidence Interval")
x_coordinates = [0, 19]
plt.show()
Variance: 4.1043499766259846 Standard Deviation: 2.0259195385370035 Confidence Interval : (3.196, 5.595)
As can be seen, our Confidence interval of (3.196, 5.595) does indeed include the true variance. The results of the calculated confidence intervals are visualized in the graphic above. While the 20 calcualted confidence intervals are displayed as blue lines with their respective width and average, the red line displays the value of the true variance. According to the confidence level 95%, which assumes a 5% error rate, the confidence interval should not include the true population variance about $\frac{1}{20} * 100 =5$% of the time. We can see from the graphic that this one confidence interval is the 13th calculated one, because it does not include the population variance.
Next, we define a function $sigma\_test(X, \sigma_0^2, \alpha)$, which tests the hypothesis
$$H_0:\sigma^2=\sigma_0^2 \text{ against }H_1:\sigma^2\neq\sigma_0^2 $$for some given level $\alpha$ and a given vector of observations $X$ (built on the confidence interval above). The function should return $1$ if the null hypothesis is rejected and $0$ otherwise.
def sigma_test(sample, sigma, alpha):
interval = ((len(sample)*np.var(sample))/stats.chi2.ppf(1-alpha / 2, len(sample)-1),(len(sample)*np.var(sample))/stats.chi2.ppf(alpha / 2, len(sample)-1))
if interval[0] > sigma or interval[1] < sigma:
result = 1
else:
result = 0
return result
The chi-squared test for variance is used to determine whether the variance of a variable obtained from a particular sample has the same size as the known population variance of the same variables. When testing the population variance equals 4 for example, the test should yield a 0, because the null hypothesis can not be rejected. When we repeat this test 1000 times in the next exercise, we hope that the function will yield a 1, 5% of the time, when the null hypothesis is rejected, because we set alpha as 0.05. To determine whether the test really has the significance level $\alpha$, a simulation with $l=1000$ repetitions is applied. We set $\mu=0, \sigma^2=\sigma_0^2=4, n=100$ and $\alpha=0.05$.
count= 0
np.random.seed(123)
for i in range(1000):
sample = np.random.normal(0, np.sqrt(4), 100)
result = sigma_test(sample, 4, 0.05)
count += result
print("percentage of rejected null hypothesis:", count/1000*100)
percentage of rejected null hypothesis: 5.1
Indeed testing if the population variance equals the given value of 4 has the significance level of alpha = 0.051. Removing the random seed will show that the significance level hovers around 5%. Increasing the amount of repetitions from 1,000 to to 10,000 will remove this variation and the significance level will be closer to 5.0 To validate the power of our hypothesis, the previous procedure is repeated, but now we select $\sigma_0^2=2$ and $\sigma^2=4$.
count= 0
np.random.seed(0)
for i in range(1000):
sample = np.random.normal(0, np.sqrt(4), 100)
result = sigma_test(sample, 2, 0.05) #test variance=2
count += result
print("percentage of rejected null hypothesis:", count/1000*100)
percentage of rejected null hypothesis: 99.9
The results show that if the alternative hypothesis is true, there is a 99.9% chance that we will correctly reject the null hypothesis. This is the power of the test for this particular given value of the variance. Power is the probability of rejecting the null hypothesis, when the alternative is true. $$Power = P(\text{reject H_o } | \text{H_a true})$$
In the above cause of the tested variance, our power is 99.9%, which indicates a very strong test. The previously used hypothesis test relies on the $\chi^2$-distribution of the following test statistic: $$T(X):=\frac{n\hat{s}^2_n}{\sigma^2}\sim\chi^2(n-1).$$ The objective now is to plot a histogram of the test statistic and compare it to the density of a suitable $\chi^2$-distribution. Furthermore, the qunatiles $\chi^2_{n-1,1-\alpha/2}$ and $ \chi^2_{n-1,\alpha/2}$, which specify the critical region, are marked by vertical lines. Then it is shown why we reject the null hypothesis $H_0$ if the test statistic $T(X)$ is on the left or the right of these two quantiles.
test_stats = []
np.random.seed(0)
for i in range(1000):
sample = np.random.normal(0, np.sqrt(4), 100)
test_stat=(len(sample)*np.var(sample))/4 * len(sample)/(df)
test_stats.append(test_stat)
df=100-1
x = np.linspace(chi2.ppf(0.01, df), chi2.ppf(0.99, df), 100) #choice
#x = np.arange(min(test_stats), max(test_stats), .05)
fig, ax = plt.subplots(1, 1)
ax.hist(test_stats, density=True, histtype='stepfilled', alpha=0.2) #choice
#ax.hist(test_stats, 50, weights=np.zeros_like(test_stats) +1 / 1000)
ax.plot(x, chi2.pdf(x, df),'r-', lw=5, alpha=0.6, label='chi2 pdf')
plt.axvline(stats.chi2.ppf(1-0.05 / 2, 100-1), label="upper $\chi^2_{n-1,1-/alpha/2}$")
plt.axvline(stats.chi2.ppf(0.05 / 2, 100-1), label="upper $\chi^2_{n-1,/alpha/2}$")
ax.legend(loc='best', frameon=False)
plt.show()
The calculated value of the test statistic is compared with the critical region of the chi-squared distribution as determined by the degrees of freedom and significance level alpha. Because we repeatedly calculated the test statistic based on a normaly distributed random sample, the distribution of the chi-squared test-statistic effectivly approximated the chi-squared distribution. In the graphic above, the test statistic is ploted by the light blue histogram and is comparable to the density of a $\chi^2$-distribution on the same interval $x$. If the test statistic for the variance of a specific sample falls to the left or to the right of the quantiles (in the critical region), we will reject the null hypothesis that the sample has the same population variance. In our case we set alpha to 0.05 this will only happen 5% of the time, therefore we can be 95% certain that the test works.
In the last exercise, we consider a welfare policy experiment with a binary treatment $D$ (assignment), which was conducted in the mid 1990s to assess the publicly funded US Job Corps program. The program targets young individuals (aged 16–24 years) who legally reside in the USA and come from a low-income household. It provides participants with approximately 1200 hours of vocational training and education, housing and board over an average duration of 8 months. Participants also receive health education as well as health and dental care.
To start the data preparation process, a data frame $X$ will be created that includes the below listed covariates. This data frame will serve as a set of regressors to determine the impact on an specific outcome variable and will also be used for classification.
'assignment', 'female', 'age', 'white', 'black', 'hispanic', 'educ', 'educmis', 'geddegree', 'hsdegree', 'english', 'cohabmarried', 'haschild', 'everwkd', 'mwearn', 'hhsize', 'hhsizemis', 'educmum', 'educmummis', 'educdad', 'educdadmis', 'welfarechild', 'welfarechildmis', 'health', 'healthmis', 'smoke', 'smokemis', 'alcohol', 'alcoholmis'.
As it comes to the dependent variable, first a binary DataFrame Y_class will be created, which is $1$ if the general health 30 months after assignment ($health30$) is stated to be excellent ($health30=1$) and $0$ otherwise. Calculating the share based on the data frame $Y_{class}$ shows that 30 months after the assignment, a share of 38.27 percent of the people have an excellent health status. In contrast, 61.73 percent of the people are not in an excellent health status.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
import numpy as np
data = pd.read_csv('dataJC.csv', sep=',', na_values=".", index_col=None)
columnnames = ['assignment', 'female', 'age', 'white', 'black', 'hispanic', 'educ',
'educmis', 'geddegree', 'hsdegree', 'english', 'cohabmarried',
'haschild', 'everwkd', 'mwearn', 'hhsize', 'hhsizemis', 'educmum',
'educmummis', 'educdad', 'educdadmis', 'welfarechild',
'welfarechildmis', 'health', 'healthmis', 'smoke', 'smokemis',
'alcohol', 'alcoholmis']
X = pd.DataFrame(data, columns=columnnames)
y_class = pd.DataFrame(np.where(data.health30 == 1, 1, 0))
Share = y_class.sum()[0]/len(y_class)*100
print('Share of people with excellent health status:')
print(round(Share,2))
Share of people with excellent health status: 38.27
Another variable $Y_{pred}$ is created, which is equal to the weekly earnings in the third year after assignment ($earny3$). $Y_{pred}$ will serve as a dependent variable in different regression approaches, because it is continuous. To determine whether the participation in the job program has an effect on the dependent variables (on the participants health as well as on their weekly earnings), the sample means of the dependent variables $Y_{class}$ and $Y_{pred}$ are calculated conditioned on assignment.
y_pred=pd.DataFrame(data["earny3"])
y_pred=y_pred.rename(columns={"earny3": "y_pred"})
print("Health Status Treated:", round(y_class.loc[data.assignment == 1].mean()[0],3))
print("Health Status Not Treated:", round(y_class.loc[data.assignment == 0].mean()[0],3))
print("")
print("Weekly Earnings Treated:", round(y_pred.loc[data.assignment == 1].mean()[0],3))
print("Weekly Earnings Not Treated:", round(y_pred.loc[data.assignment == 0].mean()[0],3))
Health Status Treated: 0.387 Health Status Not Treated: 0.376 Weekly Earnings Treated: 177.477 Weekly Earnings Not Treated: 166.014
The sample mean for $Y_{class}$ conditioned on the group of participants, who got assigned to the program (assignment =1) is only slightly higher than the mean conditioned on those participants, that did not get assigned in the program (assignment=0) (0.38 vs 0.37). Through this naive approach, it looks like there is no effect of participation in the job program on health, because the means are very close to each other. To actually test this, a hypothesis test needs to be used, because this would consider the variation of the data from which a standard error can be built. For $Y_{pred}$ the calculated sample means conditioned on the variable assignment show a larger difference between the groups. If only those participants are observed, that participated in the assignment (assignment =1), the average weekly earning is 177.48 USD, while those who did not participate in the assignment (assignment =0), have an average weekly earning of 166.01 USD. This observation indicates, that the assignment has had an impact on the earnings, as the average weekly earnings are higher for those people who were assignend to the Job Corps Program. But is this effect significant? To answer this question, a hypothesis test can be used to compare the means of the populations.
import scipy.stats as stats
data.loc[data['health30'] == 1, 'Y_class'] = 1
data.loc[data['health30'] !=1, 'Y_class'] = 0
Y_class = data.filter(['Y_class'], axis = 1)
Yclass1 = data.loc[data['assignment'] == 1, 'Y_class']
Yclass0 = data.loc[data['assignment'] == 0, 'Y_class']
Ypred1 = data.loc[data['assignment'] == 1, 'earny3']
Ypred0 = data.loc[data['assignment'] == 0, 'earny3']
print('Test Result for mean comparison of Y_class:')
sample_class_1 = Yclass1
sample_class_0 = Yclass0
stats.ttest_ind(sample_class_1,sample_class_0,equal_var= False)
print('\n')
print('Test Result for mean comparison of Y_pred:')
sample_pred_1 = Ypred1
sample_pred_0 = Ypred0
stats.ttest_ind(sample_pred_1,sample_pred_0,equal_var= False)
Test Result for mean comparison of Y_class:
Ttest_indResult(statistic=1.0412579548844745, pvalue=0.2977878613858475)
Test Result for mean comparison of Y_pred:
Ttest_indResult(statistic=3.328573871176786, pvalue=0.0008768832000587824)
Testing the hypothesis for the mean comparison of $Y_{class}$ shows that the health status of people who were assigned to the job program are not significantly different of those who were not (considering on a 'minimum' signifiance level of 0.1). Thus, we can not indicate a significant effect from the assignment to the job program on the health status after 30 months. In contrast, hypothesis testing for $Y_{pred}$ shows that the average of the weekly incomes of people who took the assignment are significantly different of those who didn't took the assignment (with significance level = 0.01). Therefore, with 99% confidence level, we can conclude that assignment does have a significant effect on the weekly earnings after 3 years. Although statistically significant, this effect is economically irrelevant, because a difference of about 10.5 dollars is minuscule.
In the next step, regression analysis is applied with $Y_{pred}$ as a dependent variable, to determine the effect of the respective covariates on weekly earnings after three years. To do this, different regression approaches are applied and their performance is evaluated and compared. To prepare the data for the regression analysis, a train-test-split is applied - where $(Y_{pred},X)$ is randomly splitted into train (80%) and test (20%) samples. The importance of sample-splitting is discussed.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y_pred, test_size=0.2, random_state=7)
#print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
The process of sample splitting is important for later analysis and evaluation of the models' performance. With sample splitting, we are able to train a model on one part of the data (the train sample) and then test it on an unseen part of the data, the test sample. As we still got the full information of the test sample, meaning the respective values of the outcome variable to the regressors, we are able to evaluate our model based on the test sample by comparing the true values with the predicted values of our model. If we were to train and test the model on the same data, a possible danger could be that the results of our model would be too good, leading us to believe that we got a model with very good or at least sufficient performance. If then a brand new sample would be acquired and predicted, the model may not habe the same good performance. Therefore, we split samples in order to be able to detect if the model is learning the data perfectly including the associated noise, called overfitting, or if it actually recognizes the underlying paterns and can then generalize to an unseen sample from the same distribution.
To evaluate the performance of our models, a function is implemented that outputs the root mean sqaured error (RMSE) for a given pair of vectors $y$ and $\hat{y}$, which is based on the equation
$$\text{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y_i})^2}.$$In the first step the train and test root mean squared errors, based on the linear model $$Y_{pred} = \beta * X + \epsilon,$$ are calculated via ols-regression. The RMSE results are compared to the standard deviation of the outcome variable $Y_{pred}$.
from sklearn.linear_model import LinearRegression
def RMSE(y, y_hat):
result = np.sqrt(np.mean((y-y_hat)**2))
result = round(result, 2)
result = pd.Series(result)
return(result)
y_train = np.squeeze(y_train)
y_test = np.squeeze(y_test)
model = LinearRegression().fit(X_train, y_train)
preds_train = model.predict(X_train)
preds_test = model.predict(X_test)
ols = pd.concat((RMSE(y_train, preds_train), RMSE(y_test, preds_test)), axis=1)
print("RMSE Train-Set:", ols[0][0])
print("RMSE Test-Set:",ols[1][0])
print("Standard Deviation of Earnings (y_pred):", np.round(np.std(y_pred)[0],2))
print("RMSE Naive Estimator:", RMSE(np.mean(y_pred),y_pred)[0])
print("Standard Deviation of Earnings (preds_test):", np.round(np.std(preds_test),2))
RMSE Train-Set: 153.06 RMSE Test-Set: 150.52 Standard Deviation of Earnings (y_pred): 162.77 RMSE Naive Estimator: 162.77 Standard Deviation of Earnings (preds_test): 57.43
First we see that the RMSE of the train-set is around the same as the RMSE of the test-set. This indicates that our model is not overfitting. As the RMSE of the train-set is a little higher than of the test-set, this could be due to the random influences of the composition of test and train samples. Second, we notice that the RMSE of the test set is lower than the standard deviation of the outcome variable y_pred. This is good, because the regression provides a better estimate than a naive estimator. A naive estimator is if we were to take the mean of the outcome variable and use it as a prediction. Then we see that the RMSE is close to the standard deviation of the outcome variable y_pred. This is because of the mathematical simularity between calculating RMSE and Standard deviation. Finally, we notice that the standard deviation of the predicted values is much lower than that of the outcome variable. Since the RMSE is large, the small spread in y_hat isn't good. As we saw the regression estimate is hardly better than one with the mean value as a naive estimator. Therefore we should also expect a low R-squared value, which puts the spread in y in relation to the declared spread of the estimate.
The reults of our linear regression should now be compared to other regression methods, including lasso, ridge and elastic net regression. After comparing the performance of the respective models based on the RMSE, approaches will be discussed on how to improve the prediction quality.
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import ElasticNetCV
model = LassoCV().fit(X_train, y_train)
preds_train = model.predict(X_train)
preds_test = model.predict(X_test)
lasso = pd.concat((RMSE(y_train, preds_train), RMSE(y_test, preds_test)), axis=1)
model = RidgeCV().fit(X_train, y_train)
preds_train = model.predict(X_train)
preds_test = model.predict(X_test)
ridge = pd.concat((RMSE(y_train, preds_train), RMSE(y_test, preds_test)), axis=1)
model = ElasticNetCV().fit(X_train, y_train)
preds_train = model.predict(X_train)
preds_test = model.predict(X_test)
enet = pd.concat((RMSE(y_train, preds_train), RMSE(y_test, preds_test)), axis=1)
summary = pd.concat((ols, lasso, ridge, enet))
summary.index = ['ols','lasso', 'ridge',"elastic net"]
summary.columns = ['In Sample', 'Out of Sample']
summary
In Sample | Out of Sample | |
---|---|---|
ols | 153.06 | 150.52 |
lasso | 153.73 | 151.21 |
ridge | 153.07 | 150.55 |
elastic net | 157.06 | 155.26 |
Ridge, lasso and elastic net are machine learning models that penalize the beta coefficients in L1 norm (lasso), l2 norm(ridge) or both (elnet). Therefore, lasso and elastic net can set beta coefficients to zero that do not have a significant explanatory impact. Ridge can only make the beta coefficient very small and close to zero. As regression models penalize the beta coefficients when they take values too far from zero, specific regularization parameters of the models can be adjusted. Such an adjustment can have an influence on the performance of the model. Tuning these parameters can therefore lead to better perdiction performance. For further reductions in RMSE, exploratory data analysis (EDA) would need to be conducted to determine, what variable types and scales are included in the data set. For example the prediction quality could be improved by scaling variables such as age, to 0-1, possibily separating age into bins and taking the log of earnings variables, which tend to be much larger than the rest of the variables in the data set. Binning/Grouping could possibly help to determine heterogeneous groups. Removal of outliers could be useful as well.
Finally, we are interested in the treatment effect of the assignment to the Job Corps program. Therefore, the ols regression for the whole sample without sample splitting is rerun to determine the effect on earnings and the significance of the assignment to the Job Corps program. Looking at the results provided by the statsmodel-api below, we see that for all other variables constant, assignment to the program increases weekly earnings by 14 dollars and 22 cents after three years. This is a positive significant effect, because the p-value is zero. Considering the confidence interval the positive effect on weekly earnings can vary between 8 and 20 dollars. The adjusted R-squared of the model is howerver only 12%, which means that only 12% of the total variation in the outcome is explained by the variables. This could be of course due to poor fit, because no data pre-processing was conducted or the possibility that earnings are determined by factors not observed in this data set. Often data pre-processing contributes relevantly to improving the model fit. Nevertheless, making conclusions about the treatment effect of program assignment on earnings is difficult, based of this 'naive' linear regression. For a valid estimation of the treatment effect, it would need to be controlled for all covariates, so that it is assured that there are no other influences on the dependent variable. This would require the inclusion of all relevant covariates, preferably selected by a double-selection-approach. Therefore, we can observe a positive correlation from the assignment to the job corps program on the weekly earnings after 3 years, but causality cannot be further interpreted.
import statsmodels.api as sm
X_c = sm.add_constant(X)
model = sm.OLS(y_pred,X_c)
model = model.fit()
model.summary()
Dep. Variable: | y_pred | R-squared: | 0.122 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.119 |
Method: | Least Squares | F-statistic: | 44.24 |
Date: | Thu, 01 Jul 2021 | Prob (F-statistic): | 9.84e-235 |
Time: | 14:34:28 | Log-Likelihood: | -59562. |
No. Observations: | 9240 | AIC: | 1.192e+05 |
Df Residuals: | 9210 | BIC: | 1.194e+05 |
Df Model: | 29 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -8.9592 | 24.676 | -0.363 | 0.717 | -57.330 | 39.411 |
assignment | 14.2214 | 3.264 | 4.357 | 0.000 | 7.823 | 20.620 |
female | -57.6810 | 3.451 | -16.713 | 0.000 | -64.446 | -50.916 |
age | 6.4151 | 0.953 | 6.732 | 0.000 | 4.547 | 8.283 |
white | 32.9201 | 7.119 | 4.624 | 0.000 | 18.965 | 46.876 |
black | -19.6873 | 6.817 | -2.888 | 0.004 | -33.051 | -6.324 |
hispanic | 10.0964 | 7.183 | 1.406 | 0.160 | -3.984 | 24.177 |
educ | 9.8711 | 1.492 | 6.617 | 0.000 | 6.947 | 12.795 |
educmis | 103.7038 | 38.645 | 2.683 | 0.007 | 27.950 | 179.458 |
geddegree | 30.0881 | 7.746 | 3.884 | 0.000 | 14.905 | 45.272 |
hsdegree | 8.2795 | 5.443 | 1.521 | 0.128 | -2.391 | 18.950 |
english | 2.4089 | 6.057 | 0.398 | 0.691 | -9.463 | 14.281 |
cohabmarried | 9.7798 | 6.902 | 1.417 | 0.157 | -3.750 | 23.310 |
haschild | 1.5944 | 4.521 | 0.353 | 0.724 | -7.267 | 10.456 |
everwkd | -27.4785 | 4.602 | -5.971 | 0.000 | -36.499 | -18.458 |
mwearn | 0.0954 | 0.018 | 5.230 | 0.000 | 0.060 | 0.131 |
hhsize | -1.2020 | 0.814 | -1.477 | 0.140 | -2.798 | 0.394 |
hhsizemis | -16.7310 | 25.278 | -0.662 | 0.508 | -66.282 | 32.820 |
educmum | 0.8241 | 0.791 | 1.042 | 0.298 | -0.727 | 2.375 |
educmummis | 1.1527 | 10.041 | 0.115 | 0.909 | -18.531 | 20.836 |
educdad | 0.7879 | 0.792 | 0.995 | 0.320 | -0.764 | 2.339 |
educdadmis | 1.3585 | 9.713 | 0.140 | 0.889 | -17.680 | 20.397 |
welfarechild | -9.6148 | 1.468 | -6.548 | 0.000 | -12.493 | -6.736 |
welfarechildmis | -28.8459 | 7.618 | -3.786 | 0.000 | -43.779 | -13.913 |
health | -11.7220 | 2.275 | -5.153 | 0.000 | -16.181 | -7.263 |
healthmis | 11.6371 | 38.980 | 0.299 | 0.765 | -64.772 | 88.046 |
smoke | -1.4975 | 2.289 | -0.654 | 0.513 | -5.985 | 2.990 |
smokemis | 11.7109 | 5.325 | 2.199 | 0.028 | 1.272 | 22.150 |
alcohol | 2.3361 | 1.746 | 1.338 | 0.181 | -1.087 | 5.759 |
alcoholmis | -6.4394 | 6.524 | -0.987 | 0.324 | -19.228 | 6.349 |
Omnibus: | 3583.727 | Durbin-Watson: | 1.962 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 29935.318 |
Skew: | 1.635 | Prob(JB): | 0.00 |
Kurtosis: | 11.189 | Cond. No. | 2.85e+03 |
In the next part different approaches were conducted to classify the data. The variable $Y_{class}$ is the binary outcome variable of excellent health status. The data was randomly splitted into a train (80%) and test (20%) sample.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn import svm
from sklearn import neighbors
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y_class, test_size=0.2, random_state=7)
y_train = np.squeeze(y_train)
y_test = np.squeeze(y_test)
First, logistic regression was used to classify the data. The performance was evaluated based on the accuracy of the test set. In order to improve the performance, different tuning parameters are applied and interpreted. In general, logistic regression is used to model the probability of a certain class. The penalty, solver and the regularisation strength are the most important tuning parameters for this classifier. For example the solver is the optimization algorithm, the penalty is used to specify the norm used in the penalization and C is the strength of regularization. For example an l1 penalty can be used for feature selection, because it reduces the coefficients of features that are not important, to 0. Both the performance of the l1 and l2 penalties are presented. For Hyperparameter Tuning we varied the values of the strength of the regularization (C) and the penalty. Loops were applied to test also different combinations of hyperparameters. Overall, the best result could be achieved when combining a C value of 0.1 and the penalty l1. Also, using the solver 'newton-cg' could further improve the results. An overview with the results is provided in the end.
penalty_vary = ['l1', 'l2']
c_vary = [0.001, 0.01, 0.1, 1, 10]
for c in c_vary:
for Penalty in penalty_vary:
logisticRegr = LogisticRegression(max_iter = 10000, solver = 'liblinear', penalty = Penalty, C=c)
model_log = logisticRegr.fit(X_train, y_train)
pred3 = model_log.predict(X_test)
print("accuracy score for c=", c,',','Penalty=', Penalty, ':', round(metrics.accuracy_score(y_test, pred3),2))
pred1 = LogisticRegression(max_iter = 10000, solver="liblinear", penalty = "l2", C= 0.01).fit(X_train, y_train).predict(X_test)
pred2 = LogisticRegression(max_iter = 10000, solver="liblinear", penalty="l1", C = 0.1).fit(X_train, y_train).predict(X_test)
pred3 = LogisticRegression(max_iter = 10000, solver="newton-cg", penalty = "l2", C= 0.01).fit(X_train, y_train).predict(X_test)
logregl1 = pd.Series(accuracy_score(y_test,pred2))
logregl2 = pd.Series(accuracy_score(y_test,pred1))
logregcg = pd.Series(accuracy_score(y_test,pred3))
accuracy score for c= 0.001 , Penalty= l1 : 0.61 accuracy score for c= 0.001 , Penalty= l2 : 0.62 accuracy score for c= 0.01 , Penalty= l1 : 0.62 accuracy score for c= 0.01 , Penalty= l2 : 0.65 accuracy score for c= 0.1 , Penalty= l1 : 0.65 accuracy score for c= 0.1 , Penalty= l2 : 0.64 accuracy score for c= 1 , Penalty= l1 : 0.65 accuracy score for c= 1 , Penalty= l2 : 0.65 accuracy score for c= 10 , Penalty= l1 : 0.65 accuracy score for c= 10 , Penalty= l2 : 0.65
Next, support vector machines were used with different kernels and other tuning parameters to minimize the empirical error rate. Support vector machines are based on the idea of finding a hyperlane that best divides a dataset into two classes. The SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of a kernel is to take data as input and transform it into the required form. There are different functions: linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. The polynomial function represents the similiarity of vectors over polynomials of the original variables, allowing the learning of non-linear models. The radial basis function is a function whose value depends on the eucledian distance between the input and some fixed point. A sigmoid funciton is a function with a domain of all real numbers and range usually between 0 and 1 or -1 and 1. For support vector machines, the kernels radial basis function, polynomial function and sigmoid function were used. For this, we varied the degrees in the polynomial function and the C values (which is the inverse of the strength of regularization). For the Sigmoid Kernel the results did not differ for different C values.
#Different degrees for Polynomial Kernel
degree_vary = [1,2]
for Degree in degree_vary:
pred_pol = svm.SVC(kernel= 'poly', degree = Degree, gamma='scale').fit(X_train, y_train).predict(X_test)
print("accuracy score polynomial for degree=", Degree, ':', round(accuracy_score(y_test,pred_pol),2))
#Varying C in RBF, no difference in Sigmoid
c_vary = [0.01, 0.1, 0.5, 1, 10]
for c in c_vary:
pred_rbf = svm.SVC(kernel = 'rbf', C = c, gamma='auto').fit(X_train, y_train).predict(X_test)
print('accuracy score rbf for C=', c, ':', round(accuracy_score(y_test,pred_rbf),2))
pred_rbf = svm.SVC(kernel = 'rbf', C = 1, gamma='auto').fit(X_train, y_train).predict(X_test)
pred_sig = svm.SVC(kernel = 'sigmoid', C = 1, gamma='auto').fit(X_train, y_train).predict(X_test)
pred_pol = svm.SVC(kernel = 'poly', degree = 2, gamma='scale').fit(X_train, y_train).predict(X_test)
svm1 = pd.Series(accuracy_score(y_test,pred_rbf))
svm2 = pd.Series(accuracy_score(y_test,pred_sig))
svm3 = pd.Series(accuracy_score(y_test,pred_pol))
accuracy score polynomial for degree= 1 : 0.61 accuracy score polynomial for degree= 2 : 0.61 accuracy score rbf for C= 0.01 : 0.61 accuracy score rbf for C= 0.1 : 0.61 accuracy score rbf for C= 0.5 : 0.61 accuracy score rbf for C= 1 : 0.62 accuracy score rbf for C= 10 : 0.59
Another classification approach that was used is $k$-nearest neighbors. Weights and number of neighbors were varied to minimize the empirical error rate. First, the neighbors were varied with the weight being set to 'uniform'. After examining the results, the number of neighbors, which lead to the best accuracy, was also used for different weights ('uniform', 'distance') to examine the difference in performance between different weights, but with the same number of neighbors. The difference between the two types of weighting functions is that the first assign the same weight to to each neighbor, while the second one tends to weight neighbors closer to the quiry point more. When varying the neighbors, the accuracy score gets the best for 100 neighbors. With a growing number of neighbors the predicitions should become more stable up to a certain point, as the algorithm can 'choose' between more options, but pushing it to far (here neighbors = 1000, 2000) will contribute no more to a better accuracy score. Therefore, it is important to adjust the number of neighbors to the respective problem set and not unnecessarily increase complexity by setting the number of neighbors to high, while the performace stagnates. Also, an unsufficient number of neighbors can bear the danger of overfitting the model. The respective train accuracy should therefore also be calculated, to inspect the difference between the performance on the test and train sample. For 100 neighbors, we could not find strong differences between the train and the test sample.
N = [1, 10, 20, 30, 40, 50, 100, 1000, 2000]
for N_neighbors in N:
neighb = neighbors.KNeighborsClassifier(n_neighbors = N_neighbors, weights = 'uniform')
model_knn = neighb.fit(X_train, y_train)
preds3 = model_knn.predict(X_test)
print("Accuracy for neighbors =", N_neighbors, ":", round(metrics.accuracy_score(y_test, preds3),2))
preds1 = neighbors.KNeighborsClassifier(n_neighbors = 100, weights = 'uniform').fit(X_train, y_train).predict(X_test)
preds1_train = neighbors.KNeighborsClassifier(n_neighbors = 100, weights = 'uniform').fit(X_train, y_train).predict(X_train)
preds2 = neighbors.KNeighborsClassifier(n_neighbors = 100, weights = 'distance').fit(X_train, y_train).predict(X_test)
knn1=pd.Series(accuracy_score(y_test,preds1))
knn1_train = pd.Series(accuracy_score(y_train,preds1_train))
knn2=pd.Series(accuracy_score(y_test,preds2))
Accuracy for neighbors = 1 : 0.53 Accuracy for neighbors = 10 : 0.59 Accuracy for neighbors = 20 : 0.59 Accuracy for neighbors = 30 : 0.59 Accuracy for neighbors = 40 : 0.6 Accuracy for neighbors = 50 : 0.61 Accuracy for neighbors = 100 : 0.62 Accuracy for neighbors = 1000 : 0.61 Accuracy for neighbors = 2000 : 0.61
Random forest is now applied in order to improve the results. Feature importances were also ascertained. The tuning parameters and how they affect the performance are explained. In general, random forest contains a large number of small decision trees, called estimators, with every tree making its own predictions. The random forest model combines the predictions of the estimators to produce a more accurate prediction.
depth = [1, 3, 7, 10, 20]
for Depth in depth:
model_d = RandomForestClassifier(max_depth=Depth, random_state=0, n_estimators = 1000).fit(X_train, y_train)
preds_d= model_d.predict(X_test)
print("Accuracy for max_depth =", Depth, ':', round(metrics.accuracy_score(y_test, preds_d),2))
print('\n')
print('Importance of features')
model = RandomForestClassifier(max_depth=7, random_state=0, n_estimators = 1000).fit(X_train, y_train)
importance = model.feature_importances_
preds1= model.predict(X_test)
preds2 = RandomForestClassifier(max_depth=3, random_state=0, n_estimators = 1000).fit(X_train, y_train).predict(X_test)
rf1=pd.Series(accuracy_score(y_test,preds1))
rf2=pd.Series(accuracy_score(y_test,preds2))
Accuracy for max_depth = 1 : 0.61 Accuracy for max_depth = 3 : 0.61 Accuracy for max_depth = 7 : 0.64 Accuracy for max_depth = 10 : 0.63 Accuracy for max_depth = 20 : 0.63 Importance of features
for i,v in enumerate(importance):
if v > 0.04:
print('Feature: %s, Score: %.5f' % (X_train.columns[i],v))
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Feature: female, Score: 0.06366 Feature: age, Score: 0.04833 Feature: educ, Score: 0.04097 Feature: mwearn, Score: 0.04412 Feature: hhsize, Score: 0.05416 Feature: educmum, Score: 0.04246 Feature: health, Score: 0.40998
<BarContainer object of 29 artists>
The most important features to classify the data are printed and also shown in the feature importances plot above. Health is the most important variable that predicts the share of people with excellent health status 30 months after assignment. This is an obvious result. Other less important variables are listed above and include:female, houshold size, age, mother's years of education at assignment as well as average weekly gross earnings at assignment.
The tuning parameters of random forests are max_depth and n_estimators. The depth of the trees inside the ensemble determines how well the trees learn the data. If the depth is too high, this may lead to overfitting. The number of estimators is the amount of trees over which is averaged. Usually the more the better, because this leads to a more robust model, due to the variety of possible random splits. We kept the number of n_estimators constant while varying the number of max_depth, because changes in n_estimators did not further improve the accuracy score.
summary = pd.concat((logregl1, logregl2, logregcg, svm1, svm2, svm3, knn1, knn2, rf1, rf2))
summary = round(summary,3)
summary.index = ['Logistic Regression liblinear l1', 'Logistic Regression liblinear l2', 'Logistic Regression newton-cg l2','Support Vector Machine RBF',
'Support Vector Machine Sigmoid', 'Support Vector Machine Poly' ,'100 K-Neigbors Uniform', '100 K-Neigbors Distance', 'Random Forest Depth 7','Random Forest Depth 3']
summary.columns = ['Out of Sample', 'Sample']
summary = pd.DataFrame(summary, columns=['Test Set Accuracy'])
summary
Test Set Accuracy | |
---|---|
Logistic Regression liblinear l1 | 0.647 |
Logistic Regression liblinear l2 | 0.646 |
Logistic Regression newton-cg l2 | 0.648 |
Support Vector Machine RBF | 0.619 |
Support Vector Machine Sigmoid | 0.612 |
Support Vector Machine Poly | 0.613 |
100 K-Neigbors Uniform | 0.616 |
100 K-Neigbors Distance | 0.616 |
Random Forest Depth 7 | 0.639 |
Random Forest Depth 3 | 0.613 |
Finally, the results are summarized and presented. The results table shows different models and their test set accuracies. Overall, there are no strong differences between the performances of the models. Within the models, the performance could be improved by adjusting the hyperparameters, but no outstanding differences have been achieved.
The best model seems to be the logistic regression, which achived an accuracy of 0.648. When comparing the results within the logistic regression models, a small improvement could be achieved by changing the solver to 'newton-cg'.
We consider the performance of this model a sufficient estimator for the out-of-sample performance. It is better than chance (0.5). A complete accuracy of 1.00 is impossible, given the variety of factors that may influence health. Further, none of the models are overfitted, because they show similar results to one another and if the train accuracies were to be shown, they wouldn't be very much different. This is an important consideration when tuning, because one can for example tune a random forest to predict the train-set to 99%, but the test-set only by 60%.