Chapter 5 Revision: Sample Variance and Sample Standard Deviation; Hypothesis testing and Confidence Intervals

In this seminar, we revise the concepts of the standard deviation, the sampling variance, hypothesis testing and confidence intervals. We will also learn some more data manipulation and working with the random number generator.

5.1 Seminar

Let’s remove all objects from our workspace and set the working directory.

rm(list=ls())
setwd("~/statistics1")

5.1.1 Sample Variance and Sample Standard Deviation

The sample variance and sample standard deviation inform us about the degree of variability of our data. Suppose, we were to roll the dice 5 times. We could then compute the mean value that we roll. The sample standard deviation measures by how much an average roll of the dice will deviate from the mean value.

We start rolling the dice, using R’s random number generator and the runif() function. The function randomly draws numbers from a uniform distribution. In a uniform distribution each value has the same probability of being drawn. All six sides of a die should be equally likely if the die is fair. Hence, the uniform distribution.

runif() takes three arguments. n is the number of values to be drawn. min is the minimum value and max is the maximum value.

# random draw of 10 values from a uniform distribution
dice <- runif(n = 10, min = 1, max = 7)
dice
 [1] 1.999577 1.230885 6.855672 3.836695 4.213343 1.285208 5.585894
 [8] 5.195535 3.157857 5.577164

We have indeed drawn 10 numbers but they are not integers as we would like—we want to simulate a die, so the values should be 1, 2, 3, 4, 5 or 6. We will return to this in a moment but for now let’s return to the randomness. Let’s draw 10 numbers again:

# random draw of 10 values from a uniform distribution
dice2 <- runif(n = 10, min = 1, max = 7)

# first draw
dice
 [1] 1.999577 1.230885 6.855672 3.836695 4.213343 1.285208 5.585894
 [8] 5.195535 3.157857 5.577164
# second draw
dice2
 [1] 2.746356 1.728629 5.308625 3.443253 6.641089 6.375807 2.991323
 [8] 2.320698 4.432030 6.385851

The numbers of the first and second roll differ because we have drawn values at random. To make our results replicate and to ensure that everyone in the seminar works with the same numbers, we set R’s random number generator with the set.seed() function. As argument we plug in some arbitrary value (it does not matter which but using a different one will lead to a different quasi-random draw).

# set random number generator
set.seed(123)

# random draw of 10 values from a uniform distribution
dice <- runif(n = 10, min = 1, max = 7)
dice
 [1] 2.725465 5.729831 3.453862 6.298104 6.642804 1.273339 4.168633
 [8] 6.354514 4.308610 3.739688

You should all have the same values. If not, run set.seed() again and then do the random draw once. If you do it more than once, the numbers will change. Let’s see how this works:

# set random number generator
set.seed(123)

# 1st random draw of 10 values from a uniform distribution
dice <- runif(n = 10, min = 1, max = 7)
dice
 [1] 2.725465 5.729831 3.453862 6.298104 6.642804 1.273339 4.168633
 [8] 6.354514 4.308610 3.739688
# 2nd random draw of 10 values from a uniform distribution
dice2 <- runif(n = 10, min = 1, max = 7)
dice2
 [1] 6.741000 3.720005 5.065424 4.435800 1.617548 6.398950 2.476526
 [8] 1.252357 2.967524 6.727022
# reset random number generator
set.seed(123)

# 3rd random draw of 10 values from a uniform distribution
dice3 <- runif(n = 10, min = 1, max = 7)
dice3
 [1] 2.725465 5.729831 3.453862 6.298104 6.642804 1.273339 4.168633
 [8] 6.354514 4.308610 3.739688
# 4th random draw of 10 values from a uniform distribution
dice4 <- runif(n = 10, min = 1, max = 7)
dice4
 [1] 6.741000 3.720005 5.065424 4.435800 1.617548 6.398950 2.476526
 [8] 1.252357 2.967524 6.727022

As you can see, the the draws from dice and dice3 are the same and the draws from dice2 and dice4 are the same as well. Let’s make the values integers with the as.integer() function which simply cuts off all decimal places.

# reset random number generator
set.seed(123)
# random draw of 10 numbers from a uniform distribution with minimum 1 and maximum 7
dice <- runif(10, 1, 7)
# cut off decimals places
dice <- as.integer(dice)
dice
 [1] 2 5 3 6 6 1 4 6 4 3
# frequency of dice rolls
table(dice)
dice
1 2 3 4 5 6 
1 1 2 2 1 3 

We have rolled a six relatively often. All sides should be equally likely but due to sampling variability, we have rolled the six most often. The expected value of a die is 3.5. That is: \[ 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5\]

We compute the mean in our sample and the standard deviation. Let’s start with the mean. Do so yourself but do not use the in-built function.

dice.sum <- dice[1] + dice[2] + dice[3] + dice[4] + dice[5] + dice[6] + dice[7] + dice[8] + dice[9] + dice[10]
dice.mean <- dice.sum / 10  
dice.mean
[1] 4

We would have gotten the same result from the mean() function.

The sample standard deviation tells by how much an average roll of the dice differs from the estimated sample mean. Estimate the sample standard deviation on your own without using the sd() function. Do not copy. Type everything yourself.

numerator <- ( (dice[1] - dice.mean)^2 + (dice[2] - dice.mean)^2 + (dice[3] - dice.mean)^2  +
                 (dice[4] - dice.mean)^2 + (dice[5] - dice.mean)^2 + (dice[6] - dice.mean)^2 +
                 (dice[7] - dice.mean)^2 + (dice[8] - dice.mean)^2 + (dice[9] - dice.mean)^2 +
                 (dice[10] - dice.mean)^2 )
std.dev <- sqrt( (numerator / 9) )
std.dev
[1] 1.763834

An average deviation from the sample mean is 1.76.

5.1.2 T test for the sample mean

Our estimate of the mean is 4. The expected value is 3.5. Is this evidence that the die is loaded? The null hypothesis is that the die is fair. The alternative hypothesis is that the die is loaded.

Answer this question on your own by computing and interpreting the t value.

# standard error of the sample mean
std.err <- std.dev / sqrt(10)

# t value
t.value <- (4 - 3.5) / std.err
t.value  
[1] 0.8964215

Clearly, we cannot reject the null hypothesis. The estimated difference between our sample mean and the population mean is 0.5. This value is 0.9 standard errors away from the population mean under the null hypothesis (3.5). We do not know the critical value for the t distribution here because we are in a small sample. However, it must be bigger than 1.96. Our t value is clearly smaller, therefore, we cannot reject the null hypothesis. That is not surprising, given that we have drawn from a uniform distribution, i.e., our sampling process was fair by definition.

Let’s back up. The standard error quantifies the average difference between mean estimates that are due to sampling variability (chance). Put differently, the standard error approximates the average difference between the population mean and our sample estimate.

Now. Compute the variance of the mean on your own. Try yourself before you check the code.

# sample variance
variance <- numerator / 9

# variance of the mean
var.mean <- variance / 10
var.mean
[1] 0.3111111

The correct answer is 0.31. Now, compute the standard error of the mean from your variance of the mean. Try yourself.

\[ s^2_{\bar{X}} = \frac{s_{x}^2}{n} = (\sqrt{ \frac{s_{x}}{\sqrt{n}}})^2 \]

# variance of the mean
sqrt(var.mean)
[1] 0.5577734

Let’s estimate the p value from a t distribution with the correct amount of degrees of freedom.

p.value <- (1 - pt(t.value, df = 9))*2
p.value
[1] 0.3933733

The probability that we roll a fair die 10 times and get a sample mean of 4 is 39 percent. That is not unlikely at all. We are far from rejecting the null hypothesis.

We now construct the confidence interval around our mean estimate. To re-cap, we construct the confidence interval as:

\[ \bar{Y} \pm \mathrm{critical \: value} \times SE(\bar{Y}) \]

First, we need the critical value (for an alpha level of 0.05). We get the critical value from the quantile function of the t distribution with \(n-1\) degrees of freedom.

qt(p = 0.975, df = 9)
[1] 2.262157

where p is the cumulative probability and df is the degrees of freedom. The critical value in a t distribution with 9 degrees of freedom is 2.262157.

Construct the lower and upper bounds of the confidence interval yourself.

# lower bound
lb <- dice.mean - qt(p = 0.975, df = 9) * std.err
# upper bound
ub <- dice.mean + qt(p = 0.975, df = 9) * std.err

# confidence interval
lb
[1] 2.738229
ub
[1] 5.261771

With 95 percent probability (where the probability is the long-run relative frequency), the population mean is within the confidence interval.

Clearly, the population mean under the assumption that the null hypothesis is true (3.5), is within this interval.

Run a t test using the t.test() function yourself.

t.test(dice,
       mu = 3.5,
       conf.level = .95)

    One Sample t-test

data:  dice
t = 0.89642, df = 9, p-value = 0.3934
alternative hypothesis: true mean is not equal to 3.5
95 percent confidence interval:
 2.738229 5.261771
sample estimates:
mean of x 
        4 

The t.test() function is more convenient than estimating everything by hand. The results are the same.

5.1.3 T test for the difference in means

We will now test the difference between two dice. First, remove everything from the workspace on your own.

# remove everything from the workspace
rm(list = ls())

Set the random number generator to 1234.

set.seed(1234)

Generate a vector of 100 rolls of a fair dice on your own and show the results in a frequency table.

fair.die <- runif(n = 100, min = 1, max = 7)
fair.die <- as.integer(fair.die)
table(fair.die)
fair.die
 1  2  3  4  5  6 
21 28  6 20 13 12 

The absolute number are not that great to see the distribution of values at a glance. Create a table that shows the proportions of each outcome.

table(fair.die) / sum(table(fair.die))
fair.die
   1    2    3    4    5    6 
0.21 0.28 0.06 0.20 0.13 0.12 

Calculate the mean.

mean(fair.die)
[1] 3.12

Although, the proportions look like we rolled the lower numbers way too often, we are not too far away from the population mean (3.5).

Now, we create a loaded die. We draw randomly from a normal distribution with mean 4.5 and standard deviation 1.5.

loaded.die <- rnorm(n = 100, mean = 4.5, sd = 1.5)
loaded.die <- as.integer(loaded.die)
table(loaded.die)
loaded.die
 1  2  3  4  5  6  7  8 
 2 13 19 28 23  8  6  1 

Oops, we rolled 6 seven’s and 1 eight. Let’s change these to sixes. We use the square brackets to index the elements of the loaded.die vector. We use the which() function to identify the elements that are greater than 6. Finally, we change these values to 6.

loaded.die[ which(loaded.die > 6) ] <- 6
table(loaded.die)
loaded.die
 1  2  3  4  5  6 
 2 13 19 28 23 15 

Now, estimate the 2 means (of the fair die and the loaded die) yourself. Next, estimate the difference in means. You may use the mean() function.

# mean in the 2 groups
mean(fair.die)
[1] 3.12
mean(loaded.die)
[1] 4.02
# first difference (difference in means)
fd <- mean(fair.die) - mean(loaded.die)
fd
[1] -0.9

The difference suggests that we roll larger values with the loaded die. Is the difference statistically detectable? To find out, we carry out the t test for the difference in means. Our dependent interval scaled variable is the value of the roll of the dice. The our independent binary variable is whether the die is loaded or not.

Compute the standard error for the difference in means on your own.

\[ SE(Y_{X=0} - Y_{X=1}) = \sqrt{ \frac{s_{Y_{X=0}}^2}{n_{X=0}} + \frac{s_{Y_{X=1}}^2}{n_{X=1}}} \]

where \(s_{Y_{X=0}}^2\) is the variance in the first group and \(s_{Y_{X=1}}^2\) is the variance in the second group. The number of observations in the first group is \(n_{X=0}\) and the number of observations in the second group is \(n_{X=1}\).

The result is 0.2161462. Try until you get it. You may use the var() function.

# standard error of the first difference
se.fd <- sqrt( ((var(fair.die) / 100) + (var(loaded.die) / 100)) )
se.fd
[1] 0.2161462

Now, that we have the standard error of the difference in means, compute the t statistic on your own (without using the t.test() function).

# standard error of the first difference
t.value <- (mean(fair.die) - mean(loaded.die)) / se.fd
t.value
[1] -4.163848

The t value is large and our sample size is also large. We can take the critical from a normal distribution. For an alpha level of 0.05, the critical value is \(1.96\).

Construct the confidence interval for the difference in means yourself.

# lower bound
lb <- fd - 1.96 * se.fd
# upper bound
ub <- fd + 1.96 * se.fd

lb
[1] -1.323647
ub
[1] -0.4763534

The confidence interval for the difference in means ranges from -1.32 to -0.48. Clearly, the difference in means is smaller than 0. We can reject the null hypothesis that there is no difference between the loaded die and the fair die because 0 is not within the confidence interval.

Let’s estimate the p value: i.e., the probability that we estimate a difference in means of -0.9 given that there really is no difference between the fair die and the loaded die.

The result is 3.129287e-05. Try until you get it. You can use the pnrom() function to get the cumulative probability from a standard normal distribution.

# 1st way
pnorm( t.value ) * 2
[1] 0.00003129287
# 2nd way
(1 - pnorm( abs(t.value) )) *2 
[1] 0.00003129287

We can reject the null hypothesis. The loaded die is different from the fair die.

5.1.4 Exercises

  1. Create a vector of a fair die and a vector of a loaded die with (25) observations such that you cannot distinguish the dice with a difference in means test. Carry out the t-test.
  2. Re-create the dice but increase the number of observations to 1000. Does the result of the t test change?
  3. Ordinary Economic Voting Behavior in the Extraordinary Election of Adolf Hitler Download and then load who_voted_nazi_in_1932.csv. who voted nazi in 1932 data
    Codebook:
Variable Description
sharenazis Percent of the vote Nazis received in the district
nazivote Number of Nazi votes
nvoter Total number of eligible voters
shareblue Percent of blue-collar potential voters
sharewhite Percent of white-collar potential voters
shareself Percent of self-employed potential voters
sharedomestic Percent of domestically employed potential voters
shareunemployed Percent of unemployed potential voters
  1. Estimate the means and illustrate the distribution of potential voter shares by class.
  2. Estimate the mean vote shares for the Nazis in districts where the share of blue-collar voters was above the mean (30.82) and below it.
  3. Construct confidence intervals around the means.
  4. Are there differences between the groups? Use the appropriate statistical test to find out.
  5. Calculate t values and p values on your own (without using the t test)
  6. Interpret your results substantially.
  7. Estimate the mean vote shares for the Nazis in districts where the share of white-collar voters was above the mean (11.423) and below it.
  8. Construct confidence intervals around the means.
  9. Are there differences between the groups? Use the appropriate statistical test to find out.
  10. Calculate t values and p values on your own (without using the t test)
  11. Interpret your results substantially.