Chapter 3 Sampling and Distributions

3.1 Seminar

In today’s seminar, we work with missing data. We will turn a numerical variable into a nominal data type. We then turn to distributions.

rm(list=ls())
setwd("~/PUBLG100")

3.1.1 Loading Dataset in CSV Format

In this seminar, we load a file in comma separated format (.csv). The load() function from last week works only for the native R file format. To load our csv-file, we use the read.csv() function.

Our data comes from the Quality of Government Institute. Let’s have a look at the codebook:

Download the data here

Variable	Description
h_j	1 if Free Judiciary
wdi_gdpc	Per capita wealth in US dollars
undp_hdi	Human development index (higher values = higher quality of life)
wbgi_cce	Control of corruption index (higher values = more control of corruption)
wbgi_pse	Political stability index (higher values = more stable)
former_col	1 = country was a colony once
lp_lat_abst	Latitude of country’s captial divided by 90

world.data <- read.csv("QoG2012.csv")

Go ahead and (1) check the dimensions of world.data, (2) the names of the variables of the dataset, (3) print the first six rows of the dataset. (

# the dimensions: rows (observations) and columns (variables) 
dim(world.data)

[1] 194   7

# the variable names
names(world.data)

[1] "h_j"         "wdi_gdpc"    "undp_hdi"    "wbgi_cce"    "wbgi_pse"   
[6] "former_col"  "lp_lat_abst"

# top 6 rows of the data
head(world.data)

  h_j   wdi_gdpc undp_hdi   wbgi_cce   wbgi_pse former_col lp_lat_abst
1   0   628.4074       NA -1.5453584 -1.9343837          0   0.3666667
2   0  4954.1982    0.781 -0.8538115 -0.6026081          0   0.4555556
3   0  6349.7207    0.704 -0.7301510 -1.7336243          1   0.3111111
4  NA         NA       NA  1.3267342  1.1980436          0   0.4700000
5   0  2856.7517    0.381 -1.2065741 -1.4150945          1   0.1366667
6  NA 13981.9795    0.800  0.8624368  0.7084046          1   0.1892222

3.1.2 Missing Values

Let’s inspect the variable h_j. It is categorical, where 1 indicates that a country has a free judiciary. We use the table() function to find the frequency in each category.

table(world.data$h_j)


  0   1 
105  64

We now know that 64 countries have a free judiciary and 105 countries do not.

Conceptually the variable is nominal. To see how the variable is stored in R, we can use the str() function.

str(world.data$h_j)

 int [1:194] 0 0 0 NA 0 NA 0 0 1 1 ...

The function returns ‘int’ which abbreviates ‘integer’, i.e., a numeric type. The function also shows us the first 10 realisations of the variable. We se zeroes and ones which are the two categories. We also see NA’s which abbreviates not available. NAs are missing values. Values can be missing for different reasons. For instance, a coder may have forgotten to code whether a country had been colonised at some point in its history or the country may be new and the categories, therefore, don’t apply. It is important for us that we cannot calculate with NAs.

There are different ways of dealing with NAs. We will always delete missing values. Our dataset must maintain its rectangular structure. Hence, when we delete a missing value from one variable, we delete it for the entire row of the dataset. Consider the following example.

Row	Variable1	Variable2	Variable3	Variable4
1	15	22	100	65
2	NA	17	26	75
3	27	NA	58	88
4	NA	NA	4	NA
5	75	45	71	18
6	18	16	99	91

If we delete missing values from Variable1, our dataset will look like this:

Row	Variable1	Variable2	Variable3	Variable4
1	15	22	100	65
3	27	NA	58	88
5	75	45	71	18
6	18	16	99	91

The new dataset is smaller than the original one. Rows 2 and 4 have been deleted. When we drop missing values from one variable in our dataset, we lose information on other variables as well. Therefore, you only want to drop missing values on variables that you are interested in. Let’s drop the missing values on our variable h_j. We do this in several steps.

First, we introduce the is.na() function. We supply a vector to the function and it checks for every element, whether it is missing or not. R returns true or false. Let’s use the function on our variable.

is.na(world.data$h_j)

  [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [23]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [67]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[100]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[111] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[122] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[144] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
[155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[177] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[188] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

To see the amount of missingness in the variable h_j, we can combine is.na() with the table() function.

table( is.na(world.data$h_j) )


FALSE  TRUE 
  169    25

So, we have 25 missing values on h_j. Our dataset has 194 rows. Check your global environment to confirm this or use the nrow() function. That means, if we drop all missing values from h_j, the our dataset world.data will lose 25 rows.

Before we drop the missings, we introduce the which() function. It returns the row indexes (the rows in the dataset) where some condition is true. So if we use which() and is.na(), we get the row numbers in the world.data dataset where values are missing on h_j.

which( is.na( world.data$h_j ) )

 [1]   4   6  11  22  23  67 100 108 112 120 123 129 130 131 141 146 147
[18] 148 149 150 154 165 173 179 191

We said that our dataset will lose 25 rows. Let’s use the length() function to confirm that this is the case.

length( which( is.na( world.data$h_j ) ) )

[1] 25

We have, indeed, identified 25 rows that we want to delete from our dataset.

The function is.na() returns “TRUE” if an observation is missing. We can use the ! operator so that the function returns “TRUE” if an observation is not missing. The ! means not.

Let’s confirm this:

# true = observation is missing
table( is.na(world.data$h_j) )


FALSE  TRUE 
  169    25

# true = observations is NOT missing
table( !is.na(world.data$h_j) )


FALSE  TRUE 
   25   169

We now drop the rows with missings on h_j by overwriting our original dataset with a new dataset that is a copy of the old without the missings. We use the square brackets to subset our dataset.

world.data <- world.data[ !is.na( world.data$h_j ) , ]

Confirm that our new world.data dataset has only 169 remaining.

“But what if we want our original dataset back,” you ask. We have overwritten the original. It is no longer in our work environment. We have to reload the data set from the disk.

Let’s do that:

world.data <- read.csv("QoG2012.csv")

Right, we now have all observations back. This is important. Let’s say we need to drop missings on a variable. We do is. If a later analysis does not involve that variable, we want all the observations back. Otherwise we would have thrown away valuable information. The smaller our dataset, the less information it contains. Less information will make it harder for us to detect systematic correlations. We have to options. Either we reload the original dataset or we create a copy of the original with a different name that we could use later on. Let’s do this.

full.dataset <- world.data

Let’s drop missings on h_j in the world.data dataset.

world.data <- world.data[ !is.na( world.data$h_j ) , ]

Now, if we want the full dataset back, we can overwrite world.data with full.dataset. The code would be the following:

world.data <- full.dataset

If you ran this line. Delete missings from h_j in world.data again.

This data manipulation may seem boring but it is really important that you know how to do this. Most of the work in data science is not running statistical models but data manipulation. Most of the dataset you will work with in your jobs, as a research assistant or on you dissertation won’t be cleaned for you. You will have to do that work. It takes time and is sometimes frustrating. That’s unfortunately the same for all of us.

3.1.3 Factor Variables

Categorical/nominal variables can be stored as numeric variables in R. However, the values do not imply an ordering or relative importance. We often store nominal variables as factor variables in R. A factor variable is a nominal data type. The advantage of turning a variable into a factor type is that we can assign labels to the categories and that R will not calculate with the values assigned to the categories.

The function factor() lets us turn the variable into a nominal data type. The first argument is the variable itself. The second are the category labels and the third are the levels associated with the categories. To know how those correspond, we have to scroll up and look at the codebook.

We also overwrite the original numeric variable h_j with our nominal copy indicated by the assignment arrow <-.

# factorize judiciary variable
world.data$h_j <- factor(world.data$h_j, labels = c("controlled", "free"), levels = c(0,1))

# frequency table of judiciary variable
table(world.data$h_j)


controlled       free 
       105         64

3.1.4 Renaming Variables

We want to rename h_j into something more meaningful. The new name should be free.judiciary. We can use the names() function to get a vector of variable names.

names(world.data)

[1] "h_j"         "wdi_gdpc"    "undp_hdi"    "wbgi_cce"    "wbgi_pse"   
[6] "former_col"  "lp_lat_abst"

We want to change the first element of that vector. We know that we can use square brackets to subset vectors. Let’s display the first element of the vector of variable names only.

names(world.data)[1]

[1] "h_j"

Now we simply change the name using the assignment arrow <- and our new variable names goes in quotes.

names(world.data)[1] <- "free.judiciary"

We now check the variable names to confirm that we successfully changed the name.

names(world.data)

[1] "free.judiciary" "wdi_gdpc"       "undp_hdi"       "wbgi_cce"      
[5] "wbgi_pse"       "former_col"     "lp_lat_abst"

3.1.5 Distributions

A marginal distribution is the distribution of a variable by itself. Let’s look at the summary statistics of the United Nations Development Index undp_hdi using the summary() function.

summary(world.data$undp_hdi)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.2730  0.5272  0.7455  0.6946  0.8350  0.9560       9

How nice. This returns summary stats. We see the range(minimum to maximum). We see the interquartile range (1st quartile to 3rd quartile). We also see mean and median. Finally, we see the number of NAs.

Oh we forgot. We said, when we drop missing on variable, we may lose information when we work on a new variable. Let’s restore our dataset world.data to its original state.

world.data <- full.dataset

Now, we check the summary stats again.

summary(world.data$undp_hdi)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.2730  0.5390  0.7510  0.6982  0.8335  0.9560      19

In the smaller dataset (where we had dropped missings from h_j), we had 9 missings. Now, we have 19 missings. The difference is 10. Our smaller dataset had 25 rows less than the bigger dataset. Therefore, we would have thrown away 6 good observations. That is not nothing. It’s 3 percent of our data.

Let’s drop missing on undp_hdi and rename it to hdi.

world.data <- world.data[ which( !is.na(world.data$undp_hdi) ) , ]

Let’s change the name.

names(world.data)[3] <- "hdi"
names(world.data)

[1] "h_j"         "wdi_gdpc"    "hdi"         "wbgi_cce"    "wbgi_pse"   
[6] "former_col"  "lp_lat_abst"

Let’s take the mean of hdi.

hdi.mean <- mean( world.data$hdi )
hdi.mean

[1] 0.69824

The mean of hdi is the mean in the sample. We would like the mean of hdi in the population. Remember that sampling variability causes us to estimate a different mean every time we take a new sample.

We learned that the means follow a distribution if we take the mean repeatedly in different samples. In expectation the population mean is the sample mean. How certain are we about the mean. Well, we need to know how the sampling distribution looks like.

To find out we estimate the standard error of the mean. The standard error is the standard deviation of the sampling distribution. The name is not standard deviation but standard error to indicate that we are talking about the distribution of a statistic (the mean) and not a random variable.

The formula for the standard error of the mean is:

\[ s_{\bar{x}} = \frac{ \sigma }{ \sqrt(n) } \]

The \(\sigma\) is the real population standard deviation of the random variable hdi which is unknown to us. We replace the population standard deviation with our sample estimate of it.

\[ s_{\bar{x}} = \frac{ s }{ \sqrt(n) } \]

The standard error of the mean estimate is then

se.hdi <- sd(world.data$hdi) / sqrt( nrow(world.data) )
se.hdi

[1] 0.01362411

Okay, so the mean is 0.69824 and the standard error of the mean is 0.0136241.

We know that the sampling distribution is approximately normal. That means that 95 percent of all observations are within 1.96 standard deviations (standard errors) of the mean.

\[ \bar{x} \pm 1.96 \times s_{\bar{x}} \]

So what is that in our case?

lower.bound <- hdi.mean - 1.96 * se.hdi
lower.bound

[1] 0.6715367

upper.bound <- hdi.mean + 1.96 * se.hdi
upper.bound

[1] 0.7249432

That now means the following. Were we to take samples of hdi again and again and again, then 95 percent of the time, the mean would be in the range from 0.6715367 to 0.7249432.

What is a probability? “The long-run relative frequency,” you all scream in unison. Given that definition, you can say: “With 95 percent probability, the mean is in the range 0.6715367 to 0.7249432.”

Sometimes people like to former way of phrasing this relationship better than the latter. In this case you tell them: “a probability is the long-run relative frequency of an outcome.”

Now, let’s visualise our sampling distribution. We haven’t actually taken many samples, so how could we visualise the sampling distribution? Well, we know the sampling distribution looks normal. We know that the mean is our mean estimate in the sample. And finally, we know that the standard deviation is the standard error of the mean.

We can randomly draw values from a normal distribution with mean 0.69824 and standard deviation 0.0136241. We do this with the rnorm() function. It’s first argument is the number of values to draw at random from the normal distribution. The second argument is the mean and the third is the standard deviation.

Recall, that a normal distribution has two parameters that characterise it completely: the mean and the standard deviation. So with those two we can draw the distribution.

draw.of.hdi.means <- rnorm( 1000, mean = hdi.mean, sd = se.hdi )

We have just drawn 1000 mean values at random from the distribution that looks like our sampling distribution.

plot(
 density( draw.of.hdi.means ),
 bty = "n",
 main = "Sampling Distribution of HDI means"
)

Beautiful Let’s add the 95 percent confidence interval around our mean estimate. The confidence interval quantifies our uncertainty. We said 95 percent of the time the mean would be in the interval from 0.6715367 to 0.7249432."

abline( v = lower.bound, lty = "dashed")
abline( v = upper.bound,  lty = "dashed")

You do not need to run the plot function again. You can just add to the plot. Check the help function of abline() to see what its arguments refer to.

Fantastic! You can see that values below and above our confidence interval are quite unlikely. Those values in the tails would not occur often.

Not often, but not impossible.

Let’s say that we wish know the probability that we take a sample and our estimate of the mean is greater or equal 0.74. We would need to integrate over the distribution from \(-\inf\) to 0.74. Fortunately R has a function that does that for us. We need the pnorm(). It calculates the probability of a value that is smaller or equal to the value we specify. In other words, it gives us the probability from the cumulative normal.

As the first argument pnrom() wants the value; 0.74 in our case. The second and third arguments are the mean and the standard deviation that characterise the normal distribution.

pnorm(0.74, mean = hdi.mean, sd = se.hdi)

[1] 0.9989122

What!? The probability to draw a mean 0.74 is 99.9 percent!? That cannot be the value is so far in the tail of the distribution.

Well, this is the cumulative probability of drawing a value that is equal to or smaller than 0.74. All probabilities sum to 1. So if we want to know the probability of drawing a value that is greater than 0.74, we subtract the probability, we just calculated, from 1.

1 - pnorm(0.74, mean = hdi.mean, sd = se.hdi)

[1] 0.001087785

Right, so the probability of getting a mean of hdi in a sample is 0.1 percent.

3.1.6 Conditional Distributions

Let’s look at hdi by former_col. The variable former_col is 1 if a country is a former colony and 0 otherwise. The variable hdi is continuous.

Before we start, we plot the marginal pdf of hdi.

plot(
  density(world.data$hdi),
  bty = "n",
  main = "Marginal Distribution of HDI"
)

The distribution is bimodal. There is one peak at the higher development end and one peak at the lower development end. Could it be that these two peaks are conditional on whether a country was colonised or not? Let’s plot the conditional distributions.

plot(
  density(world.data$hdi[world.data$former_col == 0]),
  bty = "n",
  main = "Conditional Distributions of HDI"
)
lines(density(world.data$hdi[world.data$former_col == 1]), lty = "dashed")
legend("topleft", c("not colonised", "colonised"), lty = c("solid", "dashed"))

It’s not quite like we expected. The distribution of human development of not colonised countries is shifted to right of the distribution of colonised countries and it is clearly narrower. Interestingly though, the distribution of former colonies has a greater variance. Evidently, some former colonies are doing very well and some are doing very poorly. It seems like knowing whether a country was colonised or not tells us something about its likely development but not enough. We cannot, e.g., say colonisation is the reason why countries do poorly. Probably, there are differences among types of colonial institutions that were set up by the colonisers.

Let’s move on and examine the probability that a country has .8 or more on hdi given that it is a former colony.

We can get the cumulative probability with the ecdf() function. It returns the empirical cumulative distribution, i.e., the cumulative distribution of our data. We know that we can subset using square brackets. That’s all we need.

cumulative.p <- ecdf(world.data$hdi[ world.data$former_col == 1 ])
1 - cumulative.p(.8)

[1] 0.1666667

Okay, the probability that a former colony has .8 or larger on the hdi is 16.6 percent. Go ahead figure out the probability for not former colonies on your own.

3.1.7 Exercises

Create a script and call it assignment03. Save your script.
Load the world.data dataset from your disk.
Rename the variable wdi_gdpc into gdpc.
Delete missing values from gdpc.
Inspect former_col and delete missing values from it.
Turn former_col into a factor variable with the appropriate lables.
Compute the probability that a county is richer than 55 000 per capita.
Compute the same probability given that a country is a former colony.
Compute the conditional expectation of wealth (gdp per capita) for a former colony.
Compute the conditional expectation of wealth for country that is not a former colony.
What is the probability that a former colony is 2 standard deviations below the mean wealth level?
What is the corresponding probability for a country that has not been colonised?
Compute the probability that a former colony is the wealth interval from 25 000 to 31 000.
Copmute the probability that a not former colony is in the top 2.5 percent of the wealth distribution.
At which wealth level is a country in the bottom 2.5 percent of the wealth distribution?