Statistics 1

1.2 Solutions

1.2.1 Exercise 3

Calculate the square root of 1369 using the sqrt() function.

sqrt(1369)

[1] 37

1.2.2 Exercise 4

Square the number 13 using the ^ operator.

13^2

[1] 169

1.2.3 Exercise 5

What is the result of summing all numbers from 1 to 100?

# sequence of numbers from 1 to 100 in steps of 1
numbers_1_to_100 <- seq(from = 1, to = 100, by = 1)
# sum over the vector
result <- sum(numbers_1_to_100)
# print the result
result

[1] 5050

The result is 5050.

1.2.4 Exercise 6

Create the variable income with the values form our Berlin sample in R.

# create the income variable using the c() function
income <- c(
  19395, 22698, 40587, 25705, 26292, 42150, 29609, 12349, 18131, 
  20543, 37240, 28598, 29007, 26106, 19441, 42869, 29978, 5333, 
  32013, 20272, 14321, 22820, 14739, 17711, 18749
)

1.2.5 Exercise 7

Describe Berlin income using the appropriate measures of central tendency and dispersion.

We use the mean for the central tendency of income. The variable is interval scaled and the mean is the appropriate measure of central tendency for interval scaled variables. Our income variable is also normally distributed. Income distributions in most countries are right skewed. Therefore, the central tendency of income is often described using the median.

When asked, e.g., in an exam, to describe the central tendency of an interval scaled variable, use the mean. You can also use the median if you tell us why.

# central tendency of income
mean(income)

[1] 24666.24

# dispersion
sd(income)

[1] 9467.383

Average income in our Berlin sample is 24666.24. The average difference from that value is 9467.38.

1.2.6 Exercise 8

Compute the average deviation without using the sd() function.

We do this in several steps. First, we compute the mean.

mean.income <- sum(income) / length(income)

# let's print the mean
mean.income

[1] 24666.24

Second, we take the differences between each individual realisation of income and the mean of income. The result must be a vector with the same amount of elements as the income vector.

# individual differences between each realisation of income and the mean of income
diffs.from.mean <- income - mean.income

# let's print the vector of differences
diffs.from.mean

 [1]  -5271.24  -1968.24  15920.76   1038.76   1625.76  17483.76   4942.76
 [8] -12317.24  -6535.24  -4123.24  12573.76   3931.76   4340.76   1439.76
[15]  -5225.24  18202.76   5311.76 -19333.24   7346.76  -4394.24 -10345.24
[22]  -1846.24  -9927.24  -6955.24  -5917.24

You may be surprised that this works. After all, income is a vector with 25 elements and mean.income is a scalar (only one value). R treats all variables as vectors. It notices that mean.income is a shorter vector than income. The former has 1 element and the latter 25. The vector mean.income is recycled, so that it has the same length as income where each element is the same: the mean of income. If you did not understand this don’t worry. The important thing is that it works.

Our next step is to square the differences from the mean.

# square each element in the diffs.from.mean vector
squared.diffs.from.mean <- diffs.from.mean^2

# print the squared vecto
squared.diffs.from.mean

 [1]  27785971   3873969 253470599   1079022   2643096 305681864  24430876
 [8] 151714401  42709362  17001108 158099441  15458737  18842197   2072909
[15]  27303133 331340472  28214794 373774169  53974882  19309345 107023991
[22]   3408602  98550094  48375363  35013729

We squared each individual element in the vector. Therefore, our new variable squared.diffs.from.mean still has 25 elements.

Squaring a value does two things. First, all values in our vector have become positive. Second, the marginal increase increases with distance, i.e., values that are close to the mean are only somewhat larger whereas values that are further from the mean become way larger. To see this, lets plot the square (we haven’t shown you the plot function yet, but we will do this next seminar).

# a vector of x values from negative 100 to positive 100
a <- seq(from = -100, to = 100, length.out = 200)

# the square of that vector
b <- a^2

# we plot the input vector a against b, where b is on the y-axis
plot(
  x = a, # x-axis values
  y = b, # y-axis values
  bty = "n", # no border around plot
  type = "l", # connect individual dots to a line
  xlab = "input values from vector a", # x axis label
  ylab = "b = a^2" # y axis label
)

In this plot, you should see that the slope of the line increases, the further we are from 0. We are taking individual differences from the mean. Hence, if a value is exactly at the mean, the difference is zero. The further, the value is from the mean (in any direction), the larger the output value.

We will sum over the individual elements in the next step. Hence, values that are further from the mean have a larger impact on the sum than values that are closer to the mean.

In the next step, we take the sum over our squared deviations from the mean

# sum over squared deviations vector
sum.of.squared.deviations <- sum(squared.diffs.from.mean)

# print the sum
sum.of.squared.deviations

[1] 2151152127

By summing over all elements of a vector, we end up with a scalar. The sum is 2151152126.56.

We divide the sum of squared deviations by \(n-1\). Recall, that \(n\) is the number of observations (elements in the vector) and \(-1\) is our sample adjustment.

# get the variance
var.income <- sum.of.squared.deviations / ( length(income) - 1 )

# print the variance
var.income

[1] 89631339

The squared average deviation from mean income is 89631338.61.

In the last step, we take the square root over the variance to return to our original units of income.

# get the standard deviation
sqrt(var.income)

[1] 9467.383

The average deviation from mean income in Berlin (24666.24) is 9467.38.

1.2.7 Exercise 9

What is the level of measurement of the variable in the Sunday Question?

The variable measures vote choice. The answers are categories, the parties, without any specific ordering. The level of measurement is called categorical or nominal.

1.2.8 Exercise 10

Take the most recent poll and describe what you see in terms of central tendency and dispersion.

The most recent poll was carried out by Infratest/dimap on Thursday, 6 September. The most common value, the mode, is the appropriate measure of central tendency. Christian Democrat (CDU/CSU) is the modal category. Dispersion of a categorical variable is the proportion in each category which we see displayed on the website:

Party	Proportion
CDU/CSU	0.29
SPD	0.18
GREEN	0.14
FDP	0.08
THE LEFT	0.10
AFD	0.16
other	0.05