212-Discussion

Discussion I

For today’s discussion, we’ll be going over some important concepts from POL 211. You can access the discussion material for POL 211 through this link if you would like to review it.

Data Generating Process (DGP)
Probability Distribution
Sampling Distribution
Central Limit Theorem
Statistical Inference

1. Data Generating Process (DGP)

The contemporary world is full of data and, especially, it is becoming easier and easier to transform it into data. Data analysis is about making good use of data for certain purposes, such as to predict or to make inferences about the world. For these tasks, a first step is to have a sound understanding on how the data was generated, i.e., which processes lead to its creation. The analytical strategy for data analysis will be highly dependent on which processes have generated the data under scrutiny.

Overall, in the following method sequence classes, we will only consider the type of probability based (stochastic) data generation processes. In the probability based DGP, data is generated by a probabilistic process, i.e., each element of the population is selected with a (known) probability to form part of the sample data. One possible way to achieve this is by using a random mechanism when generating the data.

Source: Lumen learning

Example:

Research Question: What percentage of U.S. adults support the death penalty?
Steps in Statistical Investigation:
1. Produce Data: Determine what to measure, then collect the data. The poll selected 1,082 U.S. adults at random. Each adult answered this question: “Do you favor or oppose the death penalty for a person convicted of murder?”
2. Explore the Data: Analyze and summarize the data.In the sample, 65% favored the death penalty.
3. Draw a Conclusion: Use the data, probability, and statistical inference to draw a conclusion about the population.

However, there is a cost that has to be taken into account: by drawing a sample we have not observed all of the individuals in the population. It is necessary, then, to account for the uncertainty implied in the sampling process that generated the data. In addition to the sampling uncertainty, there are other sources of uncertainty that would lead us to assume our data is generated by a stochastic DGP: theoretical and fundamental uncertatinty.

Let’s consider a scenario where we are not only interested in understanding the percentage of adults who support the death penalty in the U.S. but also whether there is a difference in support based on political affiliation, particularly Republicans. In this case, we have a theoretical framework that focuses on the relationship between a person’s political alignment and their stance on the death penalty. However, it’s possible that other factors, such as how individuals value a person’s life, could also influence their views on the death penalty. Unfortunately, we didn’t include these additional factors in our theory, which could potentially impact our data collection. Consequently, the results derived from our sample data might be subject to theoretical uncertainty. Additionally, there is the possibility of fundamental uncertainty within our sample data. This could arise when respondents pay minimal attention to the survey questions and opt for random responses regarding their opinions on the death penalty. Such factors are beyond the control of researchers and can introduce further uncertainty into our findings.

In summary, statistical inferences always involve an element of uncertainty. As a researcher, our aim is to account for this uncertainty by incorporating probability statements as an integral part of our conclusions in our findings.

2. Probability Distribution

How to model stochastic data generating process? Any time we want to answer a research question that involves using a sample to draw a conclusion about some larger population, we need to answer the question “how likely is it that we get a sample proportion of people supporting for the death penalty at 65%?” or “what is the probability of…?”. To answer such a question, we need to use probability to quantify the uncertainty of the outcomes of interest.

Just a reminder, when we employ a probability distribution to model a stochastic DGP, we are essentially playing as the God saying that the data for the variable of interest is generated by this probability distribution. However, in real-life, what we can directly observe is the empirical distribution (or frequency distribution) derived from our available sample data. We make an assumption that this empirical distribution from the sample data reflects the underlying theoretical probability distribution.

Let’s now review some specially named discrete and continuous probability mass functions, such as bernoulli distribution, binomial distribution, uniform distribution, and normal distribution.

a. Bernoulli Distribution

Bernoulli Distribution is a type of discrete probability distribution where every experiment conducted asks a question that can be answered only in yes or no. In other words, the random variable can be 1 with a probability $p$ or it can be 0 with a probability $(1 - p)$. Such an experiment is called a Bernoulli trial. A pass or fail exam can be modeled by a Bernoulli Distribution. Bernoulli distribution can also be used to derive a binomial distribution, geometric distribution, and negative binomial distribution.

For example, suppose there is an experiment where you flip a coin that is fair. If the outcome of the flip is heads then you will win. This means that the probability of getting heads is $p = \frac{1}{2}$. If $X$ is the random variable following a Bernoulli Distribution, we get $P(X = 1) = p = \frac{1}{2}$. We can also write it as $X \sim Bernoulli(p)$ or just $X \sim B(p)$.

Probability Mass Function (PMF)

A Bernoulli distribution is a discrete probability distribution where $Y=\left \{0, 1 \right \}$, its PMF is defined by:

\[ f(y) = \begin{cases} 1 - p & \text{if } \ \ y = 0 \\ p & \text{if } \ \ y = 1 \\ \end{cases} \] A Bernoulli distribution only has one parameter, $p$. It is a special case of the Binomial distribution when the number of trials = 1.
Cumulative Distribution Function (CDF)

\[ F(y) = \begin{cases} 0 & \text{if } \ \ y < 0 \\ 1-p & \text{if } \ \ 0 \leq y < 1\\ 1 & \text{if } \ \ x \geq 1 \end{cases} \]
Expected Value and Variance
- Expected Value: $E(Y) = p$
- Variance: $Var(Y) = p(1-p)$

b. Binomial Distribution

Suppose we repeat a Bernoulli experiment $n$ times independently and we add up the outcomes. That is, suppose that our random variable is $Y = X_1 + X_2 + · · · + X_n$, where $X_i \sim Bernoulli(p)$ and the $X_i$ are independent. Then $Y$ is said to have a Binomial distribution with sample size $n$ and probability of success $p$, denoted $Y \sim Binomial(n, p)$. As you can see, a Binomial distribution has two parameters, $n$ and $p$.

Probability Mass Function (PMF)

\[ f(y) = \begin{cases} \binom{n}{y}p^y(1-p)^{N-y} & \text{if } \ \ y \in \left \{0, 1, 2, ..., n \right \} \\ 0 & \text{otherwise } \end{cases} \]

where:
- $n$ is the number of trials (occurrences)
- $y$ is the number of successful trials
- $p$ is the probability of success in a single trial
- $\binom{n}{y}$ is the combination of $n$ and $y$. A combination is the number of ways to choose a sample of $y$ elements from a set of $N$ distinct objects where order does not matter, and replacements are not allowed. Note that $\binom{n}{y} = \frac{n!}{y!(N-y)!}$, where $!$ is factorial (so, $4! = 4 \times 3 \times 2 \times 1$).
Cumulative Distribution Function (CDF)

\[F(y) = P(Y \leq y) = \sum_{y=0}^{n} \binom{n}{y}p^y(1-p)^{n-y} \]
Expected Value and Variance
- Expected Value: $E(Y) = np$
- Variance: $E(Y) = np(1-p)$

Binomial Distribution in R

We can also use R to do the calculation for Excercise 2. Here are some base R functions to do it:

Function	Description
`dbinom(x, size, prob)`	Find the probability of getting a certain number of successes (`x`) in a certain number of trials (`size`) where the probability of success on each trial is fixed (`prob`).
`pbinom(q, size, prob)`	Return the value of the cumulative density function (CDF) of the binomial distribution given a certain random variable `q`, number of trials (`size`) and probability of success on each trial (`prob`).
`qbinom(p, size, prob)`	The inverse of `pbinom()` function. It takes the probability value and gives output which corresponds to the probability value.
`rbinom(n, size, prob)`	Generate a vector of binomial distributed random variables given a vector length n, number of trials (`size`) and probability of success on each trial (`prob`).

c. Uniform Distribution

In statistics, uniform distribution is a term used to describe a form of probability distribution where every possible outcome has an equal likelihood of happening. The probability is constant since each variable has equal chances of being the outcome. There are two types of uniform distribution:

Discrete uniform distribution

A discrete uniform distribution is a statistical distribution where the probability of outcomes is equally likely and with finite values. A good example of a discrete uniform distribution would be the possible outcomes of rolling a 6-sided die. The possible values would be 1, 2, 3, 4, 5, or 6. In this case, each of the six numbers has an equal chance of appearing. Therefore, each time the 6-sided die is thrown, each side has a chance of $\frac{1}{6}$. The number of values is finite. It is impossible to get a value of 1.3, 4.2, or 5.7 when rolling a fair die.

Continuous uniform distribution

Not all uniform distributions are discrete; some are continuous. A continuous uniform distribution (also referred to as rectangular distribution) is a statistical distribution with an infinite number of equally likely measurable values. Unlike discrete random variables, a continuous random variable can take any real value within a specified range.

A continuous uniform distribution usually comes in a rectangular shape. A good example of a continuous uniform distribution is an idealized random number generator. With continuous uniform distribution, just like discrete uniform distribution, every variable has an equal chance of happening. However, there is an infinite number of points that can exist. Today, we will focus on continuous uniform distribution.

Probability Distribution Function (PDF)

A continuous random variable $X$ has a uniform distribution, denoted $X \sim Uniform \left [a, b\right ]$, if its probability density function is:

\[ f(x) = \begin{cases} \frac{1}{b-a}& \text{if } x \in [a, b] \\ 0 & \text{otherwise} \end{cases} \]

for two constants $a$ and $b$, such that $a<x<b$. A graph of the PDF looks like this:

Note that the length of the base of the rectangle is $(b-a)$, while the length of the height of the rectangle is $\frac{1}{b-a}$.Therefore, as should be expected, the area under $f(x)$ and between the endpoints $a$ and $b$ is 1. Additionally, $f(x)>0$ over the support $a<x<b$. Therefore, $f(x)$ is a valid probability density function.
Cumulative Distribution Function (CDF)

The cumulative distribution function of a continuous uniform random variable is:

\[F(x) = \begin{cases} 0 & \text{if } \ \ x < a \\ \frac{x-a}{b-a}& \text{if } \ \ x \in [a, b] \\ 1 & \text{if } \ \ x < a \\ \end{cases}\] A graph of the CDF looks like this:

As the picture illustrates, $F(x)=0$ when $x$ is less than the lower endpoint of the support ($a$, in this case) and $F(x)=1$ when $x$ is greater than the upper endpoint of the support ($b$, in this case). The slope of the line between $a$ and $b$ is, of course, $\frac{1}{b-a}$.

Expected Value and Variance
- Expected Value: $E(X) = \frac{(a+b)}{2}$
- Variance: $Var(X) = \frac{(b-a)^2}{12}$

Continuous Uniform Distribution in R

Function	Description
`dunif(x, min, max)`	Return the corresponding uniform probability density function (PDF) values of a certain vector quantile `x`.
`punif(q, min, max)`	Return the value of the cumulative density function (CDF) of the uniform distribution given a certain random variable q and the interval.
`qunif(p, min, max)`	The inverse of `punif()` function. It takes the probability value and gives output which corresponds to the probability value.
`runif(n, min, max)`	Generate `n` number of random numbers within any interval, defined by the min and the max argument.

d. Normal Distribution

Normal distribution is the most important probability distribution function used in statistics because of its advantages in real case scenarios. For example, the height of the population, shoe size, IQ level, rolling a dice, and many more.

Probability Distribution Function (PDF)

The continuous random variable $X$ follows a normal distribution if its probability density function is defined as:

\[f(x)=\frac{1}{\sigma\sqrt{2\pi}}\left \{ -\frac{1}{2} (\frac{x-\pi}{\sigma})^2\right \}\]

for $\infty < x < \infty$, $\infty < \mu < \infty$, and $0 < \sigma < \infty$. The mean of $X$ is $\mu$ and the variance of $X$ is $\sigma^2$. We say $X \sim N(\mu, \sigma^2)$.

The probability density function of normal distribution is bell-shaped curve graph, which is symmetric. The graph signifies that the peak point is the mean of the data set and half of the values of data set lie on the left side of the mean and other half lies on the right part of the mean telling about the distribution of the values. The shape of any normal curve depends on its mean $\mu$ and standard deviation $\sigma$.
Cumulative Distribution Function (CDF)

We won’t focus on the CDF of normal distribution as it is ugly, but here it is.

\[\Phi (\frac{x-\mu}{\sigma})=\frac{1}{2}\left [ 1+erf(\frac{x-\mu}{\sigma\sqrt{2}}) \right ]\]
Expected Value and Variance
- Expected Value: $E(x) = \mu$
- Variance: $Var(x) = \sigma^2$

Normal Distribution in R

Function	Description
`dnorm(x, mean, sd)`	Probability density function (PDF).
`pnorm(x, mean, sd)`	Cumulative distribution function (CDF), which measures the probability that a random number $X$ takes a value less than or equal to $x$.
`qnorm(p, mean, sd)`	The inverse of `pnorm()` function. It takes the probability value and gives output which corresponds to the probability value.
`rnorm(n, mean, sd)`	Generate a vector of random numbers which are normally distributed.

3. Sampling Distribution

The sampling distribution of a statistic is the distribution of all possible values taken by the statistic when all possible samples of a fixed size $n$ are taken from the population. It is a theoretical idea – we do not actually build it.

Simply said, the sampling distribution of a statistic is the probability distribution of that statistic.

Suppose we have a sample with independent and identically distributed (iid) draws. We know that $\bar{X}$ is a random variable with a sampling distribution.

Independent means that the sample items are all independent events. In other words, they are not connected to each other in any way; knowledge of the value of one variable gives no information about the value of the other and vice versa.
Identically distributed means that there are no overall trends – the distribution doesn’t fluctuate and all items in the sample are taken from the same probability distribution.
The expected value of the sampling distribution of $\bar{X}$ is the population mean $\mu$.

\[E(\bar{X}) = \mu\]

The expected value of the sampling distribution for $\bar{X}$ is simply the mean of the population (DGP) distribution. This means that the sample mean $\bar{X}$ is an unbiased estimator for the population (DGP) mean, $\mu$.
The variance of the sampling distribution of $\bar{X}$ is the population variance $\sigma^2$ divided by the sample size $n$. \[Var(\bar{X}) = \frac{\sigma^2}{n}\] Note the $X_i$ are identically distributed, which means they have the same variance $\sigma^2$. Therefore, we can replace $Var(X_i)$ with the alternative notation $\sigma^2$.

If we take the square root of variance of the sampling distribution we get the standard deviation of the sampling distribution, which is known as standard error.

\[SE(\bar{X}) = \sqrt{Var(\bar{X})} = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}\]

The standard error of a sampling distribution gives us a sense for how confident we should be in our estimate of the population mean. Our result indicates that as the sample size increases, the standard deviation of the sample mean decreases, which is good. It means that as long as our sample is large enough, we can be very confident about our estimate of the population mean based on the sample mean we have.

In the following example, we use R to illustrate the sampling distribution for the sample mean for a hypothetical population. The sampling method is done without replacement.

Exam Score Example

An instructor of an introduction to statistics course has 200 students (which is the population). The population mean of their scores out of 100 points is $\mu = 71.18$, and the population standard deviation is $\sigma = 10.73$. That said, these 200 students’ scores $X$ is following normal distribution with the parameter $\mu$ being 71.18 and $\sigma$ being 10.73, $X \sim N(71.18, 10.73)$.

The sampling distribution of the sample mean when $n = 10$ for the exam scores data.

Let’s first draw one sample based on the population distribution, $X \sim N(71.18, 10.73)$.

set.seed(1234) # to get replicable results
exam.sample <- rnorm(n = 10, mean = 71.18, sd = 10.73)
exam.sample

##  [1] 58.22818 74.15682 82.81605 46.01066 75.78451 76.60998 65.01304 65.31464
##  [9] 65.12343 61.62989

Now, let’s replicate the above process for 1000 times, which will generate 1000 iid samples based on the population distribution, $X \sim N(71.18, 10.73)$.

set.seed(1234) 
exam.iid.sample <- replicate(1000, rnorm(n = 10, mean = 71.18, sd = 10.73))
exam.iid.sample <- as.data.frame(exam.iid.sample)

knitr::kable(exam.iid.sample[, c(1:5)], "simple", align = "ccccc")

V1	V2	V3	V4	V5
58.22818	66.05972	72.61877	83.00765	86.73309
74.15682	60.46731	65.91494	66.07689	59.71346
82.81605	62.85080	66.45292	63.56771	62.00194
46.01066	71.87164	76.11139	65.80150	68.16892
75.78451	81.47537	63.73638	53.69983	60.51073
76.60998	69.99664	55.64076	58.65145	60.78784
65.01304	65.69687	77.34713	47.78817	59.29848
65.31464	61.40287	60.19617	56.79114	57.74619
65.12343	62.19715	71.01757	68.02223	65.55932
61.62989	97.10191	61.13727	66.18092	65.84880

Since we draw 1000 iid samples from the population, we now should have 1000 sample mean based on these samples. Let’s plot all these sample means in a histogram to observe the distribution of these sample means, which is sampling distribution.

Note the process of taking many different samples in an effort to see what the sampling distribution looks like is referred to as a “Monte Carlo simulation.” The “true” sampling distribution is the one where we take an infinite number of samples, which obviously we cannot achieve. Monte Carlo simulation is the term for the process of repeated sampling to approximate something. We’re using it here for sampling distributions, but it is used in other contexts as well.
```
# What is the mean of each sample?
sample.mean <- colMeans(exam.iid.sample)
sample.mean[1:5]
```
```
##       V1       V2       V3       V4       V5 
## 67.06872 69.91203 67.01733 62.95875 64.63688
```
```
# Plot the distribution of these sample means
library(tidyverse)

my.bins <- seq(62, 82, 0.5)
ggplot(data.frame(sample.mean = sample.mean), aes(x = sample.mean)) + 
  geom_histogram(breaks = my.bins, color = "black", fill = "gray85") +
  geom_vline(xintercept = mean(sample.mean), linetype = "dashed", 
             color = "red", alpha = 0.5, lwd = 0.8) +
  geom_vline(xintercept = 71.18, linetype = "dashed", 
             color = "blue", alpha = 0.5, lwd = 0.8) +
  scale_x_continuous(limits = c(62, 82)) + 
  scale_y_continuous(limits = c(0, 250)) +
  labs(x = "Sample mean (X bar)", y = "Frequency", 
       title = "1000 Samples with n=10") + 
  theme_bw() + 
  theme(panel.grid.minor = element_blank())
```
What are the mean and standard error of the sample mean?
```
mean(sample.mean)
```
```
## [1] 71.24562
```
```
sd(sample.mean)
```
```
## [1] 3.252043
```
As you can see, the mean and standard error based on all 1000 samples we drew are very close to the population mean of 71.18 and the standard deviation divided by the square root of the sample size ($\frac{10.73}{\sqrt{10}}=3.393124$). However, they are not exactly the same. The reason is that we can only simulate the sampling distribution of the exam scores but cannot simulate it for an infinite number of samples. That said, we can never reach the “true” sampling distribution.

The sampling distribution of the sample mean when $n = 100$ for the exam scores data.

set.seed(1234)
exam.iid.sample <- replicate(1000, rnorm(n = 100, mean = 71.18, sd = 10.73))
exam.iid.sample <- as.data.frame(exam.iid.sample)

# What is the mean of each sample?
sample.mean <- colMeans(exam.iid.sample)
sample.mean[1:5]

##       V1       V2       V3       V4       V5 
## 69.49795 71.62254 72.83890 71.09303 70.94624

# Plot the distribution of these sample means
library(tidyverse)

ggplot(data.frame(sample.mean = sample.mean), aes(x = sample.mean)) + 
  geom_histogram(breaks = my.bins, color = "black", fill = "gray85") +
  geom_vline(xintercept = mean(sample.mean), linetype = "dashed", 
             color = "red", alpha = 0.5, lwd = 0.8) +
  geom_vline(xintercept = 71.18, linetype = "dashed", 
             color = "blue", alpha = 0.5, lwd = 0.8) +
  scale_x_continuous(limits = c(62, 82)) + 
  scale_y_continuous(limits = c(0, 250)) +
  labs(x = "Sample mean (X bar)", y = "Frequency", 
       title = "1000 Samples with n=100") + 
  theme_bw() + 
  theme(panel.grid.minor = element_blank())

What are the mean and standard error of the sample mean?

mean(sample.mean)

## [1] 71.21143

sd(sample.mean)

## [1] 1.07794

As you observe from the exam score example, if the population is normally distributed with mean $\mu$ and standard deviation $\sigma$, then the sampling distribution of the sample mean is also normally distributed no matter what the sample size is.

4. Central Limit Theorem (CLT)

What happens when the sample comes from a population that is not normally distributed? This is where the Central Limit Theorem comes in.

For a large sample size ($n > 30$), $\bar{X}$ is approximately normally distributed, regardless of the distribution of the population one samples from. If the population has mean $\mu$ and standard deviation $\sigma$, then $\bar{X}$ follows an approximate normal distribution with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$.

The Central Limit Theorem applies to a sample mean from any distribution. We could have a left-skewed or a right-skewed distribution. As long as the sample size is large, the distribution of the sample means will follow an approximate normal distribution.

Notes on the CLT:

If the population is skewed and sample size you draw is small, then the sample mean won’t be normal.
If the population is normal, then the distribution of sample mean looks normal even if $n = 2$.
If the population is skewed, then the distribution of sample mean looks more and more normal when $n$ gets larger.
Note that in all cases, the mean of the sample mean is close to the population mean and the standard error of the sample mean is close to $\frac{\sigma}{\sqrt{n}}$.

Right-skewed Distribution Example

Suppose $X$ is following a chi-square distribution with 3 as the degrees of freedom , $X \sim Chi-square(3)$. The population mean of a chi-square distribution is the degrees of freedom. In this case, $\mu = 3$. Please do the following:

This is what a chi-square distribution looks like:

Find the sampling distribution of the sample mean when $n = 10$ for $X$.

set.seed(1234)
iid.sample <- replicate(1000, rchisq(n = 5, df = 3, ncp = 0))
iid.sample <- as.data.frame(iid.sample)

# What is the mean of each sample?
sample.mean <- colMeans(iid.sample)
sample.mean[1:5]

##       V1       V2       V3       V4       V5 
## 1.944331 1.521077 2.798356 1.653387 2.474112

# Plot the distribution of these sample means
my.bins <- seq(0, 8, 0.1)
ggplot(data.frame(sample.mean = sample.mean), aes(x = sample.mean)) + 
  geom_histogram(breaks = my.bins, color = "black", fill = "gray85") +
  geom_vline(xintercept = mean(sample.mean), linetype = "dashed", 
             color = "red", alpha = 0.5, lwd = 0.8) +
  geom_vline(xintercept = 3, linetype = "dashed", 
             color = "blue", alpha = 0.5, lwd = 0.8) +
  scale_x_continuous(limits = c(0, 8)) + 
  scale_y_continuous(limits = c(0, 100)) +
  labs(x = "Sample mean (X bar)", y = "Frequency", 
       title = "1000 Samples with n=5") + 
  theme_bw() + 
  theme(panel.grid.minor = element_blank())

Find the sampling distribution of the sample mean when $n = 30$ for $X$.

set.seed(1234)
iid.sample <- replicate(1000, rchisq(n = 30, df = 3, ncp = 0))
iid.sample <- as.data.frame(iid.sample)

# What is the mean of each sample?
sample.mean <- colMeans(iid.sample)
sample.mean[1:5]

##       V1       V2       V3       V4       V5 
## 2.690281 2.510694 3.307291 2.776163 3.632522

# Plot the distribution of these sample means
ggplot(data.frame(sample.mean = sample.mean), aes(x = sample.mean)) + 
  geom_histogram(breaks = my.bins, color = "black", fill = "gray85") +
  geom_vline(xintercept = mean(sample.mean), linetype = "dashed", 
             color = "red", alpha = 0.5, lwd = 0.8) +
  geom_vline(xintercept = 3, linetype = "dashed", 
             color = "blue", alpha = 0.5, lwd = 0.8) +
  scale_x_continuous(limits = c(0, 8)) + 
  scale_y_continuous(limits = c(0, 100)) +
  labs(x = "Sample mean (X bar)", y = "Frequency", 
       title = "1000 Samples with n=30") + 
  theme_bw() + 
  theme(panel.grid.minor = element_blank())

5. Statistical Inference

a. Statistical Inferences

The real power of statistics comes from applying the concepts of probability to situations where you have data but not necessarily the whole population. The results, called statistical inference, give you probability statements about the population of interest based on that set of data.

There are two types of statistical inferences:

Estimation: Use information from the sample to estimate (or predict) the parameter of interest. For instance, using the result of a poll about the president’s current approval rating to estimate (or predict) his or her true current approval rating nationwide.
Hypothesis Tests: Use information from the sample to determine whether a certain statement about the parameter of interest is true. For instance, suppose a news station claims that the President’s current approval rating is more than 75%. We want to determine whether that statement is supported by the poll data.

b. Confidence Intervals

\[P[\bar{X}-z_{\frac{\alpha}{2}}(\frac{\sigma}{\sqrt{n}})\leq \mu \leq \bar{X}+z_{\frac{\alpha}{2}}(\frac{\sigma}{\sqrt{n}})]=1-\alpha\] This indicates that a $(1-\alpha)*100\%$ confidence interval for the mean $\mu$ is:

\[[\bar{x} - z_{\frac{\alpha}{2}}(\frac{\sigma}{\sqrt{n}}), \ \ \bar{x} + z_{\frac{\alpha}{2}}(\frac{\sigma}{\sqrt{n}})]\]

If the population standard deviation ($\sigma$) is unknown, we can use sample standard deviation to estimate it:

\[[\bar{x} - z_{\frac{\alpha}{2}}(\frac{s_x}{\sqrt{n}}), \ \ \bar{x} + z_{\frac{\alpha}{2}}(\frac{s_x}{\sqrt{n}})]\] Suppose we take a large number of samples, say 1000.Then, we calculate a 95% confidence interval for each sample. Then, “95% confident” means that we’d expect 95%, or 950, of the 1000 intervals to be correct, that is, to contain the actual unknown value $\mu$.

c. Hypothesis Tests

Hypothesis testing is the process of making a choice between two conflicting hypotheses. First, a tentative assumption is made about the parameter or distribution. This assumption is called the null hypothesis and is denoted by $H_0$. An alternative hypothesis (denoted $H_a$), which is the opposite of what is stated in the null hypothesis, is then defined. The hypothesis-testing procedure involves using sample data to determine whether or not $H_0$ can be rejected. If $H_0$ is rejected, the statistical conclusion is that the alternative hypothesis $H_a$ is true.

Null hypothesis ($H_0$): The null hypothesis states the “status quo”. This hypothesis is assumed to be true until there is evidence to suggest otherwise.
Alternative hypothesis ($H_a$): This is the statement that one wants to conclude. It is also called the research hypothesis.

The goal of hypothesis testing is to see if there is enough evidence against the null hypothesis. In other words, to see if there is enough evidence to reject the null hypothesis. If there is not enough evidence, then we fail to reject the null hypothesis.

We want to know the answer to a research question. We determine our null and alternative hypotheses. Now it is time to make a decision.

The decision is either going to be:

reject the null hypothesis or …
fail to reject the null hypothesis.

Note: Why can’t we say we “accept the null”? The reason is that we are assuming the null hypothesis is true and trying to see if there is evidence against it. Therefore, the conclusion should be in terms of rejecting the null.

Consider the following table. The table shows the decision/conclusion of the hypothesis test and the unknown “reality”, or truth. We do not know if the null is true or if it is false. If the null is false and we reject it, then we made the correct decision. If the null hypothesis is true and we fail to reject it, then we made the correct decision.

What happens when we do not make the correct decision? When doing hypothesis testing, two types of mistakes may be made and we call them Type I error and Type II error. If we reject the null hypothesis when it is true, then we made a type I error. If the null hypothesis is false and we failed to reject it, we made another error called a Type II error.

The “reality”, or truth, about the null hypothesis is unknown and therefore we do not know if we have made the correct decision or if we committed an error. We can, however, define the likelihood of these events.

$\alpha$: The probability of committing a Type I error. Also known as the significance level.
$\beta$: The probability of committing a Type II error.
Power: Power is the probability the null hypothesis is rejected given that it is false (i.e., $1-\beta$).

$\alpha$ and $\beta$ are probabilities of committing an error so we want these values to be low. However, we cannot decrease both. As $\alpha$ increases, $\beta$ decreases.

Note: Type I error is also thought of as the event that we reject the null hypothesis GIVEN the null is true. In other words, Type I error is a conditional event and $\alpha$ is a conditional probability. The same idea applies to Type II error and $\beta$.

Discussion II

In today’s discussion, we will revisit some techniques covered in POL 211 concerning how to capture relationships within data. Additionally, we will delve deeper into the application of Ordinary Least Squares (OLS) regression to capture the specific relationships of interest.

Relationships in Data
Ordinary Least Square Linear Regression

1. Relationships in Data

When we want to describe the relationship between two sets of data, we can plot the data sets in a scatter plot and look at four characteristics:

Direction: Are the data points sloping upwards or downwards?
Form: Do the data points form a straight line or a curved line?
Strength: Are the data points tightly clustered or spread out?
Outliers: Are there data points far away from the main body of data?

Today, our focus will be on the first three characteristics. We will delve into covariance, the correlation coefficient, and linear regression. These are three fundamental statistics that allow us to describe the relationships between the variables of interest.

a. Visualizing Relationships – Scatter Plot

For today, we are using Covid19.csv. You can download this csv file from Canvas or from here.

## load packages and import data
library(tidyverse)
  
cf <- read.csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/Covid19.csv")

Let’s visualize the relationship between the politics of a county (dem16vs) and the share of people who say they always wear a mask (mask_always) in a scatter plot.

dem16vs: Democratic vote share in 2016 presidential election.
mask_always: The estimated share of people in a county who would say they always wear a mask in public when they expect to be within six feet of another person.

ggplot(cf, aes(x = dem16vs, y = mask_always)) + 
  geom_point(color = "black", fill = "white", 
             stroke = 1, shape = 1) + 
             ## Use the stroke aesthetic to modify the width of the border
             ## The shape of points can be adjusted by specifying the shape argument
  xlab("Democratic Vote Share, 2016 Presidential") + 
  ylab("% Reporting Always Wearing Mask") + 
  ylim(0,1) + 
  xlim(0,1) + 
  labs(title = "NY Times Covid-19 Survey", 
       subtitle = "All Counties") + 
  theme_bw()

The plot above is hard to see since all the dots are on top of each other. So instead of plotting all the datapoints, let’s draw a random sample of 500 of the rows of the dataset and plot them.

set.seed(110) ## this allows us to draw the *same* random sample every time.
    
samp <- sample(1:nrow(cf), 500) 
## This means randomly taking a sample of 500 from the elements of the total rows of cf 
glimpse(samp)

##  int [1:500] 2008 2790 2483 336 2403 2921 1659 772 1427 2931 ...

samp is the rows we randomly sample from all the rows in cf. For example, the first element in samp is 2008, which means we’re going to draw the 2008th row out of the original dataset.

Now let’s replot the scatter plot again with only 500 datapoints we randomly drew from the whole dataset.

ggplot(cf[samp,], aes(x = dem16vs, y = mask_always)) + 
  ## Specify the random sample within the bracket of cf and remember to put it on the row position!
  geom_point(color = "black", fill = "white", 
             stroke = 1, shape = 1) + 
  xlab("Democratic Vote Share, 2016 Presidential") + 
  ylab("% Reporting Always Wearing Mask") + 
  ylim(0,1) + 
  xlim(0,1) + 
  labs(title = "NY Times Covid-19 Survey", 
       subtitle = "500 Randomly Selected Counties") + 
  theme_bw()

b. Covariance, Correlation Coefficient, Linear Regression

In addition to visualizing data using scatter plots to grasp relationships, how can we employ statistical measures to describe these relationships more formally?

Covariance

Covariance is a useful measure for describing the direction of the linear association between two quantitative variables.

\[Cov(x, y) = \frac{1}{N-1}\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})\]

As we can see from the equation, the covariance sums the term $(x_i-\bar{x})(y_i-\bar{y})$ for each data point, where $\bar{x}$ is the average $x$ value, and $\bar{y}$ is the average $y$ value. The term becomes more positive if both $x$ and $y$ are larger than the average values in the data set, and becomes more negative if smaller. As the covariance accounts for every data point in the set, a positive covariance must mean that most, if not all, data points are in sync with respect to $x$ and $y$ (small $y$ when $x$ is small or large $y$ when $x$ is large). Conversely, a negative covariance must mean that most, if not all, data points are out of sync with respect to $x$ and $y$ (small $y$ when $x$ is large or large $y$ when $x$ is small).

Let’s start by using a fake dataset to practice how to compute covariance.
```
tf <- data.frame(x = c(3, 2, 1, 7, -1), 
                 y = c(8, 10, 13, 2, 8))

knitr::kable(tf, "simple", align = "cc")
```
x y

3 8

2 10

1 13

7 2

-1 8
```
## Covariance
tf$x.dev2 <- tf$x - mean(tf$x)
tf$y.dev2 <- tf$y - mean(tf$y)

cov.fake <- (1/(nrow(tf)-1))*sum(tf$x.dev2*tf$y.dev2)
cov.fake   
```
```
## [1] -8.85
```
You can use the canned function cov() to double check your answer.
```
## Covariance
cov(tf$x, tf$y)
```
```
## [1] -8.85
```
Covariance is a useful measure at describing the direction of the linear association between two quantitative variables, but it has two weaknesses: a larger covariance does not always mean a stronger relationship, and we cannot compare the covariances across different sets of relationships. For example, if we find that the covariance between variables $x$ and $y$ is larger than the covariance between $x$ and $z$, we can only conclude that both $y$ and $z$ are positively associated with $x$. However, we cannot determine whether the relationship between $x$ and $y$ is stronger than the relationship between $x$ and $z based solely on their covariances.
Correlation Coefficient

To account for the weakness, we normalize the covariance by the standard deviation of the $x$ values and $y$ values, to get the correlation coefficient. The correlation coefficient is a value between -1 and 1, and measures both the direction and the strength of the linear association. One important distinction to note is that correlation does not measure the slope of the relationship — a large correlation only speaks to the strength of the relationship. Some key points on correlation are:
- Correlation measures the direction and strength of the linear association between two quantitative variables.
- Positive and negative indicates direction, large and small indicates the strength.
- Correlation has symmetry: correlation of x and y is the same as correlation of y and x.
- Correlation is unitless and normalized.
\[r_{xy}=\frac{Cov(x, y)}{s_xs_y}\]

Please compute the correlation coefficient between $x$ and $y$ using the formula above with the fake dataset
```
## Correlation
cor.fake <- cov.fake/(sd(tf$x)*sd(tf$y))
cor.fake
```
```
## [1] -0.7412154
```
You can also used the canned function cor() to double check your answers.
```
## Correlation
cor(tf$x, tf$y)
```
```
## [1] -0.7412154
```
Linear Regression

Correlation and covariance are quantitative measures of the strength and direction of the relationship between two variables, but they do not account for the slope of the relationship. In other words, we do not know how a change in one variable could impact the other variable. Regression is the technique that fills this void — it allows us to make the best guess at how one variable affects the other variables. The simplest linear regression, which is usually measured by ordinary least square (OLS) method, allows us to fit a “line of best fit” to the scatter plot, and use that line (or model) to describe the relationship between the two variables. The OLS approach aims to fit a line that minimizes squared residuals. The equation for that line is:

\[\hat{y} = a + bx\] The equations of $a$ and $b$ are:

\[b = \frac{r_{xy}s_{y}}{s_{x}}=\frac{Cov(x, y)}{s^2_{x}}\]

\[a = \bar{y}-b\bar{x}\]

Note: A more detailed introduction regarding linear regression will be discussed in POL 212 and 213.

What are the intercept ($a$) and slope ($b$) between $x$ and $y$ in the fake dataset? Please compute in a manual way!
```
## Slope
b <- cov(tf$x, tf$y)/var(tf$x)

## Intercept
a <- mean(tf$y) - b * mean(tf$x)
```
You can also use the canned function lm to check your answer.
```
reg <- lm(y ~ x, data = tf)
stargazer::stargazer(reg, type = "text", digits = 4)
```
```
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                  y             
## -----------------------------------------------
## x                             -1.0057          
##                              (0.5258)          
##                                                
## Constant                     10.6136**         
##                              (1.8813)          
##                                                
## -----------------------------------------------
## Observations                     5             
## R2                            0.5494           
## Adjusted R2                   0.3992           
## Residual Std. Error       3.1198 (df = 3)      
## F Statistic             3.6578 (df = 1; 3)     
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
```

x	y
3	8
2	10
1	13
7	2
-1	8

2. Ordinary Least Square Linear Regression

a. Simple Linear Regression

Below is a population regression line equation, assuming that we are the god and know the data-generating process of the relationship between $X$ (independent/explanatory variable) and $Y$ (dependent/response/outcome variable).

However, since we usually don’t have the population data but only have access to a sample of it, we can only use the sample dataset to estimate the population regression line. In the below figure, $b_0$ is the estimate of the regression intercept $\beta_0$, while $b_1$ is the estimate of the regression coefficient/slope $\beta_1$.

The Ordinary Least Squares (OLS) approach aims to fit a line that minimizes the sum of the squared residuals, which means that the goal is to find a pair of $b_0$ and $b_1$ that can minimize the difference between our observed $Y_i$ and estimated $\hat{Y_i}$.

$S(b_0, b_1) = min \Sigma E_i = min \Sigma (Y_i-\hat{Y_i})^2=min \Sigma (Y_i-(b_0+b_1X_i))^2$

$S(b_0, b_1) = \Sigma (Y_i-b_0-b_1X_i)^2$

To find a pair of $b_0$ and $b_1$ that can minimize the difference between our observed $Y_i$ and estimated $\hat{Y_i}$ is an optimization problem. All we have to do is to take the partial derivatives of the above equation with respect to $b_0$ and $b_1$, respectively, set them equal to zero, and then solve it.

$\frac{\partial S(b_0, b_1)}{\partial b_0}=\Sigma (-1)(2)(Y_i-b_0-b_1X_i)=0$

$\frac{\partial S(b_0, b_1)}{\partial b_1}=\Sigma (-X_i)(2)(Y_i-b_0-b_1X_i)=0$

By solving the above two equations, you will get:

$b_0 = \bar{Y}-b_1\bar{X}$

$b_1 = \frac{\Sigma (X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma (X_i-\bar{X})^2}$

b. Confidence Interval and Hypothesis Test

With estimates of the intercept and slope of our linear regression line, we can then make statistical inferences to assess the likelihood that our estimates capture the true parameters. Initially, we need the standard errors of the intercept and slope to construct confidence intervals and conduct hypothesis tests. POL 213 will provide a more detailed discussion on how to derive these standard errors. For now, in 212, let’s assume these standard errors and use them for statistical inferences.

The residual standard error $S_E$ (how closely the regression line we estimate fits the scatter of our data points) is:

$\hat{\sigma } = SE(E_i) = \sqrt{\frac{\Sigma (E_i)^2}{n-2}} = \sqrt{\frac{\Sigma (Y_i-\hat{Y_i})^2}{n-2}}$

And the standard error of the sample intercept ($b_0$) and slope ($b_1$) are:

$SE(b_0) = \hat{\sigma }\sqrt{\frac{\Sigma X_i^2}{n\Sigma (X_i-\bar{X})^2}}$

$SE(b_1) = \frac{\hat{\sigma }}{\sqrt{\Sigma (X_i-\bar{X})^2}}$

With standard errors, we can construct a $100(1-\alpha)%$ confidence interval for our slope and perform hypothesis testing.

Confidence Interval

$\beta=b_1+t_{\frac{\alpha}{2}}SE(b_1)$
Hypothesis Test

Two-tailed test (by defualt, lm()function conducts two-tailed test)

$H_0: \beta = 0$

$H_1: \beta \neq 0$

$t=\frac{b_1-\beta_1}{SE(b_1)}=\frac{b_1-0}{SE(b_1)}$

Compare this t-statistic with $t_{\frac{\alpha}{2}, df=n-2}$ to see if it is larger or smaller than the critical value or not. If it is larger or smaller than the critical value, we can reject $H_0$ in favor of $H_1$. In addition to calculating t-statistics, we can also calculate p-value based the t-statistics and the critical values to perform hypothesis testing (p-value $= 2*Pr(t \geq |t_c|)$ vs $\alpha$).

One-tailed test (if your hypotheses are directional, you can also conduct one-tailed test)

$H_0: \beta = 0$

$H_1: \beta < or > 0$

$t=\frac{b_1-\beta_1}{SE(b_1)}=\frac{b_1-0}{SE(b_1)}$

When performing a one-tailed test, compare the t-statistic with $t_{\alpha, df=n-2}$ without dividing $\alpha$ by 2. P-value in one-tailed test is calculated as:

p-value $= Pr(t \geq or \leq t_c)$ vs $\alpha$

Let’s now assume we know how exactly the data is generated and draw a sample out of the population we generate and then construct confidence intervals and perform hypothesis testing.

This is the example Hanno used in POL 211. Assume we study the relationship between money spend on coffee ($X$) and hours of sleep every night ($Y$) for UC Davis graduate students. The population of UC Davis graduate students is $10000$ and the population distribution of coffee spending follows normal distribution with mean equals 20 and standard deviation equals 8 (i.e., $X \sim N(20, 8^2)$).

We also know the exact population relationship between coffee spending and hours of sleep every night for UC Davis graduate students, which is defined as below:

set.seed(123)
    
coffee_spending <- rnorm(10000, 20, 8)
sleep <- 8 - 0.1*coffee_spending + rnorm(10000, 0, 3)

This is the population data we generated:

pop <- data.frame(coffee_spending, sleep) 
    
head(pop)

##   coffee_spending     sleep
## 1        15.51619 13.560556
## 2        18.15858  5.683706
## 3        32.46967  7.533918
## 4        20.56407  4.239138
## 5        21.03430  6.571840
## 6        33.72052  8.023906

Now, let’s draw a sample with 500 observations from our population and estimate the relationship between coffee spending and hours of sleep every night.

set.seed(123)

samp <- sample(1:nrow(pop), 500) 
glimpse(samp)

##  int [1:500] 2463 2511 8718 2986 1842 9334 3371 4761 6746 9819 ...

samp is the rows we randomly sample from all the rows in pop. For example, the first element in samp is 2463, which means we’re going to draw the 2463th row out of the population dataset.

Let’s first plot the scatter plot with the sample we randomly drew from the population dataset.

ggplot(pop[samp,], aes(x = coffee_spending, y = sleep)) + 
  ## Specify the random sample within the bracket of pop and remember to put it on the row position!
  geom_point(color = "black", fill = "white", 
             stroke = 1, shape = 1) + 
  xlab("Money Spend on Coffee") + 
  ylab("Hours of Sleep Every Night") + 
  theme_bw()

It seems like that their is a negative relationship between coffee spending and hours sleep every night for UC Davis graduate students. Now, let’s further estimate the intercept and slopes based on the sample we drew. To simplify the process of calculating residual standard errors and standard errors of both the slope and intercept, I use the lm function here to get all the estimates and statistics we need to construct confidence intervals and perform hypothesis tests.

mod <- lm(sleep ~ coffee_spending, data = pop[samp,])
summary(mod)

## 
## Call:
## lm(formula = sleep ~ coffee_spending, data = pop[samp, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.763 -2.143 -0.100  1.919  9.844 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.26311    0.38665  18.785  < 2e-16 ***
## coffee_spending -0.06986    0.01783  -3.917 0.000102 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.053 on 498 degrees of freedom
## Multiple R-squared:  0.02989,    Adjusted R-squared:  0.02794 
## F-statistic: 15.34 on 1 and 498 DF,  p-value: 0.0001022

Construct a 95% confidence intervals for both the intercept and slope

Intercept: $(7.26311-1.96*0.38665, 7.26311+1.96*0.38665) = (6.505276, 8.020944)$

slope: $(-0.06986-1.96*0.01783, -0.06986+1.96*0.01783) = (-0.1048068, -0.0349132)$

Both the confidence intervals for the intercept and slope do not include 0. This implies that if we draw 100 samples from the population, 95% of the confidence intervals we construct will encompass the true parameters, indicating that both of these true parameters are significantly different from 0.
Perform a two-tailed hypothesis tests for both the intercept and slope

$H_0: \beta_0 = 0$

$H_1: \beta_0 \neq 0$

$t=\frac{b_0-\beta_0}{SE(b_0)}=\frac{b_0-0}{SE(b_0)}=\frac{7.26311-0}{0.38665}=18.78471$

p-value $= 2*Pr(t \geq 18.78471)=6.57319e-60 < 0.05$
```
# find p-value
2*pt(q = 18.78471, df = 498, lower.tail = FALSE)
```
```
## [1] 6.57319e-60
```
As the p-value is smaller than 0.05, we can reject the null hypothesis and conclude that the intercept is greater than 0.

$H_0: \beta_1 = 0$

$H_1: \beta_1 \neq 0$

$t=\frac{b_1-\beta_1}{SE(b_1)}=\frac{b_1-0}{SE(b_1)}=\frac{-0.06986-0}{0.01783}=-3.918116$

p-value $= 2*Pr(t \leq -3.918116)=0.0001017134 < 0.05$
```
# find p-value
2*pt(q = -3.918116, df = 498, lower.tail = TRUE)
```
```
## [1] 0.0001017134
```
As the p-value is smaller than 0.05, we can reject the null hypothesis and conclude that there is a negative relationship between coffee spending and hours of sleep every night for UC Davis graduate students.

ggplot(pop[samp,], aes(x = coffee_spending, y = sleep)) + 
  ## Specify the random sample within the bracket of pop and remember to put it on the row position!
  geom_point(color = "black", fill = "white", 
             stroke = 1, shape = 1) + 
  geom_abline(slope = -0.06986, intercept = 7.26311, 
              color = 'blue', lwd = 1.5, alpha = 0.7) +
  xlab("Money Spend on Coffee") + 
  ylab("Hours of Sleep Every Night") + 
  theme_bw()

c. Simulate Sampling Distribution of Linear Regression Parameters

In the above exercise, we drew only one sample from the population. Now, let’s repeatedly draw samples 1000 times to simulate the sampling distribution of the intercept and slope. This will help us verify if we reach similar conclusions as we did when constructing confidence intervals and performing hypothesis tests.

Simulate Sampling Distribution of the Intercept

set.seed(1234)

par.est <- data.frame(est_intercept = NA,
                    est_slope = NA)
for (i in 1:1000){
samp <- sample(1:nrow(pop), 500)
mod <- lm(sleep ~ coffee_spending, data = pop[samp,])
par.est[i, 1] <- mod$coef[1]
par.est[i, 2] <- mod$coef[2]
}

head(par.est)

##   est_intercept   est_slope
## 1      7.977923 -0.10320265
## 2      8.107999 -0.10914730
## 3      7.766414 -0.08499736
## 4      7.760085 -0.08567993
## 5      8.376802 -0.12510784
## 6      6.947270 -0.04736407

Let’s plot the distribution of our 1000 estimated intercepts and slopes and compare that with the sampling distribution under the null hypothesis.

What is the sampling distribution under the null hypothesis? Remember that we know the DGP of the population, so we actually know the exact standard error of both the population intercept and slope.

mod_pop <- lm(sleep ~ coffee_spending, data = pop)
summary(mod_pop)

## 
## Call:
## lm(formula = sleep ~ coffee_spending, data = pop)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4478  -2.0067  -0.0222   2.0426  11.3089 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.927428   0.080939   97.94   <2e-16 ***
## coffee_spending -0.097735   0.003761  -25.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.005 on 9998 degrees of freedom
## Multiple R-squared:  0.06326,    Adjusted R-squared:  0.06317 
## F-statistic: 675.2 on 1 and 9998 DF,  p-value: < 2.2e-16

That being said, the sampling distribution under the null hypothesis for the intercept follows a normal distribution with a mean equal to 0 and a standard deviation equal to 0.080939.

int_null <- data.frame(x = seq(-1, 10, 0.01), 
                     y = dnorm(seq(-1, 10, 0.01), 
                               mean = 0, sd = 0.080939))

ggplot() + 
    geom_line(aes(x = int_null$x, y = int_null$y), ) +
    geom_histogram(aes(x = par.est$est_intercept, y = ..density..),
                   color = "black", fill = "gray85", bins = 80) +
    geom_vline(xintercept = mean(par.est$est_intercept), 
               linetype = "dashed", color = "red", alpha = 0.5, lwd = 0.8) +
    geom_vline(xintercept = 7.26311, linetype = "dashed", 
               color = "blue", alpha = 0.5, lwd = 0.8) +
    annotate("text", x = 9.5, y = 3, 
             label = "mean of\nestimated\nintercepts", color = "red") +
    annotate("text", x = 6.5, y = 3, 
             label = "one time\nestimated\nintercept", color = "blue") +
    annotate("text", x = 1.5, y = 3, 
             label = "sampling\ndistribution\nunder the null", color = "black") +
    scale_x_continuous(name = "Estimated Intercept", breaks = seq(-1, 10, 1)) +
    labs(y = "Density", title = "1000 Samples with n=500") + 
    theme_bw() + 
    theme(panel.grid.minor = element_blank())

Simulate Sampling Distribution of the Slope

The sampling distribution under the null hypothesis for the slope follows a normal distribution with a mean equal to 0 and a standard deviation equal to 0.080939.

slo_null <- data.frame(x = seq(-0.15, 0.05, 0.001), 
                     y = dnorm(seq(-0.15, 0.05, 0.001), 
                               mean = 0, sd = 0.003761))

ggplot() + 
    geom_line(aes(x = slo_null$x, y = slo_null$y), ) +
    geom_histogram(aes(x = par.est$est_slope, y = ..density..),
                   color = "black", fill = "gray85", bins = 80) +
    geom_vline(xintercept = mean(par.est$est_slope), 
               linetype = "dashed", color = "red", alpha = 0.5, lwd = 0.8) +
    geom_vline(xintercept = -0.06986, linetype = "dashed", 
               color = "blue", alpha = 0.5, lwd = 0.8) +
    annotate("text", x = -0.12, y = 60, 
             label = "mean of\nestimated\nislopes", color = "red") +
    annotate("text", x = -0.05, y = 60, 
             label = "one time\nestimated\nslope", color = "blue") +
    annotate("text", x = 0.02, y = 60, 
             label = "sampling\ndistribution\nunder the null", color = "black") +
    scale_x_continuous(name = "Estimated Slope", breaks = seq(-0.15, 0.05, 0.02)) +
    labs(y = "Density", title = "1000 Samples with n=500") + 
    theme_bw() + 
    theme(panel.grid.minor = element_blank())

d. Multivariate Linear Regression

The central difference between simple and multiple linear regression is that the slope coefficients for the explanatory variables in multiple regression are partial coefficients. That is, it represents the “effect” on the response variable of a one-unit increment in the corresponding explanatory variable, holding constant the value of the other explanatory variable.

$Y_i=\beta_0+\beta_1X_i1+\beta_2X_i2+...+\beta_kX_ik+\varepsilon_i$

Recall that the goal of the OLS approach is to minimize the discrepancy between our observed $Y_i$ and estimated $\hat{Y_i}$. To find a set of estimated $\beta$s that can minimize this discrepancy is an optimization problem. All we have to do is to take the partial derivatives of the above equation with respect to each coefficient, set them equal to zero, and then solve it. However, adding more and more $X$ variables to our linear regression model makes the calculation more and more tedious. In order to find the estimated $\beta$s more efficiently, we use matrix algebra to help us do the calculation. POL 213 will show you how to do it in more details. In today’s discussion, we focus on how to do multivariate linear regression in R.

Assuming we are examining factors that explain U.S. public school expenditures, you can assess the dataset Anscombe.txt here. This dataset measures U.S. State Public School Expenditures in 1981, and the variables are:

education = per capita education expenditures, dollars
income = per capita income, dollars
under18 = proportion under 18 years old, per 1000
urban = proportion urban, per 1000.

# load data
df <- read.delim("/Users/yu-shiuanhuang/Desktop/method-sequence/data/Anscombe.txt", sep = "")

ols <- lm(education ~ income + under18 + urban, data = df)
summary(ols)

## 
## Call:
## lm(formula = education ~ income + under18 + urban, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.240 -15.738  -1.156  15.883  51.380 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.868e+02  6.492e+01  -4.418 5.82e-05 ***
## income       8.065e-02  9.299e-03   8.674 2.56e-11 ***
## under18      8.173e-01  1.598e-01   5.115 5.69e-06 ***
## urban       -1.058e-01  3.428e-02  -3.086  0.00339 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.69 on 47 degrees of freedom
## Multiple R-squared:  0.6896, Adjusted R-squared:  0.6698 
## F-statistic: 34.81 on 3 and 47 DF,  p-value: 5.337e-12

Use the stargazer package to have a nicer output of the regression table.

# install.packages("stargazer")
library(stargazer)

stargazer(ols, 
          title = "Table 1: Factors Explaining U.S. Public School Expenditures",
          dep.var.labels = "Education Expenditure",
          covariate.labels = c("Income", "Under 18", "Urban"),
          star.cutoffs = c(0.05, 0.01, 0.001),
          digits = 2 , type = "text")

## 
## Table 1: Factors Explaining U.S. Public School Expenditures
## =================================================
##                          Dependent variable:     
##                     -----------------------------
##                         Education Expenditure    
## -------------------------------------------------
## Income                         0.08***           
##                                (0.01)            
##                                                  
## Under 18                       0.82***           
##                                (0.16)            
##                                                  
## Urban                          -0.11**           
##                                (0.03)            
##                                                  
## Constant                     -286.84***          
##                                (64.92)           
##                                                  
## -------------------------------------------------
## Observations                     51              
## R2                              0.69             
## Adjusted R2                     0.67             
## Residual Std. Error        26.69 (df = 47)       
## F Statistic             34.81*** (df = 3; 47)    
## =================================================
## Note:               *p<0.05; **p<0.01; ***p<0.001

The regression output indicates a positive relationship between per capita income and per capita education expenditures, as well as between the proportion under 18 years old and education expenditures. To elaborate, holding all other variables constant, a one-unit increase in per capita income corresponds to a 0.08-unit increase in per capita education expenditures. Similarly, holding all other variables constant, a one-unit increase in the proportion under 18 years old corresponds to a 0.82-unit increase in per capita education expenditures.

Conversely, there is a negative relationship between the proportion urban and education expenditures. A one-unit increase in the proportion urban is associated with a decrease of 0.11 units in per capita education expenditures, while holding all other variables constant.

Discussion III

In today’s discussion, we will focus on some techniques for modeling.

Variable Transformation
Rescaling Variables
Creating Indicies

1. Variable Transformation

a. Why transform variables?

To conform to regression assumptions which amplifies prediction power and increases the overall quality of the model. There are four primary reasons why we might want to transform continuous variables:

To even out the variance of a variable if the assumption of homoscedasticity is violated (residuals are independent with a mean of zero, even variance, and randomly scattered).
To normalize a variable if the assumption of normality is violated (by visual inspection, probability plots, or normality tests such as Kolmogorov-Smirnov and Shaprio-Wilks).
To linearize the regression model if it seems individual variables are non-linear.
Reduce the impact of outliers and high-leverage observations.

As we haven’t delved into the Gauss-Markov assumptions, which is a set of assumptions that leads the ordinary least squares (OLS) estimates the best linear unbiased estimator (BLUE), today’s discussion will concentrate solely on points 3 and 4, demonstrating how variable transformation can enhance our model.

In many cases, the primary method of variable transformation is by applying a simple deterministic (non-random) mathematical function to one or more variables that you include in your model. Often, the transformation is a relatively simple algebraic function that is reversible (invertible). For example:

Power functions: $x^p$ the most common of which is $\sqrt{x} = x^{\frac{1}{2}}$
Logarithmic functions: $log_{10}x$, $log_e x = ln(x)$
Multiplicative inverse (reciprocal) functions: $\frac{1}{x}$

b. Challenges with transforms:

Variables that have been transformed are sometimes harder to interpret. If the measure is commonly known and understood, such as sales dollars, “the multiplicative inverse of sales dollars” can be more difficult to interpret and communicate.
Making meaningful prediction often means inverting the transformed variables back to original form, which adds additional steps, thus increasing complexity of your model.
Often times, selecting the best transformation can be a process of trial and error.
Transformation can sometimes overshoot the goal and create issues in the other direction; must verify and check. For example, if we have a variable that is heavily right-skewed and we perform a transformation, it might be too powerful and now make it to be left-skewed.

c. Common Techniques

When deciding whether to transform our variables, we should first examine the direction and magnitude of skewness in order to choose the appropriate technique.

For left skewed variables, before transformation, a reflection should be performed to make the distribution right skewed: $(x_{max}+1)-x$ (add 1 to the maximum value of the original variable and then subtract this result from each value of the original variable).

Depending on the severity of skewness, we apply the square root transformation ($\sqrt{x}$) for mild skewness, the base-10 logarithm transformation ($log_{10}x$) for moderate skewness, or the reciprocal transformation ($\frac{1}{x}$) for severe skewness. Why so? Let’s see some examples to get the rationale.

x <- c(1, 3, 10, 19, 31, 132, 1477, 300459, 133000000)
data.frame(
  x,
  sqrtx = sqrt(x),
  logx = log(x),
  inversex = 1/x
)

##           x        sqrtx      logx     inversex
## 1         1     1.000000  0.000000 1.000000e+00
## 2         3     1.732051  1.098612 3.333333e-01
## 3        10     3.162278  2.302585 1.000000e-01
## 4        19     4.358899  2.944439 5.263158e-02
## 5        31     5.567764  3.433987 3.225806e-02
## 6       132    11.489125  4.882802 7.575758e-03
## 7      1477    38.431758  7.297768 6.770481e-04
## 8    300459   548.141405 12.613067 3.328241e-06
## 9 133000000 11532.562595 18.705860 7.518797e-09

As seen in the table, for larger values, taking the inverse tightens more than logarithm and even more than the square root. This also indicates the skewness magnitude. In heavily right-skewed variables, where large outliers exist in the right tail, taking the inverse brings them closer to the majority of values. For moderate outliers, square root or logarithm suffices.

Let’s now use the Boston housing price dataset to do some practices. Assuming we are examining factors that explain housing price in Boston, you can assess the dataset Boston.xlsx here. Below is a description of each variable in the dataset. MEDV is the outcome variable.

MEDV: median value of owner-occupied homes in $1000’s
CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town
CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centers
RAD: index of accessibility to radical highways
TAX: full-value property-tax rate per $10,000
PTRATIO: pupil-teacher ratio by town
LSTAT: % lower status of the population

library(readxl)
Boston <- read_excel("/Users/yu-shiuanhuang/Desktop/method-sequence/data/Boston.xlsx")
head(Boston)

## # A tibble: 6 × 14
##      CRIM    ZN INDUS  CHAS   NOX    RM   AGE   DIS   RAD   TAX PTRATIO     B
##     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1 0.00632    18  2.31     0 0.538  6.58  65.2  4.09     1   296    15.3  397.
## 2 0.0273      0  7.07     0 0.469  6.42  78.9  4.97     2   242    17.8  397.
## 3 0.0273      0  7.07     0 0.469  7.18  61.1  4.97     2   242    17.8  393.
## 4 0.0324      0  2.18     0 0.458  7.00  45.8  6.06     3   222    18.7  395.
## 5 0.0690      0  2.18     0 0.458  7.15  54.2  6.06     3   222    18.7  397.
## 6 0.0298      0  2.18     0 0.458  6.43  58.7  6.06     3   222    18.7  394.
## # ℹ 2 more variables: LSTAT <dbl>, MEDV <dbl>

Right skewed variables transformation

Assume we are interested in the relationships between MEDV and INDUS, MEDV and DIS, and MEDV and CRIM, respectively. Let’s first look at the distribution of each explanatory variable.

library(tidyverse)
p1 <- ggplot(data = Boston, aes(x = INDUS)) + 
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
  labs(title = "mild skewed") +
  theme_bw()
p2 <- ggplot(data = Boston, aes(x = DIS)) + 
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
  labs(title = "moderate skewed") +
  theme_bw()
p3 <- ggplot(data = Boston, aes(x = CRIM)) + 
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
  labs(title = "severe skewed") +
  theme_bw()

ggpubr::ggarrange(p1, p2, p3, ncol = 3, nrow = 1)

Now, let’s create scatterplots to visualize the relationships between MEDV and these variables and assess their linearity.

p1 <- ggplot(data = Boston, aes(x = INDUS, y = MEDV)) + 
  geom_point(color = "darkgray", shape = 1) + 
  geom_smooth(method = "lm", color = "dodgerblue2") +
  geom_smooth(method = "loess", color = "coral2") +
  ylim(c(-20, 50)) +
  annotate("text", x = 25, y = 45, label = "OLS", color = "dodgerblue2") +
  annotate("text", x = 25, y = 42, label = "LOESS", color = "coral2") +
  labs(title = "mild skewed") + xlab("INDUS") + ylab("MEDV") + theme_bw()
p2 <- ggplot(data = Boston, aes(x = DIS, y = MEDV)) + 
  geom_point(color = "darkgray", shape = 1) + 
  geom_smooth(method = "lm", color = "dodgerblue2") +
  geom_smooth(method = "loess", color = "coral2") +
  ylim(c(-20, 50)) + 
  annotate("text", x = 11, y = 45, label = "OLS", color = "dodgerblue2") +
  annotate("text", x = 11, y = 42, label = "LOESS", color = "coral2") +
  labs(title = "moderate skewed") + xlab("DIS") + ylab("MEDV") + theme_bw()
p3 <- ggplot(data = Boston, aes(x = CRIM, y = MEDV)) + 
  geom_point(color = "darkgray", shape = 1) + 
  geom_smooth(method = "lm", color = "dodgerblue2") +
  geom_smooth(method = "loess", color = "coral2") +
  ylim(c(-20, 50)) +
  annotate("text", x = 80, y = 43, label = "OLS", color = "dodgerblue2") +
  annotate("text", x = 80, y = 40, label = "LOESS", color = "coral2") +
  labs(title = "severe skewed") + xlab("CRIM") + ylab("MEDV") + theme_bw()

ggpubr::ggarrange(p1, p2, p3, ncol = 3, nrow = 1)

Loess stands for locally estimated scatterplot smoothing (lowess stands for locally weighted scatterplot smoothing) and is one of many non-parametric regression techniques, but arguably the most flexible. A smoothing function is a function that attempts to capture general patterns in X-Y relationships while reducing the noise and it makes minimal assumptions about the relationships among variables. The result of a loess application is a line through the moving central tendency of the X-Y relationship.

Loess is fairly straightforward. A specific width of points along the x axis is selected (the bandwidth or tension) adjacent to the point being predicted, and a low degree polynomial equation (often just linear) is fit through that subset of the data. More weight is given to points closest to the value being predicted. This resulting equation is then used to predict the value for the selected point. The data are then shifted one point to the right and the process continues, with a new prediction for the second point, and so on. The resulting points are then connected together with a line. The user can control how wide a band of points are used – the smaller the bandwidth, the fewer points that are used and the less smooth the final line. Users can also adjust the type of line-fitting that is used – weighted least squares is the most common. Users can also adjust what types of weights are used.

As evident in the scatterplots above, the associations between MEDV and the three explanatory variables lack linearity. Notably, the OLS regression line and the Loess line exhibit significant disparities, particularly in the MEDV and CRIM relationship. This is when variable transformation chimes in to help improve our linear models.

Let’s first transform these variables based on the severity of skewness.

p1 <- ggplot(data = Boston, aes(x = INDUS)) + 
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
  labs(title = "Before Transformation") +
  theme_bw()
p2 <- ggplot(data = Boston, aes(x = sqrt(INDUS))) + 
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
  labs(title = "After Transformation") +
  theme_bw()

p3 <- ggplot(data = Boston, aes(x = DIS)) + 
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
  labs(title = "Before Transformation") +
  theme_bw()
p4 <- ggplot(data = Boston, aes(x = log(DIS+1))) + 
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
  labs(title = "After Transformation") +
  theme_bw()

p5 <- ggplot(data = Boston, aes(x = CRIM)) + 
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
  labs(title = "Before Transformation") +
  theme_bw()
p6 <- ggplot(data = Boston, aes(x = (1/(CRIM+1)))) + 
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
  labs(title = "After Transformation") +
  theme_bw()

ggpubr::ggarrange(p1, p2, p3, p4, p5, p6, ncol = 2, nrow = 3)

Post-transformation, the distribution of these variables becomes less skewed. It’s worth noting that for DIS and CRIM, I apply the logarithm and inverse, respectively, while adding 1 to each value to prevent 0, which can’t be computed in logarithms or inverses.

Now, let’s replot the scatterplots between MEDV and these transformed explanatory variables.

p1 <- ggplot(data = Boston, aes(x = INDUS, y = MEDV)) + 
  geom_point(color = "darkgray", shape = 1) + 
  geom_smooth(method = "lm", color = "dodgerblue2") +
  geom_smooth(method = "loess", color = "coral2") +
  annotate("text", x = 25, y = 47, label = "OLS", color = "dodgerblue2") +
  annotate("text", x = 25, y = 42, label = "LOESS", color = "coral2") +
  labs(title = "Before Transformation") + xlab("INDUS") + 
  ylim(c(0, 50)) +
  ylab("MEDV") + theme_bw()
p2 <- ggplot(data = Boston, aes(x = sqrt(INDUS), y = MEDV)) + 
  geom_point(color = "darkgray", shape = 1) + 
  geom_smooth(method = "lm", color = "dodgerblue2") +
  geom_smooth(method = "loess", color = "coral2") +
  annotate("text", x = 4.5, y = 47, label = "OLS", color = "dodgerblue2") +
  annotate("text", x = 4.5, y = 42, label = "LOESS", color = "coral2") +
  ylim(c(0, 50)) +
  labs(title = "After Transfromation") + xlab("INDUS") + 
  ylab("MEDV") + theme_bw()


p3 <- ggplot(data = Boston, aes(x = DIS, y = MEDV)) + 
  geom_point(color = "darkgray", shape = 1) + 
  geom_smooth(method = "lm", color = "dodgerblue2") +
  geom_smooth(method = "loess", color = "coral2") +
  ylim(c(0, 50)) + 
  annotate("text", x = 11, y = 47, label = "OLS", color = "dodgerblue2") +
  annotate("text", x = 11, y = 42, label = "LOESS", color = "coral2") +
  labs(title = "Before Transfromation") + xlab("DIS") + 
  ylab("MEDV") + theme_bw()
p4 <- ggplot(data = Boston, aes(x = log(DIS+1), y = MEDV)) + 
  geom_point(color = "darkgray", shape = 1) + 
  geom_smooth(method = "lm", color = "dodgerblue2") +
  geom_smooth(method = "loess", color = "coral2") +
  ylim(c(0, 50)) + 
  annotate("text", x = 2.3, y = 47, label = "OLS", color = "dodgerblue2") +
  annotate("text", x = 2.3, y = 42, label = "LOESS", color = "coral2") +
  labs(title = "After Transfromation") + xlab("DIS") + 
  ylab("MEDV") + theme_bw()

p5 <- ggplot(data = Boston, aes(x = CRIM, y = MEDV)) + 
  geom_point(color = "darkgray", shape = 1) + 
  geom_smooth(method = "lm", color = "dodgerblue2") +
  geom_smooth(method = "loess", color = "coral2") +
  ylim(c(-20, 50)) +
  annotate("text", x = 80, y = 46, label = "OLS", color = "dodgerblue2") +
  annotate("text", x = 80, y = 40, label = "LOESS", color = "coral2") +
  labs(title = "Before Transfromation") + xlab("CRIM") + 
  ylab("MEDV") + theme_bw()
p6 <- ggplot(data = Boston, aes(x = 1/(CRIM+1), y = MEDV)) + 
  geom_point(color = "darkgray", shape = 1) + 
  geom_smooth(method = "lm", color = "dodgerblue2") +
  geom_smooth(method = "loess", color = "coral2") +
  ylim(c(-20, 50)) +
  annotate("text", x = 0.9, y = 46, label = "OLS", color = "dodgerblue2") +
  annotate("text", x = 0.9, y = 40, label = "LOESS", color = "coral2") +
  labs(title = "After Transfromation") + xlab("CRIM") + 
  ylab("MEDV") + theme_bw()

ggpubr::ggarrange(p1, p2, p3, p4, p5, p6, ncol = 2, nrow = 3)

As evident, linearity improves post-transformation, particularly for CRIM with severe skewness. However, it’s crucial to bear in mind the inherent trade-offs associated with variable transformations, potentially affecting interpretability.

Note: some materials of variable transformation are referred to here.

2. Rescaling Variables

There are two primary reasons to rescale our variables: to enhance the interpretability of coefficients and to identify the relative importance of each variable in explaining the outcome of interest.

a. Enhance interpretability

education = per capita education expenditures, dollars
income = per capita income, dollars
under18 = proportion under 18 years old, per 1000
urban = proportion urban, per 1000.

# load data
df <- read.delim("/Users/yu-shiuanhuang/Desktop/method-sequence/data/Anscombe.txt", sep = "")

ols1 <- lm(education ~ income + under18 + urban, data = df)

library(stargazer)
stargazer(ols1, 
          title = "Table 1: Factors Explaining U.S. Public School Expenditures",
          dep.var.labels = "Education Expenditure",
          covariate.labels = c("Income", "Under 18", "Urban"),
          star.cutoffs = c(0.05, 0.01, 0.001),
          digits = 2 , type = "text")

## 
## Table 1: Factors Explaining U.S. Public School Expenditures
## =================================================
##                          Dependent variable:     
##                     -----------------------------
##                         Education Expenditure    
## -------------------------------------------------
## Income                         0.08***           
##                                (0.01)            
##                                                  
## Under 18                       0.82***           
##                                (0.16)            
##                                                  
## Urban                          -0.11**           
##                                (0.03)            
##                                                  
## Constant                     -286.84***          
##                                (64.92)           
##                                                  
## -------------------------------------------------
## Observations                     51              
## R2                              0.69             
## Adjusted R2                     0.67             
## Residual Std. Error        26.69 (df = 47)       
## F Statistic             34.81*** (df = 3; 47)    
## =================================================
## Note:               *p<0.05; **p<0.01; ***p<0.001

As seen in the regression table, the coefficient for income is significantly positive, implying that, holding all the other variables at constant, a one-dollar increase in per capita income corresponds to a 0.08-unit increase in per capita education expenditures of a state. While reporting this finding with the original scale is acceptable, rescaling the variable for better clarity and impact is recommended. For instance, we can divide the income variable by 1000 dollars to allow the regression output to provide a more tangible understanding of the relationship.

df <- df %>% mutate(income1000 = (income/1000))
ols2 <- lm(education ~ income1000 + under18 + urban, data = df)

stargazer(ols1, ols2, 
          title = "Table 2: Factors Explaining U.S. Public School Expenditures With Rescaled Vriable",
          dep.var.labels = "Education Expenditure",
          covariate.labels = c("Income", "Income/1000", "Under 18", "Urban"),
          star.cutoffs = c(0.05, 0.01, 0.001),
          digits = 2 , type = "text")

## 
## Table 2: Factors Explaining U.S. Public School Expenditures With Rescaled Vriable
## ============================================================
##                                    Dependent variable:      
##                               ------------------------------
##                                   Education Expenditure     
##                                     (1)            (2)      
## ------------------------------------------------------------
## Income                            0.08***                   
##                                   (0.01)                    
##                                                             
## Income/1000                                      80.65***   
##                                                   (9.30)    
##                                                             
## Under 18                          0.82***        0.82***    
##                                   (0.16)          (0.16)    
##                                                             
## Urban                             -0.11**        -0.11**    
##                                   (0.03)          (0.03)    
##                                                             
## Constant                        -286.84***      -286.84***  
##                                   (64.92)        (64.92)    
##                                                             
## ------------------------------------------------------------
## Observations                        51              51      
## R2                                 0.69            0.69     
## Adjusted R2                        0.67            0.67     
## Residual Std. Error (df = 47)      26.69          26.69     
## F Statistic (df = 3; 47)         34.81***        34.81***   
## ============================================================
## Note:                          *p<0.05; **p<0.01; ***p<0.001

After rescaling, notice that a $1,000 rise in per capita income aligns with an 80.65-unit growth in state per capita education expenditures. This increase equals 1.74 times the standard deviation of the outcome variable. It’s crucial to note that only the coefficient and standard error of income change after rescaling; everything else remains constant. This is because we only rescale the unit measurement of the income.

b. Identify the relative importance of each variable

Having chosen a regression model with statistically significant independent variables, interpreting the results reveals the relationship between these variables and the dependent variable. Now, a natural question arises: “Which independent variable is the most important?”

Determining the most important variable is more complex than it initially seems. Firstly, you must define what “most important” means, considering the subject area and goals of your regression model. There is no one-size-fits-all definition, and the methods used for data collection and measurement can influence the perceived importance of independent variables.

The standard regression coefficients in your statistical output illustrate the relationship between independent variables and the dependent variable. Each coefficient represents the mean change in the dependent variable with a one-unit shift in the corresponding independent variable. While larger coefficients imply a more significant change, comparing absolute sizes alone may not accurately identify the most important variable. This is particularly true when the independent variables have different units, rendering direct comparisons of coefficients meaningless—such as when measuring time, pressure, and temperature.

While the coefficient can’t determine the importance of an independent variable, what about the variable’s p-value? Comparing p-values may seem logical as we use them to select variables for the model. But, do lower p-values signify more important variables? Unfortunately, p-values don’t measure importance; they encompass various properties of the variable. A very small p-value doesn’t necessarily indicate practical importance. Factors like a precise estimate, low variability, or a large sample size can yield a tiny p-value for an independent variable. Hence, it’s crucial, when evaluating statistical results, to assess both statistical and practical significance of effect sizes.

So how can we compare the importance of independent variables in our model? Standardize coefficients!

As previously mentioned, comparing regular regression coefficients is challenging due to different scales. However, standardized coefficients, sharing the same scale, enable meaningful comparisons. Standardizing involves subtracting each observed value’s mean and dividing by the variable’s standard deviation (similar to calculating z-values in statistics). This process ensures a consistent scale for comparison.

\[ z = \frac{x_i-\bar{x}}{sd_x} \]

We then use standardized coefficients by running the regression model with standardized independent variables. These coefficients, on the same scale, show how much the dependent variable changes with a one standard deviation shift in an independent variable. They act as standardized effect sizes, revealing the relationship strength without relying on original data units. Measured in standard deviations, these effects help grasp the practical significance of the findings.

Let’s apply this technique to examine which factor explains U.S. public school expenditures the most. You can achieve standardization using the scale() function or manually compute it.

df <- df %>% mutate(income_s = scale(income),
                    under18_s = scale(under18),
                    urban_s = scale(urban))
ols3 <- lm(education ~ income_s + under18_s + urban_s, data = df)

stargazer(ols1, ols3,
          title = "Table 3: Factors Explaining U.S. Public School Expenditures With Standardized Vriables",
          dep.var.labels = "Education Expenditure",
          covariate.labels = c("Income", "Under 18", "Urban",
                               "Income (standardized)", "Under 18 (standardized)", "Urban (standardized)"),
          star.cutoffs = c(0.05, 0.01, 0.001),
          digits = 2 , type = "text")

## 
## Table 3: Factors Explaining U.S. Public School Expenditures With Standardized Vriables
## ============================================================
##                                    Dependent variable:      
##                               ------------------------------
##                                   Education Expenditure     
##                                     (1)            (2)      
## ------------------------------------------------------------
## Income                            0.08***                   
##                                   (0.01)                    
##                                                             
## Under 18                          0.82***                   
##                                   (0.16)                    
##                                                             
## Urban                             -0.11**                   
##                                   (0.03)                    
##                                                             
## Income (standardized)                            45.17***   
##                                                   (5.21)    
##                                                             
## Under 18 (standardized)                          19.58***   
##                                                   (3.83)    
##                                                             
## Urban (standardized)                             -16.01**   
##                                                   (5.19)    
##                                                             
## Constant                        -286.84***      196.31***   
##                                   (64.92)         (3.74)    
##                                                             
## ------------------------------------------------------------
## Observations                        51              51      
## R2                                 0.69            0.69     
## Adjusted R2                        0.67            0.67     
## Residual Std. Error (df = 47)      26.69          26.69     
## F Statistic (df = 3; 47)         34.81***        34.81***   
## ============================================================
## Note:                          *p<0.05; **p<0.01; ***p<0.001

After standardizing each independent variable, we observe that income has the greatest effect size. Holding all other variables constant, a one-unit standard deviation increase in income is associated with a 45.17-unit increase in the per capita education expenditures of a state. This differs from what we would infer using the original scale of the variables to fit the model. Importantly, as we have now rescaled the variables not only regarding their unit measurement but also their range, the estimated intercept is different from the original model. With all independent variables standardized, the intercept now indicates that the baseline per capita education expenditure is 196.31 when all variables are at their mean, which is 0.

3. Creating Indicies

Summated rating scale is a type of assessment instrument comprising a series of statements measuring the same construct or variable to which respondents indicate their degree of agreement or disagreement. The number of response options for each item varies, often from 5 to 7 points (e.g., from strongly agree to strongly disagree). The response values for individual items may be summed to obtain a total or average score that reflects a person’s general attitude toward the construct of interest.

For example, the below table is a Likert scale on assessing how strong an individual’s partisan identity is (from strongly disagree with each statement to strongly agree).

	1	2	3	4	5
When I speak about this party, I usually say “we” instead of “they.”
I am interested in what other people think about this party.
When people criticize this party, it feels like a personal insult.
I have a lot in common with other supporters of this party.
If this party does badly in opinion polls, my day is ruined.
When I meet someone who supports this party, I feel connected with this person.
When I speak about this party, I refer to them as “my party.”
When people praise this party, it makes me feel good.

Since these eight statements are all related to measure one’s partisan identity level, instead of including all the variables based on these statements, we can use a summated rating scale to combine all the information to generate one aggregated variable.

But before applying a summated rating scale, we should first check the reliability of the set of scale to see whether this set of scale is a consistent measurement of a concept. In the lecture, Chris suggested that we can use Cronbach’s alpha to assess such consistency. The higher the $\alpha$ coefficient, the more the items have shared covariance and probably measure the same underlying concept. Here is the rule of thumb:

Cronbach’s alpha	Internal consistency
$\alpha \geq 0.9$	Excellent
$0.9 > \alpha \geq 0.8$	Good
$0.8 > \alpha \geq 0.7$	Acceptable
$0.7 > \alpha \geq 0.6$	Questionable
$0.6 > \alpha \geq 0.5$	Poor
$0.5 > \alpha$	Unacceptable

The $\alpha$ coefficient can be computed using the alpha() function in the psych package.

Here I provide an example of constructing a summated rating scale of partisan social identity using Bankert, Huddy, and Rosema’s (2016)data based on the previously outlined questions

pid <- read_csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/pid.csv")
head(pid)

## # A tibble: 6 × 8
##      V1    V2    V3    V4    V5    V6    V7    V8
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1     3     1     3     1     2     1     2
## 2     3     3     3     3     2     3     3     3
## 3     3     2     3     3     3     3     2     3
## 4     2     2     2     2     2     2     2     2
## 5     2     3     2     3     2     3     2     3
## 6     2     3     1     3     2     3     2     3

First, assess how consistent these eight variables/questions are related on the concept of partisan social identity.

#install.packages("psych")
library(psych)
alpha(na.omit(pid))

## 
## Reliability analysis   
## Call: alpha(x = na.omit(pid))
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##        0.9       0.9     0.9      0.53   9 0.0014  1.9 0.63     0.52
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt      0.9   0.9   0.9
## Duhachek   0.9   0.9   0.9
## 
##  Reliability if an item is dropped:
##    raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## V1      0.89      0.89    0.88      0.54 8.1   0.0016 0.0048  0.54
## V2      0.89      0.89    0.89      0.55 8.4   0.0015 0.0056  0.53
## V3      0.89      0.89    0.88      0.53 7.8   0.0016 0.0070  0.52
## V4      0.89      0.89    0.88      0.53 7.9   0.0016 0.0062  0.52
## V5      0.89      0.89    0.89      0.55 8.4   0.0016 0.0050  0.54
## V6      0.88      0.88    0.87      0.51 7.3   0.0017 0.0052  0.52
## V7      0.88      0.88    0.87      0.52 7.5   0.0017 0.0052  0.52
## V8      0.88      0.88    0.88      0.52 7.5   0.0017 0.0060  0.52
## 
##  Item statistics 
##        n raw.r std.r r.cor r.drop mean   sd
## V1 10974  0.75  0.74  0.70   0.65  1.6 0.86
## V2 10974  0.72  0.71  0.65   0.62  2.4 0.85
## V3 10974  0.77  0.78  0.73   0.69  1.6 0.78
## V4 10974  0.76  0.76  0.71   0.68  2.4 0.82
## V5 10974  0.69  0.71  0.64   0.61  1.6 0.70
## V6 10974  0.83  0.83  0.81   0.76  2.1 0.84
## V7 10974  0.81  0.81  0.79   0.74  1.6 0.81
## V8 10974  0.81  0.80  0.77   0.73  2.2 0.86
## 
## Non missing response frequency for each item
##       1    2    3    4 miss
## V1 0.56 0.27 0.11 0.05    0
## V2 0.17 0.38 0.37 0.08    0
## V3 0.52 0.34 0.11 0.03    0
## V4 0.16 0.37 0.41 0.06    0
## V5 0.55 0.36 0.08 0.01    0
## V6 0.25 0.40 0.30 0.04    0
## V7 0.57 0.30 0.10 0.03    0
## V8 0.23 0.39 0.32 0.06    0

The Cronbach’s alpha test indicates high internal consistency between these two variables, with the $\alpha$ of approximately $0.90$. Considering this reliability, we can create the summated rating scaled partisan social identity index by simply averaging these eight variables, as they are all on the same scale. Note that if the variables you aggregate are on different scales, ensure you rescale all variables to the same scale before conducting the Cronbach’s alpha test and generating the summated rating scaled index.

pid$psi <- (pid$V1 + pid$V2 + pid$V3 + pid$V4 + pid$V5 + pid$V6 + pid$V7 + pid$V8)/8
summary(pid$psi)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.500   1.875   1.943   2.375   4.000     883

The partisan social identity index (psi) we created ranges from 1 to 4. The higher the value, the stronger a person feels that the party they are affiliated with represents one of their social identities.

Discussion IV

In today’s discussion, we will cover how to decompose a linear regression and how to use our estimated linear regression line to do prediction.

FWL Theorem
Prediction
Previous Midterm Questions

1. FWL Theorem

The Frisch-Waugh-Lovell Theorem (FWL; after the initial proof by Frisch and Waugh (1933), and later generalisation by Lovell (1963)) states that:

Any predictor’s regression coefficient in a multivariate model is equivalent to the regression coefficient estimated from a bivariate model in which the residualized outcome is regressed on the residualized component of the predictor; where the residuals are taken from models regressing the outcome and the predictor on all other predictors in the multivariate regression (separately).

Simply said, the FWL Theorem shows how to decompose a regression of $Y$ on a set of variables $X$ into two pieces. The coefficient of one of the variables in multiple linear regression model can be obtained by netting off effect of other variables in the regression model from both dependent and independent variables.

Why is it important to know FWL theorem? The FWL Theorem shows that it is important to partial one set of covariates out of the other in order to obtain the ceteris paribus effect. For example, in order to get the effect of $X1$ on $Y$, $X2$ needs to be held constant, which involves removing the portion of $X1$ that is correlated with $X2$. Let’s understand it through the regression formula.

$Y = a + b_1X_1 + b_2X_2 + e$ (assume this is the estimated regression line we have)

The value of $b_1$ can be obtained by:

netting off effect of $X_2$ from $Y$; and

$Y = a + b_2X_2 + v_1$

You can consider the residual $v_1$ as the variation in $Y$ that cannot be explained by $X_2$.
netting off effect of $X_2$ from $X_1$

$X_1 = a + b_2X_2 + v_2$

You can consider the residual $v_2$ as the variation in $X_1$ that cannot be explained by $X_2$.
then, we regress the variation in $Y$ that cannot be explained by $X_2$ on the variation in $X_1$ that cannot be explained by $X_2$.

$v_1 = a' + kv_2 + e$

By doing this, the $k$ we estimate will be the same as the $b_1$ we estimate by regressing $Y$ on both $X_1$ and $X_2$ at the same time.

Let’s use the Auto dataset as an example to illustrate the FWL Theorem in practice.

library(ISLR2)
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

Here is the description of each variable in this dataset:

mpg: miles per gallon
cylinders: number of cylinders between 4 and 8
displacement: engine displacement (cu. inches)
horsepower: engine horsepower
weight: vehicle weight (lbs.)
acceleration: time to accelerate from 0 to 60 mph (sec.)
year: model year (modulo 100)
origin: origin of car (1. American, 2. European, 3. Japanese)
name: vehicle name

Assume we are investigating how the miles per gallon of a car can be explained by its origin (whether it is a U.S. or a foreign car) and its weight. That said, mpg is the outcome variable, while origin and weight are the explanatory variables.

Before beginning the analysis, let’s first create a new dummy variable named foreign based on the origin to classify these cars as either foreign or U.S. cars.

Auto$foreign <- ifelse(Auto$origin == 1, 0, 1)

I will use foreign as my $X_1$ variable and weight as the $X_2$ variable. ($X_1$ and $X_2$ can include multiple variables, but the illustration will be clearer with just one each.) Let’s first run a regression of $Y$, which is mpg, on both $X_1$ and $X_2$.

fit <- lm(mpg ~ foreign + weight, data = Auto)
stargazer::stargazer(fit, star.cutoffs = c(0.05, 0.01, 0.001),
                     digits = 3 , type = "text", omit.stat = c("F"))

## 
## =================================================
##                          Dependent variable:     
##                     -----------------------------
##                                  mpg             
## -------------------------------------------------
## foreign                        1.638**           
##                                (0.560)           
##                                                  
## weight                        -0.007***          
##                               (0.0003)           
##                                                  
## Constant                      43.929***          
##                                (1.112)           
##                                                  
## -------------------------------------------------
## Observations                     392             
## R2                              0.699            
## Adjusted R2                     0.698            
## Residual Std. Error       4.291 (df = 389)       
## =================================================
## Note:               *p<0.05; **p<0.01; ***p<0.001

Now let’s start decomposing our regression following the FWL Theorem:

Step 1: Regress $Y$ on $X_2$ and take the residuals

fit1 <- lm(mpg ~ weight, data = Auto)
v1 <- fit1$residuals

Step 2: Regress $X_1$ on $X_2$ and take the residuals

fit2 <- lm(foreign ~ weight, data = Auto)
v2 <- fit2$residuals

Step 3: Regress the residuals from Step 1 on the residuals from Step 2

fit3 <- lm(v1 ~ v2)
stargazer::stargazer(fit, fit1, fit2, fit3, 
                 star.cutoffs = c(0.05, 0.01, 0.001),
                 digits = 3 , type = "text", omit.stat = c("F"))

## 
## =======================================================================================
##                                             Dependent variable:                        
##                     -------------------------------------------------------------------
##                                    mpg                    foreign             v1       
##                           (1)              (2)              (3)              (4)       
## ---------------------------------------------------------------------------------------
## foreign                 1.638**                                                        
##                         (0.560)                                                        
##                                                                                        
## weight                 -0.007***        -0.008***        -0.0003***                    
##                         (0.0003)         (0.0003)        (0.00002)                     
##                                                                                        
## v2                                                                         1.638**     
##                                                                            (0.559)     
##                                                                                        
## Constant               43.929***        46.217***         1.396***          -0.000     
##                         (1.112)          (0.799)          (0.072)          (0.216)     
##                                                                                        
## ---------------------------------------------------------------------------------------
## Observations              392              392              392              392       
## R2                       0.699            0.693            0.361            0.022      
## Adjusted R2              0.698            0.692            0.360            0.019      
## Residual Std. Error 4.291 (df = 389) 4.333 (df = 390) 0.388 (df = 390) 4.286 (df = 390)
## =======================================================================================
## Note:                                                     *p<0.05; **p<0.01; ***p<0.001

Notice that in the Step 3 regression, the coefficient on v2 is the same as the coefficient on foreign in the original regression. Why does it work? By netting off both the effect of weight on mpg and of weight on foreign, we actually eliminate the correlation between foreign and weight.

First, let’s check how correlated is between foreign and weight.

cor(Auto$foreign, Auto$weight)

## [1] -0.6009783

What happens after we eliminate the correlation between foreign and weight?

cor(Auto$weight, v2)

## [1] -3.310067e-17

Recall that the residual $v_2$ can be seen as the variation in foreign that cannot be explained by weight. As shown in the above, we can see that now $v_2$ is not correlated with weight; the correlation coefficient is very close to 0. Now we can use $v_2$ to check how much variation in $v_1$, which is the variation in mpg that cannot be explained by weight, can be explained by $v_2$.

If you’d like to understand the FWL Theorem in more details, please see here.

2. Prediction

One of the key strengths of linear regression is its capability to make forecasts based on the relationships identified in the data. Here, I demonstrate how to make these predictions manually and using built-in functions in R.

a. Simple Linear Regression Prediction

The data we are using today is ANES 2022 pilot data, which can be downloaded here. The codebook can be downloaded here.

Assume we study whether those who are more ideologically conservative rate Donald Trump with higher liking scores.

library(haven) # for importing .dta format data

# load data
anes <- read_dta("/Users/yu-shiuanhuang/Desktop/method-sequence/data/anes_pilot_2022_stata_20221214.dta")

Remember to deal with the missing/NA values in the two variables we are going to analyze: fttrump and libcon. Before doing analysis, always check the codebook of the dataset first to see what values in each variable you will have to address.

fttrump: How would you rate Donald Trump on a scale of 0-100?
libcon: We hear a lot of talk these days about liberals and conservatives. Here is a seven-point scale on which the political views that people might hold are arranged from extremely liberal to extremely conservative. Where would you place yourself on this scale, or haven’t you thought much about this?

# examine each variable first
table(anes$fttrump) # -7 means the respondent did not answer this question

## 
##  -7   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   1 360 111  35  29  20  26  16   9  17   9  10   6   8   6   7  25   7   4   5 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38 
##   8  10   3   4   5   4   5   3   2   2   4  11   6   4   8   9   6   8   5   2 
##  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58 
##   4  12   7   7   6   2   4   9   9   7  12  47  14  12   9   6  17  11   9   4 
##  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78 
##   5  18   8   6  10   6  10  11   6   8   6  32   6  14  13   8  18   6   7   4 
##  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98 
##   6  17   6   7   4  12  33   7  13   4   9  22   8   4  13   5  17   9   6   9 
##  99 100 
##  10 134

table(anes$libcon) # 997 means the respondent did answer the question but picked "haven't thought about it too much"

## 
##   1   2   3   4   5   6   7 997 
## 122 203 121 367 136 287 135 214

After examining the variables, make sure to recode -7 in fttrump and 997 in libcon to NAs.

# recode missing values
anes$fttrump[anes$fttrump == -7] <- NA
anes$libcon[anes$libcon == 997] <- NA

# examine the variables again just to double check
table(anes$fttrump)

## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
## 360 111  35  29  20  26  16   9  17   9  10   6   8   6   7  25   7   4   5   8 
##  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39 
##  10   3   4   5   4   5   3   2   2   4  11   6   4   8   9   6   8   5   2   4 
##  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59 
##  12   7   7   6   2   4   9   9   7  12  47  14  12   9   6  17  11   9   4   5 
##  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79 
##  18   8   6  10   6  10  11   6   8   6  32   6  14  13   8  18   6   7   4   6 
##  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 
##  17   6   7   4  12  33   7  13   4   9  22   8   4  13   5  17   9   6   9  10 
## 100 
## 134

table(anes$libcon)

## 
##   1   2   3   4   5   6   7 
## 122 203 121 367 136 287 135

Now, let’s run a simple linear regression of libcon on fttrump.

m1 <- lm(fttrump ~ libcon, data = anes)
stargazer::stargazer(m1, 
          title = "Table 1: Ideology and Feelings toward Trump",
          dep.var.labels = "Feelings toward Trump",
          covariate.labels = c("Ideology"),
          star.cutoffs = c(0.05, 0.01, 0.001),
          digits = 2 , type = "text")

## 
## Table 1: Ideology and Feelings toward Trump
## =================================================
##                          Dependent variable:     
##                     -----------------------------
##                         Feelings toward Trump    
## -------------------------------------------------
## Ideology                      14.03***           
##                                (0.42)            
##                                                  
## Constant                      -19.72***          
##                                (1.93)            
##                                                  
## -------------------------------------------------
## Observations                    1,370            
## R2                              0.44             
## Adjusted R2                     0.44             
## Residual Std. Error       28.42 (df = 1368)      
## F Statistic          1,094.34*** (df = 1; 1368)  
## =================================================
## Note:               *p<0.05; **p<0.01; ***p<0.001

Now we have our simple OLS linear regression of libcon on fttrump, we can use this regression line ($\hat{y} = -19.72 + 14.03x$) to do predictions and evaluate the uncertainty of our predictions. How to do it by hand?

Let’s plot a scatterplot between libcon and fttrump and the regression line first!

ggplot(data = anes, aes(x = libcon, y = fttrump)) +
        geom_point(alpha = 0.7, color = "darkgray") +
        geom_smooth(method = "lm", color = "coral2", se = FALSE) +
        scale_x_continuous(name = "Ideology (libcon)", 
                           breaks = seq(1, 7, 1)) +
        ylab("Feelings toward Trump (fttrump)") +
        annotate("text", x = 5.5, y = 72, 
                 label = "predicted values on \n the regression line", 
                 color = "coral2") +
        theme_bw()

Let’s calculate the predicted values of the regression line by hand!

# what is the intercept and slope?
summary(m1)

## 
## Call:
## lm(formula = fttrump ~ libcon, data = anes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.480 -21.414  -0.465  20.314 105.695 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19.7245     1.9254  -10.24   <2e-16 ***
## libcon       14.0292     0.4241   33.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.42 on 1368 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.4444, Adjusted R-squared:  0.444 
## F-statistic:  1094 on 1 and 1368 DF,  p-value: < 2.2e-16

# we can call out the intercept and slope from the summary table by using
coef(m1)[1] ## intercept

## (Intercept) 
##   -19.72446

unname(coef(m1)[1])

## [1] -19.72446

coef(m1)[2] ## slope

##   libcon 
## 14.02919

unname(coef(m1)[2])

## [1] 14.02919

# let's say we'd like to know how much a person would like Trump if she place herself in the middle (4) of the ideology spectrum. 

unname(coef(m1)[1]) + (unname(coef(m1)[2])*4) ## the predicted value is 36.3923

## [1] 36.39231

Now, instead of only predicting one value, let’s predict the whole ideology spectrum (1-7)!

# create a data frame to store predicted values
pred.df <- data.frame(libcon = c(1:7)) ## make sure that the variable name in this new created data frame is the same as the variable name you specified in your m1 model
pred.df

##   libcon
## 1      1
## 2      2
## 3      3
## 4      4
## 5      5
## 6      6
## 7      7

# make prediction based on the ideology var in pred.df
pred.df2 <- pred.df %>%
    mutate(pred.fttrump  = unname(coef(m1)[1]) + (unname(coef(m1)[2])*libcon))
pred.df2

##   libcon pred.fttrump
## 1      1    -5.695269
## 2      2     8.333923
## 3      3    22.363115
## 4      4    36.392307
## 5      5    50.421499
## 6      6    64.450691
## 7      7    78.479883

We can also use the predict() function in R to do it more efficiently!

predict(m1, newdata = pred.df) ## these predicted values should be the same as what we did earlier!

##         1         2         3         4         5         6         7 
## -5.695269  8.333923 22.363115 36.392307 50.421499 64.450691 78.479883

How confident can we say about our estimates of ideology (libcon)? Calculate confidence intervals of the predicted values!

Use the confint function to find the 95% confidence intervals of your estimates.

confint(m1, level = 0.95)

##                 2.5 %    97.5 %
## (Intercept) -23.50145 -15.94747
## libcon       13.19726  14.86113

summary(m1)

## 
## Call:
## lm(formula = fttrump ~ libcon, data = anes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.480 -21.414  -0.465  20.314 105.695 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19.7245     1.9254  -10.24   <2e-16 ***
## libcon       14.0292     0.4241   33.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.42 on 1368 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.4444, Adjusted R-squared:  0.444 
## F-statistic:  1094 on 1 and 1368 DF,  p-value: < 2.2e-16

Use predict function to generate 95% confidence interval for each predicted value.

# to display the 95% confidence intervals, specify the argument interval = "confidence"
predict(m1, newdata = pred.df, interval = "confidence", level = 0.95)

##         fit       lwr       upr
## 1 -5.695269 -8.727516 -2.663021
## 2  8.333923  5.987091 10.680755
## 3 22.363115 20.572878 24.153352
## 4 36.392307 34.880187 37.904427
## 5 50.421499 48.762495 52.080504
## 6 64.450691 62.305391 66.595992
## 7 78.479883 75.680497 81.279270

Plot 95% confidence interval of our regression line by setting se=TRUE and level = 0.95 in geom_smooth function.

ggplot(data = anes, aes(x = libcon, y = fttrump)) +
        geom_point(alpha = 0.7, color = "darkgray") +
        geom_smooth(method = "lm", color = "coral2", se = TRUE, level = 0.95) +
        scale_x_continuous(name = "Ideology (libcon)", 
                           breaks = seq(1, 7, 1)) +
        ylab("Feelings toward Trump (fttrump)") +
        annotate("text", x = 5.5, y = 72, 
                 label = "predicted values on \n the regression line", 
                 color = "coral2") +
        theme_bw()

You can also use the function from jtools package Chris introduced in the lecture to plot it fastly. effect_plot() is also done with ggplot2.

jtools::effect_plot(m1, pred = "libcon", interval = TRUE, 
                    plot.points = TRUE, jitter = 0.2, colors = "darkgray",
                    line.colors = "coral2", x.label = "Ideology (libcon)",
                    y.label = "Feelings toward Trump (fttrump)") +
         annotate("text", x = 5.5, y = 72, 
                  label = "predicted values on \n the regression line", 
                  color = "coral2")

b. Multivariate Linear Regression Prediction

Now assume we study whether those who are more ideologically conservative rate Donald Trump with higher liking scores. However, instead of only include libcon in our model, let’s also control some demographic variables (aka control variables), including age, gender, race, education level, and party ID.

What kind of control variables should you include? It depends on the research question you are interested in and what the related literature suggests.

Remember to also deal with the missing/NA values first in these control variables!

# likewise, examine each variable first
table(anes$birthyr_dropdown) ## birth year

## 
## 1930 1932 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 
##    1    1    1    2    2    3    3    5    7    6    8    4   14    9   15   27 
## 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 
##   22   24   28   26   36   31   37   27   24   43   65   39   52   40   40   41 
## 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 
##   38   27   27   25   18   35   18   12   22   13   21   16   14   13   20   23 
## 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 
##   23   21   20   21   14   24   31   27   29   18   20   24   38   19   23   20 
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 
##   18   25   23   23   29   17   15   24   14

table(anes$gender) ## male: 1; female: 2

## 
##   1   2 
## 707 878

table(anes$rwh) ## white: 1; non-white: 2

## 
##    1    2 
## 1262  323

table(anes$educ) ## education level: can be treated as a continuous var

## 
##   1   2   3   4   5   6 
##  75 445 357 155 363 190

table(anes$pid7) ## party ID

## 
##  -7   1   2   3   4   5   6   7   8 
##  66 349 156 126 245 150 150 281  62

By examining each variable, it seems that we only have to deal with the missing values in pid7 (party ID). In pid7 (party ID), according to the codebook, -7 means no answer and 8 means not sure, which means we have to first recode these values into NAs.

# recode missing values
anes$pid7[anes$pid7 == -7 | anes$pid7 == 8] <- NA ## "|" means "or" 

# examine the variable again just to double check
table(anes$pid7)

## 
##   1   2   3   4   5   6   7 
## 349 156 126 245 150 150 281

Now, let’s generate control variables!

anes2 <- anes %>% 
        mutate(age = 2022 - birthyr_dropdown, ## the survey was conducted in 2022
               female = ifelse(gender == 2, 1, 0),
               white = ifelse(rwh == 1, 1, 0), 
               dem = ifelse(pid7 < 4, 1, 0), 
               rep = ifelse(pid7 > 4, 1, 0), 
               indep = ifelse(pid7 == 4, 1, 0)) ## create party ID dummies

Run a multiple linear regression of libcon on fttrump by controlling demographic variables.

m2 <- lm(fttrump ~ libcon + age + female + educ + white + dem + rep, 
         data = anes2) ## make indep as our reference group for party ID

What if we’d like to compare respondents’ liking toward Trump between Democrats and Republicans rather than compare it between Democrats and Independents and between Republicans and Indendents? Just change your reference group to Democrats or Republicans!

m3 <- lm(fttrump ~ libcon + age + female + educ + white + rep + indep, 
         data = anes2) ## make dem as our reference group for party ID

stargazer(m1, m2, m3,
          title = "Table 2: Ideology and Feelings toward Trump",
          dep.var.labels = "Feelings toward Trump",
          covariate.labels = c("Ideology", "Age", "Female", "Education",
                               "White", "Democrat", "Republican", "Independent"),
          star.cutoffs = c(0.05, 0.01, 0.001),
          digits = 2 , type = "text")

## 
## Table 2: Ideology and Feelings toward Trump
## ================================================================================================
##                                                 Dependent variable:                             
##                     ----------------------------------------------------------------------------
##                                                Feelings toward Trump                            
##                                (1)                       (2)                      (3)           
## ------------------------------------------------------------------------------------------------
## Ideology                     14.03***                  7.38***                  7.38***         
##                               (0.42)                    (0.58)                   (0.58)         
##                                                                                                 
## Age                                                    -0.17***                 -0.17***        
##                                                         (0.04)                   (0.04)         
##                                                                                                 
## Female                                                  -3.35*                   -3.35*         
##                                                         (1.43)                   (1.43)         
##                                                                                                 
## Education                                              -2.24***                 -2.24***        
##                                                         (0.48)                   (0.48)         
##                                                                                                 
## White                                                    0.45                     0.45          
##                                                         (1.90)                   (1.90)         
##                                                                                                 
## Democrat                                              -15.14***                                 
##                                                         (2.33)                                  
##                                                                                                 
## Republican                                             21.43***                 36.57***        
##                                                         (2.46)                   (2.28)         
##                                                                                                 
## Independent                                                                     15.14***        
##                                                                                  (2.33)         
##                                                                                                 
## Constant                    -19.72***                  24.51***                  9.37*          
##                               (1.93)                    (4.18)                   (3.65)         
##                                                                                                 
## ------------------------------------------------------------------------------------------------
## Observations                  1,370                     1,295                    1,295          
## R2                             0.44                      0.56                     0.56          
## Adjusted R2                    0.44                      0.56                     0.56          
## Residual Std. Error     28.42 (df = 1368)         25.48 (df = 1287)        25.48 (df = 1287)    
## F Statistic         1,094.34*** (df = 1; 1368) 234.90*** (df = 7; 1287) 234.90*** (df = 7; 1287)
## ================================================================================================
## Note:                                                              *p<0.05; **p<0.01; ***p<0.001

Table 2 shows that, holding constant the value of other control variables, when a respondent is one unit more ideologically conservative, she significantly adores Trump more by 7.38 units. Why the coefficient of ideology (libcon) becomes smaller after we control demographic variables?

How to plot the regression line with control variables?

Recall that our research question is “whether those who are more ideologically conservative rate Donald Trump with higher liking scores.” This means that we are only interested in the relationship between one’s ideology and evaluation of Trump. Therefore, when we present the multiple linear regression line, we should hold other control variables at a constant value so as to show how one unit increment in our key independent variable (i.e., ideology(libcon)) affect the change in the dependent variable (i.e., feelings toward Trump(fttrump)).

# create a data frame to store predicted values
libcon <- seq(1, 7, 0.01)
pred.dfm <- data.frame(libcon = libcon, 
                       age = rep(51, length(libcon)), # set age at the mean
                       female = rep(1, length(libcon)), # female
                       educ = rep(4, length(libcon)), # 2-year college
                       white = rep(1, length(libcon)), # white
                       dem = rep(1, length(libcon)), # democrat
                       rep = rep(0, length(libcon))) # non-republican

# use predict() to generate predicted values
pred <- data.frame(libcon = libcon,
                   predict(m2, newdata = pred.dfm, 
                           interval = "confidence", level = 0.95))
head(pred)

##   libcon       fit       lwr        upr
## 1   1.00 -3.794171 -7.017631 -0.5707099
## 2   1.01 -3.720327 -6.936792 -0.5038634
## 3   1.02 -3.646484 -6.855977 -0.4369918
## 4   1.03 -3.572641 -6.775188 -0.3700951
## 5   1.04 -3.498798 -6.694424 -0.3031729
## 6   1.05 -3.424955 -6.613685 -0.2362252

Plot the regression line with control variables (this line should be less steeper than the simple regression line we plotted earlier).

ggplot() +
        geom_point(data = anes, aes(x = libcon, y = fttrump),
                   alpha = 0.7, color = "darkgray") +
        geom_errorbar(data = pred, aes(x = libcon, ymin = lwr, ymax = upr), 
                      position = position_dodge(.9), 
                      color = "gray", alpha = 0.5) +
        geom_line(data = pred, aes(x = libcon, y = fit), color = "dodgerblue2") +
        scale_x_continuous(name = "Ideology (libcon)", 
                           breaks = seq(1, 7, 1)) +
        ylab("Feelings toward Trump (fttrump)") +
        annotate("text", x = 5.5, y = 60, 
                 label = "predicted values on \n the regression line \n when holding all other \n control variables at constant", 
                 color = "dodgerblue2") +
        theme_bw()

3. Previous Midterm Questions

PART A (Answer ONE of the below)

The Gauss-Markov assumptions are necessary to ensure that OLS estimates have desirable properties. What are these assumptions? Rank these assumptions in order of their importance in terms of how violations would affect substantive inferences and our confidence in them. Justify your choice of ordering based on these potential effects on the properties of OLS estimates.
Use a Monte Carlo simulation approach to generate a dataset (with 100 observations, one y response variable, four X covariate predictors, and an error term) that possesses at least one of the following properties:
1. Nonlinear relationships between y and at least one of the Xs
2. Conditional relationships between y and at least two of the Xs
3. At least five highly influential observations
4. Heteroskedastic error
5. Multicollinearity

Provide the R code you use to generate the dataset, the OLS results from a “naïve” model of the data, and the output from any diagnostics you use to search for the problem.

PART B

Load the dataset midtermdata.csv. The data include measures of politicalinterest (a continuous variable representing individuals’ interest in news and politics), age (in years), education (a four-level factor variable), income (annual household income in thousands $), female (a binary indicator variable), and treatment (a binary indicator variable representing whether the individual received treatment; in this case were selected to receive a series of nonpartisan political information pamphlets and booklets in the mail over the past year).
1. Estimate and report the results of a multiple linear regression model in which politicalinterest is a function of the other five covariates.
2. Interpret the coefficient on treatment.
3. Interpret the $R^2$ value for the model.
4. Are there any significant interactions between treatment and the covariates? If so, identify and interpret.
5. What if education is recoded as a continuous (1-4) variable? How does this change the interpretation on the coefficient from when education was coded as a factor variable?
6. Is there a need to transform any of the covariates? If so, identify and explain.

Discussion V

In today’s discussion, we will revisit the concepts of interaction and explore why it’s essential to incorporate interactions in a regression model. We will also engage in practical exercises to implement interaction models in R and conduct diagnostics to ensure that our interaction models adhere to assumptions. Additionally, we will review the application of tree-based methods to analyze our data and determine whether the inclusion of interactions is warranted.

Why to Include Interactions in a Regression Model?
Interaction Term in a Regression
Assumptions underlying Interaction Models
Interaction Model Diagnostics
Single-decision Tree

1. Why to Include Interactions in a Regression Model?

An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable.

Let’s look at some simulated examples. Suppose we investigate whether citizens are more inclined to trust information shared by their affiliated political party without any skepticism when they have higher levels of affective polarization, the extent of citizens’ liking and trusting their preferred party’s elites and supporters while disliking and distrusting those of opposing parties. In this case, affective polarization serves as the independent variable, while the level of trust in information shared by the in-party represents the dependent variable. Below, I outline the underlying data generating process (DGP) for this relationship in different scenarios. Here are the variables we will use for the simulation:

Levels of affective polarization (affpol) is a continuous variable which follows a normal distribution with a mean of 5 and a standard deviation of 1.5.
Political knowledge (pknow) is a dummy variable which follows a binomial distribution with one trial and a probability of 0.6 for having higher political knowledge.
One’s levels of trust in information shared by the in-party (trust) is continuous variable which is determined by both one’s levels of affective polarization and political knowledge.

Scenario 1:

set.seed(1234)
affpol <- rnorm(1000, mean = 5, sd = 1.5)
pknow <- rbinom(1000, size = 1, prob = 0.6)
trust <- 2 + 3*affpol + 0*pknow + 0*affpol*pknow + rnorm(1000, mean = 3, sd = 6)
df <- data.frame(affpol, pknow, trust)

mod1 <- lm(trust ~ affpol, data = df)

p1 <- ggplot(data = df, aes(x = affpol, y = trust)) +
  geom_point(alpha = 0.4, color = "darkgray") +
  geom_smooth(method = "lm", color = "grey30", se = TRUE, level = 0.95) +
  scale_x_continuous(name = "Affective Polarization", breaks = seq(0, 10, 1)) +
  ylab("Level of Trust in Information Shared by the In-party") +
  labs(title = "Scenario 1:", subtitle = "No effect of political knowledge") +
  theme_bw()
p1

Scenario 2:

In Scenario 1, we show a standard simple linear model. However, let’s consider the scenario where we anticipate individuals with lower political knowledge to respond at a generally higher level compared to those with higher political knowledge. There are several ways this can manifest. For instance, if the discrepancy in trust response between those with lower and higher political knowledge remains consistent across different levels of affective polarization, we would anticipate a graph resembling the following:

set.seed(1234)
affpol <- rnorm(1000, mean = 5, sd = 1.5)
pknow <- rbinom(1000, size = 1, prob = 0.6)
trust <- 2 + 3*affpol - 5*pknow + rnorm(1000, mean = 3, sd = 6)
df <- data.frame(affpol, pknow, trust)

mod2 <- lm(trust ~ affpol + pknow, data = df)

p2 <- ggplot() +
  geom_point(data = df[pknow == 1,], aes(x = affpol, y = trust), alpha = 0.2, color = "coral2") +
  geom_smooth(data = df[pknow == 1,], aes(x = affpol, y = trust), method = "lm", color = "coral2", se = TRUE, level = 0.95) +
  geom_point(data = df[pknow == 0,], aes(x = affpol, y = trust), alpha = 0.2, color = "dodgerblue2") +
  geom_smooth(data = df[pknow == 0,], aes(x = affpol, y = trust), method = "lm", color = "dodgerblue2", se = TRUE, level = 0.95) +
  scale_x_continuous(name = "Affective Polarization", breaks = seq(0, 10, 1)) +
  ylab("Level of Trust in Information Shared by the In-party") +
  labs(title = "Scenario 2:", subtitle = "Political knowledge has an effect, but no interaction") +
  annotate("text", x = 3, y = 40, label = "lower political knowledge", color = "dodgerblue2") +
  annotate("text", x = 3, y = 36, label = "higher political knowledge", color = "coral2") +
  theme_bw()
p2

Scenario 3:

However, if those with lower political knowledge have a steeper trust response curve compared to those with higher political knowledge, we would expect a picture like this:

set.seed(1234)
affpol <- rnorm(1000, mean = 5, sd = 1.5)
pknow <- rbinom(1000, size = 1, prob = 0.6)
trust <- 2 + 3*affpol - 5*pknow - 2*affpol*pknow + rnorm(1000, mean = 3, sd = 6)
df <- data.frame(affpol, pknow, trust)

mod3 <- lm(trust ~ affpol + pknow + affpol*pknow, data = df)

p3 <- ggplot() +
  geom_point(data = df[pknow == 1,], aes(x = affpol, y = trust), alpha = 0.2, color = "coral2") +
  geom_smooth(data = df[pknow == 1,], aes(x = affpol, y = trust), method = "lm", color = "coral2", se = TRUE, level = 0.95) +
  geom_point(data = df[pknow == 0,], aes(x = affpol, y = trust), alpha = 0.2, color = "dodgerblue2") +
  geom_smooth(data = df[pknow == 0,], aes(x = affpol, y = trust), method = "lm", color = "dodgerblue2", se = TRUE, level = 0.95) +
  scale_x_continuous(name = "Affective Polarization", breaks = seq(0, 10, 1)) +
  ylab("Level of Trust in Information Shared by the In-party") +
  labs(title = "Scenario 3:", subtitle = "Political knowledge has an effect with an interaction") +
  annotate("text", x = 3, y = 40, label = "lower political knowledge", color = "dodgerblue2") +
  annotate("text", x = 3, y = 36, label = "higher political knowledge", color = "coral2") +
  theme_bw()
p3

Of these three graphs, the first suggests no distinction between individuals with lower and higher political knowledge. The second shows a consistent difference between the two groups, implying no need for an interaction term. The third graph depicts a scenario where the effect of affective polarization on trust in information shared by the in-party varies based on individuals’ political knowledge levels, indicating an interaction effect.

In terms of regression equations, we have:

Scenario 1: No effect of political knowledge:

\[Y = \alpha + \beta_1*affpol + 0*pknow + 0*affpol*pknow\] where Y represents the outcome (trust in information shared by in-party), $\beta_1$ represents the effect of affective polarization (which is set as 3 in our simulation), and all other coefficients for the rest of the terms (effect of political knowledge and interaction term) are zero.
Scenario 2: Political knowledge has an effect, but no interaction:

\[Y = \alpha + \beta_1*affpol + \beta_2*pknow + 0*affpol*pknow\]

$\beta_2$ represents the effect of political knowledge, which is set as -5 in our simulation.
Scenario 3: Political knowledge has an effect with an interaction:

\[Y = \alpha + \beta_1*affpol + \beta_2*pknow + \beta_3*affpol*pknow\]

$\beta_3$ represents the interaction effect between affective polarization and political knowledge, which is set as -2 in our simulation.

Let’s consider how to interpret the effects of affective polarization and political knowledge in the presence of interaction terms.

If we focus on the first model (Scenario 1) above, where there’s no effect of political knowledge, the reporting is straightforward. There’s no impact of political knowledge, and the coefficient $\beta_1 = 3.05$ signifies the effect of affective polarization. Specifically, $\beta_1 = 3.05$ indicates the extent to which one’s level of affective polarization shifts for each unit alteration in their trust in information shared by their own party.

	Scenario 1		Scenario 2		Scenario 3
Variables	Estimates	C.I. (95%)	Estimates	C.I. (95%)	Estimates	C.I. (95%)
Intercept	4.91 ^***	3.58 – 6.24	4.67 ^***	3.26 – 6.09	4.75 ^***	2.60 – 6.89
Affective Polarization	3.05 ^***	2.79 – 3.30	3.05 ^***	2.79 – 3.30	3.03 ^***	2.62 – 3.45
Political Knowledge: Higher			-4.61 ^***	-5.39 – -3.82	-4.72 ^***	-7.46 – -1.98
Affective Polarization x Political Knowledge: Higher					-1.98 ^***	-2.51 – -1.45
Observations	1000		1000		1000
R² / R² adjusted	0.351 / 0.350		0.403 / 0.402		0.611 / 0.610
p<0.05 p<0.01 * p<0.001

If we consider the second model (Scenario 2), where there are effects of both affective polarization and political knowledge, interpretation remains straightforward: Since it does not depend on which level of political knowledge is being discussed (the effect is the same for those with higher political knowledge and those with lower political knowledge), $\beta_1 = 3.05$ still represents the amount by which one’s level of affective polarization changes for each unit change in their trust in information shared by their own party. Similarly, $\beta_2 = -4.61$ represents the effect of political knowledge, which is “additive” to the effect of affective polarization. Thus, to obtain the effect of both together for any level of affective polarization, we simply add the two individual effects.

Now, let’s consider the third model (Scenario 3) with an interaction term. Things become more intricate when an interaction term is introduced. There’s no longer a singular effect of affective polarization because it hinges on whether we’re discussing its impact on those with higher political knowledge or those with lower political knowledge. Likewise, the disparity between those with higher political knowledge and those with lower political knowledge is contingent upon the levels of affective polarization.

Consider first the effect of affective polarization: The question of “what is the effect of affective polarization” is not answerable until one knows the level of political knowledge being considered for a subject. The effect of affective polarization is $\beta_1 = 3.03$ for individuals with lower political knowledge (coded as 0), as the interaction term becomes 0 in this case. Conversely, for individuals with higher political knowledge (coded as 1), the effect of affective polarization becomes $\beta_1 + \beta_3 = 3.03 - 1.98 = 1.05$. This implies that for every one-unit increase in affective polarization, trust in information shared by the in-party increases by 1 unit for those with higher political knowledge, compared to 3 units for those with lower political knowledge.

This example considers a continuous variable combined with a dichotomous (dummy or indicator) variable. We can also consider interactions between two dummy variables, and between two continuous variables. The principles remain the same, although some technical details change slightly.

2. Interaction Term in a Regression

Now that we grasp the importance of incorporating an interaction term in a regression model, let’s apply it using real data. We’ll utilize my POL212 project titled “Electoral Outcomes, Presidential Power, and Satisfaction with Democracy” as our example. The data can be assessed here. All the variables inside are well coded and already remove NAs.

Background: Scholars have amassed evidence showing that election outcomes affect how people think about democracy (Anderson & Guillory, 1997; Anderson & Tverdova, 2001; Anderson et al., 2005; Clarke & Acock, 1989; Curini et al., 2012). Compared to those who voted for losers, those who supported winners usually have a higher level of satisfaction and trust in democracy and have a more positive feeling of government responsiveness and political efficacy.
Research Question: However, does it mean that losers’ interests are then ignored by the government? Do people who did not vote for the winning party lose that much in the political game? In this project, I seek to understand how political institutions can shape citizens’ perception of the legitimacy of democracies. In particular, I ask how does the extent of presidential power affect the winner-loser effect? Does presidential power amplify or dampen losers’ satisfaction with democracy?
Theory & Hypothesis: I argue that with different levels of presidential power, losers’ satisfaction with democracy should also be different. In countries with powerful presidents, losers should be more unsatisfied with how the government works, for being the political minority in the system causes them to be largely excluded from the policy-making process. In countries with weaker presidents, losers should be less unsatisfied, for their policy interests still can be possibly realized by the premier and the cabinet, which is determined by the congressional election.

Hypothesis: Losers will be more satisfied with democracy when the president is weaker.

Based on my theory, I am not only interested in the effect of winner-loser status on satisfaction with democracy but also how this effect is moderated by presidential power.

Data: The fourth module (2011-2016) of the Comparative Study of Electoral Systems (CSES) project. The CSES project consists of a nationally-representative post-election survey around democracies. The unit of analysis for this study is individuals. In particular, I analyze all presidential and semi-presidential countries that are rated as free by Freedom House at the time the election covered by the survey was held. This leaves the data with 15 countries covering Western and Eastern Europe, South and North America, and Asia.

# import data 
df <- read.csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/data_con.csv") %>%
  mutate(female = ifelse(gender == "Female", 1, 0))

country	eyear	prespow1
Austria	2013	0.0915
Brazil	2014	0.4863
Bulgaria	2014	0.1831
Finland	2015	0.0499
France	2012	0.1306
Ireland	2011	0.0623
Peru	2016	0.4197
Poland	2011	0.2407
Portugal	2015	0.1969
Republic of Korea	2012	0.3488
Romania	2014	0.2502
Slovakia	2016	0.1894
Slovenia	2011	0.1180
Taiwan	2012	0.2707
United States of America	2012	0.2889

Outcome variable: Satisfaction with Democracy (SWD)

In the CSES survey, the concept is measured by this question: “On the whole, are you (1) very satisfied, (2) fairly satisfied, (3) not very satisfied, or (4) not at all satisfied with the way democracy works in [COUNTRY]?” I reverse the code scale to (4) very satisfied, (3) fairly satisfied, (2) not very satisfied, or (1) not at all satisfied.
Explanatory variables:

– Loser (loser): I classified respondents as belonging to the political majority and minority by a question in CSES asking respondents for which party they feel closest to. loser is coded 1 if the party respondents feel closest to is not in accordance with the current president’s party, and 0 otherwise

– Presidential Power (prespow1): I use Doyle & Elgie’s (2015) continuous presidential power score dataset. The range of this variable is between 0 and 1. The higher score means the stronger president; the lower score means the weaker president.
Control variables:

– Divided Government (divided): Past studies have shown that “unified government produces greater quantities of significant enactments and is more responsive to the public mood than is divided government” (Coleman, 1999: 821). In this case, I expect the coefficient of divided will be negative, for individuals will be more unsatisfied with democracy under a less responsive government. divided is coded 1 if the current executive-legislative relationship under the survey date of fieldwork is a divided government, meaning that the president party did not control the majority of the legislature. The data I use to establish divided is from the Database of Political Institutions 2017 (DPI 2017).

– Gender (gender): Male vs Female.

– Age (age): continuous scale, standardized.

– Education (education): continuous scale.

Let’s first run the additive regression model and then include the interaction term, loser and prespow1, into the model so we can compare how taking this interaction term into account improves our model!

# additive model
m1 <- lm(SWD ~ loser + prespow1 + divided + female + education + age, data = df)

# interactive model
m2 <- lm(SWD ~ loser*prespow1 + divided + female + education + age, data = df)

	Additive Model		Interactive Model
Variables	Estimates	C.I. (95%)	Estimates	C.I. (95%)
Intercept	2.73 ^***	2.68 – 2.79	2.65 ^***	2.59 – 2.72
Loser	-0.33 ^***	-0.35 – -0.30	-0.22 ^***	-0.28 – -0.17
Presidential Power	-0.50 ^***	-0.59 – -0.40	-0.16	-0.34 – 0.01
Divided Government	-0.04 ^*	-0.06 – -0.01	-0.03 ^*	-0.06 – -0.01
Female	0.00	-0.02 – 0.02	0.00	-0.02 – 0.02
Education Level	0.03 ^***	0.02 – 0.03	0.03 ^***	0.02 – 0.03
Age	0.00	-0.01 – 0.01	0.00	-0.01 – 0.02
Loser * Presidential Power			-0.45 ^***	-0.65 – -0.25
Observations	18150		18150
R² / R² adjusted	0.047 / 0.047		0.048 / 0.048
p<0.05 p<0.01 * p<0.001

How to interpret the interaction term?

Let’s first think it through the additive model we constructed (for simplification, I removed all the control variables first).

\[\hat{SWD} = \hat{\alpha} + \hat{\beta_1}Loser + \hat{\beta_2}Prespow\]

If we want to know the effect of winner-loser status on satisfaction with democracy, we are looking for the estimate of $\beta_1$. Understanding it through the calculus, we are taking the partial derivative of $\hat{SWD}$ with respect to $Loser$.

\[\frac{\partial \hat{SWD}}{\partial Loser} = \hat{\beta_1}\]

Now let’s look at the interactive model.

\[\hat{SWD} = \hat{\alpha} + \hat{\beta_1}Loser + \hat{\beta_2}Prespow +\hat{\beta_3}(Loser*Prespow)\]

To know how the effect of winner-loser status on satisfaction with democracy is moderated by different levels of presidential power, let’s also take the partial derivative of $\hat{SWD}$ with respect to $Loser$, our key explanatory variable.

\[\frac{\partial \hat{SWD}}{\partial Loser} = \hat{\beta_1} + \hat{\beta_2}Prespow\]

This suggests that when the presidential power increases 1 unit, the effect of winner-loser status on satisfaction with democracy increases $\hat{\beta_1} + \hat{\beta_2}$ units.

Apply this logic to the interactive model we just estimated. According to the above regression table, the model we estimated looks like this:

\[\hat{SWD} = 2.65 -0.22Loser -0.16Prespow -0.45(Loser*Prespow) -0.03Divided +0.001Female+0.03Education+0.003Age\]

Take a partial derivative of $\hat{SWD}$ with respect to $Loser$:

\[\frac{\partial \hat{SWD}}{\partial Loser} = -0.22 -0.45Prespow\]

Presidential Power	Effect of Winner-Loser Status on Satisfaction with Democracy
$0.05$ (min)	$-0.22-0.45*0.05 =-0.2425$
$0.13$ (1st quartile)	$-0.22-0.45*0.13=-0.2785$
$0.24$ (median)	$-0.22-0.45*0.24=-0.328$
$0.29$ (3rd quartile)	$-0.22-0.45*0.29=-0.3505$
$0.49$ (max)	$-0.22-0.45*0.49=-0.4405$

The above table suggests that when the president gets stronger in a country, individuals who supported the losing party will become less satisfied with how democracy works, compared to those who supported the winning party. In other words, the winner-loser gap in satisfaction with democracy becomes greater as presidential power becomes greater, which confirms my hypothesis.

Plot the predicted effects to show the pattern more straightforwardly!

Using ggplot() to plot:

prespow1 <- seq(0, 1, by = 0.01) 

# create a new data for the predicted SWD of losers under different presidential level
dfplot.l <- data.frame(loser = rep(1, length(prespow1)), # 1 means loser
                   prespow1 = prespow1,
                   divided = rep(1, length(prespow1)), # set at the median
                   female = rep(1, length(prespow1)), # set at the median
                   education = rep(4, length(prespow1)), # set at the median
                   age = rep(0, length(prespow1))) # set at the mean
pred.l <- data.frame(Status = rep("Loser", length(prespow1)),
                 prespow = prespow1, 
                 pred = predict(m2, newdata = dfplot.l, interval = "confidence")[, 1], 
                 lwr = predict(m2, newdata = dfplot.l, interval = "confidence")[, 2], 
                 upr = predict(m2, newdata = dfplot.l, interval = "confidence")[, 3])

# create a new data for the predicted SWD of winners under different presidential level
dfplot.w <- data.frame(loser = rep(0, length(prespow1)), # 0 means winner
                   prespow1 = prespow1,
                   divided = rep(1, length(prespow1)), # set at the median
                   female = rep(1, length(prespow1)), # set at the median
                   education = rep(4, length(prespow1)), # set at the median
                   age = rep(0, length(prespow1))) # set at the mean
pred.w <- data.frame(Status = rep("Winner", length(prespow1)),
                 prespow = prespow1, 
                 pred = predict(m2, newdata = dfplot.w, interval = "confidence")[, 1], 
                 lwr = predict(m2, newdata = dfplot.w, interval = "confidence")[, 2], 
                 upr = predict(m2, newdata = dfplot.w, interval = "confidence")[, 3])

dfplot <- bind_rows(pred.l, pred.w)

ggplot(data = dfplot) +
geom_line(aes(x = prespow, y = pred, linetype = Status)) +
geom_line(aes(x = prespow, y = lwr, linetype = Status)) +
geom_line(aes(x = prespow, y = upr, linetype = Status)) +
scale_linetype_manual(values = c("dashed", "solid")) +
scale_x_continuous(name = "Presidential Power",
                   breaks = seq(0, 1, 0.1)) +
scale_y_continuous(name = "Predicted Values of Satisfaction with Democracy", 
                   breaks = seq(1.8, 2.8, 0.1)) +
theme_bw() +
theme(legend.title = element_blank(),
      legend.position = "bottom")

The above figure demonstrates the predicted values of winners’ and losers’ satisfaction with democracy under different degrees of presidential power. According to this figure, indeed, the winner-loser gap in satisfaction with democracy becomes greater when there is a stronger president in the political system.

Using interact_plot() from the interactions package to plot:

– pred argument specifies the variable that should be used as the predictor variable for the x-axis of the plot.

– modx argument specifies the variable that should be used as the moderator variable for the y-axis of the plot. The moderator variable is used to examine whether the relationship between the predictor and response variables is conditional on the level of the moderator variable.

– interval=TRUE specifies that confidence intervals should be included for the estimated marginal effects on the outcome.

Theoretically, the appropriate specification for the interact_plot() function in our case would be to set pred = loser and modx = prespow1, as loser is the key independent variable and prespow1 is the moderator. However, since loser is a binary variable, it would be easier to interpret the plot by setting pred = prespow1 and modx = loser, as this allows us to create a continuous x-axis.
```
interactions::interact_plot(m2, pred = prespow1, modx = loser, interval = TRUE, 
          x.label = "Presidential Power", 
          y.label = "Predicted Values of Satisfaction with Democracy")
```

3. Assumptions underlying Interaction Models

A classic linear interaction model is given by the following regression equation:

\[Y = \mu +\eta X+\alpha D+\beta (D\cdot X)+Z\gamma +\epsilon\]

In this model, $Y$ is the outcome variable, $D$ is the key independent variable of interest or so called “treatment,” $X$ is the moderator - a variable that affects the direction and/or strength of the treatment effect (i.e., the effect of $D$ on $Y$), $D\cdot X$ is the interaction term between $D$ and $X$, $Z$ is a vector of control variables, and $\mu$ and $\epsilon$ represent the constant (intercept) and error terms, respectively.

Let’s assume that our key independent variable, $D$, is exogenous (i.e., $E(\widehat{Cov}(X, \epsilon ))=0$). In this abstract case, we are most interested in the effect of $D$ on $Y$. This indicates that the marginal effect of $D$ on $Y$ follows the functional form given by:

\[ME_{D}=\partial Y/\partial D = \alpha +\beta X\]

This equation is a linear function of the moderator $X$, which suggests that the effect of $D$ on $Y$ can only linearly change with $X$, so if $X$ increases by one unit, the effect of $D$ on $Y$ changes by $\beta$ units and this change in the effect is “constant across the whole range of $X$.” We call this assumption linear interaction effect (LIE) assumption.

This is a strong assumption, because we often have little theoretical or empirical reason to believe that the heterogeneity in the effect of $D$ on $Y$ takes such a linear form. Instead, it might well be that the effect of $D$ on $Y$ is nonlinear or nonmonotonic. For example, the effect of $D$ on $Y$ might be small for low values of $X$, large at medium values of $X$, and then small again for high values of $X$.

Considering the LIE assumption, it’d be better to do some diagnostic tests before applying a linear interaction model to our analysis - to examine whether our heterogeneous effect is indeed in a linear form.

4. Interaction Model Diagnostics

Example 1: dummy key independent variable + continuous moderator

Let’s continue with the example from my POL 212 project and assess whether the linear interaction model we previously estimated violates the linearity of the interaction effect assumption. We can achieve this by setting linearity.check = TRUE in interact_plot(). This will enable us to visualize whether the effect of winner-loser status remains linear across the range of the moderator.

interactions::interact_plot(m2, pred = prespow1, modx = loser, plot.points = TRUE,
              jitter = 0.1, point.alpha = 0.8, linearity.check = TRUE,
              x.label = "Presidential Power", y.label = "Satisfaction with Democracy")

In the above figure, the red line is the predicted loess line, whereas the black line is the predicted linear regression line. Recall that loess stands for “locally weighted least squares regression,” which uses more local data to estimate the outcome variable. The red loess line looks only at the subset of data and will be curved if the relationship is nonlinear. What we’re looking for is whether the red loess line is straight like the black line. If the two lines are very different, it means that the LIE assumption is violated. In our case, our predicted linear regression line is not too different from the loess line.

We can also examine the LIE assumption with a more flexible binning estimator by using interflex() in interflex package developed by Hainmueller et al. (2019).

The binning estimator approach remains in the regression framework and at the same time relax the LIE assumption and flexibly allow for heterogeneity in how the conditional marginal effect changes across values of the moderator. Simply put, the binning estimator approach breaks a continuous moderator (in our case: presidential power) into several bins (the default is 3 in interflex()) represented by dummy variables and interact these dummy variables with the treatment indicator (in our case: winner-loser status). The default is to pick an evaluation point within each bin, which is the median of the moderator in each bin, to estimate the conditional marginal effect of the key independent variable on the outcome variable.

library(interflex)
ols.interaction.binning <- interflex(estimator = "binning", Y = "SWD", D = "loser", X = "prespow1",
                                     Z = c("divided","female","education", "age"), data = df, na.rm = TRUE)

## Baseline group not specified; choose treat = 0 as the baseline group.

plot(ols.interaction.binning)

The three estimates from the binning estimator in the above figure, labeled L, M, and H, are the median point of each bin. As this figure shows, the conditional effect estimates from the binning estimator and our linear interaction model are similar. The first two estimates, L and M, sit almost right on the estimated linear marginal-effect line. Only the third estimate, H, is a bit far from the line.

Example 2: continuous key independent variable + continuous moderator

For the second example, let’s replicate Homola & Tavits’s (2018) findings on how people’s pre-existing partisan affinities condition the effect of contact with nonnatives on their immigration-related fears.

According to contact theory, close intergroup contact can foster more positive attitudes towards other groups, potentially reducing threat perceptions and prejudice towards immigrants. However, Homola & Tavits (2018) suggest that individuals may resist changing their pre-existing opinions, particularly influenced by their partisanship and political ideology. They hypothesize that exposure and positive contact with immigrants may decrease perceptions of immigrant-related threats among leftist voters but increase them among rightist voters. This difference arises because positive contact aligns with the leftist ideology of advocating social change and rejecting inequality, whereas it conflicts with the rightist ideology of resisting social change and accepting inequality.

In short, Homola & Tavits (2018) argue that there is heterogeneity in the effect of contacting outgroup members on people’s perceptions of immigrants-related threats across their partisanship and associated political ideology (leftist-rightest).

Outcome variable ($Y$, threat): Homola & Tavits (2018) use the six measures they develop in their survey to build an overall threat score to capture the perceptions of threat related to immigration.
Key independent variable (treatment, $D$, contact): Homola & Tavits (2018) use their two survey questions to construct a contact factor score variable to capture respondents’ repeated positive contact with immigrants.
Moderator ($X$, PID): Homola & Tavits (2018) use two different measures to capture whether a respondent is a political leftist or rightist. First, they use the standard question “Generally speaking, do you usually think of yourself as a Democrat, a Republican, and Independent, or what?” and the follow-up question that asks people which parties do they think they are closer to. By this question, they create a binary measure of partisan affinity (variable name Rightist), where 0 = leftist (i.e., Democrat) and 1 = rightis (i.e., Republican). They also use the full 7-point party affinity scale ranging from 1 = strong Democrat to 7 = strong Republican. Here, we will focus on the continuous measure of this moderator.
Control variables ($Z$): Homola & Tavits (2018) include female, age, education, and income as their control variables.

In Homola & Tavits’s (2018) model, they interact Contact with Rightist to examine whether the effect of Contact on Threat is different based on whether respondents are leftist or rightist. They expect that the coefficient of the interaction term Contact*Rightist will be significant positive, which means that when an individual is rightist, positive contact with nonnatives will not decrease their perceptions of threat toward immigrants.

Homola & Tavits’s (2018) data can be assessed here.

# import data
library(haven)
df <- read_dta("/Users/yu-shiuanhuang/Desktop/method-sequence/data/HomolaTavits2018.dta") %>%
  mutate(PID = PID7) %>%
  select(threat, contact, PID, female, age, education, income) %>%
  na.omit()

df$education <- as.numeric(df$education)
df$income <- as.numeric(df$income)

Let’s first run the interaction model and plot the marginal effect of positive contact with nonnatives (contact) on people’s perceptions of threat toward immigrants (threat).

m <- lm(threat ~ contact*PID + female + age + education + income, data = df)

	Threat
Variables	Estimates	C.I. (95%)
Intercept	-0.62	-1.31 – 0.08
Contact	-0.66 ^***	-0.90 – -0.41
Rightest	0.46 ^***	0.42 – 0.51
Female	0.05	-0.15 – 0.24
Age	0.01 ^***	0.01 – 0.02
Education	-0.44 ^***	-0.58 – -0.31
Income	-0.11 ^**	-0.18 – -0.04
Contact*Rightest	0.08 ^*	0.02 – 0.14
Observations	1487
R² / R² adjusted	0.306 / 0.303
p<0.05 p<0.01 * p<0.001

According to the regression table, as Homola & Tavits (2018) expected, the coefficient of Contact*Rightist is significant positive. One unit increase in the contact variable decreases an individual’s perceptions of threat toward immigrants by 0.66 units. Moreover, when the individual is 1 more unit right on the political ideology spectrum, the effect of Contact on Threat is cancelled out for 0.08 units, which means that the overall effect of one unit increase in Contact on Threat is only -0.58 units.

interactions::interact_plot(m, pred = contact, modx = PID, interval = TRUE)

The above figure demonstrates the predicted values of contact levels on the perceptions of threat toward immigrants under different political ideology. According to this figure, indeed, when an individual is more right on the political ideology spectrum, more positive contact with people from the outgroup is less likely to reduce her perceived threat toward immigrants (the darker blue line is flatter than the lighter line).

Now, let’s examine whether our linear interaction model violate the LIE assumption.

interactions::interact_plot(m, pred = contact, modx = PID, plot.points = TRUE, 
              jitter = 0.1, point.alpha = 0.8, linearity.check = TRUE)

As the above figure shows, our predicted linear regression line is not too different from the loess line under different spans of political ideology.

ols.interaction.binning <- interflex(estimator = "binning", Y = "threat", D = "contact", X = "PID",
  Z = c("female", "age", "education", "income"), data = as.data.frame(df), na.rm = TRUE)
plot(ols.interaction.binning)

The more relaxed binning estimator approach also suggests that the conditional effect estimates and our linear interaction model are similar. The three estimates, L, M, and H, sit almost right on the estimated linear marginal-effect line.

5. Single-decision Tree

What is a decision tree?

A decision tree is a flowchart-like structure made of nodes and branches. At each node, a split on the data is performed based on one of the input features, generating two or more branches as output. More and more splits are made in the upcoming nodes and increasing numbers of branches are generated to partition the original data. This continues until a node is generated where all or almost all of the data belong to the same class and further splits — or branches — are no longer possible.

As mentioned in the class, tree-based method is a good way to first parse out whether there is any noteworthy candidate interactions within our data (theory is intertwined between deductive and inductive).

Let’s apply the single tree method to my POL212 project “Electoral Outcomes, Presidential Power, and Satisfaction with Democracy.”

library(rpart)
df2 <- read.csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/data_con.csv") %>%
    select(SWD, loser, prespow1, divided, gender, education, age) ## only select the variables we will use

set.seed(1234)
fit <- rpart(SWD ~ ., data = df2, control = rpart.control(minsplit = 20, minbucket = 5))
             ## minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted.
             ## minbucket: the minimum number of observations in any terminal <leaf> node.

rsq.rpart(fit) # the R2 value is about 0.2

## 
## Regression tree:
## rpart(formula = SWD ~ ., data = df2, control = rpart.control(minsplit = 20, 
##     minbucket = 5))
## 
## Variables actually used in tree construction:
## [1] loser    prespow1
## 
## Root node error: 11488/18150 = 0.63294
## 
## n= 18150 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.036314      0   1.00000 1.00018 0.0090794
## 2 0.025665      1   0.96369 0.96398 0.0089336
## 3 0.021268      2   0.93802 0.93844 0.0088492
## 4 0.013994      3   0.91675 0.91727 0.0088237
## 5 0.010000      5   0.88877 0.88942 0.0087257

Plot your tree!

library(rpart.plot)
rpart.plot(fit)

How to interpret this single decision tree?

This tree has five internal nodes (rules):

loser = 1: whether one supported the losing party or the winning party
prespow1 >= 0.16: whether the presidential power in a country is larger than 0.16
prespow1 < 0.19: whether the presidential power in a country is smaller than 0.19
prespow1 < 0.26: whether the presidential power in a country is smaller than 0.26
prespow 1 >= 0.32: whether the presidential power in a country is larger than or equal to 0.32

As you can observe from this decision tree, compared to losers in a country where the presidential power is weaker, losers in a country where the presidential power is stronger are less satisfied with democracy. In other words, there does seem to be an interaction effect between winner-loser status and presidential power as I argued in my hypothesis.

Discussion VI

In today’s discussion, we’ll cover the Gauss-Markov theorem, highlighting why ordinary least squares is a reliable estimator for our relationship of interest. We’ll then explore diagnostic techniques to ensure our model adheres to its assumptions and address any violations. Additionally, we’ll discuss collinearity detection method.

Gauss-Markov Theorem
Diagnostics for Errors and Linearity
Detecting Collinearity

1. Gauss-Markov Theorem

Gauss Markov theorem states that “if the errors are independently distributed with zero expectation and constant variance, then the least-squares estimator b is the most efficient linear unbiased estimator of $\beta$. That is, of all unbiased estimators that are linear functions of the observations, the least-squares estimator has the smallest sampling variance and, hence, the smallest mean-squared error. For this reason, the least-squares estimator is sometimes termed BLUE, an acronym for best linear unbiased estimator” (Fox 2016: 212).

In other words, under certain conditions, the ordinary least squares (OLS) estimates are thought to be the best linear unbiased estimator (BLUE). Here, I review a set of assumptions (known as Gauss-Markov assumptions) that leads to a BLUE estimator.

Linearity: the expected (mean) value of the disturbance term is 0 (i.e., $E(u_i)=0$).

Fox (2016) explains that a violation of this assumption implies “that the model fails to capture the systematic pattern of relationship between the response and explanatory variables” (307). This assumption could be broken in two ways: one, the mean of error could be a constant across all values of $X_i$, two, it could vary. In the former case, the intercept term will be consistently off by the error amount. A uniform measurement error across all observations would be a cause of such an error (Berry 1993: 42). A common cause of the latter case will be omitted variable bias. Omitted variable bias deserves particular attention because it can cause the OLS estimate to be both biased and inconsistent. Another common cause is the wrong functional form (e.g., specifying a linear relation instead of a quadratic relation). Thus, when nonlinearity is detected, we should explore not only variable transformations and different model specifications (Fox, 2016: 317), but also the possibility of omitted variables. In short, before running OLS, reseachers should make sure that the inclusion of relevant variables in the model and the setting of the function are resonable and are based on the theoretical understanding.
Nonstochastic regressors: $X$ values are independent of the error term (i.e., $cov(X_i, u_i)=0$).

The violation of this assumption suggests that there is an endogeneity problem between the explanatory and outcome variables, which can arise from measurement error on $X$, omitted confounders, or reciprocal causation between $X$ and $Y$. Again, if the main goal of the research is to provide an unbiased and consistent estimator to explain how a social phenomenon works in a real-life setting, to provide a convincing explanation requires us to make sure that the effect of our explanatory variable is actually exogenous.Fortunately, scholars have developed many methods to address this issue. For example, researchers could employ an instrumental variable or matching to improve the unbiasedness of the estimated effect.
Homoskedasticity: constant error variance across values of $X_i$ (i.e., $Var(\varepsilon)=\sigma^2$).

One instance where homoskedasticity is often violated is cross-sectional studies (Berry, 1993: 73). For example, because more developed countries can have better data accuracy, measurement error could be correlated with the level of development. Thus, in a study that includes the level of development as one of the independent variables, the variance of error term may not be constant (Berry, 1993: 73). In such circumstances, Fox (2016) suggests that we can run weighted-least-squares (WLS) regression to overcome such probelm (304). Other solutions include transformation of the dependent variable and correction of coefficient standard errors (e.g. robust standard errors and clustered standard errors) for heteroskedasticity (Fox, 2016: 307).
Independence: the observations are sampled independently. no autocorrelation between disturbances (i.e., $cov(u_i, u_j)=0$ for $i \neq j$).

The independence assumption is important for time-series analysis, where the assumption of spherical errors is often violated. To overcome this, Beck & Katz (1995) propose to use panel-correlated standard errors when analyzing time-series data.
Normality: errors are distributed normally (i.e.,$\varepsilon \sim N(0, \sigma^2)$).

Assumptions of 1 and 2 are related to the bias of the estimator, whereas assumptions of 3 and 4 are related to the efficiency of the estimator: when the assumption of 1 or 2 is violated, the OLS estimator is no longer unbiased; when the assumption of 3 or 4 is violated, the OLS estimator is no longer efficient. With these weak set of assumptions (1-4), according to the Gauss-Markov theorem, the OLS estimator is BLUE.

In statistics, efficiency is a measure of the quality of an estimator, which can be characterized by having a smaller possible variance.

With the fifth assumption, normality, the OLS estimator is the most efficient among all the unbiased estimators, not just linear estimators. Adding this fifth assumption makes the Gauss-Markov theorem a strong set. This last assumption is important when we aim to make a causal inference using the OLS estimator. With the normal distribution of the errors assumption, we can also infer that the OLS estimators also have normal sampling distributions. As a result, it allows us to apply statistical tests such as t-tests even for small sample size studies.

The below table presents the consequences of the violation of Gauss-Markov assumptions and corresponded suggested solutions.

Assumption	Violation	Solution
Linearity	Biased/inconsistent estimates	Transformations, polynomials, different model
Nonstochastic regressors	Biased/inconsistent estimates	Instrumental variables
Homoskedasticity	Biased standard errors	Robust standard errors
Independence	Biased standard errors	Fixed effects

2. Diagnostics for Errors and Linearity

Example 1: Violation of Linearity Assumption

Recall that the violation of linearity assumption implies “that the model fails to capture the systematic pattern of relationship between the response and explanatory variables” (Fox, 2016: 307). Let’s generate data based on this underlying idea!

set.seed(20230307) # for replication
N <- 1000 # set up the sample size
x1 <- rnorm(N, mean = 10, sd = 3)
x2 <- runif(N, min = -3, max = 3)
e <- rnorm(N, mean = 0, sd = 1) # set the errors to be normally distributed 
y <- 7 + 3*x1 + 2*x1^2 + 3*x2 + e # set the true y
df <- data.frame(y = y, x1 = x1, x2 = x2, e = e)

In the above chunk, I specify the true $y$ to be a function of $x_1$, $x_2$, and $e$. The true values of the regression coefficients are 3 for $x_1$ and 3 for $x_2$. The term $2*x_1^2$ represents a quadratic effect of $x_1$. The intercept is 7. Let’s first plot the true relationship between $x_1$ and $y$!

library(tidyverse)
ggplot(data = df, aes(y = y, x = x1)) +
  geom_point(alpha = 0.8) +
  geom_smooth() +
  theme_bw()

As you can see in the above figure, in the true model we set up, the effect of $x_1$ on $y$ is in a quadratic form (concave): when $x_1$ increases by more units, the effect of $x_1$ on $y$ becomes greater!

Now, let’s run a model that violates the linearity assumption (i.e., when regressing $x_1$ on $y$, do not include the quadratic term I(x_1^2) inside the lm function).

fit <- lm(y ~ x1 + x2, data = df)

	y
Variables	Estimates	S.E.	C.I. (95%)
(Intercept)	-170.54 ^***	2.82	-176.08 – -164.99
x1	42.68 ^***	0.27	42.15 – 43.21
x2	3.10 ^***	0.49	2.14 – 4.05
Observations	1000
R² / R² adjusted	0.962 / 0.962
p<0.05 p<0.01 * p<0.001

As you can see from the above regression table, the estimated value of constant/intercept (-170.54) is very far from what we have set in the true model (7). The coefficient of $x_1$ is also very biased (we set it to be 3 in our true model). Let’s add this estimated regression line onto our scatterplot of $x_1$ and $y$.

ggplot(data = df, aes(y = y, x = x1)) +
  geom_point(alpha = 0.8) +
  geom_smooth() +
  geom_abline(intercept = -170.54, slope = 42.68, color = "coral", size = 1.2) +
  theme_bw()

You can see that our estimated regression line is very different from the lowess line. Our model did not capture the nonlinear relationship between $x_1$ and $y$.

Now, assess this nonlinearity by plotting residuals vs fitted values.

par(mfrow = c(2, 2))
plot(fit)

The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (first plot).

This plot displays the predicted values of the outcome variable on the horizontal axis, and the residuals (i.e., the difference between the observed and predicted values of the outcome variable) on the vertical axis. Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero, which suggests that the expected (mean) value of the disturbance term is 0. The presence of a pattern may indicate a problem with some aspect of the linear model.

As you can observe in the first plot, the red line shows that there is a nonlinear relationship existing between one of our explanatory variables and the outcome variable.

What if we run the model without violating the linearity assumption (which means we set up our model in a better way to capture the relationship between $x_1$ and $y$) ?

fit2 <- lm(y ~ x1 + I(x1^2) + x2, data = df)

	y			y
Variables	Estimates	S.E.	C.I. (95%)	Estimates	S.E.	C.I. (95%)
(Intercept)	-170.54 ^***	2.82	-176.08 – -164.99	6.73 ^***	0.25	6.25 – 7.22
x1	42.68 ^***	0.27	42.15 – 43.21	3.04 ^***	0.05	2.94 – 3.14
x2	3.10 ^***	0.49	2.14 – 4.05	2.98 ^***	0.02	2.94 – 3.01
x1^2				2.00 ^***	0.00	1.99 – 2.00
Observations	1000			1000
R² / R² adjusted	0.962 / 0.962			1.000 / 1.000
p<0.05 p<0.01 * p<0.001

par(mfrow = c(2, 2))
plot(fit2)

Let’s check the first plot again. After we specifying the model more correctly, the red line shows no fitted pattern, which is an indication for a linear relationship.

Example 2: Violation of Homoskedasticity Assumption

Recall that the violation of homoskedasticity assumption is that the error variance across values of an independent variables is not constant. Again, let’s generate data based on this underlying idea!

set.seed(20230307) # for replication
N <- 1000 # set up the sample size
x1 <- 1:N
x2 <- rnorm(N, mean = 10, sd = 3)
sd <- runif(N, min = 0, max = 2)
e <- rnorm(N, mean = 0, sd = 2*x1) # set the errors to be normally distributed but not at a constant variance
y <- 7 + 3*x1 + x2 + e # set the true y
df <- data.frame(y = y, x1 = x1, x2 = x2, e = e)

In the above chunk, I set the error variance to be more pronounce: as $x_1$ gets larger, the error variance gets larger by multiplying $x_1$. You can see this relationship in the below figure.

ggplot(data = df, aes(x = x1, y = e)) +
  geom_point(alpha = 0.8) +
  theme_bw()

Let’s now run the regression and test it to see if it violates the homoskedasticity assumption.

fit3 <- lm(y ~ x1 + x2, data = df)

	y
Variables	Estimates	S.E.	C.I. (95%)
(Intercept)	-16.56	143.39	-297.93 – 264.81
x1	2.92 ^***	0.13	2.67 – 3.18
x2	3.02	12.15	-20.82 – 26.86
Observations	1000
R² / R² adjusted	0.334 / 0.333
p<0.05 p<0.01 * p<0.001

According to the regression table, you can see that the standard error of $x_2$ is very large, which is due to the heteroskedasticity problem in our data. The estimate of the effect of $x_2$, therefore, is very inefficient.

par(mfrow = c(2, 2))
plot(fit3)

Let’s take a look at the third plot. Scale-Location is used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem. It can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors.

Example 3: Violation of Normality Assumption

Recall that the violation of normality assumption is that errors are not distributed normally. If this assumption is violated, it means that the OLS estimators may not have a normal sampling distribution and statistical inference based on those estimators may not be reliable. Again, let’s generate data based on this underlying idea!

set.seed(20230307) # for replication
N <- 1000 # set up the sample size
x1 <- rnorm(N, mean = 10, sd = 3)
x2 <- runif(N, min = -3, max = 3)
e <- runif(N, min = 0, max = 1) # set the errors to not be normally distributed
y <- 7 + 3*x1 + x2 + e # set the true y
df <- data.frame(y = y, x1 = x1, x2 = x2, e = e)

In the above chunk, I set the errors to not be normally distributed.

Let’s now run the regression and test it to see if it violates the normality assumption.

fit4 <- lm(y ~ x1 + x2, data = df)

	y
Variables	Estimates	S.E.	C.I. (95%)
(Intercept)	7.54 ^***	0.03	7.48 – 7.60
x1	3.00 ^***	0.00	2.99 – 3.00
x2	1.00 ^***	0.01	0.99 – 1.01
Observations	1000
R² / R² adjusted	0.999 / 0.999
p<0.05 p<0.01 * p<0.001

library(car)
qqPlot(fit4, simulate = TRUE, line = "none", labels = FALSE)

## [1] 265 857

The QQ plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. In our example, we ran a statistical analysis that assumes our residuals are normally distributed, we can use a Normal QQ plot to check this assumption.

A QQ plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight, which is not the case in our example.

3. (Multi)collinearity Problem

In multivariate linear regression, it is possible for two or more predictor variables to be correlated with each other. This scenario is referred to as collinearity. In extreme cases, where collinearity exists between three or more variables, even if no pair of variables has a particularly high correlation, this is known as multicollinearity. Multicollinearity indicates redundancy between predictor variables.

Note that when there is perfect multicollinearity (i.e. when two or more independent variables are perfectly correlated), we cannot even get an estimate using the OLS method. But don’t worry, when you run lm() in R, R will automatically drop one of the perfectly correlated variables in order to run the regression.

The presence of multicollinearity can cause the solution of the regression model to become unstable.

coefficient standard errors are large reflecting imprecision
broad confidence intervals
hypothesis test have low power

Therefore, it is important to detect and address multicollinearity before building a regression model to ensure the model’s reliability and accuracy.

3.1 Examples of Multicollinearity

In this example, our response variable is score, and explanatory variables are STR(student-teacher ratio) and english (the share of English learning students).

library(AER)
library(MASS)
data(CASchools)

# define variables
CASchools$STR <- CASchools$students/CASchools$teachers 
CASchools$score <- (CASchools$read + CASchools$math)/2

If there’s high multicollinearity between variables and they’re perfectly correlated, R will automatically drop one of them to resolve the perfect multicollinearity issue. In cases of high multicollinearity where variables are not perfectly correlated, the lm() function will still generate estimates for the highly correlated variable, but the standard errors of coefficients will be notably high, indicating unstable estimates. Below, we illustrate examples when our model includes 1) perfectly correlated variables, 2) highly correlated variables, or 3) less highly correlated variables.

# generate a perfectly correlated variable
CASchools$english_pc <- 2*CASchools$english
cor(CASchools$english_pc, CASchools$english)

## [1] 1

# english_pc is perfectly correlated with english

set.seed(1234)

# generate a highly correlated variable
CASchools$english_hc1 <- 2*CASchools$english + rnorm(nrow(CASchools), 10, 5)

cor(CASchools$english_hc1, CASchools$english)

## [1] 0.9904646

# english_hc is highly but not perfectly correlated with english

set.seed(1234)

# generate a less highly correlated variable
CASchools$english_hc2 <- 2*CASchools$english + rnorm(nrow(CASchools), 10, 20)

cor(CASchools$english_hc2, CASchools$english)

## [1] 0.8662955

mod1 <- lm(score ~ STR + english , data = CASchools) # without highly correlated term
mod2 <- lm(score ~ STR + english + english_pc , data = CASchools) # with perfectly correlated term
mod3 <- lm(score ~ STR + english + english_hc1 , data = CASchools) # with highly correlated term
mod4 <- lm(score ~ STR + english + english_hc2 , data = CASchools) # with less highly correlated term

	Model 1			Model 2			Model 3			Model 4
Variables	Estimates	S.E.	C.I. (95%)	Estimates	S.E.	C.I. (95%)	Estimates	S.E.	C.I. (95%)	Estimates	S.E.	C.I. (95%)
(Intercept)	686.03 ^***	7.41	671.46 – 700.60	686.03 ^***	7.41	671.46 – 700.60	687.24 ^***	7.57	672.37 – 702.12	686.40 ^***	7.43	671.80 – 701.00
STR	-1.10 ^**	0.38	-1.85 – -0.35	-1.10 ^**	0.38	-1.85 – -0.35	-1.10 ^**	0.38	-1.85 – -0.36	-1.10 ^**	0.38	-1.85 – -0.36
english	-0.65 ^***	0.04	-0.73 – -0.57	-0.65 ^***	0.04	-0.73 – -0.57	-0.43	0.28	-0.98 – 0.12	-0.60 ^***	0.08	-0.75 – -0.44
english hc1							-0.11	0.14	-0.39 – 0.16
english hc2										-0.03	0.04	-0.10 – 0.04
Observations	420			420			420			420
R² / R² adjusted	0.426 / 0.424			0.426 / 0.424			0.427 / 0.423			0.427 / 0.423
p<0.05 p<0.01 * p<0.001

The regression output indicates that in model 2, there is a perfect multicollinearity problem. R automatically dropped english_pc to resolve it.

In models 3 and 4, there is an imperfect multicollinearity problem, which is reflected by an increase in the standard error of the coefficient estimates. Specifically, the estimated effect of english becomes increasingly unstable as the correlation between english and the correlated term (which we artificially generated) becomes stronger.

3.2 Detecting Collinearity

Multicollinearity in a regression model can be assessed by computing the variance inflation factor (VIF) for each predictor variable. VIF measures how much of the variation in one variable is explained by the other variable. This is done by running a regression using one of the correlated $x$ variables as the dependent variable against the other variables as predictor variables. The VIF is calculated as one divided by the tolerance, which is defined as one minus R-squared.

$VIF_j = \frac{1}{1-R_j^2}$

A VIF value of 1 indicates an absence of multicollinearity, while values exceeding 5 or 10 suggest problematic collinearity (James et al., 2014). When multicollinearity is detected, it is recommended to remove one or more of the correlated variables from the model. This is because the presence of multicollinearity implies that the information provided by the correlated variables about the response variable is redundant in the presence of other variables (James et al., 2014; P. Bruce & Bruce, 2017).

The R function vif() in the car package can be used to detect multicollinearity in a regression model.

library(car)
vif(mod1) # without highly correlated term

##      STR  english 
## 1.036495 1.036495

vif(mod2) # with perfectly correlated term

## Error in vif.default(mod2): there are aliased coefficients in the model

vif(mod3) # with highly correlated term

##         STR     english english_hc1 
##    1.036565   52.750311   52.691167

vif(mod4) # with less highly correlated term

##         STR     english english_hc2 
##    1.036565    4.049684    4.007776

As expected, when we introduce a highly correlated term into the model, both english and english_hc1 show exceedingly high VIF scores. Additionally, when a perfectly correlated term is included, R issues an error indicating “there are aliased coefficients in the model,” signaling a perfect multicollinearity issue.

Presidential Power	Effect of Winner-Loser Status on Satisfaction with Democracy
\(0.05\) (min)	\(-0.22-0.45*0.05 =-0.2425\)
\(0.13\) (1st quartile)	\(-0.22-0.45*0.13=-0.2785\)
\(0.24\) (median)	\(-0.22-0.45*0.24=-0.328\)
\(0.29\) (3rd quartile)	\(-0.22-0.45*0.29=-0.3505\)
\(0.49\) (max)	\(-0.22-0.45*0.49=-0.4405\)

Cronbach’s alpha	Internal consistency
\(\alpha \geq 0.9\)	Excellent
\(0.9 > \alpha \geq 0.8\)	Good
\(0.8 > \alpha \geq 0.7\)	Acceptable
\(0.7 > \alpha \geq 0.6\)	Questionable
\(0.6 > \alpha \geq 0.5\)	Poor
\(0.5 > \alpha\)	Unacceptable