For today’s discussion, we’ll be going over some important concepts from POL 211. You can access the discussion material for POL 211 through this link if you would like to review it.
The contemporary world is full of data and, especially, it is becoming easier and easier to transform it into data. Data analysis is about making good use of data for certain purposes, such as to predict or to make inferences about the world. For these tasks, a first step is to have a sound understanding on how the data was generated, i.e., which processes lead to its creation. The analytical strategy for data analysis will be highly dependent on which processes have generated the data under scrutiny.
Overall, in the following method sequence classes, we will only consider the type of probability based (stochastic) data generation processes. In the probability based DGP, data is generated by a probabilistic process, i.e., each element of the population is selected with a (known) probability to form part of the sample data. One possible way to achieve this is by using a random mechanism when generating the data.
Example:
Research Question: What percentage of U.S. adults support the death penalty?
Steps in Statistical Investigation:
Produce Data: Determine what to measure, then collect the data. The poll selected 1,082 U.S. adults at random. Each adult answered this question: “Do you favor or oppose the death penalty for a person convicted of murder?”
Explore the Data: Analyze and summarize the data.In the sample, 65% favored the death penalty.
Draw a Conclusion: Use the data, probability, and statistical inference to draw a conclusion about the population.
However, there is a cost that has to be taken into account: by drawing a sample we have not observed all of the individuals in the population. It is necessary, then, to account for the uncertainty implied in the sampling process that generated the data. In addition to the sampling uncertainty, there are other sources of uncertainty that would lead us to assume our data is generated by a stochastic DGP: theoretical and fundamental uncertatinty.
Let’s consider a scenario where we are not only interested in understanding the percentage of adults who support the death penalty in the U.S. but also whether there is a difference in support based on political affiliation, particularly Republicans. In this case, we have a theoretical framework that focuses on the relationship between a person’s political alignment and their stance on the death penalty. However, it’s possible that other factors, such as how individuals value a person’s life, could also influence their views on the death penalty. Unfortunately, we didn’t include these additional factors in our theory, which could potentially impact our data collection. Consequently, the results derived from our sample data might be subject to theoretical uncertainty. Additionally, there is the possibility of fundamental uncertainty within our sample data. This could arise when respondents pay minimal attention to the survey questions and opt for random responses regarding their opinions on the death penalty. Such factors are beyond the control of researchers and can introduce further uncertainty into our findings.
In summary, statistical inferences always involve an element of uncertainty. As a researcher, our aim is to account for this uncertainty by incorporating probability statements as an integral part of our conclusions in our findings.
How to model stochastic data generating process? Any time we want to answer a research question that involves using a sample to draw a conclusion about some larger population, we need to answer the question “how likely is it that we get a sample proportion of people supporting for the death penalty at 65%?” or “what is the probability of…?”. To answer such a question, we need to use probability to quantify the uncertainty of the outcomes of interest.
Just a reminder, when we employ a probability distribution to model a stochastic DGP, we are essentially playing as the God saying that the data for the variable of interest is generated by this probability distribution. However, in real-life, what we can directly observe is the empirical distribution (or frequency distribution) derived from our available sample data. We make an assumption that this empirical distribution from the sample data reflects the underlying theoretical probability distribution.
Let’s now review some specially named discrete and continuous probability mass functions, such as bernoulli distribution, binomial distribution, uniform distribution, and normal distribution.
Bernoulli Distribution is a type of discrete probability distribution where every experiment conducted asks a question that can be answered only in yes or no. In other words, the random variable can be 1 with a probability \(p\) or it can be 0 with a probability \((1 - p)\). Such an experiment is called a Bernoulli trial. A pass or fail exam can be modeled by a Bernoulli Distribution. Bernoulli distribution can also be used to derive a binomial distribution, geometric distribution, and negative binomial distribution.
For example, suppose there is an experiment where you flip a coin that is fair. If the outcome of the flip is heads then you will win. This means that the probability of getting heads is \(p = \frac{1}{2}\). If \(X\) is the random variable following a Bernoulli Distribution, we get \(P(X = 1) = p = \frac{1}{2}\). We can also write it as \(X \sim Bernoulli(p)\) or just \(X \sim B(p)\).
Probability Mass Function (PMF)
A Bernoulli distribution is a discrete probability distribution where \(Y=\left \{0, 1 \right \}\), its PMF is defined by:
\[ f(y) = \begin{cases} 1 - p & \text{if } \ \ y = 0 \\ p & \text{if } \ \ y = 1 \\ \end{cases} \] A Bernoulli distribution only has one parameter, \(p\). It is a special case of the Binomial distribution when the number of trials = 1.
Cumulative Distribution Function (CDF)
\[ F(y) = \begin{cases} 0 & \text{if } \ \ y < 0 \\ 1-p & \text{if } \ \ 0 \leq y < 1\\ 1 & \text{if } \ \ x \geq 1 \end{cases} \]
Expected Value and Variance
Expected Value: \(E(Y) = p\)
Variance: \(Var(Y) = p(1-p)\)
Suppose we repeat a Bernoulli experiment \(n\) times independently and we add up the outcomes. That is, suppose that our random variable is \(Y = X_1 + X_2 + · · · + X_n\), where \(X_i \sim Bernoulli(p)\) and the \(X_i\) are independent. Then \(Y\) is said to have a Binomial distribution with sample size \(n\) and probability of success \(p\), denoted \(Y \sim Binomial(n, p)\). As you can see, a Binomial distribution has two parameters, \(n\) and \(p\).
Probability Mass Function (PMF)
\[ f(y) = \begin{cases} \binom{n}{y}p^y(1-p)^{N-y} & \text{if } \ \ y \in \left \{0, 1, 2, ..., n \right \} \\ 0 & \text{otherwise } \end{cases} \]
where:
\(n\) is the number of trials (occurrences)
\(y\) is the number of successful trials
\(p\) is the probability of success in a single trial
\(\binom{n}{y}\) is the combination of \(n\) and \(y\). A combination is the number of ways to choose a sample of \(y\) elements from a set of \(N\) distinct objects where order does not matter, and replacements are not allowed. Note that \(\binom{n}{y} = \frac{n!}{y!(N-y)!}\), where \(!\) is factorial (so, \(4! = 4 \times 3 \times 2 \times 1\)).
Cumulative Distribution Function (CDF)
\[F(y) = P(Y \leq y) = \sum_{y=0}^{n} \binom{n}{y}p^y(1-p)^{n-y} \]
Expected Value and Variance
Expected Value: \(E(Y) = np\)
Variance: \(E(Y) = np(1-p)\)
Binomial Distribution in R
We can also use R to do the calculation for Excercise 2. Here are some base R functions to do it:
Function | Description |
---|---|
dbinom(x, size, prob) |
Find the probability of getting a certain number of successes
(x ) in a certain number of trials (size ) where
the probability of success on each trial is fixed
(prob ). |
pbinom(q, size, prob) |
Return the value of the cumulative density function (CDF) of the
binomial distribution given a certain random variable q ,
number of trials (size ) and probability of success on each
trial (prob ). |
qbinom(p, size, prob) |
The inverse of pbinom() function. It takes the
probability value and gives output which corresponds to the probability
value. |
rbinom(n, size, prob) |
Generate a vector of binomial distributed random variables given a
vector length n, number of trials (size ) and probability of
success on each trial (prob ). |
In statistics, uniform distribution is a term used to describe a form of probability distribution where every possible outcome has an equal likelihood of happening. The probability is constant since each variable has equal chances of being the outcome. There are two types of uniform distribution:
Discrete uniform distribution
A discrete uniform distribution is a statistical distribution where the probability of outcomes is equally likely and with finite values. A good example of a discrete uniform distribution would be the possible outcomes of rolling a 6-sided die. The possible values would be 1, 2, 3, 4, 5, or 6. In this case, each of the six numbers has an equal chance of appearing. Therefore, each time the 6-sided die is thrown, each side has a chance of \(\frac{1}{6}\). The number of values is finite. It is impossible to get a value of 1.3, 4.2, or 5.7 when rolling a fair die.
Continuous uniform distribution
Not all uniform distributions are discrete; some are continuous. A continuous uniform distribution (also referred to as rectangular distribution) is a statistical distribution with an infinite number of equally likely measurable values. Unlike discrete random variables, a continuous random variable can take any real value within a specified range.
A continuous uniform distribution usually comes in a rectangular shape. A good example of a continuous uniform distribution is an idealized random number generator. With continuous uniform distribution, just like discrete uniform distribution, every variable has an equal chance of happening. However, there is an infinite number of points that can exist. Today, we will focus on continuous uniform distribution.
Probability Distribution Function (PDF)
A continuous random variable \(X\) has a uniform distribution, denoted \(X \sim Uniform \left [a, b\right ]\), if its probability density function is:
\[ f(x) = \begin{cases} \frac{1}{b-a}& \text{if } x \in [a, b] \\ 0 & \text{otherwise} \end{cases} \]
for two constants \(a\) and \(b\), such that \(a<x<b\). A graph of the PDF looks like this:
Note that the length of the base of the rectangle is \((b-a)\), while the length of the height of the rectangle is \(\frac{1}{b-a}\).Therefore, as should be expected, the area under \(f(x)\) and between the endpoints \(a\) and \(b\) is 1. Additionally, \(f(x)>0\) over the support \(a<x<b\). Therefore, \(f(x)\) is a valid probability density function.
Cumulative Distribution Function (CDF)
The cumulative distribution function of a continuous uniform random variable is:
\[F(x) = \begin{cases} 0 & \text{if } \ \ x < a \\ \frac{x-a}{b-a}& \text{if } \ \ x \in [a, b] \\ 1 & \text{if } \ \ x < a \\ \end{cases}\] A graph of the CDF looks like this:
As the picture illustrates, \(F(x)=0\) when \(x\) is less than the lower endpoint of the support (\(a\), in this case) and \(F(x)=1\) when \(x\) is greater than the upper endpoint of the support (\(b\), in this case). The slope of the line between \(a\) and \(b\) is, of course, \(\frac{1}{b-a}\).
Expected Value and Variance
Expected Value: \(E(X) = \frac{(a+b)}{2}\)
Variance: \(Var(X) = \frac{(b-a)^2}{12}\)
Continuous Uniform Distribution in R
Function | Description |
---|---|
dunif(x, min, max) |
Return the corresponding uniform probability density function (PDF)
values of a certain vector quantile x . |
punif(q, min, max) |
Return the value of the cumulative density function (CDF) of the uniform distribution given a certain random variable q and the interval. |
qunif(p, min, max) |
The inverse of punif() function. It takes the
probability value and gives output which corresponds to the probability
value. |
runif(n, min, max) |
Generate n number of random numbers within any
interval, defined by the min and the max argument. |
Normal distribution is the most important probability distribution function used in statistics because of its advantages in real case scenarios. For example, the height of the population, shoe size, IQ level, rolling a dice, and many more.
Probability Distribution Function (PDF)
The continuous random variable \(X\) follows a normal distribution if its probability density function is defined as:
\[f(x)=\frac{1}{\sigma\sqrt{2\pi}}\left \{ -\frac{1}{2} (\frac{x-\pi}{\sigma})^2\right \}\]
for \(\infty < x < \infty\), \(\infty < \mu < \infty\), and \(0 < \sigma < \infty\). The mean of \(X\) is \(\mu\) and the variance of \(X\) is \(\sigma^2\). We say \(X \sim N(\mu, \sigma^2)\).
The probability density function of normal distribution is bell-shaped curve graph, which is symmetric. The graph signifies that the peak point is the mean of the data set and half of the values of data set lie on the left side of the mean and other half lies on the right part of the mean telling about the distribution of the values. The shape of any normal curve depends on its mean \(\mu\) and standard deviation \(\sigma\).
Cumulative Distribution Function (CDF)
We won’t focus on the CDF of normal distribution as it is ugly, but here it is.
\[\Phi (\frac{x-\mu}{\sigma})=\frac{1}{2}\left [ 1+erf(\frac{x-\mu}{\sigma\sqrt{2}}) \right ]\]
Expected Value and Variance
Expected Value: \(E(x) = \mu\)
Variance: \(Var(x) = \sigma^2\)
Normal Distribution in R
Function | Description |
---|---|
dnorm(x, mean, sd) |
Probability density function (PDF). |
pnorm(x, mean, sd) |
Cumulative distribution function (CDF), which measures the probability that a random number \(X\) takes a value less than or equal to \(x\). |
qnorm(p, mean, sd) |
The inverse of pnorm() function. It takes the
probability value and gives output which corresponds to the probability
value. |
rnorm(n, mean, sd) |
Generate a vector of random numbers which are normally distributed. |
The sampling distribution of a statistic is the distribution of all possible values taken by the statistic when all possible samples of a fixed size \(n\) are taken from the population. It is a theoretical idea – we do not actually build it.
Simply said, the sampling distribution of a statistic is the probability distribution of that statistic.
Suppose we have a sample with independent and identically distributed (iid) draws. We know that \(\bar{X}\) is a random variable with a sampling distribution.
Independent means that the sample items are all independent events. In other words, they are not connected to each other in any way; knowledge of the value of one variable gives no information about the value of the other and vice versa.
Identically distributed means that there are no overall trends – the distribution doesn’t fluctuate and all items in the sample are taken from the same probability distribution.
The expected value of the sampling distribution of \(\bar{X}\) is the population mean \(\mu\).
\[E(\bar{X}) = \mu\]
The expected value of the sampling distribution for \(\bar{X}\) is simply the mean of the population (DGP) distribution. This means that the sample mean \(\bar{X}\) is an unbiased estimator for the population (DGP) mean, \(\mu\).
The variance of the sampling distribution of \(\bar{X}\) is the population variance \(\sigma^2\) divided by the sample size \(n\). \[Var(\bar{X}) = \frac{\sigma^2}{n}\] Note the \(X_i\) are identically distributed, which means they have the same variance \(\sigma^2\). Therefore, we can replace \(Var(X_i)\) with the alternative notation \(\sigma^2\).
If we take the square root of variance of the sampling distribution we get the standard deviation of the sampling distribution, which is known as standard error.
\[SE(\bar{X}) = \sqrt{Var(\bar{X})} = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}\]
The standard error of a sampling distribution gives us a sense for how confident we should be in our estimate of the population mean. Our result indicates that as the sample size increases, the standard deviation of the sample mean decreases, which is good. It means that as long as our sample is large enough, we can be very confident about our estimate of the population mean based on the sample mean we have.
In the following example, we use R to illustrate the sampling distribution for the sample mean for a hypothetical population. The sampling method is done without replacement.
An instructor of an introduction to statistics course has 200 students (which is the population). The population mean of their scores out of 100 points is \(\mu = 71.18\), and the population standard deviation is \(\sigma = 10.73\). That said, these 200 students’ scores \(X\) is following normal distribution with the parameter \(\mu\) being 71.18 and \(\sigma\) being 10.73, \(X \sim N(71.18, 10.73)\).
Let’s first draw one sample based on the population distribution, \(X \sim N(71.18, 10.73)\).
set.seed(1234) # to get replicable results
exam.sample <- rnorm(n = 10, mean = 71.18, sd = 10.73)
exam.sample
## [1] 58.22818 74.15682 82.81605 46.01066 75.78451 76.60998 65.01304 65.31464
## [9] 65.12343 61.62989
Now, let’s replicate the above process for 1000 times, which will generate 1000 iid samples based on the population distribution, \(X \sim N(71.18, 10.73)\).
set.seed(1234)
exam.iid.sample <- replicate(1000, rnorm(n = 10, mean = 71.18, sd = 10.73))
exam.iid.sample <- as.data.frame(exam.iid.sample)
knitr::kable(exam.iid.sample[, c(1:5)], "simple", align = "ccccc")
V1 | V2 | V3 | V4 | V5 |
---|---|---|---|---|
58.22818 | 66.05972 | 72.61877 | 83.00765 | 86.73309 |
74.15682 | 60.46731 | 65.91494 | 66.07689 | 59.71346 |
82.81605 | 62.85080 | 66.45292 | 63.56771 | 62.00194 |
46.01066 | 71.87164 | 76.11139 | 65.80150 | 68.16892 |
75.78451 | 81.47537 | 63.73638 | 53.69983 | 60.51073 |
76.60998 | 69.99664 | 55.64076 | 58.65145 | 60.78784 |
65.01304 | 65.69687 | 77.34713 | 47.78817 | 59.29848 |
65.31464 | 61.40287 | 60.19617 | 56.79114 | 57.74619 |
65.12343 | 62.19715 | 71.01757 | 68.02223 | 65.55932 |
61.62989 | 97.10191 | 61.13727 | 66.18092 | 65.84880 |
Since we draw 1000 iid samples from the population, we now should have 1000 sample mean based on these samples. Let’s plot all these sample means in a histogram to observe the distribution of these sample means, which is sampling distribution.
Note the process of taking many different samples in an effort to see what the sampling distribution looks like is referred to as a “Monte Carlo simulation.” The “true” sampling distribution is the one where we take an infinite number of samples, which obviously we cannot achieve. Monte Carlo simulation is the term for the process of repeated sampling to approximate something. We’re using it here for sampling distributions, but it is used in other contexts as well.
# What is the mean of each sample?
sample.mean <- colMeans(exam.iid.sample)
sample.mean[1:5]
## V1 V2 V3 V4 V5
## 67.06872 69.91203 67.01733 62.95875 64.63688
# Plot the distribution of these sample means
library(tidyverse)
my.bins <- seq(62, 82, 0.5)
ggplot(data.frame(sample.mean = sample.mean), aes(x = sample.mean)) +
geom_histogram(breaks = my.bins, color = "black", fill = "gray85") +
geom_vline(xintercept = mean(sample.mean), linetype = "dashed",
color = "red", alpha = 0.5, lwd = 0.8) +
geom_vline(xintercept = 71.18, linetype = "dashed",
color = "blue", alpha = 0.5, lwd = 0.8) +
scale_x_continuous(limits = c(62, 82)) +
scale_y_continuous(limits = c(0, 250)) +
labs(x = "Sample mean (X bar)", y = "Frequency",
title = "1000 Samples with n=10") +
theme_bw() +
theme(panel.grid.minor = element_blank())
What are the mean and standard error of the sample mean?
mean(sample.mean)
## [1] 71.24562
sd(sample.mean)
## [1] 3.252043
As you can see, the mean and standard error based on all 1000 samples we drew are very close to the population mean of 71.18 and the standard deviation divided by the square root of the sample size (\(\frac{10.73}{\sqrt{10}}=3.393124\)). However, they are not exactly the same. The reason is that we can only simulate the sampling distribution of the exam scores but cannot simulate it for an infinite number of samples. That said, we can never reach the “true” sampling distribution.
The sampling distribution of the sample mean when \(n = 100\) for the exam scores data.
set.seed(1234)
exam.iid.sample <- replicate(1000, rnorm(n = 100, mean = 71.18, sd = 10.73))
exam.iid.sample <- as.data.frame(exam.iid.sample)
# What is the mean of each sample?
sample.mean <- colMeans(exam.iid.sample)
sample.mean[1:5]
## V1 V2 V3 V4 V5
## 69.49795 71.62254 72.83890 71.09303 70.94624
# Plot the distribution of these sample means
library(tidyverse)
ggplot(data.frame(sample.mean = sample.mean), aes(x = sample.mean)) +
geom_histogram(breaks = my.bins, color = "black", fill = "gray85") +
geom_vline(xintercept = mean(sample.mean), linetype = "dashed",
color = "red", alpha = 0.5, lwd = 0.8) +
geom_vline(xintercept = 71.18, linetype = "dashed",
color = "blue", alpha = 0.5, lwd = 0.8) +
scale_x_continuous(limits = c(62, 82)) +
scale_y_continuous(limits = c(0, 250)) +
labs(x = "Sample mean (X bar)", y = "Frequency",
title = "1000 Samples with n=100") +
theme_bw() +
theme(panel.grid.minor = element_blank())
What are the mean and standard error of the sample mean?
mean(sample.mean)
## [1] 71.21143
sd(sample.mean)
## [1] 1.07794
As you observe from the exam score example, if the population is normally distributed with mean \(\mu\) and standard deviation \(\sigma\), then the sampling distribution of the sample mean is also normally distributed no matter what the sample size is.
What happens when the sample comes from a population that is not normally distributed? This is where the Central Limit Theorem comes in.
For a large sample size (\(n > 30\)), \(\bar{X}\) is approximately normally distributed, regardless of the distribution of the population one samples from. If the population has mean \(\mu\) and standard deviation \(\sigma\), then \(\bar{X}\) follows an approximate normal distribution with mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\).
The Central Limit Theorem applies to a sample mean from any distribution. We could have a left-skewed or a right-skewed distribution. As long as the sample size is large, the distribution of the sample means will follow an approximate normal distribution.
Notes on the CLT:
If the population is skewed and sample size you draw is small, then the sample mean won’t be normal.
If the population is normal, then the distribution of sample mean looks normal even if \(n = 2\).
If the population is skewed, then the distribution of sample mean looks more and more normal when \(n\) gets larger.
Note that in all cases, the mean of the sample mean is close to the population mean and the standard error of the sample mean is close to \(\frac{\sigma}{\sqrt{n}}\).
Suppose \(X\) is following a chi-square distribution with 3 as the degrees of freedom , \(X \sim Chi-square(3)\). The population mean of a chi-square distribution is the degrees of freedom. In this case, \(\mu = 3\). Please do the following:
This is what a chi-square distribution looks like:
Find the sampling distribution of the sample mean when \(n = 10\) for \(X\).
set.seed(1234)
iid.sample <- replicate(1000, rchisq(n = 5, df = 3, ncp = 0))
iid.sample <- as.data.frame(iid.sample)
# What is the mean of each sample?
sample.mean <- colMeans(iid.sample)
sample.mean[1:5]
## V1 V2 V3 V4 V5
## 1.944331 1.521077 2.798356 1.653387 2.474112
# Plot the distribution of these sample means
my.bins <- seq(0, 8, 0.1)
ggplot(data.frame(sample.mean = sample.mean), aes(x = sample.mean)) +
geom_histogram(breaks = my.bins, color = "black", fill = "gray85") +
geom_vline(xintercept = mean(sample.mean), linetype = "dashed",
color = "red", alpha = 0.5, lwd = 0.8) +
geom_vline(xintercept = 3, linetype = "dashed",
color = "blue", alpha = 0.5, lwd = 0.8) +
scale_x_continuous(limits = c(0, 8)) +
scale_y_continuous(limits = c(0, 100)) +
labs(x = "Sample mean (X bar)", y = "Frequency",
title = "1000 Samples with n=5") +
theme_bw() +
theme(panel.grid.minor = element_blank())
Find the sampling distribution of the sample mean when \(n = 30\) for \(X\).
set.seed(1234)
iid.sample <- replicate(1000, rchisq(n = 30, df = 3, ncp = 0))
iid.sample <- as.data.frame(iid.sample)
# What is the mean of each sample?
sample.mean <- colMeans(iid.sample)
sample.mean[1:5]
## V1 V2 V3 V4 V5
## 2.690281 2.510694 3.307291 2.776163 3.632522
# Plot the distribution of these sample means
ggplot(data.frame(sample.mean = sample.mean), aes(x = sample.mean)) +
geom_histogram(breaks = my.bins, color = "black", fill = "gray85") +
geom_vline(xintercept = mean(sample.mean), linetype = "dashed",
color = "red", alpha = 0.5, lwd = 0.8) +
geom_vline(xintercept = 3, linetype = "dashed",
color = "blue", alpha = 0.5, lwd = 0.8) +
scale_x_continuous(limits = c(0, 8)) +
scale_y_continuous(limits = c(0, 100)) +
labs(x = "Sample mean (X bar)", y = "Frequency",
title = "1000 Samples with n=30") +
theme_bw() +
theme(panel.grid.minor = element_blank())
The real power of statistics comes from applying the concepts of probability to situations where you have data but not necessarily the whole population. The results, called statistical inference, give you probability statements about the population of interest based on that set of data.
There are two types of statistical inferences:
Estimation: Use information from the sample to estimate (or predict) the parameter of interest. For instance, using the result of a poll about the president’s current approval rating to estimate (or predict) his or her true current approval rating nationwide.
Hypothesis Tests: Use information from the sample to determine whether a certain statement about the parameter of interest is true. For instance, suppose a news station claims that the President’s current approval rating is more than 75%. We want to determine whether that statement is supported by the poll data.
\[P[\bar{X}-z_{\frac{\alpha}{2}}(\frac{\sigma}{\sqrt{n}})\leq \mu \leq \bar{X}+z_{\frac{\alpha}{2}}(\frac{\sigma}{\sqrt{n}})]=1-\alpha\] This indicates that a \((1-\alpha)*100\%\) confidence interval for the mean \(\mu\) is:
\[[\bar{x} - z_{\frac{\alpha}{2}}(\frac{\sigma}{\sqrt{n}}), \ \ \bar{x} + z_{\frac{\alpha}{2}}(\frac{\sigma}{\sqrt{n}})]\]
If the population standard deviation (\(\sigma\)) is unknown, we can use sample standard deviation to estimate it:
\[[\bar{x} - z_{\frac{\alpha}{2}}(\frac{s_x}{\sqrt{n}}), \ \ \bar{x} + z_{\frac{\alpha}{2}}(\frac{s_x}{\sqrt{n}})]\] Suppose we take a large number of samples, say 1000.Then, we calculate a 95% confidence interval for each sample. Then, “95% confident” means that we’d expect 95%, or 950, of the 1000 intervals to be correct, that is, to contain the actual unknown value \(\mu\).
Hypothesis testing is the process of making a choice between two conflicting hypotheses. First, a tentative assumption is made about the parameter or distribution. This assumption is called the null hypothesis and is denoted by \(H_0\). An alternative hypothesis (denoted \(H_a\)), which is the opposite of what is stated in the null hypothesis, is then defined. The hypothesis-testing procedure involves using sample data to determine whether or not \(H_0\) can be rejected. If \(H_0\) is rejected, the statistical conclusion is that the alternative hypothesis \(H_a\) is true.
Null hypothesis (\(H_0\)): The null hypothesis states the “status quo”. This hypothesis is assumed to be true until there is evidence to suggest otherwise.
Alternative hypothesis (\(H_a\)): This is the statement that one wants to conclude. It is also called the research hypothesis.
The goal of hypothesis testing is to see if there is enough evidence against the null hypothesis. In other words, to see if there is enough evidence to reject the null hypothesis. If there is not enough evidence, then we fail to reject the null hypothesis.
We want to know the answer to a research question. We determine our null and alternative hypotheses. Now it is time to make a decision.
The decision is either going to be:
reject the null hypothesis or …
fail to reject the null hypothesis.
Note: Why can’t we say we “accept the null”? The reason is that we are assuming the null hypothesis is true and trying to see if there is evidence against it. Therefore, the conclusion should be in terms of rejecting the null.
Consider the following table. The table shows the decision/conclusion of the hypothesis test and the unknown “reality”, or truth. We do not know if the null is true or if it is false. If the null is false and we reject it, then we made the correct decision. If the null hypothesis is true and we fail to reject it, then we made the correct decision.
What happens when we do not make the correct decision? When doing hypothesis testing, two types of mistakes may be made and we call them Type I error and Type II error. If we reject the null hypothesis when it is true, then we made a type I error. If the null hypothesis is false and we failed to reject it, we made another error called a Type II error.
The “reality”, or truth, about the null hypothesis is unknown and therefore we do not know if we have made the correct decision or if we committed an error. We can, however, define the likelihood of these events.
\(\alpha\): The probability of committing a Type I error. Also known as the significance level.
\(\beta\): The probability of committing a Type II error.
Power: Power is the probability the null hypothesis is rejected given that it is false (i.e., \(1-\beta\)).
\(\alpha\) and \(\beta\) are probabilities of committing an error so we want these values to be low. However, we cannot decrease both. As \(\alpha\) increases, \(\beta\) decreases.
Note: Type I error is also thought of as the event that we reject the null hypothesis GIVEN the null is true. In other words, Type I error is a conditional event and \(\alpha\) is a conditional probability. The same idea applies to Type II error and \(\beta\).
In today’s discussion, we will revisit some techniques covered in POL 211 concerning how to capture relationships within data. Additionally, we will delve deeper into the application of Ordinary Least Squares (OLS) regression to capture the specific relationships of interest.
When we want to describe the relationship between two sets of data, we can plot the data sets in a scatter plot and look at four characteristics:
Today, our focus will be on the first three characteristics. We will delve into covariance, the correlation coefficient, and linear regression. These are three fundamental statistics that allow us to describe the relationships between the variables of interest.
For today, we are using Covid19.csv
. You can download
this csv file from Canvas or from here.
## load packages and import data
library(tidyverse)
cf <- read.csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/Covid19.csv")
Let’s visualize the relationship between the politics of a county
(dem16vs
) and the share of people who say they always wear
a mask (mask_always
) in a scatter plot.
dem16vs
: Democratic vote share in 2016 presidential
election.mask_always
: The estimated share of people in a county
who would say they always wear a mask in public when they expect to be
within six feet of another person.ggplot(cf, aes(x = dem16vs, y = mask_always)) +
geom_point(color = "black", fill = "white",
stroke = 1, shape = 1) +
## Use the stroke aesthetic to modify the width of the border
## The shape of points can be adjusted by specifying the shape argument
xlab("Democratic Vote Share, 2016 Presidential") +
ylab("% Reporting Always Wearing Mask") +
ylim(0,1) +
xlim(0,1) +
labs(title = "NY Times Covid-19 Survey",
subtitle = "All Counties") +
theme_bw()
The plot above is hard to see since all the dots are on top of each other. So instead of plotting all the datapoints, let’s draw a random sample of 500 of the rows of the dataset and plot them.
set.seed(110) ## this allows us to draw the *same* random sample every time.
samp <- sample(1:nrow(cf), 500)
## This means randomly taking a sample of 500 from the elements of the total rows of cf
glimpse(samp)
## int [1:500] 2008 2790 2483 336 2403 2921 1659 772 1427 2931 ...
samp
is the rows we randomly sample from all the rows in
cf
. For example, the first element in samp
is
2008, which means we’re going to draw the 2008th row out of the original
dataset.
Now let’s replot the scatter plot again with only 500 datapoints we randomly drew from the whole dataset.
ggplot(cf[samp,], aes(x = dem16vs, y = mask_always)) +
## Specify the random sample within the bracket of cf and remember to put it on the row position!
geom_point(color = "black", fill = "white",
stroke = 1, shape = 1) +
xlab("Democratic Vote Share, 2016 Presidential") +
ylab("% Reporting Always Wearing Mask") +
ylim(0,1) +
xlim(0,1) +
labs(title = "NY Times Covid-19 Survey",
subtitle = "500 Randomly Selected Counties") +
theme_bw()
In addition to visualizing data using scatter plots to grasp relationships, how can we employ statistical measures to describe these relationships more formally?
Covariance
Covariance is a useful measure for describing the direction of the linear association between two quantitative variables.
\[Cov(x, y) = \frac{1}{N-1}\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})\]
As we can see from the equation, the covariance sums the term \((x_i-\bar{x})(y_i-\bar{y})\) for each data point, where \(\bar{x}\) is the average \(x\) value, and \(\bar{y}\) is the average \(y\) value. The term becomes more positive if both \(x\) and \(y\) are larger than the average values in the data set, and becomes more negative if smaller. As the covariance accounts for every data point in the set, a positive covariance must mean that most, if not all, data points are in sync with respect to \(x\) and \(y\) (small \(y\) when \(x\) is small or large \(y\) when \(x\) is large). Conversely, a negative covariance must mean that most, if not all, data points are out of sync with respect to \(x\) and \(y\) (small \(y\) when \(x\) is large or large \(y\) when \(x\) is small).
Let’s start by using a fake dataset to practice how to compute covariance.
tf <- data.frame(x = c(3, 2, 1, 7, -1),
y = c(8, 10, 13, 2, 8))
knitr::kable(tf, "simple", align = "cc")
x | y |
---|---|
3 | 8 |
2 | 10 |
1 | 13 |
7 | 2 |
-1 | 8 |
## Covariance
tf$x.dev2 <- tf$x - mean(tf$x)
tf$y.dev2 <- tf$y - mean(tf$y)
cov.fake <- (1/(nrow(tf)-1))*sum(tf$x.dev2*tf$y.dev2)
cov.fake
## [1] -8.85
You can use the canned function cov()
to double check
your answer.
## Covariance
cov(tf$x, tf$y)
## [1] -8.85
Covariance is a useful measure at describing the direction of the linear association between two quantitative variables, but it has two weaknesses: a larger covariance does not always mean a stronger relationship, and we cannot compare the covariances across different sets of relationships. For example, if we find that the covariance between variables \(x\) and \(y\) is larger than the covariance between \(x\) and \(z\), we can only conclude that both \(y\) and \(z\) are positively associated with \(x\). However, we cannot determine whether the relationship between \(x\) and \(y\) is stronger than the relationship between \(x\) and $z based solely on their covariances.
Correlation Coefficient
To account for the weakness, we normalize the covariance by the standard deviation of the \(x\) values and \(y\) values, to get the correlation coefficient. The correlation coefficient is a value between -1 and 1, and measures both the direction and the strength of the linear association. One important distinction to note is that correlation does not measure the slope of the relationship — a large correlation only speaks to the strength of the relationship. Some key points on correlation are:
\[r_{xy}=\frac{Cov(x, y)}{s_xs_y}\]
Please compute the correlation coefficient between \(x\) and \(y\) using the formula above with the fake dataset
## Correlation
cor.fake <- cov.fake/(sd(tf$x)*sd(tf$y))
cor.fake
## [1] -0.7412154
You can also used the canned function cor()
to double
check your answers.
## Correlation
cor(tf$x, tf$y)
## [1] -0.7412154
Linear Regression
Correlation and covariance are quantitative measures of the strength and direction of the relationship between two variables, but they do not account for the slope of the relationship. In other words, we do not know how a change in one variable could impact the other variable. Regression is the technique that fills this void — it allows us to make the best guess at how one variable affects the other variables. The simplest linear regression, which is usually measured by ordinary least square (OLS) method, allows us to fit a “line of best fit” to the scatter plot, and use that line (or model) to describe the relationship between the two variables. The OLS approach aims to fit a line that minimizes squared residuals. The equation for that line is:
\[\hat{y} = a + bx\] The equations of \(a\) and \(b\) are:
\[b = \frac{r_{xy}s_{y}}{s_{x}}=\frac{Cov(x, y)}{s^2_{x}}\]
\[a = \bar{y}-b\bar{x}\]
Note: A more detailed introduction regarding linear regression will be discussed in POL 212 and 213.
What are the intercept (\(a\)) and slope (\(b\)) between \(x\) and \(y\) in the fake dataset? Please compute in a manual way!
## Slope
b <- cov(tf$x, tf$y)/var(tf$x)
## Intercept
a <- mean(tf$y) - b * mean(tf$x)
You can also use the canned function lm
to check your
answer.
reg <- lm(y ~ x, data = tf)
stargazer::stargazer(reg, type = "text", digits = 4)
##
## ===============================================
## Dependent variable:
## ---------------------------
## y
## -----------------------------------------------
## x -1.0057
## (0.5258)
##
## Constant 10.6136**
## (1.8813)
##
## -----------------------------------------------
## Observations 5
## R2 0.5494
## Adjusted R2 0.3992
## Residual Std. Error 3.1198 (df = 3)
## F Statistic 3.6578 (df = 1; 3)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Below is a population regression line equation, assuming that we are the god and know the data-generating process of the relationship between \(X\) (independent/explanatory variable) and \(Y\) (dependent/response/outcome variable).
However, since we usually don’t have the population data but only have access to a sample of it, we can only use the sample dataset to estimate the population regression line. In the below figure, \(b_0\) is the estimate of the regression intercept \(\beta_0\), while \(b_1\) is the estimate of the regression coefficient/slope \(\beta_1\).
The Ordinary Least Squares (OLS) approach aims to fit a line that minimizes the sum of the squared residuals, which means that the goal is to find a pair of \(b_0\) and \(b_1\) that can minimize the difference between our observed \(Y_i\) and estimated \(\hat{Y_i}\).
\(S(b_0, b_1) = min \Sigma E_i = min \Sigma (Y_i-\hat{Y_i})^2=min \Sigma (Y_i-(b_0+b_1X_i))^2\)
\(S(b_0, b_1) = \Sigma (Y_i-b_0-b_1X_i)^2\)To find a pair of \(b_0\) and \(b_1\) that can minimize the difference between our observed \(Y_i\) and estimated \(\hat{Y_i}\) is an optimization problem. All we have to do is to take the partial derivatives of the above equation with respect to \(b_0\) and \(b_1\), respectively, set them equal to zero, and then solve it.
\(\frac{\partial S(b_0, b_1)}{\partial b_0}=\Sigma (-1)(2)(Y_i-b_0-b_1X_i)=0\)
\(\frac{\partial S(b_0, b_1)}{\partial b_1}=\Sigma (-X_i)(2)(Y_i-b_0-b_1X_i)=0\)
By solving the above two equations, you will get:
\(b_0 = \bar{Y}-b_1\bar{X}\)
\(b_1 = \frac{\Sigma (X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma (X_i-\bar{X})^2}\)
With estimates of the intercept and slope of our linear regression line, we can then make statistical inferences to assess the likelihood that our estimates capture the true parameters. Initially, we need the standard errors of the intercept and slope to construct confidence intervals and conduct hypothesis tests. POL 213 will provide a more detailed discussion on how to derive these standard errors. For now, in 212, let’s assume these standard errors and use them for statistical inferences.
The residual standard error \(S_E\) (how closely the regression line we estimate fits the scatter of our data points) is:
And the standard error of the sample intercept (\(b_0\)) and slope (\(b_1\)) are:
\(SE(b_0) = \hat{\sigma }\sqrt{\frac{\Sigma X_i^2}{n\Sigma (X_i-\bar{X})^2}}\)
\(SE(b_1) = \frac{\hat{\sigma }}{\sqrt{\Sigma (X_i-\bar{X})^2}}\)
With standard errors, we can construct a \(100(1-\alpha)%\) confidence interval for our slope and perform hypothesis testing.
Confidence Interval
\(\beta=b_1+t_{\frac{\alpha}{2}}SE(b_1)\)
Hypothesis Test
Two-tailed test (by defualt, lm()
function conducts
two-tailed test)
\(H_0: \beta = 0\)
\(H_1: \beta \neq 0\)
\(t=\frac{b_1-\beta_1}{SE(b_1)}=\frac{b_1-0}{SE(b_1)}\)
Compare this t-statistic with \(t_{\frac{\alpha}{2}, df=n-2}\) to see if it is larger or smaller than the critical value or not. If it is larger or smaller than the critical value, we can reject \(H_0\) in favor of \(H_1\). In addition to calculating t-statistics, we can also calculate p-value based the t-statistics and the critical values to perform hypothesis testing (p-value \(= 2*Pr(t \geq |t_c|)\) vs \(\alpha\)).
One-tailed test (if your hypotheses are directional, you can also conduct one-tailed test)
\(H_0: \beta = 0\)
\(H_1: \beta < or > 0\)
\(t=\frac{b_1-\beta_1}{SE(b_1)}=\frac{b_1-0}{SE(b_1)}\)
When performing a one-tailed test, compare the t-statistic with \(t_{\alpha, df=n-2}\) without dividing \(\alpha\) by 2. P-value in one-tailed test is calculated as:
p-value \(= Pr(t \geq or \leq t_c)\) vs \(\alpha\)
Let’s now assume we know how exactly the data is generated and draw a sample out of the population we generate and then construct confidence intervals and perform hypothesis testing.
This is the example Hanno used in POL 211. Assume we study the relationship between money spend on coffee (\(X\)) and hours of sleep every night (\(Y\)) for UC Davis graduate students. The population of UC Davis graduate students is \(10000\) and the population distribution of coffee spending follows normal distribution with mean equals 20 and standard deviation equals 8 (i.e., \(X \sim N(20, 8^2)\)).
We also know the exact population relationship between coffee spending and hours of sleep every night for UC Davis graduate students, which is defined as below:
set.seed(123)
coffee_spending <- rnorm(10000, 20, 8)
sleep <- 8 - 0.1*coffee_spending + rnorm(10000, 0, 3)
This is the population data we generated:
pop <- data.frame(coffee_spending, sleep)
head(pop)
## coffee_spending sleep
## 1 15.51619 13.560556
## 2 18.15858 5.683706
## 3 32.46967 7.533918
## 4 20.56407 4.239138
## 5 21.03430 6.571840
## 6 33.72052 8.023906
Now, let’s draw a sample with 500 observations from our population and estimate the relationship between coffee spending and hours of sleep every night.
set.seed(123)
samp <- sample(1:nrow(pop), 500)
glimpse(samp)
## int [1:500] 2463 2511 8718 2986 1842 9334 3371 4761 6746 9819 ...
samp
is the rows we randomly sample from all the rows in
pop
. For example, the first element in samp
is
2463, which means we’re going to draw the 2463th row out of the
population dataset.
Let’s first plot the scatter plot with the sample we randomly drew from the population dataset.
ggplot(pop[samp,], aes(x = coffee_spending, y = sleep)) +
## Specify the random sample within the bracket of pop and remember to put it on the row position!
geom_point(color = "black", fill = "white",
stroke = 1, shape = 1) +
xlab("Money Spend on Coffee") +
ylab("Hours of Sleep Every Night") +
theme_bw()
It seems like that their is a negative relationship between coffee
spending and hours sleep every night for UC Davis graduate students.
Now, let’s further estimate the intercept and slopes based on the sample
we drew. To simplify the process of calculating residual standard errors
and standard errors of both the slope and intercept, I use the
lm
function here to get all the estimates and statistics we
need to construct confidence intervals and perform hypothesis tests.
mod <- lm(sleep ~ coffee_spending, data = pop[samp,])
summary(mod)
##
## Call:
## lm(formula = sleep ~ coffee_spending, data = pop[samp, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.763 -2.143 -0.100 1.919 9.844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.26311 0.38665 18.785 < 2e-16 ***
## coffee_spending -0.06986 0.01783 -3.917 0.000102 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.053 on 498 degrees of freedom
## Multiple R-squared: 0.02989, Adjusted R-squared: 0.02794
## F-statistic: 15.34 on 1 and 498 DF, p-value: 0.0001022
Construct a 95% confidence intervals for both the intercept and slope
Intercept: \((7.26311-1.96*0.38665, 7.26311+1.96*0.38665) = (6.505276, 8.020944)\)
slope: \((-0.06986-1.96*0.01783, -0.06986+1.96*0.01783) = (-0.1048068, -0.0349132)\)
Both the confidence intervals for the intercept and slope do not include 0. This implies that if we draw 100 samples from the population, 95% of the confidence intervals we construct will encompass the true parameters, indicating that both of these true parameters are significantly different from 0.
Perform a two-tailed hypothesis tests for both the intercept and slope
\(H_0: \beta_0 = 0\)
\(H_1: \beta_0 \neq 0\)
\(t=\frac{b_0-\beta_0}{SE(b_0)}=\frac{b_0-0}{SE(b_0)}=\frac{7.26311-0}{0.38665}=18.78471\)
p-value \(= 2*Pr(t \geq 18.78471)=6.57319e-60 < 0.05\)
# find p-value
2*pt(q = 18.78471, df = 498, lower.tail = FALSE)
## [1] 6.57319e-60
As the p-value is smaller than 0.05, we can reject the null hypothesis and conclude that the intercept is greater than 0.
\(H_0: \beta_1 = 0\)
\(H_1: \beta_1 \neq 0\)
\(t=\frac{b_1-\beta_1}{SE(b_1)}=\frac{b_1-0}{SE(b_1)}=\frac{-0.06986-0}{0.01783}=-3.918116\)
p-value \(= 2*Pr(t \leq -3.918116)=0.0001017134 < 0.05\)
# find p-value
2*pt(q = -3.918116, df = 498, lower.tail = TRUE)
## [1] 0.0001017134
As the p-value is smaller than 0.05, we can reject the null hypothesis and conclude that there is a negative relationship between coffee spending and hours of sleep every night for UC Davis graduate students.
ggplot(pop[samp,], aes(x = coffee_spending, y = sleep)) +
## Specify the random sample within the bracket of pop and remember to put it on the row position!
geom_point(color = "black", fill = "white",
stroke = 1, shape = 1) +
geom_abline(slope = -0.06986, intercept = 7.26311,
color = 'blue', lwd = 1.5, alpha = 0.7) +
xlab("Money Spend on Coffee") +
ylab("Hours of Sleep Every Night") +
theme_bw()
In the above exercise, we drew only one sample from the population. Now, let’s repeatedly draw samples 1000 times to simulate the sampling distribution of the intercept and slope. This will help us verify if we reach similar conclusions as we did when constructing confidence intervals and performing hypothesis tests.
Simulate Sampling Distribution of the Intercept
set.seed(1234)
par.est <- data.frame(est_intercept = NA,
est_slope = NA)
for (i in 1:1000){
samp <- sample(1:nrow(pop), 500)
mod <- lm(sleep ~ coffee_spending, data = pop[samp,])
par.est[i, 1] <- mod$coef[1]
par.est[i, 2] <- mod$coef[2]
}
head(par.est)
## est_intercept est_slope
## 1 7.977923 -0.10320265
## 2 8.107999 -0.10914730
## 3 7.766414 -0.08499736
## 4 7.760085 -0.08567993
## 5 8.376802 -0.12510784
## 6 6.947270 -0.04736407
Let’s plot the distribution of our 1000 estimated intercepts and slopes and compare that with the sampling distribution under the null hypothesis.
What is the sampling distribution under the null hypothesis? Remember that we know the DGP of the population, so we actually know the exact standard error of both the population intercept and slope.
mod_pop <- lm(sleep ~ coffee_spending, data = pop)
summary(mod_pop)
##
## Call:
## lm(formula = sleep ~ coffee_spending, data = pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4478 -2.0067 -0.0222 2.0426 11.3089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.927428 0.080939 97.94 <2e-16 ***
## coffee_spending -0.097735 0.003761 -25.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.005 on 9998 degrees of freedom
## Multiple R-squared: 0.06326, Adjusted R-squared: 0.06317
## F-statistic: 675.2 on 1 and 9998 DF, p-value: < 2.2e-16
That being said, the sampling distribution under the null hypothesis for the intercept follows a normal distribution with a mean equal to 0 and a standard deviation equal to 0.080939.
int_null <- data.frame(x = seq(-1, 10, 0.01),
y = dnorm(seq(-1, 10, 0.01),
mean = 0, sd = 0.080939))
ggplot() +
geom_line(aes(x = int_null$x, y = int_null$y), ) +
geom_histogram(aes(x = par.est$est_intercept, y = ..density..),
color = "black", fill = "gray85", bins = 80) +
geom_vline(xintercept = mean(par.est$est_intercept),
linetype = "dashed", color = "red", alpha = 0.5, lwd = 0.8) +
geom_vline(xintercept = 7.26311, linetype = "dashed",
color = "blue", alpha = 0.5, lwd = 0.8) +
annotate("text", x = 9.5, y = 3,
label = "mean of\nestimated\nintercepts", color = "red") +
annotate("text", x = 6.5, y = 3,
label = "one time\nestimated\nintercept", color = "blue") +
annotate("text", x = 1.5, y = 3,
label = "sampling\ndistribution\nunder the null", color = "black") +
scale_x_continuous(name = "Estimated Intercept", breaks = seq(-1, 10, 1)) +
labs(y = "Density", title = "1000 Samples with n=500") +
theme_bw() +
theme(panel.grid.minor = element_blank())
Simulate Sampling Distribution of the Slope
The sampling distribution under the null hypothesis for the slope follows a normal distribution with a mean equal to 0 and a standard deviation equal to 0.080939.
slo_null <- data.frame(x = seq(-0.15, 0.05, 0.001),
y = dnorm(seq(-0.15, 0.05, 0.001),
mean = 0, sd = 0.003761))
ggplot() +
geom_line(aes(x = slo_null$x, y = slo_null$y), ) +
geom_histogram(aes(x = par.est$est_slope, y = ..density..),
color = "black", fill = "gray85", bins = 80) +
geom_vline(xintercept = mean(par.est$est_slope),
linetype = "dashed", color = "red", alpha = 0.5, lwd = 0.8) +
geom_vline(xintercept = -0.06986, linetype = "dashed",
color = "blue", alpha = 0.5, lwd = 0.8) +
annotate("text", x = -0.12, y = 60,
label = "mean of\nestimated\nislopes", color = "red") +
annotate("text", x = -0.05, y = 60,
label = "one time\nestimated\nslope", color = "blue") +
annotate("text", x = 0.02, y = 60,
label = "sampling\ndistribution\nunder the null", color = "black") +
scale_x_continuous(name = "Estimated Slope", breaks = seq(-0.15, 0.05, 0.02)) +
labs(y = "Density", title = "1000 Samples with n=500") +
theme_bw() +
theme(panel.grid.minor = element_blank())
The central difference between simple and multiple linear regression is that the slope coefficients for the explanatory variables in multiple regression are partial coefficients. That is, it represents the “effect” on the response variable of a one-unit increment in the corresponding explanatory variable, holding constant the value of the other explanatory variable.
\(Y_i=\beta_0+\beta_1X_i1+\beta_2X_i2+...+\beta_kX_ik+\varepsilon_i\)
Recall that the goal of the OLS approach is to minimize the discrepancy between our observed \(Y_i\) and estimated \(\hat{Y_i}\). To find a set of estimated \(\beta\)s that can minimize this discrepancy is an optimization problem. All we have to do is to take the partial derivatives of the above equation with respect to each coefficient, set them equal to zero, and then solve it. However, adding more and more \(X\) variables to our linear regression model makes the calculation more and more tedious. In order to find the estimated \(\beta\)s more efficiently, we use matrix algebra to help us do the calculation. POL 213 will show you how to do it in more details. In today’s discussion, we focus on how to do multivariate linear regression in R.
Assuming we are examining factors that explain U.S. public school
expenditures, you can assess the dataset Anscombe.txt
here.
This dataset measures U.S. State Public School Expenditures in 1981, and
the variables are:
education
= per capita education expenditures,
dollarsincome
= per capita income, dollarsunder18
= proportion under 18 years old, per 1000urban
= proportion urban, per 1000.# load data
df <- read.delim("/Users/yu-shiuanhuang/Desktop/method-sequence/data/Anscombe.txt", sep = "")
ols <- lm(education ~ income + under18 + urban, data = df)
summary(ols)
##
## Call:
## lm(formula = education ~ income + under18 + urban, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.240 -15.738 -1.156 15.883 51.380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.868e+02 6.492e+01 -4.418 5.82e-05 ***
## income 8.065e-02 9.299e-03 8.674 2.56e-11 ***
## under18 8.173e-01 1.598e-01 5.115 5.69e-06 ***
## urban -1.058e-01 3.428e-02 -3.086 0.00339 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.69 on 47 degrees of freedom
## Multiple R-squared: 0.6896, Adjusted R-squared: 0.6698
## F-statistic: 34.81 on 3 and 47 DF, p-value: 5.337e-12
Use the stargazer package to have a nicer output of the regression table.
# install.packages("stargazer")
library(stargazer)
stargazer(ols,
title = "Table 1: Factors Explaining U.S. Public School Expenditures",
dep.var.labels = "Education Expenditure",
covariate.labels = c("Income", "Under 18", "Urban"),
star.cutoffs = c(0.05, 0.01, 0.001),
digits = 2 , type = "text")
##
## Table 1: Factors Explaining U.S. Public School Expenditures
## =================================================
## Dependent variable:
## -----------------------------
## Education Expenditure
## -------------------------------------------------
## Income 0.08***
## (0.01)
##
## Under 18 0.82***
## (0.16)
##
## Urban -0.11**
## (0.03)
##
## Constant -286.84***
## (64.92)
##
## -------------------------------------------------
## Observations 51
## R2 0.69
## Adjusted R2 0.67
## Residual Std. Error 26.69 (df = 47)
## F Statistic 34.81*** (df = 3; 47)
## =================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
The regression output indicates a positive relationship between per capita income and per capita education expenditures, as well as between the proportion under 18 years old and education expenditures. To elaborate, holding all other variables constant, a one-unit increase in per capita income corresponds to a 0.08-unit increase in per capita education expenditures. Similarly, holding all other variables constant, a one-unit increase in the proportion under 18 years old corresponds to a 0.82-unit increase in per capita education expenditures.
Conversely, there is a negative relationship between the proportion urban and education expenditures. A one-unit increase in the proportion urban is associated with a decrease of 0.11 units in per capita education expenditures, while holding all other variables constant.
In today’s discussion, we will focus on some techniques for modeling.
To conform to regression assumptions which amplifies prediction power and increases the overall quality of the model. There are four primary reasons why we might want to transform continuous variables:
To even out the variance of a variable if the assumption of homoscedasticity is violated (residuals are independent with a mean of zero, even variance, and randomly scattered).
To normalize a variable if the assumption of normality is violated (by visual inspection, probability plots, or normality tests such as Kolmogorov-Smirnov and Shaprio-Wilks).
To linearize the regression model if it seems individual variables are non-linear.
Reduce the impact of outliers and high-leverage observations.
As we haven’t delved into the Gauss-Markov assumptions, which is a set of assumptions that leads the ordinary least squares (OLS) estimates the best linear unbiased estimator (BLUE), today’s discussion will concentrate solely on points 3 and 4, demonstrating how variable transformation can enhance our model.
In many cases, the primary method of variable transformation is by applying a simple deterministic (non-random) mathematical function to one or more variables that you include in your model. Often, the transformation is a relatively simple algebraic function that is reversible (invertible). For example:
Power functions: \(x^p\) the most common of which is \(\sqrt{x} = x^{\frac{1}{2}}\)
Logarithmic functions: \(log_{10}x\), \(log_e x = ln(x)\)
Multiplicative inverse (reciprocal) functions: \(\frac{1}{x}\)
Variables that have been transformed are sometimes harder to interpret. If the measure is commonly known and understood, such as sales dollars, “the multiplicative inverse of sales dollars” can be more difficult to interpret and communicate.
Making meaningful prediction often means inverting the transformed variables back to original form, which adds additional steps, thus increasing complexity of your model.
Often times, selecting the best transformation can be a process of trial and error.
Transformation can sometimes overshoot the goal and create issues in the other direction; must verify and check. For example, if we have a variable that is heavily right-skewed and we perform a transformation, it might be too powerful and now make it to be left-skewed.
When deciding whether to transform our variables, we should first examine the direction and magnitude of skewness in order to choose the appropriate technique.
For left skewed variables, before transformation, a reflection should be performed to make the distribution right skewed: \((x_{max}+1)-x\) (add 1 to the maximum value of the original variable and then subtract this result from each value of the original variable).
Depending on the severity of skewness, we apply the square root transformation (\(\sqrt{x}\)) for mild skewness, the base-10 logarithm transformation (\(log_{10}x\)) for moderate skewness, or the reciprocal transformation (\(\frac{1}{x}\)) for severe skewness. Why so? Let’s see some examples to get the rationale.
x <- c(1, 3, 10, 19, 31, 132, 1477, 300459, 133000000)
data.frame(
x,
sqrtx = sqrt(x),
logx = log(x),
inversex = 1/x
)
## x sqrtx logx inversex
## 1 1 1.000000 0.000000 1.000000e+00
## 2 3 1.732051 1.098612 3.333333e-01
## 3 10 3.162278 2.302585 1.000000e-01
## 4 19 4.358899 2.944439 5.263158e-02
## 5 31 5.567764 3.433987 3.225806e-02
## 6 132 11.489125 4.882802 7.575758e-03
## 7 1477 38.431758 7.297768 6.770481e-04
## 8 300459 548.141405 12.613067 3.328241e-06
## 9 133000000 11532.562595 18.705860 7.518797e-09
As seen in the table, for larger values, taking the inverse tightens more than logarithm and even more than the square root. This also indicates the skewness magnitude. In heavily right-skewed variables, where large outliers exist in the right tail, taking the inverse brings them closer to the majority of values. For moderate outliers, square root or logarithm suffices.
Let’s now use the Boston housing price dataset to do some practices.
Assuming we are examining factors that explain housing price in Boston,
you can assess the dataset Boston.xlsx
here.
Below is a description of each variable in the dataset.
MEDV
is the outcome variable.
MEDV
: median value of owner-occupied homes in
$1000’sCRIM
: per capita crime rate by townZN
: proportion of residential land zoned for lots over
25,000 sq.ft.INDUS
: proportion of non-retail business acres per
townCHAS
: Charles River dummy variable (1 if tract bounds
river; 0 otherwise)NOX
: nitric oxides concentration (parts per 10
million)RM
: average number of rooms per dwellingAGE
: proportion of owner-occupied units built prior to
1940DIS
: weighted distances to five Boston employment
centersRAD
: index of accessibility to radical highwaysTAX
: full-value property-tax rate per $10,000PTRATIO
: pupil-teacher ratio by townLSTAT
: % lower status of the populationlibrary(readxl)
Boston <- read_excel("/Users/yu-shiuanhuang/Desktop/method-sequence/data/Boston.xlsx")
head(Boston)
## # A tibble: 6 × 14
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397.
## 2 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397.
## 3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393.
## 4 0.0324 0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7 395.
## 5 0.0690 0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7 397.
## 6 0.0298 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394.
## # ℹ 2 more variables: LSTAT <dbl>, MEDV <dbl>
Assume we are interested in the relationships between
MEDV
and INDUS
, MEDV
and
DIS
, and MEDV
and CRIM
,
respectively. Let’s first look at the distribution of each explanatory
variable.
library(tidyverse)
p1 <- ggplot(data = Boston, aes(x = INDUS)) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
labs(title = "mild skewed") +
theme_bw()
p2 <- ggplot(data = Boston, aes(x = DIS)) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
labs(title = "moderate skewed") +
theme_bw()
p3 <- ggplot(data = Boston, aes(x = CRIM)) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
labs(title = "severe skewed") +
theme_bw()
ggpubr::ggarrange(p1, p2, p3, ncol = 3, nrow = 1)
Now, let’s create scatterplots to visualize the relationships between
MEDV
and these variables and assess their linearity.
p1 <- ggplot(data = Boston, aes(x = INDUS, y = MEDV)) +
geom_point(color = "darkgray", shape = 1) +
geom_smooth(method = "lm", color = "dodgerblue2") +
geom_smooth(method = "loess", color = "coral2") +
ylim(c(-20, 50)) +
annotate("text", x = 25, y = 45, label = "OLS", color = "dodgerblue2") +
annotate("text", x = 25, y = 42, label = "LOESS", color = "coral2") +
labs(title = "mild skewed") + xlab("INDUS") + ylab("MEDV") + theme_bw()
p2 <- ggplot(data = Boston, aes(x = DIS, y = MEDV)) +
geom_point(color = "darkgray", shape = 1) +
geom_smooth(method = "lm", color = "dodgerblue2") +
geom_smooth(method = "loess", color = "coral2") +
ylim(c(-20, 50)) +
annotate("text", x = 11, y = 45, label = "OLS", color = "dodgerblue2") +
annotate("text", x = 11, y = 42, label = "LOESS", color = "coral2") +
labs(title = "moderate skewed") + xlab("DIS") + ylab("MEDV") + theme_bw()
p3 <- ggplot(data = Boston, aes(x = CRIM, y = MEDV)) +
geom_point(color = "darkgray", shape = 1) +
geom_smooth(method = "lm", color = "dodgerblue2") +
geom_smooth(method = "loess", color = "coral2") +
ylim(c(-20, 50)) +
annotate("text", x = 80, y = 43, label = "OLS", color = "dodgerblue2") +
annotate("text", x = 80, y = 40, label = "LOESS", color = "coral2") +
labs(title = "severe skewed") + xlab("CRIM") + ylab("MEDV") + theme_bw()
ggpubr::ggarrange(p1, p2, p3, ncol = 3, nrow = 1)
Loess stands for locally estimated scatterplot smoothing (lowess stands for locally weighted scatterplot smoothing) and is one of many non-parametric regression techniques, but arguably the most flexible. A smoothing function is a function that attempts to capture general patterns in X-Y relationships while reducing the noise and it makes minimal assumptions about the relationships among variables. The result of a loess application is a line through the moving central tendency of the X-Y relationship.
Loess is fairly straightforward. A specific width of points along the x axis is selected (the bandwidth or tension) adjacent to the point being predicted, and a low degree polynomial equation (often just linear) is fit through that subset of the data. More weight is given to points closest to the value being predicted. This resulting equation is then used to predict the value for the selected point. The data are then shifted one point to the right and the process continues, with a new prediction for the second point, and so on. The resulting points are then connected together with a line. The user can control how wide a band of points are used – the smaller the bandwidth, the fewer points that are used and the less smooth the final line. Users can also adjust the type of line-fitting that is used – weighted least squares is the most common. Users can also adjust what types of weights are used.
As evident in the scatterplots above, the associations between
MEDV
and the three explanatory variables lack linearity.
Notably, the OLS regression line and the Loess line exhibit significant
disparities, particularly in the MEDV
and CRIM
relationship. This is when variable transformation chimes in to help
improve our linear models.
Let’s first transform these variables based on the severity of skewness.
p1 <- ggplot(data = Boston, aes(x = INDUS)) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
labs(title = "Before Transformation") +
theme_bw()
p2 <- ggplot(data = Boston, aes(x = sqrt(INDUS))) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
labs(title = "After Transformation") +
theme_bw()
p3 <- ggplot(data = Boston, aes(x = DIS)) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
labs(title = "Before Transformation") +
theme_bw()
p4 <- ggplot(data = Boston, aes(x = log(DIS+1))) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
labs(title = "After Transformation") +
theme_bw()
p5 <- ggplot(data = Boston, aes(x = CRIM)) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
labs(title = "Before Transformation") +
theme_bw()
p6 <- ggplot(data = Boston, aes(x = (1/(CRIM+1)))) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "gray85") +
labs(title = "After Transformation") +
theme_bw()
ggpubr::ggarrange(p1, p2, p3, p4, p5, p6, ncol = 2, nrow = 3)
Post-transformation, the distribution of these variables becomes less
skewed. It’s worth noting that for DIS
and
CRIM
, I apply the logarithm and inverse, respectively,
while adding 1 to each value to prevent 0, which can’t be computed in
logarithms or inverses.
Now, let’s replot the scatterplots between MEDV
and
these transformed explanatory variables.
p1 <- ggplot(data = Boston, aes(x = INDUS, y = MEDV)) +
geom_point(color = "darkgray", shape = 1) +
geom_smooth(method = "lm", color = "dodgerblue2") +
geom_smooth(method = "loess", color = "coral2") +
annotate("text", x = 25, y = 47, label = "OLS", color = "dodgerblue2") +
annotate("text", x = 25, y = 42, label = "LOESS", color = "coral2") +
labs(title = "Before Transformation") + xlab("INDUS") +
ylim(c(0, 50)) +
ylab("MEDV") + theme_bw()
p2 <- ggplot(data = Boston, aes(x = sqrt(INDUS), y = MEDV)) +
geom_point(color = "darkgray", shape = 1) +
geom_smooth(method = "lm", color = "dodgerblue2") +
geom_smooth(method = "loess", color = "coral2") +
annotate("text", x = 4.5, y = 47, label = "OLS", color = "dodgerblue2") +
annotate("text", x = 4.5, y = 42, label = "LOESS", color = "coral2") +
ylim(c(0, 50)) +
labs(title = "After Transfromation") + xlab("INDUS") +
ylab("MEDV") + theme_bw()
p3 <- ggplot(data = Boston, aes(x = DIS, y = MEDV)) +
geom_point(color = "darkgray", shape = 1) +
geom_smooth(method = "lm", color = "dodgerblue2") +
geom_smooth(method = "loess", color = "coral2") +
ylim(c(0, 50)) +
annotate("text", x = 11, y = 47, label = "OLS", color = "dodgerblue2") +
annotate("text", x = 11, y = 42, label = "LOESS", color = "coral2") +
labs(title = "Before Transfromation") + xlab("DIS") +
ylab("MEDV") + theme_bw()
p4 <- ggplot(data = Boston, aes(x = log(DIS+1), y = MEDV)) +
geom_point(color = "darkgray", shape = 1) +
geom_smooth(method = "lm", color = "dodgerblue2") +
geom_smooth(method = "loess", color = "coral2") +
ylim(c(0, 50)) +
annotate("text", x = 2.3, y = 47, label = "OLS", color = "dodgerblue2") +
annotate("text", x = 2.3, y = 42, label = "LOESS", color = "coral2") +
labs(title = "After Transfromation") + xlab("DIS") +
ylab("MEDV") + theme_bw()
p5 <- ggplot(data = Boston, aes(x = CRIM, y = MEDV)) +
geom_point(color = "darkgray", shape = 1) +
geom_smooth(method = "lm", color = "dodgerblue2") +
geom_smooth(method = "loess", color = "coral2") +
ylim(c(-20, 50)) +
annotate("text", x = 80, y = 46, label = "OLS", color = "dodgerblue2") +
annotate("text", x = 80, y = 40, label = "LOESS", color = "coral2") +
labs(title = "Before Transfromation") + xlab("CRIM") +
ylab("MEDV") + theme_bw()
p6 <- ggplot(data = Boston, aes(x = 1/(CRIM+1), y = MEDV)) +
geom_point(color = "darkgray", shape = 1) +
geom_smooth(method = "lm", color = "dodgerblue2") +
geom_smooth(method = "loess", color = "coral2") +
ylim(c(-20, 50)) +
annotate("text", x = 0.9, y = 46, label = "OLS", color = "dodgerblue2") +
annotate("text", x = 0.9, y = 40, label = "LOESS", color = "coral2") +
labs(title = "After Transfromation") + xlab("CRIM") +
ylab("MEDV") + theme_bw()
ggpubr::ggarrange(p1, p2, p3, p4, p5, p6, ncol = 2, nrow = 3)
As evident, linearity improves post-transformation, particularly for
CRIM
with severe skewness. However, it’s crucial to bear in
mind the inherent trade-offs associated with variable transformations,
potentially affecting interpretability.
Note: some materials of variable transformation are referred to here.
There are two primary reasons to rescale our variables: to enhance the interpretability of coefficients and to identify the relative importance of each variable in explaining the outcome of interest.
Assuming we are examining factors that explain U.S. public school
expenditures, you can assess the dataset Anscombe.txt
here.
This dataset measures U.S. State Public School Expenditures in 1981, and
the variables are:
education
= per capita education expenditures,
dollarsincome
= per capita income, dollarsunder18
= proportion under 18 years old, per 1000urban
= proportion urban, per 1000.# load data
df <- read.delim("/Users/yu-shiuanhuang/Desktop/method-sequence/data/Anscombe.txt", sep = "")
ols1 <- lm(education ~ income + under18 + urban, data = df)
library(stargazer)
stargazer(ols1,
title = "Table 1: Factors Explaining U.S. Public School Expenditures",
dep.var.labels = "Education Expenditure",
covariate.labels = c("Income", "Under 18", "Urban"),
star.cutoffs = c(0.05, 0.01, 0.001),
digits = 2 , type = "text")
##
## Table 1: Factors Explaining U.S. Public School Expenditures
## =================================================
## Dependent variable:
## -----------------------------
## Education Expenditure
## -------------------------------------------------
## Income 0.08***
## (0.01)
##
## Under 18 0.82***
## (0.16)
##
## Urban -0.11**
## (0.03)
##
## Constant -286.84***
## (64.92)
##
## -------------------------------------------------
## Observations 51
## R2 0.69
## Adjusted R2 0.67
## Residual Std. Error 26.69 (df = 47)
## F Statistic 34.81*** (df = 3; 47)
## =================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
As seen in the regression table, the coefficient for
income
is significantly positive, implying that, holding
all the other variables at constant, a one-dollar increase in per capita
income corresponds to a 0.08-unit increase in per capita education
expenditures of a state. While reporting this finding with the original
scale is acceptable, rescaling the variable for better clarity and
impact is recommended. For instance, we can divide the
income
variable by 1000 dollars to allow the regression
output to provide a more tangible understanding of the relationship.
df <- df %>% mutate(income1000 = (income/1000))
ols2 <- lm(education ~ income1000 + under18 + urban, data = df)
stargazer(ols1, ols2,
title = "Table 2: Factors Explaining U.S. Public School Expenditures With Rescaled Vriable",
dep.var.labels = "Education Expenditure",
covariate.labels = c("Income", "Income/1000", "Under 18", "Urban"),
star.cutoffs = c(0.05, 0.01, 0.001),
digits = 2 , type = "text")
##
## Table 2: Factors Explaining U.S. Public School Expenditures With Rescaled Vriable
## ============================================================
## Dependent variable:
## ------------------------------
## Education Expenditure
## (1) (2)
## ------------------------------------------------------------
## Income 0.08***
## (0.01)
##
## Income/1000 80.65***
## (9.30)
##
## Under 18 0.82*** 0.82***
## (0.16) (0.16)
##
## Urban -0.11** -0.11**
## (0.03) (0.03)
##
## Constant -286.84*** -286.84***
## (64.92) (64.92)
##
## ------------------------------------------------------------
## Observations 51 51
## R2 0.69 0.69
## Adjusted R2 0.67 0.67
## Residual Std. Error (df = 47) 26.69 26.69
## F Statistic (df = 3; 47) 34.81*** 34.81***
## ============================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
After rescaling, notice that a $1,000 rise in per capita income
aligns with an 80.65-unit growth in state per capita education
expenditures. This increase equals 1.74 times the standard deviation of
the outcome variable. It’s crucial to note that only the coefficient and
standard error of income change after rescaling; everything else remains
constant. This is because we only rescale the unit measurement of the
income
.
Having chosen a regression model with statistically significant independent variables, interpreting the results reveals the relationship between these variables and the dependent variable. Now, a natural question arises: “Which independent variable is the most important?”
Determining the most important variable is more complex than it initially seems. Firstly, you must define what “most important” means, considering the subject area and goals of your regression model. There is no one-size-fits-all definition, and the methods used for data collection and measurement can influence the perceived importance of independent variables.
The standard regression coefficients in your statistical output illustrate the relationship between independent variables and the dependent variable. Each coefficient represents the mean change in the dependent variable with a one-unit shift in the corresponding independent variable. While larger coefficients imply a more significant change, comparing absolute sizes alone may not accurately identify the most important variable. This is particularly true when the independent variables have different units, rendering direct comparisons of coefficients meaningless—such as when measuring time, pressure, and temperature.
While the coefficient can’t determine the importance of an independent variable, what about the variable’s p-value? Comparing p-values may seem logical as we use them to select variables for the model. But, do lower p-values signify more important variables? Unfortunately, p-values don’t measure importance; they encompass various properties of the variable. A very small p-value doesn’t necessarily indicate practical importance. Factors like a precise estimate, low variability, or a large sample size can yield a tiny p-value for an independent variable. Hence, it’s crucial, when evaluating statistical results, to assess both statistical and practical significance of effect sizes.
So how can we compare the importance of independent variables in our model? Standardize coefficients!
As previously mentioned, comparing regular regression coefficients is challenging due to different scales. However, standardized coefficients, sharing the same scale, enable meaningful comparisons. Standardizing involves subtracting each observed value’s mean and dividing by the variable’s standard deviation (similar to calculating z-values in statistics). This process ensures a consistent scale for comparison.
\[ z = \frac{x_i-\bar{x}}{sd_x} \]
We then use standardized coefficients by running the regression model with standardized independent variables. These coefficients, on the same scale, show how much the dependent variable changes with a one standard deviation shift in an independent variable. They act as standardized effect sizes, revealing the relationship strength without relying on original data units. Measured in standard deviations, these effects help grasp the practical significance of the findings.
Let’s apply this technique to examine which factor explains U.S.
public school expenditures the most. You can achieve standardization
using the scale()
function or manually compute it.
df <- df %>% mutate(income_s = scale(income),
under18_s = scale(under18),
urban_s = scale(urban))
ols3 <- lm(education ~ income_s + under18_s + urban_s, data = df)
stargazer(ols1, ols3,
title = "Table 3: Factors Explaining U.S. Public School Expenditures With Standardized Vriables",
dep.var.labels = "Education Expenditure",
covariate.labels = c("Income", "Under 18", "Urban",
"Income (standardized)", "Under 18 (standardized)", "Urban (standardized)"),
star.cutoffs = c(0.05, 0.01, 0.001),
digits = 2 , type = "text")
##
## Table 3: Factors Explaining U.S. Public School Expenditures With Standardized Vriables
## ============================================================
## Dependent variable:
## ------------------------------
## Education Expenditure
## (1) (2)
## ------------------------------------------------------------
## Income 0.08***
## (0.01)
##
## Under 18 0.82***
## (0.16)
##
## Urban -0.11**
## (0.03)
##
## Income (standardized) 45.17***
## (5.21)
##
## Under 18 (standardized) 19.58***
## (3.83)
##
## Urban (standardized) -16.01**
## (5.19)
##
## Constant -286.84*** 196.31***
## (64.92) (3.74)
##
## ------------------------------------------------------------
## Observations 51 51
## R2 0.69 0.69
## Adjusted R2 0.67 0.67
## Residual Std. Error (df = 47) 26.69 26.69
## F Statistic (df = 3; 47) 34.81*** 34.81***
## ============================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
After standardizing each independent variable, we observe that
income
has the greatest effect size. Holding all other
variables constant, a one-unit standard deviation increase in
income
is associated with a 45.17-unit increase in the per
capita education expenditures of a state. This differs from what we
would infer using the original scale of the variables to fit the model.
Importantly, as we have now rescaled the variables not only regarding
their unit measurement but also their range, the estimated intercept is
different from the original model. With all independent variables
standardized, the intercept now indicates that the baseline per capita
education expenditure is 196.31 when all variables are at their mean,
which is 0.
Summated rating scale is a type of assessment instrument comprising a series of statements measuring the same construct or variable to which respondents indicate their degree of agreement or disagreement. The number of response options for each item varies, often from 5 to 7 points (e.g., from strongly agree to strongly disagree). The response values for individual items may be summed to obtain a total or average score that reflects a person’s general attitude toward the construct of interest.
For example, the below table is a Likert scale on assessing how strong an individual’s partisan identity is (from strongly disagree with each statement to strongly agree).
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
When I speak about this party, I usually say “we” instead of “they.” | |||||
I am interested in what other people think about this party. | |||||
When people criticize this party, it feels like a personal insult. | |||||
I have a lot in common with other supporters of this party. | |||||
If this party does badly in opinion polls, my day is ruined. | |||||
When I meet someone who supports this party, I feel connected with this person. | |||||
When I speak about this party, I refer to them as “my party.” | |||||
When people praise this party, it makes me feel good. |
Since these eight statements are all related to measure one’s partisan identity level, instead of including all the variables based on these statements, we can use a summated rating scale to combine all the information to generate one aggregated variable.
But before applying a summated rating scale, we should first check the reliability of the set of scale to see whether this set of scale is a consistent measurement of a concept. In the lecture, Chris suggested that we can use Cronbach’s alpha to assess such consistency. The higher the \(\alpha\) coefficient, the more the items have shared covariance and probably measure the same underlying concept. Here is the rule of thumb:
Cronbach’s alpha | Internal consistency |
---|---|
\(\alpha \geq 0.9\) | Excellent |
\(0.9 > \alpha \geq 0.8\) | Good |
\(0.8 > \alpha \geq 0.7\) | Acceptable |
\(0.7 > \alpha \geq 0.6\) | Questionable |
\(0.6 > \alpha \geq 0.5\) | Poor |
\(0.5 > \alpha\) | Unacceptable |
The \(\alpha\) coefficient can be
computed using the alpha()
function in the
psych
package.
Here I provide an example of constructing a summated rating scale of partisan social identity using Bankert, Huddy, and Rosema’s (2016)data based on the previously outlined questions
pid <- read_csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/pid.csv")
head(pid)
## # A tibble: 6 × 8
## V1 V2 V3 V4 V5 V6 V7 V8
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 3 1 3 1 2 1 2
## 2 3 3 3 3 2 3 3 3
## 3 3 2 3 3 3 3 2 3
## 4 2 2 2 2 2 2 2 2
## 5 2 3 2 3 2 3 2 3
## 6 2 3 1 3 2 3 2 3
First, assess how consistent these eight variables/questions are related on the concept of partisan social identity.
#install.packages("psych")
library(psych)
alpha(na.omit(pid))
##
## Reliability analysis
## Call: alpha(x = na.omit(pid))
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.9 0.9 0.9 0.53 9 0.0014 1.9 0.63 0.52
##
## 95% confidence boundaries
## lower alpha upper
## Feldt 0.9 0.9 0.9
## Duhachek 0.9 0.9 0.9
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## V1 0.89 0.89 0.88 0.54 8.1 0.0016 0.0048 0.54
## V2 0.89 0.89 0.89 0.55 8.4 0.0015 0.0056 0.53
## V3 0.89 0.89 0.88 0.53 7.8 0.0016 0.0070 0.52
## V4 0.89 0.89 0.88 0.53 7.9 0.0016 0.0062 0.52
## V5 0.89 0.89 0.89 0.55 8.4 0.0016 0.0050 0.54
## V6 0.88 0.88 0.87 0.51 7.3 0.0017 0.0052 0.52
## V7 0.88 0.88 0.87 0.52 7.5 0.0017 0.0052 0.52
## V8 0.88 0.88 0.88 0.52 7.5 0.0017 0.0060 0.52
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## V1 10974 0.75 0.74 0.70 0.65 1.6 0.86
## V2 10974 0.72 0.71 0.65 0.62 2.4 0.85
## V3 10974 0.77 0.78 0.73 0.69 1.6 0.78
## V4 10974 0.76 0.76 0.71 0.68 2.4 0.82
## V5 10974 0.69 0.71 0.64 0.61 1.6 0.70
## V6 10974 0.83 0.83 0.81 0.76 2.1 0.84
## V7 10974 0.81 0.81 0.79 0.74 1.6 0.81
## V8 10974 0.81 0.80 0.77 0.73 2.2 0.86
##
## Non missing response frequency for each item
## 1 2 3 4 miss
## V1 0.56 0.27 0.11 0.05 0
## V2 0.17 0.38 0.37 0.08 0
## V3 0.52 0.34 0.11 0.03 0
## V4 0.16 0.37 0.41 0.06 0
## V5 0.55 0.36 0.08 0.01 0
## V6 0.25 0.40 0.30 0.04 0
## V7 0.57 0.30 0.10 0.03 0
## V8 0.23 0.39 0.32 0.06 0
The Cronbach’s alpha test indicates high internal consistency between these two variables, with the \(\alpha\) of approximately \(0.90\). Considering this reliability, we can create the summated rating scaled partisan social identity index by simply averaging these eight variables, as they are all on the same scale. Note that if the variables you aggregate are on different scales, ensure you rescale all variables to the same scale before conducting the Cronbach’s alpha test and generating the summated rating scaled index.
pid$psi <- (pid$V1 + pid$V2 + pid$V3 + pid$V4 + pid$V5 + pid$V6 + pid$V7 + pid$V8)/8
summary(pid$psi)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.500 1.875 1.943 2.375 4.000 883
The partisan social identity index (psi
) we created
ranges from 1 to 4. The higher the value, the stronger a person feels
that the party they are affiliated with represents one of their social
identities.
In today’s discussion, we will cover how to decompose a linear regression and how to use our estimated linear regression line to do prediction.
The Frisch-Waugh-Lovell Theorem (FWL; after the initial proof by Frisch and Waugh (1933), and later generalisation by Lovell (1963)) states that:
Any predictor’s regression coefficient in a multivariate model is equivalent to the regression coefficient estimated from a bivariate model in which the residualized outcome is regressed on the residualized component of the predictor; where the residuals are taken from models regressing the outcome and the predictor on all other predictors in the multivariate regression (separately).
Simply said, the FWL Theorem shows how to decompose a regression of \(Y\) on a set of variables \(X\) into two pieces. The coefficient of one of the variables in multiple linear regression model can be obtained by netting off effect of other variables in the regression model from both dependent and independent variables.
Why is it important to know FWL theorem? The FWL Theorem shows that it is important to partial one set of covariates out of the other in order to obtain the ceteris paribus effect. For example, in order to get the effect of \(X1\) on \(Y\), \(X2\) needs to be held constant, which involves removing the portion of \(X1\) that is correlated with \(X2\). Let’s understand it through the regression formula.
\(Y = a + b_1X_1 + b_2X_2 + e\) (assume this is the estimated regression line we have)
The value of \(b_1\) can be obtained by:
netting off effect of \(X_2\) from \(Y\); and
\(Y = a + b_2X_2 + v_1\)
You can consider the residual \(v_1\) as the variation in \(Y\) that cannot be explained by \(X_2\).
netting off effect of \(X_2\) from \(X_1\)
\(X_1 = a + b_2X_2 + v_2\)
You can consider the residual \(v_2\) as the variation in \(X_1\) that cannot be explained by \(X_2\).
then, we regress the variation in \(Y\) that cannot be explained by \(X_2\) on the variation in \(X_1\) that cannot be explained by \(X_2\).
\(v_1 = a' + kv_2 + e\)
By doing this, the \(k\) we estimate will be the same as the \(b_1\) we estimate by regressing \(Y\) on both \(X_1\) and \(X_2\) at the same time.
Let’s use the Auto
dataset as an example to illustrate
the FWL Theorem in practice.
library(ISLR2)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Here is the description of each variable in this dataset:
mpg
: miles per galloncylinders
: number of cylinders between 4 and 8displacement
: engine displacement (cu. inches)horsepower
: engine horsepowerweight
: vehicle weight (lbs.)acceleration
: time to accelerate from 0 to 60 mph
(sec.)year
: model year (modulo 100)origin
: origin of car (1. American, 2. European, 3.
Japanese)name
: vehicle nameAssume we are investigating how the miles per gallon of a car can be
explained by its origin (whether it is a U.S. or a foreign car) and its
weight. That said, mpg
is the outcome variable, while
origin
and weight
are the explanatory
variables.
Before beginning the analysis, let’s first create a new dummy
variable named foreign
based on the origin
to
classify these cars as either foreign or U.S. cars.
Auto$foreign <- ifelse(Auto$origin == 1, 0, 1)
I will use foreign
as my \(X_1\) variable and weight
as
the \(X_2\) variable. (\(X_1\) and \(X_2\) can include multiple variables, but
the illustration will be clearer with just one each.) Let’s first run a
regression of \(Y\), which is
mpg
, on both \(X_1\) and
\(X_2\).
fit <- lm(mpg ~ foreign + weight, data = Auto)
stargazer::stargazer(fit, star.cutoffs = c(0.05, 0.01, 0.001),
digits = 3 , type = "text", omit.stat = c("F"))
##
## =================================================
## Dependent variable:
## -----------------------------
## mpg
## -------------------------------------------------
## foreign 1.638**
## (0.560)
##
## weight -0.007***
## (0.0003)
##
## Constant 43.929***
## (1.112)
##
## -------------------------------------------------
## Observations 392
## R2 0.699
## Adjusted R2 0.698
## Residual Std. Error 4.291 (df = 389)
## =================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
Now let’s start decomposing our regression following the FWL Theorem:
Step 1: Regress \(Y\) on \(X_2\) and take the residuals
fit1 <- lm(mpg ~ weight, data = Auto)
v1 <- fit1$residuals
Step 2: Regress \(X_1\) on \(X_2\) and take the residuals
fit2 <- lm(foreign ~ weight, data = Auto)
v2 <- fit2$residuals
Step 3: Regress the residuals from Step 1 on the residuals from Step 2
fit3 <- lm(v1 ~ v2)
stargazer::stargazer(fit, fit1, fit2, fit3,
star.cutoffs = c(0.05, 0.01, 0.001),
digits = 3 , type = "text", omit.stat = c("F"))
##
## =======================================================================================
## Dependent variable:
## -------------------------------------------------------------------
## mpg foreign v1
## (1) (2) (3) (4)
## ---------------------------------------------------------------------------------------
## foreign 1.638**
## (0.560)
##
## weight -0.007*** -0.008*** -0.0003***
## (0.0003) (0.0003) (0.00002)
##
## v2 1.638**
## (0.559)
##
## Constant 43.929*** 46.217*** 1.396*** -0.000
## (1.112) (0.799) (0.072) (0.216)
##
## ---------------------------------------------------------------------------------------
## Observations 392 392 392 392
## R2 0.699 0.693 0.361 0.022
## Adjusted R2 0.698 0.692 0.360 0.019
## Residual Std. Error 4.291 (df = 389) 4.333 (df = 390) 0.388 (df = 390) 4.286 (df = 390)
## =======================================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
Notice that in the Step 3 regression, the coefficient on
v2
is the same as the coefficient on foreign
in the original regression. Why does it work? By netting off both the
effect of weight
on mpg
and of
weight
on foreign
, we actually eliminate the
correlation between foreign
and weight
.
First, let’s check how correlated is between foreign
and
weight
.
cor(Auto$foreign, Auto$weight)
## [1] -0.6009783
What happens after we eliminate the correlation between
foreign
and weight
?
cor(Auto$weight, v2)
## [1] -3.310067e-17
Recall that the residual \(v_2\) can
be seen as the variation in foreign
that cannot be
explained by weight
. As shown in the above, we can see that
now \(v_2\) is not correlated with
weight
; the correlation coefficient is very close to 0. Now
we can use \(v_2\) to check how much
variation in \(v_1\), which is the
variation in mpg
that cannot be explained by
weight
, can be explained by \(v_2\).
If you’d like to understand the FWL Theorem in more details, please see here.
One of the key strengths of linear regression is its capability to make forecasts based on the relationships identified in the data. Here, I demonstrate how to make these predictions manually and using built-in functions in R.
The data we are using today is ANES 2022 pilot data
,
which can be downloaded here.
The codebook can be downloaded here.
Assume we study whether those who are more ideologically conservative rate Donald Trump with higher liking scores.
library(haven) # for importing .dta format data
# load data
anes <- read_dta("/Users/yu-shiuanhuang/Desktop/method-sequence/data/anes_pilot_2022_stata_20221214.dta")
Remember to deal with the missing/NA values in the two variables we
are going to analyze: fttrump
and libcon
.
Before doing analysis, always check the codebook of the dataset first to
see what values in each variable you will have to address.
fttrump
: How would you rate Donald Trump on a scale
of 0-100?
libcon
: We hear a lot of talk these days about
liberals and conservatives. Here is a seven-point scale on which the
political views that people might hold are arranged from extremely
liberal to extremely conservative. Where would you place yourself on
this scale, or haven’t you thought much about this?
# examine each variable first
table(anes$fttrump) # -7 means the respondent did not answer this question
##
## -7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 1 360 111 35 29 20 26 16 9 17 9 10 6 8 6 7 25 7 4 5
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## 8 10 3 4 5 4 5 3 2 2 4 11 6 4 8 9 6 8 5 2
## 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
## 4 12 7 7 6 2 4 9 9 7 12 47 14 12 9 6 17 11 9 4
## 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
## 5 18 8 6 10 6 10 11 6 8 6 32 6 14 13 8 18 6 7 4
## 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
## 6 17 6 7 4 12 33 7 13 4 9 22 8 4 13 5 17 9 6 9
## 99 100
## 10 134
table(anes$libcon) # 997 means the respondent did answer the question but picked "haven't thought about it too much"
##
## 1 2 3 4 5 6 7 997
## 122 203 121 367 136 287 135 214
After examining the variables, make sure to recode -7 in
fttrump
and 997 in libcon
to NAs.
# recode missing values
anes$fttrump[anes$fttrump == -7] <- NA
anes$libcon[anes$libcon == 997] <- NA
# examine the variables again just to double check
table(anes$fttrump)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 360 111 35 29 20 26 16 9 17 9 10 6 8 6 7 25 7 4 5 8
## 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
## 10 3 4 5 4 5 3 2 2 4 11 6 4 8 9 6 8 5 2 4
## 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
## 12 7 7 6 2 4 9 9 7 12 47 14 12 9 6 17 11 9 4 5
## 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
## 18 8 6 10 6 10 11 6 8 6 32 6 14 13 8 18 6 7 4 6
## 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
## 17 6 7 4 12 33 7 13 4 9 22 8 4 13 5 17 9 6 9 10
## 100
## 134
table(anes$libcon)
##
## 1 2 3 4 5 6 7
## 122 203 121 367 136 287 135
Now, let’s run a simple linear regression of libcon
on
fttrump
.
m1 <- lm(fttrump ~ libcon, data = anes)
stargazer::stargazer(m1,
title = "Table 1: Ideology and Feelings toward Trump",
dep.var.labels = "Feelings toward Trump",
covariate.labels = c("Ideology"),
star.cutoffs = c(0.05, 0.01, 0.001),
digits = 2 , type = "text")
##
## Table 1: Ideology and Feelings toward Trump
## =================================================
## Dependent variable:
## -----------------------------
## Feelings toward Trump
## -------------------------------------------------
## Ideology 14.03***
## (0.42)
##
## Constant -19.72***
## (1.93)
##
## -------------------------------------------------
## Observations 1,370
## R2 0.44
## Adjusted R2 0.44
## Residual Std. Error 28.42 (df = 1368)
## F Statistic 1,094.34*** (df = 1; 1368)
## =================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
Now we have our simple OLS linear regression of libcon
on fttrump
, we can use this regression line (\(\hat{y} = -19.72 + 14.03x\)) to do
predictions and evaluate the uncertainty of our predictions. How to do
it by hand?
Let’s plot a scatterplot between libcon
and
fttrump
and the regression line first!
ggplot(data = anes, aes(x = libcon, y = fttrump)) +
geom_point(alpha = 0.7, color = "darkgray") +
geom_smooth(method = "lm", color = "coral2", se = FALSE) +
scale_x_continuous(name = "Ideology (libcon)",
breaks = seq(1, 7, 1)) +
ylab("Feelings toward Trump (fttrump)") +
annotate("text", x = 5.5, y = 72,
label = "predicted values on \n the regression line",
color = "coral2") +
theme_bw()
Let’s calculate the predicted values of the regression line by hand!
# what is the intercept and slope?
summary(m1)
##
## Call:
## lm(formula = fttrump ~ libcon, data = anes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.480 -21.414 -0.465 20.314 105.695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.7245 1.9254 -10.24 <2e-16 ***
## libcon 14.0292 0.4241 33.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.42 on 1368 degrees of freedom
## (215 observations deleted due to missingness)
## Multiple R-squared: 0.4444, Adjusted R-squared: 0.444
## F-statistic: 1094 on 1 and 1368 DF, p-value: < 2.2e-16
# we can call out the intercept and slope from the summary table by using
coef(m1)[1] ## intercept
## (Intercept)
## -19.72446
unname(coef(m1)[1])
## [1] -19.72446
coef(m1)[2] ## slope
## libcon
## 14.02919
unname(coef(m1)[2])
## [1] 14.02919
# let's say we'd like to know how much a person would like Trump if she place herself in the middle (4) of the ideology spectrum.
unname(coef(m1)[1]) + (unname(coef(m1)[2])*4) ## the predicted value is 36.3923
## [1] 36.39231
Now, instead of only predicting one value, let’s predict the whole ideology spectrum (1-7)!
# create a data frame to store predicted values
pred.df <- data.frame(libcon = c(1:7)) ## make sure that the variable name in this new created data frame is the same as the variable name you specified in your m1 model
pred.df
## libcon
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
# make prediction based on the ideology var in pred.df
pred.df2 <- pred.df %>%
mutate(pred.fttrump = unname(coef(m1)[1]) + (unname(coef(m1)[2])*libcon))
pred.df2
## libcon pred.fttrump
## 1 1 -5.695269
## 2 2 8.333923
## 3 3 22.363115
## 4 4 36.392307
## 5 5 50.421499
## 6 6 64.450691
## 7 7 78.479883
We can also use the predict()
function in R to do it
more efficiently!
predict(m1, newdata = pred.df) ## these predicted values should be the same as what we did earlier!
## 1 2 3 4 5 6 7
## -5.695269 8.333923 22.363115 36.392307 50.421499 64.450691 78.479883
How confident can we say about our estimates of ideology (libcon)? Calculate confidence intervals of the predicted values!
Use the confint
function to find the 95% confidence
intervals of your estimates.
confint(m1, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -23.50145 -15.94747
## libcon 13.19726 14.86113
summary(m1)
##
## Call:
## lm(formula = fttrump ~ libcon, data = anes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.480 -21.414 -0.465 20.314 105.695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.7245 1.9254 -10.24 <2e-16 ***
## libcon 14.0292 0.4241 33.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.42 on 1368 degrees of freedom
## (215 observations deleted due to missingness)
## Multiple R-squared: 0.4444, Adjusted R-squared: 0.444
## F-statistic: 1094 on 1 and 1368 DF, p-value: < 2.2e-16
Use predict
function to generate 95% confidence interval
for each predicted value.
# to display the 95% confidence intervals, specify the argument interval = "confidence"
predict(m1, newdata = pred.df, interval = "confidence", level = 0.95)
## fit lwr upr
## 1 -5.695269 -8.727516 -2.663021
## 2 8.333923 5.987091 10.680755
## 3 22.363115 20.572878 24.153352
## 4 36.392307 34.880187 37.904427
## 5 50.421499 48.762495 52.080504
## 6 64.450691 62.305391 66.595992
## 7 78.479883 75.680497 81.279270
Plot 95% confidence interval of our regression line by setting se=TRUE and level = 0.95 in geom_smooth function.
ggplot(data = anes, aes(x = libcon, y = fttrump)) +
geom_point(alpha = 0.7, color = "darkgray") +
geom_smooth(method = "lm", color = "coral2", se = TRUE, level = 0.95) +
scale_x_continuous(name = "Ideology (libcon)",
breaks = seq(1, 7, 1)) +
ylab("Feelings toward Trump (fttrump)") +
annotate("text", x = 5.5, y = 72,
label = "predicted values on \n the regression line",
color = "coral2") +
theme_bw()
You can also use the function from jtools
package Chris
introduced in the lecture to plot it fastly. effect_plot()
is also done with ggplot2
.
jtools::effect_plot(m1, pred = "libcon", interval = TRUE,
plot.points = TRUE, jitter = 0.2, colors = "darkgray",
line.colors = "coral2", x.label = "Ideology (libcon)",
y.label = "Feelings toward Trump (fttrump)") +
annotate("text", x = 5.5, y = 72,
label = "predicted values on \n the regression line",
color = "coral2")
Now assume we study whether those who are more ideologically
conservative rate Donald Trump with higher liking scores. However,
instead of only include libcon
in our model, let’s also
control some demographic variables (aka control
variables), including age, gender, race, education level, and party
ID.
What kind of control variables should you include? It depends on the research question you are interested in and what the related literature suggests.
Remember to also deal with the missing/NA values first in these control variables!
# likewise, examine each variable first
table(anes$birthyr_dropdown) ## birth year
##
## 1930 1932 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947
## 1 1 1 2 2 3 3 5 7 6 8 4 14 9 15 27
## 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
## 22 24 28 26 36 31 37 27 24 43 65 39 52 40 40 41
## 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
## 38 27 27 25 18 35 18 12 22 13 21 16 14 13 20 23
## 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## 23 21 20 21 14 24 31 27 29 18 20 24 38 19 23 20
## 1996 1997 1998 1999 2000 2001 2002 2003 2004
## 18 25 23 23 29 17 15 24 14
table(anes$gender) ## male: 1; female: 2
##
## 1 2
## 707 878
table(anes$rwh) ## white: 1; non-white: 2
##
## 1 2
## 1262 323
table(anes$educ) ## education level: can be treated as a continuous var
##
## 1 2 3 4 5 6
## 75 445 357 155 363 190
table(anes$pid7) ## party ID
##
## -7 1 2 3 4 5 6 7 8
## 66 349 156 126 245 150 150 281 62
By examining each variable, it seems that we only have to deal with
the missing values in pid7 (party ID)
. In
pid7 (party ID)
, according to the codebook, -7 means no
answer and 8 means not sure, which means we have to first recode these
values into NAs.
# recode missing values
anes$pid7[anes$pid7 == -7 | anes$pid7 == 8] <- NA ## "|" means "or"
# examine the variable again just to double check
table(anes$pid7)
##
## 1 2 3 4 5 6 7
## 349 156 126 245 150 150 281
Now, let’s generate control variables!
anes2 <- anes %>%
mutate(age = 2022 - birthyr_dropdown, ## the survey was conducted in 2022
female = ifelse(gender == 2, 1, 0),
white = ifelse(rwh == 1, 1, 0),
dem = ifelse(pid7 < 4, 1, 0),
rep = ifelse(pid7 > 4, 1, 0),
indep = ifelse(pid7 == 4, 1, 0)) ## create party ID dummies
Run a multiple linear regression of libcon
on
fttrump
by controlling demographic variables.
m2 <- lm(fttrump ~ libcon + age + female + educ + white + dem + rep,
data = anes2) ## make indep as our reference group for party ID
What if we’d like to compare respondents’ liking toward Trump between Democrats and Republicans rather than compare it between Democrats and Independents and between Republicans and Indendents? Just change your reference group to Democrats or Republicans!
m3 <- lm(fttrump ~ libcon + age + female + educ + white + rep + indep,
data = anes2) ## make dem as our reference group for party ID
stargazer(m1, m2, m3,
title = "Table 2: Ideology and Feelings toward Trump",
dep.var.labels = "Feelings toward Trump",
covariate.labels = c("Ideology", "Age", "Female", "Education",
"White", "Democrat", "Republican", "Independent"),
star.cutoffs = c(0.05, 0.01, 0.001),
digits = 2 , type = "text")
##
## Table 2: Ideology and Feelings toward Trump
## ================================================================================================
## Dependent variable:
## ----------------------------------------------------------------------------
## Feelings toward Trump
## (1) (2) (3)
## ------------------------------------------------------------------------------------------------
## Ideology 14.03*** 7.38*** 7.38***
## (0.42) (0.58) (0.58)
##
## Age -0.17*** -0.17***
## (0.04) (0.04)
##
## Female -3.35* -3.35*
## (1.43) (1.43)
##
## Education -2.24*** -2.24***
## (0.48) (0.48)
##
## White 0.45 0.45
## (1.90) (1.90)
##
## Democrat -15.14***
## (2.33)
##
## Republican 21.43*** 36.57***
## (2.46) (2.28)
##
## Independent 15.14***
## (2.33)
##
## Constant -19.72*** 24.51*** 9.37*
## (1.93) (4.18) (3.65)
##
## ------------------------------------------------------------------------------------------------
## Observations 1,370 1,295 1,295
## R2 0.44 0.56 0.56
## Adjusted R2 0.44 0.56 0.56
## Residual Std. Error 28.42 (df = 1368) 25.48 (df = 1287) 25.48 (df = 1287)
## F Statistic 1,094.34*** (df = 1; 1368) 234.90*** (df = 7; 1287) 234.90*** (df = 7; 1287)
## ================================================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
Table 2 shows that, holding constant the value of other control
variables, when a respondent is one unit more ideologically
conservative, she significantly adores Trump more by 7.38 units. Why the
coefficient of ideology (libcon)
becomes smaller after we
control demographic variables?
How to plot the regression line with control variables?
Recall that our research question is “whether those who are more ideologically conservative rate Donald Trump with higher liking scores.” This means that we are only interested in the relationship between one’s ideology and evaluation of Trump. Therefore, when we present the multiple linear regression line, we should hold other control variables at a constant value so as to show how one unit increment in our key independent variable (i.e., ideology(libcon)) affect the change in the dependent variable (i.e., feelings toward Trump(fttrump)).
# create a data frame to store predicted values
libcon <- seq(1, 7, 0.01)
pred.dfm <- data.frame(libcon = libcon,
age = rep(51, length(libcon)), # set age at the mean
female = rep(1, length(libcon)), # female
educ = rep(4, length(libcon)), # 2-year college
white = rep(1, length(libcon)), # white
dem = rep(1, length(libcon)), # democrat
rep = rep(0, length(libcon))) # non-republican
# use predict() to generate predicted values
pred <- data.frame(libcon = libcon,
predict(m2, newdata = pred.dfm,
interval = "confidence", level = 0.95))
head(pred)
## libcon fit lwr upr
## 1 1.00 -3.794171 -7.017631 -0.5707099
## 2 1.01 -3.720327 -6.936792 -0.5038634
## 3 1.02 -3.646484 -6.855977 -0.4369918
## 4 1.03 -3.572641 -6.775188 -0.3700951
## 5 1.04 -3.498798 -6.694424 -0.3031729
## 6 1.05 -3.424955 -6.613685 -0.2362252
Plot the regression line with control variables (this line should be less steeper than the simple regression line we plotted earlier).
ggplot() +
geom_point(data = anes, aes(x = libcon, y = fttrump),
alpha = 0.7, color = "darkgray") +
geom_errorbar(data = pred, aes(x = libcon, ymin = lwr, ymax = upr),
position = position_dodge(.9),
color = "gray", alpha = 0.5) +
geom_line(data = pred, aes(x = libcon, y = fit), color = "dodgerblue2") +
scale_x_continuous(name = "Ideology (libcon)",
breaks = seq(1, 7, 1)) +
ylab("Feelings toward Trump (fttrump)") +
annotate("text", x = 5.5, y = 60,
label = "predicted values on \n the regression line \n when holding all other \n control variables at constant",
color = "dodgerblue2") +
theme_bw()
The Gauss-Markov assumptions are necessary to ensure that OLS estimates have desirable properties. What are these assumptions? Rank these assumptions in order of their importance in terms of how violations would affect substantive inferences and our confidence in them. Justify your choice of ordering based on these potential effects on the properties of OLS estimates.
Use a Monte Carlo simulation approach to generate a dataset (with 100 observations, one y response variable, four X covariate predictors, and an error term) that possesses at least one of the following properties:
Provide the R code you use to generate the dataset, the OLS results from a “naïve” model of the data, and the output from any diagnostics you use to search for the problem.
Load the dataset midtermdata.csv
. The data include
measures of politicalinterest
(a continuous variable
representing individuals’ interest in news and politics),
age
(in years), education
(a four-level factor
variable), income
(annual household income in thousands $),
female
(a binary indicator variable), and
treatment
(a binary indicator variable representing whether
the individual received treatment; in this case were selected to receive
a series of nonpartisan political information pamphlets and booklets in
the mail over the past year).
politicalinterest
is a function of the other
five covariates.treatment
.treatment
and the covariates? If so, identify and
interpret.education
is recoded as a continuous (1-4)
variable? How does this change the interpretation on the coefficient
from when education
was coded as a factor variable?In today’s discussion, we will revisit the concepts of interaction and explore why it’s essential to incorporate interactions in a regression model. We will also engage in practical exercises to implement interaction models in R and conduct diagnostics to ensure that our interaction models adhere to assumptions. Additionally, we will review the application of tree-based methods to analyze our data and determine whether the inclusion of interactions is warranted.
An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable.
Let’s look at some simulated examples. Suppose we investigate whether citizens are more inclined to trust information shared by their affiliated political party without any skepticism when they have higher levels of affective polarization, the extent of citizens’ liking and trusting their preferred party’s elites and supporters while disliking and distrusting those of opposing parties. In this case, affective polarization serves as the independent variable, while the level of trust in information shared by the in-party represents the dependent variable. Below, I outline the underlying data generating process (DGP) for this relationship in different scenarios. Here are the variables we will use for the simulation:
Levels of affective polarization (affpol
) is a
continuous variable which follows a normal distribution with a mean of 5
and a standard deviation of 1.5.
Political knowledge (pknow
) is a dummy variable
which follows a binomial distribution with one trial and a probability
of 0.6 for having higher political knowledge.
One’s levels of trust in information shared by the in-party
(trust
) is continuous variable which is determined by both
one’s levels of affective polarization and political knowledge.
set.seed(1234)
affpol <- rnorm(1000, mean = 5, sd = 1.5)
pknow <- rbinom(1000, size = 1, prob = 0.6)
trust <- 2 + 3*affpol + 0*pknow + 0*affpol*pknow + rnorm(1000, mean = 3, sd = 6)
df <- data.frame(affpol, pknow, trust)
mod1 <- lm(trust ~ affpol, data = df)
p1 <- ggplot(data = df, aes(x = affpol, y = trust)) +
geom_point(alpha = 0.4, color = "darkgray") +
geom_smooth(method = "lm", color = "grey30", se = TRUE, level = 0.95) +
scale_x_continuous(name = "Affective Polarization", breaks = seq(0, 10, 1)) +
ylab("Level of Trust in Information Shared by the In-party") +
labs(title = "Scenario 1:", subtitle = "No effect of political knowledge") +
theme_bw()
p1
In Scenario 1, we show a standard simple linear model. However, let’s consider the scenario where we anticipate individuals with lower political knowledge to respond at a generally higher level compared to those with higher political knowledge. There are several ways this can manifest. For instance, if the discrepancy in trust response between those with lower and higher political knowledge remains consistent across different levels of affective polarization, we would anticipate a graph resembling the following:
set.seed(1234)
affpol <- rnorm(1000, mean = 5, sd = 1.5)
pknow <- rbinom(1000, size = 1, prob = 0.6)
trust <- 2 + 3*affpol - 5*pknow + rnorm(1000, mean = 3, sd = 6)
df <- data.frame(affpol, pknow, trust)
mod2 <- lm(trust ~ affpol + pknow, data = df)
p2 <- ggplot() +
geom_point(data = df[pknow == 1,], aes(x = affpol, y = trust), alpha = 0.2, color = "coral2") +
geom_smooth(data = df[pknow == 1,], aes(x = affpol, y = trust), method = "lm", color = "coral2", se = TRUE, level = 0.95) +
geom_point(data = df[pknow == 0,], aes(x = affpol, y = trust), alpha = 0.2, color = "dodgerblue2") +
geom_smooth(data = df[pknow == 0,], aes(x = affpol, y = trust), method = "lm", color = "dodgerblue2", se = TRUE, level = 0.95) +
scale_x_continuous(name = "Affective Polarization", breaks = seq(0, 10, 1)) +
ylab("Level of Trust in Information Shared by the In-party") +
labs(title = "Scenario 2:", subtitle = "Political knowledge has an effect, but no interaction") +
annotate("text", x = 3, y = 40, label = "lower political knowledge", color = "dodgerblue2") +
annotate("text", x = 3, y = 36, label = "higher political knowledge", color = "coral2") +
theme_bw()
p2
However, if those with lower political knowledge have a steeper trust response curve compared to those with higher political knowledge, we would expect a picture like this:
set.seed(1234)
affpol <- rnorm(1000, mean = 5, sd = 1.5)
pknow <- rbinom(1000, size = 1, prob = 0.6)
trust <- 2 + 3*affpol - 5*pknow - 2*affpol*pknow + rnorm(1000, mean = 3, sd = 6)
df <- data.frame(affpol, pknow, trust)
mod3 <- lm(trust ~ affpol + pknow + affpol*pknow, data = df)
p3 <- ggplot() +
geom_point(data = df[pknow == 1,], aes(x = affpol, y = trust), alpha = 0.2, color = "coral2") +
geom_smooth(data = df[pknow == 1,], aes(x = affpol, y = trust), method = "lm", color = "coral2", se = TRUE, level = 0.95) +
geom_point(data = df[pknow == 0,], aes(x = affpol, y = trust), alpha = 0.2, color = "dodgerblue2") +
geom_smooth(data = df[pknow == 0,], aes(x = affpol, y = trust), method = "lm", color = "dodgerblue2", se = TRUE, level = 0.95) +
scale_x_continuous(name = "Affective Polarization", breaks = seq(0, 10, 1)) +
ylab("Level of Trust in Information Shared by the In-party") +
labs(title = "Scenario 3:", subtitle = "Political knowledge has an effect with an interaction") +
annotate("text", x = 3, y = 40, label = "lower political knowledge", color = "dodgerblue2") +
annotate("text", x = 3, y = 36, label = "higher political knowledge", color = "coral2") +
theme_bw()
p3
Of these three graphs, the first suggests no distinction between individuals with lower and higher political knowledge. The second shows a consistent difference between the two groups, implying no need for an interaction term. The third graph depicts a scenario where the effect of affective polarization on trust in information shared by the in-party varies based on individuals’ political knowledge levels, indicating an interaction effect.
In terms of regression equations, we have:
Scenario 1: No effect of political knowledge:
\[Y = \alpha + \beta_1*affpol + 0*pknow + 0*affpol*pknow\] where Y represents the outcome (trust in information shared by in-party), \(\beta_1\) represents the effect of affective polarization (which is set as 3 in our simulation), and all other coefficients for the rest of the terms (effect of political knowledge and interaction term) are zero.
Scenario 2: Political knowledge has an effect, but no interaction:
\[Y = \alpha + \beta_1*affpol + \beta_2*pknow + 0*affpol*pknow\]
\(\beta_2\) represents the effect of political knowledge, which is set as -5 in our simulation.
Scenario 3: Political knowledge has an effect with an interaction:
\[Y = \alpha + \beta_1*affpol + \beta_2*pknow + \beta_3*affpol*pknow\]
\(\beta_3\) represents the interaction effect between affective polarization and political knowledge, which is set as -2 in our simulation.
Let’s consider how to interpret the effects of affective polarization and political knowledge in the presence of interaction terms.
If we focus on the first model (Scenario 1) above, where there’s no effect of political knowledge, the reporting is straightforward. There’s no impact of political knowledge, and the coefficient \(\beta_1 = 3.05\) signifies the effect of affective polarization. Specifically, \(\beta_1 = 3.05\) indicates the extent to which one’s level of affective polarization shifts for each unit alteration in their trust in information shared by their own party.
Scenario 1 | Scenario 2 | Scenario 3 | ||||
---|---|---|---|---|---|---|
Variables | Estimates | C.I. (95%) | Estimates | C.I. (95%) | Estimates | C.I. (95%) |
Intercept | 4.91 *** | 3.58 – 6.24 | 4.67 *** | 3.26 – 6.09 | 4.75 *** | 2.60 – 6.89 |
Affective Polarization | 3.05 *** | 2.79 – 3.30 | 3.05 *** | 2.79 – 3.30 | 3.03 *** | 2.62 – 3.45 |
Political Knowledge: Higher | -4.61 *** | -5.39 – -3.82 | -4.72 *** | -7.46 – -1.98 | ||
Affective Polarization x Political Knowledge: Higher | -1.98 *** | -2.51 – -1.45 | ||||
Observations | 1000 | 1000 | 1000 | |||
R2 / R2 adjusted | 0.351 / 0.350 | 0.403 / 0.402 | 0.611 / 0.610 | |||
|
If we consider the second model (Scenario 2), where there are effects of both affective polarization and political knowledge, interpretation remains straightforward: Since it does not depend on which level of political knowledge is being discussed (the effect is the same for those with higher political knowledge and those with lower political knowledge), \(\beta_1 = 3.05\) still represents the amount by which one’s level of affective polarization changes for each unit change in their trust in information shared by their own party. Similarly, \(\beta_2 = -4.61\) represents the effect of political knowledge, which is “additive” to the effect of affective polarization. Thus, to obtain the effect of both together for any level of affective polarization, we simply add the two individual effects.
Now, let’s consider the third model (Scenario 3) with an interaction term. Things become more intricate when an interaction term is introduced. There’s no longer a singular effect of affective polarization because it hinges on whether we’re discussing its impact on those with higher political knowledge or those with lower political knowledge. Likewise, the disparity between those with higher political knowledge and those with lower political knowledge is contingent upon the levels of affective polarization.
Consider first the effect of affective polarization: The question of “what is the effect of affective polarization” is not answerable until one knows the level of political knowledge being considered for a subject. The effect of affective polarization is \(\beta_1 = 3.03\) for individuals with lower political knowledge (coded as 0), as the interaction term becomes 0 in this case. Conversely, for individuals with higher political knowledge (coded as 1), the effect of affective polarization becomes \(\beta_1 + \beta_3 = 3.03 - 1.98 = 1.05\). This implies that for every one-unit increase in affective polarization, trust in information shared by the in-party increases by 1 unit for those with higher political knowledge, compared to 3 units for those with lower political knowledge.
This example considers a continuous variable combined with a dichotomous (dummy or indicator) variable. We can also consider interactions between two dummy variables, and between two continuous variables. The principles remain the same, although some technical details change slightly.
Now that we grasp the importance of incorporating an interaction term in a regression model, let’s apply it using real data. We’ll utilize my POL212 project titled “Electoral Outcomes, Presidential Power, and Satisfaction with Democracy” as our example. The data can be assessed here. All the variables inside are well coded and already remove NAs.
Background: Scholars have amassed evidence showing that election outcomes affect how people think about democracy (Anderson & Guillory, 1997; Anderson & Tverdova, 2001; Anderson et al., 2005; Clarke & Acock, 1989; Curini et al., 2012). Compared to those who voted for losers, those who supported winners usually have a higher level of satisfaction and trust in democracy and have a more positive feeling of government responsiveness and political efficacy.
Research Question: However, does it mean that losers’ interests are then ignored by the government? Do people who did not vote for the winning party lose that much in the political game? In this project, I seek to understand how political institutions can shape citizens’ perception of the legitimacy of democracies. In particular, I ask how does the extent of presidential power affect the winner-loser effect? Does presidential power amplify or dampen losers’ satisfaction with democracy?
Theory & Hypothesis: I argue that with different levels of presidential power, losers’ satisfaction with democracy should also be different. In countries with powerful presidents, losers should be more unsatisfied with how the government works, for being the political minority in the system causes them to be largely excluded from the policy-making process. In countries with weaker presidents, losers should be less unsatisfied, for their policy interests still can be possibly realized by the premier and the cabinet, which is determined by the congressional election.
Hypothesis: Losers will be more satisfied with democracy when the president is weaker.
Based on my theory, I am not only interested in the effect of winner-loser status on satisfaction with democracy but also how this effect is moderated by presidential power.
Data: The fourth module (2011-2016) of the Comparative Study of Electoral Systems (CSES) project. The CSES project consists of a nationally-representative post-election survey around democracies. The unit of analysis for this study is individuals. In particular, I analyze all presidential and semi-presidential countries that are rated as free by Freedom House at the time the election covered by the survey was held. This leaves the data with 15 countries covering Western and Eastern Europe, South and North America, and Asia.
# import data
df <- read.csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/data_con.csv") %>%
mutate(female = ifelse(gender == "Female", 1, 0))
country | eyear | prespow1 |
---|---|---|
Austria | 2013 | 0.0915 |
Brazil | 2014 | 0.4863 |
Bulgaria | 2014 | 0.1831 |
Finland | 2015 | 0.0499 |
France | 2012 | 0.1306 |
Ireland | 2011 | 0.0623 |
Peru | 2016 | 0.4197 |
Poland | 2011 | 0.2407 |
Portugal | 2015 | 0.1969 |
Republic of Korea | 2012 | 0.3488 |
Romania | 2014 | 0.2502 |
Slovakia | 2016 | 0.1894 |
Slovenia | 2011 | 0.1180 |
Taiwan | 2012 | 0.2707 |
United States of America | 2012 | 0.2889 |
Outcome variable: Satisfaction with Democracy
(SWD
)
In the CSES survey, the concept is measured by this question: “On the whole, are you (1) very satisfied, (2) fairly satisfied, (3) not very satisfied, or (4) not at all satisfied with the way democracy works in [COUNTRY]?” I reverse the code scale to (4) very satisfied, (3) fairly satisfied, (2) not very satisfied, or (1) not at all satisfied.
Explanatory variables:
– Loser (loser
): I classified respondents as belonging
to the political majority and minority by a question in CSES asking
respondents for which party they feel closest to. loser
is
coded 1 if the party respondents feel closest to is not in accordance
with the current president’s party, and 0 otherwise
– Presidential Power (prespow1
): I use Doyle &
Elgie’s (2015) continuous presidential power score dataset. The range of
this variable is between 0 and 1. The higher score means the stronger
president; the lower score means the weaker president.
Control variables:
– Divided Government (divided
): Past studies have shown
that “unified government produces greater quantities of significant
enactments and is more responsive to the public mood than is divided
government” (Coleman, 1999: 821). In this case, I expect the coefficient
of divided
will be negative, for individuals will be more
unsatisfied with democracy under a less responsive government.
divided
is coded 1 if the current executive-legislative
relationship under the survey date of fieldwork is a divided government,
meaning that the president party did not control the majority of the
legislature. The data I use to establish divided
is from
the Database of Political Institutions 2017 (DPI 2017).
– Gender (gender
): Male vs Female.
– Age (age
): continuous scale, standardized.
– Education (education
): continuous scale.
Let’s first run the additive regression model and then include the
interaction term, loser
and prespow1
, into the
model so we can compare how taking this interaction term into account
improves our model!
# additive model
m1 <- lm(SWD ~ loser + prespow1 + divided + female + education + age, data = df)
# interactive model
m2 <- lm(SWD ~ loser*prespow1 + divided + female + education + age, data = df)
Additive Model | Interactive Model | |||
---|---|---|---|---|
Variables | Estimates | C.I. (95%) | Estimates | C.I. (95%) |
Intercept | 2.73 *** | 2.68 – 2.79 | 2.65 *** | 2.59 – 2.72 |
Loser | -0.33 *** | -0.35 – -0.30 | -0.22 *** | -0.28 – -0.17 |
Presidential Power | -0.50 *** | -0.59 – -0.40 | -0.16 | -0.34 – 0.01 |
Divided Government | -0.04 * | -0.06 – -0.01 | -0.03 * | -0.06 – -0.01 |
Female | 0.00 | -0.02 – 0.02 | 0.00 | -0.02 – 0.02 |
Education Level | 0.03 *** | 0.02 – 0.03 | 0.03 *** | 0.02 – 0.03 |
Age | 0.00 | -0.01 – 0.01 | 0.00 | -0.01 – 0.02 |
Loser * Presidential Power | -0.45 *** | -0.65 – -0.25 | ||
Observations | 18150 | 18150 | ||
R2 / R2 adjusted | 0.047 / 0.047 | 0.048 / 0.048 | ||
|
How to interpret the interaction term?
Let’s first think it through the additive model we constructed (for simplification, I removed all the control variables first).
\[\hat{SWD} = \hat{\alpha} + \hat{\beta_1}Loser + \hat{\beta_2}Prespow\]
If we want to know the effect of winner-loser status on satisfaction with democracy, we are looking for the estimate of \(\beta_1\). Understanding it through the calculus, we are taking the partial derivative of \(\hat{SWD}\) with respect to \(Loser\).
\[\frac{\partial \hat{SWD}}{\partial Loser} = \hat{\beta_1}\]
Now let’s look at the interactive model.
\[\hat{SWD} = \hat{\alpha} + \hat{\beta_1}Loser + \hat{\beta_2}Prespow +\hat{\beta_3}(Loser*Prespow)\]
To know how the effect of winner-loser status on satisfaction with democracy is moderated by different levels of presidential power, let’s also take the partial derivative of \(\hat{SWD}\) with respect to \(Loser\), our key explanatory variable.
\[\frac{\partial \hat{SWD}}{\partial Loser} = \hat{\beta_1} + \hat{\beta_2}Prespow\]
This suggests that when the presidential power increases 1 unit, the effect of winner-loser status on satisfaction with democracy increases \(\hat{\beta_1} + \hat{\beta_2}\) units.
Apply this logic to the interactive model we just estimated. According to the above regression table, the model we estimated looks like this:
\[\hat{SWD} = 2.65 -0.22Loser -0.16Prespow -0.45(Loser*Prespow) -0.03Divided +0.001Female+0.03Education+0.003Age\]
Take a partial derivative of \(\hat{SWD}\) with respect to \(Loser\):
\[\frac{\partial \hat{SWD}}{\partial Loser} = -0.22 -0.45Prespow\]
Presidential Power | Effect of Winner-Loser Status on Satisfaction with Democracy |
---|---|
\(0.05\) (min) | \(-0.22-0.45*0.05 =-0.2425\) |
\(0.13\) (1st quartile) | \(-0.22-0.45*0.13=-0.2785\) |
\(0.24\) (median) | \(-0.22-0.45*0.24=-0.328\) |
\(0.29\) (3rd quartile) | \(-0.22-0.45*0.29=-0.3505\) |
\(0.49\) (max) | \(-0.22-0.45*0.49=-0.4405\) |
The above table suggests that when the president gets stronger in a country, individuals who supported the losing party will become less satisfied with how democracy works, compared to those who supported the winning party. In other words, the winner-loser gap in satisfaction with democracy becomes greater as presidential power becomes greater, which confirms my hypothesis.
Plot the predicted effects to show the pattern more straightforwardly!
Using ggplot()
to plot:
prespow1 <- seq(0, 1, by = 0.01)
# create a new data for the predicted SWD of losers under different presidential level
dfplot.l <- data.frame(loser = rep(1, length(prespow1)), # 1 means loser
prespow1 = prespow1,
divided = rep(1, length(prespow1)), # set at the median
female = rep(1, length(prespow1)), # set at the median
education = rep(4, length(prespow1)), # set at the median
age = rep(0, length(prespow1))) # set at the mean
pred.l <- data.frame(Status = rep("Loser", length(prespow1)),
prespow = prespow1,
pred = predict(m2, newdata = dfplot.l, interval = "confidence")[, 1],
lwr = predict(m2, newdata = dfplot.l, interval = "confidence")[, 2],
upr = predict(m2, newdata = dfplot.l, interval = "confidence")[, 3])
# create a new data for the predicted SWD of winners under different presidential level
dfplot.w <- data.frame(loser = rep(0, length(prespow1)), # 0 means winner
prespow1 = prespow1,
divided = rep(1, length(prespow1)), # set at the median
female = rep(1, length(prespow1)), # set at the median
education = rep(4, length(prespow1)), # set at the median
age = rep(0, length(prespow1))) # set at the mean
pred.w <- data.frame(Status = rep("Winner", length(prespow1)),
prespow = prespow1,
pred = predict(m2, newdata = dfplot.w, interval = "confidence")[, 1],
lwr = predict(m2, newdata = dfplot.w, interval = "confidence")[, 2],
upr = predict(m2, newdata = dfplot.w, interval = "confidence")[, 3])
dfplot <- bind_rows(pred.l, pred.w)
ggplot(data = dfplot) +
geom_line(aes(x = prespow, y = pred, linetype = Status)) +
geom_line(aes(x = prespow, y = lwr, linetype = Status)) +
geom_line(aes(x = prespow, y = upr, linetype = Status)) +
scale_linetype_manual(values = c("dashed", "solid")) +
scale_x_continuous(name = "Presidential Power",
breaks = seq(0, 1, 0.1)) +
scale_y_continuous(name = "Predicted Values of Satisfaction with Democracy",
breaks = seq(1.8, 2.8, 0.1)) +
theme_bw() +
theme(legend.title = element_blank(),
legend.position = "bottom")
The above figure demonstrates the predicted values of winners’ and losers’ satisfaction with democracy under different degrees of presidential power. According to this figure, indeed, the winner-loser gap in satisfaction with democracy becomes greater when there is a stronger president in the political system.
Using interact_plot()
from the
interactions
package to plot:
– pred
argument specifies the variable that should be
used as the predictor variable for the x-axis of the plot.
– modx
argument specifies the variable that should be
used as the moderator variable for the y-axis of the plot. The moderator
variable is used to examine whether the relationship between the
predictor and response variables is conditional on the level of the
moderator variable.
– interval=TRUE
specifies that confidence intervals
should be included for the estimated marginal effects on the
outcome.
Theoretically, the appropriate specification for the
interact_plot()
function in our case would be to set
pred = loser
and modx = prespow1
, as
loser
is the key independent variable and
prespow1
is the moderator. However, since
loser
is a binary variable, it would be easier to interpret
the plot by setting pred = prespow1
and
modx = loser
, as this allows us to create a continuous
x-axis.
interactions::interact_plot(m2, pred = prespow1, modx = loser, interval = TRUE,
x.label = "Presidential Power",
y.label = "Predicted Values of Satisfaction with Democracy")
A classic linear interaction model is given by the following regression equation:
\[Y = \mu +\eta X+\alpha D+\beta (D\cdot X)+Z\gamma +\epsilon\]
In this model, \(Y\) is the outcome variable, \(D\) is the key independent variable of interest or so called “treatment,” \(X\) is the moderator - a variable that affects the direction and/or strength of the treatment effect (i.e., the effect of \(D\) on \(Y\)), \(D\cdot X\) is the interaction term between \(D\) and \(X\), \(Z\) is a vector of control variables, and \(\mu\) and \(\epsilon\) represent the constant (intercept) and error terms, respectively.
Let’s assume that our key independent variable, \(D\), is exogenous (i.e., \(E(\widehat{Cov}(X, \epsilon ))=0\)). In this abstract case, we are most interested in the effect of \(D\) on \(Y\). This indicates that the marginal effect of \(D\) on \(Y\) follows the functional form given by:
\[ME_{D}=\partial Y/\partial D = \alpha +\beta X\]
This equation is a linear function of the moderator \(X\), which suggests that the effect of \(D\) on \(Y\) can only linearly change with \(X\), so if \(X\) increases by one unit, the effect of \(D\) on \(Y\) changes by \(\beta\) units and this change in the effect is “constant across the whole range of \(X\).” We call this assumption linear interaction effect (LIE) assumption.
This is a strong assumption, because we often have little theoretical or empirical reason to believe that the heterogeneity in the effect of \(D\) on \(Y\) takes such a linear form. Instead, it might well be that the effect of \(D\) on \(Y\) is nonlinear or nonmonotonic. For example, the effect of \(D\) on \(Y\) might be small for low values of \(X\), large at medium values of \(X\), and then small again for high values of \(X\).
Considering the LIE assumption, it’d be better to do some diagnostic tests before applying a linear interaction model to our analysis - to examine whether our heterogeneous effect is indeed in a linear form.
Let’s continue with the example from my POL 212 project and assess
whether the linear interaction model we previously estimated violates
the linearity of the interaction effect assumption. We can achieve this
by setting linearity.check = TRUE
in
interact_plot()
. This will enable us to visualize whether
the effect of winner-loser status remains linear across the range of the
moderator.
interactions::interact_plot(m2, pred = prespow1, modx = loser, plot.points = TRUE,
jitter = 0.1, point.alpha = 0.8, linearity.check = TRUE,
x.label = "Presidential Power", y.label = "Satisfaction with Democracy")
In the above figure, the red line is the predicted loess line, whereas the black line is the predicted linear regression line. Recall that loess stands for “locally weighted least squares regression,” which uses more local data to estimate the outcome variable. The red loess line looks only at the subset of data and will be curved if the relationship is nonlinear. What we’re looking for is whether the red loess line is straight like the black line. If the two lines are very different, it means that the LIE assumption is violated. In our case, our predicted linear regression line is not too different from the loess line.
We can also examine the LIE assumption with a more flexible binning
estimator by using interflex()
in interflex
package developed by Hainmueller
et al. (2019).
The binning estimator approach remains in the regression framework
and at the same time relax the LIE assumption and flexibly allow for
heterogeneity in how the conditional marginal effect changes across
values of the moderator. Simply put, the binning estimator approach
breaks a continuous moderator (in our case: presidential power) into
several bins (the default is 3 in interflex()
) represented
by dummy variables and interact these dummy variables with the treatment
indicator (in our case: winner-loser status). The default is to pick an
evaluation point within each bin, which is the median of the moderator
in each bin, to estimate the conditional marginal effect of the key
independent variable on the outcome variable.
library(interflex)
ols.interaction.binning <- interflex(estimator = "binning", Y = "SWD", D = "loser", X = "prespow1",
Z = c("divided","female","education", "age"), data = df, na.rm = TRUE)
## Baseline group not specified; choose treat = 0 as the baseline group.
plot(ols.interaction.binning)
The three estimates from the binning estimator in the above figure, labeled L, M, and H, are the median point of each bin. As this figure shows, the conditional effect estimates from the binning estimator and our linear interaction model are similar. The first two estimates, L and M, sit almost right on the estimated linear marginal-effect line. Only the third estimate, H, is a bit far from the line.
For the second example, let’s replicate Homola & Tavits’s (2018) findings on how people’s pre-existing partisan affinities condition the effect of contact with nonnatives on their immigration-related fears.
According to contact theory, close intergroup contact can foster more positive attitudes towards other groups, potentially reducing threat perceptions and prejudice towards immigrants. However, Homola & Tavits (2018) suggest that individuals may resist changing their pre-existing opinions, particularly influenced by their partisanship and political ideology. They hypothesize that exposure and positive contact with immigrants may decrease perceptions of immigrant-related threats among leftist voters but increase them among rightist voters. This difference arises because positive contact aligns with the leftist ideology of advocating social change and rejecting inequality, whereas it conflicts with the rightist ideology of resisting social change and accepting inequality.
In short, Homola & Tavits (2018) argue that there is heterogeneity in the effect of contacting outgroup members on people’s perceptions of immigrants-related threats across their partisanship and associated political ideology (leftist-rightest).
Outcome variable (\(Y\),
threat
): Homola & Tavits (2018) use the six measures
they develop in their survey to build an overall threat score to capture
the perceptions of threat related to immigration.
Key independent variable (treatment, \(D\), contact
): Homola &
Tavits (2018) use their two survey questions to construct a contact
factor score variable to capture respondents’ repeated positive contact
with immigrants.
Moderator (\(X\),
PID
): Homola & Tavits (2018) use two different measures
to capture whether a respondent is a political leftist or rightist.
First, they use the standard question “Generally speaking, do you
usually think of yourself as a Democrat, a Republican, and Independent,
or what?” and the follow-up question that asks people which parties do
they think they are closer to. By this question, they create a binary
measure of partisan affinity (variable name Rightist), where 0 = leftist
(i.e., Democrat) and 1 = rightis (i.e., Republican). They also use the
full 7-point party affinity scale ranging from 1 = strong Democrat to 7
= strong Republican. Here, we will focus on the continuous measure of
this moderator.
Control variables (\(Z\)): Homola & Tavits (2018) include female, age, education, and income as their control variables.
In Homola & Tavits’s (2018) model, they interact Contact with Rightist to examine whether the effect of Contact on Threat is different based on whether respondents are leftist or rightist. They expect that the coefficient of the interaction term Contact*Rightist will be significant positive, which means that when an individual is rightist, positive contact with nonnatives will not decrease their perceptions of threat toward immigrants.
Homola & Tavits’s (2018) data can be assessed here.
# import data
library(haven)
df <- read_dta("/Users/yu-shiuanhuang/Desktop/method-sequence/data/HomolaTavits2018.dta") %>%
mutate(PID = PID7) %>%
select(threat, contact, PID, female, age, education, income) %>%
na.omit()
df$education <- as.numeric(df$education)
df$income <- as.numeric(df$income)
Let’s first run the interaction model and plot the marginal effect of
positive contact with nonnatives (contact
) on people’s
perceptions of threat toward immigrants (threat
).
m <- lm(threat ~ contact*PID + female + age + education + income, data = df)
Threat | ||
---|---|---|
Variables | Estimates | C.I. (95%) |
Intercept | -0.62 | -1.31 – 0.08 |
Contact | -0.66 *** | -0.90 – -0.41 |
Rightest | 0.46 *** | 0.42 – 0.51 |
Female | 0.05 | -0.15 – 0.24 |
Age | 0.01 *** | 0.01 – 0.02 |
Education | -0.44 *** | -0.58 – -0.31 |
Income | -0.11 ** | -0.18 – -0.04 |
Contact*Rightest | 0.08 * | 0.02 – 0.14 |
Observations | 1487 | |
R2 / R2 adjusted | 0.306 / 0.303 | |
|
According to the regression table, as Homola & Tavits (2018) expected, the coefficient of Contact*Rightist is significant positive. One unit increase in the contact variable decreases an individual’s perceptions of threat toward immigrants by 0.66 units. Moreover, when the individual is 1 more unit right on the political ideology spectrum, the effect of Contact on Threat is cancelled out for 0.08 units, which means that the overall effect of one unit increase in Contact on Threat is only -0.58 units.
interactions::interact_plot(m, pred = contact, modx = PID, interval = TRUE)
The above figure demonstrates the predicted values of contact levels on the perceptions of threat toward immigrants under different political ideology. According to this figure, indeed, when an individual is more right on the political ideology spectrum, more positive contact with people from the outgroup is less likely to reduce her perceived threat toward immigrants (the darker blue line is flatter than the lighter line).
Now, let’s examine whether our linear interaction model violate the LIE assumption.
interactions::interact_plot(m, pred = contact, modx = PID, plot.points = TRUE,
jitter = 0.1, point.alpha = 0.8, linearity.check = TRUE)
As the above figure shows, our predicted linear regression line is not too different from the loess line under different spans of political ideology.
ols.interaction.binning <- interflex(estimator = "binning", Y = "threat", D = "contact", X = "PID",
Z = c("female", "age", "education", "income"), data = as.data.frame(df), na.rm = TRUE)
plot(ols.interaction.binning)
The more relaxed binning estimator approach also suggests that the conditional effect estimates and our linear interaction model are similar. The three estimates, L, M, and H, sit almost right on the estimated linear marginal-effect line.
What is a decision tree?
A decision tree is a flowchart-like structure made of nodes and branches. At each node, a split on the data is performed based on one of the input features, generating two or more branches as output. More and more splits are made in the upcoming nodes and increasing numbers of branches are generated to partition the original data. This continues until a node is generated where all or almost all of the data belong to the same class and further splits — or branches — are no longer possible.
As mentioned in the class, tree-based method is a good way to first parse out whether there is any noteworthy candidate interactions within our data (theory is intertwined between deductive and inductive).
Let’s apply the single tree method to my POL212 project “Electoral Outcomes, Presidential Power, and Satisfaction with Democracy.”
library(rpart)
df2 <- read.csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/data_con.csv") %>%
select(SWD, loser, prespow1, divided, gender, education, age) ## only select the variables we will use
set.seed(1234)
fit <- rpart(SWD ~ ., data = df2, control = rpart.control(minsplit = 20, minbucket = 5))
## minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted.
## minbucket: the minimum number of observations in any terminal <leaf> node.
rsq.rpart(fit) # the R2 value is about 0.2
##
## Regression tree:
## rpart(formula = SWD ~ ., data = df2, control = rpart.control(minsplit = 20,
## minbucket = 5))
##
## Variables actually used in tree construction:
## [1] loser prespow1
##
## Root node error: 11488/18150 = 0.63294
##
## n= 18150
##
## CP nsplit rel error xerror xstd
## 1 0.036314 0 1.00000 1.00018 0.0090794
## 2 0.025665 1 0.96369 0.96398 0.0089336
## 3 0.021268 2 0.93802 0.93844 0.0088492
## 4 0.013994 3 0.91675 0.91727 0.0088237
## 5 0.010000 5 0.88877 0.88942 0.0087257
Plot your tree!
library(rpart.plot)
rpart.plot(fit)
How to interpret this single decision tree?
This tree has five internal nodes (rules):
loser = 1: whether one supported the losing party or the winning party
prespow1 >= 0.16: whether the presidential power in a country is larger than 0.16
prespow1 < 0.19: whether the presidential power in a country is smaller than 0.19
prespow1 < 0.26: whether the presidential power in a country is smaller than 0.26
prespow 1 >= 0.32: whether the presidential power in a country is larger than or equal to 0.32
As you can observe from this decision tree, compared to losers in a country where the presidential power is weaker, losers in a country where the presidential power is stronger are less satisfied with democracy. In other words, there does seem to be an interaction effect between winner-loser status and presidential power as I argued in my hypothesis.
In today’s discussion, we’ll cover the Gauss-Markov theorem, highlighting why ordinary least squares is a reliable estimator for our relationship of interest. We’ll then explore diagnostic techniques to ensure our model adheres to its assumptions and address any violations. Additionally, we’ll discuss collinearity detection method.
Gauss Markov theorem states that “if the errors are independently distributed with zero expectation and constant variance, then the least-squares estimator b is the most efficient linear unbiased estimator of \(\beta\). That is, of all unbiased estimators that are linear functions of the observations, the least-squares estimator has the smallest sampling variance and, hence, the smallest mean-squared error. For this reason, the least-squares estimator is sometimes termed BLUE, an acronym for best linear unbiased estimator” (Fox 2016: 212).
In other words, under certain conditions, the ordinary least squares (OLS) estimates are thought to be the best linear unbiased estimator (BLUE). Here, I review a set of assumptions (known as Gauss-Markov assumptions) that leads to a BLUE estimator.
Linearity: the expected (mean) value of the disturbance term is 0 (i.e., \(E(u_i)=0\)).
Fox (2016) explains that a violation of this assumption implies “that the model fails to capture the systematic pattern of relationship between the response and explanatory variables” (307). This assumption could be broken in two ways: one, the mean of error could be a constant across all values of \(X_i\), two, it could vary. In the former case, the intercept term will be consistently off by the error amount. A uniform measurement error across all observations would be a cause of such an error (Berry 1993: 42). A common cause of the latter case will be omitted variable bias. Omitted variable bias deserves particular attention because it can cause the OLS estimate to be both biased and inconsistent. Another common cause is the wrong functional form (e.g., specifying a linear relation instead of a quadratic relation). Thus, when nonlinearity is detected, we should explore not only variable transformations and different model specifications (Fox, 2016: 317), but also the possibility of omitted variables. In short, before running OLS, reseachers should make sure that the inclusion of relevant variables in the model and the setting of the function are resonable and are based on the theoretical understanding.
Nonstochastic regressors: \(X\) values are independent of the error term (i.e., \(cov(X_i, u_i)=0\)).
The violation of this assumption suggests that there is an endogeneity problem between the explanatory and outcome variables, which can arise from measurement error on \(X\), omitted confounders, or reciprocal causation between \(X\) and \(Y\). Again, if the main goal of the research is to provide an unbiased and consistent estimator to explain how a social phenomenon works in a real-life setting, to provide a convincing explanation requires us to make sure that the effect of our explanatory variable is actually exogenous.Fortunately, scholars have developed many methods to address this issue. For example, researchers could employ an instrumental variable or matching to improve the unbiasedness of the estimated effect.
Homoskedasticity: constant error variance across values of \(X_i\) (i.e., \(Var(\varepsilon)=\sigma^2\)).
One instance where homoskedasticity is often violated is cross-sectional studies (Berry, 1993: 73). For example, because more developed countries can have better data accuracy, measurement error could be correlated with the level of development. Thus, in a study that includes the level of development as one of the independent variables, the variance of error term may not be constant (Berry, 1993: 73). In such circumstances, Fox (2016) suggests that we can run weighted-least-squares (WLS) regression to overcome such probelm (304). Other solutions include transformation of the dependent variable and correction of coefficient standard errors (e.g. robust standard errors and clustered standard errors) for heteroskedasticity (Fox, 2016: 307).
Independence: the observations are sampled independently. no autocorrelation between disturbances (i.e., \(cov(u_i, u_j)=0\) for \(i \neq j\)).
The independence assumption is important for time-series analysis, where the assumption of spherical errors is often violated. To overcome this, Beck & Katz (1995) propose to use panel-correlated standard errors when analyzing time-series data.
Normality: errors are distributed normally (i.e.,\(\varepsilon \sim N(0, \sigma^2)\)).
Assumptions of 1 and 2 are related to the bias of the estimator, whereas assumptions of 3 and 4 are related to the efficiency of the estimator: when the assumption of 1 or 2 is violated, the OLS estimator is no longer unbiased; when the assumption of 3 or 4 is violated, the OLS estimator is no longer efficient. With these weak set of assumptions (1-4), according to the Gauss-Markov theorem, the OLS estimator is BLUE.
With the fifth assumption, normality, the OLS estimator is the most efficient among all the unbiased estimators, not just linear estimators. Adding this fifth assumption makes the Gauss-Markov theorem a strong set. This last assumption is important when we aim to make a causal inference using the OLS estimator. With the normal distribution of the errors assumption, we can also infer that the OLS estimators also have normal sampling distributions. As a result, it allows us to apply statistical tests such as t-tests even for small sample size studies.
The below table presents the consequences of the violation of Gauss-Markov assumptions and corresponded suggested solutions.
Assumption | Violation | Solution |
---|---|---|
Linearity | Biased/inconsistent estimates | Transformations, polynomials, different model |
Nonstochastic regressors | Biased/inconsistent estimates | Instrumental variables |
Homoskedasticity | Biased standard errors | Robust standard errors |
Independence | Biased standard errors | Fixed effects |
Recall that the violation of linearity assumption implies “that the model fails to capture the systematic pattern of relationship between the response and explanatory variables” (Fox, 2016: 307). Let’s generate data based on this underlying idea!
set.seed(20230307) # for replication
N <- 1000 # set up the sample size
x1 <- rnorm(N, mean = 10, sd = 3)
x2 <- runif(N, min = -3, max = 3)
e <- rnorm(N, mean = 0, sd = 1) # set the errors to be normally distributed
y <- 7 + 3*x1 + 2*x1^2 + 3*x2 + e # set the true y
df <- data.frame(y = y, x1 = x1, x2 = x2, e = e)
In the above chunk, I specify the true \(y\) to be a function of \(x_1\), \(x_2\), and \(e\). The true values of the regression coefficients are 3 for \(x_1\) and 3 for \(x_2\). The term \(2*x_1^2\) represents a quadratic effect of \(x_1\). The intercept is 7. Let’s first plot the true relationship between \(x_1\) and \(y\)!
library(tidyverse)
ggplot(data = df, aes(y = y, x = x1)) +
geom_point(alpha = 0.8) +
geom_smooth() +
theme_bw()
As you can see in the above figure, in the true model we set up, the effect of \(x_1\) on \(y\) is in a quadratic form (concave): when \(x_1\) increases by more units, the effect of \(x_1\) on \(y\) becomes greater!
Now, let’s run a model that violates the linearity assumption (i.e.,
when regressing \(x_1\) on \(y\), do not include the quadratic term
I(x_1^2)
inside the lm
function).
fit <- lm(y ~ x1 + x2, data = df)
y | |||
---|---|---|---|
Variables | Estimates | S.E. | C.I. (95%) |
(Intercept) | -170.54 *** | 2.82 | -176.08 – -164.99 |
x1 | 42.68 *** | 0.27 | 42.15 – 43.21 |
x2 | 3.10 *** | 0.49 | 2.14 – 4.05 |
Observations | 1000 | ||
R2 / R2 adjusted | 0.962 / 0.962 | ||
|
As you can see from the above regression table, the estimated value of constant/intercept (-170.54) is very far from what we have set in the true model (7). The coefficient of \(x_1\) is also very biased (we set it to be 3 in our true model). Let’s add this estimated regression line onto our scatterplot of \(x_1\) and \(y\).
ggplot(data = df, aes(y = y, x = x1)) +
geom_point(alpha = 0.8) +
geom_smooth() +
geom_abline(intercept = -170.54, slope = 42.68, color = "coral", size = 1.2) +
theme_bw()
You can see that our estimated regression line is very different from the lowess line. Our model did not capture the nonlinear relationship between \(x_1\) and \(y\).
Now, assess this nonlinearity by plotting residuals vs fitted values.
par(mfrow = c(2, 2))
plot(fit)
The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (first plot).
This plot displays the predicted values of the outcome variable on the horizontal axis, and the residuals (i.e., the difference between the observed and predicted values of the outcome variable) on the vertical axis. Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero, which suggests that the expected (mean) value of the disturbance term is 0. The presence of a pattern may indicate a problem with some aspect of the linear model.
As you can observe in the first plot, the red line shows that there is a nonlinear relationship existing between one of our explanatory variables and the outcome variable.
What if we run the model without violating the linearity assumption (which means we set up our model in a better way to capture the relationship between \(x_1\) and \(y\)) ?
fit2 <- lm(y ~ x1 + I(x1^2) + x2, data = df)
y | y | |||||
---|---|---|---|---|---|---|
Variables | Estimates | S.E. | C.I. (95%) | Estimates | S.E. | C.I. (95%) |
(Intercept) | -170.54 *** | 2.82 | -176.08 – -164.99 | 6.73 *** | 0.25 | 6.25 – 7.22 |
x1 | 42.68 *** | 0.27 | 42.15 – 43.21 | 3.04 *** | 0.05 | 2.94 – 3.14 |
x2 | 3.10 *** | 0.49 | 2.14 – 4.05 | 2.98 *** | 0.02 | 2.94 – 3.01 |
x1^2 | 2.00 *** | 0.00 | 1.99 – 2.00 | |||
Observations | 1000 | 1000 | ||||
R2 / R2 adjusted | 0.962 / 0.962 | 1.000 / 1.000 | ||||
|
par(mfrow = c(2, 2))
plot(fit2)
Let’s check the first plot again. After we specifying the model more correctly, the red line shows no fitted pattern, which is an indication for a linear relationship.
Recall that the violation of homoskedasticity assumption is that the error variance across values of an independent variables is not constant. Again, let’s generate data based on this underlying idea!
set.seed(20230307) # for replication
N <- 1000 # set up the sample size
x1 <- 1:N
x2 <- rnorm(N, mean = 10, sd = 3)
sd <- runif(N, min = 0, max = 2)
e <- rnorm(N, mean = 0, sd = 2*x1) # set the errors to be normally distributed but not at a constant variance
y <- 7 + 3*x1 + x2 + e # set the true y
df <- data.frame(y = y, x1 = x1, x2 = x2, e = e)
In the above chunk, I set the error variance to be more pronounce: as \(x_1\) gets larger, the error variance gets larger by multiplying \(x_1\). You can see this relationship in the below figure.
ggplot(data = df, aes(x = x1, y = e)) +
geom_point(alpha = 0.8) +
theme_bw()
Let’s now run the regression and test it to see if it violates the homoskedasticity assumption.
fit3 <- lm(y ~ x1 + x2, data = df)
y | |||
---|---|---|---|
Variables | Estimates | S.E. | C.I. (95%) |
(Intercept) | -16.56 | 143.39 | -297.93 – 264.81 |
x1 | 2.92 *** | 0.13 | 2.67 – 3.18 |
x2 | 3.02 | 12.15 | -20.82 – 26.86 |
Observations | 1000 | ||
R2 / R2 adjusted | 0.334 / 0.333 | ||
|
According to the regression table, you can see that the standard error of \(x_2\) is very large, which is due to the heteroskedasticity problem in our data. The estimate of the effect of \(x_2\), therefore, is very inefficient.
par(mfrow = c(2, 2))
plot(fit3)
Let’s take a look at the third plot. Scale-Location is used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem. It can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors.
Recall that the violation of normality assumption is that errors are not distributed normally. If this assumption is violated, it means that the OLS estimators may not have a normal sampling distribution and statistical inference based on those estimators may not be reliable. Again, let’s generate data based on this underlying idea!
set.seed(20230307) # for replication
N <- 1000 # set up the sample size
x1 <- rnorm(N, mean = 10, sd = 3)
x2 <- runif(N, min = -3, max = 3)
e <- runif(N, min = 0, max = 1) # set the errors to not be normally distributed
y <- 7 + 3*x1 + x2 + e # set the true y
df <- data.frame(y = y, x1 = x1, x2 = x2, e = e)
In the above chunk, I set the errors to not be normally distributed.
Let’s now run the regression and test it to see if it violates the normality assumption.
fit4 <- lm(y ~ x1 + x2, data = df)
y | |||
---|---|---|---|
Variables | Estimates | S.E. | C.I. (95%) |
(Intercept) | 7.54 *** | 0.03 | 7.48 – 7.60 |
x1 | 3.00 *** | 0.00 | 2.99 – 3.00 |
x2 | 1.00 *** | 0.01 | 0.99 – 1.01 |
Observations | 1000 | ||
R2 / R2 adjusted | 0.999 / 0.999 | ||
|
library(car)
qqPlot(fit4, simulate = TRUE, line = "none", labels = FALSE)
## [1] 265 857
The QQ plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. In our example, we ran a statistical analysis that assumes our residuals are normally distributed, we can use a Normal QQ plot to check this assumption.
A QQ plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight, which is not the case in our example.
In multivariate linear regression, it is possible for two or more predictor variables to be correlated with each other. This scenario is referred to as collinearity. In extreme cases, where collinearity exists between three or more variables, even if no pair of variables has a particularly high correlation, this is known as multicollinearity. Multicollinearity indicates redundancy between predictor variables.
Note that when there is perfect multicollinearity (i.e. when two or
more independent variables are perfectly correlated), we cannot even get
an estimate using the OLS method. But don’t worry, when you run
lm()
in R, R will automatically drop one of the perfectly
correlated variables in order to run the regression.
The presence of multicollinearity can cause the solution of the regression model to become unstable.
Therefore, it is important to detect and address multicollinearity before building a regression model to ensure the model’s reliability and accuracy.
In this example, our response variable is score
, and
explanatory variables are STR
(student-teacher ratio) and
english
(the share of English learning students).
library(AER)
library(MASS)
data(CASchools)
# define variables
CASchools$STR <- CASchools$students/CASchools$teachers
CASchools$score <- (CASchools$read + CASchools$math)/2
If there’s high multicollinearity between variables and they’re perfectly correlated, R will automatically drop one of them to resolve the perfect multicollinearity issue. In cases of high multicollinearity where variables are not perfectly correlated, the lm() function will still generate estimates for the highly correlated variable, but the standard errors of coefficients will be notably high, indicating unstable estimates. Below, we illustrate examples when our model includes 1) perfectly correlated variables, 2) highly correlated variables, or 3) less highly correlated variables.
# generate a perfectly correlated variable
CASchools$english_pc <- 2*CASchools$english
cor(CASchools$english_pc, CASchools$english)
## [1] 1
# english_pc is perfectly correlated with english
set.seed(1234)
# generate a highly correlated variable
CASchools$english_hc1 <- 2*CASchools$english + rnorm(nrow(CASchools), 10, 5)
cor(CASchools$english_hc1, CASchools$english)
## [1] 0.9904646
# english_hc is highly but not perfectly correlated with english
set.seed(1234)
# generate a less highly correlated variable
CASchools$english_hc2 <- 2*CASchools$english + rnorm(nrow(CASchools), 10, 20)
cor(CASchools$english_hc2, CASchools$english)
## [1] 0.8662955
mod1 <- lm(score ~ STR + english , data = CASchools) # without highly correlated term
mod2 <- lm(score ~ STR + english + english_pc , data = CASchools) # with perfectly correlated term
mod3 <- lm(score ~ STR + english + english_hc1 , data = CASchools) # with highly correlated term
mod4 <- lm(score ~ STR + english + english_hc2 , data = CASchools) # with less highly correlated term
Model 1 | Model 2 | Model 3 | Model 4 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Variables | Estimates | S.E. | C.I. (95%) | Estimates | S.E. | C.I. (95%) | Estimates | S.E. | C.I. (95%) | Estimates | S.E. | C.I. (95%) |
(Intercept) | 686.03 *** | 7.41 | 671.46 – 700.60 | 686.03 *** | 7.41 | 671.46 – 700.60 | 687.24 *** | 7.57 | 672.37 – 702.12 | 686.40 *** | 7.43 | 671.80 – 701.00 |
STR | -1.10 ** | 0.38 | -1.85 – -0.35 | -1.10 ** | 0.38 | -1.85 – -0.35 | -1.10 ** | 0.38 | -1.85 – -0.36 | -1.10 ** | 0.38 | -1.85 – -0.36 |
english | -0.65 *** | 0.04 | -0.73 – -0.57 | -0.65 *** | 0.04 | -0.73 – -0.57 | -0.43 | 0.28 | -0.98 – 0.12 | -0.60 *** | 0.08 | -0.75 – -0.44 |
english hc1 | -0.11 | 0.14 | -0.39 – 0.16 | |||||||||
english hc2 | -0.03 | 0.04 | -0.10 – 0.04 | |||||||||
Observations | 420 | 420 | 420 | 420 | ||||||||
R2 / R2 adjusted | 0.426 / 0.424 | 0.426 / 0.424 | 0.427 / 0.423 | 0.427 / 0.423 | ||||||||
|
The regression output indicates that in model 2, there is a perfect
multicollinearity problem. R automatically dropped
english_pc
to resolve it.
In models 3 and 4, there is an imperfect multicollinearity problem,
which is reflected by an increase in the standard error of the
coefficient estimates. Specifically, the estimated effect of
english
becomes increasingly unstable as the correlation
between english
and the correlated term (which we
artificially generated) becomes stronger.
Multicollinearity in a regression model can be assessed by computing the variance inflation factor (VIF) for each predictor variable. VIF measures how much of the variation in one variable is explained by the other variable. This is done by running a regression using one of the correlated \(x\) variables as the dependent variable against the other variables as predictor variables. The VIF is calculated as one divided by the tolerance, which is defined as one minus R-squared.
A VIF value of 1 indicates an absence of multicollinearity, while values exceeding 5 or 10 suggest problematic collinearity (James et al., 2014). When multicollinearity is detected, it is recommended to remove one or more of the correlated variables from the model. This is because the presence of multicollinearity implies that the information provided by the correlated variables about the response variable is redundant in the presence of other variables (James et al., 2014; P. Bruce & Bruce, 2017).
The R function vif()
in the car
package can
be used to detect multicollinearity in a regression model.
library(car)
vif(mod1) # without highly correlated term
## STR english
## 1.036495 1.036495
vif(mod2) # with perfectly correlated term
## Error in vif.default(mod2): there are aliased coefficients in the model
vif(mod3) # with highly correlated term
## STR english english_hc1
## 1.036565 52.750311 52.691167
vif(mod4) # with less highly correlated term
## STR english english_hc2
## 1.036565 4.049684 4.007776
As expected, when we introduce a highly correlated term into the
model, both english
and english_hc1
show
exceedingly high VIF scores. Additionally, when a perfectly correlated
term is included, R issues an error indicating “there are aliased
coefficients in the model,” signaling a perfect multicollinearity
issue.