Discussion I

For today’s discussion, we will review the distinction between parametric and nonparametric regression models, as well as some key properties of simple linear regression. If you wish to review materials from POL 211 & 212, you can access them through the following links: POL 211 Discussion, POL 212 Discussion.

  1. Parametric versus nonparametric Regression Models
  2. Simple Linear Regression
  3. Properties of Least Squares Estimator

1. Parametric versus nonparametric Regression Models

When it comes to regression analysis, choosing the right approach is crucial for accurate predictions and meaningful insights. Two common methods used are parametric, like linear regression, and semi/non-parametric, like smoothing spline regression or Kernal regression. Each has its own advantages and disadvantages, and the choice between them largely depends on the nature of the data and the underlying relationships.

a. What is parametric and semi/nonparametric regression?

Parametric Regression Linear regression is a well-known parametric method that assumes a linear functional form for the relationship between the predictors (\(X\)) and the target variable (\(Y\)). This approach has several benefits, such as ease of estimation with a small number of coefficients. In linear regression, these coefficients have straightforward interpretations, and statistical significance tests are readily applicable. However, parametric methods come with a significant limitation — they rely on the assumption that the specified functional form is a close approximation to the true relationship. If this assumption is far from reality, linear regression can perform poorly and yield unreliable results.

Nonparametric Regression On the other hand, non-parametric methods like K-Nearest Neighbors (KNN) regression do not make explicit assumptions about the functional form of the relationship between \(X\) and \(Y\). Instead, they provide a more flexible approach for regression. KNN regression identifies the K training observations closest to a prediction point and estimates the target variable by averaging their values. While this approach is more versatile and can handle complex relationships, it can suffer from high variance when K is small, leading to overfitting. Conversely, when K is large, KNN regression can underfit the data.

Example

Assume that we have a outcome variable \(Y\) and two explanatory variables, \(x_1\) and \(x_2\). In general, the regression model that describes the relationship can be written as:

\[Y = f_1(x_1) + f_2(x_2) + \epsilon\]

  • Some parametric regression models:

    • Multiple linear regression model: \(Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon\)
    • Polynomial regression model of second order: \(Y = \beta_0 + \beta_{10}x_1 + \beta_{11}{x_1}^2 + \beta_{20}x_2 + \beta_{21}{x_2}^2 + \epsilon\)
    • Nonlinear regression model: \(Y = \beta_0 + \beta_1x_1 + \beta_2e^{\beta_3x_2} + \epsilon\)
    • Poisson regression with \(Y\) is count: \(log(\mu) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon\)
  • If we do not know \(f_1\) and \(f_2\) functions, we need to use a Nonparametric regression model.

b. Nonparametric regression estimation methods: K-Nearest Neighbors (KNN) regression

K-Nearest Neighbors (KNN) regression is one of the simplest and best-known nonparametric methods. Given a value for \(K\) and a prediction point \(x_0\), KNN regression first identifies the \(K\) training observations that are closest to \(x_0\), represented by \(N_0\). It then estimates \(f(x_0)\) using the average of all the training responses in \(N_0\). In other words,

\[\hat{f}(x_0)=\frac{1}{K}\sum_{x_i \in N_0}y_i\]

When using KNN as a regressor with a continuous dependent variable, data points are scattered across the coordinate plane. When a new data point is introduced, the number of neighbors (\(K\)) is determined using any of the distance metrics.Usually, the Euclidean distance is used as the distance metric. Once the neighbors are identified, the predicted value of the new data point is calculated as the average of all the neighbors’ values combined.

The below figure illustrates two KNN fits on a data set with \(p = 2\) predictors. The fit with \(K = 1\) is shown in the left-hand panel, while the right-hand panel corresponds to \(K = 9\).We see that when \(K = 1\), the KNN fit perfectly interpolates the training observations, and consequently takes the form of a step function. When \(K = 9\), the KNN fit still is a step function, but averaging over nine observations results in much smaller regions of constant prediction, and consequently a smoother fit. In general, the optimal value for \(K\) will depend on the bias-variance tradeoff, which Chris introduced in POL 212. A small value for \(K\) provides the most flexible fit, which will have low bias but high variance. This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation. In contrast, larger values of \(K\) provide a smoother and less variable fit; the prediction in a region is an average of several points, and so changing one observation has a smaller effect. However, the smoothing may cause bias by masking some of the structure in \(f(X)\).

The bias-variance tradeoff is a fundamental concept in machine learning that deals with the balance between the bias (error from overly simplistic assumptions) and variance (sensitivity to fluctuations in the training data) of a model. In the context of KNN (K-Nearest Neighbors), adjusting the value of K can help control this tradeoff.

  • Low \(K\) (e.g., \(K=1\)): Low bias, high variance. The model closely follows the training data, leading to high variance and sensitivity to noise. It can capture complex patterns but may overfit the training data.

  • High \(K\) (e.g., large \(K\)): High bias, low variance. The model averages over more data points, resulting in smoother predictions with lower variance but potentially higher bias. It may underfit the data by oversimplifying the underlying patterns.

Choosing an appropriate value for \(K\) involves finding a balance between bias and variance to achieve optimal model performance. This tradeoff influences the model’s ability to generalize to unseen data.

There are no pre-defined statistical methods to find the most favorable value of \(K\), but here is a typical way of how to choose an optimal \(K\):

  1. Initialize a random \(K\) value and start computing.
  2. Choosing a small value of \(K\) leads to unstable decision boundaries.
  3. The substantial \(K\) value is better for classification as it leads to smoothening the decision boundaries.
  4. Derive a plot between error rate and \(K\) denoting values in a defined range. Then choose the \(K\) value as having a minimum error rate.

c. When should we use parametric or nonparametric regression?

The key question is when to choose a parametric approach like linear regression over a non-parametric one such as KNN regression. The answer is straightforward: a parametric approach performs better when the chosen functional form is a close match to the true relationship, particularly in the presence of a linear relationship. If the specified functional form is far from the truth, and prediction accuracy is our goal, then the parametric method will perform poorly. For instance, if we assume a linear relationship between \(X\) and \(Y\) but the true relationship is far from linear, then the resulting model will provide a poor fit to the data, and any conclusions drawn from it will be suspect.

In contrast, non-parametric methods do not explicitly assume a parametric form for \(f(X)\), and thereby provide an alternative and more flexible approach for performing regression.

To illustrate this point, let’s consider a few scenarios and use R to simulate them:

  1. Linear Relationship: When the true relationship between \(X\) and \(Y\) is linear, linear regression outperforms nonparametric regression. Linear regression provides an almost perfect fit in this situation, as it closely matches the underlying relationship.

    set.seed(1234) # for replication
    N <- 1000 # set up the sample size
    x1 <- rnorm(N, mean = 10, sd = 3)
    e <- rnorm(N, mean = 0, sd = 3) # set the errors to be normally distributed 
    y <- 7 + 3*x1 + e # set the true y
    df <- data.frame(id = 1:N, y = y, x1 = x1, e = e)

    In the above chunk, I specify the true \(y\) to be a function of \(x_1\) and \(e\). The true values of the regression coefficients is 3 for \(x_1\). The intercept is 7. Let’s first plot the true relationship between \(x_1\) and \(y\)!

    library(tidyverse)
    ggplot(data = df, aes(y = y, x = x1)) +
      geom_point(alpha = 0.6) +
      geom_smooth() +
      theme_bw()

    Let’s begin by splitting the generated data into a training set, comprising 80% of the data, and a test set, comprising the remaining 20%.

    set.seed(1234) # set the seed to make the partition reproducible
    train <- df %>% sample_frac(0.8) # should have 800 obs
    test <- anti_join(df, train, by = 'id') # should have 200 obs

    Now, let’s first apply the lm() function to our training set. We’ll then use the coefficients estimated from this model to predict the y-values in the test set, followed by computing the mean squared error (MSE).

    fit <- lm(y ~ x1, data = train)

     

    OLS

    Variables

    Estimates

    S.E.

    C.I. (95%)

    (Intercept)

    6.50 ***

    0.35

    5.81 – 7.19

    x1

    3.06 ***

    0.03

    2.99 – 3.13

    Observations

    800

    R2 / R2 adjusted

    0.910 / 0.910

    • p<0.05   ** p<0.01   *** p<0.001
    test$y_hat <- predict(fit, newdata = data.frame(x1 = test$x1))
    
    # compute mean squared error (MSE)
    MSE_ols <- sum((test$y - test$y_hat)^2)/nrow(test)
    
    MSE_ols
    ## [1] 8.770813

    The mean squared error (MSE) obtained by using OLS regression to estimate the relationship between \(y\) and \(x_1\) in the training set and then using the estimated coefficients to predict \(y\) in the test set is 8.770813. Now, let’s proceed to use KNN regression with various values of \(K\) to observe how the MSE varies.

    We will use the knn.reg() function from the FNN package for KNN regression. This function requires four arguments to be specified:

    • train: the predictors of the training data
    • test: the predictor values, \(x\), at which we would like to make predictions
    • y: the response for the training data
    • k: the number of neighbors to consider
    # install.packages("FNN")
    library(FNN)
    train_no_y <- train %>% select(x1) 
    test_no_y <- test %>% select(x1) 
    train_y <- train %>% select(y) 
    
    yhat_k1 <- knn.reg(train = train_no_y, test = test_no_y, y = train_y, k = 1)
    yhat_k1
    ## Prediction:
    ##   [1] 29.381821 28.304426 65.609505 23.837209 34.330518 33.436274 21.824014
    ##   [8] 30.847990 32.289959 42.276634 48.391432 53.703199 31.595459 33.423440
    ##  [15] 32.935792 41.304789 26.486024 39.445252 52.119807 38.608137 44.000824
    ##  [22] 38.608137 37.159409 38.540173 33.079801 39.907787 44.810978 37.870818
    ##  [29] 34.348167 42.174368 40.401263 34.996959 27.478094 48.391432 33.978417
    ##  [36] 26.367731 56.238177 31.240796 37.950729 55.185947 34.850824 43.470660
    ##  [43] 44.483329 21.700094 49.108239 44.558484 41.288458 40.623165 31.965684
    ##  [50] 28.174148 35.309716 44.393660 40.512042 48.391432 26.800229 31.877765
    ##  [57] 45.551430 28.354259 39.391663 43.100323 33.168206 25.652004 37.857259
    ##  [64] 42.280182 31.905242 30.070676 45.434786 32.597936 55.713238 30.245094
    ##  [71] 35.747199 24.318503 35.309716 42.142535 44.687740 34.463743 41.607736
    ##  [78] 11.887226 36.074342 55.185947 33.106591 64.709783 20.566998 54.842595
    ##  [85] 45.434786 51.245243 39.073328 28.874519 41.177921 35.357025 30.071689
    ##  [92] 36.168858 35.095983 28.392045 31.241205 32.798233 29.709591 37.191013
    ##  [99] 38.608137 26.841422 44.442204 24.630053 26.367731 19.105273 33.290191
    ## [106] 40.401263 56.238177 21.708475 33.771862 38.654251 42.868435 49.161958
    ## [113] 33.168206 26.492572 27.573812 41.597145 30.479421 45.411737 40.623165
    ## [120] 37.208831 35.170920  8.539242 48.085100 44.483329 20.591604 21.824014
    ## [127] 45.068575 40.911258 35.964535 44.731334 38.432659 44.912912 38.520474
    ## [134] 33.390623 24.145923 22.517012 37.950729 41.465057 36.782189 31.241205
    ## [141] 33.423440 29.709591 28.827443 21.824014 39.739236 49.108239 41.337496
    ## [148] 51.204406 31.801673 31.406837 30.414360 32.690670 39.641522 28.351906
    ## [155] 44.206919 32.373809 48.402430 31.595459 67.724169 27.404886 36.782189
    ## [162] 42.020108 54.160505 36.074342 48.649611 34.330518 44.253326 42.549677
    ## [169] 36.377909 46.279074 27.157538 25.692918 30.150333 29.936407 50.422834
    ## [176] 27.404886 37.870818 33.978417 29.820616 32.646670 28.464324 26.137602
    ## [183] 29.774837 38.787417 29.728468 39.795625 54.842595 48.649611 38.169491
    ## [190] 38.146542 33.233128 36.284027 41.820483 35.838288 38.441584 32.682236
    ## [197] 28.521211 41.054304 52.592726 44.479609

    As observed from the above results, when setting \(K = 1\) in knn.reg(), it provides the predicted \(y\) values for the test set. Now, let’s calculate the MSE for this prediction.

    MSE_k1 <- sum((test$y - yhat_k1$pred)^2)/nrow(test)
    MSE_k1
    ## [1] 20.25002

    As you can see, the MSE when employing KNN regression with \(K = 1\) is significantly higher than the MSE observed with OLS regression. Let’s create a for loop and repeat this process for various values of \(K\) to determine whether the MSE improves or worsens compared to the MSE when using OLS.

    MSE_KNN <- c()
    
    for (i in seq(1, 100, by = 2)){
      yhat_k <- knn.reg(train = train_no_y, test = test_no_y, y = train_y, k = i)
      MSE_k <- sum((test$y - yhat_k$pred)^2)/nrow(test)
      MSE_KNN <- c(MSE_KNN, MSE_k)
    }
    
    MSE_KNN
    ##  [1] 20.250016 11.554227 10.170821 10.020766 10.002982  9.655185  9.468217
    ##  [8]  9.151752  9.058586  8.825635  8.704344  8.733093  8.672769  8.708741
    ## [15]  8.678995  8.627246  8.531880  8.596169  8.653730  8.691982  8.783636
    ## [22]  8.831042  8.858023  8.874769  8.833497  8.853275  8.927840  8.982527
    ## [29]  8.938121  8.948997  9.004996  9.012363  8.970862  8.969159  9.031482
    ## [36]  9.058308  9.110250  9.158928  9.255961  9.305864  9.352528  9.415657
    ## [43]  9.468395  9.539323  9.576483  9.650497  9.693073  9.724640  9.752314
    ## [50]  9.824734
    ggplot(data = data.frame(K = seq(1, 100, by = 2), MSE = MSE_KNN), 
       aes(y = MSE, x = K)) +
      geom_line(alpha = 0.4, color = "darkgreen") +
      geom_point(alpha = 0.3, color = "darkgreen") +
      geom_hline(yintercept = MSE_ols, color = "darkgray", 
             alpha = 0.8, linetype = "dashed") +
      annotate("text", x = 90, y = 18, label = "KNN", color = "darkgreen", alpha = 0.5) +
      annotate("text", x = 90, y = 17, label = "OLS", color = "darkgray") +
      theme_bw()

    After simulation, it’s evident that when the true relationship between the variables of interest is linear, KNN regression doesn’t offer superior predictions compared to OLS regression.

  2. Slight Non-Linearity: In cases of slight non-linearity, where the true relationship deviates slightly from linearity, KNN regression can perform nearly as well as linear regression. It still provides reasonable results without a substantial reduction in prediction accuracy.

  3. Strong Non-Linearity: However, in situations with a strong non-linear relationship, KNN regression outperforms linear regression. This is because KNN can adapt to complex relationships, providing more accurate predictions.

    set.seed(1234) # for replication
    N <- 1000 # set up the sample size
    x1 <- rnorm(N, mean = 1, sd = 10)
    e <- rnorm(N, mean = 3, sd = 10) # set the errors to be normally distributed 
    y <- 3 + 4*(x1+3)^2*(x1-1)^3 + e # set the true y
    df <- data.frame(id = 1:N, y = y, x1 = x1, e = e)

    For example, let’s now generate a relationship between \(x_1\) and \(y\) that is strongly non-linear. I specify the true \(y\) to be a function of \(x_1\) and \(e\). Overall, the formula, \(y = 3+4(x_1+3)^2(x_1-1)^3+e\), describes a relationship where \(y\) is influenced by \(x_1\) in a nonlinear way, with two distinct “bumps” or curves, along with some random variability represented by \(e\).

    library(tidyverse)
    ggplot(data = df, aes(y = y, x = x1)) +
      geom_point(alpha = 0.6) +
      geom_smooth() +
      theme_bw()

    Following our previous approach, let’s now compare the MSE obtained from OLS regression, used to estimate the relationship between \(y\) and \(x_1\), with the MSE derived from KNN regression, employing different values of \(K\).

    • Step 1: Split the Sample in to Training and Test Set

      set.seed(1234) # set the seed to make the partition reproducible
      train <- df %>% sample_frac(0.8) # should have 800 obs
      test <- anti_join(df, train, by = 'id') # should have 200 obs
      fit2 <- lm(y ~ x1, data = train)
    • Step 2: Run OLS Regression and Compute the MSE

      fit2 <- lm(y ~ x1, data = train)

       

      OLS

      Variables

      Estimates

      S.E.

      C.I. (95%)

      (Intercept)

      95970.43

      389692.62

      -668973.27 – 860914.13

      x1

      734839.02 ***

      38319.47

      659620.15 – 810057.88

      Observations

      800

      R2 / R2 adjusted

      0.315 / 0.315

      • p<0.05   ** p<0.01   *** p<0.001
      test$y_hat <- predict(fit, newdata = data.frame(x1 = test$x1))
      MSE_ols <- sum((test$y - test$y_hat)^2)/nrow(test)
      
      MSE_ols
      ## [1] 4.742095e+13
    • Step 3: Run KNN and Compute the MSE

      train_no_y <- train %>% select(x1) 
      test_no_y <- test %>% select(x1) 
      train_y <- train %>% select(y) 
      MSE_KNN <- c()
      
      for (i in seq(1, 100, by = 2)){
        yhat_k <- knn.reg(train = train_no_y, test = test_no_y, y = train_y, k = i)
        MSE_k <- sum((test$y - yhat_k$pred)^2)/nrow(test)
        MSE_KNN <- c(MSE_KNN, MSE_k)
      }
      
      MSE_KNN
      ##  [1] 1.006088e+11 1.791096e+11 1.840990e+11 5.863832e+11 9.814868e+11
      ##  [6] 4.548076e+11 1.081960e+12 1.129925e+12 1.487238e+12 2.374122e+12
      ## [11] 2.763030e+12 3.688595e+12 4.638701e+12 5.402272e+12 6.302316e+12
      ## [16] 6.098534e+12 6.924194e+12 7.720989e+12 8.508775e+12 9.279200e+12
      ## [21] 9.522820e+12 1.021639e+13 1.021119e+13 1.086252e+13 1.149660e+13
      ## [26] 1.209597e+13 1.238768e+13 1.296902e+13 1.354249e+13 1.352895e+13
      ## [31] 1.408380e+13 1.458944e+13 1.475359e+13 1.524066e+13 1.571534e+13
      ## [36] 1.618749e+13 1.664102e+13 1.702627e+13 1.745704e+13 1.775741e+13
      ## [41] 1.815320e+13 1.818130e+13 1.853354e+13 1.892895e+13 1.930142e+13
      ## [46] 1.966064e+13 2.001759e+13 2.037685e+13 2.072546e+13 2.106549e+13
    • Step 4: Compare the MSE obtained from OLS and with that from KNN

      ggplot(data = data.frame(K = seq(1, 100, by = 2), MSE = MSE_KNN), 
             aes(y = MSE, x = K)) +
        geom_line(alpha = 0.4, color = "darkgreen") +
        geom_point(alpha = 0.3, color = "darkgreen") +
        geom_hline(yintercept = MSE_ols, color = "darkgray", 
           alpha = 0.8, linetype = "dashed") +
        annotate("text", x = 90, y = 4e+13, label = "KNN", color = "darkgreen", alpha = 0.5) +
        annotate("text", x = 90, y = 3.7e+13, label = "OLS", color = "darkgray") +
        theme_bw()

      As observed, regardless of the choice of \(K\) in our KNN regression, the MSE obtained from KNN is consistently much smaller than that from OLS.

  4. Curse of Dimensionality: When dealing with high-dimensional data (i.e., when you have a lot of predictors), KNN regression may suffer from the “curse of dimensionality.” In such cases, the performance of KNN deteriorates significantly as the dimensionality of the data increases. In the case of KNN regression, as the number of dimensions increases, the distance between data points becomes less meaningful. This is because in high-dimensional spaces, the concept of distance becomes less discriminating, as most data points are located far apart from each other. Consequently, KNN may struggle to find relevant neighbors, leading to poor predictive performance and increased computational complexity. On the other had, linear regression, with its fewer parameters, is less affected by this issue.

    Even in problems in which the dimension is small, we might prefer linear regression to KNN from an interpretability standpoint. If the test MSE of KNN is only slightly lower than that of linear regression, we might be willing to forego a little bit of prediction accuracy for the sake of a simple model that can be described in terms of just a few coefficients, and for which p-values are available.

2. Simple Linear Regression

Below is a population regression line equation, assuming that we are the god and know the data-generating process of the relationship between \(X\) (independent/explanatory variable) and \(Y\) (dependent/response/outcome variable).

However, since we usually don’t have the population data but only have access to a sample of it, we can only use the sample dataset to estimate the population regression line. In the below figure, \(b_0\) is the estimate of the regression intercept \(\beta_0\), while \(b_1\) is the estimate of the regression coefficient/slope \(\beta_1\).

The Ordinary Least Squares (OLS) approach aims to fit a line that minimizes the sum of the squared residuals, which means that the goal is to find a pair of \(b_0\) and \(b_1\) that can minimize the difference between our observed \(Y_i\) and estimated \(\hat{Y_i}\).

\[S(A, B) = min \Sigma E_i = min \Sigma (Y_i-\hat{Y_i})^2=min \Sigma (Y_i-(A+BX_i))^2\]

\[S(A, B) = \Sigma (Y_i-A-BX_i)^2\]

To find a pair of \(b_0\) and \(b_1\) that can minimize the difference between our observed \(Y_i\) and estimated \(\hat{Y_i}\) is an optimization problem. All we have to do is to take the partial derivatives of the above equation with respect to \(b_0\) and \(b_1\), respectively, set them equal to zero, and then solve it.

\[\frac{\partial S(A, B)}{\partial A}=\Sigma (-1)(2)(Y_i-A-BX_i)=0\]

\[\frac{\partial S(A, B)}{\partial B}=\Sigma (-X_i)(2)(Y_i-A-BX_i)=0\]

By solving the above two equations, you will get:

\[A = \bar{Y}-B\bar{X}\]

\[B = \frac{\Sigma (X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma (X_i-\bar{X})^2}\]

a. Gauss-Markov Theorem

Gauss Markov theorem states that “if the errors are independently distributed with zero expectation and constant variance, then the least-squares estimator b is the most efficient linear unbiased estimator of \(\beta\). That is, of all unbiased estimators that are linear functions of the observations, the least-squares estimator has the smallest sampling variance and, hence, the smallest mean-squared error. For this reason, the least-squares estimator is sometimes termed BLUE, an acronym for best linear unbiased estimator” (Fox 2016: 212).

In other words, under certain conditions, the ordinary least squares (OLS) estimates are thought to be the best linear unbiased estimator (BLUE). Here, I review a set of assumptions (known as Gauss-Markov assumptions) that leads to a BLUE estimator.

  1. Linearity: the expected (mean) value of the disturbance term is 0 (i.e., \(E(u_i)=0\)).

    Fox (2016) explains that a violation of this assumption implies “that the model fails to capture the systematic pattern of relationship between the response and explanatory variables” (307). This assumption could be broken in two ways: one, the mean of error could be a constant across all values of \(X_i\), two, it could vary. In the former case, the intercept term will be consistently off by the error amount. A uniform measurement error across all observations would be a cause of such an error (Berry 1993: 42). A common cause of the latter case will be omitted variable bias. Omitted variable bias deserves particular attention because it can cause the OLS estimate to be both biased and inconsistent. Another common cause is the wrong functional form (e.g., specifying a linear relation instead of a quadratic relation). Thus, when nonlinearity is detected, we should explore not only variable transformations and different model specifications (Fox, 2016: 317), but also the possibility of omitted variables. In short, before running OLS, reseachers should make sure that the inclusion of relevant variables in the model and the setting of the function are resonable and are based on the theoretical understanding.

  2. Nonstochastic regressors: \(X\) values are independent of the error term (i.e., \(cov(X_i, u_i)=0\)).

    The violation of this assumption suggests that there is an endogeneity problem between the explanatory and outcome variables, which can arise from measurement error on \(X\), omitted confounders, or reciprocal causation between \(X\) and \(Y\). Again, if the main goal of the research is to provide an unbiased and consistent estimator to explain how a social phenomenon works in a real-life setting, to provide a convincing explanation requires us to make sure that the effect of our explanatory variable is actually exogenous.Fortunately, scholars have developed many methods to address this issue. For example, researchers could employ an instrumental variable or matching to improve the unbiasedness of the estimated effect.

  3. Homoskedasticity: constant error variance across values of \(X_i\) (i.e., \(Var(\varepsilon)=\sigma^2\)).

    One instance where heteroskedasticity is often violated is cross-sectional studies (Berry, 1993: 73). For example, because more developed countries can have better data accuracy, measurement error could be correlated with the level of development. Thus, in a study that includes the level of development as one of the independent variables, the variance of error term may not be constant (Berry, 1993: 73). In such circumstances, Fox (2016) suggests that we can run weighted-least-squares (WLS) regression to overcome such probelm (304). Other solutions include transformation of the dependent variable and correction of coefficient standard errors (e.g. robust standard errors and clustered standard errors) for heteroskedasticity (Fox, 2016: 307).

  4. Independence: the observations are sampled independently. no autocorrelation between disturbances (i.e., \(cov(u_i, u_j)=0\) for \(i \neq j\)).

    The independence assumption is important for time-series analysis, where the assumption of spherical errors is often violated. To overcome this, Beck & Katz (1995) propose panel-correlated standard errors to when analyzing time-series data.

  5. Normality: errors are distributed normally (i.e.,\(\varepsilon \sim N(0, \sigma^2)\)).

Assumptions of 1 and 2 are related to the bias of the estimator, whereas assumptions of 3 and 4 are related to the efficiency of the estimator: when the assumption of 1 or 2 is violated, the OLS estimator is no longer unbiased; when the assumption of 3 or 4 is violated, the OLS estimator is no longer efficient. With these weak set of assumptions (1-4), according to the Gauss-Markov theorem, the OLS estimator is BLUE.

(In statistics, efficiency is a measure of the quality of an estimator, which can be characterized by having a smaller possible variance.)

With the fifth assumption, normality, the OLS estimator is the most efficient among all the unbiased estimators, not just linear estimators. Adding this fifth assumption makes the Gauss-Markov theorem a strong set. This last assumption is important when we aim to make a causal inference using the OLS estimator. With the normal distribution of the errors assumption, we can also infer that the OLS estimators also have normal sampling distributions. As a result, it allows us to apply statistical tests such as t-tests even for small sample size studies.

One important caveat is that when there is perfect multicollinearity (i.e. when two or more independent variables are perfectly correlated), we cannot even get an estimate using the OLS method. However, even when there is high multicollinearity, the OLS estimate is still BLUE (Berry 1993: 27). In such cases, standard errors will be very high, making the estimates fluctuate considerably from sample to sample (Berry 1993: 27).

The below table presents the consequences of the violation of Gauss-Markov assumptions and corresponded suggested solutions.

Assumption Violation Solution
Linearity Biased/inconsistent estimates Transformations, polynomials, different model
Nonstochastic regressors Biased/inconsistent estimates Instrumental variables
Homoskedasticity Biased standard errors Robust standard errors
Independence Biased standard errors Fixed effects
Perfect collinearity cannot run OLS Omit one collinear term

b. Properties of Least Squares Estimator

The sample least squares coefficients are unbiased estimators of the population regression coefficients.

Here are some reviews before demonstrating the unbiasedness of the least-squares estimators \(A\) and \(B\) for \(\alpha\) and \(\beta\):

  • The assumed model: \(Y_i=\alpha+\beta X_i+\varepsilon_i\)
  • Linearity assumption: \(E(\varepsilon_i)=0\)
  • Normality assumption: \(\varepsilon_i\sim N(0, \sigma_\varepsilon^2)\)
  • Independence assumption: \(Cov(\varepsilon_i, \varepsilon_j)=0, for i\neq j\)
  • The expectation of linear combination: \(E(a+bY)=a+bE(Y)\)
  • The variance of linear combination:
    • \(Var(a+bY)=b^2Var(Y)\)
    • \(Var(X+Y)=Var(X)+Var(Y)+2Cov(X, Y)\)
  • For the below demonstration, the \(X_i\) are assumed to be fixed, not random.
    • \(B=\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}=\frac{\Sigma(X_i-\bar{X})Y_i-\Sigma(X_i-\bar{X})\bar{Y}}{\Sigma(X_i-\bar{X})^2}=\frac{\Sigma(X_i-\bar{X})Y_i-\bar{Y}\Sigma(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2}=\frac{\Sigma(X_i-\bar{X})Y_i}{\Sigma(X_i-\bar{X})^2}\)
    • \(\Sigma(X_i-\bar{X})^2=\Sigma(X_i-\bar{X})(X_i-\bar{X})=\Sigma X_i(X_i-\bar{X})-\bar{X}\Sigma(X_i-\bar{X})=\Sigma X_i(X_i-\bar{X})\)
  1. Demonstrate the unbiasedness of the least-squares estimators \(B\) for \(\beta\) in simple regression (i.e., \(E(B) = \beta\)).

    \(E(B)=\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}\)

    \(=\frac{\Sigma(X_i-\bar{X})Y_i}{\Sigma(X_i-\bar{X})^2}\)

    \(=\frac{1}{\Sigma(X_i-\bar{X})^2}E(\Sigma(X_i-\bar{X})Y_i)\)

    \(=\frac{1}{\Sigma(X_i-\bar{X})^2}\Sigma(X_i-\bar{X})E(Y_i)\)

    \(=\frac{1}{\Sigma(X_i-\bar{X})^2}\Sigma(X_i-\bar{X})E(\alpha + \beta X_i+\varepsilon_i)\)

    \(=\frac{1}{\Sigma(X_i-\bar{X})^2}\Sigma(X_i-\bar{X})(\alpha +\beta X_i+E(\varepsilon_i))\)

    \(=\frac{1}{\Sigma(X_i-\bar{X})^2}\Sigma(X_i-\bar{X})(\alpha+\beta X_i)\)

    \(=\frac{1}{\Sigma(X_i-\bar{X})^2}(\Sigma(X_i-\bar{X})\alpha+\Sigma(X_i-\bar{X})\beta X_i)\)

    \(=\frac{\beta}{\Sigma(X_i-\bar{X})^2}(\Sigma(X_i-\bar{X})X_i)\)

    \(=\beta\)

  2. Demonstrate the unbiasedness of the least-squares estimators \(A\) for \(\alpha\) in simple regression (i.e., \(E(A) = \alpha\)).

    \(E(A)=E(\bar{Y}-B\bar{X})\)

    \(=E(\frac{1}{n}\Sigma Y_i-\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}\bar{X})\)

    \(=E(\frac{1}{n}\Sigma Y_i-\frac{\Sigma(X_i-\bar{X})Y_i}{\Sigma(X_i-\bar{X})^2}\bar{X})\)

    \(=E(\Sigma(\frac{1}{n}-\frac{\bar{X}(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2})Y_i)\)

    \(=E(\Sigma(\frac{1}{n}-\frac{\bar{X}(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2})(\alpha +\beta X_i +\varepsilon_i))\)

    \(=E(\Sigma(\frac{1}{n}-\frac{\bar{X}(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2}))E(\alpha +\beta X_i +\varepsilon_i)\)

    \(=E(\Sigma(\frac{1}{n}-\frac{\bar{X}(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2}))(\alpha+\beta X_i+E(\varepsilon_i))\)

    \(=\Sigma(\frac{1}{n}-\frac{\bar{X}(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2})(\alpha+\beta X_i)\)

    \(=\Sigma \frac{\alpha}{n}+\Sigma \beta \frac{X_i}{n}-\alpha \frac{\bar{X}\Sigma(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2}-\beta \frac{\bar{X}\Sigma(X_i-\bar{X})X_i}{\Sigma(X_i-\bar{X})^2}\)

    \(=\alpha + \beta \bar{X}-\alpha*0-\beta \bar{X}\)

    \(=\alpha\)

  3. Derive the sampling variances of \(A\) in a simple regression.

    \(Var(B)=Var(\frac{\Sigma(X_i-\bar{X})Y_i}{\Sigma(X_i-\bar{X})^2})\)

    \(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}Var(\Sigma(X_i-\bar{X})Y_i)\)

    \(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}Var(\Sigma(X_i-\bar{X})(\alpha + \beta X_i+\varepsilon_i))\)

    \(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}Var(\Sigma(X_i-\bar{X}(\alpha + \beta X_i)+\Sigma(X_i-\bar{X})\varepsilon_i)\)

    \(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}Var(\Sigma(X_i-\bar{X})\varepsilon_i)=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}\Sigma Var((X_i-\bar{X})\varepsilon_i)\)

    \(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}(\Sigma (X_i-\bar{X})^2 Var(\varepsilon_i))\)

    \(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}(\Sigma (X_i-\bar{X})^2 \sigma_\varepsilon^2)\)

    \(=\frac{\sigma_\varepsilon^2}{\Sigma(X_i-\bar{X})^2}\)

  4. Derive the sampling variances of \(B\) in a simple regression.

    \(Var(A)=Var(\bar{Y}-B\bar{X})\)

    \(=Var(\bar{Y})+Var(B\bar{X})-2Cov(\bar{Y},B\bar{X})\)

    \(=Var(\bar{Y})+Var(B\bar{X})\)

    \(=Var(\frac{\Sigma Y_i}{n})+\bar{X}^2Var(B)\)

    \(=\frac{1}{n^2}Var(\Sigma Y_i)+\bar{X}^2Var(B)\)

    \(=\frac{1}{n^2}\Sigma Var(Y_i)+\bar{X}^2Var(B)\)

    \(=\frac{1}{n^2}\Sigma Var(\alpha+\beta X_i+\varepsilon_i)+\bar{X}^2Var(B)\)

    \(=\frac{1}{n^2}\Sigma Var(\varepsilon_i)+\bar{X}^2Var(B)\)

    \(=\frac{1}{n^2}n\sigma_\varepsilon^2+\bar{X}(\frac{\sigma_\varepsilon^2}{\Sigma(X_i-\bar{X})^2})\)

    \(=\sigma_\varepsilon^2(\frac{1}{n}+\frac{\bar{X}^2}{\Sigma(X_i-\bar{X})^2})\)

    \(=\sigma_\varepsilon^2(\frac{\Sigma(X_i-\bar{X})^2 +n\bar{X}^2}{n\Sigma(X_i-\bar{X})^2})\)

    \(=\sigma_\varepsilon^2(\frac{\Sigma X_i^2-2\bar{X}\Sigma X_i+\Sigma \bar{X}^2+n\bar{X}^2}{n\Sigma(X_i-\bar{X})^2})\)

    \(=\sigma_\varepsilon^2(\frac{\Sigma X_i^2-2\bar{X}\Sigma X_i+2n \bar{X}^2}{n\Sigma(X_i-\bar{X})^2})\)

    \(=\sigma_\varepsilon^2(\frac{\Sigma X_i^2-2\frac{\Sigma X_i}{n}\Sigma X_i+2n \bar{X}^2}{n\Sigma(X_i-\bar{X})^2})\)

    \(=\sigma_\varepsilon^2(\frac{\Sigma X_i^2-\frac{2\Sigma X_i^2}{n}+2n\frac{\Sigma X_i^2}{n^2}}{n\Sigma(X_i-\bar{X})^2})\)

    \(=\frac{\sigma_\varepsilon^2 \Sigma X_i^2}{n\Sigma(X_i-\bar{X})^2}\)

c. Confidence Interval & Hypothesis Testing

As mentioned earlier, the fifth assumption, normality, allows us to apply statistical inference tests to assess the \(\beta\)s we estimate. Recall from the lecture that the residual standard error \(S_E\) (how closely the regression line we estimate fits the scatter of our data points) is:

\[\hat{\sigma } = SE(E_i) = \sqrt{\frac{\Sigma (E_i)^2}{n-2}} = \sqrt{\frac{\Sigma (Y_i-\hat{Y_i})^2}{n-2}}\]

And the standard error of the sample intercept (\(A\)) and slope (\(B\)) are:

\[SE(A) = \hat{\sigma }\sqrt{\frac{\Sigma X_i^2}{n\Sigma (X_i-\bar{X})^2}}\]

\[SE(B) = \frac{\hat{\sigma }}{\sqrt{\Sigma (X_i-\bar{X})^2}}\]

With standard errors, we can construct a \(100(1-\alpha)%\) confidence interval for our slope and perform hypothesis testing.

  • Confidence Interval

    \(\beta=B+t_{\frac{\alpha}{2}}SE(B)\)

  • Hypothesis Test

    Two-tailed test (by defualt, lm()function conducts two-tailedtest)

    \(H_0: \beta = 0\)

    \(H_1: \beta \neq 0\)

    \(t=\frac{b_1-\beta_1}{SE(b_1)}=\frac{b_1-0}{SE(b_1)}\)

    Compare this t-statistic with \(t_{\frac{\alpha}{2}, df=n-2}\) to see if it is larger or smaller than the critical value or not. If it is larger or smaller than the critical value, we can reject \(H_0\) in favor of \(H_1\). In addition to calculating t-statistics, we can also calculate p-value based the t-statistics and the critical values to perform hypothesis testing (p-value \(= 2*Pr(t \geq |t_c|)\) vs \(\alpha\)).

    One-tailed test (if your hypotheses are directional, you can also conduct one-tailed test)

    \(H_0: \beta = 0\)

    \(H_1: \beta < or > 0\)

    \(t=\frac{B-\beta_1}{SE(B)}=\frac{B-0}{SE(B)}\)

    When performing a one-tailed test, compare the t-statistic with \(t_{\alpha, df=n-2}\) without dividing \(\alpha\) by 2. P-value in one-tailed test is calculated as:

    p-value \(= Pr(t \geq or \leq t_c)\) vs \(\alpha\)

    Let’s run a simple example of it using the prestige data from Problem Set 1. The prestige data consists of 10 observations with 2 variables. The description of the variables are as follows:

    • prestige:The average prestige rating for the occupation.

    • education: The average number of years of education for occupational incumbents.

    # load data
    df <- read.csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/ps1_prestige.csv") 
    df
    ##    prestige education
    ## 1        82        86
    ## 2        83        76
    ## 3        90        92
    ## 4        76        90
    ## 5        90        86
    ## 6        87        84
    ## 7        93        93
    ## 8        90       100
    ## 9        52        87
    ## 10       88        86

    Let’s first calculate all the statistics we’ll need to perform hypothesis testing.

    1. Slope: \(B\)

      \(B=\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}\)

    # calculate the slope b1
    B <- sum((df$education-mean(df$education))*(df$prestige-mean(df$prestige)))/
    sum((df$education-mean(df$education))^2)
    B
    ## [1] 0.3895028
    1. Intercept: \(A\)

      \(A=\bar{Y}-B\bar{X}\)

    # calculate the intercept b0
    A <- mean(df$prestige)-B*mean(df$education)
    A
    ## [1] 48.82376
    1. Residual Standard Error: \(\hat{\sigma} = SE(E_i)\)

      \(\hat{\sigma } = SE(E_i) = \sqrt{\frac{\Sigma (E_i)^2}{n-2}} = \sqrt{\frac{\Sigma (Y_i-\hat{Y_i})^2}{n-2}}\)

    # generate predicted Y
    df$pred <- A + B*df$education
    
    #calculate the residual standard error 
    se <- sqrt(sum((df$prestige-df$pred)^2)/(nrow(df)-2))
    se
    ## [1] 12.46986
    1. Standard Error of the Estimated Intercept and Slope: \(SE(A)\) & \(SE(B)\)

      \(SE(A) = \hat{\sigma }\sqrt{\frac{\Sigma X_i^2}{n\Sigma (X_i-\bar{X})^2}}\)

      \(SE(B) = \frac{\hat{\sigma }}{\sqrt{\Sigma (X_i-\bar{X})^2}}\)

    # calculate the standard error of the estimated intercept b0
    se_A <- se*sqrt(sum(df$education^2)/(nrow(df)*sum((df$education-mean(df$education))^2)))
    se_A
    ## [1] 57.80998
    # calculate the standard error of the estimated slope b1
    se_B <- se/sqrt(sum((df$education-mean(df$education))^2))
    se_B
    ## [1] 0.6554015

    Now, let’s perform two-tailed hypothesis testing of the slope.

    \(H_0: \beta = 0\)

    \(H_1: \beta \neq 0\)

    \(t=\frac{B-\beta_1}{SE(B)}=\frac{B-0}{SE(B)}=\frac{0.3895028-0}{0.6554015}=0.5942964\)

    # calculate the t-statistic
    t <- B/se_B
    t
    ## [1] 0.5942964

    What are the t critical values in both tails? Let’s say the level of significance, \(\alpha\), is 0.05 in this case.

    \(|t_{\frac{\alpha}{2}, df=n-2}|=|t_{\frac{0.05}{2}, df=10-2}|=2.306004\)

    # find t critical value
    qt(p = 0.025, df = 8, lower.tail = TRUE)
    ## [1] -2.306004
    qt(p = 0.025, df = 8, lower.tail = FALSE)
    ## [1] 2.306004

    Now, let’s compare our t-statistic of \(b_1\) with the t critical value.

    \(t=0.5942964 < |t_{\frac{0.05}{2}, df=8}|=2.306004\)

    We can also calculate the p-value.

    p-value \(= 2*Pr(t \geq 0.5942964)=0.5687352 > 0.05\)

    # find p-value
    2*pt(q = 0.5942964, df = 8, lower.tail = FALSE)
    ## [1] 0.5687352

    These all suggest that we cannot reject \(H_0\), which means that the positive effect of education on one’s prestige is not statistically significant.

    Let’s also try perform one-tailed hypothesis testing for this case.

    \(H_0: \beta = 0\)

    \(H_1: \beta > 0\)

    \(t=\frac{B-\beta_1}{SE(b_1)}=\frac{B-0}{SE(B)}=\frac{0.3895028-0}{0.6554015}=0.5942964\)

    What is the t critical values in the right tail?

    \(t_{\alpha, df=n-2}=t_{0.05, df=10-2}=1.859548\)

    # find t critical value
    qt(p = 0.05, df = 8, lower.tail = FALSE)
    ## [1] 1.859548

    Now again, let’s compare our t-statistic of \(B\) with the t critical value.

    \(t=0.5942964 < t_{0.05, df=8}=1.859548\)

    And the p-value in the one-tailed test is:

    p-value \(=Pr(t \geq 0.5942964)= 0.2843676 > 0.05\)

    # find p-value
    pt(q = 0.5942964, df = 8, lower.tail = FALSE)
    ## [1] 0.2843676

    In R, lm() does all the above calculation for us!

    # simple regression
    fit <- lm(prestige ~ education, data = df)
    summary(fit)
    ## 
    ## Call:
    ## lm(formula = prestige ~ education, data = df)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -30.7105   0.3157   4.9580   5.6238   7.9525 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)
    ## (Intercept)  48.8238    57.8100   0.845    0.423
    ## education     0.3895     0.6554   0.594    0.569
    ## 
    ## Residual standard error: 12.47 on 8 degrees of freedom
    ## Multiple R-squared:  0.04228,    Adjusted R-squared:  -0.07743 
    ## F-statistic: 0.3532 on 1 and 8 DF,  p-value: 0.5687

d. Evaluating Least Squares Fit

We understand that there will be variability in the variable prestige across different occupations, with some having high prestige and others having low prestige. Some of this variability can be attributed to factors not included as variables in our regression model, such as economic conditions and other unmeasured variables. Additionally, differences in the variable education may explain some of the differences in prestige across occupations. For instance, occupations with a higher average number of years of education among incumbents tend to have higher prestige. However, it’s important to quantify how much of the variability in prestige can be explained by the variables included in our regression model, and how much remains unexplained by other factors that we have not accounted for.

  1. Explained Variability: Regression Sum of Squares (RegSS)

    The Regression Sum of Squares (RegSS) (sometimes referred to as the Sum of Squares Explained, SSE) is a measure of the variability in the outcome variable that is explained by the explanatory variables, i.e. the x-variables in your regression. It is given by the following sum:

    \(RegSS = \sum_{i=1}^{n}(\hat{y_i}-\bar{y})^2\)

  2. Residual or Unexplained Variability: Residual Sum of Squares (RSS)

    The Residual Sum of Squares (RSS) is a measure of the variability in the outcome variable that is not explained by your regression, and therefore is due to all the other factors that affect prestige besides education. It is given by the following sum:

    \(RSS = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 = \sum_{i=1}^{n}e_i^2\)

  3. Total Variability: Total Sum of Squares (TSS)

    You can show mathematically that \(RSS + RegSS\) is equal to the following expression, which is referred to as the Total Sum of Squares (TSS):

    \(TSS = \sum_{i=1}^{n}(y_i-\bar{y_i})^2 = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 + \sum_{i=1}^{n}(\hat{y_i}-\bar{y})^2 = RSS + RegSS\)

  4. Coefficient of Determination: R-Squared value

    The coefficient of determination, sometimes referred to as the R-Squared value, is a measure of what percentage of the variability in your outcome variable is explained by your explanatory variables. It is given by the expression,

    \(R^2 = \frac{RegSS}{TSS}\)

Discussion II

For today’s discussion, we will review concepts regarding how to find a set of \(\hat{\beta}\)s that can minimize the squared residual errors of our multivariate linear model using matrix algebra.

  1. Multivariate Linear Regression
  2. Find Coefficients using Matrix in R

1. Multivariate Linear Regression

The central difference between simple and multiple linear regression is that the slope coefficients for the explanatory variables in multiple regression are partial coefficients. That is, it represents the “effect” on the response variable of a one-unit increment in the corresponding explanatory variable, holding constant the value of the other explanatory variable.

\[Y_i=\hat{\beta_0}+\hat{\beta_1}X_{i1}+\hat{\beta_2}X_{i2}+...+\hat{\beta_k}X_{ik}+\varepsilon_i\]

Recall that the goal of the OLS approach is to minimize the discrepancy between our observed \(Y_i\) and estimated \(\hat{Y_i}\). To find a set of estimated \(\beta\)s that can minimize this discrepancy is an optimization problem. All we have to do is to take the partial derivatives of the above equation with respect to each coefficient, set them equal to zero, and then solve it. However, as shown in the lecture, adding more and more \(X\) variables to our linear regression model makes the calculation more and more tedious. In order to find the estimated \(\beta\)s more efficiently, we use matrix algebra to help us do the calculation.

To do this, let’s first think of our data set as a matrix. The below equations are how each observation in our data set generated.

\[Y_1=\hat{\beta_0}+\hat{\beta_1}X_{11}+\hat{\beta_2}X_{12}+...+\hat{\beta_k}X_{1k}+\varepsilon_1\]

\[Y_2=\hat{\beta_0}+\hat{\beta_1}X_{21}+\hat{\beta_2}X_{22}+...+\hat{\beta_k}X_{2k}+\varepsilon_2\]

.

.

.

\[Y_n=\hat{\beta_0}+\hat{\beta_1}X_{n1}+\hat{\beta_2}X_{n2}+...+\hat{\beta_k}X_{nk}+\varepsilon_n\]

Rewrite them into a matrix form.

The system of equations is summarized as the y vector equal to the product of the X matrix of data times the \(\hat{\beta}\) vector of parameters plus the vector of disturbances, \(\varepsilon\). That said, the fitted linear model we are trying to find is: \(y = X\hat{\beta}+\varepsilon\). In the lecture, Lauren showed how to apply what we did when finding a pair of intercept and slope in simple linear regression to solve the optimization problem(i.e., minimize RSS) to find the coefficient vector \(\hat{\beta}\) using matrix algebra.

\[ \begin{aligned} S(\hat{\beta}) &= \sum E_i^2 = \varepsilon\varepsilon' = (y- X\hat{\beta})'(y- X\hat{\beta})\\ &= y'y-y'X\hat{\beta}-\hat{\beta}'X'y+\hat{\beta}'X'X\hat{\beta} \\ &= y'y-(2y'X)\hat{\beta}+\hat{\beta}'(X'X)\hat{\beta} \end{aligned} \]

Instead of taking derivative with respect to each coefficient one at a time, we can take derivative directly with respect to the coefficient vector \(\hat{\beta}\).

\[ \begin{aligned} S(\hat{\beta}) &= \sum E_i^2 = \varepsilon\varepsilon' = (y- X\hat{\beta})(y- X\hat{\beta})'\\ &= y'y-y'X\hat{\beta}-\hat{\beta}'X'y+\hat{\beta}'X'X\hat{\beta} \\ &= y'y-(2y'X)\hat{\beta}+\hat{\beta}'(X'X)\hat{\beta} \end{aligned} \] \[ \begin{aligned} \partial S(\hat{\beta}) &= \frac{\partial}{\partial \hat{\beta}}(y'y-(2y'X)\hat{\beta}+\hat{\beta}'(X'X)\hat{\beta}) \\ 0 &=0-2Xy'+ 2\hat{\beta}X'X \\ 2\hat{\beta}X'X &=2Xy'\\ \hat{\beta} &= \frac{Xy'}{X'X} \\ \hat{\beta} &= (X'X)^{-1}Xy' \end{aligned} \] Therefore, the least squares estimator is:

\[\hat{\beta}=(X'X)^{-1}X'y\]

2. Find Coefficients using Matrix in R

Let’s use the Anscombe.txt data set as an example to find the estimated coefficients by using the above matrix equation.

  • education = per capita education expenditures, dollars
  • income = per capita income, dollars
  • under18 = proportion under 18 years old, per 1000
  • urban = proportion urban, per 1000
# load data
df2 <- read.delim("/Users/yu-shiuanhuang/Desktop/method-sequence/data/Anscombe.txt", sep = "")
head(df2)
##    education income under18 urban
## ME       189   2824   350.7   508
## NH       169   3259   345.9   564
## VT       230   3072   348.5   322
## MA       168   3835   335.3   846
## RI       180   3549   327.1   871
## CT       193   4256   341.0   774
# convert each variable to a vector
edu <- df2$education
intercept <- rep(1, nrow(df2)) # remember to crease this vector with 1s!
inc <- df2$income
un18 <- df2$under18
urb <- df2$urban
X <- cbind(intercept, inc, un18, urb)
class(X) # this is a matrix
## [1] "matrix" "array"

We now have a vector edu and a matrix \(X\) containing our intercepts and our three explanatory variables (income, under18, urban). Let’s estimate our best fit parameters \(\beta_0, \beta_1,\), \(\beta_2\), and \(\beta_3\) by applying the estimator we derived in the lecture: \(\hat{\beta} = (X'X)^{-1}X'y\).

beta.hat <- solve(t(X) %*% X) %*% t(X) %*% edu

t(beta.hat)
##      intercept        inc      un18        urb
## [1,] -286.8388 0.08065325 0.8173377 -0.1058062

Compare with the canned routine in R to confirm the matrix calculation was correct.

mod <- lm(edu ~ X-1) # remember to remove the first column in the X matrix bc that's the column with all the 1s we created just for the algebra calculation

beta.canned <- mod$coefficients
beta.canned
##    Xintercept          Xinc         Xun18          Xurb 
## -286.83876273    0.08065325    0.81733774   -0.10580623

How to come up with the predicted \(\hat{y}\)? \[\hat{y}=Xb\]

eduhat <- beta.hat[1] + df2$income*beta.hat[2] + 
    df2$under18*beta.hat[3] + df2$urban*beta.hat[4]
eduhat
##  [1] 173.8168 199.0526 211.7006 207.0077 174.5936 253.2396 223.9983 210.5846
##  [9] 179.8788 206.2476 213.3513 231.5918 233.2380 209.4856 211.0236 196.9737
## [17] 176.3859 188.1518 199.2828 195.3128 187.4341 250.0855 231.5108 252.0303
## [25] 182.3619 139.8510 169.8280 162.6433 176.5621 159.9772 156.6485 139.1353
## [33] 135.8968 148.7554 135.1561 174.0988 143.0524 175.0569 195.4585 171.6380
## [41] 205.2510 192.1739 197.6281 189.7953 190.1857 261.4657 212.7525 181.6204
## [49] 221.7760 355.7228 221.5298

How to come up with the residuals? \[e=y-\hat{y}=y-Xb\]

e <- edu - eduhat
e
##  [1]  15.1831901 -30.0526055  18.2993676 -39.0077435   5.4064124 -60.2396370
##  [7]  37.0016640   3.4153951  21.1211678 -34.2476465 -19.3513498 -42.5918115
## [13]  -0.2380450  -0.4855509  50.9763644  37.0263340   0.6140641 -11.1518234
## [19] -12.2828447 -47.3127738   8.5658819  -2.0854980  15.4891839  -6.0302784
## [25]  -2.3619147   9.1490037 -14.8279988 -13.6433457 -20.5621494  31.0227604
## [31] -16.6485112  -2.1352962 -23.8967859 -18.7553867  -1.1561315 -12.0987770
## [37]  -8.0523620 -20.0569429  42.5415211  -1.6380384  32.7489708  -0.1738631
## [43]  29.3718741  17.2047442  10.8143055 -36.4656906   2.2475110  51.3796304
## [49]  51.2240417  16.2771794  -9.5297657

Now calculate the standard errors of the vector \(\hat{\beta}\). Recall the error variance is unknown but we have available the unbiased estimator \(S^2_E = (e'e)/(n-k-1)\), based on the residuals of our model fit. We can extract these residuals, calculate the estimator, and then use it to calculate the variance of the least squares coefficients \(\hat{V(b)} = S^2_E (X'X)^{-1}\).

  • As a quick refresher of concepts: the variance is a measure of a random variable’s “spread” or variation around its mean (a.k.a. its expected value), while the co-variance measures how correlated are the variations of two random variables with each other.

  • The variance-covariance matrix is a square matrix (i.e. it has the same number of rows and columns). The elements of the matrix that lie along its main diagonal (i.e. the one that goes from top-left to bottom-right contain the variances while all other elements contain the co-variances). Thus, the variance-covariance matrix of the fitted coefficients of a regression model contains the variances of the fitted model’s coefficient estimates along its main diagonal, and it also contains the pair-wise co-variances between coefficient estimates in the non-diagonal elements.

  • Why do we not only calculate the variance of regression coefficients but also the convariance of them?

    The covariance matrix can be very useful for certain model diagnostics. If two variables are highly correlated, one way to think about it is that the model is having trouble figuring out which variable is responsible for an effect (because they are so closely related). This can be helpful for a whole variety of cases, such as choosing subsets of covariates to use in a predictive model; if two variables are highly correlated, you may only want to use one of the two in your predictive model.

S2E <- as.numeric((t(e)%*%e)/(nrow(df2)-3-1))
S2E
## [1] 712.5394
v.beta.hat <- S2E*solve(t(X)%*%X)
v.beta.hat # the diagonal numbers are the variance of each coefficient
##              intercept           inc          un18           urb
## intercept 4214.5975090 -1.849526e-01 -9.7493677311 -0.1582902773
## inc         -0.1849526  8.646281e-05  0.0001387409 -0.0002162611
## un18        -9.7493677  1.387409e-04  0.0255327479  0.0002084764
## urb         -0.1582903 -2.162611e-04  0.0002084764  0.0011752674
sqrt(diag(v.beta.hat)) 
##    intercept          inc         un18          urb 
## 64.919931523  0.009298538  0.159789699  0.034282173

Compare with the canned routine in R to see if we got the same standard errors of coefficients.

  OLS
Variables Estimates S.E. C.I. (95%)
Xintercept -286.84 *** 64.92 -417.44 – -156.24
Xinc 0.08 *** 0.01 0.06 – 0.10
Xun18 0.82 *** 0.16 0.50 – 1.14
Xurb -0.11 ** 0.03 -0.17 – -0.04
Observations 51
R2 / R2 adjusted 0.984 / 0.982
  • p<0.05   ** p<0.01   *** p<0.001

With the standard errors of each coefficient, we can perform hypothesis testing and construct confidence intervals to determine the significance and reliability of the estimated effects (similar to what we did in the case of simple linear regression).

  • What would happen when there is perfect multicollinearity in our data (i.e. when two or more independent variables are perfectly correlated)?

    # convert each variable to a vector
    edu <- df2$education
    intercept <- rep(1, nrow(df2)) # remember to crease this vector with 1s!
    inc <- df2$income
    un18 <- df2$under18
    urb <- df2$urban
    pmc <- 2*inc
    cor(pmc, inc) # pmc is perfectly correlated with inc
    ## [1] 1
    X <- cbind(intercept, inc, un18, urb, pmc)
    class(X) # this is a matrix
    ## [1] "matrix" "array"
    beta.hat <- solve(t(X) %*% X) %*% t(X) %*% edu
    ## Error in solve.default(t(X) %*% X): Lapack routine dgesv: system is exactly singular: U[5,5] = 0

    As you can see in the result of the above chunk, you cannot calculate the matrix algebra when there is perfect multicollinearity. However, when you run lm() in R, R will automatically delete the variable that is perfectly correlated with the other variable(s) in the dataset.

    mod2 <- lm(edu ~ X-1) # remember to remove the first column in the X matrix bc that's the column with all the 1s we created just for the algebra calculation

     

    OLS

    Variables

    Estimates

    S.E.

    C.I. (95%)

    Xintercept

    -286.84 ***

    64.92

    -417.44 – -156.24

    Xinc

    0.08 ***

    0.01

    0.06 – 0.10

    Xun18

    0.82 ***

    0.16

    0.50 – 1.14

    Xurb

    -0.11 **

    0.03

    -0.17 – -0.04

    Observations

    51

    R2 / R2 adjusted

    0.984 / 0.982

    • p<0.05   ** p<0.01   *** p<0.001

    If there is high multicollinearity between variables but not perfectly correlated, the matrix algebra still works, and lm() will still spit out the estimates of the highly correlated variable. However, the standard errors of coefficients will be very high.

    # convert each variable to a vector
    edu <- df2$education
    intercept <- rep(1, nrow(df2)) # remember to crease this vector with 1s!
    inc <- df2$income
    un18 <- df2$under18
    urb <- df2$urban
    set.seed(1234)
    pmc <- 2*inc + rnorm(nrow(df2), 5, 1)
    cor(pmc, inc) # pmc is highly but not perfectly correlated with inc
    ## [1] 0.9999997
    X <- cbind(intercept, inc, un18, urb, pmc)
    class(X) # this is a matrix
    ## [1] "matrix" "array"
    beta.hat <- solve(t(X) %*% X) %*% t(X) %*% edu
    
    t(beta.hat)
    ##      intercept     inc      un18        urb       pmc
    ## [1,] -290.6956 -1.4651 0.8188972 -0.1038287 0.7726427
    eduhat <- beta.hat[1] + df2$income*beta.hat[2] + 
    df2$under18*beta.hat[3] + df2$urban*beta.hat[4] + pmc*beta.hat[5]
    e <- edu - eduhat
    S2E <- as.numeric((t(e)%*%e)/(nrow(df2)-4-1))
    v.beta.hat <- S2E*solve(t(X)%*%X)
    sqrt(diag(v.beta.hat))
    ##   intercept         inc        un18         urb         pmc 
    ## 69.35278085  9.01847688  0.16172179  0.03651251  4.50787061

     

    OLS

    Variables

    Estimates

    S.E.

    C.I. (95%)

    Estimates

    S.E.

    C.I. (95%)

    Xintercept

    -286.84 ***

    64.92

    -417.44 – -156.24

    -290.70 ***

    69.35

    -430.30 – -151.10

    Xinc

    0.08 ***

    0.01

    0.06 – 0.10

    -1.47

    9.02

    -19.62 – 16.69

    Xun18

    0.82 ***

    0.16

    0.50 – 1.14

    0.82 ***

    0.16

    0.49 – 1.14

    Xurb

    -0.11 **

    0.03

    -0.17 – -0.04

    -0.10 **

    0.04

    -0.18 – -0.03

    Xpmc

    0.77

    4.51

    -8.30 – 9.85

    Observations

    51

    51

    R2 / R2 adjusted

    0.984 / 0.982

    0.984 / 0.982

    • p<0.05   ** p<0.01   *** p<0.001

Discussion III

For today’s discussion, we will review some diagnostic techniques to detect influential data points that could affect our model and ensure our model adheres to Gauss-Markov assumptions, addressing any violations as needed.

  1. Unusual and Influential Data
  2. Diagnostics for Errors and Linearity

1. Unusual and Influential Data

When dealing with observational or experimental data, it’s common to encounter outliers, which are data points that behave differently than the rest of the data for some reason. Outliers can greatly influence the analysis and interpretation of the data. In today’s discussion, we’ll examine what outliers are, how they can affect data analysis and interpretation, and use simple linear regression as an example to explore situations where individual data points significantly impact the model. Lastly, we will also

1.1 Leverage, discrepancy and influence

Although many researchers discuss outliers in a broad sense, it’s important to recognize that there are different types of unusual data points. A data point may be unusual in terms of its predictor behavior (the \(x\)-value in simple regression), its outcome (the \(y\)-value in simple regression), or both.

  • leverage: A data point that differs significantly from the rest of the dataset in terms of its predictor behavior is known as a leverage point. In the case of simple linear regression, this means that the x-value of the data point is much higher or lower than the mean of the predictor variable.

  • discrepancy: When a data point has an uncommon y-value for its corresponding x-value, it is said to have high discrepancy. Within the context of regression analysis, such an unusual point is called an outlier.

  • influence: Having either a high leverage or a high discrepancy does not necessarily make a data point influential in a linear model. In fact, a single data point’s influence is determined by leverage × discrepancy. Therefore, having only high leverage or high discrepancy may not be enough to significantly alter the model parameters.

The figure below illustrates the three characteristics discussed above. In each of the three panels, the red line represents the line of best fit without the data point in question (marked by the triangle), while the blue line shows the line of best fit with it.

  • In panel A, the data point with the triangle has a high leverage - its \(x\)-value is much higher than the rest.

  • In panel B, it has a high discrepancy - it lies far away from the line of best fit.

    However, neither of the above characteristics alone exerts a lot of influence on our model parameters, as the red line does not diverge significantly from the blue line.

  • In panel C, we see a data point that has both high leverage and high discrepancy, and as a result, it exerts high influence: the blue line is noticeably different from the red line.

Example Data

For today’s discussion, we will use a data set based on an example in Field, Miles, and Field (2012). The dataset includes eight samples, where the \(x\)-value represents the number of pubs within a borough (district) of London, and the \(y\)-value represents the number of deaths in that borough over a specific period of time. We are interested in investigating the relationship between the number of deaths and the number of pubs in each borough.

pubs <- data.frame(name = c(1, 2, 3, 4, 5, 6, 7, 8), 
                   pubs = c(10, 25, 40, 55, 70, 85, 100, 500), 
                   deaths = c(1043.822, 2086.934, 2951.086, 3992.459, 5088.003, 
                              6095.645, 6923.497, 10000.000))
pubs
##   name pubs    deaths
## 1    1   10  1043.822
## 2    2   25  2086.934
## 3    3   40  2951.086
## 4    4   55  3992.459
## 5    5   70  5088.003
## 6    6   85  6095.645
## 7    7  100  6923.497
## 8    8  500 10000.000

Let’s plot the line of best fit of this data:

library(tidyverse)
ggplot(data = pubs, aes(x = pubs, y = deaths, label = name)) +
    geom_point(alpha = 0.8) +
    geom_text(hjust = 2, vjust = 0) +
    geom_smooth(method = lm, se = FALSE, colour = "darkorange") +
    xlab("Number of Pubs") + 
    ylab("Deaths") +
    theme_bw()

Both the raw data and the linear regression plot indicate that data point 8 is significantly different from the other samples. This particular data point not only has a much higher number of pubs compared to the others, but it also appears to have a distinct relationship between the number of pubs and the number of deaths, unlike the other boroughs.

Since we have some concerns about this specific data point, let’s determine whether it exhibits a) high leverage, b) high discrepancy, and c) high influence.

1.2 Assessing Leverage

Recall that leverage quantifies the degree to which a predictor value differs from the other predictor values. In the case of simple linear regression, we can compute the distance between each predictor point (\(X_i\)) and the mean predictor value (\(\bar{X}\)) to measure leverage.

A standardized version of this distance is called hat-value and denoted by \(h_i\) (for \(i = {1,..., n}\)):

\[h_i = \frac{1}{n}+\frac{(X_i-\bar{X})^2}{\sum_{j=1}^{n}(X_j-\bar{X})^2}\]

The average hat value (\(\bar{h}\)) is defined as \(\frac{k+1}{n}\), in which \(k\) is the number of predictors and \(n\) the number of observations. Values of \(h\) are bound between \(\frac{1}{n}\) and 1, with 1 denoting highest leverage (highest distance from mean).

Based on our sample data, it’s obvious that data point 8 has the highest leverage among all the points. Let’s apply the formula above to compute the leverage value (\(h_8\)) for this data point.

# number of cases
n = nrow(pubs)

# distance to mean of point 8
numerator = (pubs$pubs[8] - mean(pubs$pubs))^2

# distance to mean of all the other points
denominator = sum((pubs$pubs - mean(pubs$pubs))^2)

# putting it together
h_8 = 1/n + numerator/denominator

h_8
## [1] 0.969302

The resulting hat-value is 0.969302. This is quite high - in fact, it’s very close to 1, the highest possible value.

You don’t need to manually calculate all of the hat values because R offers a convenient hatvalues function that can be applied to any linear model. To take advantage of this feature, we will begin by fitting a simple linear model using lm and then extracting the hat value of each data point.

# fitting linear model
mod <- lm(deaths ~ pubs, data = pubs)

# getting hatvalues and plotting them
plot(hatvalues(mod))

Remember that leverage alone does not mean a point exerts high influence, but it certainly means it’s worth investigating. Hat values are open to interpretation, but a cut-off value that is common is twice or three times the average \(\bar{h}\), meaning anything above that value should be looked at closer.

Let’s add a cut-off line into the above hatvalue figure.

# the average of hat-values
mean_h <- (1+1)/8

# adding a cut-off line
plot(hatvalues(mod))
abline(h = 2*mean_h, lty = 2)
abline(h = 3*mean_h, lty = 2)

As illustrated in the figure above, we can see that the value of \(h_8\) is greater than twice and three times the mean hat values (\(\bar{h}\)). Based on this observation, we can safely conclude that the value of \(h_8\) is highly unusual. In this case, \(h_8\) is definitely unusual!

After having assessed leverage, let’s look at discrepancy. How unusual is the \(y\)-value given its \(x\)-value?

1.3 Assessing Discrepancy

As we learned earlier, data points that do not fit well to the linear regression line are referred to as outliers, or high-discrepancy points. Typically, we assess the fit of the regression line using residuals, which measure the distance between a predicted value and the actual value. While we already know that point 8 has high leverage, it is worth noting that the line of best fit is quite close to its predicted value (see the left panel). Additionally, when we examine the residuals, point 8 does not appear to be further from the regression line than any other points (see the right panel).

Instead, we can look at standardized or studentized residuals.

  • Standardized residuals are formed by calculating:

    \[E_i'=\frac{E_i}{S_E\sqrt{1-h_i}}\]

    where \(S_E=\sqrt{\frac{\sum E_i^2}{n-k-1}}\)

    Let’s calculate the standardized residual for our example data by hand (the 8th data point is the one we’re interested in):

    # residual for data point in original model
    Ei <- as.numeric(residuals(mod)[8])
    
    # estimate of sigma (standard deviation) for residuals
    S_E <- summary(mod)$sigma
    
    # hatvalue for point 8
    hi <- as.numeric(hatvalues(mod)[8])
    
    # putting it together
    Eprime <- Ei/(S_E*sqrt(1-hi))
    Eprime
    ## [1] -2.447433

    You can also just let R do the maths by calling rstandard on the original model:

    rstandard(mod)
    ##           1           2           3           4           5           6 
    ## -1.44154418 -0.89769067 -0.48020238  0.04464274  0.59853447  1.09293034 
    ##           7           8 
    ##  1.47205817 -2.44743280

    Note that the last value is the same as the one we calculated by hand! Let’s compare the residuals and the studentized residuals in plots side by side:

    The standardized residuals provide a better measure of discrepancy and can reveal outliers that were not detected by the normal residual plot. In our case, by looking at the standardized residual plot below, we can confirm that point 8 has a high discrepancy compared to the other points.

  • Studentized residuals are calculated by fitting a model without the case for which the residual is calculated, and then scaling the resulting residual (\(E_i\)) by an estimate of the standard deviation of the residuals (\(S_{E(-i)}\)) and the point’s hat value (\(h_i\)):

    \[E_i^*=\frac{E_i}{S_{E(-i)}\sqrt{1-h_i}}\]

    Let’s calculate the studentized residual for our example data by hand (the 8th data point is the one we’re interested in):

    # model excluding point 8
    mod.red <- lm(deaths ~ pubs, data = pubs[-8,]) 
    
    # residual for data point in original model
    Ei <- as.numeric(residuals(mod)[8])
    
    # estimate of sigma (standard deviation) for residuals
    S_E <- summary(mod.red)$sigma
    
    # hatvalue for point 8
    hi <- as.numeric(hatvalues(mod)[8])
    
    # putting it together
    Estar <- Ei/(S_E*sqrt(1-hi))
    Estar
    ## [1] -54.5284

    You can also just let R do the maths by calling rstudent on the original model:

    rstudent(mod)
    ##            1            2            3            4            5            6 
    ##  -1.62765333  -0.88075356  -0.44703731   0.04075983   0.56346508   1.11482846 
    ##            7            8 
    ##   1.68127219 -54.52839824

    Note that the last value is the same as the one we calculated by hand! Let’s compare the residuals and the studentized residuals in plots side by side:

    The studentized residuals reveal clearly that point 8 has a high discrepancy, while this was not possible to see from the normal residual plots. You can also perform Bonferroni adjustment to test the statistical significance of the \(E_8^*\) to check if we can confidently say that point 8 is an outlier. Observations are considered outliers if their Bonferroni p is less than .05.

    library(car)
    outlierTest(mod)
    ##   rstudent unadjusted p-value Bonferroni p
    ## 8 -54.5284         3.9231e-08   3.1385e-07

    According to the above test, we can say that point 8 can be considered an outlier (high discrepancy).

1.4 Assessing Influence

As mentioned earlier, having high leverage or discrepancy alone does not necessarily mean a data point will have a high influence on the linear regression model. However, since our point of interest (point 8) has both high leverage and discrepancy, it is likely to have high influence on the model.

We’ll use something called Cook’s D, which relies on the standardized residuals to assess influence. A similar measure is DFFITS, which instead is based on studentized residuals. Another measure we won’t go into are DFBETAS, which measures the influence on each individual parameter, instead of the overall model.

Observations that combine high leverage with large studentized residual exert substantial influence on regression coeffcients. Cook’s D statistic provides a summary index of influence.

\(D_i=\frac{E_i'^2}{k+1}\times \frac{h_i}{1-h_i}\)

where the first term, the standardized residual, is a measure of discrepancy and the second is a measure of leverage.The higher the leverage and residuals, the higher the Cook’s distance.

Several interpretations for Cook’s distance exist. There isn’t a universally accepted rule for cut off points.

  • One interpretation is to investigate any point over \(\frac{4}{n}\), where \(n\) is the number of observations.
  • Other scholars suggest that any “large” \(D_i\) should be investigated. How large is “too large”? The consensus seems to be that a \(D_i\) value of more than 1 indicates an influential value, but you may want to look at values above 0.5.

Let’s use influencePlot() in the car package to check Cook’s D statistics!

##      StudRes       Hat      CookD
## 1  -1.627653 0.1813863  0.2302244
## 7   1.681272 0.1256287  0.1556728
## 8 -54.528398 0.9693020 94.5671657

1.5 Added-variable Plot (AV Plot)

All the above techniques can also be applied to multivariate linear regression models. Additionally, we introduce another very useful influence graph for visually detecting outliers in multivariate linear models, known as an added-variable plot or partial-regression plot.

A limitation of using the typical residual plots in a multivariate linear regression model, such as plotting residuals against the values of a predictor variable, is that they may not fully illustrate the “additional contribution” of a predictor variable when considering other variables already in the model. Added-variable plots offer a more refined visualization, providing graphic insights into the marginal importance of a predictor variable, taking into account the presence of other variables in the model. The following materials are sourced from Fox’s slides and lecture.

  • Let \(y_i^{(1)}\) represent the residuals from the least-squares regression of \(y\) on all of the \(x\)s with the exception \(x_1\), that is, the residuals from the fitted model:

    \[y_i = b_0^{(1)} + b_2^{(1)}x_{i2} + ...+ b_k^{(1)}x_{ik} + y_i^{(1)}\]

  • Likewise, \(x_i^{(1)}\) are residuals from the least-squares regression of \(x_1\) on the other \(x\)s:

    \[x_{1i} = c_0^{(1)} + c_2^{(1)}x_{i2} + ...+ c_k^{(1)}x_{ik} + x_i^{(1)}\]

  • The notation emphasizes the interpretation of the residuals \(y_i^{(1)}\) and \(x_i^{(1)}\) as the parts of \(y\) and \(x_1\) that remain when the linear dependence of these variables on \(x_2, ..., x_k\) is removed.

The AV plot for \(x_1\) is then the scatterplot of \(y_i^{(1)}\) versus \(x_i^{(1)}\), and we repeat the procedure for each \(x_j\), \(j =0, 1, ..., k\) (where \(x_0 = 1\)). In effect, the \((k+1)\)-dimensional scatterplot for \(y\) and \(x_1, ..., x_k\) is reduced to a sequence of \(k+1\) 2D AV plots. By doing so, the AV plots visualize leverage and influence on each of the regression coefficients.

The added-variable plot for \(x_1\) has the following very interesting properties:

  • The slope of the least-squares simple-regression line of of \(y_i^{(1)}\) on \(x_i^{(1)}\) is the same as the least-squares slope \(b_1\) for \(x_1\) in the full multiple regression.

  • The residuals from this simple regression are the same as the residuals \(e_i\) from the full regression.

  • Consequently, the standard deviation of the residuals in the added-variable plot is the same as the multiple regression.

  • The standard error of \(b_1\) from this simple regression is also the same as the standard error from the full regression

  • Because the \(x_i^{(1)}\) are residuals, they are less variable than \(x_1\) if \(x_1\) is correlated with the other \(x\)s. The added-variable plot therefore show how collinearity can degrade the precision of the estimation by decreasing the conditional variation of an \(x\).

NOTE: You may find these concepts very familiar as in POL 212, we have actually covered the Frisch-Waugh-Lovell Theorem (you can review the materials of Discussion IV in POL 212 Discussion), in which we showed how to decompose a regression of \(Y\) on a set of variables \(X\) into two pieces. The coefficient of one of the variables in multiple linear regression model can be obtained by netting off effect of other variables in the regression model from both dependent and independent variables. The FWL Theorem shows that it is important to partial one set of covariates out of the other in order to obtain the ceteris paribus effect. For example, in order to get the effect of \(X1\) on \(Y\), \(X2\) needs to be held constant, which involves removing the portion of \(X1\) that is correlated with \(X2\). Likewise, when identifying outliers among explanatory variables in a multivariate linear regression model, it’s crucial to first hold all other variables constant.

Example

For this example, we will use the Duncan data frame, which has 45 rows and 4 columns. This is a dataset on the prestige and other characteristics of 45 U.S. occupations in 1950.

library(car)
library(carData)

knitr::kable(head(Duncan)) # from carData package
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
minister prof 21 84 87

These are the description of each variable:

  • type: Type of occupation. A factor with the following levels: prof, professional and managerial; wc, white-collar; bc, blue-collar.

  • income: Percentage of occupational incumbents in the 1950 US Census who earned $3,500 or more per year (about $36,000 in 2017 US dollars).

  • education: Percentage of occupational incumbents in 1950 who were high school graduates (which, were we cynical, we would say is roughly equivalent to a PhD in 2017).

  • prestige: Percentage of respondents in a social survey who rated the occupation as “good” or better in prestige.

Let’s begin by regressing prestige on all explanatory variables income and education, then plot the added-variable plots to visually inspect for any potential outliers.

m.duncan <- lm(prestige ~ income + education, data = Duncan)
  OLS
Variables Estimates S.E. C.I. (95%)
(Intercept) -6.06 4.27 -14.69 – 2.56
income 0.60 *** 0.12 0.36 – 0.84
education 0.55 *** 0.10 0.35 – 0.74
Observations 45
R2 / R2 adjusted 0.828 / 0.820
  • p<0.05   ** p<0.01   *** p<0.001
avPlots(m.duncan)

Note that the slope of the blue regression line in both plots corresponds to the estimated coefficient of income and education in our multivariate regression model.

In the AV plots, we can observe the impact (or lack thereof) of these unusual cases on the regression coefficients. Let’s first examine the left-hand side of the figure. We notice that the case of the minister pulls up on the left side, while the case of the conductor pulls down on the right side. This influential pair is widely separated, affecting the coefficient of income by making it smaller than it would otherwise be. The case of railroad engineers appears as a very high leverage point due to their unusual income given their education level, but it aligns more or less with the trend observed throughout the data. Similarly, the case of the reporter also follows the overall trend.

Similarly, in the education AV plots, we observe a similar trend except for the effect of the minister’s case, which pulls up on the left side, leading to a larger coefficient of education than it would otherwise be. Again, the cases of railroad engineers and reporters seem to align with the general trend of the data, requiring less concern.

Overall, based on the AV plots, we may speculate that the case of the minister could be an influential data point affecting our model’s estimation. We can proceed by updating our original regression model and then comparing the coefficients between the models to validate this hypothesis.

m.duncan2 <- lm(prestige ~ income + education, 
                data = subset(Duncan, !(rownames(Duncan) %in% c("minister"))))
  Model 1 Model 2
Variables Estimates S.E. C.I. (95%) Estimates S.E. C.I. (95%)
(Intercept) -6.06 4.27 -14.69 – 2.56 -6.63 3.89 -14.48 – 1.22
income 0.60 *** 0.12 0.36 – 0.84 0.73 *** 0.12 0.50 – 0.97
education 0.55 *** 0.10 0.35 – 0.74 0.43 *** 0.10 0.24 – 0.63
Observations 45 44
R2 / R2 adjusted 0.828 / 0.820 0.856 / 0.849
  • p<0.05   ** p<0.01   *** p<0.001

Indeed, after excluding the case of the minister, the coefficient of income increases from 0.6 to 0.73, while the coefficient of education decreases from 0.55 to 0.43.

We can also run the influencePlot() to check if the Cook’s distance of the minister’s case is indeed very problematic or not.

##                StudRes        Hat      CookD
## minister     3.1345186 0.17305816 0.56637974
## reporter    -2.3970224 0.05439356 0.09898456
## conductor   -1.7040324 0.19454165 0.22364122
## RR.engineer  0.8089221 0.26908963 0.08096807

2. Diagnostics for Linearity and Errors

2.1 Violation of Linearity Assumption

Recall the linearity assumption is that the expected (mean) value of the disturbance term is 0 (i.e., \(E(u_i)=0\)). The average regression error is 0 everywhere implies that the regression surface captures the dependency of the conditional mean of \(y\) on \(x\)s. Violating the linearity assumption implies “that the model fails to capture the systematic pattern of relationship between the response and explanatory variables” (Fox, 2016: 307). For example, a partial relationship specified to be linear may be nonlinear, or two explanatory variables specified to have additive partial effects may interact in determining \(y\). While a fitted model may sometimes provide a useful approximation to the true regression surface \(E(y)\), there are instances where the model can be highly misleading. You can think of nonlinearity (fitting the wrong equation to the data) as potentially the most serious problem with a regression model.

When performing a linear regression with a single explanatory variable, a scatterplot of the response variable against the explanatory variable provides a good indication of the nature of the relationship. If there is more than one explanatory variable, things become more complicated. Although it can still be useful to generate scatterplots of the response variable against each of the explanatory variables, this does not take into account the effect of the other explanatory variables in the model. After all, our interest centers on the partial relationship between \(y\) and each \(x\), controlling for the other \(x\)s, not on the marginal relationship between y and a single \(x\).

Furthermore, while plotting residuals against each \(x\) is helpful for detecting departures from linearity, residual plots cannot distinguish between monotone and nonmonotone nonlinearity.

In the above figure from the Fox textbook, case (a) might be modeled by \(y = \beta_0 + \beta_1\sqrt{x} + \varepsilon\) (a transformation of \(x\)), but case (b) cannot be linearied by a transformation of \(x\) and might instead be dealt with a quadratic regression, \(y = \beta_0 + \beta_1x + \beta_2x^2 + \varepsilon\).

Component + residual (C+R) plots, an extension of partial residual plots, are a good way to see if each predictor you include in the multivariate linear model have a linear relationship to the dependent variable. A partial residual plot essentially attempts to model the residuals of one predictor against the dependent variable. The partial residual for the \(j\)th explanatory variables is defined as:

\[E_{i}^{(j)}= E_i+B_jX_{ij}\]

In words, add back the linear component of the partial relationship between \(Y\) and \(X_j\) to the least-squares residuals, which may include an unmodeled nonlinear component. Then plot \(E^{(j)}\) versus \(X_j\).

Example

Let’s do a simulation example to see how C+R plots can help distinguish monotone and nonmonotone nonlinearity.

set.seed(1234) # for replication
N <- 100 # set up the sample size
x1 <- rnorm(N, mean = 4, sd = 5)
x2 <- runif(N, min = -5, max = 5)
x3 <- rnorm(N, mean = 10, sd = 2)*(x1/2)
cor(x1, x3)
## [1] 0.9724877
e <- rnorm(N, mean = 0, sd = 1) # set the errors to be normally distributed 
y <- 7 + x1^2 + x2 + 5*x2^2 - 7*x3 + e # set the true y
df <- data.frame(y = y, x1 = x1, x2 = x2, x3 = x3, e = e)

In the above chunk, I define the true value \(y\) as a function of \(x_1\), \(x_2\), \(x_3\), and the error term (\(e\)). Specifically, the true regression coefficients are set as follows: \(1\) for the squared transformation of \(x_1\), \(1\) for \(x_2\), and \(-7\) for \(x_3\). Additionally, the term \(5 \cdot x_2^2\) signifies a quadratic effect of \(x_2\), while the intercept is fixed at \(7\). Note, I sepcifically set \(x_3\) to be highly correlated with \(x_1\). To visualize the underlying relationship between \(x_1\), \(x_2\), \(x_3\), and \(y\), let’s start by plotting scatterplots for each pair of \(y\) and \(x\), then running a multivariate regression, and plotting residuals against each \(x\).

fit <- lm(y ~ x1 + x2 + x3, data = df)
df$residual <- fit$residuals

library(tidyverse)

p1 <- ggplot(data = df, aes(x = x1, y = y)) +
  geom_point(color = "gray37", alpha = 0.7) +
  geom_smooth(color = "salmon2", alpha = 0.3) +
  labs(subtitle = "Scatterplot between y and each x") +
  theme_bw()
  
p2 <- ggplot(data = df, aes(x = x1, y = residual)) +
  geom_point(color = "gray37", alpha = 0.7) +
  geom_smooth(method = "lm", color = "steelblue2", alpha = 0.3) +
  labs(subtitle = "Residuals against each x") +
  theme_bw()

p3 <- ggplot(data = df, aes(x = x2, y = y)) +
  geom_point(color = "gray37", alpha = 0.7) +
  geom_smooth(color = "salmon2", alpha = 0.3) +
  theme_bw()
  
p4 <- ggplot(data = df, aes(x = x2, y = residual)) +
  geom_point(color = "gray37", alpha = 0.7) +
  geom_smooth(method = "lm", color = "steelblue2", alpha = 0.3) +
  theme_bw()

p5 <- ggplot(data = df, aes(x = x3, y = y)) +
  geom_point(color = "gray37", alpha = 0.7) +
  geom_smooth(color = "salmon2", alpha = 0.3) +
  theme_bw()
  
p6 <- ggplot(data = df, aes(x = x3, y = residual)) +
  geom_point(color = "gray37", alpha = 0.7) +
  geom_smooth(method = "lm", color = "steelblue2", alpha = 0.3) +
  theme_bw()
  
ggpubr::ggarrange(p1, p2, p3, p4, p5, p6, ncol = 2, nrow = 3)

From the scatterplots of \(y\) against \(x_1\) and \(x_2\), as well as their respective residual plots, it’s evident that while plotting residual plots can indicate whether our model violates the linearity assumption, it doesn’t provide insight into the underlying reasons for the violation. Notably, the residuals plotted against both \(x_1\) and \(x_2\) exhibit a similar convex pattern. Moreover, scatterplots of \(y\) against \(x_1\) and \(x_2\) do not accurately depict their partial relationship. Simply plotting scatterplots between variables fails to consider the influence of other variables in the model, particularly when there are high correlations among predictors. Let’s do the C+R plots instead.

crPlots(fit)

The magenta line represents a smooth curve through the C+R plots, while the blue dashed line represents the regression coefficient of each \(x\) from the multivariate regression model. A notable disparity between the residual line and the component line suggests that the predictor does not exhibit a linear relationship with the response variable. Once again, the advantage of using C+R plots is that they provide information about the partial relationship between \(y\) and each \(x\) while controlling for all other variables. Additionally, they offer insight into how to address nonlinearity. To address the nonlinearity in \(x_1\), applying a square transformation often resolves the issue effectively. As for \(x_2\), we need to consider the quadratic relationship between \(y\) and \(x_2\).

df$x1_square <- (df$x1)^2
fit_update <- lm(y ~ x1_square + poly(x2, 2, raw = TRUE) + x3, data = df)
crPlots(fit_update)

2.2 Violation of Homoskedasticity Assumption

Note: the materials for this section draw from Penn State STAT 501 and here.

The method of ordinary least squares assumes that there is constant variance in the errors (which is called homoscedasticity). The method of weighted least squares can be used when the ordinary least squares assumption of constant variance in the errors is violated (which is called heteroscedasticity). The model under consideration is:

\[Y = X\beta + \epsilon^*\] where \(\epsilon^*\) is assumed to be (multivariate) normally distributed with mean vector 0 and nonconstant variance-covariance matrix.

\[\begin{equation*} \left(\begin{array}{cccc} \sigma^{2}_{1} & 0 & \ldots & 0 \\ 0 & \sigma^{2}_{2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \sigma^{2}_{n} \\ \end{array} \right) \end{equation*}\]

If we define the reciprocal of each variance, \(\sigma_i^2\), as the weights, \(w_i = 1/\sigma_i^2\), then let matrix W be a diagonal matrix containing these weights:

\[\begin{equation*}\textbf{W}=\left( \begin{array}{cccc} w_{1} & 0 & \ldots & 0 \\ 0& w_{2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0& 0 & \ldots & w_{n} \\ \end{array} \right) \end{equation*}\] The weighted least squares estimate is then:

\[\begin{align*} \hat{\beta}_{WLS}&=\arg\min_{\beta}\sum_{i=1}^{n}\epsilon_{i}^{*2}\\ &=(\textbf{X}^{T}\textbf{W}\textbf{X})^{-1}\textbf{X}^{T}\textbf{W}\textbf{Y} \end{align*}\] With this setting, we can make a few observations:

  • Since each weight is inversely proportional to the error variance, it reflects the information in that observation. So, an observation with a small error variance has a large weight since it contains relatively more information than an observation with a large error variance (small weight).
  • The weights have to be known (or more usually estimated) up to a proportionality constant.
  • Weighted least squares estimates of the coefficients will usually be nearly the same as the “ordinary” unweighted estimates. In cases where they differ substantially, the procedure can be iterated until estimated coefficients stabilize (often in no more than one or two iterations); this is called iteratively reweighted least squares.
  • The use of weights will (legitimately) impact the widths of statistical intervals.
Example

Let’s simulate the violation of homoskedasticity assumption and apply the method of weighted least squares to address it.

set.seed(1234) # for replication
N <- 1000 # set up the sample size
x1 <- 1:N
x2 <- rnorm(N, mean = 10, sd = 3)
sd <- runif(N, min = 0, max = 2)
e <- rnorm(N, mean = 0, sd = 2*x1) # set the errors to be normally distributed but not at a constant variance
y <- 7 + 3*x1 + x2 + e # set the true y
df <- data.frame(y = y, x1 = x1, x2 = x2, e = e)

In the above chunk, I set the error variance to be more pronounce: as \(x1\) gets larger, the error variance gets larger by multiplying \(x1\). You can see this relationship in the below figure.

ggplot(data = df, aes(x = x1, y = e)) +
  geom_point(color = "gray37", alpha = 0.7) +
  theme_bw()

Let’s now run the regression and test it to see if it violates the homoskedasticity assumption.

fit <- lm(y ~ x1 + x2, data = df) 
  OLS
Variables Estimates S.E. C.I. (95%)
(Intercept) -65.82 144.57 -349.52 – 217.87
x1 3.13 *** 0.13 2.88 – 3.38
x2 6.73 12.46 -17.71 – 31.17
Observations 1000
R2 / R2 adjusted 0.371 / 0.370
  • p<0.05   ** p<0.01   *** p<0.001

According to the regression table, you can see that the standard error of \(x2\) is very large, which is due to the heteroskedasticity problem in our data. The estimate of the effect of \(x2\), therefore, is very inefficient.

par(mfrow = c(2, 2))
plot(fit)

Let’s take a look at the third plot. Scale-Location is used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem. It can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors.

To formally test for heteroscedasticity, we can perform a Breusch-Pagan test:

# load lmtest package
library(lmtest)

# perform Breusch-Pagan test
bptest(fit)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 168.84, df = 2, p-value < 2.2e-16

The Breusch-Pagan test uses the following null and alternative hypotheses:

  • Null Hypothesis (\(H_0\)): Homoscedasticity is present (the residuals are distributed with equal variance)
  • Alternative Hypothesis (\(H_A\)): Heteroscedasticity is present (the residuals are not distributed with equal variance)

Since the p-value from the test is < 2.2e-16, we will reject the null hypothesis and conclude that heteroscedasticity is a problem in this model.

Since heteroscedasticity is present, we will perform weighted least squares by defining the weights in such a way that the observations with lower variance are given more weight:

# define weights to use
wt <- 1 / lm(abs(fit$residuals) ~ fit$fitted.values)$fitted.values^2

head(lm(abs(fit$residuals) ~ fit$fitted.values)$fitted.values)
##        1        2        3        4        5        6 
## 22.48226 39.03961 48.75519 15.68057 45.26871 47.61134
head(wt)
##            1            2            3            4            5            6 
## 0.0019784279 0.0006561287 0.0004206862 0.0040670185 0.0004879820 0.0004411429
# perform weighted least squares regression
wls_fit <- lm(y ~ x1 + x2, data = df, weights = wt)
  No WLS WLS
Variables Estimates S.E. C.I. (95%) Estimates S.E. C.I. (95%)
(Intercept) -65.82 144.57 -349.52 – 217.87 2.14 20.99 -39.04 – 43.32
x1 3.13 *** 0.13 2.88 – 3.38 3.09 *** 0.07 2.94 – 3.23
x2 6.73 12.46 -17.71 – 31.17 0.79 2.55 -4.21 – 5.79
Observations 1000 1000
R2 / R2 adjusted 0.371 / 0.370 0.648 / 0.648
  • p<0.05   ** p<0.01   *** p<0.001

The weighted least squares model yields a coefficient for \(x_2\) much closer to its true value compared to the original model. Additionally, the residual standard error in the WLS model is smaller than that in the original multivariate linear regression model. These results suggest that the predicted values generated by the weighted least squares model are much closer to the actual observations than those produced by the original model.

The weighted least squares model also has an R-squared of 0.648 compared to 0.371 in the original multivariate linear regression model. This indicates that the weighted least squares model is able to explain more of the variance in y compared to the original multivariate linear regression model.

Discussion IV

For today’s discussion, we will review what multicollinearity is, how to detect it, and introduce Principal Component Analysis (PCA) as one of the methods to handle it.

  1. (Multi)collinearity Problem
  2. Detecting Collinearity
  3. Principle Component Analysis (PCA)

1. (Multi)collinearity Problem

In multivariate linear regression, it is possible for two or more predictor variables to be correlated with each other. This scenario is referred to as collinearity. In extreme cases, where collinearity exists between three or more variables, even if no pair of variables has a particularly high correlation, this is known as multicollinearity. Multicollinearity indicates redundancy between predictor variables.

The presence of multicollinearity can cause the solution of the regression model to become unstable.

  • coefficient standard errors are large reflecting imprecision
  • broad confidence intervals
  • hypothesis test have low power

Therefore, it is important to detect and address multicollinearity before building a regression model to ensure the model’s reliability and accuracy.

1.1 Example of Perfect Multicollinearity Problem

Let’s first review what Lauren discussed in the lecture regarding why it’s not feasible to include all the dummies of a categorical variable .

library(AER)
library(MASS)
data(CASchools)

# define variables
CASchools$STR <- CASchools$students/CASchools$teachers 
CASchools$score <- (CASchools$read + CASchools$math)/2 

# set seed for reproducibility
set.seed(1)

# generate artificial data on location
CASchools$direction <- sample(c("West", "North", "South", "East"), 420, replace = T)
head(CASchools[, c("score", "STR", "english", "direction")])
##    score      STR   english direction
## 1 690.80 17.88991  0.000000      West
## 2 661.20 21.52466  4.583333      East
## 3 643.60 18.69723 30.000002     South
## 4 647.70 17.35714  0.000000      West
## 5 640.85 18.67133 13.857677     North
## 6 605.55 21.40625 12.408759      West

In this example, our response variable is score, and explanatory variables are STR(student-teacher ratio), english (the share of English learning students), and direction (artificial location of a school).

mod1 <- lm(score ~ STR + english + direction, data = CASchools)
  Score
Variables Estimates S.E. C.I. (95%)
(Intercept) 684.80 *** 7.54 669.98 – 699.63
STR -1.09 ** 0.38 -1.84 – -0.34
english -0.66 *** 0.04 -0.73 – -0.58
direction [North] 1.66 2.06 -2.38 – 5.71
direction [South] 0.72 2.06 -3.34 – 4.77
direction [West] 1.79 1.98 -2.10 – 5.69
Observations 420
R2 / R2 adjusted 0.428 / 0.421
  • p<0.05   ** p<0.01   *** p<0.001

In the regression table above, you’ll notice that R has included only three of the dummies for the direction variable and has designated the category East as the reference group in the model. What this means is that when interpreting the coefficients of the remaining three categories of the direction variable, we are comparing their effects to the category East. For instance, we can infer that schools located in the North have, on average, 1.66 higher test scores than schools located in the East. Although this difference is not statistically significant, as the variable was artificially generated.

In addition, since the category East was designated as the reference group, the value of the intercept reflects that the baseline test score of a school is when schools have a student-teacher ratio of 0, no English learning students, and are located in the East. The modelpredicts that the average test score of such schools is 684.80.

What would happen if we intentionally include four dummies of the direction variable in the model?

library(fastDummies)
CASchools <- dummy_cols(CASchools, select_columns = "direction")
head(CASchools[,c("score", "STR", "english", "direction_East", 
                  "direction_North", "direction_South", "direction_West")])
##    score      STR   english direction_East direction_North direction_South
## 1 690.80 17.88991  0.000000              0               0               0
## 2 661.20 21.52466  4.583333              1               0               0
## 3 643.60 18.69723 30.000002              0               0               1
## 4 647.70 17.35714  0.000000              0               0               0
## 5 640.85 18.67133 13.857677              0               1               0
## 6 605.55 21.40625 12.408759              0               0               0
##   direction_West
## 1              1
## 2              0
## 3              0
## 4              1
## 5              0
## 6              1
mod2 <- lm(score ~ STR + english + direction_North + direction_South + 
               direction_West + direction_East , data = CASchools)
  Model 1 Model 2
Variables Estimates S.E. C.I. (95%) Estimates S.E. C.I. (95%)
(Intercept) 684.80 *** 7.54 669.98 – 699.63 684.80 *** 7.54 669.98 – 699.63
STR -1.09 ** 0.38 -1.84 – -0.34 -1.09 ** 0.38 -1.84 – -0.34
english -0.66 *** 0.04 -0.73 – -0.58 -0.66 *** 0.04 -0.73 – -0.58
direction [North] 1.66 2.06 -2.38 – 5.71
direction [South] 0.72 2.06 -3.34 – 4.77
direction [West] 1.79 1.98 -2.10 – 5.69
direction North 1.66 2.06 -2.38 – 5.71
direction South 0.72 2.06 -3.34 – 4.77
direction West 1.79 1.98 -2.10 – 5.69
Observations 420 420
R2 / R2 adjusted 0.428 / 0.421 0.428 / 0.421
  • p<0.05   ** p<0.01   *** p<0.001

As you can observe in model 2, R automatically dropped one of the dummies of the direction variable (East). R does this to solve the perfect multicollinearity problem. Where is the perfect multicollinearity problem in this case?

Recall from the materials in Discussion II, we talk about how to think of our data set as a matrix. The below equations are how each observation in our data set generated.

\(Y_1=\beta_0+\beta_1X_{11}+\beta_2X_{12}+...+\beta_kX_{1k}+\varepsilon_1\)

\(Y_2=\beta_0+\beta_1X_{21}+\beta_2X_{22}+...+\beta_kX_{2k}+\varepsilon_2\)

.

.

.

\(Y_n=\beta_0+\beta_1X_{n1}+\beta_2X_{n2}+...+\beta_kX_{nk}+\varepsilon_n\)

Rewrite them into a matrix form.

The system of equations is summarized as the y vector equal to the product of the X matrix of data times the \(\beta\) vector of parameters plus the vector of disturbances, \(\varepsilon\). In the lecture, Lauren showed how to apply what we did when finding a pair of intercept and slope in simple linear regression to solve the optimization problem to find the coefficient vector b using matrix algebra. Instead of taking derivative with respect to each coefficient one at a time, we can take derivative directly with respect to the coefficient vector b. Therefore, the least squares estimator is:

\(\hat{\beta}=(X'X)^{-1}X'y\)

In the matrix algebra for estimating \(\beta\) coefficients using the above equation, it is necessary for the first column of the \(X\) matrix to be composed of 1s. This is because the first column represents the intercept term in the regression model. Therefore, including all the dummies of a categorical variable in a regression model would result in a perfect multicollinearity problem beacuse they are the linear combination of the intercept term.

Let’s relook at the CASchools again.

head(CASchools[,c("score", "STR", "english", "direction_East", 
                  "direction_North", "direction_South", "direction_West")])
##    score      STR   english direction_East direction_North direction_South
## 1 690.80 17.88991  0.000000              0               0               0
## 2 661.20 21.52466  4.583333              1               0               0
## 3 643.60 18.69723 30.000002              0               0               1
## 4 647.70 17.35714  0.000000              0               0               0
## 5 640.85 18.67133 13.857677              0               1               0
## 6 605.55 21.40625 12.408759              0               0               0
##   direction_West
## 1              1
## 2              0
## 3              0
## 4              1
## 5              0
## 6              1

The linear combination of the intercept term equals to the sum of direction_East, direction_North, direction_South, direction_West, resulting in a perfect multicollinearity problem.

CASchools$intercept <- CASchools$direction_East + CASchools$direction_North + 
    CASchools$direction_South + CASchools$direction_West

head(CASchools[,c("score", "STR", "english", "direction_East", 
                  "direction_North", "direction_South", "direction_West", "intercept")])
##    score      STR   english direction_East direction_North direction_South
## 1 690.80 17.88991  0.000000              0               0               0
## 2 661.20 21.52466  4.583333              1               0               0
## 3 643.60 18.69723 30.000002              0               0               1
## 4 647.70 17.35714  0.000000              0               0               0
## 5 640.85 18.67133 13.857677              0               1               0
## 6 605.55 21.40625 12.408759              0               0               0
##   direction_West intercept
## 1              1         1
## 2              0         1
## 3              0         1
## 4              1         1
## 5              0         1
## 6              1         1

1.2 Example of Imperfect Multicollinearity Problem

If there is high multicollinearity between variables but not perfectly correlated, lm() will still spit out the estimates of the highly correlated variable. However, the standard errors of coefficients will be very high, which means that our estimates will be highly unstable.

set.seed(1234)

# generate a highly correlated variable
CASchools$english_hc1 <- 2*CASchools$english + rnorm(nrow(CASchools), 10, 5)

cor(CASchools$english_hc1, CASchools$english) # english_hc is highly but not perfectly correlated with english
## [1] 0.9904646
set.seed(1234)

# generate a less highly correlated variable
CASchools$english_hc2 <- 2*CASchools$english + rnorm(nrow(CASchools), 10, 20)

cor(CASchools$english_hc2, CASchools$english) 
## [1] 0.8662955
mod3 <- lm(score ~ STR + english , data = CASchools) # without highly correlated term
mod4 <- lm(score ~ STR + english + english_hc1 , data = CASchools) # with highly correlated term
mod5 <- lm(score ~ STR + english + english_hc2 , data = CASchools) # with less highly correlated term
  Model 1 Model 2 Model 3
Variables Estimates S.E. C.I. (95%) Estimates S.E. C.I. (95%) Estimates S.E. C.I. (95%)
(Intercept) 686.03 *** 7.41 671.46 – 700.60 687.24 *** 7.57 672.37 – 702.12 686.40 *** 7.43 671.80 – 701.00
STR -1.10 ** 0.38 -1.85 – -0.35 -1.10 ** 0.38 -1.85 – -0.36 -1.10 ** 0.38 -1.85 – -0.36
english -0.65 *** 0.04 -0.73 – -0.57 -0.43 0.28 -0.98 – 0.12 -0.60 *** 0.08 -0.75 – -0.44
english hc1 -0.11 0.14 -0.39 – 0.16
english hc2 -0.03 0.04 -0.10 – 0.04
Observations 420 420 420
R2 / R2 adjusted 0.426 / 0.424 0.427 / 0.423 0.427 / 0.423
  • p<0.05   ** p<0.01   *** p<0.001

The regression output indicates that in models 2 and 3, there is an imperfect multicollinearity problem, which is reflected by an increase in the standard error of the coefficient estimates. Specifically, the estimated effect of english becomes increasingly unstable as the correlation between english and the correlated term (which we artificially generated) becomes stronger.

2. Detecting Collinearity

Multicollinearity in a regression model can be assessed by computing the variance inflation factor (VIF) for each predictor variable. VIF measures how much of the variation in one variable is explained by the other variable. This is done by running a regression using one of the correlated \(x\) variables as the dependent variable against the other variables as predictor variables. The VIF is calculated as one divided by the tolerance, which is defined as one minus R-squared.

\(VIF_j = \frac{1}{1-R_j^2}\)

A VIF value of 1 indicates an absence of multicollinearity, while values exceeding 5 or 10 suggest problematic collinearity (James et al., 2014). When multicollinearity is detected, it is recommended to remove one or more of the correlated variables from the model. This is because the presence of multicollinearity implies that the information provided by the correlated variables about the response variable is redundant in the presence of other variables (James et al., 2014; P. Bruce & Bruce, 2017).

The R function vif() in the car package can be used to detect multicollinearity in a regression model:

library(car)
vif(mod3) # without highly correlated term
##      STR  english 
## 1.036495 1.036495
vif(mod4) # with highly correlated term
##         STR     english english_hc1 
##    1.036565   52.750311   52.691167
vif(mod5) # with less highly correlated term
##         STR     english english_hc2 
##    1.036565    4.049684    4.007776

As anticipated, when we include a highly correlated term in the model, both english and english_hc1 exhibit extremely high VIF scores.

3. Principal Components Analysis (PCA)

Principal Components Analysis (PCA), which is an unsupervised machine learning technique, can help handle multicollinearity problems by creating new features that are orthogonal to each other, meaning that they have zero correlation. This eliminates the problem of having features that are redundant or that inflate the variance of your model coefficients.

The goal of PCA is to explain most of the variability in a dataset with fewer variables than the original dataset. PCA offers a way to find a low-dimensional representation of a dataset that captures as much of the variation in the data as possible. If we’re able to capture most of the variation in just two dimensions, we could project all of the observations in the original dataset onto a simple scatterplot.

The way we find the principal components is as follows:

Given a dataset with \(p\) predictors: \(X_1\), \(X_2\),…, \(X_p\), calculate \(Z_1\),…, \(Z_M\) to be the \(M\) linear combinations of the original \(p\) predictors where:

  • \(Z_m = \Sigma \Phi_{jm}X_j\) for some constants \(\Phi_{1m}\), \(\Phi_{2m}\),…, \(\Phi_{pm}\), \(m = 1, 2,...,M\).
  • \(Z_1\) is the linear combination of the predictors that captures the most variance possible.
  • \(Z_2\) is the next linear combination of the predictors that captures the most variance while being orthogonal (i.e. uncorrelated) to \(Z_1\).
  • \(Z_3\) is then the next linear combination of the predictors that captures the most variance while being orthogonal to \(Z_2\).
  • And so on.

In practice, we use the following steps to calculate the linear combinations of the original predictors:

  1. Scale each of the variables to have a mean of 0 and a standard deviation of 1.

  2. Calculate the covariance matrix for the scaled variables.

  3. Calculate the eigenvalues of the covariance matrix.

Using linear algebra, it can be shown that the eigenvector that corresponds to the largest eigenvalue is the first principal component. In other words, this particular combination of the predictors explains the most variance in the data.

The eigenvector corresponding to the second largest eigenvalue is the second principal component, and so on.

Example

For this example, we will use the “Communities and Crime” dataset from the UCI Machine Learning Repository, created by Michael Redmond. This dataset includes various socio-economic, law enforcement, and crime data. A cleaned version is available in the mogavs package under the name crimeData, which has been simplified by removing all columns identified as redundant by the dataset’s author, reducing the number of columns from 128 to 123.

We will use this dataset from the mogavs package for our analysis. The outcome variable we are focusing on is violent crimes per capita, while the independent variables include aspects of law enforcement, socio-economic factors, and crime data. A detailed description of all these variables can be found here.

library(mogavs)
dim(crimeData)
## [1] 1994  123
colnames(crimeData)
##   [1] "x.V6"   "x.V7"   "x.V8"   "x.V9"   "x.V10"  "x.V11"  "x.V12"  "x.V13" 
##   [9] "x.V14"  "x.V15"  "x.V16"  "x.V17"  "x.V18"  "x.V19"  "x.V20"  "x.V21" 
##  [17] "x.V22"  "x.V23"  "x.V24"  "x.V25"  "x.V26"  "x.V27"  "x.V28"  "x.V29" 
##  [25] "x.V30"  "x.V31"  "x.V32"  "x.V33"  "x.V34"  "x.V35"  "x.V36"  "x.V37" 
##  [33] "x.V38"  "x.V39"  "x.V40"  "x.V41"  "x.V42"  "x.V43"  "x.V44"  "x.V45" 
##  [41] "x.V46"  "x.V47"  "x.V48"  "x.V49"  "x.V50"  "x.V51"  "x.V52"  "x.V53" 
##  [49] "x.V54"  "x.V55"  "x.V56"  "x.V57"  "x.V58"  "x.V59"  "x.V60"  "x.V61" 
##  [57] "x.V62"  "x.V63"  "x.V64"  "x.V65"  "x.V66"  "x.V67"  "x.V68"  "x.V69" 
##  [65] "x.V70"  "x.V71"  "x.V72"  "x.V73"  "x.V74"  "x.V75"  "x.V76"  "x.V77" 
##  [73] "x.V78"  "x.V79"  "x.V80"  "x.V81"  "x.V82"  "x.V83"  "x.V84"  "x.V85" 
##  [81] "x.V86"  "x.V87"  "x.V88"  "x.V89"  "x.V90"  "x.V91"  "x.V92"  "x.V93" 
##  [89] "x.V94"  "x.V95"  "x.V96"  "x.V97"  "x.V98"  "x.V99"  "x.V100" "x.V101"
##  [97] "x.V102" "x.V103" "x.V104" "x.V105" "x.V106" "x.V107" "x.V108" "x.V109"
## [105] "x.V110" "x.V111" "x.V112" "x.V113" "x.V114" "x.V115" "x.V116" "x.V117"
## [113] "x.V118" "x.V119" "x.V120" "x.V121" "x.V122" "x.V123" "x.V124" "x.V125"
## [121] "x.V126" "x.V127" "y"
# initialize
crimeData.y <- crimeData$y # set y as a vector out of the original data  
crimeData.X <- crimeData[, -123] # takeout y

By plotting the correlation matrix among predictors, we observe significant correlations between some variables, suggesting potential multicollinearity issues if all variables are included in the model.

res <- cor(crimeData.X, method = "pearson")
corrplot::corrplot(res, method= "color", 
                   order = "hclust", tl.pos = 'n')

As you compute the Variance Inflation Factor (VIF) for each predictor, you will notice that many variables have VIF scores exceeding 5 or 10, indicating potentially problematic multicollinearity.

fit <- lm(y ~ ., data = crimeData)
summary(fit)$r.squared
## [1] 0.6979036
round(vif(fit), 2)
##      x.V6      x.V7      x.V8      x.V9     x.V10     x.V11     x.V12     x.V13 
##    411.21     50.92     27.09     43.71      8.04     27.76     56.63     80.66 
##     x.V14     x.V15     x.V16     x.V17     x.V18     x.V19     x.V20     x.V21 
##    164.73     76.54    413.92     20.03    237.53     66.08      4.75     24.54 
##     x.V22     x.V23     x.V24     x.V25     x.V26     x.V27     x.V28     x.V29 
##     61.59     18.88      7.07    138.42    187.64    107.17      7.55      1.83 
##     x.V30     x.V31     x.V32     x.V33     x.V34     x.V35     x.V36     x.V37 
##      2.69      2.94      2.79     40.68     42.25     32.56     59.08     52.65 
##     x.V38     x.V39     x.V40     x.V41     x.V42     x.V43     x.V44     x.V45 
##     13.33     41.26      5.93      8.64     19.69     44.51    295.17     27.51 
##     x.V46     x.V47     x.V48     x.V49     x.V50     x.V51     x.V52     x.V53 
##    442.21   1319.59    111.72    166.85    165.81     22.10     16.23     11.73 
##     x.V54     x.V55     x.V56     x.V57     x.V58     x.V59     x.V60     x.V61 
##     14.03     18.61     16.77      6.46     23.16     54.16     64.93     32.00 
##     x.V62     x.V63     x.V64     x.V65     x.V66     x.V67     x.V68     x.V69 
##    137.85    460.77    732.28    465.34     45.15     37.96    285.81    302.77 
##     x.V70     x.V71     x.V72     x.V73     x.V74     x.V75     x.V76     x.V77 
##    306.56    116.33     49.29    732.40     40.51     17.87      3.60     14.91 
##     x.V78     x.V79     x.V80     x.V81     x.V82     x.V83     x.V84     x.V85 
##      5.69    711.20      3.02      3.90      6.24     11.52      3.67    347.53 
##     x.V86     x.V87     x.V88     x.V89     x.V90     x.V91     x.V92     x.V93 
##    777.26    203.14     37.64    249.43     82.63    133.85      5.08      6.59 
##     x.V94     x.V95     x.V96     x.V97     x.V98     x.V99    x.V100    x.V101 
##      3.50      5.35      2.71     75.55     12.98     19.82     11.48     12.03 
##    x.V102    x.V103    x.V104    x.V105    x.V106    x.V107    x.V108    x.V109 
##    292.31 558345.67    175.45    438.06     16.57     45.50     14.09 552634.18 
##    x.V110    x.V111    x.V112    x.V113    x.V114    x.V115    x.V116    x.V117 
##     12.67     14.35     55.13     51.63      8.50    106.64     16.77      3.99 
##    x.V118    x.V119    x.V120    x.V121    x.V122    x.V123    x.V124    x.V125 
##      4.57      3.63      5.15      3.95      9.26     51.25     19.37      3.65 
##    x.V126    x.V127 
##      2.67     68.40

Instead of arbitrarily dropping variables from our model to address multicollinearity, we can perform principal component analysis (PCA). This approach allows us to identify principal components that capture most of the variance in the data with fewer variables, thus avoiding multicollinearity and leading to more efficient estimates.

Now, the first procedure for PCA is to scale and center the data using prcomp command. This command is both convenient and useful, helping you to subtract each observation value by its mean and then divide the result by the standard deviation. Then, we can use print and summary commands to see the rotation, standard deviation and proportion of variance each component contributes to the dataset, and use fviz_fig in package factoextra to plot a nice formatted scree plot.

pca <- prcomp(crimeData.X, center = TRUE, scale. = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3    PC4     PC5     PC6    PC7
## Standard deviation     5.1263 4.2852 3.14023 2.8484 2.52280 2.44983 2.1274
## Proportion of Variance 0.2154 0.1505 0.08083 0.0665 0.05217 0.04919 0.0371
## Cumulative Proportion  0.2154 0.3659 0.44674 0.5132 0.56542 0.61461 0.6517
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.92926 1.79021 1.56346 1.46739 1.39980 1.30804 1.28740
## Proportion of Variance 0.03051 0.02627 0.02004 0.01765 0.01606 0.01402 0.01359
## Cumulative Proportion  0.68222 0.70849 0.72852 0.74617 0.76223 0.77626 0.78984
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     1.26287 1.19862 1.14023 1.12564 1.05175 0.99584 0.97054
## Proportion of Variance 0.01307 0.01178 0.01066 0.01039 0.00907 0.00813 0.00772
## Cumulative Proportion  0.80291 0.81469 0.82535 0.83573 0.84480 0.85293 0.86065
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.96633 0.92222 0.89262 0.85634 0.84920 0.82465 0.77804
## Proportion of Variance 0.00765 0.00697 0.00653 0.00601 0.00591 0.00557 0.00496
## Cumulative Proportion  0.86830 0.87527 0.88181 0.88782 0.89373 0.89930 0.90426
##                           PC29    PC30    PC31    PC32   PC33    PC34    PC35
## Standard deviation     0.76580 0.73427 0.72819 0.70541 0.6899 0.68415 0.67038
## Proportion of Variance 0.00481 0.00442 0.00435 0.00408 0.0039 0.00384 0.00368
## Cumulative Proportion  0.90907 0.91349 0.91784 0.92191 0.9258 0.92965 0.93334
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.66397 0.62569 0.60702 0.58972 0.57603 0.56650 0.55393
## Proportion of Variance 0.00361 0.00321 0.00302 0.00285 0.00272 0.00263 0.00252
## Cumulative Proportion  0.93695 0.94016 0.94318 0.94603 0.94875 0.95138 0.95389
##                           PC43    PC44   PC45    PC46    PC47   PC48    PC49
## Standard deviation     0.54304 0.52869 0.5180 0.49787 0.49653 0.4687 0.45403
## Proportion of Variance 0.00242 0.00229 0.0022 0.00203 0.00202 0.0018 0.00169
## Cumulative Proportion  0.95631 0.95860 0.9608 0.96283 0.96486 0.9667 0.96835
##                           PC50    PC51    PC52    PC53   PC54    PC55   PC56
## Standard deviation     0.45164 0.44451 0.43885 0.42861 0.4130 0.41005 0.3987
## Proportion of Variance 0.00167 0.00162 0.00158 0.00151 0.0014 0.00138 0.0013
## Cumulative Proportion  0.97002 0.97164 0.97322 0.97472 0.9761 0.97750 0.9788
##                           PC57    PC58    PC59    PC60   PC61    PC62    PC63
## Standard deviation     0.38490 0.37296 0.37131 0.36040 0.3487 0.32963 0.32066
## Proportion of Variance 0.00121 0.00114 0.00113 0.00106 0.0010 0.00089 0.00084
## Cumulative Proportion  0.98001 0.98115 0.98228 0.98335 0.9843 0.98524 0.98608
##                           PC64    PC65    PC66   PC67    PC68    PC69    PC70
## Standard deviation     0.31633 0.30719 0.29810 0.2913 0.28788 0.27641 0.26830
## Proportion of Variance 0.00082 0.00077 0.00073 0.0007 0.00068 0.00063 0.00059
## Cumulative Proportion  0.98690 0.98767 0.98840 0.9891 0.98978 0.99040 0.99099
##                           PC71    PC72    PC73    PC74    PC75    PC76    PC77
## Standard deviation     0.25982 0.25450 0.25103 0.23801 0.23588 0.22980 0.22319
## Proportion of Variance 0.00055 0.00053 0.00052 0.00046 0.00046 0.00043 0.00041
## Cumulative Proportion  0.99155 0.99208 0.99259 0.99306 0.99351 0.99395 0.99435
##                           PC78    PC79    PC80    PC81    PC82    PC83    PC84
## Standard deviation     0.21832 0.21411 0.21056 0.20020 0.19450 0.18859 0.18256
## Proportion of Variance 0.00039 0.00038 0.00036 0.00033 0.00031 0.00029 0.00027
## Cumulative Proportion  0.99475 0.99512 0.99548 0.99581 0.99612 0.99641 0.99669
##                           PC85    PC86    PC87    PC88   PC89   PC90    PC91
## Standard deviation     0.17283 0.16885 0.16837 0.16166 0.1563 0.1556 0.14596
## Proportion of Variance 0.00024 0.00023 0.00023 0.00021 0.0002 0.0002 0.00017
## Cumulative Proportion  0.99693 0.99717 0.99740 0.99761 0.9978 0.9980 0.99819
##                           PC92    PC93    PC94    PC95    PC96    PC97    PC98
## Standard deviation     0.14342 0.13589 0.13405 0.13088 0.12553 0.11681 0.11606
## Proportion of Variance 0.00017 0.00015 0.00015 0.00014 0.00013 0.00011 0.00011
## Cumulative Proportion  0.99835 0.99851 0.99865 0.99879 0.99892 0.99903 0.99915
##                           PC99   PC100   PC101   PC102   PC103   PC104   PC105
## Standard deviation     0.11383 0.10552 0.10317 0.09484 0.09310 0.08754 0.08136
## Proportion of Variance 0.00011 0.00009 0.00009 0.00007 0.00007 0.00006 0.00005
## Cumulative Proportion  0.99925 0.99934 0.99943 0.99950 0.99957 0.99964 0.99969
##                          PC106   PC107   PC108   PC109   PC110   PC111   PC112
## Standard deviation     0.07377 0.07086 0.06453 0.06077 0.05462 0.05390 0.04860
## Proportion of Variance 0.00004 0.00004 0.00003 0.00003 0.00002 0.00002 0.00002
## Cumulative Proportion  0.99974 0.99978 0.99981 0.99984 0.99987 0.99989 0.99991
##                          PC113   PC114   PC115   PC116   PC117   PC118   PC119
## Standard deviation     0.04829 0.04163 0.03993 0.03839 0.03532 0.02948 0.02670
## Proportion of Variance 0.00002 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001
## Cumulative Proportion  0.99993 0.99994 0.99996 0.99997 0.99998 0.99999 0.99999
##                          PC120   PC121     PC122
## Standard deviation     0.02461 0.02171 0.0009488
## Proportion of Variance 0.00000 0.00000 0.0000000
## Cumulative Proportion  1.00000 1.00000 1.0000000

In the summary part, the line Proportion of Variance represents how much percentage of the variance is explained by each component. Here the first component explains 21.54% of the variability, the second component explains 15.05% of the variance and so on… Then the line Cumulative Proportion represents the percentage of variance explained by first k elements. Specifically, PC41 column gives the value 0.95138, which means if we include the first 41 components, then 95.14% of the variance is explained.

Note that the principal components scores for each observation are stored in pca$x.

# display principal components
head(pca$x)
##            PC1        PC2         PC3        PC4        PC5       PC6
## [1,]  1.522743 -0.7916743 -3.28026359 -1.2753142  0.5157339  2.246354
## [2,] -1.015422  1.0858176 -3.62568762 -3.4731108  3.0243619 -0.354592
## [3,] -3.158742 -2.4840295  0.07796813  1.0533924 -1.5569032 -1.452501
## [4,]  2.289034  2.8487601  2.05171050  3.4190377 -2.7585249  3.700239
## [5,]  5.542577 -2.4608213  1.99899950  1.9062590 -1.0345593  2.313514
## [6,]  6.281678  7.1365764 -3.36227772 -0.4295065  2.8784476 -3.242882
##             PC7        PC8         PC9      PC10      PC11        PC12
## [1,] -2.7985542  0.9267431  0.58399391 0.6004072 0.9551796 -0.79934712
## [2,] -4.4201234  1.9166227 -1.72469679 0.1042599 0.1723816  0.04400804
## [3,] -0.3855891  3.8352238 -0.29908075 2.4358470 2.3923717  1.29196186
## [4,]  2.8111671  5.9181406 -0.77918858 2.7291999 0.3761453  4.26473995
## [5,] -0.6231897 -0.2395864  0.39365486 0.2781169 0.3233575  1.17753888
## [6,]  0.1920078  0.4265712  0.09238477 0.1389794 1.7694596  0.30840918
##            PC13       PC14       PC15       PC16        PC17        PC18
## [1,] -0.9666304  0.1303065 -0.4657864  1.2189250  0.06043401 0.729760127
## [2,] -2.8716352 -0.5208606  1.0407432  0.5515834  1.16689692 0.002392092
## [3,]  0.9311878 -0.8210312  2.1994493  0.7673565 -1.14019003 0.424345038
## [4,]  1.6327709 -2.8870481 -3.5170495  1.0626643 -1.28514836 1.244770149
## [5,] -0.1779068 -0.9654716 -0.2833528 -0.7469055 -0.47950830 0.936034210
## [6,] -2.9758081 -0.3608440  2.0664083  0.5216004  1.48223391 2.949611589
##             PC19       PC20       PC21        PC22          PC23       PC24
## [1,] -0.79921663  0.1319244  0.1489032  0.07092186 -0.3082014815 -0.8177185
## [2,] -0.62406584  0.1334736 -0.3755883  0.12341205  0.9769526455  0.6355450
## [3,] -0.07681935  0.7818357  0.0991699 -0.77126234  0.0405022795  0.2899994
## [4,] -1.60374672  0.9227981  0.1087099  0.35152183  0.0005690714  0.9002704
## [5,]  0.88507386  1.8341636 -0.9523725  0.57316432  1.3744618673  0.8042577
## [6,] -0.70898935 -0.6434944  0.1147150 -0.66398971  0.0099635183  0.5997810
##            PC25       PC26       PC27        PC28        PC29       PC30
## [1,] -0.3069298 -0.4152735  0.2167211  0.21327767 -0.14586468 -0.3158446
## [2,]  0.2369439  1.0503642  0.6764152  0.08486815  0.94130197  1.5350282
## [3,]  0.6215431 -1.3418481  1.8594000  0.31524477 -0.44893741 -0.1837356
## [4,]  0.4220371  1.2271946  1.2904089 -0.67079238  0.08942445 -0.5654088
## [5,] -0.4217490  1.1109450 -1.1807146  0.38248434 -0.11894801  0.6922435
## [6,]  1.2264617 -0.2096174 -0.5534602 -0.04476907  1.08860753  0.2460175
##            PC31        PC32       PC33       PC34        PC35       PC36
## [1,]  0.6630573 -1.02344207  0.3339815 -0.0757706  0.06264455  0.3350997
## [2,]  0.4915675  0.11530529 -0.2695809  0.2543846  0.38000284  0.6615344
## [3,]  0.8941770 -0.20630166 -0.2529353  0.2248543  0.68720542  0.2818908
## [4,]  1.0092294  0.87614189 -0.5055810 -0.6989796  1.60577512 -0.1289466
## [5,]  0.3785551  0.81712031 -0.2265443 -0.4916125  0.59171867 -0.2505556
## [6,] -0.0408304  0.09808626  1.0147548  0.3374860 -0.37652980 -0.4025329
##              PC37       PC38          PC39       PC40        PC41        PC42
## [1,]  0.259200542  0.2418568 -0.0004330426 -0.4746566 -0.14509983 -0.08296911
## [2,]  1.878980303  0.2291362 -0.2598100237 -0.1539732  0.74360461 -0.64468912
## [3,] -0.004971392  0.3541479 -0.4798498803  0.5575868  0.37614191  0.32161952
## [4,]  0.616734544  0.1541563  0.6113958888  0.3656620  0.57434140  1.03110160
## [5,] -0.357031178 -0.9109974 -0.7144578826  0.9458072 -0.09201943 -0.24641243
## [6,] -0.117014069 -0.2321068  0.8217520978  1.0968101 -1.18647049  1.68628074
##             PC43        PC44        PC45        PC46        PC47       PC48
## [1,] -0.20996256 -0.26979783 -0.02232234  0.19467958 -0.12852110  0.1186386
## [2,]  0.26534725  0.06124778 -0.14509262 -0.20542773  0.74906735  0.6352985
## [3,]  0.25466846  0.34548537  0.58932503 -0.04111156  0.23739990 -0.4620854
## [4,]  0.02436998 -0.72168563  0.51396741  0.85006150 -0.51750603 -0.5766830
## [5,]  1.06357153 -0.47374715 -0.46508358  0.17740669  0.07338641  0.2011894
## [6,]  0.64559540  0.76391268 -0.33056362 -1.40686319  0.23855990  0.5412779
##             PC49        PC50         PC51        PC52         PC53        PC54
## [1,] -0.47408426  0.36632820 -0.221114792  0.08659287  0.202921887 -0.56429022
## [2,]  0.27977737  0.06500505 -0.379707373  0.15280599 -0.158091159  0.10436843
## [3,]  0.31312759 -0.56695086 -0.372736267  1.03517280 -0.003103641 -0.18345652
## [4,] -0.14841188 -0.90026578  0.678598688 -0.42198488 -0.388367929  0.04813490
## [5,]  0.03735946 -0.13589903  0.005542966  0.25711781  0.443969205  0.08099316
## [6,]  0.57913087  0.36442670 -0.210534692  0.08908420 -0.223842488 -0.03031585
##             PC55        PC56       PC57       PC58        PC59        PC60
## [1,]  0.17788012 -0.09741465 -0.1584140 -0.3327349 -0.23543274  0.14137259
## [2,] -0.31194653  0.40355200 -0.8192633 -0.9560810 -0.03005856  0.27905650
## [3,]  0.51685549 -0.01196752  0.8944203  0.1020099  0.19358496  0.01860159
## [4,]  0.24529401 -0.26928696 -0.8567052  0.7879718  0.89692562 -0.15526450
## [5,]  0.07380586 -0.07165184  0.4605282 -0.7086372  0.02123262  0.15349342
## [6,] -0.04679451 -0.61733981  0.3886963  0.1180046  0.15272849 -0.74067126
##            PC61        PC62         PC63        PC64        PC65       PC66
## [1,]  0.5420337 -0.21522110  0.011337783 -0.17855066 -0.03791060  0.2101025
## [2,] -0.2090061  0.28012117  0.407462036 -0.08357081  0.04968771 -0.2213775
## [3,]  0.3609713  0.06828475 -0.003728144 -0.26821311 -0.02681214  0.3520063
## [4,] -0.3382821 -0.63885301  0.708558150 -0.26610003  0.48286393  0.4088681
## [5,]  0.0682367 -0.31584887 -0.551599596  0.10789834 -0.19562580 -0.0128477
## [6,] -0.7194117  0.55382257  0.514415967  0.10134009 -0.30701522 -0.4432713
##             PC67        PC68       PC69        PC70       PC71       PC72
## [1,] -0.51909689  0.28334891 0.09476049  0.19610574  0.1407269 -0.1277971
## [2,]  0.13757464  0.02999191 0.22958571  0.11785784  0.2285787  0.1875041
## [3,] -0.37177927 -0.29498899 0.21763723  0.01009910  0.4531931 -0.4678119
## [4,]  0.05099891  0.32802894 0.52196959 -0.08467014 -0.0582615 -0.2535461
## [5,]  0.04325537 -0.17818549 0.46153578  0.39289479 -0.3359191  0.2128533
## [6,] -0.21770012 -0.21199440 0.19043841  0.02908808 -0.4286533 -0.1956865
##            PC73        PC74        PC75         PC76        PC77          PC78
## [1,]  0.2428455  0.14955193  0.24772571 -0.278433174  0.32258807  0.0009070064
## [2,] -0.1455737  0.01297870 -0.31442006  0.325249630 -0.12592710  0.1187763524
## [3,]  0.1381617  0.02066202 -0.04140399  0.067418905  0.02249998 -0.2817578199
## [4,]  0.1924255 -0.24635625  0.17189179  0.242693556 -0.15010779 -0.5763854872
## [5,] -0.3613821 -0.39889662 -0.15685371 -0.005084836  0.01039949  0.0588177443
## [6,]  0.1170826 -0.10292304 -0.68971469  0.120030215  0.34315601 -0.2274151946
##            PC79        PC80        PC81        PC82         PC83          PC84
## [1,]  0.2072092  0.11045114  0.02290726  0.16893495 -0.106774335  0.0357632989
## [2,]  0.2019984 -0.13363946  0.22413684 -0.03204937  0.183270913  0.0562400806
## [3,] -0.1783240  0.06125581  0.13483458  0.28799251  0.240764564 -0.0007928556
## [4,]  0.1234795 -0.08927775  0.11526411  0.32315921  0.289921213  0.2614749976
## [5,] -0.0677084 -0.12118897  0.36184767 -0.29040697  0.062395112 -0.0646438834
## [6,] -0.3710100 -0.07073274 -0.02863557 -0.23347614 -0.008981002  0.0587340638
##               PC85        PC86        PC87        PC88         PC89
## [1,] -0.1132907008  0.19209582  0.01316851 -0.17345931 -0.006560923
## [2,] -0.1127852532 -0.01852039  0.09825591  0.09264179 -0.041461218
## [3,]  0.0007632933  0.17275546  0.09530554 -0.11461990  0.020477096
## [4,] -0.2483955280 -0.01237575 -0.01099696 -0.11966783 -0.229702465
## [5,] -0.0040571625 -0.44420163  0.02773361 -0.25533816  0.075757828
## [6,] -0.2745976443 -0.21179701  0.13560804 -0.09763305  0.093732040
##              PC90        PC91        PC92        PC93        PC94        PC95
## [1,] -0.003642731 -0.11289932  0.11608906 -0.07704524  0.06614833  0.11492631
## [2,]  0.024193468  0.03816543 -0.11774730  0.03284336  0.22907143 -0.04057139
## [3,]  0.073810100  0.18713160 -0.23797510  0.09322499 -0.10560470 -0.04332543
## [4,] -0.130182703  0.03867649 -0.26777218  0.05498199  0.12958923  0.21716528
## [5,]  0.061212379 -0.14136550  0.04223394  0.03579572 -0.11236475  0.19984461
## [6,] -0.160984487 -0.07197470 -0.02657317  0.15250861  0.15430530  0.08072158
##              PC96          PC97        PC98        PC99       PC100
## [1,] -0.258270737 -0.0362312424  0.03474901  0.08725786 -0.02848885
## [2,] -0.090128855 -0.0537359918 -0.09445251 -0.05327007  0.01088949
## [3,] -0.006743572  0.0120459893 -0.12821049 -0.08188170  0.12048035
## [4,]  0.037614628 -0.0006408959  0.19091908  0.17668717  0.03509228
## [5,] -0.007310014  0.0389636696  0.13666343  0.05765868 -0.10317519
## [6,] -0.088021629  0.1189092184 -0.09280172  0.14622446  0.18399030
##             PC101        PC102       PC103        PC104       PC105       PC106
## [1,] -0.013583776  0.048909193  0.00216378  0.025381950  0.05621436  0.02345814
## [2,] -0.057960400  0.016098193  0.07359925 -0.101541065 -0.06608601 -0.03191740
## [3,]  0.007994631 -0.004963888  0.02902918 -0.076526590 -0.05591075  0.11492641
## [4,]  0.203271380 -0.113055603 -0.16515165 -0.021816263 -0.02176457  0.07787187
## [5,]  0.138790887 -0.085911058  0.04003697 -0.009699707  0.01974313 -0.11142903
## [6,]  0.050447797  0.133785735 -0.10120763  0.155234520 -0.05841889 -0.03275102
##            PC107       PC108       PC109        PC110        PC111       PC112
## [1,] -0.03918637 -0.10011823  0.04367051 -0.078567824  0.061063148 -0.01210633
## [2,]  0.02430551 -0.02982626  0.10875893  0.009612063 -0.075154447 -0.04936937
## [3,] -0.03971283 -0.04659983  0.06379742  0.044902480 -0.053825635  0.04671148
## [4,]  0.13376939  0.01577792  0.20098914 -0.127069247  0.063238208 -0.03424504
## [5,]  0.06745432  0.07870918 -0.03788183 -0.074087792 -0.006017305 -0.02745800
## [6,] -0.08736603  0.17214458 -0.02495514 -0.034747967 -0.002569887  0.09188466
##             PC113       PC114        PC115       PC116        PC117
## [1,]  0.043628685  0.01734119  0.012264508  0.05154895  0.021639315
## [2,]  0.090416102 -0.04063677 -0.051208845  0.02021354 -0.010790308
## [3,]  0.009164668 -0.01981276  0.045332507  0.01182388 -0.085536545
## [4,]  0.018914629 -0.07159628  0.052524427 -0.02034317 -0.012929678
## [5,]  0.009371864  0.02543947 -0.070515577  0.02912714  0.002946295
## [6,] -0.049328896  0.01240519  0.002689908  0.05069294 -0.042221767
##              PC118        PC119         PC120        PC121         PC122
## [1,] -0.0008515265  0.005301082 -0.0004861255  0.005744919 -0.0003524091
## [2,]  0.0411841056  0.007092108 -0.0349009165 -0.070776929  0.0007109495
## [3,]  0.0040223169  0.012616585 -0.0145490756  0.004363147  0.0002528977
## [4,] -0.0304318336 -0.016297073  0.0377117417 -0.009217490  0.0003120880
## [5,]  0.0338064744 -0.030311422 -0.0581924432  0.011809583  0.0006055602
## [6,] -0.0182105313 -0.028032027 -0.0333198643 -0.021873913 -0.0009814977

Let’s revisit the correlation plot. As the principal components are orthogonal, they exhibit no correlation. Consequently, the correlation plot appears perfectly white, except for instances of autocorrelation.

res2 <- cor(pca$x, method = "pearson")
corrplot::corrplot(res2, method= "color", 
                   order = "hclust", tl.pos = 'n')

One challenge of using PCA is deciding how many principal components to retain. Retaining too many may not sufficiently reduce dimensionality, potentially maintaining multicollinearity. Conversely, keeping too few can lead to significant information loss, reducing the accuracy of your model. A widely used method to determine the optimal number of principal components is the scree plot. This plot displays the eigenvalues of the principal components in descending order and identifies where an abrupt change in the slope occurs, signaling a drop in the importance of subsequent components. The ideal pattern is a steep curve, followed by a bend, and then a straight line. Use the components in the steep curve before the first point that starts the line trend.

factoextra::fviz_eig(pca, barfill = "gray73", barcolor = "gray73",
                     linecolor = "steelblue2", xlab = "Principle Component", ncp = 30)

As shown in the scree plot above, at first glance, it appears that the curve starts to level off from PC10, suggesting that including the first 9 components may be optimal. According to the summary results of the PCA object, including these first 9 components explains 70.85% of the variance.

Now let’s rerun the model with 9 principal components and compute the VIF scores for each component. As shown below, all the VIF scores are 1, confirming the absence of multicollinearity as previously indicated by the correlation matrix. Instead of including numerous correlated variables, which could lead to inefficient estimates, the revised model now offers a more efficient set of estimates. Moreover, it still explains a significant portion of the variance in the outcome variable, with an \(R^2\) of 0.6261, only slightly lower than the 0.6979 of the original model.

crimeData.pca <- as.data.frame(cbind(y = crimeData.y, pca$x[, 1:9]))
fit_pca <- lm(y ~ ., data = crimeData.pca)
summary(fit_pca)$r.squared
## [1] 0.6260937
round(vif(fit_pca), 2)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 
##   1   1   1   1   1   1   1   1   1

For a more stringent method to determine how many principal components should be included in the linear regression model, please see here for an example.

Another challenge of using PCA is to interpret the meaning of the principal components. Unlike the original features, which may have a clear definition and unit, the principal components are abstract and hard to explain. One way to interpret the principal components is to look at the coefficients of the linear combination that forms each component. These coefficients indicate how much each original feature contributes to the component and what is the direction of the relationship. For example, if a component has a high positive coefficient for a feature, it means that the component increases with the feature.

The function fviz_contrib function from factoextra package can be used to draw a bar plot of variable contributions. If your data contains many variables, you can decide to show only the top contributing variables. The R code below shows the top 50 variables contributing to the first principal component we have in this example:

factoextra::fviz_contrib(pca, choice = "var", axes = 1, top = 50, 
                         fill = "gray73", color = "gray73")

The red dashed line on the graph above indicates the expected average contribution. If the contribution of the variables were uniform, the expected value would be 1/length(variables) = 1/122 = 0.82%. For a given component, a variable with a contribution larger than this cutoff could be considered as important in contributing to the component.

For instance, in the linear regression model where we include principal components instead of the original variables, there is a negative relationship between our outcome variable, violent crimes per capita, and PC1. According to the contribution plot, variables such as x.V25, x.V18, x.V50, and others above the red dashed line indicate that the relationship between our outcome variable and these variables is likely negative.

Discussion V

Today’s discussion will begin by examining the critiques of utilizing ordinary least squares for regression modeling when the outcome variable is categorical rather than continuous. Subsequently, we will explore the appropriate methods for modeling a categorical outcome variable.

  1. Linear Probability Model
  2. Logistic Regression Model

1. Linear Probability Model

1.1 What is the Linear Probability Model?

In an ordinary least square (OLS) linear model, the output of the model is a conditional expectation of \(Y\) given on \(X_i\), which is \(E(Y|X_i)\), in which \(Y\) is our continuous outcome variable and \(X_i\) are explanatory variables. If we use OLS to estimate the effect on a dichotomous outcome variable, the original target, \(E(Y|X_i)\), becomes \(Pr(Y=1|X_i)\). Since now the outcome variable is a vector that only could be either 0 or 1, intuitively thinking, when we apply OLS under such circumstance, we are estimating the proportion of 1s at different values of our explanatory variables. This is known as the linear probability model, which can be written in the following equation:

\[Pr(Y=1|X_i)=\alpha + \beta_1x_{1i} + \beta_2x_{2i} + ... + \beta_kx_{ki} + \varepsilon_i\]

1.2 Statistical Critiques about the Linear Probability Model

Some critiques regarding the problems of using OLS to estimate the effect on dichotomous dependent variables can be classified into two branches.

  • Violation of Homoskedasticity Assumption

    First, the misuse of OLS on dichotomous outcome variables would violate one of the Gauss-Markov assumptions – homoskedasticity. Homoskedasticity assumption suggests that error variance across values of \(X_i\) should be constant (i.e., \(Var(\varepsilon_i)=\sigma^2\)). The violation of this assumption indicates that the error variance is not constant and heteroskedastic. While such violation will only produce biased standard errors but not affect the estimate of the regression coefficients, the test of significance of the estimates is still affected. The OLS estimator under such violation will be no longer efficient. Larger standard errors would make us not reject the null hypothesis more often. The reason why adopting linear probability model would frequently cause the violation of homoskedasticity happen is due to the fact that the conditional distribution of the error term \(\varepsilon_i\) in such model is dichotomous. Recall the Chilean voters example from the lecture.

    # read in data
    mydata <- read.delim("/Users/yu-shiuanhuang/Desktop/year#3/SPRING/TA-POL213/data/Chile3.txt", header=TRUE, sep="\t")
    
    # recode voting intention
    mydata$vote2 <- 0
    mydata$vote2[mydata$vote == "Y"] <- 1
    mydata$vote2[mydata$vote == "N"] <- 0
    mydata$vote2[mydata$vote == "U"] <- NA
    mydata$vote2[mydata$vote == "A"] <- NA
    
    mydata2 <- subset(mydata, vote2 != "NA")
    # scatterplot of voting intention against support for status quo
    ggplot(data = mydata2, aes(x = statusquo, y = vote2)) +
      geom_jitter(width = 0.1, height = 0.1, alpha = 0.7, col = "gray66") +
      geom_hline(yintercept = 1, col = "salmon2", alpha = 0.7, linetype = "dashed") +
      geom_hline(yintercept = 0, col = "salmon2", alpha = 0.7, linetype = "dashed") +
      scale_y_continuous(breaks = seq(0, 1, 0.2), labels = seq(0, 1, 0.2)) +
      scale_x_continuous(breaks = seq(-1.5, 1.5, 0.5), labels = seq(-1.5, 1.5, 0.5)) +
      labs(x = "Support for Status Quo", y = "Voting Intention") +
      theme_bw()

    # linear probability model
    ols.chile <- lm(vote2 ~ statusquo, data = mydata2)

     

    OLS

    Variables

    Estimates

    S.E.

    C.I. (95%)

    (Intercept)

    0.49 ***

    0.01

    0.48 – 0.50

    statusquo

    0.39 ***

    0.01

    0.38 – 0.41

    Observations

    1754

    R2 / R2 adjusted

    0.730 / 0.730

    • p<0.05   ** p<0.01   *** p<0.001

    # plot residuals
    mydata2$residuals <- ols.chile$residuals
    
    ggplot(data = mydata2, aes(x = statusquo, y = residuals)) +
      geom_jitter(width = 0.1, height = 0.1, alpha = 0.7, col = "gray66") +
      geom_hline(yintercept = 0, col = "salmon2", alpha = 0.7, linetype = "dashed") +
      scale_y_continuous(breaks = seq(-1, 1, 0.5), labels = seq(-1, 1, 0.5)) +
      scale_x_continuous(breaks = seq(-1.5, 1.5, 0.5), labels = seq(-1.5, 1.5, 0.5)) +
      labs(x = "Support for Status Quo", y = "Residuals") +
      theme_bw()

    Specifically, because \(Y_i\) is either 0 or 1, the conditional distribution of \(\varepsilon_i\) is also dichotomous.

    \(Y_i = 1:\) \(\varepsilon_i = 1-E(Y_i) = 1-(\alpha+\beta X)\)

    \(Y_i = 0:\) \(\varepsilon_i = 0-E(Y_i) = 0-(\alpha+\beta X)\)

  • Inaccurate Estimated Effects at the Extremes

    The second branch of critique lies in that the effect on the dichotomous outcome variable estimated by the linear probability model would be inaccurate at the extremes. Recall that, in a dichotomous dependent variable, the potential outcomes a unit could receive are either 1 or 0. If we use OLS to estimate the effect, the linear coefficients could generate meaningless results because the predicted values (i.e., probabilities) based on such model would easily fall outside the interval between 0 and 1. This could impact the interpretation and implication of the findings. For example, how would we be able to justify why there is a hypothetical voter that has a probability of 110% voting for a candidate? Let’s recall the Chile voters example again. The fitted values based on the linear probability model show that there are predictions lie outside the (0, 1) interval. The minimum of the fitted values is -0.18799, and the maximum is 1.16744.

    summary(ols.chile$fitted.values)
    ##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    ## -0.18799  0.06449  0.42559  0.49373  0.95418  1.16744
    # scatterplot of voting intention against support for status quo and the fitted regression line
    mydata2$fitted.values <- ols.chile$fitted.values
    
    ggplot(data = mydata2, aes(x = statusquo, y = vote2)) +
      geom_jitter(width = 0.1, height = 0.1, alpha = 0.7, col = "gray66") +
      geom_hline(yintercept = 1, col = "salmon2", alpha = 0.7, linetype = "dashed") +
      geom_hline(yintercept = 0, col = "salmon2", alpha = 0.7, linetype = "dashed") +
      geom_line(data = mydata2, aes(x = statusquo, y = fitted.values), col = "steelblue2", alpha = 0.7, lwd = 1) +
      scale_y_continuous(breaks = seq(0, 1, 0.2), labels = seq(0, 1, 0.2)) +
      scale_x_continuous(breaks = seq(-1.5, 1.5, 0.5), labels = seq(-1.5, 1.5, 0.5)) +
      labs(x = "Support for Status Quo", y = "Voting Intention") +
      theme_bw()

  • 2. Logistic Regression Model

    Due to the above two main statistical critiques on the linear probability model, we use the cdf of the logistic distribution to transform and restrict the lower bound of the prediction to 0 and an upper bound to 1, in which we estimate our parameters through maximum likelihood approach (unlike the least squares estimation approach we’ve been using to solve the optimization problem for our models so far).

    The logistic regression model we are going to estimate can be written as the following equation:

    \[\pi_i \equiv Pr(Y_i=1|X_i) = \Lambda(\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki}) = \frac{1}{1+exp(-(\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki}))}=\frac{exp(\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki})}{1+exp(\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki})}\]

    in which \(\Lambda()\) is a function that can transform and map predicted values to probabilities by constraining them to the unit interval (0, 1). The above equation can be further linearized by the following transformation:

    \[logit(\pi_i)=log\frac{\pi_i}{1-\pi_i}=\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki}\]

    The left-hand side is termed the logit, which stands for “logistic unit”. It is also known as the log odds. In this case, our model will produce values on the log scale and with the logistic equation above, we can transform the values to the 0-1 range. Now, the question remains: How do we find the best parameter estimates for our model? (i.e., how to find \(\alpha\) and \(\beta\)s?) By applying the maximum likelihood framework!

    The best parameter estimates are the ones that maximize the likelihood of the statistical model actually producing the observed data. You can think of this fitting as a probability distribution to an observed data set. The parameters of the probability distribution should maximize the likelihood that the observed data came from the distribution in question.

    In logistic regression, the response variable is modeled with a binomial distribution or its special case Bernoulli distribution. For a response variable \(Y_i\) that takes on two values {0, 1} with probability \(\pi_i\) and \((1-\pi_i)\), we can summarize the probability distribution of \(Y_i\) simply as:

    \[p(y_i) \equiv Pr(Y_i=y_i)=\pi_i^{y_i}(1-\pi_i)^{1-y_i}\]

    For a sample of N independent observations, the joint probability for the data can be written as:

    \[p(y_1, y_2,...,y_N)=Pr(Y|\pi)=\prod_{i=1}^{N}\pi_i^{y_i}(1-\pi_i)^{1-y_i}\]

    substituting from the logit model (\(\pi_i\) is dependent on parameters \(\alpha\) and \(\beta\) and also on the values of predictor variables \(x_i\)):

    \[Pr(Y|\alpha, \beta)=\prod_{i=1}^{N}(\frac{1}{1+exp(-(\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki}))})^{y_i}(1-\frac{1}{1+exp(-(\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki}))})^{1-y_i}\]

    \[=\prod_{i=1}^{N}(1+exp(-(\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki})))^{-y_i}(1+exp(\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki}))^{y_i-1}\]

    We can find the parameters that are most likely to have produced the results you observed by maximizing the above joint probability function. Thinking of this equation as a function of the parameters while treating the data (\(y_1, y_2, ..., y_N\)) as fixed gives us the likelihood function for the logit model \(L(\alpha, \beta|Y)\), which is proportional to \(Pr(Y|\alpha, \beta)\). For an easier computation, let’s take a log of it to further transform it to a log-likelihood function (the logarithms are good to reduce the numerical value of big numbers; for example, they enable us to simplify calculations by reducing the exponent of a number, and also allow us to transform products into sums). The log-likelihood function can be written as:

    \[lnL(\alpha, \beta|y)=\Sigma_{i=1}^{N}[-y_i ln[1+exp(-(\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki}))]-(1-y_i)ln[1+exp(\alpha+\beta_1 x_{1i}+...+\beta_k x_{ki})]]\]

    By employing the log-likelihood function with our dataset, we can utilize numerical computation in R to find the values of \(\alpha\) and \(\beta\) that maximize this function. Typically, the computation begins by forming an initial guess for the parameters and subsequently adjusting them based on the log-likelihood function’s output, aiming to reach the maximum. Consequently, multiple iterations of the numerical process are involved.

    Example: Chilean voters

    Let’s now run a logistic regression model on our Chilean voters data.

    logit.chile <- glm(vote2 ~ statusquo, data = mydata2, 
                       family = binomial(link = "logit"))
    
    stargazer::stargazer(logit.chile, digits = 5, type = "text")
    ## 
    ## =============================================
    ##                       Dependent variable:    
    ##                   ---------------------------
    ##                              vote2           
    ## ---------------------------------------------
    ## statusquo                 3.20554***         
    ##                            (0.14310)         
    ##                                              
    ## Constant                   0.21531**         
    ##                            (0.09964)         
    ##                                              
    ## ---------------------------------------------
    ## Observations                 1,754           
    ## Log Likelihood            -376.29370         
    ## Akaike Inf. Crit.          756.58740         
    ## =============================================
    ## Note:             *p<0.1; **p<0.05; ***p<0.01

    What does the coefficient of statusquo mean? The regression output here is in such form:

    \(log \frac{\pi_i}{1-\pi_i}=\alpha + \beta X_i=0.21531+3.20554X_i\)

    where \(\pi_i\)=Pr(voted yes for supporting Pinochet)

    The coefficient of the “statusquo” variable is found to be significantly positive, suggesting that a one-unit increase in “statusquo” is associated with a 3.20554 unit increase in the log-odds. What does this even mean?

    To have a more understandable interpretation, we can transform the regression output into odds ratio (i.e., \(\frac{\pi_i}{1-\pi_i}\); the odds of a voter supports Pinochet versus against Pinochet). How to transform log-odds into odds ratio? Exponentiate the log-odds!

    exp(coef(logit.chile))
    ## (Intercept)   statusquo 
    ##    1.240243   24.668900

    The odds ratio of the statusquo variable means that, for a one-unit increase in the statusquo variable, we expect to see about 2367% ((24.67-1)*100%) increase in the odds of a voter supports Pinochet versus against Pinochet, meaning that a voter is more likely to vote for Pinochet when he/she is more prone to maintain the status quo.

    We can also confirm this interpretation by looking at the predicted values using the estimated coefficients and then calculating the probability:

    \(log \frac{\pi_i}{1-\pi_i}=0.21531+3.20554*statusquo\)

    where \(\pi_i\)=Pr(voted yes for supporting Pinochet)

    \(\pi_i = \frac{exp(0.21531+3.20554*statusquo)}{1+exp(0.21531+3.20554*statusquo)}\)

    Let’s first see the range of the statusquo variable:

    summary(mydata2$statusquo)
    ##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    ## -1.725940 -1.085260 -0.168955  0.003962  1.172380  1.713550

    The probability that a voter holds the status quo at a value of 0 supports/against Pinochet is 0.5536205/0.4463795. The odds ratio that a voter holds the status quo at a value of 0 supports Pinochet versus against Pinochet is then 1.240246.

    pi_0 <- exp(0.21531 + 3.20554*0) / (1 + exp(0.21531 + 3.20554*0))
    pi_0
    ## [1] 0.5536205
    1 - pi_0
    ## [1] 0.4463795
    or_0 <- pi_0 / (1 - pi_0); or_0
    ## [1] 1.240246

    The probability that a voter holds the status quo at a value of 1 supports/against Pinochet is 0.9683498/0.03165017. The odds ratio that a voter holds the status quo at a value of 1 supports Pinochet versus against Pinochet is then 30.59541.

    pi_1 <- exp(0.21531 + 3.20554*1) / (1 + exp(0.21531 + 3.20554*1))
    pi_1
    ## [1] 0.9683498
    1 - pi_1
    ## [1] 0.03165017
    or_1 <- pi_1 / (1 - pi_1); or_1
    ## [1] 30.59541

    By dividing or_1 by or_0, we find that a one-unit increase in statuquo is associated with 2367% ((24.67-1)*100%) increase in the odds of a voter supports Pinochet compared to being against Pinochet.

    or_1 / or_0
    ## [1] 24.66882

    Plot the fitted probabilities based on the logistic regression model.

    # scatterplot of voting intention against support for status quo and the fitted logistic regression line
    x <- seq(-1.7, 1.8, 0.1)
    predict.df <- data.frame(x = x,
                             prob = faraway::ilogit(0.21531 + 3.20554*x))
    
    ggplot(data = mydata2, aes(x = statusquo, y = vote2)) +
      geom_jitter(width = 0.1, height = 0.1, alpha = 0.7, col = "gray66") +
      geom_hline(yintercept = 1, col = "salmon2", alpha = 0.7, linetype = "dashed") +
      geom_hline(yintercept = 0, col = "salmon2", alpha = 0.7, linetype = "dashed") +
      geom_line(data = predict.df, aes(x = x, y = prob), col = "darkseagreen", alpha = 0.7, lwd = 1) +
      scale_y_continuous(breaks = seq(0, 1, 0.2), labels = seq(0, 1, 0.2)) +
      scale_x_continuous(breaks = seq(-1.5, 1.5, 0.5), labels = seq(-1.5, 1.5, 0.5)) +
      labs(x = "Support for Status Quo", y = "Voting Intention") +
      theme_bw()

    ilogit() computes the inverse logit transformation (i.e., \(\frac{exp(x)}{1+exp(x)}\)).

    You can also use the predict() function we discussed before to compute fitted probabilities based on your model.

    # scatterplot of voting intention against support for status quo and the fitted logistic regression line
    x <- data.frame(statusquo = seq(-1.7, 1.8, 0.1))
    predict.df2 <- data.frame(x = x,
                             prob = predict(logit.chile, newdata = x, type = "response", se.fit = TRUE))
    predict.df2 <- predict.df2 %>%
      mutate(upr = prob.fit + (1.96 * prob.se.fit),
             lwr = prob.fit - (1.96 * prob.se.fit))
    
    ggplot(data = mydata2, aes(x = statusquo, y = vote2)) +
      geom_jitter(width = 0.1, height = 0.1, alpha = 0.7, col = "gray66") +
      geom_hline(yintercept = 1, col = "salmon2", alpha = 0.7, linetype = "dashed") +
      geom_hline(yintercept = 0, col = "salmon2", alpha = 0.7, linetype = "dashed") +
      geom_line(data = predict.df2, aes(x = statusquo, y = prob.fit), col = "darkseagreen", alpha = 0.7, lwd = 1) +
      geom_line(data = predict.df2, aes(x = statusquo, y = upr), col = "darkseagreen", alpha = 0.7, linetype = "dashed") +
      geom_line(data = predict.df2, aes(x = statusquo, y = lwr), col = "darkseagreen", alpha = 0.7, linetype = "dashed") +
      scale_y_continuous(breaks = seq(0, 1, 0.2), labels = seq(0, 1, 0.2)) +
      scale_x_continuous(breaks = seq(-1.5, 1.5, 0.5), labels = seq(-1.5, 1.5, 0.5)) +
      labs(x = "Support for Status Quo", y = "Voting Intention") +
      theme_bw()

    Discussion VI

    In today’s discussion, we will demonstrate how the analysis techniques used in OLS regression can also be applied to logistic regression. We will then explore methods for evaluating the performance of different logistic regression models to determine which is superior. Finally, we will review ordered logistic regression, discussing its differences from binary logistic regression and OLS regression.

    1. Interactive Logistic Regression Model
    2. Model Selection
    3. Ordered Logit Model

    1. Interactive Logistic Regression Model

    Depending on your research question, similar to how an OLS model can be used, we can also investigate whether there are heterogeneous effects when applying a logistic regression model. Let’s use the data of my POL212 project “Electoral Outcomes, Presidential Power, and Satisfaction with Democracy” as an example.

    • Background: Scholars have amassed evidence showing that election outcomes affect how people think about democracy (Anderson & Guillory, 1997; Anderson & Tverdova, 2001; Anderson et al., 2005; Clarke & Acock, 1989; Curini et al., 2012). Compared to those who voted for losers, those who supported winners usually have a higher level of satisfaction and trust in democracy and have a more positive feeling of government responsiveness and political efficacy.

    • Research Question: However, does it mean that losers’ interests are then ignored by the government? Do people who did not vote for the winning party lose that much in the political game? In this project, I seek to understand how political institutions can shape citizens’ perception of the legitimacy of democracies. In particular, I ask how does the extent of presidential power affect the winner-loser effect? Does presidential power amplify or dampen losers’ satisfaction with democracy?

    • Theory & Hypothesis: I argue that with different levels of presidential power, losers’ satisfaction with democracy should also be different. In countries with powerful presidents, losers should be more unsatisfied with how the government works, for being the political minority in the system causes them to be largely excluded from the policy-making process. In countries with weaker presidents, losers should be less unsatisfied, for their policy interests still can be possibly realized by the premier and the cabinet, which is determined by the congressional election.

      Hypothesis: Losers will be more satisfied with democracy when the president is weaker.

      Based on my theory, I am not only interested in the effect of winner-loser status on satisfaction with democracy but also how this effect is moderated by presidential power.

      • Outcome variable: satisfaction with democracy

      • Explainable variables: winner-loser status and presidential power

      • Control variables: divided government, gender, education level, age

      The data can be assessed here. All the variables inside are well coded and already remove NAs.

      # load packages
      library(tidyverse)
      
      # import data 
      df <- read.csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/data_con.csv") %>%
      mutate(SWD_b = ifelse(SWD > 2, 1, 0))
    • Data: The fourth module (2011-2016) of the Comparative Study of Electoral Systems (CSES) project. The CSES project consists of a nationally-representative post-election survey around democracies. The unit of analysis for this study is individuals. In particular, I analyze all presidential and semi-presidential countries that are rated as free by Freedom House at the time the election covered by the survey was held. This leaves the data with 15 countries covering Western and Eastern Europe, South and North America, and Asia.

      country eyear prespow1
      Austria 2013 0.0915
      Brazil 2014 0.4863
      Bulgaria 2014 0.1831
      Finland 2015 0.0499
      France 2012 0.1306
      Ireland 2011 0.0623
      Peru 2016 0.4197
      Poland 2011 0.2407
      Portugal 2015 0.1969
      Republic of Korea 2012 0.3488
      Romania 2014 0.2502
      Slovakia 2016 0.1894
      Slovenia 2011 0.1180
      Taiwan 2012 0.2707
      United States of America 2012 0.2889
    • Outcome variable: Satisfaction with Democracy (SWD)

      In the CSES survey, the concept is measured by this question: “On the whole, are you (1) very satisfied, (2) fairly satisfied, (3) not very satisfied, or (4) not at all satisfied with the way democracy works in [COUNTRY]?” I reversed the code scale to (4) very satisfied, (3) fairly satisfied, (2) not very satisfied, or (1) not at all satisfied and recoded “(4) very satisfied” and “(3) fairly satisfied” to 1, and 0 otherwise

    • Explainatory variables:

      – Loser (loser): I classified respondents as belonging to the political majority and minority by a question in CSES asking respondents for which party they feel closest to. loser is coded 1 if the party respondents feel closest to is not in accordance with the current president’s party, and 0 otherwise

      – Presidential Power (prespow1): I use Doyle & Elgie’s (2015) continuous presidential power score dataset. The range of this variable is between 0 and 1. The higher score means the stronger president; the lower score means the weaker president.

    • Control variables:

      – Divided Government (divided): Past studies have shown that “unified government produces greater quantities of significant enactments and is more responsive to the public mood than is divided government” (Coleman, 1999: 821). In this case, I expect the coefficient of divided will be negative, for individuals will be more unsatisfied with democracy under a less responsive government. divided is coded 1 if the current executive-legislative relationship under the survey date of fieldwork is a divided government, meaning that the president party did not control the majority of the legislature. The data I use to establish divided is from the Database of Political Institutions 2017 (DPI 2017).

      – Gender (gender): Male vs Female.

      – Age (age): continuous scale, standardized.

      – Education (education): continuous scale.

    Let’s first run the additive regression model and then include the interaction term, loser and prespow1, into the model so we can compare how taking this interaction term into account improves our model!

    # additive model
    logit.swd <- glm(SWD_b ~ as.factor(loser) + prespow1 + as.factor(divided) + 
                   as.factor(gender) + education + age, data = df, 
                   family = binomial(link = "logit"))
    
    # interactive model
    logit.swd.int <- glm(SWD_b ~ as.factor(loser)*prespow1 + as.factor(divided) + 
                   as.factor(gender) + education + age, data = df, 
                   family = binomial(link = "logit"))
    ## 
    ## The Effect of Electoral Outcomes and Presidential Power on Satisfaction with Democracy (Logistic Regression Model).
    ## ===========================================================
    ##                                  Dependent variable:       
    ##                            --------------------------------
    ##                              Satisfaction with Democracy   
    ##                            Additive Model Interactive Model
    ##                                 (1)              (2)       
    ## -----------------------------------------------------------
    ## Loser                       -0.83762***      -0.69264***   
    ##                              (0.03433)        (0.07499)    
    ##                                                            
    ## Presidential Power          -1.11516***      -0.65411**    
    ##                              (0.12401)        (0.24632)    
    ##                                                            
    ## Divided Government            -0.05990        -0.05880     
    ##                              (0.03788)        (0.03791)    
    ##                                                            
    ## Male                          0.00797          0.00855     
    ##                              (0.03048)        (0.03048)    
    ##                                                            
    ## Education Level              0.06027***      0.06064***    
    ##                              (0.00886)        (0.00886)    
    ##                                                            
    ## Age                           0.00954          0.00987     
    ##                              (0.01593)        (0.01593)    
    ##                                                            
    ## Loser * Presidential Power                    -0.60460*    
    ##                                               (0.27912)    
    ##                                                            
    ## Constant                     0.80402***      0.68884***    
    ##                              (0.06890)        (0.08686)    
    ##                                                            
    ## -----------------------------------------------------------
    ## Observations                   18,150          18,150      
    ## Log Likelihood             -12,127.18000    -12,124.83000  
    ## Akaike Inf. Crit.           24,268.35000    24,265.66000   
    ## ===========================================================
    ## Note:                         *p<0.05; **p<0.01; ***p<0.001

    What does the coefficient of Loser mean? The regression outputs for additive and interactive models here are in such form:

    Additive Model:

    \[ \begin{aligned} log \frac{\pi_i}{1-\pi_i} & =\alpha + \beta_{Loser} Loser_i + \beta_{Prespow} Prespow_i + \beta_{Male} Male_i + \beta_{Age} Age_i + \beta_{Education} Education_i + \beta_{DividedGov} DividedGov_i \\ &= 0.80402-0.83762Loser_i-1.11516Prespow_i+0.00797Male_i+0.00954Age+0.06027Education_i-0.05990DividedGov_i \end{aligned} \] Interactive Model:

    \[ \begin{aligned} log \frac{\pi_i}{1-\pi_i} &= \alpha + \beta_{Loser} Loser_i + \beta_{Prespow} Prespow_i + \beta_{Loser*Prespow} (Loser*Prespow)_i + \beta_{Male} Male_i + \beta_{Age} Age_i + \beta_{Education} Education_i + \beta_{DividedGov} DividedGov_i \\ &=0.68884-0.69264Loser_i-0.65411Prespow_i-0.60460(Loser*Prespow)_i + 0.00855Male_i+0.00987Age+0.06064Education_i-0.05880DividedGov_i \end{aligned} \] where \(\pi_i\)=Pr(Being satisfied with democracy)

    Let’s focusing on interpreting the interactive model! The coefficient of the “Loser” variable is significantly positive, indicating that compared to being a winner, being a loser is associated with a 0.69264 unit decrease in the log-odds. Additionally, as the strength of the president increases in a country, the effects of being a loser are associated with an additional 0.60460 unit decrease in the log-odds.

    Now transform the regression output into odds ratio (i.e., \(\frac{\pi_i}{1-\pi_i}\); the odds of a voter feeling more satisfied with democracy versus feeling less satisfied with democracy). How to transform log-odds into odds ratio? Exponentiate the log-odds!

    exp(coef(logit.swd.int))
    ##                (Intercept)          as.factor(loser)1 
    ##                  1.9914124                  0.5002521 
    ##                   prespow1        as.factor(divided)1 
    ##                  0.5199028                  0.9428962 
    ##      as.factor(gender)Male                  education 
    ##                  1.0085916                  1.0625205 
    ##                        age as.factor(loser)1:prespow1 
    ##                  1.0099152                  0.5462903

    The odds ratio of the loser variable means that the odds of a voter feeling more satisfied with democracy versus feeling less satisfied with democracy is \(50% ((1-0.50)*100%)\) lower for a voter being a loser than being a winner. This means that a voter will feel less satisfied with democracy when he/she supported a losing party. Furthermore, the odds ratio of the interaction term suggests that the odds of a voter feeling more satisfied with democracy versus feeling less satisfied with democracy is \(45% ((1-0.55)*100%)\) more lower for a voter being a loser than being a winner as one-unit increase in presidential power in a country.

    Plot the predicted effects to show the pattern more straightforwardly!

    prespow1 <- seq(0, 1, by = 0.01) 
    
    # create a new data for the predicted SWD of losers under different presidential level
    dfplot.l <- data.frame(loser = rep(1, length(prespow1)), # 1 means loser
                           prespow1 = prespow1,
                           divided = rep(1, length(prespow1)), # set at the median
                           gender = rep("Female", length(prespow1)), # set at the median
                           education = rep(4, length(prespow1)), # set at the median
                           age = rep(0, length(prespow1))) # set at the mean
    
    pred.l <- predict(logit.swd.int, newdata = dfplot.l, type = "response", se.fit = TRUE)
    
    # create a new data for the predicted SWD of winners under different presidential level
    dfplot.w <- data.frame(loser = rep(0, length(prespow1)), # 0 means winner
                           prespow1 = prespow1,
                           divided = rep(1, length(prespow1)), # set at the median
                           gender = rep("Female", length(prespow1)), # set at the median
                           education = rep(4, length(prespow1)), # set at the median
                           age = rep(0, length(prespow1))) # set at the mean
    pred.w <- predict(logit.swd.int, newdata = dfplot.w, type = "response", se.fit = TRUE)
    # scatterplot of satisfaction with democracy for presidential power and the fitted probabilities 95% confidence interval
    predict.df <- data.frame(x = prespow1,
                             prob.w = pred.w$fit,
                             upr.w = pred.w$fit + (1.96 * pred.w$se.fit),
                             lwr.w = pred.w$fit - (1.96 * pred.w$se.fit),
                             prob.l = pred.l$fit,
                             upr.l = pred.l$fit + (1.96 * pred.l$se.fit),
                             lwr.l = pred.l$fit - (1.96 * pred.l$se.fit)
                             )
    ggplot() +
      geom_hline(yintercept = 1, col = "gray23", alpha = 0.7, linetype = "dashed") +
      geom_hline(yintercept = 0, col = "gray23", alpha = 0.7, linetype = "dashed") +
      geom_line(data = predict.df, aes(x = prespow1, y = prob.w), col = "dodgerblue2", alpha = 0.7, lwd = 1) +
      geom_line(data = predict.df, aes(x = prespow1, y = upr.w), col = "dodgerblue2", alpha = 0.7, linetype = "dashed") +
      geom_line(data = predict.df, aes(x = prespow1, y = lwr.w), col = "dodgerblue2", alpha = 0.7, linetype = "dashed") +
      geom_line(data = predict.df, aes(x = prespow1, y = prob.l), col = "salmon2", alpha = 0.7, lwd = 1) +
      geom_line(data = predict.df, aes(x = prespow1, y = upr.l), col = "salmon2", alpha = 0.7, linetype = "dashed") +
      geom_line(data = predict.df, aes(x = prespow1, y = lwr.l), col = "salmon2", alpha = 0.7, linetype = "dashed") +
      scale_y_continuous(breaks = seq(0, 1, 0.2), labels = seq(0, 1, 0.2)) +
      scale_x_continuous(breaks = seq(0, 1, 0.2), labels = seq(0, 1, 0.2)) +
      labs(x = "Presidential Power", y = "Satisfaction with Democracy") +
      annotate("text", x = 0.1, y = 0.3, label = "Winner", color = "dodgerblue2") +
      annotate("text", x = 0.1, y = 0.2, label = "Loser", color = "salmon2") +
      theme_bw()

    2. Model Selection

    Which model performs better?

    1. Likelihood-ratio Test

      The Likelihood-Ratio Test (LRT) is a statistical test used to compare the goodness of fit of two models based on the ratio of their likelihoods. Specifically, we will compare a full model with a nested null model. Fitting these models produces a likelihood for each, \(L_1\) and \(L_0\); the null is a specialized version of the full, so \(L_1 \geq L_0\).

      \[G^2_0=2ln(\frac{L_1}{L_0})=2(lnL_1-lnL_0)\] is the likelihood-ratio test statistic for the null; under the null, this statistic has a chi-square distribution with q (# of coefficients) degrees of freedom. If the null hypothesis (the null model) is supported by the observed data, there should be no statistically significant difference between the two likelihoods. Noted that likelihood-ratio test can only compare two models when one is nested within the other.

      Take the above additive and interaction models of predicting satisfaction with democracy as an example, in which the former is the nested null model and the latter is the full model. Let’s first calculate the \(lnL_1\) and \(lnL_0\) for the nested null model (additive model) and the full model (interactive model) and then compute the \(G^2_0\) by hand in R.

      # additive model (nested null model)
      logLik(logit.swd)
      ## 'log Lik.' -12127.18 (df=7)
      # interactive model (full model)
      logLik(logit.swd.int)
      ## 'log Lik.' -12124.83 (df=8)
      # g statistic
      g <- 2*(-12124.83 - -12127.18); g
      ## [1] 4.7

      To test if this g statistic is statistically significant, we can apply lrtest() in lmtest package. lrtest() is a generic function for carrying out likelihood ratio tests in R.

      library(lmtest)
      lrtest(logit.swd, logit.swd.int)
      ## Likelihood ratio test
      ## 
      ## Model 1: SWD_b ~ as.factor(loser) + prespow1 + as.factor(divided) + as.factor(gender) + 
      ##     education + age
      ## Model 2: SWD_b ~ as.factor(loser) * prespow1 + as.factor(divided) + as.factor(gender) + 
      ##     education + age
      ##   #Df LogLik Df  Chisq Pr(>Chisq)  
      ## 1   7 -12127                       
      ## 2   8 -12125  1 4.6921     0.0303 *
      ## ---
      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      The likelihood ratio test shows that including the interaction term does significantly increase the goodness-of-fit. The Chi-square value demonstrates that the difference between the two models is statistically significant at 0.05 p-value level.

    2. Akaike Information Criterion (AIC)

      Information measures penalize including predictors that do not significantly improve model fit. These measures allow one to compare different models, even non nested ones. The best known of these measures is the Akaike Information Criterion (AIC), which judges a model by how close its fitted values are to the true expected values. The optimal model is the one that tends to have fitted values closest to the true outcome probabilities. This is the model that minimizes:

      \(AIC=-2(lnL_M-P)=Dev_M+2P\) where \(P\) is the number of parameters in the model (including \(\alpha\))

      The smallest the AIC value, the better the fit. Since fit will improve with additional predictors, adding \(2P\) is a penalty for adding predictors in the models. The logistic regression output in R by default gives you AIC. You can also use AIC() from stats package to find it.

      AIC(logit.swd)
      ## [1] 24268.35
      AIC(logit.swd.int)
      ## [1] 24265.66

      The AIC in the interactive model is smaller than the AIC for the additive model with a difference of 2.69 (24,265.66 < 24,268.35). However, how to decide if such difference is weak or strong? Raftery (1995) suggests this guide for determining whether a difference is weak or strong.

      Absolute Difference Evidence
      0-2 Weak
      2-6 Positive
      6-10 Strong
      >10 Very strong

    3. Ordered Logit Model

    OLS regression assumes interval level measurement for the dependent variable. This is not appropriate for modeling ordinal (rank-order) variables (e.g., strongly disagree 1 - strongly agree 4). Nevertheless, it is common for researchers to treating ordinal dependent variables as interval-level variables and model them using OLS regression. This modeling approach assumes effects are linear and additive, which is inappropriate when the outcomes sould be modeled as nonlinear and nonadditive. Estimates of effects can be biased because of this and predictions can be inaccurate at extremes. Additionally, statistical hypothesis tests may be unsound because standard OLS assumptions about the error term are not met (i.e., applying OLS to ordinal data will result in non-normal heteroskedastic errors).

    Best case scenario for using OLS regression with ordered dependent variable:

    • The underlying conceptual variable is interval-level in nature.

    • The ordinal representation of the variable uses many categories (e.g., 8 or more) and cases are distributed somewhat evenly across the categories.

    Ordered logit regression is a method developed to fulfill the goal of modeling the relative frequency distribution of cases across three or more ranked categories of the dependent variable. Ordered logit regression predicts the expected relative frequency distribution of cases across the ranked categories of the dependent variable under any combination of values on relevant independent variables. The effect of a predictor (\(X\)) will shift the distribution of cases across the categories of the dependent variable (\(Y\)) in a systematic direction toward higher or lower categories. Specifically, the relative frequency distribution of cases will systematically shift toward higher categories if \(X\) has a positive effect and it will systematically shift toward lower categories if \(X\) has a negative effect.

    Let’s use the data of my POL212 project “Electoral Outcomes, Presidential Power, and Satisfaction with Democracy” as an example again. Instead of generating a binary outcome variable, let’s use the original ordinal-level measurement.

    # import data 
    df <- read.csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/data_con.csv")
    df$SWD <- as.factor(df$SWD)
    library(MASS)
    ologit <- polr(SWD ~ loser + prespow1 + divided + 
                   gender + education + age, data = df, Hess = TRUE)
    ologit
    ## Call:
    ## polr(formula = SWD ~ loser + prespow1 + divided + gender + education + 
    ##     age, data = df, Hess = TRUE)
    ## 
    ## Coefficients:
    ##        loser     prespow1      divided   genderMale    education          age 
    ## -0.831051971 -1.196054612 -0.068692147  0.005675726  0.061671638  0.010370014 
    ## 
    ## Intercepts:
    ##       1|2       2|3       3|4 
    ## -2.604264 -0.817846  2.173226 
    ## 
    ## Residual Deviance: 41015.34 
    ## AIC: 41033.34

    Again, we can transform the log-odds coefficients to odds ratios by exponentiating them.

    exp(coef(ologit))
    ##      loser   prespow1    divided genderMale  education        age 
    ##  0.4355908  0.3023849  0.9336141  1.0056919  1.0636130  1.0104240

    If you want to get odds ratios with confidence intervals, you can use confint() to find them.

    confint(ologit)
    ##                  2.5 %       97.5 %
    ## loser      -0.89378842 -0.768505180
    ## prespow1   -1.42200529 -0.970198536
    ## divided    -0.13625994 -0.001201165
    ## genderMale -0.04942995  0.060786448
    ## education   0.04562275  0.077741862
    ## age        -0.01848288  0.039225680

    How to interpret the effects of winner-loser status?

    The odds of having one level higher on satisfaction with democracy is 56% (\((1-0.44)*100\%\)) lower for those who supported the losing party than those who supported the winning party. As one-unit increase in the presidential power in a country, the odds of having one level higher on satisfaction with democracy significantly decreases by 70%.

    Let’s calculate the predicted probabilities for each category of the outcome variable for two covariate profiles. In these profiles, only the winner-loser status differs, while all other variables are set at the mean or median.

    pro1 <- data.frame(loser = 1, 
                       prespow1 = 0.23147, # set at the median
                       divided = 1, # set at the median
                       gender = "Female", # set at the median
                       education = 4, # set at the median
                       age = 0) # set at the mean
    prob1 <- predict(ologit, newdata = pro1, type = "prob"); prob1
    ##          1          2          3          4 
    ## 0.15784431 0.37014199 0.42903556 0.04297814
    sum(prob1)
    ## [1] 1

    Note that the summation of the predicted probabilities for each category of the outcome variable should be 1!

    pro2 <- data.frame(loser = 0, 
                       prespow1 = 0.23147, # set at the median
                       divided = 1, # set at the median
                       gender = "Female", # set at the median
                       education = 4, # set at the median
                       age = 0) # set at the mean
    prob2 <- predict(ologit, newdata = pro2, type = "prob"); prob2
    ##          1          2          3          4 
    ## 0.07547995 0.25213556 0.57892288 0.09346161
    Profile Not at all satisfied Pr(Y=1) Not very satisfied Pr(Y=2) Fairly satisfied Pr(Y=3) Very satisfied Pr(Y=4)
    Loser 0.1578 0.3701 0.4290 0.0430
    Winner 0.0755 0.2521 0.5789 0.0935

    Now, let’s create a plot showing the fitted probabilities for each category of the outcome variable, considering different levels of presidential power, while keeping all other variables at their mean or median values.

    x <- seq(0, 1, 0.01)
    pro <- data.frame(loser = rep(1, length(x)), # set at the median
                      prespow1 = x,
                      divided = rep(1, length(x)), # set at the median
                      gender = rep("Female", length(x)), # set at the median
                      education = rep(4, length(x)), # set at the median
                      age = rep(0, length(x))) # set at the mean
    prob <- predict(ologit, newdata = pro, type = "prob")
    # scatterplot of satisfaction with democracy for presidential power and the fitted probabilities 95% confidence interval
    predict.df <- data.frame(prespow1 = x, prob)
    names(predict.df) <- c("prespow1" ,"Y1", "Y2", "Y3", "Y4")
    
    ggplot() +
      geom_hline(yintercept = 1, col = "gray23", alpha = 0.7, linetype = "dashed") +
      geom_hline(yintercept = 0, col = "gray23", alpha = 0.7, linetype = "dashed") +
      geom_line(data = predict.df, aes(x = prespow1, y = Y1), col = "dodgerblue2", alpha = 0.7, lwd = 1) +
      geom_line(data = predict.df, aes(x = prespow1, y = Y2), col = "darkseagreen3", alpha = 0.7, lwd = 1) +
      geom_line(data = predict.df, aes(x = prespow1, y = Y3), col = "salmon2", alpha = 0.7, lwd = 1) +
      geom_line(data = predict.df, aes(x = prespow1, y = Y4), col = "darkorchid3", alpha = 0.7, lwd = 1) +
      scale_y_continuous(breaks = seq(0, 1, 0.2), labels = seq(0, 1, 0.2)) +
      scale_x_continuous(breaks = seq(0, 1, 0.2), labels = seq(0, 1, 0.2)) +
      labs(x = "Presidential Power", y = "Satisfaction with Democracy") +
      annotate("text", x = 0.85, y = 0.95, label = "Not at all satisfied", color = "dodgerblue2") +
      annotate("text", x = 0.85, y = 0.9, label = "Not very satisfied", color = "darkseagreen3") +
      annotate("text", x = 0.85, y = 0.85, label = "Fairly satisfied", color = "salmon2") +
      annotate("text", x = 0.85, y = 0.8, label = "Very satisfied", color = "darkorchid3") +
      theme_bw()

    One assumption must be held when using ordered logit models, which is the parallel regression assumption (also known as the proportional odds assumption). The parallel regression assumption assumes that the relationship between the predictors and response are the same for each pair of outcome categories in the ordered logit model. If the assumption does not hold, it means that it is possible that one or more predictors has a different effect over different categories or a different range of the response. This means that some predictors might have stronger effect on some values of the dependent variable but have weaker effect on some others.

    To test whether the assumption of equal coefficients across different logits is justified, let’s perform the Brant test.

    brant::brant(ologit)
    ## -------------------------------------------- 
    ## Test for X2  df  probability 
    ## -------------------------------------------- 
    ## Omnibus      180.98  12  0
    ## loser        1.66    2   0.44
    ## prespow1 37.96   2   0
    ## divided      115.06  2   0
    ## genderMale   24.12   2   0
    ## education    0.34    2   0.84
    ## age      19.21   2   0
    ## -------------------------------------------- 
    ## 
    ## H0: Parallel Regression Assumption holds

    The result shows that our main model actually violates the parallel assumption, for the p-values for the model, presidential power, divided government, gender, and age are all below 0.05. These violations suggest that a simple ordered logit model might not be a good fit of these data. To address the poor fit of this model, we can try re-estimating our findings by conducting the generalized ordered logit model. This type of model relaxes the parallel regression assumption and allows each category of the response variable to have different coefficients. However, one caveat is that, although the generalized ordered logit model is more flexible than an ordered logit model, it is less efficient since it has to generate more parameters than an ordered logit model. Therefore, sometimes, when encountering this violation, a common practice is do nothing.