9  Introduction to Correlation and Regression Analysis

9.1 Introduction

The distinction between correlation and causation represents a fundamental challenge in statistical analysis. While correlation measures the statistical association between variables, causation implies a direct influence of one variable on another.

Statistical relationships form the backbone of data-driven decision making across disciplines—from economics and public health to psychology and environmental science. Understanding when a relationship indicates mere association versus genuine causality is crucial for valid inference and effective policy recommendations.

9.2 Covariance

Covariance measures how two variables vary together, indicating both the direction and magnitude of their linear relationship.

Formula: \text{cov}(X,Y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n - 1}

Where:

  • x_i and y_i are individual data points
  • \bar{x} and \bar{y} are the means of variables X and Y
  • n is the number of observations
  • We divide by (n-1) for sample covariance (Bessel’s correction)

Step-by-Step Manual Calculation Process

Example 1: Student Study Hours vs. Test Scores

Data:

  • Study Hours (X): 2, 4, 6, 8, 10
  • Test Scores (Y): 65, 70, 80, 85, 95
Step Description Calculation
1 Calculate means \bar{x} = \frac{2+4+6+8+10}{5} = 6 hours
\bar{y} = \frac{65+70+80+85+95}{5} = 79 points
2 Calculate deviations (x_i - \bar{x}): -4, -2, 0, 2, 4
(y_i - \bar{y}): -14, -9, 1, 6, 16
3 Calculate products (x_i - \bar{x})(y_i - \bar{y}):
(-4)(-14) = 56
(-2)(-9) = 18
(0)(1) = 0
(2)(6) = 12
(4)(16) = 64
4 Sum the products \sum = 56 + 18 + 0 + 12 + 64 = 150
5 Divide by (n-1) \text{cov}(X,Y) = \frac{150}{5-1} = \frac{150}{4} = 37.5

R Verification:

# Define the data
study_hours <- c(2, 4, 6, 8, 10)
test_scores <- c(65, 70, 80, 85, 95)

# Calculate covariance
cov_manual <- cov(study_hours, test_scores)
print(paste("Covariance:", round(cov_manual, 2)))
[1] "Covariance: 37.5"
# Verify step by step
x_mean <- mean(study_hours)
y_mean <- mean(test_scores)
x_dev <- study_hours - x_mean
y_dev <- test_scores - y_mean
products <- x_dev * y_dev
cov_calculated <- sum(products) / (length(study_hours) - 1)

# Display calculation steps
data.frame(
  X = study_hours,
  Y = test_scores,
  X_deviation = x_dev,
  Y_deviation = y_dev,
  Product = products
)
   X  Y X_deviation Y_deviation Product
1  2 65          -4         -14      56
2  4 70          -2          -9      18
3  6 80           0           1       0
4  8 85           2           6      12
5 10 95           4          16      64

Interpretation: The positive covariance (37.5) indicates that study hours and test scores tend to increase together.

Practice Problem with Solution

Calculate covariance manually for:

  • Temperature (°F): 32, 50, 68, 86, 95
  • Ice Cream Sales ($): 100, 200, 400, 600, 800

Solution:

Step Calculation
1. Means \bar{x} = \frac{32+50+68+86+95}{5} = 66.2°F
\bar{y} = \frac{100+200+400+600+800}{5} = 420
2. Deviations X: -34.2, -16.2, 1.8, 19.8, 28.8
Y: -320, -220, -20, 180, 380
3. Products 10944, 3564, -36, 3564, 10944
4. Sum 28980
5. Covariance \frac{28980}{4} = 7245
# Verify practice problem
temp <- c(32, 50, 68, 86, 95)
sales <- c(100, 200, 400, 600, 800)
print(paste("Covariance for practice problem:", round(cov(temp, sales), 2)))
[1] "Covariance for practice problem: 7245"

9.3 Correlation Coefficient

The correlation coefficient standardizes covariance to eliminate scale dependency, producing values between -1 and +1.

Interpretation Guidelines

Correlation Value Strength Interpretation Example
±0.90 to ±1.00 Very Strong Almost perfect relationship Height of parents and children
±0.70 to ±0.89 Strong Highly related variables Study time and grades
±0.50 to ±0.69 Moderate Moderately related Exercise and weight loss
±0.30 to ±0.49 Weak Weakly related Shoe size and reading ability
±0.00 to ±0.29 Very Weak/None Little to no relationship Birth month and intelligence

Types of Correlations Visualization

# Generate sample data with different correlation patterns
n <- 100

# Positive linear correlation
set.seed(123)
years_education <- rnorm(n, 14, 3)
annual_income <- 15000 + 3500 * years_education + rnorm(n, 0, 10000)
pos_cor <- round(cor(years_education, annual_income), 3)

# Negative linear correlation
screen_time <- runif(n, 1, 8)
sleep_hours <- 9 - 0.5 * screen_time + rnorm(n, 0, 0.5)
neg_cor <- round(cor(screen_time, sleep_hours), 3)

# No correlation
x_random <- rnorm(n, 50, 10)
y_random <- rnorm(n, 50, 10)
no_cor <- round(cor(x_random, y_random), 3)

# Non-linear correlation (quadratic)
hours_studied <- runif(n, 0, 12)
test_score <- -2 * (hours_studied - 6)^2 + 90 + rnorm(n, 0, 5)
nonlin_cor <- round(cor(hours_studied, test_score), 3)

# Create data frames with correlation values
positive_data <- data.frame(
  x = years_education, 
  y = annual_income,
  label = paste0("Positive Linear (r = ", pos_cor, ")")
)

negative_data <- data.frame(
  x = screen_time, 
  y = sleep_hours,
  label = paste0("Negative Linear (r = ", neg_cor, ")")
)

no_corr_data <- data.frame(
  x = x_random, 
  y = y_random,
  label = paste0("No Correlation (r = ", no_cor, ")")
)

nonlinear_data <- data.frame(
  x = hours_studied, 
  y = test_score,
  label = paste0("Non-linear (Pearson r = ", nonlin_cor, ")")
)

# Combine data
all_data <- rbind(positive_data, negative_data, no_corr_data, nonlinear_data)

# Create faceted plot
ggplot(all_data, aes(x = x, y = y)) +
  geom_point(alpha = 0.6, color = "darkblue", size = 2) +
  geom_smooth(method = "lm", color = "red", se = TRUE, alpha = 0.3) +
  facet_wrap(~ label, scales = "free", ncol = 2) +
  labs(
    title = "Different Types of Correlations",
    subtitle = "Linear regression line shown in red with confidence band",
    x = "", 
    y = ""
  ) +
  theme_minimal() +
  theme(
    strip.text = element_text(size = 10, face = "bold"),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "gray50")
  )
`geom_smooth()` using formula = 'y ~ x'

9.4 Pearson Correlation

Formula: r = \frac{\text{cov}(X,Y)}{s_X s_Y} = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^n(y_i - \bar{y})^2}}

Complete Manual Calculation Example

Using our study hours example:

  • Study Hours (X): 2, 4, 6, 8, 10
  • Test Scores (Y): 65, 70, 80, 85, 95

Detailed Calculation Steps:

Step Description Calculation
1 Calculate covariance From above: \text{cov}(X,Y) = 37.5
2 Calculate deviations squared
For X (x_i - \bar{x})^2: 16, 4, 0, 4, 16
Sum = 40
For Y (y_i - \bar{y})^2: 196, 81, 1, 36, 256
Sum = 570
3 Calculate standard deviations
s_X s_X = \sqrt{\frac{40}{4}} = \sqrt{10} = 3.162
s_Y s_Y = \sqrt{\frac{570}{4}} = \sqrt{142.5} = 11.937
4 Calculate correlation r = \frac{37.5}{3.162 \times 11.937}
r = \frac{37.5}{37.73} = 0.994
# Manual calculation verification
study_hours <- c(2, 4, 6, 8, 10)
test_scores <- c(65, 70, 80, 85, 95)

# Calculate Pearson correlation
r_value <- cor(study_hours, test_scores, method = "pearson")
print(paste("Pearson correlation coefficient:", round(r_value, 3)))
[1] "Pearson correlation coefficient: 0.993"
# Detailed calculation
x_mean <- mean(study_hours)
y_mean <- mean(test_scores)
x_dev <- study_hours - x_mean
y_dev <- test_scores - y_mean

# Show calculation table
calc_table <- data.frame(
  X = study_hours,
  Y = test_scores,
  X_dev = x_dev,
  Y_dev = y_dev,
  X_dev_sq = x_dev^2,
  Y_dev_sq = y_dev^2,
  Product = x_dev * y_dev
)
print(calc_table)
   X  Y X_dev Y_dev X_dev_sq Y_dev_sq Product
1  2 65    -4   -14       16      196      56
2  4 70    -2    -9        4       81      18
3  6 80     0     1        0        1       0
4  8 85     2     6        4       36      12
5 10 95     4    16       16      256      64
# Summary statistics
cat("\nSummary:")

Summary:
cat("\nSum of (X-mean)²:", sum(x_dev^2))

Sum of (X-mean)²: 40
cat("\nSum of (Y-mean)²:", sum(y_dev^2))

Sum of (Y-mean)²: 570
cat("\nSum of products:", sum(x_dev * y_dev))

Sum of products: 150
cat("\nStandard deviation of X:", sd(study_hours))

Standard deviation of X: 3.162278
cat("\nStandard deviation of Y:", sd(test_scores))

Standard deviation of Y: 11.93734
# Calculate confidence interval and p-value
cor_test <- cor.test(study_hours, test_scores, method = "pearson")
print(cor_test)

    Pearson's product-moment correlation

data:  study_hours and test_scores
t = 15, df = 3, p-value = 0.0006431
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8994446 0.9995859
sample estimates:
      cor 
0.9933993 

Interpretation: r = 0.994 indicates an almost perfect positive linear relationship between study hours and test scores. The p-value < 0.05 suggests this relationship is statistically significant.

9.5 Spearman Rank Correlation

Spearman correlation measures monotonic relationships using ranks instead of raw values.

Formula: \rho = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2 - 1)}

Where d_i is the difference between ranks for observation i.

Complete Manual Example

Data: Math and English Scores

Student Math Score English Score
A 78 82
B 85 90
C 88 88
D 92 94
E 95 96

Ranking and Calculation:

Student Math Score Math Rank English Score English Rank d = (Math Rank - English Rank)
A 78 1 82 1 0 0
B 85 2 90 3 -1 1
C 88 3 88 2 1 1
D 92 4 94 4 0 0
E 95 5 96 5 0 0
Sum: 2

Calculation: \rho = 1 - \frac{6 \times 2}{5(25-1)} = 1 - \frac{12}{120} = 1 - 0.1 = 0.9

# Data
math_scores <- c(78, 85, 88, 92, 95)
english_scores <- c(82, 90, 88, 94, 96)

# Show ranks
rank_table <- data.frame(
  Student = LETTERS[1:5],
  Math_Score = math_scores,
  Math_Rank = rank(math_scores),
  English_Score = english_scores,
  English_Rank = rank(english_scores)
)

rank_table$d <- rank_table$Math_Rank - rank_table$English_Rank
rank_table$d_squared <- rank_table$d^2

print(rank_table)
  Student Math_Score Math_Rank English_Score English_Rank  d d_squared
1       A         78         1            82            1  0         0
2       B         85         2            90            3 -1         1
3       C         88         3            88            2  1         1
4       D         92         4            94            4  0         0
5       E         95         5            96            5  0         0
cat("\nSum of d²:", sum(rank_table$d_squared))

Sum of d²: 2
# Calculate Spearman correlation
rho <- cor(math_scores, english_scores, method = "spearman")
cat("\nSpearman correlation (R):", round(rho, 3))

Spearman correlation (R): 0.9
# Manual calculation
n <- length(math_scores)
rho_manual <- 1 - (6 * sum(rank_table$d_squared)) / (n * (n^2 - 1))
cat("\nSpearman correlation (manual):", round(rho_manual, 3))

Spearman correlation (manual): 0.9

9.6 Cross-tabulation and Categorical Data

Cross-tabulation shows relationships between categorical variables.

# Create more realistic sample data
set.seed(123)
n_total <- 120

# Create education and employment data with realistic relationship
education <- factor(
  c(rep("High School", 40), 
    rep("College", 50), 
    rep("Graduate", 30)),
  levels = c("High School", "College", "Graduate")
)

# Employment status with education-related probabilities
employment <- factor(
  c(# High School: 75% employed
    sample(c(rep("Employed", 30), rep("Unemployed", 10)), 40),
    # College: 90% employed  
    sample(c(rep("Employed", 45), rep("Unemployed", 5)), 50),
    # Graduate: 95% employed
    sample(c(rep("Employed", 28), rep("Unemployed", 2)), 30)),
  levels = c("Employed", "Unemployed")
)

# Create contingency table
contingency_table <- table(education, employment)
print("Contingency Table:")
[1] "Contingency Table:"
print(contingency_table)
             employment
education     Employed Unemployed
  High School       30         10
  College           45          5
  Graduate          28          2
# Calculate row percentages
print("\nRow Percentages (% within each education level):")
[1] "\nRow Percentages (% within each education level):"
print(round(prop.table(contingency_table, 1) * 100, 1))
             employment
education     Employed Unemployed
  High School     75.0       25.0
  College         90.0       10.0
  Graduate        93.3        6.7
# Chi-square test for independence
chi_test <- chisq.test(contingency_table)
Warning in chisq.test(contingency_table): Chi-squared approximation may be
incorrect
print(chi_test)

    Pearson's Chi-squared test

data:  contingency_table
X-squared = 5.9623, df = 2, p-value = 0.05073

9.7 Practical Exercises with Solutions

Exercise 1: Calculate Pearson Correlation Manually

Data:

  • Height (inches): 66, 68, 70, 72, 74
  • Weight (pounds): 140, 155, 170, 185, 200

Solution:

# Data
height <- c(66, 68, 70, 72, 74)
weight <- c(140, 155, 170, 185, 200)

# Step 1: Calculate means
h_mean <- mean(height)  # 70
w_mean <- mean(weight)  # 170

# Step 2: Calculate deviations and products
calculation_df <- data.frame(
  Height = height,
  Weight = weight,
  H_dev = height - h_mean,
  W_dev = weight - w_mean,
  H_dev_sq = (height - h_mean)^2,
  W_dev_sq = (weight - w_mean)^2,
  Product = (height - h_mean) * (weight - w_mean)
)

print(calculation_df)
  Height Weight H_dev W_dev H_dev_sq W_dev_sq Product
1     66    140    -4   -30       16      900     120
2     68    155    -2   -15        4      225      30
3     70    170     0     0        0        0       0
4     72    185     2    15        4      225      30
5     74    200     4    30       16      900     120
# Step 3: Calculate correlation
cov_hw <- sum(calculation_df$Product) / (length(height) - 1)
sd_h <- sqrt(sum(calculation_df$H_dev_sq) / (length(height) - 1))
sd_w <- sqrt(sum(calculation_df$W_dev_sq) / (length(weight) - 1))
r_manual <- cov_hw / (sd_h * sd_w)

cat("\nManual calculation:")

Manual calculation:
cat("\nCovariance:", round(cov_hw, 2))

Covariance: 75
cat("\nSD Height:", round(sd_h, 2))

SD Height: 3.16
cat("\nSD Weight:", round(sd_w, 2))

SD Weight: 23.72
cat("\nCorrelation:", round(r_manual, 3))

Correlation: 1
# Verify with R
cat("\n\nR calculation:", round(cor(height, weight), 3))


R calculation: 1

Exercise 2: Calculate Spearman Correlation Manually

Data:

  • Student rankings in Math: 1, 3, 2, 5, 4
  • Student rankings in Science: 2, 4, 1, 5, 3

Solution:

# Rankings (already ranked)
math_rank <- c(1, 3, 2, 5, 4)
science_rank <- c(2, 4, 1, 5, 3)

# Calculate differences
d <- math_rank - science_rank
d_squared <- d^2

# Create table
rank_df <- data.frame(
  Student = 1:5,
  Math_Rank = math_rank,
  Science_Rank = science_rank,
  d = d,
  d_squared = d_squared
)

print(rank_df)
  Student Math_Rank Science_Rank  d d_squared
1       1         1            2 -1         1
2       2         3            4 -1         1
3       3         2            1  1         1
4       4         5            5  0         0
5       5         4            3  1         1
# Calculate Spearman correlation
n <- length(math_rank)
sum_d_sq <- sum(d_squared)
rho_manual <- 1 - (6 * sum_d_sq) / (n * (n^2 - 1))

cat("\nSum of d²:", sum_d_sq)

Sum of d²: 4
cat("\nSpearman correlation (manual):", round(rho_manual, 3))

Spearman correlation (manual): 0.8
cat("\nSpearman correlation (R):", round(cor(math_rank, science_rank, method = "spearman"), 3))

Spearman correlation (R): 0.8

Exercise 3: Interpretation Practice

Interpret these correlation values:

  1. r = 0.85 between hours of practice and performance score

    • Answer: Strong positive relationship. As practice hours increase, performance scores tend to increase substantially.
  2. r = -0.72 between outside temperature and heating costs

    • Answer: Strong negative relationship. As temperature increases, heating costs decrease substantially.
  3. r = 0.12 between shoe size and intelligence

    • Answer: Very weak/no meaningful relationship. Shoe size and intelligence are essentially unrelated.

9.8 Important Points to Remember

  1. Correlation measures relationship strength: Values range from -1 to +1
  2. Correlation ≠ Causation: High correlation doesn’t prove one variable causes another
  3. Choose the right method:
    • Pearson: For linear relationships in continuous data
    • Spearman: For monotonic relationships or ranked data
  4. Check assumptions:
    • Pearson assumes linear relationship and normal distribution
    • Spearman only assumes monotonic relationship
  5. Watch for outliers: Extreme values can greatly affect Pearson correlation
  6. Visualize your data: Always plot before calculating correlation

9.9 Summary: Decision Tree for Correlation Analysis


CHOOSING THE RIGHT CORRELATION METHOD:

Is your data numerical?
├─ YES → Is the relationship linear?
│   ├─ YES → Use PEARSON correlation
│   └─ NO → Is it monotonic?
│       ├─ YES → Use SPEARMAN correlation
│       └─ NO → Consider non-linear methods
└─ NO → Is it ordinal (ranked)?
    ├─ YES → Use SPEARMAN correlation
    └─ NO → Use CROSS-TABULATION for categorical data

Quick Reference Card

Measure Use When Formula Range
Covariance Initial exploration of relationship \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n-1} -∞ to +∞
Pearson r Linear relationships, continuous data \frac{\text{cov}(X,Y)}{s_X s_Y} -1 to +1
Spearman ρ Monotonic relationships, ranked data 1 - \frac{6\sum d_i^2}{n(n^2-1)} -1 to +1
Cross-tabs Categorical variables Frequency counts N/A

9.10 Understanding Ordinary Least Squares (OLS): A Quick-start Guide

Understanding Ordinary Least Squares (OLS): A Quick-start Guide

Introduction: What is Regression Analysis?

Regression analysis helps us understand and measure relationships between things we can observe. It provides mathematical tools to identify patterns in data that help us make predictions.

Consider these research questions:

  • How does study time affect test scores?
  • How does experience affect salary?
  • How does advertising spending influence sales?

Regression gives us systematic methods to answer these questions with real data.

The Starting Point: A Simple Example

Let’s begin with something concrete. You’ve collected data from 20 students in your class:

Student Study Hours Exam Score
Alex 2 68
Beth 4 74
Carlos 6 85
Diana 8 91

When you plot this data, you get a scatter plot with dots all over. Your goal: find the straight line that best describes the relationship between study hours and exam scores.

But what does “best” mean? That’s what we’ll discover.

Why Real Data Doesn’t Fall on a Perfect Line

Before diving into the math, let’s understand why data points don’t line up perfectly.

Deterministic vs. Stochastic Models

Deterministic Models describe relationships with no uncertainty. Think of physics equations: \text{Distance} = \text{Speed} × \text{Time}

If you drive at exactly 60 mph for exactly 2 hours, you’ll always travel exactly 120 miles. No variation, no exceptions.

Stochastic Models acknowledge that real-world data contains natural variation. The fundamental structure is: Y = f(X) + \epsilon

Where:

  • Y is what we’re trying to predict (exam scores)
  • f(X) is the systematic pattern (how study hours typically affect scores)
  • \epsilon (epsilon) represents all the random stuff we can’t measure

In our example, two students might study for 5 hours but get different scores because:

  • One slept better the night before
  • One is naturally better at test-taking
  • One had a noisy roommate during the exam
  • Pure chance in which questions were asked

This randomness is natural and expected - that’s what \epsilon captures.

The Simple Linear Regression Model

We express the relationship between study hours and exam scores as: Y_i = \beta_0 + \beta_1X_i + \epsilon_i

Let’s decode this:

  • Y_i = exam score for student i
  • X_i = study hours for student i
  • \beta_0 = the intercept (baseline score with zero study hours)
  • \beta_1 = the slope (points gained per study hour)
  • \epsilon_i = everything else affecting student i’s score

Key insight: We never know the true values of \beta_0 and \beta_1. Instead, we use our data to estimate them, calling our estimates \hat{\beta}_0 and \hat{\beta}_1 (the “hats” mean “estimated”).

Understanding Residuals: How Wrong Are Our Predictions?

Once we draw a line through our data, we can make predictions. For each student:

  1. Actual score (y_i): What they really got
  2. Predicted score (\hat{y}_i): What our line says they should have gotten
  3. Residual (e_i): The difference = Actual - Predicted

Visual Example:

Diana: Studied 8 hours, scored 91
Our line predicts: 88 points
Residual: 91 - 88 = +3 points (we underestimated)

Eric: Studied 5 hours, scored 70
Our line predicts: 79 points  
Residual: 70 - 79 = -9 points (we overestimated)

The Key Insight: Why Square the Residuals?

Here’s a puzzle. Consider these residuals from four students:

  • Student A: +5 points
  • Student B: -5 points
  • Student C: +3 points
  • Student D: -3 points

If we just add them: (+5) + (-5) + (+3) + (-3) = 0

This suggests perfect predictions, but every prediction was wrong! The positive and negative errors canceled out.

The solution: Square each residual before adding:

  • Student A: (+5)^2 = 25
  • Student B: (-5)^2 = 25
  • Student C: (+3)^2 = 9
  • Student D: (-3)^2 = 9
  • Total squared error: 68

Why squaring works:

  1. No more cancellation: All squared values are positive
  2. Bigger errors matter more: A 10-point error counts 4× as much as a 5-point error
  3. Mathematical convenience: Squared functions are smooth and differentiable

The Ordinary Least Squares Method

OLS finds the line that minimizes the Sum of Squared Errors (SSE):

\text{SSE} = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2

Expanding this: \min_{\beta_0, \beta_1} \sum_{i=1}^n (y_i - (\beta_0 + \beta_1x_i))^2

In plain English: “Find the intercept and slope that make the total squared prediction error as small as possible.”

The Mathematical Solution (Formal Derivation)

To minimize SSE, we use calculus. Taking partial derivatives and setting them to zero:

\frac{\partial SSE}{\partial \beta_0} = -2\sum_{i=1}^n (y_i - \beta_0 - \beta_1x_i) = 0

\frac{\partial SSE}{\partial \beta_1} = -2\sum_{i=1}^n x_i(y_i - \beta_0 - \beta_1x_i) = 0

Solving this system of equations yields:

\hat{\beta}_1 = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})^2} = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}

\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}

Where \bar{x} and \bar{y} are the sample means.

What this tells us:

  • The slope depends on how X and Y vary together (covariance) relative to how much X varies alone (variance)
  • The line always passes through the center point (\bar{x}, \bar{y})

Making Sense of Variation: How Good Is Our Line?

To evaluate our model’s performance, we break down the variation in exam scores:

Total Sum of Squares (SST)

“How much do exam scores vary overall?” SST = \sum_{i=1}^n(y_i - \bar{y})^2

This measures how spread out the scores are from the class average.

Regression Sum of Squares (SSR)

“How much variation does our line explain?” SSR = \sum_{i=1}^n(\hat{y}_i - \bar{y})^2

This measures how much better our predictions are than just guessing the average for everyone.

Error Sum of Squares (SSE)

“How much variation is left unexplained?” SSE = \sum_{i=1}^n(y_i - \hat{y}_i)^2

This is the variation our model couldn’t capture (the squared residuals).

The Fundamental Equation

SST = SSR + SSE \text{Total Variation} = \text{Explained} + \text{Unexplained}

R-Squared: The Model Report Card

The coefficient of determination (R²) tells us what percentage of variation our model explains:

R^2 = \frac{\text{Explained Variation}}{\text{Total Variation}} = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}

How to interpret R²:

  • R² = 0.75: “Study hours explain 75% of the variation in exam scores”
  • R² = 0.30: “Our model captures 30% of what makes scores different”
  • R² = 1.00: Perfect prediction (never happens with real data)
  • R² = 0.00: Our line is no better than guessing the average

Important reality check: In social sciences, R² = 0.30 might be excellent. In engineering, R² = 0.95 might be the minimum acceptable. Context matters.

Interpreting Your Results

When you run OLS and get \hat{\beta}_0 = 60 and \hat{\beta}_1 = 4:

The Slope (\hat{\beta}_1 = 4):

  • “Each additional hour of study is associated with 4 more points on the exam”
  • This is an average effect across all students
  • It’s not a guarantee for any individual student

The Intercept (\hat{\beta}_0 = 60):

  • “A student who studies 0 hours is predicted to score 60”
  • Often this is just a mathematical anchor point
  • May not make practical sense (who studies 0 hours?)

The Prediction Equation: \text{Predicted Score} = 60 + 4 \times \text{Study Hours}

So a student studying 5 hours: Predicted score = 60 + 4(5) = 80 points

Effect Size and Practical Significance

Statistical significance tells us whether an effect exists. Practical significance tells us whether it matters. Understanding both is crucial for proper interpretation.

Calculating and Interpreting Raw Effect Sizes

The raw (unstandardized) effect size is simply your slope coefficient \hat{\beta}_1.

Example: If \hat{\beta}_1 = 4 points per hour:

  • This is the raw effect size
  • Interpretation: “One hour of additional study yields 4 exam points”

To assess practical significance, consider:

  1. Scale of the outcome: 4 points on a 100-point exam (4%) vs. 4 points on a 500-point exam (0.8%)
  2. Cost of the intervention: Is one hour of study time worth 4 points?
  3. Context-specific thresholds: Does 4 points change a letter grade?

Calculating Standardized Effect Sizes

Standardized effects allow comparison across different scales and studies.

Formula for standardized coefficient (beta weight): \beta_{std} = \hat{\beta}_1 \times \frac{s_X}{s_Y}

Where:

  • s_X = standard deviation of X (study hours)
  • s_Y = standard deviation of Y (exam scores)

Step-by-step calculation:

  1. Calculate the standard deviation of X: s_X = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n-1}}
  2. Calculate the standard deviation of Y: s_Y = \sqrt{\frac{\sum(y_i - \bar{y})^2}{n-1}}
  3. Multiply: \beta_{std} = \hat{\beta}_1 \times \frac{s_X}{s_Y}

Example calculation:

  • Suppose s_X = 2.5 hours and s_Y = 12 points
  • With \hat{\beta}_1 = 4: \beta_{std} = 4 \times \frac{2.5}{12} = 0.83
  • Interpretation: “A one standard deviation increase in study hours (2.5 hours) is associated with 0.83 standard deviations increase in exam score”

Cohen’s Guidelines for Effect Sizes

For standardized regression coefficients:

  • Small effect: |β| ≈ 0.10 (explains ~1% of variance)
  • Medium effect: |β| ≈ 0.30 (explains ~9% of variance)
  • Large effect: |β| ≈ 0.50 (explains ~25% of variance)

For R² (proportion of variance explained):

  • Small effect: R² ≈ 0.02
  • Medium effect: R² ≈ 0.13
  • Large effect: R² ≈ 0.26

Important: These are general benchmarks. Field-specific standards often differ:

  • Psychology/Education: R² = 0.10 might be meaningful
  • Physics/Engineering: R² < 0.90 might be unacceptable
  • Economics: R² = 0.30 might be excellent

Calculating Confidence Intervals for Effect Sizes

To quantify uncertainty in your effect size:

For the raw coefficient: CI = \hat{\beta}_1 \pm t_{critical} \times SE(\hat{\beta}_1)

Where:

  • t_{critical} = critical value from t-distribution (usually ≈ 2 for 95% CI)
  • SE(\hat{\beta}_1) = standard error of the slope

Practical interpretation: If 95% CI = [3.2, 4.8], we can say: “We’re 95% confident that each study hour adds between 3.2 and 4.8 exam points.”

Making Decisions About Practical Significance

To determine if an effect is practically significant, consider:

  1. Minimum meaningful difference: What’s the smallest effect that would matter?
    • In education: Often 0.25 standard deviations
    • In medicine: Determined by clinical relevance
    • In business: Based on cost-benefit analysis
  2. Number needed to treat (NNT) analog: How much X must change for meaningful Y change?
    • If passing requires 10 more points and \hat{\beta}_1 = 4
    • Students need 2.5 more study hours to pass
  3. Cost-effectiveness ratio: \text{Efficiency} = \frac{\text{Effect Size}}{\text{Cost of Intervention}}

Example practical significance assessment:

  • Effect: 4 points per study hour
  • Passing threshold: 70 points
  • Current average: 68 points
  • Conclusion: 30 minutes of extra study could change fail to pass
  • Decision: Practically significant for borderline students

Understanding Uncertainty: Nothing Is Perfect

Your estimates come from a sample, not the entire population. This creates uncertainty.

Why We Have Uncertainty

  • You studied 20 students, not all students ever
  • Your sample might be slightly unusual by chance
  • Measurement isn’t perfect (did students report hours accurately?)

Confidence Intervals: Being Honest About Uncertainty

Instead of saying “the effect is exactly 4 points per hour,” we say:

  • “We estimate 4 points per hour”
  • “We’re 95% confident the true effect is between 3.2 and 4.8 points”

This range (3.2 to 4.8) is called a 95% confidence interval.

What it means: If we repeated this study many times with different samples, 95% of the intervals we calculate would contain the true effect.

What it doesn’t mean: There’s a 95% chance the true value is in this specific interval (it either is or isn’t).

Testing If There’s Really a Relationship

The big question: “Is there actually a relationship, or did we just get lucky with our sample?”

We test this by asking: “If study hours truly had zero effect on scores, how likely would we be to see a pattern this strong just by chance?”

The process (simplified):

  1. Assume there’s no relationship (the “null hypothesis”)
  2. Calculate how unlikely our data would be if that were true
  3. If it’s very unlikely (typically less than 5% chance), we conclude there probably is a relationship

P-values in plain English:

  • p = 0.03: “If study hours didn’t matter at all, there’s only a 3% chance we’d see a pattern this strong by luck”
  • p = 0.40: “This pattern could easily happen by chance even if there’s no real relationship”

Rule of thumb: p < 0.05 → “statistically significant” (probably a real relationship)

When Things Go Wrong: Model Diagnostics

Quick Visual Checks

  1. Plot your data first: Does it look roughly linear?
  2. Plot residuals vs. predicted values: Should look like a random cloud
  3. Look for outliers: Any points way off from the others?

Warning Signs Your Model Might Be Misleading

Pattern in residuals: If residuals show a curve or trend, you’re missing something

Increasing spread: If residuals get more spread out as predictions increase, standard errors might be wrong

Influential outliers: One or two weird points can drag your whole line off

Missing variables: If you forgot something important (like prior knowledge), your estimates might be biased

Key Assumptions: When OLS Works Well

  1. Linearity: The true relationship is approximately straight
    • Check: Look at your scatter plot
  2. Independence: Each observation is separate
    • Check: Make sure students didn’t work together or copy
  3. Constant variance: The spread of residuals is similar everywhere
    • Check: Residual plot shouldn’t fan out
  4. No perfect multicollinearity: (For multiple regression) Predictors aren’t perfectly related
    • Check: Make sure you didn’t include the same variable twice
  5. Random sampling: Your data represents the population you care about
    • Check: Did you sample fairly?

Summary: Your OLS Toolkit

What OLS Does:

  • Finds the straight line that minimizes squared prediction errors
  • Estimates how much Y changes when X changes by one unit
  • Tells you how much variation your model explains (R²)
  • Quantifies uncertainty in your estimates

Your Step-by-Step Process:

  1. Plot your data - does a line make sense?
  2. Run OLS to get \hat{\beta}_0 and \hat{\beta}_1
  3. Check R² - how much variation do you explain?
  4. Calculate effect sizes (raw and standardized)
  5. Assess practical significance using context-specific criteria
  6. Look at confidence intervals - how uncertain are you?
  7. Check residuals - any obvious problems?
  8. Make decisions based on both statistical and practical significance

Kluczowe interpretacje / Key interpretations

Domyślny model: regresja OLS Y=\beta_0+\beta_1 X+\varepsilon (lub wieloraka: Y=\beta_0+\beta_1 X_1+\cdots+\beta_p X_p+\varepsilon).

  • Nachylenie / Slope (\beta_1) PL: Przy wzroście X o 1 jednostkę (ceteris paribus), przeciętna wartość Y zmienia się o \beta_1 jednostek. ENG: When X increases by 1 unit (ceteris paribus), the expected value of Y changes by \beta_1 units.

  • Standaryzowane nachylenie / Standardized slope \big(\beta_{1}^{(\mathrm{std})}\big) Definicja:

    \beta_{1}^{(\mathrm{std})} \;=\; \beta_1 \cdot \frac{s_X}{s_Y},

    gdzie s_X i s_Y to odchylenia standardowe X i Y. PL: Przy wzroście X o 1 odchylenie standardowe (SD), przeciętna wartość Y zmienia się o \beta_{1}^{(\mathrm{std})} odchyleń standardowych Y. ENG: For a 1 standard deviation (SD) increase in X, the expected value of Y changes by \beta_{1}^{(\mathrm{std})} SDs of Y. Uwaga/Note: W regresji prostej \beta_{1}^{(\mathrm{std})} = r (Pearson). / In simple regression, \beta_{1}^{(\mathrm{std})} = r (Pearson).

  • Współczynnik determinacji / R^2 Definicja:

    R^2 \;=\; 1 - \frac{SS_{\mathrm{res}}}{SS_{\mathrm{tot}}}.

    PL: Model wyjaśnia 100\times R^2% zmienności Y względem modelu tylko z wyrazem wolnym (in-sample). ENG: The model explains 100\times R^2% of the variance in Y relative to the intercept-only model (in-sample). W wielu zmiennych rozważ: \text{adjusted } R^2. / With multiple predictors consider: adjusted R^2.

  • Wartość p / P-value Formalnie/Formally:

    p \;=\; \Pr\!\big(\,|T|\ge |t_{\mathrm{obs}}| \mid H_0\,\big),

    gdzie T ma rozkład t przy H_0. PL: Zakładając prawdziwość H_0 i spełnione założenia modelu, prawdopodobieństwo uzyskania co najmniej tak ekstremalnej statystyki jak obserwowana wynosi p. ENG: Assuming H_0 and the model assumptions hold, p is the probability of observing a test statistic at least as extreme as the one obtained.

  • Przedział ufności / Confidence interval (np. dla \beta_1) Konstrukcja/Construction:

    \hat{\beta}_1 \;\pm\; t_{1-\alpha/2,\ \mathrm{df}} \cdot \mathrm{SE}\!\left(\hat{\beta}_1\right).

    PL (ściśle): W długiej serii powtórzeń 95% tak skonstruowanych przedziałów zawiera prawdziwą wartość \beta_1; dla naszych danych oszacowanie mieści się w [\text{lower},\ \text{upper}]. ENG (strict): Over many repetitions, 95% of such intervals would contain the true \beta_1; for our data, the estimate lies within [\text{lower},\ \text{upper}]. PL (skrót dydaktyczny): „Jesteśmy 95% pewni, że \beta_1 leży w [\text{lower},\ \text{upper}].” ENG (teaching shorthand): “We are 95% confident that \beta_1 lies in [\text{lower},\ \text{upper}].”

Najczęstsze nieporozumienia / Common pitfalls

  • PL: p nie jest prawdopodobieństwem, że H_0 jest prawdziwa. ENG: p is not the probability that H_0 is true.
  • PL: 95% CI nie zawiera 95% obserwacji (od tego jest przedział predykcji). ENG: A 95% CI does not contain 95% of observations (that’s a prediction interval).
  • PL/ENG: Wysokie R^2 ≠ przyczynowość / High R^2 ≠ causality. Zawsze sprawdzaj/Always check diagnozy reszt, skalę efektu, i dopasowanie poza próbą.

Critical Reminders:

  • Association does not imply causation
  • Statistical significance does not guarantee practical importance
  • Every model is wrong, but some are useful
  • Always visualize your data and residuals
  • Consider both effect size and uncertainty when making decisions

OLS provides a principled, mathematical approach to finding patterns in real-world data. While it cannot provide perfect predictions, it offers the best linear approximation possible along with honest assessments of that approximation’s quality and uncertainty.

9.11 Complete Manual OLS Calculation: A Step-by-Step Example

A professor wants to understand the relationship between hours spent studying and exam scores. She collects data from 6 students:

Student Study Hours (X) Exam Score (Y)
A 1 65
B 2 70
C 3 75
D 4 85
E 5 88
F 6 95

Our goal: Find the best-fitting line \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1X using OLS.

9.12 Step 1: Calculate the Means

First, we need the mean of X and Y.

For X (study hours): \bar{X} = \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = \frac{21}{6} = 3.5

For Y (exam scores): \bar{Y} = \frac{65 + 70 + 75 + 85 + 88 + 95}{6} = \frac{478}{6} = 79.67

9.13 Step 2: Calculate Deviations from Means

For each observation, calculate (X_i - \bar{X}) and (Y_i - \bar{Y}):

Student X_i Y_i X_i - \bar{X} Y_i - \bar{Y}
A 1 65 1 - 3.5 = -2.5 65 - 79.67 = -14.67
B 2 70 2 - 3.5 = -1.5 70 - 79.67 = -9.67
C 3 75 3 - 3.5 = -0.5 75 - 79.67 = -4.67
D 4 85 4 - 3.5 = 0.5 85 - 79.67 = 5.33
E 5 88 5 - 3.5 = 1.5 88 - 79.67 = 8.33
F 6 95 6 - 3.5 = 2.5 95 - 79.67 = 15.33

9.14 Step 3: Calculate Products and Squares

Now calculate (X_i - \bar{X})(Y_i - \bar{Y}) and (X_i - \bar{X})^2:

Student (X_i - \bar{X})(Y_i - \bar{Y}) (X_i - \bar{X})^2
A (-2.5)(-14.67) = 36.68 (-2.5)² = 6.25
B (-1.5)(-9.67) = 14.51 (-1.5)² = 2.25
C (-0.5)(-4.67) = 2.34 (-0.5)² = 0.25
D (0.5)(5.33) = 2.67 (0.5)² = 0.25
E (1.5)(8.33) = 12.50 (1.5)² = 2.25
F (2.5)(15.33) = 38.33 (2.5)² = 6.25
Sum 107.03 17.50

9.15 Step 4: Calculate the Slope (\hat{\beta}_1)

Using the OLS formula: \hat{\beta}_1 = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2} = \frac{107.03}{17.50} = 6.12

Interpretation: Each additional hour of study is associated with a 6.12-point increase in exam score.

9.16 Step 5: Calculate the Intercept (\hat{\beta}_0)

\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X} = 79.67 - (6.12 \times 3.5) = 79.67 - 21.42 = 58.25

Interpretation: A student who studies 0 hours is predicted to score 58.25 points.

9.17 Step 6: Write the Regression Equation

\hat{Y} = 58.25 + 6.12X

9.18 Step 7: Calculate Predicted Values and Residuals

Using our equation to predict each student’s score:

Student X_i Y_i \hat{Y}_i = 58.25 + 6.12X_i Residual e_i = Y_i - \hat{Y}_i
A 1 65 58.25 + 6.12(1) = 64.37 65 - 64.37 = 0.63
B 2 70 58.25 + 6.12(2) = 70.49 70 - 70.49 = -0.49
C 3 75 58.25 + 6.12(3) = 76.61 75 - 76.61 = -1.61
D 4 85 58.25 + 6.12(4) = 82.73 85 - 82.73 = 2.27
E 5 88 58.25 + 6.12(5) = 88.85 88 - 88.85 = -0.85
F 6 95 58.25 + 6.12(6) = 94.97 95 - 94.97 = 0.03

Check: Sum of residuals = 0.63 - 0.49 - 1.61 + 2.27 - 0.85 + 0.03 ≈ 0 ✓

9.19 Step 8: Calculate Sum of Squares

Total Sum of Squares (SST)

How much total variation exists in exam scores?

Student Y_i (Y_i - \bar{Y})^2
A 65 (65 - 79.67)² = (-14.67)² = 215.21
B 70 (70 - 79.67)² = (-9.67)² = 93.51
C 75 (75 - 79.67)² = (-4.67)² = 21.81
D 85 (85 - 79.67)² = (5.33)² = 28.41
E 88 (88 - 79.67)² = (8.33)² = 69.39
F 95 (95 - 79.67)² = (15.33)² = 235.01
Sum SST = 663.34

Regression Sum of Squares (SSR)

How much variation does our model explain?

Student \hat{Y}_i (\hat{Y}_i - \bar{Y})^2
A 64.37 (64.37 - 79.67)² = (-15.30)² = 234.09
B 70.49 (70.49 - 79.67)² = (-9.18)² = 84.27
C 76.61 (76.61 - 79.67)² = (-3.06)² = 9.36
D 82.73 (82.73 - 79.67)² = (3.06)² = 9.36
E 88.85 (88.85 - 79.67)² = (9.18)² = 84.27
F 94.97 (94.97 - 79.67)² = (15.30)² = 234.09
Sum SSR = 655.44

Error Sum of Squares (SSE)

How much variation is unexplained?

Student Residual e_i e_i^2
A 0.63 0.40
B -0.49 0.24
C -1.61 2.59
D 2.27 5.15
E -0.85 0.72
F 0.03 0.00
Sum SSE = 9.10

Verification: SST = SSR + SSE 663.34 ≈ 655.44 + 9.10 = 664.54 ✓ (small rounding difference)

9.20 Step 9: Calculate R-Squared

R^2 = \frac{SSR}{SST} = \frac{655.44}{663.34} = 0.988

Alternative formula: R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{9.10}{663.34} = 1 - 0.014 = 0.986

(Small difference due to rounding)

Interpretation: Study hours explain 98.8% of the variation in exam scores. This is an extremely strong relationship.

9.21 Step 10: Calculate Effect Sizes

Raw Effect Size

The raw effect size is simply the slope: 6.12 points per hour

Standardized Effect Size

First, calculate standard deviations:

For X: s_X = \sqrt{\frac{\sum(X_i - \bar{X})^2}{n-1}} = \sqrt{\frac{17.50}{5}} = \sqrt{3.50} = 1.87

For Y: s_Y = \sqrt{\frac{\sum(Y_i - \bar{Y})^2}{n-1}} = \sqrt{\frac{663.34}{5}} = \sqrt{132.67} = 11.52

Standardized coefficient: \beta_{std} = \hat{\beta}_1 \times \frac{s_X}{s_Y} = 6.12 \times \frac{1.87}{11.52} = 6.12 \times 0.162 = 0.99

Interpretation: A one standard deviation increase in study hours (1.87 hours) is associated with a 0.99 standard deviation increase in exam score.

According to Cohen’s guidelines:

  • Small effect: |β| = 0.10
  • Medium effect: |β| = 0.30
  • Large effect: |β| = 0.50

Our standardized effect of 0.99 is nearly twice Cohen’s “large effect” threshold.

9.22 Step 11: Practical Significance Assessment

Context Analysis

  1. Scale consideration:
    • Effect: 6.12 points per hour
    • Exam scale: 0-100 points
    • Percentage impact: 6.12% per hour
  2. Practical thresholds:
    • Letter grade difference: Often 10 points
    • Time to improve one letter grade: 10/6.12 = 1.63 hours
    • Conclusion: Less than 2 hours of extra study could change a letter grade
  3. Cost-benefit analysis:
    • Benefit: 6.12 points per hour
    • Cost: 1 hour of time
    • Decision: Highly favorable return on investment

9.23 Summary of Results

Regression equation: \hat{Y} = 58.25 + 6.12X

Key statistics:

  • Slope (\hat{\beta}_1): 6.12 points/hour
  • Intercept (\hat{\beta}_0): 58.25 points
  • R²: 0.988 (98.8% of variance explained)
  • Standardized effect: 0.99 (very large effect)

Practical interpretation:

  • Each hour of study adds about 6 points to exam score
  • The model fits extremely well (R² near 1)
  • The effect is both statistically and practically significant
  • Students can meaningfully improve grades with modest increases in study time

9.24 Verification Check

To verify our calculations, let’s check that the regression line passes through (\bar{X}, \bar{Y}):

\hat{Y} = 58.25 + 6.12(3.5) = 58.25 + 21.42 = 79.67 = \bar{Y}

The calculation confirms our regression line passes through the point of means, as it should.

9.25 R Code to Verify Manual Calculations

Below is R code that checks all our manual calculations. You can run this code to confirm every step.

# Step 1: Create the data
study_hours <- c(1, 2, 3, 4, 5, 6)  # X variable
exam_scores <- c(65, 70, 75, 85, 88, 95)  # Y variable
n <- length(study_hours)  # Sample size

# Create a data frame for easier handling
data <- data.frame(
  Student = LETTERS[1:6],
  X = study_hours,
  Y = exam_scores
)
print("Original Data:")
[1] "Original Data:"
print(data)
  Student X  Y
1       A 1 65
2       B 2 70
3       C 3 75
4       D 4 85
5       E 5 88
6       F 6 95
# Step 2: Calculate means
x_bar <- mean(study_hours)
y_bar <- mean(exam_scores)
cat("\nMeans:\n")

Means:
cat(sprintf("Mean of X (study hours): %.2f\n", x_bar))
Mean of X (study hours): 3.50
cat(sprintf("Mean of Y (exam scores): %.2f\n", y_bar))
Mean of Y (exam scores): 79.67
# Step 3: Calculate deviations from means
data$x_dev <- data$X - x_bar  # X deviations
data$y_dev <- data$Y - y_bar  # Y deviations
cat("\nDeviations from means:\n")

Deviations from means:
print(data[, c("Student", "x_dev", "y_dev")])
  Student x_dev      y_dev
1       A  -2.5 -14.666667
2       B  -1.5  -9.666667
3       C  -0.5  -4.666667
4       D   0.5   5.333333
5       E   1.5   8.333333
6       F   2.5  15.333333
# Step 4: Calculate products and squares for OLS formula
data$xy_product <- data$x_dev * data$y_dev  # (Xi - X̄)(Yi - Ȳ)
data$x_dev_sq <- data$x_dev^2  # (Xi - X̄)²

# Sum of products and squares
sum_xy_product <- sum(data$xy_product)
sum_x_dev_sq <- sum(data$x_dev_sq)
cat("\nSum of (Xi - X̄)(Yi - Ȳ):", round(sum_xy_product, 2), "\n")

Sum of (Xi - X̄)(Yi - Ȳ): 107 
cat("Sum of (Xi - X̄)²:", sum_x_dev_sq, "\n")
Sum of (Xi - X̄)²: 17.5 
# Step 5: Calculate slope (beta_1) manually
beta_1_manual <- sum_xy_product / sum_x_dev_sq
cat("\nSlope (β₁) calculated manually:", round(beta_1_manual, 2), "\n")

Slope (β₁) calculated manually: 6.11 
# Step 6: Calculate intercept (beta_0) manually
beta_0_manual <- y_bar - beta_1_manual * x_bar
cat("Intercept (β₀) calculated manually:", round(beta_0_manual, 2), "\n")
Intercept (β₀) calculated manually: 58.27 
# Step 7: Compare with R's lm() function
model <- lm(Y ~ X, data = data)
cat("\n--- Verification with R's lm() function ---\n")

--- Verification with R's lm() function ---
print(summary(model)$coefficients[, 1:2])
             Estimate Std. Error
(Intercept) 58.266667  1.4045278
X            6.114286  0.3606495
# Step 8: Calculate predicted values and residuals
data$Y_hat <- beta_0_manual + beta_1_manual * data$X  # Predicted values
data$residual <- data$Y - data$Y_hat  # Residuals

cat("\nPredicted values and residuals:\n")

Predicted values and residuals:
print(data[, c("Student", "X", "Y", "Y_hat", "residual")])
  Student X  Y    Y_hat    residual
1       A 1 65 64.38095  0.61904762
2       B 2 70 70.49524 -0.49523810
3       C 3 75 76.60952 -1.60952381
4       D 4 85 82.72381  2.27619048
5       E 5 88 88.83810 -0.83809524
6       F 6 95 94.95238  0.04761905
cat("Sum of residuals (should be ≈ 0):", round(sum(data$residual), 4), "\n")
Sum of residuals (should be ≈ 0): 0 
# Step 9: Calculate Sum of Squares
# Total Sum of Squares (SST)
SST <- sum((data$Y - y_bar)^2)
cat("\n--- Sum of Squares Calculations ---\n")

--- Sum of Squares Calculations ---
cat("SST (Total Sum of Squares):", round(SST, 2), "\n")
SST (Total Sum of Squares): 663.33 
# Regression Sum of Squares (SSR)
SSR <- sum((data$Y_hat - y_bar)^2)
cat("SSR (Regression Sum of Squares):", round(SSR, 2), "\n")
SSR (Regression Sum of Squares): 654.23 
# Error Sum of Squares (SSE)
SSE <- sum(data$residual^2)
cat("SSE (Error Sum of Squares):", round(SSE, 2), "\n")
SSE (Error Sum of Squares): 9.1 
# Verify that SST = SSR + SSE
cat("\nVerification: SST = SSR + SSE\n")

Verification: SST = SSR + SSE
cat(sprintf("%.2f = %.2f + %.2f\n", SST, SSR, SSE))
663.33 = 654.23 + 9.10
cat("Difference (due to rounding):", round(SST - (SSR + SSE), 4), "\n")
Difference (due to rounding): 0 
# Step 10: Calculate R-squared
R_squared_method1 <- SSR / SST
R_squared_method2 <- 1 - (SSE / SST)
R_squared_lm <- summary(model)$r.squared

cat("\n--- R-squared Calculations ---\n")

--- R-squared Calculations ---
cat("R² (Method 1: SSR/SST):", round(R_squared_method1, 4), "\n")
R² (Method 1: SSR/SST): 0.9863 
cat("R² (Method 2: 1 - SSE/SST):", round(R_squared_method2, 4), "\n")
R² (Method 2: 1 - SSE/SST): 0.9863 
cat("R² (from lm function):", round(R_squared_lm, 4), "\n")
R² (from lm function): 0.9863 
# Step 11: Calculate Effect Sizes
# Raw effect size (just the slope)
cat("\n--- Effect Size Calculations ---\n")

--- Effect Size Calculations ---
cat("Raw effect size (slope):", round(beta_1_manual, 2), "points per hour\n")
Raw effect size (slope): 6.11 points per hour
# Standard deviations for standardized effect
sd_x <- sd(data$X)  # Standard deviation of X
sd_y <- sd(data$Y)  # Standard deviation of Y
cat("\nStandard deviations:\n")

Standard deviations:
cat(sprintf("SD of X: %.2f\n", sd_x))
SD of X: 1.87
cat(sprintf("SD of Y: %.2f\n", sd_y))
SD of Y: 11.52
# Standardized effect size
beta_std <- beta_1_manual * (sd_x / sd_y)
cat(sprintf("\nStandardized effect size: %.2f\n", beta_std))

Standardized effect size: 0.99
# Correlation coefficient (should equal sqrt(R²) for simple regression)
correlation <- cor(data$X, data$Y)
cat("\nPearson correlation coefficient:", round(correlation, 4), "\n")

Pearson correlation coefficient: 0.9931 
cat("Square root of R²:", round(sqrt(R_squared_method1), 4), "\n")
Square root of R²: 0.9931 
cat("These should be equal (within rounding error)\n")
These should be equal (within rounding error)
# Step 12: Create visualization
cat("\n--- Creating Visualization ---\n")

--- Creating Visualization ---
# Plot the data and regression line
plot(data$X, data$Y, 
     main = "Study Hours vs Exam Scores with OLS Regression Line",
     xlab = "Study Hours",
     ylab = "Exam Score",
     pch = 19, col = "blue", cex = 1.5)

# Add the regression line
abline(a = beta_0_manual, b = beta_1_manual, col = "red", lwd = 2)

# Add the mean point
points(x_bar, y_bar, pch = 15, col = "green", cex = 2)

# Add vertical lines for residuals
for(i in 1:nrow(data)) {
  segments(data$X[i], data$Y[i], data$X[i], data$Y_hat[i], 
           col = "gray", lty = 2)
}

# Add legend
legend("bottomright", 
       legend = c("Observed Data", "Regression Line", "Mean Point", "Residuals"),
       col = c("blue", "red", "green", "gray"),
       pch = c(19, NA, 15, NA),
       lty = c(NA, 1, NA, 2),
       lwd = c(NA, 2, NA, 1))

# Add the equation to the plot
equation_text <- sprintf("Y = %.2f + %.2f*X", beta_0_manual, beta_1_manual)
r2_text <- sprintf("R² = %.3f", R_squared_method1)
text(1.5, 92, equation_text, pos = 4)
text(1.5, 89, r2_text, pos = 4)

# Final summary
cat("\n========== FINAL SUMMARY ==========\n")

========== FINAL SUMMARY ==========
cat("Regression Equation: Y =", round(beta_0_manual, 2), "+", 
    round(beta_1_manual, 2), "* X\n")
Regression Equation: Y = 58.27 + 6.11 * X
cat("R-squared:", round(R_squared_method1, 4), 
    sprintf("(%.1f%% of variance explained)\n", R_squared_method1 * 100))
R-squared: 0.9863 (98.6% of variance explained)
cat("Standardized Effect Size:", round(beta_std, 2), "(Very Large Effect)\n")
Standardized Effect Size: 0.99 (Very Large Effect)
cat("\nInterpretation:\n")

Interpretation:
cat("- Each additional hour of study increases exam score by", 
    round(beta_1_manual, 2), "points\n")
- Each additional hour of study increases exam score by 6.11 points
cat("- Study hours explain", sprintf("%.1f%%", R_squared_method1 * 100), 
    "of the variation in exam scores\n")
- Study hours explain 98.6% of the variation in exam scores
cat("- The relationship is extremely strong (correlation =", 
    round(correlation, 3), ")\n")
- The relationship is extremely strong (correlation = 0.993 )

Running the Code

To run this R code:

  1. Copy the entire code block above
  2. Paste it into RStudio or any R console
  3. Execute the code
  4. Compare the output with our manual calculations

The code will:

  • Recreate all our manual calculations step by step
  • Verify results using R’s built-in lm() function
  • Generate a visualization of the data with the regression line
  • Display all intermediate calculations with clear labels

Expected Output Highlights

When you run this code, you should see:

  • Slope: 6.12 (matching our manual calculation)
  • Intercept: 58.25 (matching our manual calculation)
  • R²: 0.988 (matching our manual calculation)
  • Standardized effect: 0.99 (matching our manual calculation)
  • A plot showing the data points, regression line, and residuals

This verification confirms that our pen-and-paper calculations were correct!

9.26 The Linear Regression Model

Regression analysis provides a statistical framework for modeling relationships between a dependent variable and one or more independent variables. This methodology enables researchers to quantify relationships, test hypotheses, and make predictions based on observed data.

Simple Linear Regression

The simple linear regression model expresses the relationship between a dependent variable and a single independent variable:

Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i

Where: - Y_i represents the dependent variable for observation i - X_i represents the independent variable for observation i - \beta_0 is the intercept parameter - \beta_1 is the slope parameter - \varepsilon_i is the error term for observation i

Multiple Linear Regression

The multiple linear regression model extends this framework to incorporate k independent variables:

Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + ... + \beta_k X_{ki} + \varepsilon_i

This formulation allows for the simultaneous analysis of multiple predictors and their respective contributions to the dependent variable.

Ordinary Least Squares Estimation

Defining the Optimization Criterion

The estimation of regression parameters requires a criterion for determining the “best” fit. Consider three potential approaches for defining the optimal line through a set of data points:

Approach 1: Minimizing the Sum of Residuals

\min \sum_{i=1}^{n} (Y_i - \hat{Y}_i) = \min \sum_{i=1}^{n} e_i

This approach is fundamentally flawed. For any line passing through the data, we can always find another line where positive and negative residuals sum to zero. In fact, infinitely many lines satisfy \sum e_i = 0. This criterion fails to uniquely identify an optimal solution. Moreover, a horizontal line through the mean of Y would achieve zero sum of residuals while ignoring the relationship with X entirely.

Approach 2: Minimizing the Sum of Absolute Residuals

\min \sum_{i=1}^{n} |Y_i - \hat{Y}_i| = \min \sum_{i=1}^{n} |e_i|

This criterion, known as Least Absolute Deviations (LAD), addresses the cancellation problem by taking absolute values. It produces estimates that are more robust to outliers than OLS. However, this approach presents significant challenges:

  • The absolute value function is not differentiable at zero, complicating analytical solutions
  • Multiple solutions may exist (the objective function may have multiple minima)
  • No closed-form solution exists; iterative numerical methods are required
  • Statistical inference is more complex, lacking the elegant properties of OLS estimators

Approach 3: Minimizing the Sum of Squared Residuals

\min \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \min \sum_{i=1}^{n} e_i^2

The Ordinary Least Squares (OLS) approach minimizes the sum of squared residuals. This criterion offers several advantages:

  • Prevents cancellation of positive and negative errors
  • Provides a unique solution (except in cases of perfect multicollinearity)
  • Yields closed-form analytical solutions through differentiation
  • Produces estimators with optimal statistical properties under classical assumptions
  • Facilitates straightforward statistical inference

Visualizing the Sum of Squared Errors

The OLS method can be understood geometrically through the following conceptual framework:

  1. Each error appears as a vertical line from the data point to the regression line
  2. Each of these vertical lines represents a residual (e_i)
  3. We square each residual, which can be visualized as creating a square area
  4. The sum of all these squared areas is what OLS minimizes
# Visualization of squared errors as geometric areas
library(ggplot2)

# Generate sample data with clear pattern
set.seed(42)
n <- 8  # Small number for clarity
x <- seq(2, 16, length.out = n)
y <- 10 + 1.5*x + rnorm(n, 0, 3)
df <- data.frame(x = x, y = y)

# Fit regression model
model <- lm(y ~ x, df)
df$fitted <- fitted(model)
df$residual <- residuals(model)

# Create visualization with actual squares
p <- ggplot(df, aes(x = x, y = y)) +
  # Add regression line first (bottom layer)
  geom_smooth(method = "lm", se = FALSE, color = "darkblue", linewidth = 1.2) +
  
  # Add squares for each residual
  # For positive residuals
  geom_rect(data = subset(df, residual > 0),
            aes(xmin = x - abs(residual)/2, 
                xmax = x + abs(residual)/2, 
                ymin = fitted, 
                ymax = fitted + abs(residual)),
            fill = "red", alpha = 0.25, color = "red", linewidth = 0.5) +
  
  # For negative residuals  
  geom_rect(data = subset(df, residual < 0),
            aes(xmin = x - abs(residual)/2, 
                xmax = x + abs(residual)/2, 
                ymin = fitted - abs(residual), 
                ymax = fitted),
            fill = "red", alpha = 0.25, color = "red", linewidth = 0.5) +
  
  # Add residual lines
  geom_segment(aes(xend = x, yend = fitted), 
               color = "red", linewidth = 0.8, linetype = "solid") +
  
  # Add data points
  geom_point(size = 3.5, color = "black") +
  
  # Add text annotations for selected squared values
  geom_text(data = subset(df, abs(residual) > 2),
            aes(x = x, 
                y = fitted + sign(residual) * abs(residual)/2,
                label = paste0("e²=", round(residual^2, 1))),
            size = 3, color = "darkred", fontface = "italic") +
  
  # Styling
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(face = "italic")
  ) +
  coord_equal() +  # Ensures squares appear as squares
  labs(
    title = "Geometric Visualization of Sum of Squared Errors",
    subtitle = paste("SSE =", round(sum(df$residual^2), 1), 
                     "- Red squares represent squared residuals"),
    x = "Independent Variable (X)",
    y = "Dependent Variable (Y)"
  ) +
  # Add SSE annotation
  annotate("rect", 
           xmin = max(df$x) - 3, xmax = max(df$x) - 0.5,
           ymin = min(df$y) - 2, ymax = min(df$y),
           fill = "lightyellow", alpha = 0.8, color = "gray40") +
  annotate("text", 
           x = max(df$x) - 1.75, y = min(df$y) - 1,
           label = paste("Σe² =", round(sum(df$residual^2), 1)),
           size = 4, fontface = "bold", color = "darkred")

print(p)
`geom_smooth()` using formula = 'y ~ x'

Mathematical Derivation of OLS Estimators

For simple linear regression, the OLS estimators are obtained by minimizing the sum of squared residuals:

SSE = \sum_{i=1}^{n} (Y_i - \beta_0 - \beta_1 X_i)^2

Taking partial derivatives with respect to \beta_0 and \beta_1 and setting them equal to zero yields the normal equations. Solving this system produces:

\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} = \frac{Cov(X,Y)}{Var(X)}

\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}

Properties of OLS Estimators

The OLS procedure guarantees several important properties:

  1. Zero sum of residuals: \sum_{i=1}^{n} e_i = 0
  2. Orthogonality of residuals and predictors: \sum_{i=1}^{n} X_i e_i = 0
  3. The fitted regression line passes through the point (\bar{X}, \bar{Y})
  4. Zero covariance between fitted values and residuals: \sum_{i=1}^{n} \hat{Y}_i e_i = 0

Classical Linear Model Assumptions

Core Assumptions

For OLS estimators to possess desirable statistical properties, the following assumptions must hold:

Assumption 1: Linearity in Parameters

The relationship between the dependent and independent variables is linear in the parameters: Y_i = \beta_0 + \beta_1 X_{1i} + ... + \beta_k X_{ki} + \varepsilon_i

Assumption 2: Strict Exogeneity

The error term has zero conditional expectation given all values of the independent variables: E[\varepsilon_i | X] = 0

This assumption implies that the independent variables contain no information about the mean of the error term. It is stronger than contemporaneous exogeneity and rules out feedback from past errors to current regressors. This assumption is critical for unbiased estimation and is often violated in time series contexts with lagged dependent variables or in the presence of omitted variables.

This assumption is particularly important for our discussion of spurious correlations. Violations of the exogeneity assumption lead to endogeneity problems, which we will discuss later.

Assumption 3: No Perfect Multicollinearity

In multiple regression, no independent variable can be expressed as a perfect linear combination of other independent variables. The matrix X'X must be invertible.

Assumption 4: Homoscedasticity

The variance of the error term is constant across all observations: Var(\varepsilon_i | X) = \sigma^2

This assumption ensures that the precision of the regression does not vary systematically with the level of the independent variables.

Assumption 5: No Autocorrelation

The error terms are uncorrelated with each other: Cov(\varepsilon_i, \varepsilon_j | X) = 0 \text{ for } i \neq j

Assumption 6: Normality of Errors (for inference)

The error terms follow a normal distribution: \varepsilon_i \sim N(0, \sigma^2)

This assumption is not required for the unbiasedness or consistency of OLS estimators but is necessary for exact finite-sample inference.

Gauss-Markov Theorem

Under Assumptions 1-5, the OLS estimators are BLUE (Best Linear Unbiased Estimators):

  • Best: Minimum variance among the class of linear unbiased estimators
  • Linear: The estimators are linear functions of the dependent variable
  • Unbiased: E[\hat{\beta}] = \beta

Visualization of OLS Methodology

Geometric Interpretation

# Comprehensive visualization of OLS regression
library(ggplot2)

# Generate sample data
set.seed(42)
n <- 50
x <- runif(n, 0, 100)
epsilon <- rnorm(n, 0, 15)
y <- 20 + 0.8*x + epsilon

# Create data frame
data <- data.frame(x = x, y = y)

# Fit regression model
model <- lm(y ~ x, data = data)
data$fitted <- fitted(model)
data$residuals <- residuals(model)

# Create comprehensive plot
ggplot(data, aes(x = x, y = y)) +
  # Add confidence interval
  geom_smooth(method = "lm", se = TRUE, alpha = 0.15, fill = "blue") +
  # Add regression line
  geom_line(aes(y = fitted), color = "blue", linewidth = 1.2) +
  # Add residual segments
  geom_segment(aes(xend = x, yend = fitted), 
               color = "red", alpha = 0.5, linewidth = 0.7) +
  # Add observed points
  geom_point(size = 2.5, alpha = 0.8) +
  # Add fitted values
  geom_point(aes(y = fitted), color = "blue", size = 1.5, alpha = 0.6) +
  # Annotations
  theme_bw() +
  theme(
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    plot.title = element_text(size = 12, face = "bold")
  ) +
  labs(
    title = "Ordinary Least Squares Regression",
    subtitle = sprintf("Estimated equation: Y = %.2f + %.3f X  (R² = %.3f, RSE = %.2f)",
                      coef(model)[1], coef(model)[2], 
                      summary(model)$r.squared, 
                      summary(model)$sigma),
    x = "Independent Variable (X)",
    y = "Dependent Variable (Y)"
  ) +
  annotate("text", x = min(x) + 5, y = max(y) - 5,
           label = sprintf("SSE = %.1f", sum(residuals(model)^2)),
           hjust = 0, size = 3.5)
`geom_smooth()` using formula = 'y ~ x'

Visualization of Squared Residuals

# Demonstrate why squaring is necessary
library(ggplot2)
library(gridExtra)

# Generate example with clear pattern
set.seed(123)
n <- 20
x <- seq(1, 10, length.out = n)
y <- 2 + 1.5*x + rnorm(n, 0, 2)
df <- data.frame(x = x, y = y)

# Fit model
model <- lm(y ~ x, df)
df$fitted <- fitted(model)
df$residual <- residuals(model)

# Plot 1: Raw residuals
p1 <- ggplot(df, aes(x = x)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_segment(aes(y = 0, yend = residual, xend = x), 
               color = ifelse(df$residual > 0, "darkgreen", "darkred"),
               linewidth = 1) +
  geom_point(aes(y = residual), size = 3,
             color = ifelse(df$residual > 0, "darkgreen", "darkred")) +
  theme_minimal() +
  labs(title = "Residuals (ei)",
       subtitle = sprintf("Sum = %.2f (not meaningful)", sum(df$residual)),
       x = "X", y = "Residual") +
  ylim(c(-6, 6))

# Plot 2: Squared residuals
p2 <- ggplot(df, aes(x = x)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_segment(aes(y = 0, yend = residual^2, xend = x), 
               color = "darkred", linewidth = 1) +
  geom_point(aes(y = residual^2), size = 3, color = "darkred") +
  theme_minimal() +
  labs(title = "Squared Residuals (ei²)",
       subtitle = sprintf("Sum = %.2f (minimized by OLS)", sum(df$residual^2)),
       x = "X", y = "Squared Residual") +
  ylim(c(0, 36))

# Combine plots
grid.arrange(p1, p2, ncol = 2)

Diagnostic Analysis

Residual Diagnostics

Assessment of model assumptions requires careful examination of residual patterns:

# Generate diagnostic plots
par(mfrow = c(2, 2))

# Residuals vs Fitted Values
plot(model, which = 1)
# Tests linearity and homoscedasticity assumptions

# Normal Q-Q Plot
plot(model, which = 2)
# Tests normality assumption

# Scale-Location Plot
plot(model, which = 3)
# Tests homoscedasticity assumption

# Cook's Distance
plot(model, which = 4)

# Identifies influential observations

Testing Assumptions Formally

# Formal statistical tests
library(lmtest)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
library(car)

# Test for heteroscedasticity
# Breusch-Pagan test
bptest(model)

    studentized Breusch-Pagan test

data:  model
BP = 0.37876, df = 1, p-value = 0.5383
# Test for autocorrelation
# Durbin-Watson test
dwtest(model)

    Durbin-Watson test

data:  model
DW = 2.1781, p-value = 0.5599
alternative hypothesis: true autocorrelation is greater than 0
# Test for normality
# Shapiro-Wilk test on residuals
shapiro.test(residuals(model))

    Shapiro-Wilk normality test

data:  residuals(model)
W = 0.98221, p-value = 0.9593
# Test for linearity
# Rainbow test
raintest(model)

    Rainbow test

data:  model
Rain = 1.2139, df1 = 10, df2 = 8, p-value = 0.3995

Extensions and Alternatives

When OLS Assumptions Fail

When classical assumptions are violated, alternative approaches may be necessary:

  1. Heteroscedasticity: Weighted Least Squares (WLS) or robust standard errors
  2. Autocorrelation: Generalized Least Squares (GLS) or Newey-West standard errors
  3. Non-normality: Bootstrap inference or robust regression methods
  4. Multicollinearity: Ridge regression or LASSO
  5. Endogeneity: Instrumental Variables (IV) or Two-Stage Least Squares (2SLS)

Robust Regression Methods

When outliers are present, robust alternatives to OLS include:

  • M-estimators (Huber regression)
  • Least Trimmed Squares (LTS)
  • MM-estimators

Practical Implementation

Complete Analysis Example

# Load required libraries
library(tidyverse)
library(broom)
library(car)

# Generate dataset
set.seed(2024)
n <- 200
data <- data.frame(
  x1 = rnorm(n, 50, 10),
  x2 = rnorm(n, 30, 5),
  x3 = rbinom(n, 1, 0.5)
)
data$y <- 10 + 2*data$x1 + 3*data$x2 + 15*data$x3 + rnorm(n, 0, 10)

# Fit multiple regression model
full_model <- lm(y ~ x1 + x2 + x3, data = data)

# Model summary
summary(full_model)

Call:
lm(formula = y ~ x1 + x2 + x3, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.6507  -6.9674  -0.7472   6.3670  30.4959 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 13.59158    5.97565   2.274               0.024 *  
x1           2.05837    0.06876  29.934 <0.0000000000000002 ***
x2           2.76233    0.15259  18.103 <0.0000000000000002 ***
x3          14.80771    1.42654  10.380 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.897 on 196 degrees of freedom
Multiple R-squared:  0.8744,    Adjusted R-squared:  0.8725 
F-statistic: 454.9 on 3 and 196 DF,  p-value: < 0.00000000000000022
# ANOVA table
anova(full_model)
Analysis of Variance Table

Response: y
           Df Sum Sq Mean Sq F value                Pr(>F)    
x1          1  82189   82189  839.04 < 0.00000000000000022 ***
x2          1  40936   40936  417.90 < 0.00000000000000022 ***
x3          1  10555   10555  107.75 < 0.00000000000000022 ***
Residuals 196  19200      98                                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Confidence intervals for parameters
confint(full_model, level = 0.95)
                2.5 %    97.5 %
(Intercept)  1.806741 25.376410
x1           1.922758  2.193977
x2           2.461401  3.063262
x3          11.994367 17.621053
# Variance Inflation Factors (multicollinearity check)
vif(full_model)
      x1       x2       x3 
1.009487 1.044192 1.038735 
# Model diagnostics
par(mfrow = c(2, 2))
plot(full_model)

# Tidy output
tidy(full_model, conf.int = TRUE)
# A tibble: 4 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    13.6     5.98        2.27 2.40e- 2     1.81     25.4 
2 x1              2.06    0.0688     29.9  4.90e-75     1.92      2.19
3 x2              2.76    0.153      18.1  1.06e-43     2.46      3.06
4 x3             14.8     1.43       10.4  2.14e-20    12.0      17.6 
glance(full_model)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.874         0.872  9.90      455. 5.21e-88     3  -740. 1490. 1507.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Conclusion

Ordinary Least Squares regression remains a fundamental tool in statistical analysis. The method’s mathematical elegance, combined with its optimal properties under the classical assumptions, explains its widespread application. However, practitioners must carefully verify assumptions and consider alternatives when these conditions are not met. Understanding both the theoretical foundations and practical limitations of OLS is essential for proper statistical inference and prediction.

Visualizing OLS Through Different Regression Lines

A simple but effective way to visualize the concept of “best fit” is to compare multiple lines and their resulting SSE values:

# Load required libraries
library(ggplot2)
library(gridExtra)

# Create sample data
set.seed(123)
x <- 1:20
y <- 2 + 3*x + rnorm(20, 0, 5)
data <- data.frame(x = x, y = y)

# Fit regression model
model <- lm(y ~ x, data = data)
coef <- coefficients(model)

# Define different lines: optimal and sub-optimal with clearer differences
lines <- data.frame(
  label = c("Best Fit (OLS)", "Line A", "Line B", "Line C"),
  intercept = c(coef[1], coef[1] - 8, coef[1] + 8, coef[1] - 4),
  slope = c(coef[2], coef[2] - 1.2, coef[2] + 0.8, coef[2] - 0.7)
)

# Calculate SSE for each line
lines$sse <- sapply(1:nrow(lines), function(i) {
  predicted <- lines$intercept[i] + lines$slope[i] * x
  sum((y - predicted)^2)
})

# Add percentage increase over optimal SSE
lines$pct_increase <- round((lines$sse / lines$sse[1] - 1) * 100, 1)
lines$pct_text <- ifelse(lines$label == "Best Fit (OLS)", 
                         "Optimal", 
                         paste0("+", lines$pct_increase, "%"))

# Assign distinct colors for better visibility
line_colors <- c("Best Fit (OLS)" = "blue", 
                "Line A" = "red", 
                "Line B" = "darkgreen", 
                "Line C" = "purple")

# Create data for mini residual plots
mini_data <- data.frame()
for(i in 1:nrow(lines)) {
  line_data <- data.frame(
    x = x,
    y = y,
    predicted = lines$intercept[i] + lines$slope[i] * x,
    residuals = y - (lines$intercept[i] + lines$slope[i] * x),
    line = lines$label[i]
  )
  mini_data <- rbind(mini_data, line_data)
}

# Create main comparison plot with improved visibility
p1 <- ggplot(data, aes(x = x, y = y)) +
  # Add background grid for reference
  theme_minimal() +
  theme(
    panel.grid.minor = element_line(color = "gray90"),
    panel.grid.major = element_line(color = "gray85"),
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 13),
    axis.title = element_text(size = 13, face = "bold"),
    axis.text = element_text(size = 12)
  ) +
  # Add data points
  geom_point(size = 2.5, alpha = 0.8) +
  # Add lines with improved visibility
  geom_abline(data = lines, 
              aes(intercept = intercept, slope = slope, 
                  color = label, linetype = label == "Best Fit (OLS)"),
              size = 1.2) +
  # Use custom colors
  scale_color_manual(values = line_colors) +
  scale_linetype_manual(values = c("TRUE" = "solid", "FALSE" = "dashed"), guide = "none") +
  # Better legends
  labs(title = "Comparing Different Regression Lines",
       subtitle = "The OLS line minimizes the sum of squared errors",
       x = "X", y = "Y",
       color = "Regression Line") +
  guides(color = guide_legend(override.aes = list(size = 2)))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Create mini residual plots with improved visibility
p_mini <- list()

for(i in 1:nrow(lines)) {
  line_data <- subset(mini_data, line == lines$label[i])
  
  p_mini[[i]] <- ggplot(line_data, aes(x = x, y = residuals)) +
    # Add reference line
    geom_hline(yintercept = 0, linetype = "dashed", size = 0.8, color = "gray50") +
    # Add residual points with line color
    geom_point(color = line_colors[lines$label[i]], size = 2.5) +
    # Add squares to represent squared errors
    geom_rect(aes(xmin = x - 0.3, xmax = x + 0.3,
                  ymin = 0, ymax = residuals),
              fill = line_colors[lines$label[i]], alpha = 0.2) +
    # Improved titles
    labs(title = lines$label[i],
         subtitle = paste("SSE =", round(lines$sse[i], 1), 
                          ifelse(i == 1, " (Optimal)", 
                                 paste0(" (+", lines$pct_increase[i], "%)"))),
         x = NULL, y = NULL) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 12, face = "bold", color = line_colors[lines$label[i]]),
      plot.subtitle = element_text(size = 10),
      panel.grid.minor = element_blank()
    )
}

# Create SSE comparison table with better visibility
sse_df <- data.frame(
  x = rep(1, nrow(lines)),
  y = nrow(lines):1,
  label = paste0(lines$label, ": SSE = ", round(lines$sse, 1), " (", lines$pct_text, ")"),
  color = line_colors[lines$label]
)

sse_table <- ggplot(sse_df, aes(x = x, y = y, label = label, color = color)) +
  geom_text(hjust = 0, size = 5, fontface = "bold") +
  scale_color_identity() +
  theme_void() +
  xlim(1, 10) +
  ylim(0.5, nrow(lines) + 0.5) +
  labs(title = "Sum of Squared Errors (SSE) Comparison") +
  theme(plot.title = element_text(hjust = 0, face = "bold", size = 14))

# Arrange the plots with better spacing
grid.arrange(
  p1, 
  arrangeGrob(p_mini[[1]], p_mini[[2]], p_mini[[3]], p_mini[[4]], 
              ncol = 2, padding = unit(1, "cm")),
  sse_table, 
  ncol = 1, 
  heights = c(4, 3, 1)
)

Comparing different regression lines

Key Learning Points

  • The Sum of Squared Errors (SSE) is what Ordinary Least Squares (OLS) regression minimizes
  • Each residual contributes its squared value to the total SSE
  • The OLS line has a lower SSE than any other possible line
  • Large residuals contribute disproportionately to the SSE due to the squaring operation
  • This is why outliers can have such a strong influence on regression lines

Step-by-Step SSE Minimization

To illustrate the process of finding the minimum SSE, we can create a sequence that passes through the optimal point, showing how the SSE first decreases to a minimum and then increases again:

# Create sample data
set.seed(123)
x <- 1:20
y <- 2 + 3*x + rnorm(20, 0, 5)
data <- data.frame(x = x, y = y)

# Fit regression model
model <- lm(y ~ x, data = data)
coef <- coefficients(model)

# Create a sequence of steps that passes through the optimal OLS line
steps <- 9  # Use odd number to have a middle point at the optimum
step_seq <- data.frame(
  step = 1:steps,
  intercept = seq(coef[1] - 8, coef[1] + 8, length.out = steps),
  slope = seq(coef[2] - 1.5, coef[2] + 1.5, length.out = steps)
)

# Mark the middle step (optimal OLS solution)
optimal_step <- ceiling(steps/2)

# Calculate SSE for each step
step_seq$sse <- sapply(1:nrow(step_seq), function(i) {
  predicted <- step_seq$intercept[i] + step_seq$slope[i] * x
  sum((y - predicted)^2)
})

# Create a "journey through the SSE valley" plot
p2 <- ggplot(data, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_abline(data = step_seq, 
              aes(intercept = intercept, slope = slope, 
                  color = sse, group = step),
              size = 1) +
  # Highlight the optimal line
  geom_abline(intercept = step_seq$intercept[optimal_step], 
              slope = step_seq$slope[optimal_step],
              color = "green", size = 1.5) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Journey Through the SSE Valley",
       subtitle = "The green line represents the OLS solution with minimum SSE",
       color = "SSE Value") +
  theme_minimal()

# Create an SSE valley plot
p3 <- ggplot(step_seq, aes(x = step, y = sse)) +
  geom_line(size = 1) +
  geom_point(size = 3, aes(color = sse)) +
  scale_color_gradient(low = "blue", high = "red") +
  # Highlight the optimal point
  geom_point(data = step_seq[optimal_step, ], aes(x = step, y = sse), 
             size = 5, color = "green") +
  # Add annotation
  annotate("text", x = optimal_step, y = step_seq$sse[optimal_step] * 1.1, 
           label = "Minimum SSE", color = "darkgreen", fontface = "bold") +
  labs(title = "The SSE Valley: Decreasing Then Increasing",
       subtitle = "The SSE reaches its minimum at the OLS solution",
       x = "Step",
       y = "Sum of Squared Errors") +
  theme_minimal() +
  theme(legend.position = "none")

# Display both plots
grid.arrange(p2, p3, ncol = 1, heights = c(3, 2))

SSE minimization visualization

In R, the lm() function fits linear regression models:

model <- lm(y ~ x, data = data_frame)

Model Interpretation: A Beginner’s Guide

Let’s create a simple dataset to understand regression output better. Imagine we’re studying how years of education affect annual income:

# Create a simple dataset - this is our Data Generating Process (DGP)
set.seed(123) # For reproducibility
education_years <- 10:20  # Education from 10 to 20 years
n <- length(education_years)

# True parameters in our model - using more realistic values for Poland
true_intercept <- 3000   # Base monthly income with no education (in PLN)
true_slope <- 250        # Each year of education increases monthly income by 250 PLN

# Generate monthly incomes with some random noise
income <- true_intercept + true_slope * education_years + rnorm(n, mean=0, sd=300)

# Create our dataset
education_income <- data.frame(
  education = education_years,
  income = income
)

# Let's visualize our data
library(ggplot2)
ggplot(education_income, aes(x = education, y = income)) +
  geom_point(size = 3, color = "blue") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  scale_y_continuous(limits = c(5000, 8500), 
                     breaks = seq(5000, 8500, by = 500),
                     labels = scales::comma) +
  scale_x_continuous(breaks = 10:20) +
  labs(
    title = "Relationship between Education and Income in Poland",
    x = "Years of Education",
    y = "Monthly Income (PLN)",
    subtitle = "Red line shows the estimated linear relationship"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold"),
    axis.title = element_text(face = "bold")
  ) +
  annotate("text", x = 11, y = 8000, 
           label = "Each point represents\none person's data", 
           hjust = 0, size = 4)
`geom_smooth()` using formula = 'y ~ x'

Fitting the Model

Now let’s fit a linear regression model to this data:

# Fit a simple regression model
edu_income_model <- lm(income ~ education, data = education_income)

# Display the results
model_summary <- summary(edu_income_model)
model_summary

Call:
lm(formula = income ~ education, data = education_income)

Residuals:
    Min      1Q  Median      3Q     Max 
-427.72 -206.04  -38.12  207.32  460.78 

Coefficients:
            Estimate Std. Error t value  Pr(>|t|)    
(Intercept)   3095.3      447.6   6.915 0.0000695 ***
education      247.2       29.2   8.467 0.0000140 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 306.3 on 9 degrees of freedom
Multiple R-squared:  0.8885,    Adjusted R-squared:  0.8761 
F-statistic: 71.69 on 1 and 9 DF,  p-value: 0.00001403

Understanding the Regression Output Step by Step

Let’s break down what each part of this output means in simple terms:

1. The Formula

At the top, you see income ~ education, which means we’re predicting income based on education.

2. Residuals

These show how far our predictions are from the actual values. Ideally, they should be centered around zero.

3. Coefficients Table
Coefficient Estimates
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3095.27 447.63 6.91 0
education 247.23 29.20 8.47 0

Intercept (\beta_0):

  • Value: Approximately 3095

  • Interpretation: This is the predicted monthly income for someone with 0 years of education

  • Note: Sometimes the intercept isn’t meaningful in real-world terms, especially if x=0 is outside your data range

Education (\beta_1):

  • Value: Approximately 247

  • Interpretation: For each additional year of education, we expect monthly income to increase by this amount in PLN

  • This is our main coefficient of interest!

Standard Error:

  • Measures how precise our estimates are

  • Smaller standard errors mean more precise estimates

  • Think of it as “give or take how much” for our coefficients

t value:

  • This is the coefficient divided by its standard error

  • It tells us how many standard errors away from zero our coefficient is

  • Larger absolute t values (above 2) suggest the effect is statistically significant

p-value:

  • The probability of seeing our result (or something more extreme) if there was actually no relationship

  • Typically, p < 0.05 is considered statistically significant

  • For education, p = 0.000014, which is significant!

4. Model Fit Statistics
Model Fit Statistics
Statistic Value
R-squared 0.888
Adjusted R-squared 0.876
F-statistic 71.686
p-value 0.000

R-squared:

  • Value: 0.888

  • Interpretation: 89% of the variation in income is explained by education

  • Higher is better, but be cautious of very high values (could indicate overfitting)

F-statistic:

  • Tests whether the model as a whole is statistically significant

  • A high F-statistic with a low p-value indicates a significant model

Visualizing the Model Results

Let’s visualize what our model actually tells us:

# Predicted values
education_income$predicted <- predict(edu_income_model)
education_income$residuals <- residuals(edu_income_model)

# Create a more informative plot
ggplot(education_income, aes(x = education, y = income)) +
  # Actual data points
  geom_point(size = 3, color = "blue") +
  
  # Regression line
  geom_line(aes(y = predicted), color = "red", size = 1.2) +
  
  # Residual lines
  geom_segment(aes(xend = education, yend = predicted), 
               color = "darkgray", linetype = "dashed") +
  
  # Set proper scales
  scale_y_continuous(limits = c(5000, 8500), 
                     breaks = seq(5000, 8500, by = 500),
                     labels = scales::comma) +
  scale_x_continuous(breaks = 10:20) +
  
  # Annotations
  annotate("text", x = 19, y = 7850, 
           label = paste("Slope =", round(coef(edu_income_model)[2]), "PLN per year"),
           color = "red", hjust = 1, fontface = "bold") +
  
  annotate("text", x = 10.5, y = 5500, 
           label = paste("Intercept =", round(coef(edu_income_model)[1]), "PLN"),
           color = "red", hjust = 0, fontface = "bold") +
  
  annotate("text", x = 14, y = 8200, 
           label = paste("R² =", round(model_summary$r.squared, 2)),
           color = "black", fontface = "bold") +
  
  # Labels
  labs(
    title = "Interpreting the Education-Income Regression Model",
    subtitle = "Red line shows predicted income for each education level",
    x = "Years of Education",
    y = "Monthly Income (PLN)",
    caption = "Gray dashed lines represent residuals (prediction errors)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold"),
    axis.title = element_text(face = "bold")
  )

Real-World Interpretation

  • A person with 16 years of education (college graduate) would be predicted to earn about: \hat{Y} = 3095 + 247 \times 16 = 7051 \text{ PLN monthly}

  • The model suggests that each additional year of education is associated with a 247 PLN increase in monthly income.

  • Our model explains approximately 89% of the variation in income in our sample.

  • The relationship is statistically significant (p < 0.001), meaning it’s very unlikely to observe this relationship if education truly had no effect on income.

Important Cautions for Beginners

  1. Correlation ≠ Causation: Our model shows association, not necessarily causation

  2. Omitted Variables: Other factors might influence both education and income

  3. Extrapolation: Be careful predicting outside the range of your data

  4. Linear Relationship: We’ve assumed the relationship is linear, which may not always be true

9.27 Regression Analysis and Ordinary Least Squares (*)

Foundations of Regression Analysis

Regression analysis constitutes a fundamental statistical methodology for examining relationships between variables. At its core, regression provides a systematic framework for understanding how changes in one or more independent variables influence a dependent variable.

The primary objectives of regression analysis include: - Quantifying relationships between variables - Making predictions based on observed patterns - Testing hypotheses about variable associations - Understanding the proportion of variation explained by predictors

Deterministic versus Stochastic Models

Statistical modeling encompasses two fundamental approaches:

Deterministic models assume precise, invariant relationships between variables. Given specific inputs, these models yield identical outputs without variation. Consider the physics equation:

\text{Distance} = \text{Speed} \times \text{Time}

This relationship exhibits no randomness; identical inputs always produce identical outputs.

Stochastic models, in contrast, acknowledge inherent variability in real-world phenomena. Regression analysis employs stochastic modeling through the fundamental equation:

Y = f(X) + \epsilon

Where: - Y represents the outcome variable - f(X) captures the systematic relationship between predictors and outcome - \epsilon represents random variation inherent in the data

This formulation recognizes that real-world relationships contain both systematic patterns and random variation.

9.28 Simple Linear Regression Model

Model Specification

Simple linear regression models the relationship between a single predictor variable and an outcome variable through a linear equation:

Y_i = \beta_0 + \beta_1X_i + \epsilon_i

The model components represent: - Y_i: The dependent variable for observation i - X_i: The independent variable for observation i - \beta_0: The population intercept parameter - \beta_1: The population slope parameter - \epsilon_i: The random error term for observation i

Interpretation of Parameters

The parameters possess specific interpretations:

  • Intercept (\beta_0): The expected value of Y when X = 0. This represents the baseline level of the outcome variable.
  • Slope (\beta_1): The expected change in Y for a one-unit increase in X. This quantifies the strength and direction of the linear relationship.
  • Error term (\epsilon_i): Captures all factors affecting Y not explained by X, including measurement error, omitted variables, and inherent randomness.

Estimation versus True Parameters

The distinction between population parameters and sample estimates proves crucial:

  • Population parameters (\beta_0, \beta_1) represent true, unknown values
  • Sample estimates (\hat{\beta}_0, \hat{\beta}_1) represent our best approximations based on available data
  • The hat notation (^) consistently denotes estimated values

The fitted regression equation becomes:

\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1X_i

9.29 The Ordinary Least Squares Method

The Fundamental Challenge

Given a dataset with observations (X_i, Y_i), we need a systematic method to determine the “best” values for \hat{\beta}_0 and \hat{\beta}_1. The challenge lies in defining what constitutes “best” and developing a practical method to find these values.

Consider that for any given line through the data, each observation will have a prediction error or residual:

e_i = Y_i - \hat{Y}_i = Y_i - (\hat{\beta}_0 + \hat{\beta}_1X_i)

These residuals represent how far our predictions deviate from actual values. A good fitting line should make these residuals as small as possible overall.

Why Minimize the Sum of Squared Residuals?

The Ordinary Least Squares method determines optimal parameter estimates by minimizing the sum of squared residuals. This choice requires justification, as we could conceivably minimize other quantities. The rationale for squaring residuals includes:

Mathematical tractability: Squaring creates a smooth, differentiable function that yields closed-form solutions through calculus. The derivatives of squared terms lead to linear equations that can be solved analytically.

Equal treatment of positive and negative errors: Simply summing raw residuals would allow positive and negative errors to cancel, potentially yielding a sum of zero even when predictions are poor. Squaring ensures all deviations contribute positively to the total error measure.

Penalization of large errors: Squaring gives progressively greater weight to larger errors. An error of 4 units contributes 16 to the sum, while an error of 2 units contributes only 4. This property encourages finding a line that avoids extreme prediction errors.

Statistical optimality: Under certain assumptions (including normally distributed errors), OLS estimators possess desirable statistical properties, including being the Best Linear Unbiased Estimators (BLUE) according to the Gauss-Markov theorem.

Connection to variance: The sum of squared deviations directly relates to variance, a fundamental measure of spread in statistics. Minimizing squared residuals thus minimizes the variance of prediction errors.

The OLS Optimization Problem

The OLS method formally seeks values of \hat{\beta}_0 and \hat{\beta}_1 that minimize:

\text{SSE} = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^n (Y_i - (\hat{\beta}_0 + \hat{\beta}_1X_i))^2

This optimization problem can be solved using calculus by: 1. Taking partial derivatives with respect to \hat{\beta}_0 and \hat{\beta}_1 2. Setting these derivatives equal to zero 3. Solving the resulting system of equations

Derivation of OLS Estimators

The minimization yields closed-form solutions:

\hat{\beta}_1 = \frac{\sum_{i=1}^n(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n(X_i - \bar{X})^2} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}

\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}

These formulas reveal that: - The slope estimate depends on the covariance between variables relative to the predictor’s variance - The intercept ensures the regression line passes through the point of means (\bar{X}, \bar{Y})

Properties of OLS Estimators

OLS estimators possess several desirable properties:

  • Unbiasedness: Under appropriate conditions, E[\hat{\beta}_j] = \beta_j
  • Efficiency: OLS provides minimum variance among linear unbiased estimators
  • Consistency: As sample size increases, estimates converge to true values
  • The regression line passes through the centroid: The point (\bar{X}, \bar{Y}) always lies on the fitted line

Extension to Multiple Regression

While this guide focuses on simple linear regression with one predictor, the OLS framework extends naturally to multiple regression with several predictors:

Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + ... + \beta_kX_{ki} + \epsilon_i

The same principle applies: we minimize the sum of squared residuals, though the mathematics involves matrix algebra rather than simple formulas. The fundamental logic—finding parameter values that minimize prediction errors—remains unchanged.

9.30 Understanding Variance Decomposition

The Baseline Model Concept

Before introducing predictors, consider the simplest possible model: predicting every observation using the overall mean \bar{Y}. This baseline model represents our best prediction in the absence of additional information.

The baseline model’s predictions: \hat{Y}_i^{\text{baseline}} = \bar{Y} \text{ for all } i

This model serves as a reference point for evaluating improvement gained through incorporating predictors. The baseline model essentially asks: “If we knew nothing about the relationship between X and Y, what would be our best constant prediction?”

Components of Total Variation

The total variation in the outcome variable decomposes into three fundamental components:

Total Sum of Squares (SST)

SST quantifies the total variation in the outcome variable relative to its mean:

\text{SST} = \sum_{i=1}^n (Y_i - \bar{Y})^2

Interpretation: SST represents the total variance that requires explanation. It measures the prediction error when using only the mean as our model—essentially the variance explained by the baseline (zero) model. This is the starting point: the total amount of variation we hope to explain by introducing predictors.

Regression Sum of Squares (SSR)

SSR measures the variation explained by the regression model:

\text{SSR} = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2

Interpretation: SSR quantifies the improvement in prediction achieved by incorporating the predictor variable. It represents the reduction in prediction error relative to the baseline model—the portion of total variation that our regression line successfully captures.

Error Sum of Squares (SSE)

SSE captures the unexplained variation remaining after regression:

\text{SSE} = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2

Interpretation: SSE represents the residual variation that the model cannot explain, reflecting the inherent randomness and effects of omitted variables. This is the variation that remains even after using our best-fitting line.

The Fundamental Decomposition Identity

These components relate through the fundamental equation:

\text{SST} = \text{SSR} + \text{SSE}

This identity demonstrates that: - Total variation equals the sum of explained and unexplained components - The regression model partitions total variation into systematic and random parts - Model improvement can be assessed by comparing SSR to SST

Conceptual Framework for Variance Decomposition

# Demonstration of Variance Decomposition
library(ggplot2)

# Generate sample data
set.seed(42)
n <- 50
x <- runif(n, 1, 10)
y <- 3 + 2*x + rnorm(n, 0, 2)
data <- data.frame(x = x, y = y)

# Fit model
model <- lm(y ~ x, data = data)
y_mean <- mean(y)
y_pred <- predict(model)

# Calculate components
SST <- sum((y - y_mean)^2)
SSR <- sum((y_pred - y_mean)^2)
SSE <- sum((y - y_pred)^2)

# Display decomposition
cat("Variance Decomposition\n")
cat("======================\n")
cat("Total SS (SST):", round(SST, 2), 
    "- Total variation from mean\n")
cat("Regression SS (SSR):", round(SSR, 2), 
    "- Variation explained by model\n")
cat("Error SS (SSE):", round(SSE, 2), 
    "- Unexplained variation\n")
cat("\nVerification: SST = SSR + SSE\n")
cat(round(SST, 2), "=", round(SSR, 2), "+", 
    round(SSE, 2), "\n")

9.31 The Coefficient of Determination (R²)

Definition and Calculation

The coefficient of determination, denoted R², quantifies the proportion of total variation explained by the regression model:

R^2 = \frac{\text{SSR}}{\text{SST}} = \frac{\text{Explained Variation}}{\text{Total Variation}}

Alternatively expressed as:

R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\text{Unexplained Variation}}{\text{Total Variation}}

R² directly answers the question: “What proportion of the total variation in Y (relative to the baseline mean model) does our regression model explain?”

Interpretation Guidelines

R² values range from 0 to 1, with specific interpretations:

  • R² = 0: The model explains no variation beyond the baseline mean model. The regression line provides no improvement over simply using \bar{Y}.
  • R² = 0.25: The model explains 25% of total variation. Three-quarters of the variation remains unexplained.
  • R² = 0.75: The model explains 75% of total variation. This represents substantial explanatory power.
  • R² = 1.00: The model explains all variation (perfect fit). All data points fall exactly on the regression line.

Contextual Considerations

The interpretation of R² requires careful consideration of context:

Field-specific standards: Acceptable R² values vary dramatically across disciplines - Physical sciences often expect R² > 0.90 due to controlled conditions - Social sciences may consider R² = 0.30 meaningful given human complexity - Biological systems typically show intermediate values due to natural variation

Sample size effects: Small samples can artificially inflate R², leading to overly optimistic assessments of model fit.

Model complexity: In multiple regression, additional predictors mechanically increase R², even if they lack true explanatory power.

Practical significance: Statistical fit should align with substantive importance. A model with R² = 0.95 may be less useful than one with R² = 0.60 if the latter addresses more relevant questions.

Adjusted R² for Multiple Regression

When extending to multiple regression, adjusted R² accounts for the number of predictors:

R^2_{\text{adj}} = 1 - \frac{\text{SSE}/(n-p)}{\text{SST}/(n-1)}

Where: - n = sample size - p = number of parameters (including intercept)

Adjusted R² penalizes model complexity, providing a more conservative measure when comparing models with different numbers of predictors.

Measures of Fit

  1. R-squared (R^2): R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}

  2. Root Mean Square Error (RMSE): RMSE = \sqrt{\frac{SSE}{n}}

  3. Mean Absolute Error (MAE): MAE = \frac{1}{n}\sum_{i=1}^n |Y_i - \hat{Y}_i|

9.32 Practical Implementation

Complete Regression Analysis Example

# Comprehensive regression analysis
library(ggplot2)

# Generate educational data
set.seed(123)
n <- 100
study_hours <- runif(n, 0, 10)
exam_scores <- 50 + 4*study_hours + rnorm(n, 0, 5)
education_data <- data.frame(
  study_hours = study_hours,
  exam_scores = exam_scores
)

# Fit regression model
model <- lm(exam_scores ~ study_hours, data = education_data)

# Extract key statistics
y_mean <- mean(exam_scores)
y_pred <- predict(model)

# Variance decomposition
SST <- sum((exam_scores - y_mean)^2)
SSR <- sum((y_pred - y_mean)^2)
SSE <- sum((exam_scores - y_pred)^2)
r_squared <- SSR/SST

# Display results
cat("OLS Regression Results\n")
OLS Regression Results
cat("======================\n")
======================
cat("\nParameter Estimates:\n")

Parameter Estimates:
cat("Intercept (β₀):", round(coef(model)[1], 2), "\n")
Intercept (β₀): 49.96 
cat("Slope (β₁):", round(coef(model)[2], 2), "\n")
Slope (β₁): 3.96 
cat("\nInterpretation:\n")

Interpretation:
cat("- Expected score with 0 study hours:", 
    round(coef(model)[1], 2), "\n")
- Expected score with 0 study hours: 49.96 
cat("- Score increase per study hour:", 
    round(coef(model)[2], 2), "\n")
- Score increase per study hour: 3.96 
cat("\nVariance Decomposition:\n")

Variance Decomposition:
cat("Total variation (SST):", round(SST, 2), "\n")
Total variation (SST): 14880.14 
cat("Explained by model (SSR):", round(SSR, 2), "\n")
Explained by model (SSR): 12578.3 
cat("Unexplained (SSE):", round(SSE, 2), "\n")
Unexplained (SSE): 2301.84 
cat("\nModel Performance:\n")

Model Performance:
cat("R-squared:", round(r_squared, 4), "\n")
R-squared: 0.8453 
cat("Interpretation: The model explains", 
    round(r_squared * 100, 1), 
    "% of the variation in exam scores\n")
Interpretation: The model explains 84.5 % of the variation in exam scores
cat("beyond what the mean alone could explain.\n")
beyond what the mean alone could explain.

Visualization of Key Concepts

# Create comprehensive visualization
library(ggplot2)
library(gridExtra)

# Plot 1: Baseline model (mean only)
p1 <- ggplot(education_data, aes(x = study_hours, y = exam_scores)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = y_mean, color = "blue", size = 1) +
  geom_segment(aes(xend = study_hours, yend = y_mean), 
               alpha = 0.3, color = "red") +
  labs(title = "Baseline Model: Predicting with Mean",
       subtitle = paste("Total variation (SST) =", round(SST, 1)),
       x = "Study Hours", y = "Exam Scores") +
  theme_minimal() +
  annotate("text", x = 8, y = y_mean + 1, 
           label = "Mean of Y", color = "blue")

# Plot 2: Regression model
p2 <- ggplot(education_data, aes(x = study_hours, y = exam_scores)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  geom_segment(aes(xend = study_hours, yend = y_pred), 
               alpha = 0.3, color = "red") +
  labs(title = "Regression Model: Improved Predictions",
       subtitle = paste("Unexplained variation (SSE) =", 
                       round(SSE, 1), 
                       "| R² =", round(r_squared, 3)),
       x = "Study Hours", y = "Exam Scores") +
  theme_minimal() +
  annotate("text", x = 8, y = max(y_pred) + 1, 
           label = "Regression Line", color = "blue")

# Plot 3: Variance components
variance_data <- data.frame(
  Component = c("Total (SST)", "Explained (SSR)", "Unexplained (SSE)"),
  Value = c(SST, SSR, SSE),
  Type = c("Total", "Explained", "Unexplained")
)

p3 <- ggplot(variance_data, aes(x = Component, y = Value, fill = Type)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Total" = "gray50", 
                               "Explained" = "green4", 
                               "Unexplained" = "red3")) +
  labs(title = "Variance Decomposition",
       y = "Sum of Squares") +
  theme_minimal() +
  theme(legend.position = "none")

# Combine plots
grid.arrange(p1, p2, p3, layout_matrix = rbind(c(1, 2), c(3, 3)))
`geom_smooth()` using formula = 'y ~ x'

9.33 Summary and Key Insights

Core Concepts Review

Regression analysis models relationships between variables using stochastic frameworks that acknowledge inherent variation. The simple linear regression model expresses this relationship as Y_i = \beta_0 + \beta_1X_i + \epsilon_i.

Ordinary Least Squares provides optimal parameter estimates by minimizing the sum of squared residuals. This choice of minimizing squared errors stems from mathematical tractability, equal treatment of positive and negative errors, appropriate penalization of large errors, and desirable statistical properties.

Variance decomposition partitions total variation into: - Total variation from the baseline mean model (SST) - Variation explained by the regression model (SSR)
- Unexplained residual variation (SSE)

The fundamental identity SST = SSR + SSE shows how regression improves upon the baseline model.

quantifies model performance as the proportion of total variation explained, providing a standardized measure of how much better the regression model performs compared to simply using the mean.

Critical Considerations

The baseline model (predicting with the mean) serves as the fundamental reference point. All regression improvement is measured relative to this simple model. SST represents the total variance requiring explanation when we start with no predictors.

Parameter estimates (\hat{\beta}_0, \hat{\beta}_1) represent sample-based approximations of unknown population values (\beta_0, \beta_1). The distinction between population parameters and sample estimates remains crucial for proper inference.

Model assessment requires considering both statistical fit (R²) and practical significance. A model with modest R² may still provide valuable insights, while high R² does not guarantee causation or practical utility.

Extensions and Applications

While this guide focuses on simple linear regression, the framework extends naturally to: - Multiple regression with several predictors - Polynomial regression for nonlinear relationships - Interaction terms to capture conditional effects - Categorical predictors through appropriate coding

The fundamental principles—minimizing prediction errors, decomposing variation, and assessing model fit—remain consistent across these extensions. The mathematical complexity increases, but the conceptual foundation established here continues to apply.

9.34 Key Assumptions of Linear Regression

Strict Exogeneity: The Fundamental Assumption

The most crucial assumption in regression is strict exogeneity:

E[\varepsilon|X] = 0

This means:

  1. The error term has zero mean conditional on X
  2. X contains no information about the average error
  3. There are no systematic patterns in how our predictions are wrong

Let’s visualize when this assumption holds and when it doesn’t:

# Generate data
set.seed(789)
x <- seq(1, 10, by = 0.2)

# Case 1: Exogenous errors
y_exog <- 2 + 3*x + rnorm(length(x), 0, 2)

# Case 2: Non-exogenous errors (error variance increases with x)
y_nonexog <- 2 + 3*x + 0.5*x*rnorm(length(x), 0, 2)

# Create datasets
data_exog <- data.frame(
  x = x,
  y = y_exog,
  type = "Exogenous Errors\n(Assumption Satisfied)"
)

data_nonexog <- data.frame(
  x = x,
  y = y_nonexog,
  type = "Non-Exogenous Errors\n(Assumption Violated)"
)

data_combined <- rbind(data_exog, data_nonexog)

# Create plots with residuals
plot_residuals <- function(data, title) {
  model <- lm(y ~ x, data = data)
  data$predicted <- predict(model)
  data$residuals <- residuals(model)
  
  p1 <- ggplot(data, aes(x = x, y = y)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "red") +
    theme_minimal() +
    labs(title = title)
  
  p2 <- ggplot(data, aes(x = x, y = residuals)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    theme_minimal() +
    labs(y = "Residuals")
  
  list(p1, p2)
}

# Generate plots
plots_exog <- plot_residuals(data_exog, "Exogenous Errors")
plots_nonexog <- plot_residuals(data_nonexog, "Non-Exogenous Errors")

# Arrange plots
gridExtra::grid.arrange(
  plots_exog[[1]], plots_exog[[2]],
  plots_nonexog[[1]], plots_nonexog[[2]],
  ncol = 2
)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Figure 9.1: Exogeneity vs. Non-Exogeneity Examples

Linearity: The Form Assumption

The relationship between X and Y should be linear in parameters:

E[Y|X] = \beta_0 + \beta_1X

Note that this doesn’t mean X and Y must have a straight-line relationship - we can transform variables. Let’s see different types of relationships:

# Generate data
set.seed(101)
x <- seq(1, 10, by = 0.1)

# Different relationships
data_relationships <- data.frame(
  x = rep(x, 3),
  y = c(
    # Linear
    2 + 3*x + rnorm(length(x), 0, 2),
    # Quadratic
    2 + 0.5*x^2 + rnorm(length(x), 0, 2),
    # Exponential
    exp(0.3*x) + rnorm(length(x), 0, 2)
  ),
  type = rep(c("Linear", "Quadratic", "Exponential"), each = length(x))
)

# Plot
ggplot(data_relationships, aes(x = x, y = y)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_smooth(se = FALSE, color = "blue") +
  facet_wrap(~type, scales = "free_y") +
  theme_minimal() +
  labs(subtitle = "Red: linear fit, Blue: true relationship")
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Figure 9.2: Linear and Nonlinear Relationships

Understanding Violations and Solutions

When linearity is violated:

  1. Transform variables:
    • Log transformation: for exponential relationships
    • Square root: for moderate nonlinearity
    • Power transformations: for more complex relationships
# Generate exponential data
set.seed(102)
x <- seq(1, 10, by = 0.2)
y <- exp(0.3*x) + rnorm(length(x), 0, 2)

# Create datasets
data_trans <- data.frame(
  x = x,
  y = y,
  log_y = log(y)
)
Warning in log(y): NaNs produced
# Original scale plot
p1 <- ggplot(data_trans, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  theme_minimal() +
  labs(title = "Original Scale")

# Log scale plot
p2 <- ggplot(data_trans, aes(x = x, y = log_y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  theme_minimal() +
  labs(title = "Log-Transformed Y")

gridExtra::grid.arrange(p1, p2, ncol = 2)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Figure 9.3: Effect of Variable Transformations

9.35 Spurious Correlation: Causes and Examples

Spurious correlation occurs when variables appear related but the relationship is not causal. These misleading correlations arise from several sources:

  • Random coincidence (chance)
  • Confounding variables (hidden third factors)
  • Selection biases
  • Improper statistical analysis
  • Reverse causality
  • Endogeneity problems (including simultaneity)

Random Coincidence (Chance)

With sufficient data mining or small sample sizes, seemingly meaningful correlations can emerge purely by chance. This is especially problematic when researchers conduct multiple analyses without appropriate corrections for multiple comparisons, a practice known as “p-hacking.”

# Create a realistic example of spurious correlation based on actual country data
# Using country data on chocolate consumption and Nobel prize winners
# This example is inspired by a published correlation (Messerli, 2012)
set.seed(123)
countries <- c("Switzerland", "Sweden", "Denmark", "Belgium", "Austria", 
               "Norway", "Germany", "Netherlands", "United Kingdom", "Finland", 
               "France", "Italy", "Spain", "Poland", "Greece", "Portugal")

# Create realistic data: Chocolate consumption correlates with GDP per capita
# Higher GDP countries tend to consume more chocolate and have better research funding
gdp_per_capita <- c(87097, 58977, 67218, 51096, 53879, 89154, 51860, 57534, 
                    46510, 53982, 43659, 35551, 30416, 17841, 20192, 24567)

# Normalize GDP values to make them more manageable
gdp_normalized <- (gdp_per_capita - min(gdp_per_capita)) / 
                 (max(gdp_per_capita) - min(gdp_per_capita))

# More realistic chocolate consumption - loosely based on real consumption patterns
# plus some randomness, but influenced by GDP
chocolate_consumption <- 4 + 8 * gdp_normalized + rnorm(16, 0, 0.8)

# Nobel prizes - also influenced by GDP (research funding) with noise
# The relationship is non-linear, but will show up as correlated
nobel_prizes <- 2 + 12 * gdp_normalized^1.2 + rnorm(16, 0, 1.5)

# Create dataframe
country_data <- data.frame(
  country = countries,
  chocolate = round(chocolate_consumption, 1),
  nobel = round(nobel_prizes, 1),
  gdp = gdp_per_capita
)

# Fit regression model - chocolate vs nobel without controlling for GDP
chocolate_nobel_model <- lm(nobel ~ chocolate, data = country_data)

# Better model that reveals the confounding
full_model <- lm(nobel ~ chocolate + gdp, data = country_data)

# Plot the apparent relationship
ggplot(country_data, aes(x = chocolate, y = nobel)) +
  geom_point(color = "darkblue", size = 3, alpha = 0.7) +
  geom_text(aes(label = country), hjust = -0.2, vjust = 0, size = 3) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(
    title = "Apparent Correlation: Chocolate Consumption vs. Nobel Prizes",
    subtitle = "Demonstrates how confounding variables create spurious correlations",
    x = "Chocolate Consumption (kg per capita)",
    y = "Nobel Prizes per 10M Population"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

# Show regression results
summary(chocolate_nobel_model)

Call:
lm(formula = nobel ~ chocolate, data = country_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9080 -1.4228  0.0294  0.5962  3.2977 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept)  -4.0518     1.3633  -2.972     0.0101 *  
chocolate     1.3322     0.1682   7.921 0.00000154 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.626 on 14 degrees of freedom
Multiple R-squared:  0.8176,    Adjusted R-squared:  0.8045 
F-statistic: 62.75 on 1 and 14 DF,  p-value: 0.000001536
# Demonstrate multiple testing problem
p_values <- numeric(100)
for(i in 1:100) {
  # Generate two completely random variables with n=20
  x <- rnorm(20)
  y <- rnorm(20)
  # Test for correlation and store p-value
  p_values[i] <- cor.test(x, y)$p.value
}

# How many "significant" results at alpha = 0.05?
sum(p_values < 0.05)
[1] 3
# Visualize the multiple testing phenomenon
hist(p_values, breaks = 20, main = "P-values from 100 Tests of Random Data",
     xlab = "P-value", col = "lightblue", border = "white")
abline(v = 0.05, col = "red", lwd = 2, lty = 2)
text(0.15, 20, paste("Approximately", sum(p_values < 0.05),
                     "tests are 'significant'\nby random chance alone!"), 
     col = "darkred")

This example demonstrates how seemingly compelling correlations can emerge between unrelated variables due to confounding factors and chance. The correlation between chocolate consumption and Nobel prizes appears significant (p < 0.05) when analyzed directly, even though it’s explained by a third variable - national wealth (GDP per capita).

Wealthier countries typically consume more chocolate and simultaneously invest more in education and research, leading to more Nobel prizes. Without controlling for this confounding factor, we would mistakenly conclude a direct relationship between chocolate and Nobel prizes.

The multiple testing demonstration further illustrates why spurious correlations appear so frequently in research. When conducting 100 statistical tests on completely random data, we expect approximately 5 “significant” results at α = 0.05 purely by chance. In real research settings where hundreds of variables might be analyzed, the probability of finding false positive correlations increases dramatically.

This example underscores three critical points:

  1. Small sample sizes (16 countries) are particularly vulnerable to chance correlations
  2. Confounding variables can create strong apparent associations between unrelated factors
  3. Multiple testing without appropriate corrections virtually guarantees finding “significant” but meaningless patterns

Such findings explain why replication is essential in research and why most initial “discoveries” fail to hold up in subsequent studies.

Confounding Variables (Hidden Third Factors)

Confounding occurs when an external variable influences both the predictor and outcome variables, creating an apparent relationship that may disappear when the confounder is accounted for.

# Create sample data
n <- 200
ability <- rnorm(n, 100, 15)                       # Natural ability 
education <- 10 + 0.05 * ability + rnorm(n, 0, 2)  # Education affected by ability
income <- 10000 + 2000 * education + 100 * ability + rnorm(n, 0, 5000)  # Income affected by both

omitted_var_data <- data.frame(
  ability = ability,
  education = education,
  income = income
)

# Model without accounting for ability
model_naive <- lm(income ~ education, data = omitted_var_data)

# Model accounting for ability
model_full <- lm(income ~ education + ability, data = omitted_var_data)

# Show results
summary(model_naive)

Call:
lm(formula = income ~ education, data = omitted_var_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-14422.9  -3362.1    142.7   3647.7  14229.6 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  18982.0     2410.5   7.875    0.000000000000221 ***
education     2050.9      158.7  12.926 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5066 on 198 degrees of freedom
Multiple R-squared:  0.4576,    Adjusted R-squared:  0.4549 
F-statistic: 167.1 on 1 and 198 DF,  p-value: < 0.00000000000000022
summary(model_full)

Call:
lm(formula = income ~ education + ability, data = omitted_var_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-12739.9  -3388.7    -41.1   3572.1  14976.8 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 13203.84    3018.85   4.374            0.0000198 ***
education    1871.43     166.03  11.272 < 0.0000000000000002 ***
ability        85.60      27.87   3.071              0.00243 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4961 on 197 degrees of freedom
Multiple R-squared:  0.4824,    Adjusted R-squared:  0.4772 
F-statistic: 91.81 on 2 and 197 DF,  p-value: < 0.00000000000000022
# Create visualization with ability shown through color
ggplot(omitted_var_data, aes(x = education, y = income, color = ability)) +
  geom_point(alpha = 0.8) +
  scale_color_viridis_c(name = "Ability Score") +
  geom_smooth(method = "lm", color = "red", se = FALSE) + 
  labs(
    title = "Income vs. Education, Colored by Ability",
    subtitle = "Visualizing the confounding variable",
    x = "Years of Education",
    y = "Annual Income (PLN)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

This example illustrates omitted variable bias: without accounting for ability, the estimated effect of education on income is exaggerated (2,423 PLN per year vs. 1,962 PLN per year). The confounding occurs because ability influences both education and income, creating a spurious component in the observed correlation.

Classic Example: Ice Cream and Drownings

A classic example of confounding involves the correlation between ice cream sales and drowning incidents, both influenced by temperature:

# Create sample data
n <- 100
temperature <- runif(n, 5, 35)  # Temperature in Celsius

# Both ice cream sales and drownings are influenced by temperature
ice_cream_sales <- 100 + 10 * temperature + rnorm(n, 0, 20)
drownings <- 1 + 0.3 * temperature + rnorm(n, 0, 1)

confounding_data <- data.frame(
  temperature = temperature,
  ice_cream_sales = ice_cream_sales,
  drownings = drownings
)

# Model without controlling for temperature
model_naive <- lm(drownings ~ ice_cream_sales, data = confounding_data)

# Model controlling for temperature
model_full <- lm(drownings ~ ice_cream_sales + temperature, data = confounding_data)

# Show results
summary(model_naive)

Call:
lm(formula = drownings ~ ice_cream_sales, data = confounding_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8163 -0.7597  0.0118  0.7846  2.5797 

Coefficients:
                 Estimate Std. Error t value            Pr(>|t|)    
(Intercept)     -1.503063   0.370590  -4.056              0.0001 ***
ice_cream_sales  0.028074   0.001205  23.305 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.088 on 98 degrees of freedom
Multiple R-squared:  0.8471,    Adjusted R-squared:  0.8456 
F-statistic: 543.1 on 1 and 98 DF,  p-value: < 0.00000000000000022
summary(model_full)

Call:
lm(formula = drownings ~ ice_cream_sales + temperature, data = confounding_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.85074 -0.61169  0.01186  0.60556  2.01776 

Coefficients:
                 Estimate Std. Error t value      Pr(>|t|)    
(Intercept)      1.243785   0.530123   2.346         0.021 *  
ice_cream_sales -0.002262   0.004839  -0.467         0.641    
temperature      0.317442   0.049515   6.411 0.00000000524 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9169 on 97 degrees of freedom
Multiple R-squared:  0.8926,    Adjusted R-squared:  0.8904 
F-statistic: 403.2 on 2 and 97 DF,  p-value: < 0.00000000000000022
# Create visualization
ggplot(confounding_data, aes(x = ice_cream_sales, y = drownings, color = temperature)) +
  geom_point(alpha = 0.8) +
  scale_color_viridis_c(name = "Temperature (°C)") +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "The Ice Cream and Drownings Correlation",
    subtitle = "Temperature as a confounding variable",
    x = "Ice Cream Sales",
    y = "Drowning Incidents"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The naive model shows a statistically significant relationship between ice cream sales and drownings. However, once temperature is included in the model, the coefficient for ice cream sales decreases substantially and becomes statistically insignificant. This demonstrates how failing to account for confounding variables can lead to spurious correlations.

Reverse Causality

Reverse causality occurs when the assumed direction of causation is incorrect. Consider this example of anxiety and relaxation techniques:

# Create sample data
n <- 200
anxiety_level <- runif(n, 1, 10)  # Anxiety level (1-10)

# People with higher anxiety tend to use more relaxation techniques
relaxation_techniques <- 1 + 0.7 * anxiety_level + rnorm(n, 0, 1)

reverse_data <- data.frame(
  anxiety = anxiety_level,
  relaxation = relaxation_techniques
)

# Fit models in both directions
model_incorrect <- lm(anxiety ~ relaxation, data = reverse_data)
model_correct <- lm(relaxation ~ anxiety, data = reverse_data)

# Show regression results
summary(model_incorrect)

Call:
lm(formula = anxiety ~ relaxation, data = reverse_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9651 -0.7285 -0.0923  0.7247  3.7996 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) -0.09482    0.21973  -0.432               0.667    
relaxation   1.15419    0.04105  28.114 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.182 on 198 degrees of freedom
Multiple R-squared:  0.7997,    Adjusted R-squared:  0.7987 
F-statistic: 790.4 on 1 and 198 DF,  p-value: < 0.00000000000000022
summary(model_correct)

Call:
lm(formula = relaxation ~ anxiety, data = reverse_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.15178 -0.51571 -0.00222  0.55513  2.04334 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  1.05726    0.15286   6.917      0.0000000000624 ***
anxiety      0.69284    0.02464  28.114 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9161 on 198 degrees of freedom
Multiple R-squared:  0.7997,    Adjusted R-squared:  0.7987 
F-statistic: 790.4 on 1 and 198 DF,  p-value: < 0.00000000000000022
# Visualize
ggplot(reverse_data, aes(x = relaxation, y = anxiety)) +
  geom_point(alpha = 0.6, color = "darkblue") +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "Anxiety and Relaxation Techniques",
    subtitle = "Example of reverse causality",
    x = "Use of Relaxation Techniques (frequency/week)",
    y = "Anxiety Level (1-10 scale)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Both regression models show statistically significant relationships, but they imply different causal mechanisms. The incorrect model suggests that relaxation techniques increase anxiety, while the correct model reflects the true data generating process: anxiety drives the use of relaxation techniques.

Collider Bias (Selection Bias)

Collider bias occurs when conditioning on a variable that is affected by both the independent and dependent variables of interest, creating an artificial relationship between variables that are actually independent.

# Create sample data
n <- 1000

# Generate two independent variables (no relationship between them)
intelligence <- rnorm(n, 100, 15)  # IQ score
family_wealth <- rnorm(n, 50, 15)  # Wealth score (independent from intelligence)
  
# True data-generating process: admission depends on both intelligence and wealth
admission_score <- 0.4 * intelligence + 0.4 * family_wealth + rnorm(n, 0, 10)
admitted <- admission_score > median(admission_score)  # Binary admission variable

# Create full dataset
full_data <- data.frame(
  intelligence = intelligence,
  wealth = family_wealth,
  admission_score = admission_score,
  admitted = admitted
)

# Regression in full population (true model)
full_model <- lm(intelligence ~ wealth, data = full_data)
summary(full_model)

Call:
lm(formula = intelligence ~ wealth, data = full_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.608 -10.115   0.119  10.832  55.581 

Coefficients:
             Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 101.42330    1.73139   58.58 <0.0000000000000002 ***
wealth       -0.02701    0.03334   -0.81               0.418    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.41 on 998 degrees of freedom
Multiple R-squared:  0.0006569, Adjusted R-squared:  -0.0003444 
F-statistic: 0.656 on 1 and 998 DF,  p-value: 0.4182
# Get just the admitted students
admitted_only <- full_data[full_data$admitted, ]

# Regression in admitted students (conditioning on the collider)
admitted_model <- lm(intelligence ~ wealth, data = admitted_only)
summary(admitted_model)

Call:
lm(formula = intelligence ~ wealth, data = admitted_only)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.511  -9.064   0.721   8.965  48.267 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 115.4750     2.6165  44.133 < 0.0000000000000002 ***
wealth       -0.1704     0.0462  -3.689              0.00025 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.91 on 498 degrees of freedom
Multiple R-squared:  0.0266,    Adjusted R-squared:  0.02464 
F-statistic: 13.61 on 1 and 498 DF,  p-value: 0.0002501
# Additional analysis - regression with the collider as a control variable
# This demonstrates how controlling for a collider introduces bias
collider_control_model <- lm(intelligence ~ wealth + admitted, data = full_data)
summary(collider_control_model)

Call:
lm(formula = intelligence ~ wealth + admitted, data = full_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.729  -8.871   0.700   8.974  48.044 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  102.90069    1.56858  65.601 < 0.0000000000000002 ***
wealth        -0.19813    0.03224  -6.145        0.00000000116 ***
admittedTRUE  14.09944    0.94256  14.959 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.93 on 997 degrees of freedom
Multiple R-squared:  0.1838,    Adjusted R-squared:  0.1822 
F-statistic: 112.3 on 2 and 997 DF,  p-value: < 0.00000000000000022
# Plot for full population
p1 <- ggplot(full_data, aes(x = wealth, y = intelligence)) +
  geom_point(alpha = 0.3, color = "darkblue") +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "Full Population",
    subtitle = paste("Correlation:", round(cor(full_data$intelligence, full_data$wealth), 3)),
    x = "Family Wealth Score",
    y = "Intelligence Score"
  ) +
  theme_minimal()

# Plot for admitted students
p2 <- ggplot(admitted_only, aes(x = wealth, y = intelligence)) +
  geom_point(alpha = 0.3, color = "darkgreen") +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "Admitted Students Only",
    subtitle = paste("Correlation:", round(cor(admitted_only$intelligence, admitted_only$wealth), 3)),
    x = "Family Wealth Score",
    y = "Intelligence Score"
  ) +
  theme_minimal()

# Display plots side by side
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

This example demonstrates collider bias in three ways:

  1. In the full population, intelligence and wealth have no relationship (coefficient near zero, p-value = 0.87)
  2. Among admitted students (conditioning on the collider), a significant negative relationship appears (coefficient = -0.39, p-value < 0.001)
  3. When controlling for admission status in a regression, a spurious relationship is introduced (coefficient = -0.16, p-value < 0.001)

The collider bias creates relationships between variables that are truly independent. This can be represented in a directed acyclic graph (DAG):

\text{Intelligence} \rightarrow \text{Admission} \leftarrow \text{Wealth}

When we condition on admission (the collider), we create a spurious association between intelligence and wealth.

Improper Analysis

Inappropriate statistical methods can produce spurious correlations. Common issues include using linear models for non-linear relationships, ignoring data clustering, or mishandling time series data.

# Generate data with a true non-linear relationship
n <- 100
x <- seq(-3, 3, length.out = n)
y <- x^2 + rnorm(n, 0, 1)  # Quadratic relationship

improper_data <- data.frame(x = x, y = y)

# Fit incorrect linear model
wrong_model <- lm(y ~ x, data = improper_data)

# Fit correct quadratic model
correct_model <- lm(y ~ x + I(x^2), data = improper_data)

# Show results
summary(wrong_model)

Call:
lm(formula = y ~ x, data = improper_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2176 -2.1477 -0.6468  2.4365  7.3457 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  3.14689    0.28951  10.870 <0.0000000000000002 ***
x            0.08123    0.16548   0.491               0.625    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.895 on 98 degrees of freedom
Multiple R-squared:  0.002453,  Adjusted R-squared:  -0.007726 
F-statistic: 0.2409 on 1 and 98 DF,  p-value: 0.6246
summary(correct_model)

Call:
lm(formula = y ~ x + I(x^2), data = improper_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81022 -0.65587  0.01935  0.61168  2.68894 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  0.12407    0.14498   0.856               0.394    
x            0.08123    0.05524   1.470               0.145    
I(x^2)       0.98766    0.03531  27.972 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9664 on 97 degrees of freedom
Multiple R-squared:   0.89, Adjusted R-squared:  0.8877 
F-statistic: 392.3 on 2 and 97 DF,  p-value: < 0.00000000000000022
# Visualize
ggplot(improper_data, aes(x = x, y = y)) +
  geom_point(alpha = 0.6, color = "darkblue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  geom_smooth(method = "lm", formula = y ~ x + I(x^2), color = "green", se = FALSE) +
  labs(
    title = "Improper Analysis Example",
    subtitle = "Linear model (red) vs. Quadratic model (green)",
    x = "Variable X",
    y = "Variable Y"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The linear model incorrectly suggests no relationship between x and y (coefficient near zero, p-value = 0.847), while the quadratic model reveals the true relationship (R^2 = 0.90). This demonstrates how model misspecification can create spurious non-correlations, masking real relationships that exist in different forms.

Endogeneity and Its Sources

Endogeneity occurs when an explanatory variable is correlated with the error term in a regression model. This violates the exogeneity assumption of OLS regression and leads to biased estimates. There are several sources of endogeneity:

Omitted Variable Bias

As shown in the education-income example, when important variables are omitted from the model, their effects are absorbed into the error term, which becomes correlated with included variables.

Measurement Error

When variables are measured with error, the observed values differ from true values, creating correlation between the error term and the predictors.

Simultaneity (Bidirectional Causality)

When the dependent variable also affects the independent variable, creating a feedback loop. Let’s demonstrate this:

# Create sample data with mutual influence
n <- 100

# Initialize variables
economic_growth <- rnorm(n, 2, 1)
employment_rate <- rnorm(n, 60, 5)

# Create mutual influence through iterations
for(i in 1:3) {
  economic_growth <- 2 + 0.05 * employment_rate + rnorm(n, 0, 0.5)
  employment_rate <- 50 + 5 * economic_growth + rnorm(n, 0, 2)
}

simultaneity_data <- data.frame(
  growth = economic_growth,
  employment = employment_rate
)

# Model estimating effect of growth on employment
model_growth_on_emp <- lm(employment ~ growth, data = simultaneity_data)

# Model estimating effect of employment on growth
model_emp_on_growth <- lm(growth ~ employment, data = simultaneity_data)

# Show results
summary(model_growth_on_emp)

Call:
lm(formula = employment ~ growth, data = simultaneity_data)

Residuals:
   Min     1Q Median     3Q    Max 
-3.603 -1.500 -0.099  1.387  5.673 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  49.9665     2.0717   24.12 <0.0000000000000002 ***
growth        5.0151     0.3528   14.22 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.045 on 98 degrees of freedom
Multiple R-squared:  0.6734,    Adjusted R-squared:  0.6701 
F-statistic: 202.1 on 1 and 98 DF,  p-value: < 0.00000000000000022
summary(model_emp_on_growth)

Call:
lm(formula = growth ~ employment, data = simultaneity_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.11417 -0.20626 -0.02185  0.22646  0.72941 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -4.801257   0.749557  -6.405        0.00000000523 ***
employment   0.134283   0.009446  14.216 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3346 on 98 degrees of freedom
Multiple R-squared:  0.6734,    Adjusted R-squared:  0.6701 
F-statistic: 202.1 on 1 and 98 DF,  p-value: < 0.00000000000000022
# Visualize
ggplot(simultaneity_data, aes(x = growth, y = employment)) +
  geom_point(alpha = 0.6, color = "darkblue") +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "Simultaneity Between Economic Growth and Employment",
    x = "Economic Growth (%)",
    y = "Employment Rate (%)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The true data generating process is a system of simultaneous equations:

\text{Growth}_i = \alpha_0 + \alpha_1 \text{Employment}_i + u_i \text{Employment}_i = \beta_0 + \beta_1 \text{Growth}_i + v_i

Standard OLS regression cannot consistently estimate either equation because each explanatory variable is correlated with the error term in its respective equation.

Selection Bias

When the sample is not randomly selected from the population, the selection process can introduce correlation between the error term and the predictors. The collider bias example demonstrates a form of selection bias.

The consequences of endogeneity include: - Biased coefficient estimates - Incorrect standard errors - Invalid hypothesis tests - Misleading causal interpretations

Addressing endogeneity requires specialized methods such as instrumental variables, system estimation, panel data methods, or experimental designs.

Understanding Endogeneity in Regression

Endogeneity is a critical concept in statistical analysis that occurs when an explanatory variable in a regression model is correlated with the error term. This creates challenges for accurately understanding cause-and-effect relationships in research. Let’s examine the three main types of endogeneity and how they affect research outcomes.

Omitted Variable Bias (OVB)

Omitted Variable Bias occurs when an important variable that affects both the dependent and independent variables is left out of the analysis. This omission leads to incorrect conclusions about the relationship between the variables we’re studying.

Consider a study examining the relationship between education and income:

Example: Education and Income The observed relationship shows that more education correlates with higher income. However, an individual’s inherent abilities affect both their educational attainment and their earning potential. Without accounting for ability, we may overestimate education’s direct effect on income.

The statistical representation shows why this matters:

y_i = \beta_0 + \beta_1x_i + \beta_2z_i + \epsilon_i (Complete model)

y_i = \beta_0 + \beta_1x_i + u_i (Incomplete model)

When we omit an important variable, our estimates of the remaining relationships become biased and unreliable.

Simultaneity

Simultaneity occurs when two variables simultaneously influence each other, making it difficult to determine the direction of causation. This creates a feedback loop that complicates statistical analysis.

Common Examples of Simultaneity:

Academic Performance and Study Habits represent a clear case of simultaneity. Academic performance influences how much time students dedicate to studying, while study time affects academic performance. This two-way relationship makes it challenging to measure the isolated effect of either variable.

Market Dynamics provide another example. Prices influence demand, while demand influences prices. This concurrent relationship requires special analytical approaches to understand the true relationships.

Measurement Error

Measurement error occurs when we cannot accurately measure our variables of interest. This imprecision can significantly impact our analysis and conclusions.

Common Sources of Measurement Error:

Self-Reported Data presents a significant challenge. When participants report their own behaviors or characteristics, such as study time, the reported values often differ from actual values. This discrepancy affects our ability to measure true relationships.

Technical Limitations also contribute to measurement error through imprecise measuring tools, inconsistent measurement conditions, and recording or data entry errors.

Addressing Endogeneity in Research

Identification Strategies

# Example of controlling for omitted variables
model_simple <- lm(income ~ education, data = df)
model_full <- lm(income ~ education + ability + experience + region, data = df)

# Compare coefficients
summary(model_simple)
summary(model_full)
  1. Include Additional Variables: Collect data on potentially important omitted variables and include relevant control variables in your analysis. For example, including measures of ability when studying education’s effect on income.

  2. Use Panel Data: Collect data across multiple time periods to control for unobserved fixed characteristics and analyze changes over time.

  3. Instrumental Variables: Find variables that affect your independent variable but not your dependent variable to isolate the relationship of interest.

Improving Measurement

  1. Multiple Measurements: Take several measurements of key variables, use averaging to reduce random error, and compare different measurement methods.

  2. Better Data Collection: Use validated measurement instruments, implement quality control procedures, and document potential sources of error.

Best Practices for Researchers

Research Design fundamentally shapes your ability to address endogeneity. Plan for potential endogeneity issues before collecting data, include measures for potentially important control variables, and consider using multiple measurement approaches.

Analysis should include testing for endogeneity when possible, using appropriate statistical methods for your specific situation, and documenting assumptions and limitations.

Reporting must clearly describe potential endogeneity concerns, explain how you addressed these issues, and discuss implications for your conclusions.

Formal Derivation of OLS Estimators: A Complete Mathematical Treatment

Objective and Setup

We seek to find the values of \hat{\beta}_0 and \hat{\beta}_1 that minimize the sum of squared residuals:

SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i))^2

This is an unconstrained optimization problem where we treat SSE as a function of two variables: SSE(\hat{\beta}_0, \hat{\beta}_1).

Mathematical Prerequisites

Chain Rule for Composite Functions: For f(g(x)), the derivative is: \frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)

In our context:

  • Outer function: f(u) = u^2 with derivative f'(u) = 2u
  • Inner function: u = y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i)

First-Order Conditions: At a minimum, both partial derivatives equal zero: \frac{\partial SSE}{\partial \hat{\beta}_0} = 0 \quad \text{and} \quad \frac{\partial SSE}{\partial \hat{\beta}_1} = 0

Derivation of \hat{\beta}_0

Step 1: Take the partial derivative with respect to \hat{\beta}_0:

\frac{\partial SSE}{\partial \hat{\beta}_0} = \frac{\partial}{\partial \hat{\beta}_0} \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1x_i)^2

Step 2: Apply the chain rule to each term: = \sum_{i=1}^n 2(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i) \cdot \frac{\partial}{\partial \hat{\beta}_0}(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i)

= \sum_{i=1}^n 2(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i) \cdot (-1)

= -2\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1x_i)

Step 3: Set equal to zero and solve: -2\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1x_i) = 0

Dividing by -2 and expanding: \sum_{i=1}^n y_i - n\hat{\beta}_0 - \hat{\beta}_1\sum_{i=1}^n x_i = 0

Step 4: Isolate \hat{\beta}_0: n\hat{\beta}_0 = \sum_{i=1}^n y_i - \hat{\beta}_1\sum_{i=1}^n x_i

\boxed{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}}

Interpretation: The intercept adjusts to ensure the regression line passes through the point of means (\bar{x}, \bar{y}).

Derivation of \hat{\beta}_1

Step 1: Take the partial derivative with respect to \hat{\beta}_1:

\frac{\partial SSE}{\partial \hat{\beta}_1} = \sum_{i=1}^n 2(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i) \cdot (-x_i)

= -2\sum_{i=1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i)

Step 2: Substitute the expression for \hat{\beta}_0: = -2\sum_{i=1}^n x_i(y_i - (\bar{y} - \hat{\beta}_1\bar{x}) - \hat{\beta}_1x_i)

= -2\sum_{i=1}^n x_i((y_i - \bar{y}) - \hat{\beta}_1(x_i - \bar{x}))

Step 3: Set equal to zero and expand: \sum_{i=1}^n x_i(y_i - \bar{y}) - \hat{\beta}_1\sum_{i=1}^n x_i(x_i - \bar{x}) = 0

Step 4: Use the algebraic identity \sum_{i=1}^n x_i(y_i - \bar{y}) = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}):

This identity holds because: \sum_{i=1}^n x_i(y_i - \bar{y}) = \sum_{i=1}^n (x_i - \bar{x} + \bar{x})(y_i - \bar{y}) = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) + \bar{x}\sum_{i=1}^n(y_i - \bar{y}) = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) + 0

Similarly, \sum_{i=1}^n x_i(x_i - \bar{x}) = \sum_{i=1}^n (x_i - \bar{x})^2.

Step 5: Solve for \hat{\beta}_1: \hat{\beta}_1\sum_{i=1}^n (x_i - \bar{x})^2 = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})

\boxed{\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}}

Verification of Minimum (Second-Order Conditions)

To confirm we have found a minimum (not a maximum or saddle point), we examine the Hessian matrix of second partial derivatives:

Second partial derivatives:

\frac{\partial^2 SSE}{\partial \hat{\beta}_0^2} = 2n > 0

\frac{\partial^2 SSE}{\partial \hat{\beta}_1^2} = 2\sum_{i=1}^n x_i^2 > 0

\frac{\partial^2 SSE}{\partial \hat{\beta}_0 \partial \hat{\beta}_1} = 2\sum_{i=1}^n x_i

Hessian matrix: \mathbf{H} = 2\begin{bmatrix} n & \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{bmatrix}

Positive definiteness check:

  • First leading principal minor: 2n > 0
  • Second leading principal minor (determinant): \det(\mathbf{H}) = 4\left(n\sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i\right)^2\right) = 4n\sum_{i=1}^n(x_i - \bar{x})^2 > 0

Since the Hessian is positive definite, we have confirmed a minimum.

Geometric Interpretation

# Visualizing the optimization surface
library(tidyverse)
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
# Generate sample data
set.seed(42)
x <- runif(20, 1, 8)
y <- 2 + 3*x + rnorm(20, 0, 1)

# Create grid of beta values
beta0_seq <- seq(0, 4, length.out = 50)
beta1_seq <- seq(2, 4, length.out = 50)
grid <- expand.grid(beta0 = beta0_seq, beta1 = beta1_seq)

# Calculate SSE for each combination
grid$SSE <- apply(grid, 1, function(params) {
  sum((y - (params[1] + params[2]*x))^2)
})

# Create contour plot
ggplot(grid, aes(x = beta0, y = beta1, z = SSE)) +
  geom_contour_filled(aes(fill = after_stat(level))) +
  geom_point(x = coef(lm(y ~ x))[1], 
             y = coef(lm(y ~ x))[2], 
             color = "red", size = 3) +
  labs(title = "SSE Surface in Parameter Space",
       subtitle = "Red point shows the OLS minimum",
       x = expression(hat(beta)[0]),
       y = expression(hat(beta)[1])) +
  theme_minimal()


9.36 Appendix A.1: Understanding Correlation Measures: A Self-Study Tutorial (Using Stress Level and Cognitive Performance Data)

Dataset Overview

data <- data.frame(
  anxiety_level = c(8, 5, 11, 14, 7, 10),
  cognitive_performance = c(85, 90, 62, 55, 80, 65)
)

1. Covariance Calculation

Step 1: Calculate Means

Variable Calculation Result
Mean Anxiety (\bar{x}) (8 + 5 + 11 + 14 + 7 + 10) ÷ 6 9.17
Mean Cognitive (\bar{y}) (85 + 90 + 62 + 55 + 80 + 65) ÷ 6 72.83

Step 2: Calculate Deviations and Products

i x_i y_i (x_i - \bar{x}) (y_i - \bar{y}) (x_i - \bar{x})(y_i - \bar{y})
1 8 85 -1.17 12.17 -14.24
2 5 90 -4.17 17.17 -71.60
3 11 62 1.83 -10.83 -19.82
4 14 55 4.83 -17.83 -86.12
5 7 80 -2.17 7.17 -15.56
6 10 65 0.83 -7.83 -6.50
Sum 55 437 0.00 0.00 -213.84

Step 3: Calculate Covariance

\text{Cov}(X,Y) = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1} = \frac{-213.84}{5} = -42.77

2. Pearson Correlation Coefficient

Step 1: Calculate Squared Deviations

i (x_i - \bar{x}) (y_i - \bar{y}) (x_i - \bar{x})^2 (y_i - \bar{y})^2
1 -1.17 12.17 1.37 148.11
2 -4.17 17.17 17.39 294.81
3 1.83 -10.83 3.35 117.29
4 4.83 -17.83 23.33 317.91
5 -2.17 7.17 4.71 51.41
6 0.83 -7.83 0.69 61.31
Sum 0.00 0.00 50.84 990.84

Step 2: Calculate Standard Deviations

Measure Formula Calculation Result
s_x \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}} \sqrt{\frac{50.84}{5}} 3.19
s_y \sqrt{\frac{\sum (y_i - \bar{y})^2}{n-1}} \sqrt{\frac{990.84}{5}} 14.08

Step 3: Calculate Pearson Correlation

r = \frac{\text{Cov}(X,Y)}{s_x s_y} = \frac{-42.77}{3.19 \times 14.08} = -0.95

3. Spearman Rank Correlation

Step 1: Assign Ranks

i x_i y_i Rank x_i Rank y_i d_i d_i^2
1 8 85 3 2 1 1
2 5 90 1 1 0 0
3 11 62 5 5 0 0
4 14 55 6 6 0 0
5 7 80 2 3 -1 1
6 10 65 4 4 0 0
Sum 2

Step 2: Calculate Spearman Correlation

\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)} = 1 - \frac{6(2)}{6(36-1)} = 1 - \frac{12}{210} = 0.94

Verification using R

# Calculate correlations using R
cor(data$anxiety_level, data$cognitive_performance, method = "pearson")
[1] -0.9527979
cor(data$anxiety_level, data$cognitive_performance, method = "spearman")
[1] -0.9428571

Interpretation

  1. The strong negative Pearson correlation (r = -0.95) indicates a very strong negative linear relationship between anxiety level and cognitive performance.

  2. The strong positive Spearman correlation (ρ = 0.94) shows that the relationship is also strongly monotonic.

  3. The difference between Pearson and Spearman correlations suggests that while there is a strong relationship, it might not be perfectly linear.

Exercise

  1. Verify each calculation step in the tables above.
  2. Try calculating these measures with a modified dataset:
    • Add one outlier and observe how it affects both correlation coefficients
    • Change one pair of values and recalculate

9.37 Appendix A.2: Calculating Covariance, Pearson, Spearman Correlation, and OLS - worked example

A political science student is investigating the relationship between district magnitude (DM) and Gallagher’s disproportionality index (GH) in parliamentary elections across 10 randomly selected democracies.

Data on electoral district magnitudes (\text{DM}) and Gallagher index:

\text{DM} (X) Gallagher (Y)
2 18.2
3 16.7
4 15.8
5 15.3
6 15.0
7 14.8
8 14.7
9 14.6
10 14.55
11 14.52

Step 1: Calculate Basic Statistics

Calculation of means:

For \text{DM} (X): \bar{X} = \frac{\sum_{i=1}^n X_i}{n}

Detailed calculation:

2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 = 65 \bar{x} = \frac{65}{10} = 6.5

For Gallagher index (Y): \bar{Y} = \frac{\sum_{i=1}^n Y_i}{n}

Detailed calculation:

18.2 + 16.7 + 15.8 + 15.3 + 15.0 + 14.8 + 14.7 + 14.6 + 14.55 + 14.52 = 154.17 \bar{y} = \frac{154.17}{10} = 15.417

Step 2: Detailed Covariance Calculations

Complete working table showing all calculations:

i X_i Y_i (X_i - \bar{X}) (Y_i - \bar{Y}) (X_i - \bar{X})(Y_i - \bar{Y}) (X_i - \bar{X})^2 (Y_i - \bar{Y})^2
1 2 18.2 -4.5 2.783 -12.5235 20.25 7.7451
2 3 16.7 -3.5 1.283 -4.4905 12.25 1.6461
3 4 15.8 -2.5 0.383 -0.9575 6.25 0.1467
4 5 15.3 -1.5 -0.117 0.1755 2.25 0.0137
5 6 15.0 -0.5 -0.417 0.2085 0.25 0.1739
6 7 14.8 0.5 -0.617 -0.3085 0.25 0.3807
7 8 14.7 1.5 -0.717 -1.0755 2.25 0.5141
8 9 14.6 2.5 -0.817 -2.0425 6.25 0.6675
9 10 14.55 3.5 -0.867 -3.0345 12.25 0.7517
10 11 14.52 4.5 -0.897 -4.0365 20.25 0.8047
Sum 65 154.17 0 0 -28.085 82.5 12.8442

Covariance calculation: \text{Cov}(X,Y) = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{n-1}

\text{Cov}(X,Y) = \frac{-28.085}{9} = -3.120556

Step 3: Standard Deviation Calculations

For \text{DM} (X): \sigma_X = \sqrt{\frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}}

\sigma_x = \sqrt{\frac{82.5}{9}} = \sqrt{9.1667} = 3.026582

For Gallagher (Y): \sigma_Y = \sqrt{\frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{n-1}}

\sigma_y = \sqrt{\frac{12.8442}{9}} = \sqrt{1.4271} = 1.194612

Step 4: Pearson Correlation Calculation

r = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}

r = \frac{-3.120556}{3.026582 \times 1.194612} = \frac{-3.120556}{3.615752} = -0.863044

Step 5: Spearman Rank Correlation Calculation

Complete ranking table with all calculations:

i X_i Y_i Rank X_i Rank Y_i d_i d_i^2
1 2 18.2 1 10 -9 81
2 3 16.7 2 9 -7 49
3 4 15.8 3 8 -5 25
4 5 15.3 4 7 -3 9
5 6 15.0 5 6 -1 1
6 7 14.8 6 5 1 1
7 8 14.7 7 4 3 9
8 9 14.6 8 3 5 25
9 10 14.55 9 2 7 49
10 11 14.52 10 1 9 81
Sum 330

Spearman correlation calculation: \rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}

\rho = 1 - \frac{6 \times 330}{10(100 - 1)} = 1 - \frac{1980}{990} = 1 - 2 = -1

Step 6: R Verification

# Create vectors
DM <- c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
GH <- c(18.2, 16.7, 15.8, 15.3, 15.0, 14.8, 14.7, 14.6, 14.55, 14.52)

# Calculate covariance
cov(DM, GH)
[1] -3.120556
# Calculate correlations
cor(DM, GH, method = "pearson")
[1] -0.8627742
cor(DM, GH, method = "spearman")
[1] -1

Step 7: Basic Visualization

library(ggplot2)

# Create data frame
data <- data.frame(DM = DM, GH = GH)

# Create scatter plot
ggplot(data, aes(x = DM, y = GH)) +
  geom_point(size = 3, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "District Magnitude vs Gallagher Index",
    x = "District Magnitude (DM)",
    y = "Gallagher Index (GH)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

OLS Estimation and Goodness-of-Fit Measures

Step 1: Calculate OLS Estimates

Using previously calculated values:

  • \sum(X_i - \bar{X})(Y_i - \bar{Y}) = -28.085
  • \sum(X_i - \bar{X})^2 = 82.5
  • \bar{X} = 6.5
  • \bar{Y} = 15.417

Calculate slope (\hat{\beta_1}): \hat{\beta_1} = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2}

\hat{\beta_1} = -28,085 ÷ 82,5 = -0,3404

Calculate intercept (\hat{\beta_0}): \hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X}

\hat{\beta_0} = 15,417 - (-0,3404 × 6,5) = 15,417 + 2,2126 = 17,6296

Therefore, the OLS regression equation is: \hat{Y} = 17.6296 - 0.3404X

Step 2: Calculate Fitted Values and Residuals

Complete table showing all calculations:

i X_i Y_i \hat{Y}_i e_i = Y_i - \hat{Y}_i e_i^2 (Y_i - \bar{Y})^2 (\hat{Y}_i - \bar{Y})^2
1 2 18.2 16.9488 1.2512 1.5655 7.7451 2.3404
2 3 16.7 16.6084 0.0916 0.0084 1.6461 1.4241
3 4 15.8 16.2680 -0.4680 0.2190 0.1467 0.7225
4 5 15.3 15.9276 -0.6276 0.3939 0.0137 0.2601
5 6 15.0 15.5872 -0.5872 0.3448 0.1739 0.0289
6 7 14.8 15.2468 -0.4468 0.1996 0.3807 0.0290
7 8 14.7 14.9064 -0.2064 0.0426 0.5141 0.2610
8 9 14.6 14.5660 0.0340 0.0012 0.6675 0.7241
9 10 14.55 14.2256 0.3244 0.1052 0.7517 1.4184
10 11 14.52 13.8852 0.6348 0.4030 0.8047 2.3439
Sum 65 154.17 154.17 0 3.2832 12.8442 9.5524

Calculations for fitted values:

For X = 2:
Ŷ = 17.6296 + (-0.3404 × 2) = 16.9488

For X = 3:
Ŷ = 17.6296 + (-0.3404 × 3) = 16.6084

[... continue for all values]

Step 3: Calculate Goodness-of-Fit Measures

Sum of Squared Errors (SSE): SSE = \sum e_i^2

SSE = 3.2832

Sum of Squared Total (SST): SST = \sum(Y_i - \bar{Y})^2

SST = 12.8442

Sum of Squared Regression (SSR): SSR = \sum(\hat{Y}_i - \bar{Y})^2

SSR = 9.5524

Verify decomposition: SST = SSR + SSE

12.8442 = 9.5524 + 3.2832 (within rounding error)

R-squared calculation: R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}

R² = 9.5524 ÷ 12.8442
   = 0.7438

Step 4: R Verification

# Fit linear model
model <- lm(GH ~ DM, data = data)

# View summary statistics
summary(model)

Call:
lm(formula = GH ~ DM, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.62764 -0.46274 -0.08615  0.26624  1.25109 

Coefficients:
            Estimate Std. Error t value       Pr(>|t|)    
(Intercept) 17.62976    0.50121  35.174 0.000000000467 ***
DM          -0.34042    0.07053  -4.827        0.00131 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6406 on 8 degrees of freedom
Multiple R-squared:  0.7444,    Adjusted R-squared:  0.7124 
F-statistic:  23.3 on 1 and 8 DF,  p-value: 0.00131
# Calculate R-squared manually
SST <- sum((GH - mean(GH))^2)
SSE <- sum(residuals(model)^2)
SSR <- SST - SSE
R2_manual <- SSR/SST
R2_manual
[1] 0.7443793

Step 5: Residual Analysis

# Create residual plots
par(mfrow = c(2, 2))
plot(model)

Step 6: Predicted vs Actual Values Plot

# Create predicted vs actual plot
ggplot(data.frame(
  Actual = GH,
  Predicted = fitted(model)
), aes(x = Predicted, y = Actual)) +
  geom_point(color = "blue", size = 3) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Predicted vs Actual Values",
    x = "Predicted Gallagher Index",
    y = "Actual Gallagher Index"
  ) +
  theme_minimal()

Log-Transformed Models

Step 1: Data Transformation

First, calculate natural logarithms of variables:

i X_i Y_i \ln(X_i) \ln(Y_i)
1 2 18.2 0.6931 2.9014
2 3 16.7 1.0986 2.8154
3 4 15.8 1.3863 2.7600
4 5 15.3 1.6094 2.7278
5 6 15.0 1.7918 2.7081
6 7 14.8 1.9459 2.6946
7 8 14.7 2.0794 2.6878
8 9 14.6 2.1972 2.6810
9 10 14.55 2.3026 2.6777
10 11 14.52 2.3979 2.6757

Step 2: Compare Different Model Specifications

We estimate three alternative specifications:

  1. Log-linear model: \ln(Y_i) = \beta_0 + \beta_1 X_i + \epsilon_i
  2. Linear-log model: Y_i = \beta_0 + \beta_1\ln(X_i) + \epsilon_i
  3. Log-log model: \ln(Y_i) = \beta_0 + \beta_1\ln(X_i) + \epsilon_i
# Create transformed variables
data$log_DM <- log(data$DM)
data$log_GH <- log(data$GH)

# Fit models
model_linear <- lm(GH ~ DM, data = data)
model_loglinear <- lm(log_GH ~ DM, data = data)
model_linearlog <- lm(GH ~ log_DM, data = data)
model_loglog <- lm(log_GH ~ log_DM, data = data)

# Compare R-squared values
models_comparison <- data.frame(
  Model = c("Linear", "Log-linear", "Linear-log", "Log-log"),
  R_squared = c(
    summary(model_linear)$r.squared,
    summary(model_loglinear)$r.squared,
    summary(model_linearlog)$r.squared,
    summary(model_loglog)$r.squared
  )
)

# Display comparison
models_comparison
       Model R_squared
1     Linear 0.7443793
2 Log-linear 0.7670346
3 Linear-log 0.9141560
4    Log-log 0.9288088

Step 3: Visual Comparison

# Create plots for each model
p1 <- ggplot(data, aes(x = DM, y = GH)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Linear Model") +
  theme_minimal()

p2 <- ggplot(data, aes(x = DM, y = log_GH)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Log-linear Model") +
  theme_minimal()

p3 <- ggplot(data, aes(x = log_DM, y = GH)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Linear-log Model") +
  theme_minimal()

p4 <- ggplot(data, aes(x = log_DM, y = log_GH)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Log-log Model") +
  theme_minimal()

# Arrange plots in a grid
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Step 4: Residual Analysis for Best Model

Based on R-squared values, analyze residuals for the best-fitting model:

# Residual plots for best model
par(mfrow = c(2, 2))
plot(model_linearlog)

Step 5: Interpretation of Best Model

The linear-log model coefficients:

summary(model_linearlog)

Call:
lm(formula = GH ~ log_DM, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40702 -0.30207 -0.04907  0.22905  0.60549 

Coefficients:
            Estimate Std. Error t value        Pr(>|t|)    
(Intercept)  19.0223     0.4079   46.64 0.0000000000494 ***
log_DM       -2.0599     0.2232   -9.23 0.0000153880425 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3712 on 8 degrees of freedom
Multiple R-squared:  0.9142,    Adjusted R-squared:  0.9034 
F-statistic: 85.19 on 1 and 8 DF,  p-value: 0.00001539

Interpretation: - \hat{\beta_0} represents the expected Gallagher Index when ln(DM) = 0 (i.e., when DM = 1) - \hat{\beta_1} represents the change in Gallagher Index associated with a one-unit increase in ln(DM)

Step 6: Model Predictions

# Create prediction plot for best model
ggplot(data, aes(x = log_DM, y = GH)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Linear-log Model: Gallagher Index vs ln(District Magnitude)",
    x = "ln(District Magnitude)",
    y = "Gallagher Index"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Step 7: Elasticity Analysis

For the log-log model, coefficients represent elasticities directly. Calculate average elasticity for the linear-log model:

# Calculate elasticity at means
mean_DM <- mean(data$DM)
mean_GH <- mean(data$GH)
beta1 <- coef(model_linearlog)[2]
elasticity <- beta1 * (1/mean_GH)
elasticity
    log_DM 
-0.1336136 

This represents the percentage change in the Gallagher Index for a 1% change in District Magnitude.

9.38 Appendix A.3: Understanding Pearson, Spearman, and Kendall

Dataset

data <- data.frame(
  x = c(2, 4, 5, 3, 8),
  y = c(3, 5, 4, 4, 7)
)

Pearson Correlation

r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}

Step-by-Step Calculations:

i x_i y_i x_i - \bar{x} y_i - \bar{y} (x_i - \bar{x})(y_i - \bar{y}) (x_i - \bar{x})^2 (y_i - \bar{y})^2
1 2 3 -2.4 -1.6 3.84 5.76 2.56
2 4 5 -0.4 0.4 -0.16 0.16 0.16
3 5 4 0.6 -0.6 -0.36 0.36 0.36
4 3 4 -1.4 -0.6 0.84 1.96 0.36
5 8 7 3.6 2.4 8.64 12.96 5.76
Sum 22 23 0 0 12.8 21.2 9.2

\bar{x} = 4.4 \bar{y} = 4.6

r = \frac{12.8}{\sqrt{21.2 \times 9.2}} = \frac{12.8}{\sqrt{195.04}} = \frac{12.8}{13.97} = 0.92

Spearman Correlation

\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}

Step-by-Step Calculations:

i x_i y_i Rank x_i Rank y_i d_i d_i^2
1 2 3 1 1 0 0
2 4 5 3 5 -2 4
3 5 4 4 2.5 1.5 2.25
4 3 4 2 2.5 -0.5 0.25
5 8 7 5 4 1 1
Sum 7.5

\rho = 1 - \frac{6(7.5)}{5(25-1)} = 1 - \frac{45}{120} = 0.82

Kendall’s Tau

\tau = \frac{\text{number of concordant pairs} - \text{number of discordant pairs}}{\frac{1}{2}n(n-1)}

Step-by-Step Calculations:

Pair (i,j) x_i,x_j y_i,y_j x_j-x_i y_j-y_i Result
(1,2) 2,4 3,5 +2 +2 C
(1,3) 2,5 3,4 +3 +1 C
(1,4) 2,3 3,4 +1 +1 C
(1,5) 2,8 3,7 +6 +4 C
(2,3) 4,5 5,4 +1 -1 D
(2,4) 4,3 5,4 -1 -1 C
(2,5) 4,8 5,7 +4 +2 C
(3,4) 5,3 4,4 -2 0 D
(3,5) 5,8 4,7 +3 +3 C
(4,5) 3,8 4,7 +5 +3 C

Number of concordant pairs = 8 Number of discordant pairs = 2 \tau = \frac{8-2}{10} = 0.74

Verification in R

cat("Pearson:", round(cor(data$x, data$y, method="pearson"), 2), "\n")
Pearson: 0.92 
cat("Spearman:", round(cor(data$x, data$y, method="spearman"), 2), "\n")
Spearman: 0.82 
cat("Kendall:", round(cor(data$x, data$y, method="kendall"), 2), "\n")
Kendall: 0.74 

Interpretation of Results

  1. Pearson Correlation (r = 0.92)
    • Strong positive linear correlation
    • Indicates a very strong linear relationship between variables
  2. Spearman Correlation (ρ = 0.82)
    • Also strong positive correlation
    • Slightly lower than Pearson’s, suggesting some deviations from monotonicity
  3. Kendall’s Tau (τ = 0.74)
    • Lowest of the three values, but still indicates strong association
    • More robust to outliers

Comparison of Measures

  1. Differences in Values:
    • Pearson (0.92) - highest value, strong linearity
    • Spearman (0.82) - considers only ranking
    • Kendall (0.74) - most conservative measure
  2. Practical Application:
    • All measures confirm strong positive association
    • Differences between measures indicate slight deviations from perfect linearity
    • Kendall provides the most conservative estimate of relationship strength

Exercises

  1. Change y[3] from 4 to 6 and recalculate all three correlations
  2. Add an outlier (x=10, y=2) and recalculate correlations
  3. Compare which measure is most sensitive to changes in the data

Key Points to Remember

  1. Pearson Correlation:
    • Measures linear relationship
    • Most sensitive to outliers
    • Requires interval or ratio data
  2. Spearman Correlation:
    • Measures monotonic relationship
    • Less sensitive to outliers
    • Works with ordinal data
  3. Kendall’s Tau:
    • Measures ordinal association
    • Most robust to outliers
    • Best for small samples and tied ranks

9.39 Appendix B: Bias in OLS Estimation with Endogenous Regressors

In this tutorial, we will explore the bias in Ordinary Least Squares (OLS) estimation when the error term is correlated with the explanatory variable, a situation known as endogeneity. We will first derive the bias mathematically and then illustrate it using a simulated dataset in R.

9.40 Theoretical Derivation

Consider a data generating process (DGP) where the true relationship between x and y is:

y = 2x + e

However, there is an endogeneity problem because the error term e is correlated with x in the following way:

e = 1x + u

where u is an independent error term.

If we estimate the simple linear model y = \hat{\beta_0} + \hat{\beta_1}x + \varepsilon using OLS, the OLS estimator of \hat{\beta_1} will be biased due to the endogeneity issue.

To understand the bias, let’s derive the expected value of the OLS estimator \hat{\beta}_1:

\begin{align*} E[\hat{\beta}_1] &= E[(X'X)^{-1}X'y] \\ &= E[(X'X)^{-1}X'(2x + 1x + u)] \\ &= E[(X'X)^{-1}X'(3x + u)] \\ &= 3 + E[(X'X)^{-1}X'u] \end{align*}

If the error term u is uncorrelated with x, then E[(X'X)^{-1}X'u] = 0, and the OLS estimator would be unbiased: E[\hat{\beta}_1] = 3. However, in this case, the original error term e is correlated with x, so u is also likely to be correlated with x.

Assuming E[(X'X)^{-1}X'u] \neq 0, the OLS estimator will be biased:

\begin{align*} \text{Bias}(\hat{\beta}_1) &= E[\hat{\beta}_1] - \beta_{1,\text{true}} \\ &= 3 + E[(X'X)^{-1}X'u] - 2 \\ &= 1 + E[(X'X)^{-1}X'u] \end{align*}

The direction and magnitude of the bias will depend on the correlation between x and u. If x and u are positively correlated, the bias will be positive, and the OLS estimator will overestimate the true coefficient. Conversely, if x and u are negatively correlated, the bias will be negative, and the OLS estimator will underestimate the true coefficient.

9.41 Simulation in R

Let’s create a simple dataset with 10 observations where x is in the interval 1:10, and generate y values based on the given DGP: y = 2x + e, where e = 1x + u, and u is a random error term.

set.seed(123)  # for reproducibility
x <- 1:10
u <- rnorm(10, mean = 0, sd = 1)
e <- 1*x + u
# e <- 1*x
y <- 2*x + e

# Generate the data frame
data <- data.frame(x = x, y = y)

# Estimate the OLS model
model <- lm(y ~ x, data = data)

# Print the model summary
summary(model)

Call:
lm(formula = y ~ x, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1348 -0.5624 -0.1393  0.3854  1.6814 

Coefficients:
            Estimate Std. Error t value      Pr(>|t|)    
(Intercept)   0.5255     0.6673   0.787         0.454    
x             2.9180     0.1075  27.134 0.00000000367 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9768 on 8 degrees of freedom
Multiple R-squared:  0.9893,    Adjusted R-squared:  0.9879 
F-statistic: 736.3 on 1 and 8 DF,  p-value: 0.000000003666

In this example, the true relationship is y = 2x + e, where e = 1x + u. However, when we estimate the OLS model, we get:

\hat{y} = 0.18376 + 3.05874x

The estimated coefficient for x is 3.05874, which is biased upward from the true value of 2. This bias is due to the correlation between the error term e and the explanatory variable x.

To visualize the bias using ggplot2, we can plot the true relationship (y = 2x) and the estimated OLS relationship:

library(ggplot2)

ggplot(data, aes(x = x, y = y)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 2, color = "blue", linewidth = 1, linetype = "dashed") +
  geom_abline(intercept = coef(model)[1], slope = coef(model)[2], color = "red", linewidth = 1) +
  labs(title = "True vs. Estimated Relationship", x = "x", y = "y") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(name = "Lines", values = c("blue", "red"), 
                     labels = c("True", "OLS"))

The plot will show that the estimated OLS line (red) is steeper than the true relationship line (blue), illustrating the upward bias in the estimated coefficient.

9.42 Conclusion

In summary, when the error term is correlated with the explanatory variable (endogeneity), the OLS estimator will be biased. The direction and magnitude of the bias depend on the nature of the correlation between the error term and the explanatory variable. This tutorial demonstrated the bias both mathematically and through a simulated example in R, using ggplot2 for visualization.


9.43 Appendix C. OLS - Worked Examples

9.44 Example 1. Practice Hours vs. Skill Rating: Visual Guide to OLS and R-squared

library(ggplot2)
library(dplyr)
library(gridExtra)

# Our simple dataset
practice <- c(1, 2, 3, 4, 5, 6)
skill <- c(3, 7, 5, 8, 10, 9)

# Create data frame
data <- data.frame(
  Student = 1:6,
  Practice = practice,
  Skill = skill
)

print(data)
  Student Practice Skill
1       1        1     3
2       2        2     7
3       3        3     5
4       4        4     8
5       5        5    10
6       6        6     9

Manual Calculations

Step 1: Calculate the Means

The mean is the average value, calculated by summing all observations and dividing by the number of observations.

Mean of Practice Hours (X̄):

\bar{X} = \frac{\sum_{i=1}^{n} X_i}{n} = \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = \frac{21}{6} = 3.5

Mean of Skill Ratings (Ȳ):

\bar{Y} = \frac{\sum_{i=1}^{n} Y_i}{n} = \frac{3 + 7 + 5 + 8 + 10 + 9}{6} = \frac{42}{6} = 7

# Calculate means
mean_x <- mean(practice)
mean_y <- mean(skill)

cat("Mean Practice Hours:", mean_x, "\n")
Mean Practice Hours: 3.5 
cat("Mean Skill Rating:", mean_y, "\n")
Mean Skill Rating: 7 

Step 2: Calculate Variance and Standard Deviation

Variance measures how spread out the data is from the mean. We use the sample variance formula (dividing by n-1).

Variance of X (Practice Hours):

First, calculate deviations from the mean (X_i - \bar{X}):

Student X_i X_i - \bar{X} (X_i - \bar{X})^2
1 1 1 - 3.5 = -2.5 (-2.5)^2 = 6.25
2 2 2 - 3.5 = -1.5 (-1.5)^2 = 2.25
3 3 3 - 3.5 = -0.5 (-0.5)^2 = 0.25
4 4 4 - 3.5 = 0.5 (0.5)^2 = 0.25
5 5 5 - 3.5 = 1.5 (1.5)^2 = 2.25
6 6 6 - 3.5 = 2.5 (2.5)^2 = 6.25
Sum 17.5

s^2_X = \frac{\sum(X_i - \bar{X})^2}{n-1} = \frac{17.5}{5} = 3.5

s_X = \sqrt{3.5} = 1.871

Variance of Y (Skill Ratings):

Student Y_i Y_i - \bar{Y} (Y_i - \bar{Y})^2
1 3 3 - 7 = -4 (-4)^2 = 16
2 7 7 - 7 = 0 (0)^2 = 0
3 5 5 - 7 = -2 (-2)^2 = 4
4 8 8 - 7 = 1 (1)^2 = 1
5 10 10 - 7 = 3 (3)^2 = 9
6 9 9 - 7 = 2 (2)^2 = 4
Sum 34

s^2_Y = \frac{\sum(Y_i - \bar{Y})^2}{n-1} = \frac{34}{5} = 6.8

s_Y = \sqrt{6.8} = 2.608

# Verify variance and standard deviation
cat("Variance of Practice:", var(practice), "\n")
Variance of Practice: 3.5 
cat("SD of Practice:", sd(practice), "\n")
SD of Practice: 1.870829 
cat("Variance of Skill:", var(skill), "\n")
Variance of Skill: 6.8 
cat("SD of Skill:", sd(skill), "\n")
SD of Skill: 2.607681 

Step 3: Calculate Covariance

Covariance measures how two variables vary together. A positive covariance indicates that as one variable increases, the other tends to increase.

s_{XY} = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{n-1}

Let’s calculate the products for each observation:

Student (X_i - \bar{X}) (Y_i - \bar{Y}) (X_i - \bar{X})(Y_i - \bar{Y})
1 -2.5 -4 (-2.5) \times (-4) = 10.0
2 -1.5 0 (-1.5) \times (0) = 0.0
3 -0.5 -2 (-0.5) \times (-2) = 1.0
4 0.5 1 (0.5) \times (1) = 0.5
5 1.5 3 (1.5) \times (3) = 4.5
6 2.5 2 (2.5) \times (2) = 5.0
Sum 21.0

s_{XY} = \frac{21.0}{5} = 4.2

# Verify covariance
cat("Covariance:", cov(practice, skill), "\n")
Covariance: 4.2 

Step 4: Calculate Pearson Correlation Coefficient

The correlation coefficient standardizes the covariance to a scale from -1 to +1:

r = \frac{s_{XY}}{s_X \cdot s_Y} = \frac{4.2}{1.871 \times 2.608} = \frac{4.2}{4.879} = 0.861

This gives us a correlation of 0.861, indicating a strong positive relationship between practice hours and skill rating.

# Verify correlation
cat("Correlation:", cor(practice, skill), "\n")
Correlation: 0.8609161 

Step 5: Calculate OLS Regression Coefficients

The Ordinary Least Squares (OLS) method finds the values of \beta_0 and \beta_1 that minimize the sum of squared errors.

Slope estimator:

\hat{\beta_1} = \frac{s_{XY}}{s^2_X} = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2}

Using our calculated values:

\hat{\beta_1} = \frac{4.2}{3.5} = 1.2

Intercept estimator:

\hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X}

\hat{\beta_0} = 7 - (1.2 \times 3.5) = 7 - 4.2 = 2.8

Our regression equation is:

\hat{Y} = 2.8 + 1.2X

This means:

  • When practice hours = 0, predicted skill = 2.8

  • For each 1-hour increase in practice, skill rating increases by 1.2 points

# Fit the model
model <- lm(skill ~ practice)
coef_model <- coef(model)

cat("Intercept (β₀):", coef_model[1], "\n")
Intercept (β₀): 2.8 
cat("Slope (β₁):", coef_model[2], "\n")
Slope (β₁): 1.2 
# Get predictions
data$predicted <- predict(model)
data$residual <- residuals(model)

Step 6: Calculate Predicted Values and Residuals

Using \hat{Y} = 2.8 + 1.2X:

Student X_i Y_i \hat{Y}_i = 2.8 + 1.2X_i Residual (Y_i - \hat{Y}_i)
1 1 3 2.8 + 1.2(1) = 4.0 3 - 4.0 = -1.0
2 2 7 2.8 + 1.2(2) = 5.2 7 - 5.2 = 1.8
3 3 5 2.8 + 1.2(3) = 6.4 5 - 6.4 = -1.4
4 4 8 2.8 + 1.2(4) = 7.6 8 - 7.6 = 0.4
5 5 10 2.8 + 1.2(5) = 8.8 10 - 8.8 = 1.2
6 6 9 2.8 + 1.2(6) = 10.0 9 - 10.0 = -1.0

Step 7: Calculate Sum of Squares

SST (Total Sum of Squares) - Total variation in Y:

SST = \sum(Y_i - \bar{Y})^2

From our earlier variance calculation:

SST = (n-1) \times s^2_Y = 5 \times 6.8 = 34

SSE (Sum of Squared Errors) - Unexplained variation:

SSE = \sum(Y_i - \hat{Y}_i)^2

Student Y_i \hat{Y}_i (Y_i - \hat{Y}_i) (Y_i - \hat{Y}_i)^2
1 3 4.0 -1.0 1.00
2 7 5.2 1.8 3.24
3 5 6.4 -1.4 1.96
4 8 7.6 0.4 0.16
5 10 8.8 1.2 1.44
6 9 10.0 -1.0 1.00
Sum 8.80

SSE = 8.80

SSR (Sum of Squares Regression) - Explained variation:

SSR = SST - SSE = 34 - 8.80 = 25.20

Step 8: Calculate R-squared

R-squared tells us what proportion of the total variation in Y is explained by our model:

R^2 = \frac{SSR}{SST} = \frac{25.20}{34} = 0.741

Alternative formula:

R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{8.80}{34} = 1 - 0.259 = 0.741

Or simply:

R^2 = r^2 = (0.861)^2 = 0.741

This means our model explains 74.1% of the variation in skill ratings!

# Verify sum of squares and R-squared
SST <- sum((skill - mean_y)^2)
SSR <- sum((data$predicted - mean_y)^2)
SSE <- sum(data$residual^2)
r_squared <- SSR / SST

cat("SST (Total):", SST, "\n")
SST (Total): 34 
cat("SSR (Explained):", SSR, "\n")
SSR (Explained): 25.2 
cat("SSE (Unexplained):", SSE, "\n")
SSE (Unexplained): 8.8 
cat("R-squared:", r_squared, "\n")
R-squared: 0.7411765 
cat("R-squared (from correlation):", cor(practice, skill)^2, "\n")
R-squared (from correlation): 0.7411765 

Visualization 1: The OLS Best-Fit Line

This plot shows how OLS minimizes the sum of squared residuals (vertical distances from points to the line).

ggplot(data, aes(x = Practice, y = Skill)) +
  geom_hline(yintercept = mean_y, linetype = "dashed", 
             color = "gray50", linewidth = 0.8, alpha = 0.7) +
  geom_segment(aes(xend = Practice, yend = predicted), 
               color = "red", linetype = "dotted", linewidth = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", linewidth = 1.2) +
  geom_point(size = 5, color = "darkblue") +
  geom_point(aes(y = predicted), size = 3, color = "blue", shape = 17) +
  annotate("text", x = 1.5, y = mean_y + 0.3, 
           label = paste0("Mean (ȳ = ", mean_y, ")"), 
           color = "gray30", size = 4) +
  annotate("text", x = 5, y = 4, 
           label = "Residuals (errors)\nOLS minimizes Σ(residuals²)", 
           color = "red", size = 3.5, hjust = 0) +
  annotate("text", x = 2, y = 10.5, 
           label = paste0("ŷ = ", round(coef_model[1], 1), 
                         " + ", round(coef_model[2], 1), "x"), 
           color = "blue", size = 5, fontface = "bold") +
  labs(
    title = "OLS Regression: Minimizing Squared Residuals",
    subtitle = "Blue triangles are predicted values; red lines show residuals",
    x = "Practice Hours",
    y = "Skill Rating"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14))
`geom_smooth()` using formula = 'y ~ x'

Key Insight: OLS finds the unique line that makes the sum of squared red distances as small as possible!

Visualization 2: Variance Decomposition (SST = SSR + SSE)

This shows how total variation is split into explained and unexplained components.

# Create the decomposition plot
ggplot(data, aes(x = Practice)) +
  # Total deviation (SST)
  geom_segment(aes(y = mean_y, yend = Skill, xend = Practice), 
               color = "purple", linewidth = 1.2, alpha = 0.6,
               arrow = arrow(length = unit(0.15, "inches"))) +
  
  # Explained deviation (SSR)
  geom_segment(aes(y = mean_y, yend = predicted, xend = Practice), 
               color = "green", linewidth = 1.5, alpha = 0.8) +
  
  # Unexplained deviation (SSE)
  geom_segment(aes(y = predicted, yend = Skill, xend = Practice), 
               color = "red", linewidth = 1.2, alpha = 0.8,
               linetype = "dashed") +
  
  # Mean line
  geom_hline(yintercept = mean_y, linetype = "solid", 
             color = "gray40", linewidth = 1) +
  
  # Regression line
  geom_smooth(aes(y = Skill), method = "lm", se = FALSE, 
              color = "blue", linewidth = 1) +
  
  # Points
  geom_point(aes(y = Skill), size = 5, color = "darkblue") +
  geom_point(aes(y = predicted), size = 3, color = "blue", shape = 15) +
  
  # Annotations
  annotate("text", x = 6.5, y = mean_y, 
           label = "Mean", color = "gray40", size = 4, hjust = 0) +
  annotate("text", x = 6.5, y = 9.5, 
           label = "Regression Line", color = "blue", size = 4, hjust = 0) +
  
  # Legend
  annotate("segment", x = 0.5, xend = 0.5, y = 2, yend = 3.5,
           color = "purple", linewidth = 1.2, 
           arrow = arrow(length = unit(0.12, "inches"))) +
  annotate("text", x = 0.7, y = 2.75, 
           label = "Total (SST)", color = "purple", 
           size = 3.5, hjust = 0, fontface = "bold") +
  
  annotate("segment", x = 0.5, xend = 0.5, y = 4.5, yend = 5.5,
           color = "green", linewidth = 1.5) +
  annotate("text", x = 0.7, y = 5, 
           label = "Explained (SSR)", color = "green", 
           size = 3.5, hjust = 0, fontface = "bold") +
  
  annotate("segment", x = 0.5, xend = 0.5, y = 6.5, yend = 7.5,
           color = "red", linewidth = 1.2, linetype = "dashed") +
  annotate("text", x = 0.7, y = 7, 
           label = "Unexplained (SSE)", color = "red", 
           size = 3.5, hjust = 0, fontface = "bold") +
  
  labs(
    title = "Variance Decomposition: SST = SSR + SSE",
    subtitle = "Purple = total deviation | Green = explained by model | Red = residual error",
    x = "Practice Hours",
    y = "Skill Rating"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14)) +
  coord_cartesian(xlim = c(0.3, 7.5), ylim = c(2, 11))
`geom_smooth()` using formula = 'y ~ x'

Mathematical Decomposition:

For each observation: (Y_i - \bar{Y})^2 = (\hat{Y}_i - \bar{Y})^2 + (Y_i - \hat{Y}_i)^2

  • Purple arrows: Total deviation from mean = (Y_i - \bar{Y})

  • Green bars: Model’s explanation = (\hat{Y}_i - \bar{Y})

  • Red dashed: What’s left over = (Y_i - \hat{Y}_i)

Visualization 3: R-squared as a Proportion

# Calculate sum of squares
SST <- sum((skill - mean_y)^2)
SSR <- sum((data$predicted - mean_y)^2)
SSE <- sum(data$residual^2)
r_squared <- SSR / SST

# Create bar chart data
ss_data <- data.frame(
  Component = c("Total (SST)", "Explained (SSR)", "Unexplained (SSE)"),
  Value = c(SST, SSR, SSE),
  Color = c("purple", "green", "red")
)

p1 <- ggplot(ss_data, aes(x = Component, y = Value, fill = Component)) +
  geom_bar(stat = "identity", alpha = 0.7, color = "black") +
  geom_text(aes(label = round(Value, 2)), vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("green", "red", "purple")) +
  labs(
    title = "Sum of Squares Breakdown",
    y = "Sum of Squares",
    x = ""
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

# Create proportion visualization
prop_data <- data.frame(
  Component = c("Explained\n(SSR)", "Unexplained\n(SSE)"),
  Proportion = c(SSR/SST, SSE/SST),
  Percentage = c(SSR/SST * 100, SSE/SST * 100)
)

p2 <- ggplot(prop_data, aes(x = "", y = Proportion, fill = Component)) +
  geom_bar(stat = "identity", width = 1, color = "white", linewidth = 2) +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("green3", "red3")) +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), 
            size = 6, fontface = "bold", color = "white") +
  labs(title = paste0("R² = ", round(r_squared, 3))) +
  theme_void(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
        legend.position = "bottom",
        legend.title = element_blank())

grid.arrange(p1, p2, ncol = 2)

R-squared Formulas:

R^2 = \frac{SSR}{SST} = \frac{25.20}{34} = 0.74

R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{8.80}{34} = 0.74

Interpretation: Our model explains 74% of the variation in skill ratings!

Visualization 4: R² as Correlation Between Deviations

This shows the geometric interpretation of R²: how well predicted deviations from the mean match actual deviations.

# Calculate deviations
data$actual_dev <- skill - mean_y
data$predicted_dev <- data$predicted - mean_y

# Side-by-side comparison
p1 <- ggplot(data, aes(x = Practice)) +
  geom_hline(yintercept = 0, color = "black", linewidth = 0.8) +
  geom_segment(aes(y = 0, yend = actual_dev, xend = Practice), 
               color = "purple", linewidth = 2, alpha = 0.7,
               arrow = arrow(length = unit(0.15, "inches"))) +
  geom_point(aes(y = actual_dev), size = 4, color = "purple") +
  labs(
    title = "Actual Deviations from Mean",
    subtitle = expression(Y[i] - bar(Y)),
    x = "Practice Hours",
    y = "Deviation from Mean"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

p2 <- ggplot(data, aes(x = Practice)) +
  geom_hline(yintercept = 0, color = "black", linewidth = 0.8) +
  geom_segment(aes(y = 0, yend = predicted_dev, xend = Practice), 
               color = "green", linewidth = 2, alpha = 0.7,
               arrow = arrow(length = unit(0.15, "inches"))) +
  geom_point(aes(y = predicted_dev), size = 4, color = "green") +
  labs(
    title = "Predicted Deviations from Mean",
    subtitle = expression(hat(Y)[i] - bar(Y)),
    x = "Practice Hours",
    y = "Deviation from Mean"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

grid.arrange(p1, p2, ncol = 2)

Visualization 5: Scatterplot of Deviations (R² Interpretation)

ggplot(data, aes(x = predicted_dev, y = actual_dev)) +
  geom_abline(slope = 1, intercept = 0, 
              linetype = "dashed", color = "gray50", linewidth = 1) +
  geom_point(size = 5, color = "darkblue", alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", 
              fill = "lightblue", alpha = 0.3) +
  annotate("text", x = -3, y = 2.5, 
           label = paste0("If perfect fit:\nall points on this line\n(R² = 1)"), 
           color = "gray40", size = 3.5, hjust = 0) +
  annotate("text", x = 1.5, y = -3, 
           label = paste0("Correlation = ", round(sqrt(r_squared), 3), 
                         "\nR² = ", round(r_squared, 3)), 
           color = "blue", size = 5, fontface = "bold") +
  labs(
    title = "R² Measures How Well Predicted Deviations Match Actual Deviations",
    subtitle = "Each point compares model's prediction to reality (both relative to mean)",
    x = expression(paste("Predicted Deviation: ", hat(Y)[i] - bar(Y))),
    y = expression(paste("Actual Deviation: ", Y[i] - bar(Y)))
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 13))
`geom_smooth()` using formula = 'y ~ x'

Key Insight:

R² is literally the square of the correlation between:

  • What the model predicts (relative to mean)

  • What actually happened (relative to mean)

If R^2 = 1: all points fall exactly on the diagonal (perfect predictions)

If R^2 = 0: no relationship between predicted and actual deviations

Summary Table

# Create summary table
summary_stats <- data.frame(
  Statistic = c("Sample Size (n)", 
                "Mean Practice (X̄)", 
                "Mean Skill (Ȳ)",
                "Correlation (r)",
                "Intercept (β₀)",
                "Slope (β₁)",
                "R-squared",
                "SST (Total)",
                "SSR (Explained)",
                "SSE (Unexplained)"),
  Value = c(6,
            round(mean_x, 2),
            round(mean_y, 2),
            round(cor(practice, skill), 3),
            round(coef_model[1], 2),
            round(coef_model[2], 2),
            round(r_squared, 3),
            round(SST, 2),
            round(SSR, 2),
            round(SSE, 2))
)

knitr::kable(summary_stats, align = c("l", "r"))
Statistic Value
Sample Size (n) 6.000
Mean Practice (X̄) 3.500
Mean Skill (Ȳ) 7.000
Correlation (r) 0.861
Intercept (β₀) 2.800
Slope (β₁) 1.200
R-squared 0.741
SST (Total) 34.000
SSR (Explained) 25.200
SSE (Unexplained) 8.800

Key Takeaways

OLS Regression:

  • Finds the line that minimizes the sum of squared residuals

  • Produces unbiased estimates with minimum variance (BLUE)

R-squared (0.74) means:

  1. 74% of variation in skill is explained by practice hours

  2. The correlation between predicted and actual deviations is \sqrt{0.74} = 0.86

  3. SSR is 74% of SST; SSE is 26% of SST

Geometric Interpretation:

  • Total variation = distance of each point from the mean

  • Model captures 74% of these distances through the regression line

  • Remaining 26% is unexplained (residuals)

Practical Implication:

Each additional hour of practice increases expected skill by 1.2 points, and this relationship explains most (but not all) of the variation we observe!

9.45 Example 2. Income and Voter Turnout

Background

In preparation for upcoming municipal elections, the Electoral Commission of a mid-sized European city conducted research on voter participation patterns across different city districts. A key question emerged:

Does economic prosperity of a district correlate with civic engagement, specifically voter turnout?

Data Collection

Sample: 5 representative districts

Time Period: Data from recent municipal elections

Variables:

  • Income: Average annual household income per capita (thousands €)
  • Turnout: Percentage of registered voters who voted in the election

Initial R Output for Reference

# Data
income <- c(50, 45, 56, 40, 60)  # thousands €
turnout <- c(60, 56, 70, 50, 75) # %

# Full model check
model <- lm(turnout ~ income)
summary(model)

Call:
lm(formula = turnout ~ income)

Residuals:
      1       2       3       4       5 
-1.9486  0.3359  0.5100  0.6204  0.4824 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.89647    3.96731  -0.226 0.835748    
income       1.25690    0.07822  16.068 0.000524 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.263 on 3 degrees of freedom
Multiple R-squared:  0.9885,    Adjusted R-squared:  0.9847 
F-statistic: 258.2 on 1 and 3 DF,  p-value: 0.0005243

Dispersion Measures

Means:

\bar{X} = \frac{\sum_{i=1}^n X_i}{n} = \frac{50 + 45 + 56 + 40 + 60}{5} = \frac{251}{5} = 50.2

\bar{Y} = \frac{\sum_{i=1}^n Y_i}{n} = \frac{60 + 56 + 70 + 50 + 75}{5} = \frac{311}{5} = 62.2

# Verification
mean(income)  # 50.2
[1] 50.2
mean(turnout) # 62.2
[1] 62.2

Variances:

s^2_X = \frac{\sum(X_i - \bar{X})^2}{n-1}

Deviations for X: (-0.2, -5.2, 5.8, -10.2, 9.8)

s^2_X = \frac{0.04 + 27.04 + 33.64 + 104.04 + 96.04}{4} = \frac{260.8}{4} = 65.2

Deviations for Y: (-2.2, -6.2, 7.8, -12.2, 12.8)

s^2_Y = \frac{4.84 + 38.44 + 60.84 + 148.84 + 163.84}{4} = \frac{416.8}{4} = 104.2

# Verification
var(income)  # 65.2
[1] 65.2
var(turnout) # 104.2
[1] 104.2

Covariance and Correlation

Covariance:

s_{XY} = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{n-1}

Products of deviations:

(-0.2 \times -2.2) = 0.44 (-5.2 \times -6.2) = 32.24 (5.8 \times 7.8) = 45.24 (-10.2 \times -12.2) = 124.44 (9.8 \times 12.8) = 125.44

s_{XY} = \frac{327.8}{4} = 81.95

# Verification
cov(income, turnout) # 81.95
[1] 81.95

Correlation:

r_{XY} = \frac{s_{XY}}{\sqrt{s^2_X}\sqrt{s^2_Y}} = \frac{81.95}{\sqrt{65.2}\sqrt{104.2}} = 0.994

# Verification
cor(income, turnout) # 0.994
[1] 0.9942402

OLS Regression (\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X)

Slope coefficient:

\hat{\beta_1} = \frac{s_{XY}}{s^2_X} = \frac{81.95}{65.2} = 1.2571429

Intercept:

\hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X}

Step by step:

  1. 1.2571429 \times 50.2 = 63.1085714
  2. \hat{\beta_0} = 62.2 - 63.1085714 = -0.9085714
# Verification
coef(model)  # Exact coefficients from R
(Intercept)      income 
 -0.8964724   1.2569018 

Detailed Decomposition of Variance and R-squared

Step 1: Calculate predicted values (\hat{Y}):

\hat{Y} = -0.9085714 + 1.2571429X

The predicted values \hat{Y} for each X value:

For X = 50:

\hat{Y} = -0.9085714 + 1.2571429 \times (50) \hat{Y} = -0.9085714 + 62.857145 \hat{Y} = 61.9485736

For X = 45:

\hat{Y} = -0.9085714 + 1.2571429 \times (45) \hat{Y} = -0.9085714 + 56.5714305 \hat{Y} = 55.6535591

For X = 56:

\hat{Y} = -0.9085714 + 1.2571429 \times (56) \hat{Y} = -0.9085714 + 70.4200024 \hat{Y} = 69.5114310

For X = 40:

\hat{Y} = -0.9085714 + 1.2571429 \times (40) \hat{Y} = -0.9085714 + 50.2657160 \hat{Y} = 49.3571446

For X = 60:

\hat{Y} = -0.9085714 + 1.2571429 \times (60) \hat{Y} = -0.9085714 + 75.4285740 \hat{Y} = 74.5200026

# Verification of predicted values
y_hat <- -0.9085714 + 1.2571429 * income
data.frame(
  X = income,
  Y = turnout,
  Y_hat = y_hat,
  row.names = 1:5
)
   X  Y    Y_hat
1 50 60 61.94857
2 45 56 55.66286
3 56 70 69.49143
4 40 50 49.37714
5 60 75 74.52000

Step 2: Calculate SST (Total Sum of Squares)

SST = \sum(Y_i - \bar{Y})^2 \text{ where } \bar{Y} = 62.2

(60 - 62.2)^2 = (-2.2)^2 = 4.84 (56 - 62.2)^2 = (-6.2)^2 = 38.44 (70 - 62.2)^2 = (7.8)^2 = 60.84 (50 - 62.2)^2 = (-12.2)^2 = 148.84 (75 - 62.2)^2 = (12.8)^2 = 163.84

SST = 4.84 + 38.44 + 60.84 + 148.84 + 163.84 = 416.8

Step 3: Calculate SSR (Regression Sum of Squares)

SSR = \sum(\hat{Y}_i - \bar{Y})^2

(61.9485736 - 62.2)^2 = (-0.2514264)^2 = 0.0632151 (55.6535591 - 62.2)^2 = (-6.5464409)^2 = 42.8558689 (69.5114310 - 62.2)^2 = (7.3114310)^2 = 53.4570178 (49.3571446 - 62.2)^2 = (-12.8428554)^2 = 164.9389370 (74.5200026 - 62.2)^2 = (12.3200026)^2 = 151.7824640

SSR = 413.0975028

Step 4: Calculate SSE (Error Sum of Squares)

SSE = \sum(Y_i - \hat{Y}_i)^2

(60 - 61.9485736)^2 = (-1.9485736)^2 = 3.7969384 (56 - 55.6535591)^2 = (0.3464409)^2 = 0.1200212 (70 - 69.5114310)^2 = (0.4885690)^2 = 0.2387198 (50 - 49.3571446)^2 = (0.6428554)^2 = 0.4132631 (75 - 74.5200026)^2 = (0.4799974)^2 = 0.2303975

SSE = 4.7024972

Step 5: Verify decomposition

SST = SSR + SSE 416.8 = 413.0975028 + 4.7024972

Step 6: Calculate R-squared

R^2 = \frac{SSR}{SST} = \frac{413.0975028}{416.8} = 0.9916

# Verification
summary(model)$r.squared  # Should match our calculation
[1] 0.9885135

Visualization

library(ggplot2)
df <- data.frame(income = income, turnout = turnout)

ggplot(df, aes(x = income, y = turnout)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Voter Turnout vs Income per Capita",
    x = "Income per Capita (thousands €)",
    y = "Voter Turnout (%)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    text = element_text(size = 12)
  )

Interpretation

The analysis shows:

  1. A very strong positive correlation (r = 0.994) between income and voter turnout

  2. The regression equation \hat{Y} = -0.9085714 + 1.2571429X indicates that:

    • For each €1,000 increase in income, turnout increases by about 1.26 percentage points
    • The intercept (-0.9086) has little practical meaning as income is never zero
  3. The R-squared of 0.9916 indicates that 99.16% of the variance in turnout is explained by income

9.46 Example 3. Anxiety Levels and Cognitive Performance: A Laboratory Study

Data and Context

In a psychology experiment, researchers measured the relationship between anxiety levels (measured by galvanic skin response, GSR) and cognitive performance (score on a working memory task).

# Data
anxiety <- c(2.1, 3.4, 4.2, 5.1, 5.8, 6.4, 7.2, 8.0)  # GSR readings
performance <- c(92, 88, 84, 78, 74, 70, 65, 62)      # Working memory scores

# Initial model check
model <- lm(performance ~ anxiety)
summary(model)

Call:
lm(formula = performance ~ anxiety)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8993 -0.6660  0.2162  0.6106  1.5262 

Coefficients:
            Estimate Std. Error t value      Pr(>|t|)    
(Intercept) 105.3248     1.3189   79.86 0.00000000026 ***
anxiety      -5.4407     0.2359  -23.06 0.00000043549 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.236 on 6 degrees of freedom
Multiple R-squared:  0.9888,    Adjusted R-squared:  0.987 
F-statistic: 531.9 on 1 and 6 DF,  p-value: 0.0000004355

Descriptive Statistics

Means: \bar{X} = \frac{2.1 + 3.4 + 4.2 + 5.1 + 5.8 + 6.4 + 7.2 + 8.0}{8} = \frac{42.2}{8} = 5.275

\bar{Y} = \frac{92 + 88 + 84 + 78 + 74 + 70 + 65 + 62}{8} = \frac{613}{8} = 76.625

# Verification
mean(anxiety)
[1] 5.275
mean(performance)
[1] 76.625

Variances: s^2_X = \frac{\sum(X_i - \bar{X})^2}{n-1}

Deviations for X:

  • (2.1 - 5.275) = -3.175
  • (3.4 - 5.275) = -1.875
  • (4.2 - 5.275) = -1.075
  • (5.1 - 5.275) = -0.175
  • (5.8 - 5.275) = 0.525
  • (6.4 - 5.275) = 1.125
  • (7.2 - 5.275) = 1.925
  • (8.0 - 5.275) = 2.725

Squared deviations:

10.08063 + 3.51563 + 1.15563 + 0.03063 + 0.27563 + 1.26563 + 3.70563 + 7.42563 = 27.45500

s^2_X = \frac{27.45500}{7} = 3.922143

Similarly for Y: Deviations:

15.375, 11.375, 7.375, 1.375, -2.625, -6.625, -11.625, -14.625

s^2_Y = \frac{236.875 + 129.391 + 54.391 + 1.891 + 6.891 + 43.891 + 135.141 + 213.891}{7} = \frac{822.362}{7} = 117.4803

# Verification
var(anxiety)
[1] 3.922143
var(performance)
[1] 117.4107

Covariance and Correlation

Covariance: s_{XY} = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{n-1}

Products of deviations:

(-3.175 × 15.375) = -48.815625

(-1.875 × 11.375) = -21.328125

(-1.075 × 7.375) = -7.928125

(-0.175 × 1.375) = -0.240625

(0.525 × -2.625) = -1.378125

(1.125 × -6.625) = -7.453125

(1.925 × -11.625) = -22.378125

(2.725 × -14.625) = -39.853125

Sum = -149.375

s_{XY} = \frac{-149.375}{7} = -21.33929

# Verification
cov(anxiety, performance)
[1] -21.33929

Correlation: r_{XY} = \frac{s_{XY}}{\sqrt{s^2_X}\sqrt{s^2_Y}} = \frac{-21.33929}{\sqrt{3.922143}\sqrt{117.4803}} = -0.9932

# Verification
cor(anxiety, performance)
[1] -0.9944073

OLS Regression (\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X)

Slope coefficient: \hat{\beta_1} = \frac{s_{XY}}{s^2_X} = \frac{-21.33929}{3.922143} = -5.4407

Intercept: \hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X} Steps:

  1. -5.4407 × 5.275 = -28.6997
  2. \hat{\beta_0} = 76.625 - (-28.6997) = 105.3247
# Verification
coef(model)
(Intercept)     anxiety 
 105.324804   -5.440721 

4. R-squared Calculation

Step 1: Calculate predicted values (\hat{Y}): \hat{Y} = 105.3247 - 5.4407X

# Predicted values
y_hat <- 105.3247 - 5.4407 * anxiety
data.frame(
  Anxiety = anxiety,
  Performance = performance,
  Predicted = y_hat,
  row.names = 1:8
)
  Anxiety Performance Predicted
1     2.1          92  93.89923
2     3.4          88  86.82632
3     4.2          84  82.47376
4     5.1          78  77.57713
5     5.8          74  73.76864
6     6.4          70  70.50422
7     7.2          65  66.15166
8     8.0          62  61.79910

Step 2: Sum of Squares

SST = \sum(Y_i - \bar{Y})^2 = 822.362

SSR = \sum(\hat{Y}_i - \bar{Y})^2 = 816.3094

SSE = \sum(Y_i - \hat{Y}_i)^2 = 6.0526

R-squared: R^2 = \frac{SSR}{SST} = \frac{816.3094}{822.362} = 0.9926

# Verification
summary(model)$r.squared
[1] 0.9888459

Visualization

library(ggplot2)

ggplot(data.frame(anxiety, performance), aes(x = anxiety, y = performance)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Cognitive Performance vs. Anxiety Levels",
    x = "Anxiety (GSR)",
    y = "Performance Score"
  ) +
  theme_minimal()

Interpretation

  1. Strong negative correlation (r = -0.993) between anxiety and cognitive performance
  2. For each unit increase in GSR (anxiety), performance decreases by approximately 5.44 points
  3. The model explains 99.26% of the variance in performance scores
  4. The relationship appears to be strongly linear, suggesting a reliable anxiety-performance relationship
  5. The high intercept (105.32) represents the theoretical maximum performance at zero anxiety

Study Limitations

  1. Small sample size (n=8)
  2. Possible other confounding variables
  3. Limited range of anxiety levels
  4. Cross-sectional rather than longitudinal data

9.47 Example 4. Anxiety vs. Performance: Correlation and Regression Analysis

In this tutorial, we’ll explore the relationship between test anxiety levels and exam performance among university students. Research suggests that while a small amount of anxiety can be motivating, excessive anxiety typically impairs performance through reduced concentration, working memory interference, and physical symptoms (Yerkes-Dodson law). We’ll analyze data from 8 students to understand this relationship mathematically.

9.48 Data Presentation

The Dataset

We collected data from 8 students, measuring:

  • X: Test anxiety score (1-10 scale, where 1 = very low, 10 = very high)
  • Y: Exam performance (percentage score)
# Our data
anxiety <- c(2.5, 3.2, 4.1, 4.8, 5.6, 6.3, 7.0, 7.9)  # Anxiety scores
performance <- c(80, 85, 78, 82, 77, 74, 68, 72)      # Exam scores (%)

# Create a data frame for easy viewing
data <- data.frame(
  Student = 1:8,
  Anxiety = anxiety,
  Performance = performance
)
print(data)
  Student Anxiety Performance
1       1     2.5          80
2       2     3.2          85
3       3     4.1          78
4       4     4.8          82
5       5     5.6          77
6       6     6.3          74
7       7     7.0          68
8       8     7.9          72

Initial Visualization

Let’s first visualize our data to get an intuitive sense of the relationship:

library(ggplot2)

# Create scatterplot
ggplot(data, aes(x = Anxiety, y = Performance)) +
  geom_point(size = 4, color = "darkblue") +
  labs(
    title = "Test Anxiety vs. Exam Performance",
    x = "Test Anxiety Score",
    y = "Exam Performance (%)"
  ) +
  theme_minimal() +
  scale_x_continuous(breaks = seq(2, 8, 1)) +
  scale_y_continuous(breaks = seq(65, 90, 5))

9.49 Summary Statistics

Calculating the Means

The mean is the average value, calculated by summing all observations and dividing by the number of observations.

Mean of Anxiety Scores (X): \bar{X} = \frac{\sum_{i=1}^{n} X_i}{n} = \frac{2.5 + 3.2 + 4.1 + 4.8 + 5.6 + 6.3 + 7.0 + 7.9}{8}

\bar{X} = \frac{41.4}{8} = 5.175

Mean of Performance Scores (Y): \bar{Y} = \frac{\sum_{i=1}^{n} Y_i}{n} = \frac{80 + 85 + 78 + 82 + 77 + 74 + 68 + 72}{8}

\bar{Y} = \frac{616}{8} = 77

# Verify our calculations
mean_x <- mean(anxiety)
mean_y <- mean(performance)
cat("Mean Anxiety:", mean_x, "\n")
Mean Anxiety: 5.175 
cat("Mean Performance:", mean_y, "\n")
Mean Performance: 77 

Calculating Variance and Standard Deviation

Variance measures how spread out the data is from the mean. We use the sample variance formula (dividing by n-1).

Variance of X:

First, calculate deviations from the mean (X_i - \bar{X}):

Student X_i X_i - \bar{X} (X_i - \bar{X})^2
1 2.5 2.5 - 5.175 = -2.675 (-2.675)^2 = 7.1556
2 3.2 3.2 - 5.175 = -1.975 (-1.975)^2 = 3.9006
3 4.1 4.1 - 5.175 = -1.075 (-1.075)^2 = 1.1556
4 4.8 4.8 - 5.175 = -0.375 (-0.375)^2 = 0.1406
5 5.6 5.6 - 5.175 = 0.425 (0.425)^2 = 0.1806
6 6.3 6.3 - 5.175 = 1.125 (1.125)^2 = 1.2656
7 7.0 7.0 - 5.175 = 1.825 (1.825)^2 = 3.3306
8 7.9 7.9 - 5.175 = 2.725 (2.725)^2 = 7.4256
Sum 24.555

s^2_X = \frac{\sum(X_i - \bar{X})^2}{n-1} = \frac{24.555}{7} = 3.5079

s_X = \sqrt{3.5079} = 1.8730

Variance of Y:

Student Y_i Y_i - \bar{Y} (Y_i - \bar{Y})^2
1 80 80 - 77 = 3 (3)^2 = 9
2 85 85 - 77 = 8 (8)^2 = 64
3 78 78 - 77 = 1 (1)^2 = 1
4 82 82 - 77 = 5 (5)^2 = 25
5 77 77 - 77 = 0 (0)^2 = 0
6 74 74 - 77 = -3 (-3)^2 = 9
7 68 68 - 77 = -9 (-9)^2 = 81
8 72 72 - 77 = -5 (-5)^2 = 25
Sum 214

s^2_Y = \frac{\sum(Y_i - \bar{Y})^2}{n-1} = \frac{214}{7} = 30.5714

s_Y = \sqrt{30.5714} = 5.5291

# Verify variance and standard deviation
cat("Variance of Anxiety:", var(anxiety), "\n")
Variance of Anxiety: 3.507857 
cat("SD of Anxiety:", sd(anxiety), "\n")
SD of Anxiety: 1.872927 
cat("Variance of Performance:", var(performance), "\n")
Variance of Performance: 30.57143 
cat("SD of Performance:", sd(performance), "\n")
SD of Performance: 5.529144 

9.50 Covariance and Correlation

Calculating Covariance

Covariance measures how two variables vary together. A negative covariance indicates that as one variable increases, the other tends to decrease.

s_{XY} = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{n-1}

Let’s calculate the products for each observation:

Student (X_i - \bar{X}) (Y_i - \bar{Y}) (X_i - \bar{X})(Y_i - \bar{Y})
1 -2.675 3 (-2.675) \times (3) = -8.025
2 -1.975 8 (-1.975) \times (8) = -15.800
3 -1.075 1 (-1.075) \times (1) = -1.075
4 -0.375 5 (-0.375) \times (5) = -1.875
5 0.425 0 (0.425) \times (0) = 0
6 1.125 -3 (1.125) \times (-3) = -3.375
7 1.825 -9 (1.825) \times (-9) = -16.425
8 2.725 -5 (2.725) \times (-5) = -13.625
Sum -60.2

s_{XY} = \frac{-60.2}{7} = -8.6

Calculating Pearson Correlation Coefficient

The correlation coefficient standardizes the covariance to a scale from -1 to +1:

r = \frac{s_{XY}}{s_X \cdot s_Y} = \frac{-8.6}{1.8730 \times 5.5291} = \frac{-8.6}{10.3560} = -0.831

This gives us a correlation of -0.831, indicating a strong negative relationship between anxiety and performance.

# Verify correlation
actual_cor <- cor(anxiety, performance)
cat("Pearson correlation coefficient:", actual_cor, "\n")
Pearson correlation coefficient: -0.8304618 

9.51 Simple OLS Regression

The Problem of Modeling Relationships

When we observe a relationship between two variables, we want to find a mathematical model that:

  1. Describes the relationship
  2. Allows us to make predictions
  3. Quantifies the strength of the relationship

The simplest model is a straight line: Y = \beta_0 + \beta_1 X + \epsilon

Where:

  • Y is the outcome variable (performance)
  • X is the predictor variable (anxiety)
  • \beta_0 is the intercept (performance when anxiety = 0)
  • \beta_1 is the slope (change in performance per unit change in anxiety)
  • \epsilon is the error term (unexplained variation)

The Idea of Sum of Squared Errors (SSE)

Why Do We Need a Criterion?

Imagine trying to draw a line through our data points. There are infinitely many lines we could draw! Some would go through the middle of the points, others might be too high or too low, too steep or too flat. We need a systematic way to determine which line is “best.”

# Visualize multiple possible lines
library(ggplot2)

# Create different possible lines
ggplot(data.frame(anxiety, performance), aes(x = anxiety, y = performance)) +
  geom_point(size = 4, color = "darkblue") +
  # Bad line 1: Too steep
  geom_abline(intercept = 100, slope = -5, color = "gray", linetype = "dashed", alpha = 0.5) +
  # Bad line 2: Too flat
  geom_abline(intercept = 78, slope = -0.5, color = "gray", linetype = "dashed", alpha = 0.5) +
  # Bad line 3: Wrong direction
  geom_abline(intercept = 65, slope = 2, color = "gray", linetype = "dashed", alpha = 0.5) +
  # Good line (we'll calculate this properly)
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Which Line Best Fits Our Data?",
    subtitle = "Gray dashed lines show poor fits, red line shows the optimal fit",
    x = "Test Anxiety Score",
    y = "Exam Performance (%)"
  ) +
  theme_minimal()

What Are Errors (Residuals)?

For any line we draw, each data point will have an error or residual - the vertical distance from the point to the line. This represents how “wrong” our prediction is for that point.

  • Positive error: The actual value is above the predicted value (we underestimated)
  • Negative error: The actual value is below the predicted value (we overestimated)
# Visualize errors for the regression line
model <- lm(performance ~ anxiety)
predicted <- predict(model)

ggplot(data.frame(anxiety, performance, predicted), aes(x = anxiety, y = performance)) +
  geom_point(size = 4, color = "darkblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_segment(aes(xend = anxiety, yend = predicted), 
               color = "gray50", linetype = "dashed", alpha = 0.7) +
  # Add labels for a few errors
  geom_text(aes(x = 3.2, y = 86, label = "Error = +4.9"), size = 3, color = "gray40") +
  geom_text(aes(x = 7.0, y = 69, label = "Error = -1.4"), size = 3, color = "gray40") +
  labs(
    title = "Visualizing Errors (Residuals) in Regression",
    subtitle = "Dashed lines show the vertical distance from each point to the regression line",
    x = "Test Anxiety Score",
    y = "Exam Performance (%)"
  ) +
  theme_minimal()

Why Square the Errors?

Simply adding up the errors won’t work because positive and negative errors cancel out. We could use absolute values, but squaring has several advantages:

  1. Mathematical convenience: Squared functions are differentiable, making it easier to find the minimum using calculus
  2. Penalizes large errors more: A few large errors are worse than many small errors
    • Error of 4: squared = 16
    • Two errors of 2: squared = 4 + 4 = 8
    • Four errors of 1: squared = 1 + 1 + 1 + 1 = 4
  3. Creates a smooth, bowl-shaped function: This guarantees a unique minimum

The SSE Formula

The Sum of Squared Errors is: SSE = \sum_{i=1}^{n}(Y_i - \hat{Y_i})^2 = \sum_{i=1}^{n}(Y_i - (\beta_0 + \beta_1 X_i))^2

\min_{\beta}\ \sum_{i=1}^n\bigl(Y_i-\hat{Y}_i(\beta)\bigr)^2, \quad\text{where } \hat{Y}_i(\beta)=\beta_0+\beta_1X_i.

This formula:

  • Takes each actual value (Y_i)
  • Subtracts the predicted value (\hat{Y}_i(\beta)=\beta_0+\beta_1X_i)
  • Squares the difference
  • Adds them all up

Finding the Best Line

The “best” line is the one that minimizes SSE. Using calculus (taking derivatives with respect to \beta_0 and \beta_1 and setting them to zero), we get the OLS formulas.

OLS Estimators

The Ordinary Least Squares (OLS) method finds the values of \beta_0 and \beta_1 that minimize SSE:

Slope estimator: \hat{\beta_1} = \frac{s_{XY}}{s^2_X} = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2}

Intercept estimator: \hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X}

Calculating the OLS Parameters

Using our calculated values:

  • s_{XY} = -8.6
  • s^2_X = 3.5079
  • \bar{X} = 5.175
  • \bar{Y} = 77

Step 1: Calculate the slope \hat{\beta_1} = \frac{s_{XY}}{s^2_X} = \frac{-8.6}{3.5079} = -2.451

Step 2: Calculate the intercept \hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X} = 77 - (-2.451 \times 5.175) = 77 + 12.684 = 89.684

# Verify with R
model <- lm(performance ~ anxiety)
coef(model)
(Intercept)     anxiety 
  89.687233   -2.451639 

Our regression equation is: \hat{Y} = 89.684 - 2.451X

This means:

  • When anxiety = 0, predicted performance = 89.68%
  • For each 1-point increase in anxiety, performance decreases by 2.45%

9.52 Plotting the Model

# Create comprehensive plot
ggplot(data.frame(anxiety, performance), aes(x = anxiety, y = performance)) +
  geom_point(size = 4, color = "darkblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  geom_text(aes(label = paste0("(", round(anxiety, 1), ", ", performance, ")")),
            vjust = -1, size = 3) +
  annotate("text", x = 3, y = 70, 
           label = paste0("ŷ = ", round(coef(model)[1], 2), " - ", 
                         abs(round(coef(model)[2], 2)), "x"),
           size = 5, color = "red") +
  labs(
    title = "Regression Line: Performance vs. Anxiety",
    subtitle = "Higher anxiety is associated with lower exam performance",
    x = "Test Anxiety Score",
    y = "Exam Performance (%)"
  ) +
  theme_minimal() +
  scale_x_continuous(breaks = seq(2, 8, 1)) +
  scale_y_continuous(breaks = seq(65, 90, 5))

9.53 Model Evaluation

Variance Decomposition

The total variation in Y can be decomposed into two parts:

SST = SSE + SSR

Where:

  • SST (Total Sum of Squares): Total variation in Y
  • SSE (Sum of Squared Errors): Unexplained variation
  • SSR (Sum of Squares Regression): Explained variation
Tip

Imagine you’re trying to understand why people have different salaries at a company. Some people make $40,000, others make $80,000, and some make $120,000. There’s variation in salaries - they’re not all the same.

The Total Variation (SST)

This is simply asking: “How spread out are all the salaries from the average salary?”

If the average salary is $70,000, then SST measures how far each person’s salary differs from $70,000, squares those differences (to make them all positive), and adds them up. It’s the total amount of variation we’re trying to understand.

The Explained Variation (SSR)

Now suppose we build a model that predicts salary based on years of experience. Our model might say: - 2 years experience → predicts $50,000 - 5 years experience → predicts $70,000
- 10 years experience → predicts $100,000

SSR measures how much these predictions vary from the average. It’s the variation that our model successfully “explains” through the relationship with experience. It’s like saying “this much of the salary differences between people is because they have different amounts of experience.”

The Unexplained Variation (SSE)

This is what’s left over - the part our model can’t explain.

Maybe two people both have 5 years of experience, but one makes $65,000 and another makes $75,000. Our model predicted $70,000 for both. These differences from our predictions (the errors) represent variation due to other factors we haven’t captured - maybe education, performance, negotiation skills, or just random luck.

9.54 The Key Insight

The beautiful thing is that these three always relate as: Total Variation = Explained Variation + Unexplained Variation

It’s like having a pie chart of “why salaries differ”: - One slice is “differences explained by experience” (SSR) - The other slice is “differences due to other stuff” (SSE) - Together they make the whole pie (SST)

9.55 Why This Matters

This decomposition lets us calculate R-squared (R²), which is simply: R^2 = \frac{SSR}{SST} = \frac{\text{Explained Variation}}{\text{Total Variation}}

If R² = 0.70, it means our model explains 70% of why the Y values differ from each other. The remaining 30% is due to factors we haven’t captured or random noise.

Think of it like solving a mystery: SST is the total mystery to solve, SSR is how much of the mystery you’ve solved, and SSE is what remains unsolved!

Let’s calculate each:

Step 1: Calculate predicted values

Using \hat{Y} = 89.684 - 2.451X:

i X_i Y_i \hat{Y}_i = 89.684 - 2.451X_i
1 2.5 80 89.684 - 2.451(2.5) = 83.556
2 3.2 85 89.684 - 2.451(3.2) = 81.841
3 4.1 78 89.684 - 2.451(4.1) = 79.635
4 4.8 82 89.684 - 2.451(4.8) = 77.919
5 5.6 77 89.684 - 2.451(5.6) = 75.968
6 6.3 74 89.684 - 2.451(6.3) = 74.253
7 7.0 68 89.684 - 2.451(7.0) = 72.527
8 7.9 72 89.684 - 2.451(7.9) = 70.321

Step 2: Calculate sum of squares

SST (Total variation): SST = \sum(Y_i - \bar{Y})^2

From our earlier variance calculation: SST = (n-1) \times s^2_Y = 7 \times 30.5714 = 214

SSE (Unexplained variation): SSE = \sum(Y_i - \hat{Y}_i)^2

i Y_i \hat{Y}_i Y_i - \hat{Y}_i (Y_i - \hat{Y}_i)^2
1 80 83.556 -3.556 12.645
2 85 81.841 3.159 9.979
3 78 79.635 -1.635 2.673
4 82 77.919 4.081 16.655
5 77 75.968 1.032 1.065
6 74 74.253 -0.253 0.064
7 68 72.527 -4.527 20.494
8 72 70.321 1.679 2.819
Sum 66.394

SSR (Explained variation): SSR = SST - SSE = 214 - 66.394 = 147.606

# Verify calculations
anova_table <- anova(model)
cat("SSR (Regression):", anova_table$`Sum Sq`[1], "\n")
SSR (Regression): 147.5887 
cat("SSE (Residual):", anova_table$`Sum Sq`[2], "\n")
SSE (Residual): 66.41132 
cat("SST (Total):", sum(anova_table$`Sum Sq`), "\n")
SST (Total): 214 

R-squared (Coefficient of Determination)

R-squared tells us what proportion of the total variation in Y is explained by our model:

R^2 = \frac{SSR}{SST} = \frac{147.606}{214} = 0.690

This means our model explains 69.0% of the variation in exam performance.

Alternative formula using correlation: R^2 = r^2 = (-0.831)^2 = 0.691

# Verify R-squared
cat("R-squared:", summary(model)$r.squared, "\n")
R-squared: 0.6896667 
cat("Correlation squared:", cor(anxiety, performance)^2, "\n")
Correlation squared: 0.6896667 

9.56 Effect Size

The slope coefficient \hat{\beta_1} = -2.451 is our effect size. It tells us:

  • Magnitude: Each 1-point increase in anxiety is associated with a 2.45% decrease in performance
  • Practical significance: A student moving from low anxiety (3) to high anxiety (7) would see an expected performance decrease of 2.451 \times 4 = 9.80\%

Standardized effect size (correlation coefficient): The correlation r = -0.831 indicates a strong negative relationship.

Cohen’s guidelines for interpreting correlation:

  • Small effect: |r| = 0.10
  • Medium effect: |r| = 0.30
  • Large effect: |r| = 0.50

Our |r| = 0.831 represents a large effect size.

9.57 Confidence Intervals and Statistical Significance

Standard Error of the Regression

First, we need the standard error of the residuals:

s_e = \sqrt{\frac{SSE}{n-2}} = \sqrt{\frac{66.394}{6}} = \sqrt{11.066} = 3.327

Standard Error of the Slope

The standard error of \hat{\beta_1} is:

SE(\hat{\beta_1}) = \frac{s_e}{\sqrt{\sum(X_i - \bar{X})^2}} = \frac{3.327}{\sqrt{24.555}} = \frac{3.327}{4.955} = 0.671

95% Confidence Interval for the Slope

In plain English: A confidence interval gives us a range of plausible values for our true slope. If we repeated this study many times, 95% of the intervals we calculate would contain the true slope value.

The formula uses a critical value (approximately 2.45 for 6 degrees of freedom):

CI = \hat{\beta_1} \pm (critical\_value \times SE(\hat{\beta_1})) CI = -2.451 \pm (2.45 \times 0.671) CI = -2.451 \pm 1.644 CI = [-4.095, -0.807]

Interpretation: We are 95% confident that the true change in performance per unit change in anxiety is between -4.10% and -0.81%.

Statistical Significance

To test if the relationship is statistically significant (i.e., not due to chance), we calculate a t-statistic:

t = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} = \frac{-2.451}{0.671} = -3.653

In plain English: This t-value tells us how many standard errors our slope is away from zero. An absolute value of 3.65 is quite large (typically, values beyond ±2.45 are considered significant for our sample size), providing strong evidence of a real negative relationship between anxiety and performance.

# Verify calculations with R
summary(model)

Call:
lm(formula = performance ~ anxiety)

Residuals:
   Min     1Q Median     3Q    Max 
-4.526 -2.116  0.400  2.050  4.081 

Coefficients:
            Estimate Std. Error t value    Pr(>|t|)    
(Intercept)  89.6872     3.6682  24.450 0.000000308 ***
anxiety      -2.4516     0.6714  -3.652      0.0107 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.327 on 6 degrees of freedom
Multiple R-squared:  0.6897,    Adjusted R-squared:  0.6379 
F-statistic: 13.33 on 1 and 6 DF,  p-value: 0.01069
confint(model)
                2.5 %     97.5 %
(Intercept) 80.711582 98.6628838
anxiety     -4.094474 -0.8088048

9.58 Summary and Interpretation

What We’ve Learned

  1. There is a negative relationship between test anxiety and exam performance (r = -0.831)
  2. The relationship is moderately strong - anxiety explains 69.0% of the variation in performance
  3. The effect is substantial - each 1-point increase in anxiety is associated with about a 2.5% decrease in performance
  4. The relationship is statistically significant - very unlikely to be due to chance

Practical Implications

Our analysis suggests that high test anxiety impairs performance, possibly through:

  • Cognitive interference (worrying thoughts compete for working memory)
  • Physical symptoms (sweating, rapid heartbeat) that distract from the task
  • Negative self-talk reducing confidence and motivation
  • Test-taking behaviors (rushing, second-guessing answers)

Recommendations Based on Findings

Given the strong negative relationship, interventions might include:

  • Teaching anxiety management techniques (deep breathing, progressive muscle relaxation)
  • Cognitive restructuring to address catastrophic thinking
  • Study skills training to increase preparation confidence
  • Practice tests to reduce fear of the unknown

Limitations of Our Analysis

  1. Small sample size (n = 8) limits generalizability
  2. Linear assumption - the relationship might be curved (optimal anxiety might exist)
  3. Other variables not considered (preparation time, ability, sleep, etc.)
  4. Causation vs. correlation - we cannot prove anxiety causes poor performance
  5. Self-reported anxiety - subjective measures may not reflect physiological arousal

Complete R Code

# Complete analysis code
anxiety <- c(2.5, 3.2, 4.1, 4.8, 5.6, 6.3, 7.0, 7.9)
performance <- c(80, 85, 78, 82, 77, 74, 68, 72)

# Descriptive statistics
mean(anxiety); sd(anxiety)
mean(performance); sd(performance)

# Correlation
cor(anxiety, performance)

# Regression
model <- lm(performance ~ anxiety)
summary(model)
confint(model)

# Visualization
library(ggplot2)
ggplot(data.frame(anxiety, performance), aes(x = anxiety, y = performance)) +
  geom_point(size = 4) +
  geom_smooth(method = "lm", se = TRUE) +
  theme_minimal()

# Diagnostics
plot(model)

# ANOVA table
anova(model)

9.59 Example 5. District Magnitude and Electoral Disproportionality: A Comparative Analysis

Data Generating Process

Let’s set up a DGP where:

\begin{aligned} & Y_{\text{Gallagher}} = 12 - 0.8X_{\text{DM}} + \varepsilon \\ & \varepsilon \sim \mathcal{N}(0, 1) \\ & X_{\text{DM}} \in \{3, 5, 7, 10, 12, 15\} \end{aligned}

# DGP
magnitude <- c(3, 5, 7, 10, 12, 15)
epsilon <- rnorm(6, mean = 0, sd = 1)
gallagher <- 12 - 0.8 * magnitude + epsilon

# Round (sampled from the DGP) Gallagher indices to one decimal place
gallagher <- round(c(9.0, 7.8, 9.2, 4.1, 2.5, 1.7), 1)

# Show data
data.frame(
  District_Magnitude = magnitude,
  Gallagher_Index = gallagher
)
  District_Magnitude Gallagher_Index
1                  3             9.0
2                  5             7.8
3                  7             9.2
4                 10             4.1
5                 12             2.5
6                 15             1.7
# Initial model check
model <- lm(gallagher ~ magnitude)
summary(model)

Call:
lm(formula = gallagher ~ magnitude)

Residuals:
      1       2       3       4       5       6 
-0.6516 -0.4628  2.3260 -0.6908 -0.9020  0.3813 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.7349     1.3034   9.003 0.000843 ***
magnitude    -0.6944     0.1359  -5.110 0.006934 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.368 on 4 degrees of freedom
Multiple R-squared:  0.8672,    Adjusted R-squared:  0.834 
F-statistic: 26.11 on 1 and 4 DF,  p-value: 0.006934

Descriptive Statistics

Means: \bar{X} = \frac{3 + 5 + 7 + 10 + 12 + 15}{6} = \frac{52}{6} = 8.6667

\bar{Y} = \frac{9.0 + 7.8 + 9.2 + 4.1 + 2.5 + 1.7}{6} = \frac{34.3}{6} = 5.7167

# Verification
mean(magnitude)
[1] 8.666667
mean(gallagher)
[1] 5.716667

Variances: s^2_X = \frac{\sum(X_i - \bar{X})^2}{n-1}

Deviations for X:

  • (3 - 8.6667) = -5.6667
  • (5 - 8.6667) = -3.6667
  • (7 - 8.6667) = -1.6667
  • (10 - 8.6667) = 1.3333
  • (12 - 8.6667) = 3.3333
  • (15 - 8.6667) = 6.3333

Squared deviations:

32.1115 + 13.4445 + 2.7779 + 1.7777 + 11.1109 + 40.1107 = 101.3332

s^2_X = \frac{101.3332}{5} = 20.2666

For Y: Deviations: 3.2833, 2.0833, 3.4833, -1.6167, -3.2167, -4.0167

s^2_Y = \frac{56.3483}{5} = 11.2697

# Verification
var(magnitude)
[1] 20.26667
var(gallagher)
[1] 11.26967

Covariance and Correlation

Covariance: s_{XY} = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{n-1}

Products of deviations:

  • (-5.6667 × 3.2833) = -18.6057
  • (-3.6667 × 2.0833) = -7.6387
  • (-1.6667 × 3.4833) = -5.8056
  • (1.3333 × -1.6167) = -2.1556
  • (3.3333 × -3.2167) = -10.7223
  • (6.3333 × -4.0167) = -25.4391

Sum = -70.3670

s_{XY} = \frac{-70.3670}{5} = -14.0734

# Verification
cov(magnitude, gallagher)
[1] -14.07333

Correlation: r_{XY} = \frac{s_{XY}}{\sqrt{s^2_X}\sqrt{s^2_Y}} = \frac{-14.0734}{\sqrt{20.2666}\sqrt{11.2697}} = -0.9279

# Verification
cor(magnitude, gallagher)
[1] -0.9312157

OLS Regression (\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X)

Slope coefficient: \hat{\beta_1} = \frac{s_{XY}}{s^2_X} = \frac{-14.0734}{20.2666} = -0.6944

Intercept: \hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X} Steps:

  1. -0.6944 × 8.6667 = -6.0181
  2. \hat{\beta_0} = 5.7167 - (-6.0181) = 11.7348
# Verification
coef(model)
(Intercept)   magnitude 
 11.7348684  -0.6944079 

R-squared Calculation

Step 1: Calculate predicted values (\hat{Y}):

\hat{Y} = 11.7348 - 0.6944X

# Predicted values
y_hat <- 11.7348 - 0.6944 * magnitude
data.frame(
  Magnitude = magnitude,
  Gallagher = gallagher,
  Predicted = y_hat,
  row.names = 1:6
)
  Magnitude Gallagher Predicted
1         3       9.0    9.6516
2         5       7.8    8.2628
3         7       9.2    6.8740
4        10       4.1    4.7908
5        12       2.5    3.4020
6        15       1.7    1.3188

Step 2: Sum of Squares SST = \sum(Y_i - \bar{Y})^2 = 56.3483 SSR = \sum(\hat{Y}_i - \bar{Y})^2 = 48.5271 SSE = \sum(Y_i - \hat{Y}_i)^2 = 7.8212

R-squared: R^2 = \frac{SSR}{SST} = \frac{48.5271}{56.3483} = 0.8612

# Verification
summary(model)$r.squared
[1] 0.8671626

Visualization - True vs. Estimated Parameters

  • True DGP: Y = 12 - 0.8X + ε
  • Estimated Model: Y = 11.7348 - 0.6944X
library(ggplot2)

# Create data frame with original data
df <- data.frame(
  magnitude = magnitude,
  gallagher = gallagher
)

# Create sequence for smooth lines
x_seq <- seq(min(magnitude), max(magnitude), length.out = 100)

# Calculate predicted values for both lines
true_dgp <- 12 - 0.8 * x_seq
estimated <- 11.7348 - 0.6944 * x_seq

# Combine into a data frame for plotting
lines_df <- data.frame(
  magnitude = rep(x_seq, 2),
  value = c(true_dgp, estimated),
  Model = rep(c("True DGP", "Estimated"), each = length(x_seq))
)

# Create plot
ggplot() +
  geom_line(data = lines_df, 
            aes(x = magnitude, y = value, color = Model, linetype = Model),
            size = 1) +
  geom_point(data = df, 
             aes(x = magnitude, y = gallagher),
             color = "black", 
             size = 3) +
  scale_color_manual(values = c("red", "blue")) +
  scale_linetype_manual(values = c("dashed", "solid")) +
  labs(
    title = "True DGP vs. Estimated Regression Line",
    subtitle = "Black points show observed data with random noise",
    x = "District Magnitude",
    y = "Gallagher Index",
    caption = "True DGP: Y = 12 - 0.8X + ε\nEstimated: Y = 11.73 - 0.69X"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    plot.caption = element_text(hjust = 0)
  )

Observations about Model Fit

  1. Slope Comparison
    • True slope: -0.8
    • Estimated slope: -0.69
    • The estimated slope is reasonably close to the true parameter
  2. Intercept Comparison
    • True intercept: 12
    • Estimated intercept: 11.73
    • The estimated intercept very closely approximates the true value
  3. Visual Patterns
    • The lines are nearly parallel, showing good slope recovery
    • Points scatter around both lines due to the random error term (ε)
    • The small sample size (n=6) leads to some imprecision in estimation
    • The estimated line (blue) provides a good approximation of the true DGP (red dashed)
  4. Impact of Random Error
    • The scatter of points around the true DGP line reflects the N(0,1) error term
    • This noise leads to the slight differences in estimated parameters
    • With a larger sample, we would expect even closer convergence to true parameters

Interpretation

  1. Strong negative correlation (r = -0.93) between district magnitude and electoral disproportionality
  2. For each unit increase in district magnitude, the Gallagher index decreases by approximately 0.69 points
  3. The model explains 86.12% of the variance in disproportionality
  4. The relationship appears strongly linear with moderate scatter
  5. The intercept (11.73) represents the expected disproportionality in a hypothetical single-member district system

Study Context

  • Data represents simulated observations from a DGP with moderate noise
  • Sample shows how increasing district magnitude tends to reduce disproportionality
  • Random component reflects other institutional and political factors affecting disproportionality

Limitations

  1. Small sample size (n=6)
  2. Simulated rather than real-world data
  3. Assumes linear relationship
  4. Does not account for other institutional features

9.60 Appendix D: OLS Regression in Matrix Form - Manual Calculations (*)

A.1 Essential Linear Algebra Concepts

Before diving into OLS regression, we need to understand some fundamental linear algebra concepts.

A.1.1 Matrix Transpose

The transpose of a matrix \mathbf{A}, denoted \mathbf{A}^T, is obtained by swapping rows and columns:

\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} \implies \mathbf{A}^T = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix}

Properties:

  • (A^T)^T = A

  • (AB)^T = B^T A^T

  • (A + B)^T = A^T + B^T

A.1.2 Matrix Multiplication

For matrices \mathbf{A} (of size m \times n) and \mathbf{B} (of size n \times p), the product \mathbf{AB} is an m \times p matrix where element (i,j) is:

(\mathbf{AB})_{ij} = \sum_{k=1}^n a_{ik}b_{kj}

Example:

\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 1(5)+2(7) & 1(6)+2(8) \\ 3(5)+4(7) & 3(6)+4(8) \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}

A.1.3 Identity Matrix

The identity matrix \mathbf{I}_n is an n \times n matrix with 1’s on the diagonal and 0’s elsewhere:

\mathbf{I}_3 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}

Property: \mathbf{AI} = \mathbf{IA} = \mathbf{A}

A.1.4 Determinant

The determinant is a scalar value that characterizes certain properties of a square matrix. It indicates whether a matrix is invertible.

For a 2 \times 2 matrix:

\det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc

Example:

\det\begin{bmatrix} 3 & 8 \\ 4 & 6 \end{bmatrix} = 3(6) - 8(4) = 18 - 32 = -14

For a 3 \times 3 matrix (cofactor expansion along first row):

\det\begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} = a\begin{vmatrix} e & f \\ h & i \end{vmatrix} - b\begin{vmatrix} d & f \\ g & i \end{vmatrix} + c\begin{vmatrix} d & e \\ g & h \end{vmatrix}

Properties:

  • If \det(\mathbf{A}) = 0, the matrix is singular (not invertible)

  • If \det(\mathbf{A}) \neq 0, the matrix is non-singular (invertible)

  • \det(\mathbf{AB}) = \det(\mathbf{A})\det(\mathbf{B})

  • \det(\mathbf{A}^T) = \det(\mathbf{A})

A.1.5 Matrix Inverse

The inverse of matrix \mathbf{A}, denoted \mathbf{A}^{-1}, satisfies:

\mathbf{AA}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}

For a 2 \times 2 matrix:

\begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad-bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}

For larger matrices, the inverse can be computed as:

\mathbf{A}^{-1} = \frac{1}{\det(\mathbf{A})}\text{adj}(\mathbf{A})

where \text{adj}(\mathbf{A}) is the adjugate (transpose of the cofactor matrix).

Properties:

  • (A^{-1})^{-1} = A

  • (AB)^{-1} = B^{-1}A^{-1}

  • (A^T)^{-1} = (A^{-1})^T

A.1.6 Matrix Rank

The rank of a matrix is the maximum number of linearly independent rows (or columns). A matrix \mathbf{X} of size n \times p has:

  • Full column rank if \text{rank}(\mathbf{X}) = p (all columns are independent)

  • Rank deficiency if \text{rank}(\mathbf{X}) < p (multicollinearity exists)

Full column rank is required for \mathbf{X}^T\mathbf{X} to be invertible in regression.

A.1.7 Quadratic Forms

A quadratic form is an expression of the type:

\mathbf{x}^T\mathbf{A}\mathbf{x} = \sum_{i=1}^n\sum_{j=1}^n a_{ij}x_ix_j

For a vector \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} and matrix \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}:

\mathbf{x}^T\mathbf{A}\mathbf{x} = a_{11}x_1^2 + (a_{12}+a_{21})x_1x_2 + a_{22}x_2^2

This is crucial for understanding the OLS objective function.

A.2 OLS Minimization Problem and Derivation

A.2.1 The Objective Function

In matrix notation, the linear regression model is:

\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

where:

  • \mathbf{y} is an n \times 1 vector of observed responses

  • \mathbf{X} is an n \times (p+1) design matrix of predictors (including intercept)

  • \boldsymbol{\beta} is a (p+1) \times 1 vector of regression coefficients

  • \boldsymbol{\epsilon} is an n \times 1 vector of errors

The sum of squared residuals (SSR) is:

S(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2

In matrix form, this becomes:

S(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})

A.2.2 Expanding the Objective Function

Let’s expand the expression:

S(\boldsymbol{\beta}) = (\mathbf{y}^T - \boldsymbol{\beta}^T\mathbf{X}^T)(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})

= \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}

Note that \mathbf{y}^T\mathbf{X}\boldsymbol{\beta} is a scalar, so it equals its transpose:

\mathbf{y}^T\mathbf{X}\boldsymbol{\beta} = (\mathbf{y}^T\mathbf{X}\boldsymbol{\beta})^T = \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y}

Therefore:

S(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}

A.2.3 Taking the Derivative

To minimize S(\boldsymbol{\beta}), we take the derivative with respect to \boldsymbol{\beta} and set it equal to zero.

Matrix calculus rules needed:

  • \frac{\partial}{\partial \boldsymbol{\beta}}(\mathbf{a}^T\boldsymbol{\beta}) = \mathbf{a}

  • \frac{\partial}{\partial \boldsymbol{\beta}}(\boldsymbol{\beta}^T\mathbf{A}\boldsymbol{\beta}) = 2\mathbf{A}\boldsymbol{\beta} (when \mathbf{A} is symmetric)

Applying these rules:

\frac{\partial S}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}

A.2.4 Setting the Derivative to Zero (Normal Equations)

Set \frac{\partial S}{\partial \boldsymbol{\beta}} = \mathbf{0}:

-2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}

Simplify:

\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}

These are called the normal equations.

A.2.5 Solving for \hat{\boldsymbol{\beta}}

Assuming \mathbf{X}^T\mathbf{X} is invertible (full rank assumption), multiply both sides by (\mathbf{X}^T\mathbf{X})^{-1}:

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

This is the OLS estimator.

A.2.6 Verifying This is a Minimum

To confirm this is a minimum (not maximum or saddle point), we check the second derivative:

\frac{\partial^2 S}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = 2\mathbf{X}^T\mathbf{X}

This is the Hessian matrix. Since \mathbf{X}^T\mathbf{X} is positive definite (when \mathbf{X} has full column rank), the Hessian is positive definite, confirming we have a minimum.

A.3 Case 1: Simple Linear Regression (One Predictor)

A.3.1 Setting Up the Problem

Consider the following dataset with n = 5 observations:

Observation x y
1 1 2
2 2 4
3 3 5
4 4 4
5 5 5

Our model is: y_i = \beta_0 + \beta_1 x_i + \epsilon_i

A.3.2 Constructing the Design Matrix

The design matrix \mathbf{X} includes a column of ones for the intercept:

\mathbf{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 2 \\ 4 \\ 5 \\ 4 \\ 5 \end{bmatrix}

A.3.3 Step 1: Calculate \mathbf{X}^T\mathbf{X}

\mathbf{X}^T = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix}

\mathbf{X}^T\mathbf{X} = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix}

Element-by-element calculation:

  • (1,1) element: 1(1) + 1(1) + 1(1) + 1(1) + 1(1) = 5

  • (1,2) element: 1(1) + 1(2) + 1(3) + 1(4) + 1(5) = 15

  • (2,1) element: 1(1) + 2(1) + 3(1) + 4(1) + 5(1) = 15

  • (2,2) element: 1(1) + 2(2) + 3(3) + 4(4) + 5(5) = 1 + 4 + 9 + 16 + 25 = 55

\mathbf{X}^T\mathbf{X} = \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix}

A.3.4 Step 2: Calculate (\mathbf{X}^T\mathbf{X})^{-1}

For a 2 \times 2 matrix \mathbf{A} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}, the inverse is:

\mathbf{A}^{-1} = \frac{1}{ad - bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}

Calculate the determinant:

\det(\mathbf{X}^T\mathbf{X}) = 5(55) - 15(15) = 275 - 225 = 50

Calculate the inverse:

(\mathbf{X}^T\mathbf{X})^{-1} = \frac{1}{50}\begin{bmatrix} 55 & -15 \\ -15 & 5 \end{bmatrix} = \begin{bmatrix} 1.1 & -0.3 \\ -0.3 & 0.1 \end{bmatrix}

A.3.5 Step 3: Calculate \mathbf{X}^T\mathbf{y}

\mathbf{X}^T\mathbf{y} = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} 2 \\ 4 \\ 5 \\ 4 \\ 5 \end{bmatrix}

Element-by-element calculation:

  • First element: 1(2) + 1(4) + 1(5) + 1(4) + 1(5) = 20

  • Second element: 1(2) + 2(4) + 3(5) + 4(4) + 5(5) = 2 + 8 + 15 + 16 + 25 = 66

\mathbf{X}^T\mathbf{y} = \begin{bmatrix} 20 \\ 66 \end{bmatrix}

A.3.6 Step 4: Calculate \hat{\boldsymbol{\beta}}

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \begin{bmatrix} 1.1 & -0.3 \\ -0.3 & 0.1 \end{bmatrix} \begin{bmatrix} 20 \\ 66 \end{bmatrix}

Element-by-element calculation:

  • \hat{\beta}_0: 1.1(20) + (-0.3)(66) = 22 - 19.8 = 2.2

  • \hat{\beta}_1: (-0.3)(20) + 0.1(66) = -6 + 6.6 = 0.6

\hat{\boldsymbol{\beta}} = \begin{bmatrix} 2.2 \\ 0.6 \end{bmatrix}

Result: The fitted regression line is \hat{y} = 2.2 + 0.6x

A.4 Case 2: Multiple Linear Regression (Two Predictors)

A.4.1 Setting Up the Problem

Consider a dataset with n = 5 observations and two predictors:

Observation x_1 x_2 y
1 1 3 3
2 2 2 5
3 3 5 7
4 4 4 9
5 5 7 10

Our model is: y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i

A.4.2 Constructing the Design Matrix

\mathbf{X} = \begin{bmatrix} 1 & 1 & 3 \\ 1 & 2 & 2 \\ 1 & 3 & 5 \\ 1 & 4 & 4 \\ 1 & 5 & 7 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 3 \\ 5 \\ 7 \\ 9 \\ 10 \end{bmatrix}

A.4.3 Step 1: Calculate \mathbf{X}^T\mathbf{X}

\mathbf{X}^T = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \\ 3 & 2 & 5 & 4 & 7 \end{bmatrix}

\mathbf{X}^T\mathbf{X} = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \\ 3 & 2 & 5 & 4 & 7 \end{bmatrix} \begin{bmatrix} 1 & 1 & 3 \\ 1 & 2 & 2 \\ 1 & 3 & 5 \\ 1 & 4 & 4 \\ 1 & 5 & 7 \end{bmatrix}

Element-by-element calculation:

Row 1:

  • (1,1): 1 + 1 + 1 + 1 + 1 = 5

  • (1,2): 1 + 2 + 3 + 4 + 5 = 15

  • (1,3): 3 + 2 + 5 + 4 + 7 = 21

Row 2:

  • (2,1): 1 + 2 + 3 + 4 + 5 = 15

  • (2,2): 1 + 4 + 9 + 16 + 25 = 55

  • (2,3): 3 + 4 + 15 + 16 + 35 = 73

Row 3:

  • (3,1): 3 + 2 + 5 + 4 + 7 = 21

  • (3,2): 3 + 4 + 15 + 16 + 35 = 73

  • (3,3): 9 + 4 + 25 + 16 + 49 = 103

\mathbf{X}^T\mathbf{X} = \begin{bmatrix} 5 & 15 & 21 \\ 15 & 55 & 73 \\ 21 & 73 & 103 \end{bmatrix}

A.4.4 Step 2: Calculate (\mathbf{X}^T\mathbf{X})^{-1}

For a 3 \times 3 matrix, we use: \mathbf{A}^{-1} = \frac{1}{\det(\mathbf{A})}\text{adj}(\mathbf{A})

Step 2a: Calculate the determinant using cofactor expansion along the first row:

\det(\mathbf{X}^T\mathbf{X}) = 5\begin{vmatrix}55 & 73 \\ 73 & 103\end{vmatrix} - 15\begin{vmatrix}15 & 73 \\ 21 & 103\end{vmatrix} + 21\begin{vmatrix}15 & 55 \\ 21 & 73\end{vmatrix}

Calculate each 2 \times 2 determinant:

  • \begin{vmatrix}55 & 73 \\ 73 & 103\end{vmatrix} = 55(103) - 73(73) = 5665 - 5329 = 336

  • \begin{vmatrix}15 & 73 \\ 21 & 103\end{vmatrix} = 15(103) - 73(21) = 1545 - 1533 = 12

  • \begin{vmatrix}15 & 55 \\ 21 & 73\end{vmatrix} = 15(73) - 55(21) = 1095 - 1155 = -60

\det(\mathbf{X}^T\mathbf{X}) = 5(336) - 15(12) + 21(-60) = 1680 - 180 - 1260 = 240

Step 2b: Calculate the cofactor matrix

The cofactor C_{ij} is (-1)^{i+j} times the determinant of the matrix obtained by deleting row i and column j.

C_{11} = (+1)\begin{vmatrix}55 & 73 \\ 73 & 103\end{vmatrix} = 336

C_{12} = (-1)\begin{vmatrix}15 & 73 \\ 21 & 103\end{vmatrix} = -12

C_{13} = (+1)\begin{vmatrix}15 & 55 \\ 21 & 73\end{vmatrix} = -60

C_{21} = (-1)\begin{vmatrix}15 & 21 \\ 73 & 103\end{vmatrix} = -(15 \times 103 - 21 \times 73) = -(1545 - 1533) = -12

C_{22} = (+1)\begin{vmatrix}5 & 21 \\ 21 & 103\end{vmatrix} = 5(103) - 21(21) = 515 - 441 = 74

C_{23} = (-1)\begin{vmatrix}5 & 15 \\ 21 & 73\end{vmatrix} = -(5 \times 73 - 15 \times 21) = -(365 - 315) = -50

C_{31} = (+1)\begin{vmatrix}15 & 21 \\ 55 & 73\end{vmatrix} = 15(73) - 21(55) = 1095 - 1155 = -60

C_{32} = (-1)\begin{vmatrix}5 & 21 \\ 15 & 73\end{vmatrix} = -(5 \times 73 - 21 \times 15) = -(365 - 315) = -50

C_{33} = (+1)\begin{vmatrix}5 & 15 \\ 15 & 55\end{vmatrix} = 5(55) - 15(15) = 275 - 225 = 50

Cofactor matrix:

\mathbf{C} = \begin{bmatrix} 336 & -12 & -60 \\ -12 & 74 & -50 \\ -60 & -50 & 50 \end{bmatrix}

Step 2c: Calculate the adjugate (transpose of cofactor matrix):

\text{adj}(\mathbf{X}^T\mathbf{X}) = \mathbf{C}^T = \begin{bmatrix} 336 & -12 & -60 \\ -12 & 74 & -50 \\ -60 & -50 & 50 \end{bmatrix}

Step 2d: Calculate the inverse:

(\mathbf{X}^T\mathbf{X})^{-1} = \frac{1}{240}\begin{bmatrix} 336 & -12 & -60 \\ -12 & 74 & -50 \\ -60 & -50 & 50 \end{bmatrix} = \begin{bmatrix} 1.4 & -0.05 & -0.25 \\ -0.05 & 0.308\overline{3} & -0.208\overline{3} \\ -0.25 & -0.208\overline{3} & 0.208\overline{3} \end{bmatrix}

For simplicity in calculations, we’ll use:

(\mathbf{X}^T\mathbf{X})^{-1} \approx \begin{bmatrix} 1.400 & -0.050 & -0.250 \\ -0.050 & 0.308 & -0.208 \\ -0.250 & -0.208 & 0.208 \end{bmatrix}

A.4.5 Step 3: Calculate \mathbf{X}^T\mathbf{y}

\mathbf{X}^T\mathbf{y} = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \\ 3 & 2 & 5 & 4 & 7 \end{bmatrix} \begin{bmatrix} 3 \\ 5 \\ 7 \\ 9 \\ 10 \end{bmatrix}

Element-by-element calculation:

  • First: 3 + 5 + 7 + 9 + 10 = 34

  • Second: 3 + 10 + 21 + 36 + 50 = 120

  • Third: 9 + 10 + 35 + 36 + 70 = 160

\mathbf{X}^T\mathbf{y} = \begin{bmatrix} 34 \\ 120 \\ 160 \end{bmatrix}

A.4.6 Step 4: Calculate \hat{\boldsymbol{\beta}}

\hat{\boldsymbol{\beta}} = \begin{bmatrix} 1.400 & -0.050 & -0.250 \\ -0.050 & 0.308 & -0.208 \\ -0.250 & -0.208 & 0.208 \end{bmatrix} \begin{bmatrix} 34 \\ 120 \\ 160 \end{bmatrix}

Element-by-element calculation:

  • \hat{\beta}_0 = 1.400(34) - 0.050(120) - 0.250(160) = 47.6 - 6.0 - 40.0 = 1.6

  • \hat{\beta}_1 = -0.050(34) + 0.308(120) - 0.208(160) = -1.7 + 37.0 - 33.3 = 2.0

  • \hat{\beta}_2 = -0.250(34) - 0.208(120) + 0.208(160) = -8.5 - 25.0 + 33.3 = -0.2

\hat{\boldsymbol{\beta}} = \begin{bmatrix} 1.6 \\ 2.0 \\ -0.2 \end{bmatrix}

Result: The fitted regression is \hat{y} = 1.6 + 2.0x_1 - 0.2x_2

(Note: With exact fractions, the result would be slightly different, but this illustrates the manual calculation process.)

A.5 The Gauss-Markov Theorem

A.5.1 Statement of the Theorem

Under the classical linear regression assumptions, the OLS estimator \hat{\boldsymbol{\beta}} is the Best Linear Unbiased Estimator (BLUE) of \boldsymbol{\beta}.

“Best” means it has the minimum variance among all linear unbiased estimators.

A.5.2 Required Assumptions

  1. Linearity: \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

  2. Strict Exogeneity: E[\boldsymbol{\epsilon}|\mathbf{X}] = \mathbf{0}

  3. No Perfect Multicollinearity: \mathbf{X} has full column rank, i.e., \text{rank}(\mathbf{X}) = p+1

  4. Spherical Errors: \text{Var}(\boldsymbol{\epsilon}|\mathbf{X}) = \sigma^2\mathbf{I}_n

    • Homoscedasticity: \text{Var}(\epsilon_i) = \sigma^2 for all i

    • No autocorrelation: \text{Cov}(\epsilon_i, \epsilon_j) = 0 for i \neq j

A.5.3 Proof that OLS is Unbiased

Starting with: \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Substitute \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}:

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon})

= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}

= \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}

= \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}

Taking expectations:

E[\hat{\boldsymbol{\beta}}|\mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E[\boldsymbol{\epsilon}|\mathbf{X}]

Since E[\boldsymbol{\epsilon}|\mathbf{X}] = \mathbf{0}:

E[\hat{\boldsymbol{\beta}}|\mathbf{X}] = \boldsymbol{\beta}

Therefore, OLS is unbiased.

A.5.4 Variance of the OLS Estimator

From \hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}:

\text{Var}(\hat{\boldsymbol{\beta}}|\mathbf{X}) = \text{Var}[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}|\mathbf{X}]

Since (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T is non-random conditional on \mathbf{X}:

= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\boldsymbol{\epsilon}|\mathbf{X}) \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}

With spherical errors \text{Var}(\boldsymbol{\epsilon}|\mathbf{X}) = \sigma^2\mathbf{I}:

= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T (\sigma^2\mathbf{I}) \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}

= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}

= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}

This is the variance-covariance matrix of \hat{\boldsymbol{\beta}}.

A.5.5 Example: Variance Calculation for Simple Regression

From Case 1, we had:

(\mathbf{X}^T\mathbf{X})^{-1} = \begin{bmatrix} 1.1 & -0.3 \\ -0.3 & 0.1 \end{bmatrix}

We estimate \sigma^2 using the residual sum of squares:

\hat{\sigma}^2 = \frac{\text{RSS}}{n - (p + 1)} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n - 2}

For our example, the fitted values are: \hat{y}_i = 2.2 + 0.6x_i

i y_i \hat{y}_i y_i - \hat{y}_i (y_i - \hat{y}_i)^2
1 2 2.8 -0.8 0.64
2 4 3.4 0.6 0.36
3 5 4.0 1.0 1.00
4 4 4.6 -0.6 0.36
5 5 5.2 -0.2 0.04

\text{RSS} = 0.64 + 0.36 + 1.00 + 0.36 + 0.04 = 2.4

\hat{\sigma}^2 = \frac{2.4}{5 - 2} = \frac{2.4}{3} = 0.8

The variance-covariance matrix of \hat{\boldsymbol{\beta}} is:

\text{Var}(\hat{\boldsymbol{\beta}}) = 0.8 \begin{bmatrix} 1.1 & -0.3 \\ -0.3 & 0.1 \end{bmatrix} = \begin{bmatrix} 0.88 & -0.24 \\ -0.24 & 0.08 \end{bmatrix}

Standard errors:

  • \text{SE}(\hat{\beta}_0) = \sqrt{0.88} = 0.938

  • \text{SE}(\hat{\beta}_1) = \sqrt{0.08} = 0.283

These standard errors are used for hypothesis testing and confidence intervals.

A.5.6 Why OLS is “Best” (Sketch of Proof)

The Gauss-Markov theorem states that among all linear unbiased estimators, OLS has the minimum variance.

Consider any other linear unbiased estimator:

\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}

where \mathbf{C} is some matrix. For unbiasedness:

E[\tilde{\boldsymbol{\beta}}] = \mathbf{C}E[\mathbf{y}] = \mathbf{C}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}

This requires \mathbf{C}\mathbf{X} = \mathbf{I}.

We can write \mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D} for some matrix \mathbf{D} where \mathbf{D}\mathbf{X} = \mathbf{0}.

The variance of \tilde{\boldsymbol{\beta}} is:

\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2\mathbf{C}\mathbf{C}^T

= \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}][(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}]^T

= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} + \sigma^2\mathbf{D}\mathbf{D}^T

Since \mathbf{D}\mathbf{D}^T is positive semi-definite:

\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2\mathbf{D}\mathbf{D}^T \geq \mathbf{0}

Therefore, OLS has the smallest variance among all linear unbiased estimators.

A.6 Key Takeaways

  1. Matrix notation provides an elegant and scalable framework for regression that extends naturally from simple to multiple regression.

  2. The OLS minimization derives from setting \frac{\partial S}{\partial \boldsymbol{\beta}} = \mathbf{0}, yielding the normal equations \mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}.

  3. The OLS solution \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} requires that \mathbf{X}^T\mathbf{X} be invertible (no perfect multicollinearity).

  4. Under the Gauss-Markov assumptions, OLS is BLUE: Best (minimum variance), Linear, Unbiased Estimator.

  5. The variance of OLS estimates is \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}, which can be used for inference.

  6. Key linear algebra tools include matrix transpose, multiplication, determinant, inverse, and rank—all essential for understanding OLS.

  7. Assumption violations (e.g., heteroscedasticity, autocorrelation) don’t make OLS biased, but they invalidate the “best” property and standard inference procedures.