1  Introduction to Statistics and Data Analysis for Political Science

1.1 What is Statistics?

Statistics is the science of learning from data in the presence of uncertainty. More specifically, statistics provides:

  • Methods for collecting data systematically and without bias
  • Tools for describing and summarizing what we observe in our data
  • Techniques for making inferences about populations based on samples
  • Frameworks for quantifying uncertainty in our conclusions
  • Approaches for modeling relationships between variables

In political science, statistics helps us move beyond anecdotal evidence and personal impressions to make rigorous, evidence-based claims about political phenomena.

1.2 Key Concepts: Parameters, Statistics, and Estimates

Parameters vs. Statistics

A fundamental distinction in statistics is between parameters and statistics:

Population Parameters - Numerical characteristics of the entire population - Usually unknown and what we want to learn about - Denoted by Greek letters: \mu (mu) for mean, \sigma (sigma) for standard deviation, \pi (pi) for proportion - Examples: The true percentage of all Americans who support universal healthcare

Sample Statistics
- Numerical characteristics calculated from sample data - What we actually observe and calculate - Denoted by Roman letters: \bar{x} for sample mean, s for sample standard deviation, \hat{p} for sample proportion - Examples: The percentage of 1,000 survey respondents who support universal healthcare

The Inference Process: From Statistics to Parameters

The core of statistical inference involves using sample statistics to make educated guesses about population parameters:

\text{Sample Statistic} \xrightarrow{\text{Statistical Inference}} \text{Population Parameter}

Example: If 52% of our sample (\hat{p} = 0.52) supports a candidate, we use this statistic to estimate the population parameter (\pi) representing true support among all voters.

Estimates and Estimators

An estimator is the method or formula used to approximate a parameter. An estimate is the specific numerical result from applying that estimator to a particular sample.

  • Estimator: The sample mean \bar{x} = \frac{\sum x_i}{n}
  • Estimate: \bar{x} = 6.3 years of education (the actual number from our data)

1.3 The Soup Analogy: Understanding Statistical Inference

Imagine you’re a chef making a large pot of soup for 1,000 people. You want to know if the soup has the right amount of salt, but you can’t taste all of it. Instead, you take a small spoonful to taste.

The Population: The entire pot of soup (1,000 servings)

The Sample: Your spoonful

The Parameter: The true saltiness of the entire pot (unknown)

The Statistic: The saltiness of your spoonful (what you can measure)

Statistical Inference: Using the spoonful’s saltiness to draw conclusions about the entire pot

Key Insights from the Soup Analogy:

  1. Random sampling matters: You must stir the soup first and take your spoonful from a random location. If you always sample from the top, you might miss that the salt settled to the bottom.

  2. Sample size affects precision: A bigger spoonful gives you a better sense of the overall saltiness than a tiny sip.

  3. Uncertainty is inherent: Even with good sampling, your spoonful might not perfectly represent the whole pot. There’s always some uncertainty.

  4. Systematic bias ruins everything: If someone secretly added extra salt to just your spoonful, your inference about the whole pot would be wrong. This represents sampling bias.

  5. Inference has limits: You can estimate the average saltiness, but your spoonful can’t tell you if some portions are saltier than others (variability within the population).

This analogy captures the essence of statistical thinking: we use small, carefully selected samples to learn about much larger populations, always acknowledging the uncertainty inherent in this process.

1.4 A Real-World Example: What Predicts Electoral Success?

Let’s start with a question that gets to the heart of political science: What makes politicians win elections?

Imagine you’re a campaign manager trying to understand why some incumbents win by landslides while others barely scrape by. You have data on 200 recent congressional elections, including each incumbent’s approval rating, the state of the local economy, and their victory margin.

# Create realistic electoral data
set.seed(123)
n_elections <- 200

# Generate correlated predictors (realistic scenario)
approval_rating <- runif(n_elections, 35, 85)
economic_growth <- rnorm(n_elections, 2.5, 1.5)
campaign_spending <- rnorm(n_elections, 800000, 200000)

# Create victory margin with realistic relationships
victory_margin <- -15 + 
  0.6 * approval_rating +           # Strong approval effect
  2.5 * economic_growth +           # Economic voting
  0.000003 * campaign_spending +    # Money helps, but less than you'd think
  rnorm(n_elections, 0, 8)          # Random factors

# Create dataset
election_data <- data.frame(
  district = 1:n_elections,
  approval = approval_rating,
  econ_growth = economic_growth,
  spending = campaign_spending,
  victory_margin = victory_margin,
  won = victory_margin > 0
)

# Quick visualization
p1 <- ggplot(election_data, aes(x = approval, y = victory_margin)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.7) +
  labs(title = "Approval Rating vs. Victory Margin",
       x = "Approval Rating (%)",
       y = "Victory Margin (percentage points)",
       subtitle = "Points above the dashed line represent wins")

print(p1)

# Run the regression
simple_model <- lm(victory_margin ~ approval, data = election_data)
summary(simple_model)

Call:
lm(formula = victory_margin ~ approval, data = election_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.1083  -6.6037   0.2415   5.9329  26.9635 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.60413    2.82575  -1.629    0.105    
approval     0.55908    0.04569  12.237   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.83 on 198 degrees of freedom
Multiple R-squared:  0.4306,    Adjusted R-squared:  0.4277 
F-statistic: 149.7 on 1 and 198 DF,  p-value: < 2.2e-16

Figure Note: This scatter plot shows the relationship between approval ratings (x-axis) and electoral victory margins (y-axis). Each point represents one election. The red line shows the “line of best fit” from linear regression, with the gray band indicating uncertainty. Points above the dashed horizontal line (y=0) represent electoral victories.

What we just discovered: Each 1-point increase in approval rating is associated with about a 0.56-point increase in victory margin. With an approval rating below 8.2%, incumbents typically lose.

But wait—is approval rating the whole story? Let’s see what happens when we consider multiple factors:

# Multiple regression model
full_model <- lm(victory_margin ~ approval + econ_growth + spending, data = election_data)

# Clean presentation of results
model_results <- tidy(full_model) %>%
  mutate(
    estimate = round(estimate, 4),
    p.value = round(p.value, 3),
    significant = ifelse(p.value < 0.05, "Yes", "No")
  )

kable(model_results, 
      col.names = c("Variable", "Effect Size", "Std Error", "t-statistic", "p-value", "Significant?"),
      caption = "What Really Drives Electoral Success?") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
What Really Drives Electoral Success?
Variable Effect Size Std Error t-statistic p-value Significant?
(Intercept) -13.0053 3.6930474 -3.5215743 0.001 Yes
approval 0.5719 0.0410961 13.9168686 0.000 Yes
econ_growth 2.7816 0.3879416 7.1700375 0.000 Yes
spending 0.0000 0.0000028 0.2881961 0.774 No

The story gets more interesting: When we account for multiple factors simultaneously, we see that:

  • Approval rating remains the strongest predictor
  • Economic growth also matters significantly
  • Campaign spending has a much smaller effect than many assume

This is the power of regression analysis—it helps us disentangle complex relationships and understand what really matters in politics.

By the end of this course, you’ll understand:

  • How this analysis works and what assumptions it requires

  • When we can interpret these relationships as causal vs. merely correlational

  • How to assess the reliability and practical significance of our findings

  • What could go wrong and how to avoid common pitfalls

Now let’s build the foundation to understand how we got these results and what they really mean.

1.5 Why Statistics for Political Science?

1.6 The Political World is Full of Data

Political science has evolved from a primarily theoretical discipline to one that increasingly relies on empirical evidence. Whether we’re studying:

  • Election outcomes: Why do people vote the way they do?

  • Public opinion: What shapes attitudes toward immigration or climate policy?

  • International relations: What factors predict conflict between nations?

  • Policy effectiveness: Did a new education policy actually improve outcomes?

We need systematic ways to analyze data and draw conclusions that go beyond anecdotes and personal impressions.

1.7 From Intuition to Evidence

Consider this question: “Does democracy lead to economic growth?”

Your intuition might suggest yes—democratic countries tend to be wealthier. But is this causation or correlation? Are there exceptions? How confident can we be in our conclusions?

Statistics provides the tools to move from hunches to evidence-based answers, helping us distinguish between what seems true and what actually is true.

1.8 The Statistical Mindset

Developing statistical thinking means learning to:

  1. Embrace uncertainty: We never know population values exactly, and that’s okay

  2. Think about variation: Why do things differ? What patterns exist?

  3. Question relationships: Does A cause B, or are they just related?

  4. Be skeptical: Could this pattern have happened by chance?

1.9 Core Concepts: Building Blocks of Statistical Thinking

1.10 Population vs. Sample: The Foundation of Inference

The Fundamental Challenge

In political science, we’re often interested in understanding entire populations—the complete set of units we want to study. However, studying entire populations is usually impossible, impractical, or unnecessary.

What Can Be a Population?

A population in political science can consist of various types of units:

Individuals

  • Population: All 240 million American adults
  • Sample: 1,000 randomly selected adults in a survey
  • Research question: What percentage support universal healthcare?

Countries

  • Population: All 195 sovereign nations in the world
  • Sample: 50 countries from different regions and development levels
  • Research question: Does democracy correlate with economic growth?

Subnational Units

  • Population: All 3,143 U.S. counties
  • Sample: 200 randomly selected counties
  • Research question: How does unemployment affect crime rates?

Organizations

  • Population: All NGOs registered with the United Nations
  • Sample: 100 NGOs working in different policy areas
  • Research question: What factors predict NGO effectiveness?

Events or Time Periods

  • Population: All elections held in Europe since 1945
  • Sample: 300 elections from different countries and decades
  • Research question: How do economic conditions affect incumbent vote share?

Legislative Units

  • Population: All bills introduced in Congress from 2000-2020
  • Sample: 500 randomly selected bills
  • Research question: What predicts whether a bill becomes law?

The Sample Solution and Key Insight

A sample is a subset of the population we actually observe and measure. The key insight of statistics is that we can learn about populations by studying samples—if we’re careful about how we choose them.

From our sample, we want to make inferences about the population:

\text{Sample Statistic} \rightarrow \text{Population Parameter}

For example: If 52% of our sample supports Candidate A, what can we say about support in the entire population?

The fundamental principle: random selection gives every unit in the population an equal chance of being included, preventing systematic bias.

Visualizing Sampling

Let’s see how different sample sizes affect our estimates:

# Simulate sampling from a population
population_size <- 1000000
true_proportion <- 0.60  # True population parameter

# Take different sized samples
sample_sizes <- c(100, 500, 1000, 5000)
results <- data.frame()

for (size in sample_sizes) {
  for (i in 1:20) {
    sample_result <- rbinom(1, size, true_proportion) / size
    results <- rbind(results, 
                     data.frame(size = size, 
                               trial = i,
                               estimate = sample_result))
  }
}

# Visualize
ggplot(results, aes(x = factor(size), y = estimate)) +
  geom_point(alpha = 0.6, size = 2, color = "steelblue") +
  geom_hline(yintercept = true_proportion, color = "red", 
             linetype = "dashed", size = 1) +
  labs(title = "How Sample Size Affects Accuracy",
       subtitle = "Red line shows true population value (60%)",
       x = "Sample Size",
       y = "Sample Estimate") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent)

Figure Note: This scatter plot demonstrates how sample size affects the accuracy of estimates. Each blue dot represents one sample estimate. Notice how larger samples (right side) cluster more tightly around the true population value (red dashed line), illustrating reduced sampling variability.

Notice how larger samples cluster more tightly around the true value—this illustrates the law of large numbers.

The Representation Problem

Not all samples are created equal. Consider these sampling methods:

  1. Convenience Sample: Surveying students in your political science class
    • Problem: Not representative of all voters
  2. Voluntary Response Sample: Online poll on a news website
    • Problem: Self-selection bias
  3. Random Sample: Each unit has equal probability of selection
    • Solution: Best chance of representative sample

1.11 Randomness: The Foundation of Statistical Inference

What is Randomness?

In statistics, randomness doesn’t mean chaos—it means structured uncertainty.

Randomness doesn’t mean “chaotic” or “unpredictable in principle.” It refers to a process where individual outcomes are unpredictable, but the long-run pattern follows known probabilities.

Randomness has two key properties:

  1. Unpredictability in individual cases: We can’t know if a specific voter will turn out

  2. Predictability in aggregate: We can estimate that 60% of registered voters will turn out

Why Randomness Matters

Randomness appears in political science in two crucial ways:

Random Sampling

  • Prevents systematic bias in surveys
  • Allows us to quantify uncertainty
  • Foundation for statistical inference

Random Assignment (in experiments)

  • Ensures treatment and control groups are comparable
  • Allows causal inference
  • Eliminates confounding

The Power of Random Sampling

Here’s the remarkable fact: by embracing randomness in our sampling, we gain the ability to make precise statements about populations.

For example: If we randomly sample 1,000 voters and find 55% support a candidate, statistics tells us that:

  • The true population support is probably close to 55%
  • We can calculate exactly how close (typically within about 3 percentage points)
  • We can state our confidence level (usually 95%)

This seems like magic, but it works because randomness follows predictable patterns in large samples.

# Demonstrate why random sampling works
set.seed(42)

# Create a "population" with known characteristics
population_support <- c(rep("Candidate A", 5200), rep("Candidate B", 4800))
true_support_A <- mean(population_support == "Candidate A")

# Function to take a random sample and calculate support
take_sample <- function(n) {
  sample_result <- sample(population_support, n)
  return(mean(sample_result == "Candidate A"))
}

# Take many samples of different sizes
sample_sizes <- c(50, 100, 500, 1000)
results <- map_dfr(sample_sizes, function(n) {
  estimates <- replicate(100, take_sample(n))
  data.frame(
    sample_size = n,
    estimate = estimates,
    true_value = true_support_A
  )
})

ggplot(results, aes(x = factor(sample_size), y = estimate)) +
  geom_boxplot(alpha = 0.7, fill = "lightblue") +
  geom_hline(yintercept = true_support_A, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Random Sampling Gets Closer to Truth with Larger Samples",
    subtitle = "Red line shows true population value (52%)",
    x = "Sample Size",
    y = "Estimated Support for Candidate A",
    caption = "Each box shows 100 random samples of that size"
  ) +
  scale_y_continuous(labels = scales::percent_format())

Figure Note: Box plots summarize the distribution of estimates across multiple samples. The box shows the middle 50% of estimates (25th to 75th percentile), with the dark line indicating the median. “Whiskers” extend to show the range, and any outliers appear as individual points. Notice how boxes become narrower (less variable) as sample size increases.

Key insight: Random sampling allows us to make valid inferences about populations, even when we can’t observe everyone.

1.12 Measurement: Turning Concepts into Numbers

The Measurement Challenge in Political Science

Political scientists face a unique challenge: many of our most important concepts resist easy measurement:

  • How do you measure “democracy”?
  • What number captures “political ideology”?
  • How do you quantify “institutional strength”?
  • How do you measure “political participation”?

Types of Measurement

Nominal (Categories without order)

  • Party affiliation: Democrat, Republican, Independent
  • Country: USA, UK, Germany
  • Vote choice: Candidate A, Candidate B, Did not vote
  • Mathematical operations: Only counting/frequencies

Ordinal (Ordered categories)

  • Education level: High school < Bachelor’s < Master’s < PhD
  • Survey responses: Strongly disagree < Disagree < Neutral < Agree < Strongly agree
  • Agreement: Strongly disagree, Disagree, Neutral, Agree, Strongly agree
  • Mathematical operations: Ordering, but not meaningful distances

Interval (Numeric with consistent intervals)

  • Years: Difference between 2020-2021 equals 2023-2024
  • Temperature in Celsius
  • Mathematical operations: Addition, subtraction, averaging

Ratio (Interval with true zero)

  • Vote count: 0 votes means no votes
  • GDP: Can meaningfully say one country’s GDP is twice another’s
  • Age: 18, 19, 20, … years
  • Income: $25,000, $50,000, $75,000
  • Mathematical operations: All operations including ratios

Measurement Error: The Inevitable Companion

Every measurement contains some error. Consider measuring “democracy”:

\text{Observed Democracy Score} = \text{True Democracy Level} + \text{Measurement Error}

Think of it this way:

What we observe = True value + Error

There are two types of error:

Systematic Error (Bias)

  • Consistently pushes results in one direction
  • Doesn’t get better with more data
  • Example: A survey question worded as “Don’t you agree that taxes are too high?” will systematically overestimate anti-tax sentiment

Random Error

  • Unpredictable fluctuations up and down
  • Averages out with more data
  • Example: Some people might misunderstand a question or accidentally check the wrong box, but these errors go in both directions

The key difference: We can reduce random error by collecting more data, but systematic error requires fixing our measurement approach.

# Illustrate measurement error
set.seed(101)
n_countries <- 50
true_democracy <- runif(n_countries, 0, 10)
measurement_error <- rnorm(n_countries, 0, 1)
observed_democracy <- true_democracy + measurement_error

measurement_data <- data.frame(
  country = 1:n_countries,
  true_value = true_democracy,
  observed_value = observed_democracy,
  error = measurement_error
)

ggplot(measurement_data, aes(x = true_value, y = observed_value)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(
    title = "Measurement Error in Democracy Scores",
    subtitle = "Points should fall on red line if measurement were perfect",
    x = "True Democracy Level",
    y = "Observed Democracy Score"
  ) +
  coord_equal()

Figure Note: This scatter plot illustrates measurement error. If measurement were perfect, all points would fall exactly on the red diagonal line (observed = true value). Deviation from this line represents measurement error - the difference between what we observe and the true underlying value.

1.13 Variables and Variation

What Makes a Variable?

A variable is any characteristic that can take different values across units of observation. In political science:

  • Units: Countries, individuals, elections, policies, years
  • Variables: GDP, voting preference, democracy score, conflict occurrence

The Fundamental Model

The core of statistical thinking can be expressed as:

Y = f(X) + \text{error}

This says: Our outcome (Y) is some function of our predictors (X), plus unpredictable variation.

Components:

  • Y = Dependent variable (what we’re trying to explain)
  • X = Independent variable(s) (what we think explains Y)
  • f() = The relationship (often assumed linear)
  • error = Everything else we can’t explain

This model is the foundation for all statistical analysis—from simple correlations to complex machine learning.

1.14 Statistical Error and Uncertainty

Types of Statistical Error

Sampling Error

This is the uncertainty that comes from studying a sample instead of the whole population.

  • Gets smaller with bigger samples
  • We can calculate it mathematically
  • Example: A poll of 1,000 people has about ±3% margin of error

Non-sampling Error

Everything else that can go wrong:

  • Biased questions
  • People lying or misremembering
  • Missing important groups
  • Data entry mistakes
  • Non-response bias: Certain groups don’t respond to surveys
  • Selection bias: Our sample isn’t representative

Important limitation: Bigger samples don’t fix non-sampling error!

Quantifying Uncertainty: Standard Error and Confidence Intervals

When we estimate something from a sample (like the proportion supporting a candidate), we can calculate our uncertainty.

Standard Error: Measures how much our estimate might vary if we repeated the sampling

  • Smaller standard error = more precise estimate
  • Gets smaller with larger samples

Confidence Interval: A range where we’re fairly confident the true value lies

  • 95% confidence interval: We’re 95% confident the truth is in this range
  • Wider interval = more uncertainty
  • Example: “52% support ± 3%” means we’re confident true support is between 49% and 55%

We express uncertainty through confidence intervals:

\text{Estimate} \pm \text{Margin of Error}

Example: Political Polling

# Poll of 1000 voters
n <- 1000
p_hat <- 0.52

# Calculate uncertainty
se <- sqrt(p_hat * (1 - p_hat) / n)
margin <- 1.96 * se

# Create visualization of multiple polls
polls <- data.frame(
  poll = LETTERS[1:5],
  estimate = c(0.52, 0.51, 0.53, 0.50, 0.54),
  se = rep(se, 5)
)

ggplot(polls, aes(x = poll, y = estimate)) +
  geom_point(size = 3, color = "darkblue") +
  geom_errorbar(aes(ymin = estimate - 1.96*se, 
                    ymax = estimate + 1.96*se),
                width = 0.2, color = "darkblue") +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
  labs(title = "Five Polls of the Same Race",
       subtitle = "Error bars show 95% confidence intervals",
       x = "Poll",
       y = "Support for Candidate") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent, limits = c(0.45, 0.58))

Figure Note: This plot shows point estimates (dots) with error bars representing 95% confidence intervals. Each poll provides a single estimate, but the error bars show the range where we’re 95% confident the true value lies. The overlapping intervals suggest the polls are consistent with each other.

All five polls tell the same story: the race is close. The confidence intervals help us avoid overinterpreting small differences.

# Demonstrate confidence intervals
set.seed(123)
true_prop <- 0.52
n_samples <- 50
sample_size <- 1000

# Generate sample proportions
sample_props <- rbinom(n_samples, sample_size, true_prop) / sample_size

# Calculate 95% confidence intervals
se <- sqrt(sample_props * (1 - sample_props) / sample_size)
ci_lower <- sample_props - 1.96 * se
ci_upper <- sample_props + 1.96 * se

ci_data <- data.frame(
  sample_id = 1:n_samples,
  estimate = sample_props,
  ci_lower = ci_lower,
  ci_upper = ci_upper,
  contains_truth = (ci_lower <= true_prop) & (true_prop <= ci_upper)
)

ggplot(ci_data[1:20, ], aes(x = sample_id, y = estimate, color = contains_truth)) +
  geom_point(size = 2) +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2) +
  geom_hline(yintercept = true_prop, color = "red", linetype = "dashed", size = 1) +
  scale_color_manual(values = c("FALSE" = "red", "TRUE" = "blue"),
                     labels = c("FALSE" = "Misses true value", "TRUE" = "Contains true value")) +
  labs(
    title = "95% Confidence Intervals",
    subtitle = "About 95% of intervals should contain the true value (red line)",
    x = "Sample Number",
    y = "Estimated Support",
    color = "Interval Performance"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(legend.position = "bottom")

Figure Note: This line plot shows how statistical power (y-axis) increases with sample size (x-axis) for different effect sizes. Larger effect sizes (higher lines) are easier to detect. The dashed horizontal line marks 80% power, a conventional threshold for adequate power in study design.

Figure Note: This plot demonstrates the behavior of 95% confidence intervals. Each horizontal line represents one confidence interval from a different sample. About 95% of intervals (blue) contain the true population value (red dashed line), while about 5% (red) miss it. This illustrates what “95% confidence” means.

1.15 Statistical Significance: Making Sense of Uncertain Evidence

1.16 Starting with Intuition: The “Innocent Until Proven Guilty” Analogy

Think of statistical significance like a courtroom trial:

  • Null hypothesis (H_0): “The defendant is innocent” (no real effect exists)
  • Alternative hypothesis (H_1): “The defendant is guilty” (a real effect exists)
  • Evidence: Our data and statistical test
  • Verdict: Reject H_0 (find significance) or fail to reject H_0 (no significance)

Just like in court, we need strong evidence to reject the presumption of innocence (no effect).

1.17 What is Statistical Significance?

When we observe a difference in our data, we face a fundamental question: Is this difference “real” (reflecting something true about the population) or just “noise” (random variation from sampling)?

Statistical significance helps us answer:

“Is what we observed likely due to a real effect, or could it just be random luck?”

We’re distinguishing between:

  • Signal: Real patterns that reflect true relationships
  • Noise: Random variation that doesn’t mean anything

1.18 The Logic of Hypothesis Testing

The null hypothesis is our default assumption—usually that nothing interesting is happening:

  • There’s no difference between groups
  • There’s no relationship between variables
  • The treatment has no effect

We maintain this skeptical stance until the data convinces us otherwise.

1.19 Understanding p-values: Three Ways to Think About It

The p-value is probably the most misunderstood concept in statistics. Here are three ways to think about it:

1. The Surprise Level

“How surprised should I be to see this data if nothing was really going on?”

  • Small p-value (< 0.05) = Very surprised = Maybe something IS going on
  • Large p-value (> 0.05) = Not surprised = Probably just random variation

2. The Coin Flip Analogy

Imagine you suspect a coin is unfair. You flip it 10 times and get 8 heads.

  • The p-value asks: “If the coin were actually fair, how often would I get 8 or more heads in 10 flips?”
  • If this rarely happens with a fair coin, we might conclude the coin is biased

3. The Formal Definition

A p-value answers this specific question:

“If there were really no effect in the population, what’s the probability of seeing a result at least as extreme as what we observed?”

Think of it like this: “If the null hypothesis were true, what’s the probability of getting data at least as extreme as what we observed?”

A Visual Understanding of p-values

# Simulate what happens under the null hypothesis
set.seed(789)
null_distribution <- rnorm(10000, mean = 0, sd = 1)

# Our observed value
observed <- 2.1

# Create visualization
hist_data <- data.frame(values = null_distribution)

ggplot(hist_data, aes(x = values)) +
  geom_histogram(aes(y = ..density..), bins = 50, 
                 fill = "lightblue", color = "black", alpha = 0.7) +
  geom_density(color = "darkblue", size = 1) +
  geom_vline(xintercept = observed, color = "red", size = 1.5) +
  geom_vline(xintercept = -observed, color = "red", size = 1.5) +
  geom_area(stat = "function", fun = dnorm, 
            xlim = c(observed, 4), fill = "red", alpha = 0.3) +
  geom_area(stat = "function", fun = dnorm, 
            xlim = c(-4, -observed), fill = "red", alpha = 0.3) +
  labs(title = "What the p-value Measures",
       subtitle = "If there were no effect (null hypothesis true), how often would we see results this extreme?",
       x = "Possible Results Under Null Hypothesis",
       y = "Probability") +
  theme_minimal() +
  annotate("text", x = 2.5, y = 0.1, label = "p-value:\nProbability of\nthis or more\nextreme", 
           color = "red", fontface = "bold")

Figure Note: This histogram shows the distribution of possible results if the null hypothesis were true (no real effect). The red vertical lines mark our observed result, and the red shaded areas represent the p-value - the probability of getting results this extreme or more extreme under the null hypothesis.

The red areas show the p-value—the probability of seeing something as extreme as our result (red lines) if the null hypothesis were true.

1.20 Examples: Understanding p-values in Context

Example 1: Do Campaign Ads Really Work?

Scenario: A candidate runs TV ads in 20 randomly selected cities, but not in 20 other similar cities. After the campaign:

  • Ad cities: 58% vote for the candidate
  • No-ad cities: 54% vote for the candidate
  • Difference: 4 percentage points

The Question: Is this 4% difference real, or just random variation?

# Simulate the campaign ad example
set.seed(123)

# Create data for the example
ad_cities <- c(rep("With Ads", 20), rep("No Ads", 20))
vote_share <- c(
  rnorm(20, 0.58, 0.08),  # Cities with ads
  rnorm(20, 0.54, 0.08)   # Cities without ads
)

campaign_data <- data.frame(
  treatment = ad_cities,
  vote_share = vote_share
)

# Calculate the difference
mean_with_ads <- mean(campaign_data$vote_share[campaign_data$treatment == "With Ads"])
mean_no_ads <- mean(campaign_data$vote_share[campaign_data$treatment == "No Ads"])
observed_diff <- mean_with_ads - mean_no_ads

# Perform t-test
t_test_result <- t.test(vote_share ~ treatment, data = campaign_data)
p_val <- t_test_result$p.value

# Visualize the data
ggplot(campaign_data, aes(x = treatment, y = vote_share, fill = treatment)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.6, size = 2) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "red") +
  labs(
    title = "Do Campaign Ads Increase Vote Share?",
    subtitle = paste0("Difference: ", round(observed_diff, 3), 
                     " (", round(observed_diff*100, 1), " percentage points), p = ", round(p_val, 3)),
    x = "Treatment Condition",
    y = "Vote Share",
    caption = "Red diamonds show group averages"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(legend.position = "none")

Figure Note: Box plots compare distributions between two groups. The box shows the middle 50% of observations, with the median as the central line. The red diamonds show group means. Jittered points show individual data values. The similar box heights suggest comparable variability within each group.

Interpretation: p = 0.02

If p < 0.05: “If ads had no real effect, we’d see a difference this large only 2% of the time by chance alone. This is unlikely enough that we conclude the effect is real.”

If p > 0.05: “This difference could easily happen by chance, so we can’t conclude the ads worked.”

Example 2: Voter Turnout and Weather

Scenario: Does rain decrease voter turnout? We compare turnout on rainy vs. sunny election days:

# Simulate weather and turnout data
set.seed(456)

weather_data <- data.frame(
  weather = c(rep("Rainy", 25), rep("Sunny", 25)),
  turnout = c(
    rnorm(25, 0.62, 0.08),  # Rainy days
    rnorm(25, 0.68, 0.08)   # Sunny days
  )
)

# Calculate statistics
rain_turnout <- mean(weather_data$turnout[weather_data$weather == "Rainy"])
sunny_turnout <- mean(weather_data$turnout[weather_data$weather == "Sunny"])
weather_diff <- sunny_turnout - rain_turnout

# Statistical test
weather_test <- t.test(turnout ~ weather, data = weather_data)
weather_p <- weather_test$p.value

# Visualization
ggplot(weather_data, aes(x = weather, y = turnout, fill = weather)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.6, size = 2) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "black") +
  labs(
    title = "Does Weather Affect Voter Turnout?",
    subtitle = paste0("Sunny days: ", round(sunny_turnout*100, 1), "%, Rainy days: ", 
                     round(rain_turnout*100, 1), "%, Difference: ", 
                     round(weather_diff*100, 1), " points, p = ", round(weather_p, 3)),
    x = "Weather Condition",
    y = "Voter Turnout Rate"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(values = c("Rainy" = "lightblue", "Sunny" = "gold")) +
  theme(legend.position = "none")

Figure Note: This box plot compares voter turnout distributions between rainy and sunny election days. The black diamonds indicate group averages. The separation between boxes suggests a meaningful difference, with sunny days showing higher average turnout than rainy days.

Result: p = 0.075

What this means:

  • We found a 4.4 percentage point difference in turnout
  • If weather had no real effect, we’d see a difference this large about 7.5% of the time just by chance
  • Since p > 0.05, we cannot conclude that weather significantly affects turnout

Example 3: When Results Are NOT Significant

Scenario: Does social media use affect political knowledge?

# Simulate a case with no significant effect
set.seed(999)

social_media_data <- data.frame(
  social_media_hours = runif(150, 0, 8),
  political_knowledge = rnorm(150, 50, 15)  # No relationship with social media
)

# Add tiny relationship (not detectable with this sample size)
social_media_data$political_knowledge <- social_media_data$political_knowledge + 
  0.2 * social_media_data$social_media_hours + rnorm(150, 0, 15)

sm_model <- lm(political_knowledge ~ social_media_hours, data = social_media_data)
sm_summary <- summary(sm_model)
sm_p <- sm_summary$coefficients[2, 4]

ggplot(social_media_data, aes(x = social_media_hours, y = political_knowledge)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Social Media Use and Political Knowledge",
    subtitle = paste0("Effect: ", round(coef(sm_model)[2], 2), 
                     " points per hour, p = ", round(sm_p, 3),
                     " (NOT statistically significant)"),
    x = "Social Media Hours per Day",
    y = "Political Knowledge Score",
    caption = "Large confidence interval suggests high uncertainty"
  )

Figure Note: This scatter plot shows the relationship between social media hours and political knowledge scores. The red line represents the fitted regression line, with the gray band showing the 95% confidence interval. The wide confidence band indicates high uncertainty in the relationship, consistent with the non-significant p-value.


**Result**: p = 0.542 (not significant)

**What this means**:

- We cannot conclude that social media use affects political knowledge
- This doesn't prove there's no effect—just that we can't detect one with confidence
- **Possible reasons**: Effect is too small, sample too small, or no real effect exists

## The 0.05 Threshold: Convention, Not Magic

We often use p < 0.05 as our cutoff for "statistical significance." But why 0.05?

- It's just a convention established by statistician Ronald Fisher
- It means: "If nothing were going on, we'd see this less than 5% of the time"
- It's not a magic number—p = 0.049 isn't meaningfully different from p = 0.051

**Think of it this way**: If you ran 100 studies where nothing was really happening, about 5 would show "significant" results just by chance. The 0.05 threshold accepts this 5% false positive rate.

::: {.cell}

```{.r .cell-code}
# Visualize different significance levels
p_values <- c(0.001, 0.01, 0.03, 0.05, 0.08, 0.15, 0.4)
evidence_strength <- c("Very Strong", "Strong", "Moderate", "Weak", "Weak", "Very Weak", "No Evidence")
significance <- ifelse(p_values < 0.05, "Significant", "Not Significant")

p_data <- data.frame(
  p_value = p_values,
  strength = factor(evidence_strength, levels = c("No Evidence", "Very Weak", "Weak", "Moderate", "Strong", "Very Strong")),
  significant = significance
)

ggplot(p_data, aes(x = reorder(paste0("p = ", p_value), p_value), y = 1, fill = significant)) +
  geom_col() +
  geom_text(aes(label = strength), vjust = -0.5, size = 3) +
  labs(
    title = "Interpreting Different p-values",
    subtitle = "Convention: p < 0.05 is considered 'statistically significant'",
    x = "p-value",
    y = "",
    fill = "Statistical Significance"
  ) +
  scale_fill_manual(values = c("Not Significant" = "lightcoral", "Significant" = "lightblue")) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1))

:::

Figure Note: This bar chart illustrates how different p-values correspond to different levels of evidence strength. The conventional p < 0.05 threshold divides results into “significant” (blue) and “not significant” (red) categories, but the actual strength of evidence varies continuously.

1.21 Common Misconceptions About p-values

❌ Wrong Interpretations:

  1. “p = 0.03 means there’s a 3% chance our hypothesis is wrong”
    • Why it’s wrong: p-values don’t tell us the probability our hypothesis is correct
  2. “p = 0.07 means our effect is smaller than p = 0.02”
    • Why it’s wrong: p-values reflect uncertainty, not effect size
  3. “Non-significant results mean there’s no effect”
    • Why it’s wrong: Could be due to small sample size or large measurement error

✅ Correct Interpretations:

  1. “p = 0.03 means: if there were no real effect, we’d see data this extreme only 3% of the time”

  2. “p = 0.07 means we don’t have strong enough evidence to reject the null hypothesis”

  3. “Non-significant results mean we can’t confidently distinguish the signal from the noise”

1.22 Walking Through an Example: Gender and Political Knowledge

Let’s walk through hypothesis testing with a concrete example:

Research Question: Do men and women differ in political knowledge?

Step 1: Set Up Hypotheses

  • Null hypothesis (H_0): Men and women have equal political knowledge
  • Alternative hypothesis (H_1): There is a difference

Step 2: Collect Data

  • We survey 200 people (100 men, 100 women)
  • Give them a 10-question political knowledge quiz
  • Men average 6.5 correct, women average 5.8 correct
  • Difference = 0.7 questions

Step 3: Ask the Key Question

“If men and women truly have equal knowledge (null hypothesis), how likely is it we’d see a 0.7 question difference in our sample?”

Step 4: Calculate the p-value

Using statistical formulas (which we’ll learn later), we find p = 0.12

Step 5: Interpret

  • p = 0.12 means: If there were truly no difference, we’d see a gap this large about 12% of the time
  • Since 12% > 5% (our threshold), we don’t reject the null hypothesis
  • Conclusion: We don’t have strong evidence of a gender difference in political knowledge

Important: This doesn’t prove men and women have equal knowledge! It just means our evidence isn’t strong enough to conclude they differ.

1.23 Confidence Intervals: An Alternative Approach

Instead of just asking “is there a difference?”, we can ask “how big is the difference?”

Using our example above:

  • Observed difference: 0.7 questions
  • 95% Confidence Interval: -0.2 to 1.6 questions

This means:

  • We’re 95% confident the true difference is between -0.2 (women score higher) and 1.6 (men score higher)
  • Since this interval includes 0 (no difference), we can’t rule out equality
  • But it also tells us the maximum plausible difference is 1.6 questions

Confidence intervals give us more information than p-values alone!

1.24 Putting It All Together: A Step-by-Step Guide

When you encounter a statistical test result:

Step 1: Look at the effect size

  • How big is the difference/relationship?
  • Is it practically meaningful?

Step 2: Check the p-value

  • p < 0.05: Evidence against “no effect”
  • p > 0.05: Insufficient evidence for an effect

Step 3: Consider the context

  • Sample size (small samples make significance harder to achieve)
  • Study design (well-designed studies are more trustworthy)
  • Prior knowledge (does this make theoretical sense?)

Step 4: Interpret carefully

  • Significant ≠ important
  • Non-significant ≠ no effect
  • Always consider confidence intervals

Remember: Statistical significance is about the reliability of evidence, not the importance of findings.

1.25 Regression: The Workhorse of Political Science

1.26 What is Regression?

Regression analysis is the most important statistical tool in political science. It models relationships between variables and operationalizes our Fundamental Model:

Y = f(X) + \epsilon

Regression helps us answer questions like:

  • How much does education increase political participation?
  • What factors predict electoral success?
  • Do democratic institutions promote economic growth?

1.27 Simple Linear Regression

The basic regression equation:

Y_i = \alpha + \beta X_i + \epsilon_i

Where:

  • Y_i = outcome for observation i
  • X_i = predictor for observation i
  • \alpha = intercept (expected value of Y when X = 0)
  • \beta = slope (change in Y for one-unit change in X)
  • \epsilon_i = error term

Example: Education and Political Participation

Question: Does education increase political participation?

# Create simulated data
set.seed(789)
n <- 200

education <- rnorm(n, 14, 3)  # Years of education
participation <- 0.3 + 0.05 * education + rnorm(n, 0, 0.5)  # Political participation index

# Keep participation between 0 and 1
participation <- pmax(0, pmin(1, participation))

participation_data <- data.frame(
  education = education,
  participation = participation
)

# Fit regression
model <- lm(participation ~ education, data = participation_data)

# Plot with regression line
ggplot(participation_data, aes(x = education, y = participation)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(
    title = "Education and Political Participation",
    x = "Years of Education",
    y = "Political Participation Index",
    subtitle = paste0("Estimated effect: ", round(coef(model)[2], 3), 
                     " (p = ", round(summary(model)$coefficients[2,4], 3), ")")
  )

# Display results
summary(model)

Call:
lm(formula = participation ~ education, data = participation_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8313 -0.1324  0.1262  0.1913  0.3813 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.455184   0.098061   4.642 6.28e-06 ***
education   0.024150   0.006926   3.487 0.000602 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2951 on 198 degrees of freedom
Multiple R-squared:  0.05786,   Adjusted R-squared:  0.0531 
F-statistic: 12.16 on 1 and 198 DF,  p-value: 0.0006016

Figure Note: This scatter plot shows individual observations (points) with a fitted regression line (red) and confidence band (gray). The regression line represents the best linear fit through the data points, showing how political participation tends to increase with education levels.

Interpretation: Each additional year of education is associated with about a 0.024 increase in political participation.

Interpreting Regression Output

Key elements:

Coefficients: Effect sizes

  • The intercept (\alpha): Expected value of Y when X = 0
  • The slope (\beta): Change in Y for one-unit change in X

Standard errors: Uncertainty in estimates

t-statistics: Coefficient / Standard Error

p-values: Test of H_0: \beta = 0

R-squared: Proportion of variance explained

In our example:

  • Intercept: When education = 0, expected participation is 0.455
  • Slope: Each year of education increases participation by 0.024 points
  • R-squared: Education explains 5.8% of variation in participation

1.28 Multiple Regression: Controlling for Confounders

Real-world relationships are complex. Multiple regression estimates effects while controlling for other variables:

Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_k X_{ki} + \epsilon_i

Each \beta_j represents the effect of X_j holding all other variables constant.

Example: What Influences Voter Turnout?

Let’s return to our opening example but dive deeper:

# Use our election data from the introduction
# Run logistic regression for binary outcome
turnout_model <- glm(won ~ approval + econ_growth + spending, 
                     data = election_data, 
                     family = binomial)

# For easier interpretation, also run linear regression on victory margin
linear_model <- lm(victory_margin ~ approval + econ_growth + spending, 
                   data = election_data)

# Clean presentation of results
model_results <- tidy(linear_model) %>%
  mutate(
    estimate = round(estimate, 4),
    p.value = round(p.value, 3),
    significant = ifelse(p.value < 0.05, "Yes", "No")
  )

kable(model_results, 
      col.names = c("Variable", "Effect Size", "Std Error", "t-statistic", "p-value", "Significant?"),
      caption = "Multiple Regression: Predictors of Electoral Victory Margin") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Multiple Regression: Predictors of Electoral Victory Margin
Variable Effect Size Std Error t-statistic p-value Significant?
(Intercept) -13.0053 3.6930474 -3.5215743 0.001 Yes
approval 0.5719 0.0410961 13.9168686 0.000 Yes
econ_growth 2.7816 0.3879416 7.1700375 0.000 Yes
spending 0.0000 0.0000028 0.2881961 0.774 No
# Visualize the multiple regression
# Create partial regression plots
library(car)
par(mfrow = c(2, 2))
avPlots(linear_model, main = "Partial Regression Plots")

Key insights from multiple regression:

  1. Approval rating has the strongest effect: Each percentage point increase in approval adds about 0.57 percentage points to victory margin

  2. Economic growth also matters: Each percentage point of economic growth adds about 2.78 percentage points to victory margin

  3. Campaign spending has a smaller effect than many assume: Each additional dollar only adds about 0.8 percentage points per million dollars

  4. Controlling matters: The effects might be different if we only looked at approval rating alone

1.29 Key Assumptions and Limitations

What regression assumes (in simple terms):

  1. Linear relationship: The effect is constant (one more year of education always has the same effect)

  2. Independence: Each observation is separate (one person’s vote doesn’t affect another’s in our data)

  3. Random sampling: Our sample represents the population

  4. No perfect predictors: We can’t perfectly predict the outcome from the inputs

  5. Homoscedasticity: The error variance is constant across observations

  6. Normal residuals: The errors are roughly normally distributed

The Big Limitation: Regression finds patterns, not necessarily causes!

Just because education correlates with voting doesn’t mean education causes voting. Maybe:

  • Educated people tend to be wealthier (wealth causes voting)
  • Politically interested people seek more education (reverse causation)
  • Some third factor causes both

Always ask: “What else could explain this relationship?”

1.30 Practical Example: Putting Regression to Work

Let’s work through a complete regression analysis using our electoral data:

# Complete analysis of electoral success
# 1. Explore the data
summary(election_data)
    district         approval      econ_growth         spending      
 Min.   :  1.00   Min.   :35.03   Min.   :-0.5799   Min.   : 267815  
 1st Qu.: 50.75   1st Qu.:48.61   1st Qu.: 1.5722   1st Qu.: 685246  
 Median :100.50   Median :59.10   Median : 2.3801   Median : 821597  
 Mean   :100.50   Mean   :60.32   Mean   : 2.5097   Mean   : 806963  
 3rd Qu.:150.25   3rd Qu.:71.67   3rd Qu.: 3.4157   3rd Qu.: 942685  
 Max.   :200.00   Max.   :84.71   Max.   : 7.3616   Max.   :1314292  
 victory_margin      won         
 Min.   :-3.491   Mode :logical  
 1st Qu.:20.720   FALSE:2        
 Median :29.288   TRUE :198      
 Mean   :29.119                  
 3rd Qu.:37.390                  
 Max.   :58.823                  
# 2. Create scatter plots
library(GGally)
ggpairs(election_data[, c("victory_margin", "approval", "econ_growth", "spending")],
        title = "Relationships Among Electoral Variables")

# 3. Fit the model
full_model <- lm(victory_margin ~ approval + econ_growth + spending, data = election_data)

# 4. Check model assumptions
par(mfrow = c(2, 2))
plot(full_model, main = "Regression Diagnostics")

# 5. Interpret results
model_summary <- summary(full_model)
cat("Model Summary:\n")
Model Summary:
cat("R-squared:", round(model_summary$r.squared, 3), "\n")
R-squared: 0.549 
cat("Adjusted R-squared:", round(model_summary$adj.r.squared, 3), "\n")
Adjusted R-squared: 0.542 
cat("Overall model p-value:", round(pf(model_summary$fstatistic[1], 
                                      model_summary$fstatistic[2], 
                                      model_summary$fstatistic[3], 
                                      lower.tail = FALSE), 3), "\n")
Overall model p-value: 0 
# 6. Substantive interpretation
coef_table <- tidy(full_model) %>%
  mutate(
    effect_size = case_when(
      term == "approval" ~ paste0(round(estimate, 2), " point increase in victory margin per 1% approval increase"),
      term == "econ_growth" ~ paste0(round(estimate, 2), " point increase in victory margin per 1% economic growth"),
      term == "spending" ~ paste0(round(estimate*1000000, 2), " point increase in victory margin per $1M spent"),
      TRUE ~ "Baseline victory margin when all predictors = 0"
    )
  )

kable(coef_table[, c("term", "estimate", "p.value", "effect_size")],
      col.names = c("Variable", "Coefficient", "p-value", "Substantive Interpretation"),
      caption = "Substantive Interpretation of Electoral Success Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Substantive Interpretation of Electoral Success Model
Variable Coefficient p-value Substantive Interpretation
(Intercept) -13.0053408 0.0005336 Baseline victory margin when all predictors = 0
approval 0.5719288 0.0000000 0.57 point increase in victory margin per 1% approval increase
econ_growth 2.7815556 0.0000000 2.78 point increase in victory margin per 1% economic growth
spending 0.0000008 0.7735012 0.8 point increase in victory margin per $1M spent

What we learned:

  1. Model fit: The model explains 54.9% of the variation in victory margins

  2. Key predictors: Approval rating and economic growth are the strongest predictors

  3. Practical significance: A 10-point increase in approval rating predicts about a 5.7-point increase in victory margin

  4. Limitations: We still can’t explain 45.1% of the variation—politics is complex!

1.31 Causation: The Challenge of Causal Inference

1.32 Correlation is Not Causation

Just because two variables are related doesn’t mean one causes the other. Consider:

  1. Reverse causation: Does democracy cause growth, or growth cause democracy?

  2. Common cause: Ice cream sales correlate with crime (both caused by temperature)

  3. Coincidence: Spurious correlations in large datasets

1.33 The Fundamental Problem of Causal Inference

To know if something causes an effect, we’d ideally want to see:

  • What happens WITH the cause
  • What happens WITHOUT the cause
  • For the same unit at the same time

The problem: We can’t observe both! A country either has democracy or it doesn’t. A voter either sees an ad or doesn’t.

This is why causal inference is so challenging—we only see one version of reality, not the counterfactual (what would have happened otherwise).

1.34 Solutions for Causal Inference

1. Randomized Experiments

Gold standard for causation:

  • Randomly assign treatment
  • Compare average outcomes
  • Difference = causal effect

Why it works: Randomization ensures groups are identical except for treatment.

2. Natural Experiments

When randomization happens naturally:

  • Close elections (regression discontinuity)
  • Policy changes affecting some units but not others
  • Natural disasters or other shocks

3. Statistical Control

Use regression to “control” for confounders:

  • Include potential confounders as control variables
  • Interpret coefficient on treatment as causal effect
  • Key limitation: Can only control for observed variables

4. Panel/Longitudinal Methods

Follow same units over time:

  • Control for time-invariant characteristics
  • Difference-in-differences
  • Fixed effects models

Example: Campaign Effects

# Simulate an experiment
set.seed(456)
n_districts <- 100

# Random assignment to treatment (campaign ads)
treatment <- sample(c(0, 1), n_districts, replace = TRUE)

# Potential outcomes (only one observed)
y0 <- rnorm(n_districts, mean = 50, sd = 5)  # Turnout without ads
y1 <- y0 + 3  # Turnout with ads (true effect = 3)

# Observed outcome
turnout <- ifelse(treatment == 1, y1, y0)

# Analysis
experiment_data <- data.frame(treatment, turnout)
t.test(turnout ~ treatment, data = experiment_data)

    Welch Two Sample t-test

data:  turnout by treatment
t = -3.2015, df = 93.469, p-value = 0.001869
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -5.010781 -1.174459
sample estimates:
mean in group 0 mean in group 1 
       49.86662        52.95924 
# Visualize
ggplot(experiment_data, aes(x = factor(treatment), y = turnout, fill = factor(treatment))) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Randomized Experiment: Campaign Ad Effects",
       x = "Treatment Group",
       y = "Voter Turnout (%)") +
  scale_x_discrete(labels = c("Control", "Saw Ads")) +
  scale_fill_manual(values = c("lightblue", "coral")) +
  theme_minimal() +
  theme(legend.position = "none")

Figure Note: This box plot compares outcomes between randomly assigned treatment groups. The clear separation between boxes suggests the treatment (seeing campaign ads) had a meaningful effect on voter turnout. Random assignment ensures the groups are comparable except for the treatment.

The Challenge: When Experiments Aren’t Possible

Most political science questions can’t be answered with randomized experiments:

  • Can’t randomly assign countries to be democratic
  • Can’t randomly assign people to different social classes
  • Can’t randomly start wars to study their effects

Solution: Use clever research designs that approximate experiments:

Regression Discontinuity: Exploit arbitrary cutoffs

  • Example: Compare politicians who barely won vs. barely lost elections
  • Assumption: These groups are very similar except for winning

Difference-in-Differences: Compare changes over time

  • Example: Study policy changes that affect some states but not others
  • Compare how outcomes change in treated vs. untreated states

Instrumental Variables: Find variables that affect treatment but not outcome directly

  • Example: Use rainfall to study effect of economic growth on conflict
  • Logic: Rainfall affects growth but doesn’t directly cause war

1.35 Example: Do Negative Ads Reduce Turnout?

The Challenge: We can’t randomly assign some voters to see negative ads and others not to in real elections.

Observational Study Approach:

# Simulate observational data on negative ads and turnout
set.seed(123)
n_areas <- 200

# Some areas see more negative ads (not random)
competitiveness <- runif(n_areas, 0, 1)  # How competitive the race is
negative_ads <- 10 + 30 * competitiveness + rnorm(n_areas, 0, 5)  # More ads in competitive areas

# Turnout is affected by both ads and competitiveness
turnout_obs <- 65 - 0.2 * negative_ads + 10 * competitiveness + rnorm(n_areas, 0, 5)

obs_data <- data.frame(
  area = 1:n_areas,
  negative_ads = negative_ads,
  competitiveness = competitiveness,
  turnout = turnout_obs
)

# Naive analysis (ignoring competitiveness)
naive_model <- lm(turnout ~ negative_ads, data = obs_data)

# Proper analysis (controlling for competitiveness)
controlled_model <- lm(turnout ~ negative_ads + competitiveness, data = obs_data)

# Compare results
comparison <- data.frame(
  Model = c("Naive (no controls)", "Controlled"),
  Ad_Effect = c(coef(naive_model)[2], coef(controlled_model)[2]),
  P_Value = c(summary(naive_model)$coefficients[2,4], 
              summary(controlled_model)$coefficients[2,4]),
  True_Effect = c(-0.2, -0.2)  # We know the true effect from our simulation
)

kable(comparison, 
      digits = 3,
      caption = "Comparing Naive vs. Controlled Analysis") %>%
  kable_styling()
Comparing Naive vs. Controlled Analysis
Model Ad_Effect P_Value True_Effect
Naive (no controls) 0.009 0.825 -0.2
Controlled -0.184 0.015 -0.2
# Visualize the confounding
ggplot(obs_data, aes(x = negative_ads, y = turnout, color = competitiveness)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  scale_color_gradient(low = "blue", high = "red", name = "Competitiveness") +
  labs(
    title = "The Confounding Problem",
    subtitle = "Red line shows naive correlation; true effect requires controlling for competitiveness",
    x = "Number of Negative Ads",
    y = "Voter Turnout (%)"
  )

Key insight: The naive analysis gives us the wrong answer! Without controlling for competitiveness, we’d conclude negative ads increase turnout, when they actually decrease it.

1.36 Common Pitfalls and How to Avoid Them

1.37 1. The Ecological Fallacy

Mistake: Inferring individual behavior from aggregate data

Example: “Rich states vote Democratic, therefore rich people vote Democratic”

  • Reality: Within states, wealthier individuals often vote Republican

Solution: Match level of analysis to research question

# Illustrate the ecological fallacy
set.seed(456)

# Create state-level data
states <- data.frame(
  state = 1:50,
  median_income = runif(50, 40000, 80000),
  dem_vote_share = runif(50, 0.3, 0.7)
)

# Add correlation at state level
states$dem_vote_share <- 0.2 + 0.000005 * states$median_income + rnorm(50, 0, 0.1)

# Create individual-level data
individuals <- data.frame()
for(i in 1:50) {
  n_people <- sample(100:200, 1)
  state_income <- states$median_income[i]
  state_dem <- states$dem_vote_share[i]
  
  # Individual incomes around state median
  income <- rnorm(n_people, state_income, 15000)
  
  # Within states, higher income predicts Republican voting
  vote_dem <- rbinom(n_people, 1, plogis(state_dem - 0.00001 * (income - state_income)))
  
  state_data <- data.frame(
    state = i,
    income = income,
    vote_dem = vote_dem
  )
  
  individuals <- rbind(individuals, state_data)
}

# State-level correlation
state_cor <- cor(states$median_income, states$dem_vote_share)

# Individual-level correlation  
ind_cor <- cor(individuals$income, individuals$vote_dem)

cat("State-level correlation:", round(state_cor, 3), "\n")
State-level correlation: 0.604 
cat("Individual-level correlation:", round(ind_cor, 3), "\n")
Individual-level correlation: -0.051 
cat("These have opposite signs - this is the ecological fallacy!")
These have opposite signs - this is the ecological fallacy!

1.38 2. Selection Bias

Mistake: Non-random samples that systematically exclude certain groups

Example: Surveying only likely voters misses preferences of habitual non-voters

Solution: Define population carefully, acknowledge sampling limitations

1.39 3. Overfitting

Mistake: Models too complex for available data

Example: Including 50 variables with 100 observations

  • The model memorizes your specific sample rather than learning general patterns

Solution: Keep models simple, focus on key variables

1.40 4. The Multiple Testing Problem

The Jellybean Problem: Imagine testing whether 20 different colors of jellybeans cause acne. Even if no jellybeans actually cause acne, you’ll probably find that one color shows a “significant” effect just by chance.

Why this happens: If you run many tests, some will be significant by pure luck

  • With 20 tests at p < 0.05 level, you expect 1 false positive

Solution:

  • Decide what you’re testing before looking at the data
  • Be cautious when you see one significant result among many tests
  • Report all tests you ran, not just significant ones

1.41 5. Ignoring Uncertainty

Mistake: Treating point estimates as exact

Example: “Support is 52%” vs. “Support is 52% ± 3%”

Solution: Always report and interpret confidence intervals

1.42 6. Confusing Statistical and Practical Significance

Mistake: Assuming statistically significant results are always meaningful

Example: A study of 10,000 people finds that negative ads decrease turnout by 0.01 percentage points (p = 0.03)

Questions to ask:

  • Is 0.01 percentage points a meaningful difference?
  • Would this affect election outcomes?
  • Is the effect large enough to matter for policy?

Solution: Always consider effect sizes alongside p-values

1.43 Practical Applications in Political Science

1.44 1. Polling and Elections

Key concepts applied:

  • Sample vs. population: Voters vs. poll respondents
  • Sampling error: Margin of error in polls
  • Confidence intervals: “52% ± 3%”
  • Statistical significance: Is the lead meaningful?
  • Regression: Modeling vote choice as function of candidate characteristics

Example: Poll Analysis

# Simulate poll data
set.seed(789)
poll_n <- 1200
candidate_support <- 0.52

# Add realistic complications
party_id <- sample(c("Democrat", "Republican", "Independent"), 
                   poll_n, replace = TRUE, prob = c(0.35, 0.33, 0.32))

# Voting intention depends on party ID
vote_prob <- case_when(
  party_id == "Democrat" ~ 0.85,
  party_id == "Republican" ~ 0.15,
  party_id == "Independent" ~ 0.52
)

will_vote_dem <- rbinom(poll_n, 1, vote_prob)

# Calculate results with uncertainty
dem_support <- mean(will_vote_dem)
se <- sqrt(dem_support * (1 - dem_support) / poll_n)
margin_error <- 1.96 * se

cat("Poll Results:\n")
Poll Results:
cat("Democratic candidate support:", round(dem_support * 100, 1), "%\n")
Democratic candidate support: 51.5 %
cat("Margin of error: ±", round(margin_error * 100, 1), "%\n")
Margin of error: ± 2.8 %
cat("95% Confidence interval:", 
    round((dem_support - margin_error) * 100, 1), "% to",
    round((dem_support + margin_error) * 100, 1), "%\n")
95% Confidence interval: 48.7 % to 54.3 %
if(dem_support - margin_error > 0.5) {
  cat("Democratic candidate has statistically significant lead\n")
} else if(dem_support + margin_error < 0.5) {
  cat("Republican candidate has statistically significant lead\n")
} else {
  cat("Race is too close to call\n")
}
Race is too close to call

1.45 2. Comparative Politics

Research question: Do democratic institutions promote economic growth?

Statistical challenges:

  • Measurement: How do we measure “democracy”?
  • Sampling: Which countries/years to include?
  • Causation: Do institutions cause growth, or vice versa?

Regression approach:

# Simulate country-year data
set.seed(101)
n_countries <- 150
n_years <- 20

country_data <- expand_grid(
  country = 1:n_countries,
  year = 2000:(2000 + n_years - 1)
) %>%
  mutate(
    # Countries have fixed characteristics
    democracy_score = rep(runif(n_countries, 1, 10), each = n_years),
    
    # GDP growth affected by democracy, but also other factors
    gdp_growth = 1 + 0.3 * democracy_score + 
                 0.2 * rnorm(n(), 0, 1) +  # Random shocks
                 rnorm(n(), 0, 2)          # Measurement error
  )

# Run regression
democracy_model <- lm(gdp_growth ~ democracy_score, data = country_data)

# Visualize
ggplot(country_data, aes(x = democracy_score, y = gdp_growth)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "Democracy and Economic Growth",
    subtitle = paste0("Effect: ", round(coef(democracy_model)[2], 2), 
                     " percentage points per democracy point"),
    x = "Democracy Score (1-10)",
    y = "GDP Growth Rate (%)"
  )

summary(democracy_model)

Call:
lm(formula = gdp_growth ~ democracy_score, data = country_data)

Residuals:
   Min     1Q Median     3Q    Max 
-6.939 -1.347 -0.027  1.397  6.696 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.84477    0.08556   9.873   <2e-16 ***
democracy_score  0.33372    0.01376  24.259   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.012 on 2998 degrees of freedom
Multiple R-squared:  0.1641,    Adjusted R-squared:  0.1638 
F-statistic: 588.5 on 1 and 2998 DF,  p-value: < 2.2e-16

Interpretation challenges:

  • Is this relationship causal?
  • What about reverse causation (growth → democracy)?
  • What about confounding variables (education, natural resources)?

1.46 3. Public Opinion Research

Example study: Effect of negative campaign ads on voter turnout

Design considerations:

  • Random assignment to treatment (seeing ads vs. not)
  • Measurement of turnout intention vs. actual turnout
  • Controlling for confounding variables (partisanship, interest)
# Simulate experimental study of negative ads
set.seed(202)
n_participants <- 500

experimental_data <- data.frame(
  participant = 1:n_participants,
  age = sample(18:80, n_participants, replace = TRUE),
  party_id = sample(c("Dem", "Rep", "Ind"), n_participants, replace = TRUE),
  political_interest = runif(n_participants, 1, 7),
  treatment = sample(c("Control", "Negative Ad"), n_participants, replace = TRUE)
)

# Turnout intention affected by treatment and individual characteristics
experimental_data <- experimental_data %>%
  mutate(
    baseline_turnout = 0.4 + 0.05 * political_interest + 
                      ifelse(party_id == "Ind", -0.1, 0) +
                      0.002 * (age - 18),
    
    # Negative ads reduce turnout by 5 percentage points
    turnout_prob = baseline_turnout + ifelse(treatment == "Negative Ad", -0.05, 0),
    
    # Add noise
    turnout_prob = turnout_prob + rnorm(n_participants, 0, 0.1),
    turnout_prob = pmax(0, pmin(1, turnout_prob)),  # Keep between 0 and 1
    
    will_turnout = rbinom(n_participants, 1, turnout_prob)
  )

# Analyze treatment effect
treatment_effect <- experimental_data %>%
  group_by(treatment) %>%
  summarise(
    n = n(),
    turnout_rate = mean(will_turnout),
    se = sqrt(turnout_rate * (1 - turnout_rate) / n)
  )

# Statistical test
t.test(will_turnout ~ treatment, data = experimental_data)

    Welch Two Sample t-test

data:  will_turnout by treatment
t = 1.4077, df = 496.45, p-value = 0.1598
alternative hypothesis: true difference in means between group Control and group Negative Ad is not equal to 0
95 percent confidence interval:
 -0.02417005  0.14633164
sample estimates:
    mean in group Control mean in group Negative Ad 
                0.6521739                 0.5910931 
# Visualize
ggplot(experimental_data, aes(x = treatment, y = will_turnout, fill = treatment)) +
  geom_bar(stat = "summary", fun = "mean", alpha = 0.7) +
  geom_errorbar(data = treatment_effect, 
                aes(x = treatment, y = turnout_rate, 
                    ymin = turnout_rate - 1.96 * se,
                    ymax = turnout_rate + 1.96 * se),
                width = 0.2, inherit.aes = FALSE) +
  labs(
    title = "Effect of Negative Ads on Turnout Intention",
    x = "Experimental Condition",
    y = "Proportion Intending to Vote"
  ) +
  theme(legend.position = "none")

1.47 Moving Forward: Building Statistical Intuition

1.48 Key Principles to Remember

The Statistical Mindset

  1. Always think about uncertainty: Every statistic comes with error

  2. Distinguish correlation from causation: Association ≠ causal effect

  3. Consider practical significance: Statistical significance isn’t everything

  4. Question your measurements: How well do our proxies capture what we care about?

  5. Think about selection: Who/what is in our sample, and who/what is missing?

The Fundamental Tools You’ve Learned

  • Sampling: How to learn about many from studying few
  • Measurement: How to turn political concepts into numbers
  • Description: How to summarize what we see in data
  • Inference: How to draw conclusions beyond our sample
  • Regression: How to model relationships between variables
  • Significance testing: How to distinguish real patterns from noise
  • Causation: Why correlation doesn’t equal causation

1.49 Next Steps in Your Training

Immediate Next Steps

  1. Practice with R or Stata: Apply these concepts with real data

  2. Read research critically: Can you identify the population, sample, and key assumptions?

  3. Take a methods course: Build on these foundations

Future Learning

  • Probability theory: The mathematical foundations (usually second year)
  • Advanced regression: Logistic regression, interactions, non-linear relationships
  • Causal inference: More sophisticated ways to identify causes
  • Survey methodology: Designing good questionnaires and samples
  • Panel data methods: Following units over time
  • Machine learning: Prediction-focused approaches to analyzing political data

1.50 Practical Advice for Political Science Research

1. Start with Theory

Statistics is a tool, not a substitute for thinking:

  • What relationship do you expect and why?
  • What would falsify your hypothesis?
  • What alternative explanations exist?

2. Know Your Data

Before any analysis:

# Essential diagnostic steps
summary(data)           # Basic statistics
table(data$variable)    # Frequency tables
hist(data$variable)     # Distribution
plot(x, y)             # Scatterplots
cor(data)              # Correlation matrix

3. Match Method to Question

  • Describing: Means, proportions, distributions
  • Predicting: Regression, machine learning
  • Causal inference: Experiments, quasi-experiments, panel methods

4. Interpret Substantively

Always translate statistics back to political science:

  • What does a one-unit change mean substantively?
  • Is the effect politically meaningful?
  • What are the policy implications?

5. Be Transparent

  • Report all analyses, not just significant results
  • Share data and code when possible
  • Acknowledge limitations
  • Describe robustness checks

1.51 Essential R Code for Getting Started

# Reading data
data <- read.csv("yourfile.csv")     # Load a CSV file

# Basic exploration
summary(data)                         # See basic statistics for all variables
head(data)                           # Look at first few rows
table(data$party)                    # Count how many in each category

# Simple analysis
mean(data$age)                       # Calculate average age
cor(data$income, data$turnout)      # Correlation between two variables

# Basic visualization
hist(data$age)                       # Histogram of age distribution
plot(data$education, data$turnout)  # Scatterplot of two variables

# Difference between groups
t.test(income ~ gender, data = data) # Compare average income by gender

# Simple regression
model <- lm(turnout ~ education, data = data)  # Run regression
summary(model)                                  # See results

# Multiple regression
model2 <- lm(turnout ~ education + age + income, data = data)
summary(model2)

# Create nice plots with ggplot2
library(ggplot2)
ggplot(data, aes(x = education, y = turnout)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Education and Turnout",
       x = "Years of Education", 
       y = "Voter Turnout")

1.52 Resources for Continued Learning

Textbooks for Beginners

  • Kellstedt & Whitten: The Fundamentals of Political Science Research - Written specifically for political science students
  • Imai: Quantitative Social Science: An Introduction - Great examples, includes R code
  • Freedman, Pisani & Purves: Statistics - Classic intro text, very intuitive

Online Resources

R for Beginners

  • Swirl: Interactive R lessons in your console
  • RStudio Primers: https://rstudio.cloud/learn/primers

Statistical Concepts

  • Khan Academy Statistics: Free video lessons
  • Crash Course Statistics: YouTube series

Political Science Methods

  • ICPSR Summer Program: Training in quantitative methods
  • MethodSpace: https://www.methodspace.com

Getting Help

  • Your university’s statistics tutoring center
  • Office hours (use them!)
  • Study groups with classmates
  • Stack Overflow (for coding questions)

1.53 Final Thoughts

Statistics is not just a tool—it’s a way of thinking about evidence, uncertainty, and inference. As citizens and scholars, developing statistical intuition helps us:

  • Critically evaluate political claims
  • Design better research
  • Make more informed decisions
  • Understand the limits of what we can know

Remember: Every number tells a story, but not every story told by numbers is true. Your job is to develop the skills to tell the difference.

The goal isn’t to become a statistician, but to become a political scientist who can evaluate and produce rigorous evidence. Statistics helps us move from hunches to hypotheses to evidence-based conclusions about the political world.

As you continue your journey in political science, always remember that behind every statistical analysis are real people, real policies, and real consequences. The tools you’ve learned here will help you contribute to our understanding of politics and hopefully make the world a bit better informed.