1 Foundations of Statistics and Demography

1.1 Introduction

Statistics is the science of learning from data under uncertainty.

Statistics is a way to learn about the world from data. It teaches how to collect data wisely, spot patterns, estimate population parameters, and make predictions—stating how wrong we might be.

Note

Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It encompasses both the methods for working with data and the theoretical foundations that justify these methods.

But statistics is more than just numbers and formulas—it’s a way of thinking about uncertainty and variation in the world around us.

What is Data?

Data: Information collected during research – this includes survey responses, experimental results, economic indicators, social media content, or any other measurable observations.

A data distribution describes how values spread across possible outcomes (what values and how often a variable takes). Distributions tell us what values are common, what values are rare, and what patterns exist in our data.

Demography is the scientific study of human populations, focusing on their size, structure, distribution, and changes over time. It’s essentially the statistical analysis of people - who they are, where they live, how many there are, and how these characteristics evolve.

Statistics and demography are interconnected disciplines that provide powerful tools for understanding populations, their characteristics, and the patterns that emerge from data.

Rounding and Scientific Notation

Main Rule: Unless otherwise specified, round the decimal parts of decimal numbers to at least 2 significant figures. In statistics, we often work with long decimal parts and very small numbers — don’t round excessively in intermediate steps, round at the end of calculations.

Rounding in Statistical Context

The decimal part consists of digits after the decimal point. In statistics, it’s particularly important to maintain appropriate precision:

Descriptive statistics:

Mean: \bar{x} = 15.847693... \rightarrow 15.85
Standard deviation: s = 2.7488... \rightarrow 2.75
Correlation coefficient: r = 0.78432... \rightarrow 0.78

Very small numbers (p-values, probabilities):

p = 0.000347... \rightarrow 0.00035 or 3.5 \times 10^{-4}
P(X > 2) = 0.0000891... \rightarrow 0.000089 or 8.9 \times 10^{-5}

Significant Figures in Decimal Parts

In the decimal part, significant figures are all digits except leading zeros:

.78432 has 5 significant figures → round to .78 (2 s.f.)
.000347 has 3 significant figures → round to .00035 (2 s.f.)
.050600 has 4 significant figures → round to .051 (2 s.f.)

Rounding Rules in Statistics

Round only the decimal part to at least 2 significant figures
The integer part remains unchanged
In long calculations keep 3-4 digits in the decimal part until the final step
NEVER round to zero - small values have interpretive significance
For very small numbers use scientific notation when it improves readability
P-values often require greater precision — keep 2-3 significant figures

Scientific Notation in Statistics

In statistics, we often encounter very small numbers. Use scientific notation when it improves readability:

P-values and probabilities:

p = 0.000347 = 3.47 \times 10^{-4} (better: 3.5 \times 10^{-4})
P(Z > 3.5) = 0.000233 = 2.33 \times 10^{-4}

Large numbers (rare in basic statistics):

N = 1\,234\,567 = 1.23 \times 10^6

When in doubt: Better to keep an extra digit than to round too aggressively

What is Statistics For in Social and Political Science?

Statistics is essential in social and political science for several key purposes:

Understanding Social Phenomena: Measuring inequality, poverty, unemployment, political participation; describing demographic patterns and social trends; quantifying attitudes, beliefs, and behaviors in populations.

Testing Theories: Political scientists theorize about democracy, voting behavior, conflict, and institutions. Sociologists develop theories about social mobility, inequality, and group dynamics. Statistics allows us to test whether these theories match reality.

Causal Inference: Social scientists want to answer “why” questions—Does education increase income? Do democracies go to war less often? Does social media affect political polarization? Statistics helps separate causation from mere correlation.

Policy Evaluation: Assessing whether interventions work—Does a job training program reduce unemployment? Did election reform increase voter turnout? Are anti-poverty programs effective? Statistics provides tools to evaluate what works and what doesn’t.

Public Opinion Research: Election polls and forecasting; measuring public support for policies; understanding how opinions vary across demographic groups; tracking attitude changes over time.

Making Generalizations: We can’t survey everyone, so we sample and use statistics to make inferences about entire populations. A poll of 1,000 people can tell us about a nation of millions (with known uncertainty).

Dealing with Complexity: Human societies are messy—many factors influence outcomes simultaneously. Statistics helps us control for confounding variables, isolate specific effects, and make sense of multivariate relationships.

The Uniqueness of Social Sciences: Unlike natural sciences, social sciences study human behavior, which is highly variable and context-dependent. Statistics provides the tools to find patterns and draw conclusions despite this inherent uncertainty.

When working with data, statisticians use two different approaches: exploration and confirmation/verification (inferential statistics). First, we examine the data to understand its characteristics and identify patterns. Then, we use formal methods to test specific hypotheses and draw conclusions.

EDA vs. Inferential Statistics

Statistics can be viewed as two complementary phases:

Exploratory Data Analysis (EDA): combines descriptive statistics and visualization methods to explore data, uncover patterns, check assumptions, and generate hypotheses.
Inferential Statistics: uses probability models to test hypotheses and draw conclusions that generalize beyond the observed data.

Percent vs Percentage Points (pp)

When news reports say “unemployment decreased by 2,” do they mean 2 percentage points (pp) or 2 percent?

These are not the same:

2 pp (absolute change): e.g., 10% → 8% (−2 pp).
2% (relative change): multiply the old rate by 0.98; e.g., 10% → 9.8% (−0.2 pp).

Always ask:

What is the baseline (earlier rate)?
Is the change absolute (pp) or relative (%)?
Could this be sampling error / random variation?
How was unemployment measured (survey vs. administrative), when, and who’s included?

Rule of thumb

Use percentage points (pp) when comparing rates directly (unemployment, turnout).
Use percent (%) for relative changes (proportional to the starting value).

Tiny lookup table

Starting rate	“Down 2%” (relative)	“Down 2 pp” (absolute)
6%	6% × 0.98 = 5.88% (−0.12 pp)	4%
8%	8% × 0.98 = 7.84% (−0.16 pp)	6%
10%	10% × 0.98 = 9.8% (−0.2 pp)	8%

Uwaga (PL): 2% ≠ 2 punkty procentowe (pp).

1.2 Exploratory Data Analysis (EDA)

What is EDA? Exploratory Data Analysis is the initial step where we examine data systematically to understand its structure and characteristics. This phase does not involve formal hypothesis testing—it focuses on discovering what the data contains.

Why do we do EDA?

Find interesting patterns you didn’t expect
Spot mistakes or unusual values in your data
Get ideas about what questions to ask
Understand what your data looks like before doing formal tests (many statistical methods have specific requirements about the data to work properly. EDA helps check whether our data meets these requirements - e.g. 1) some tests require data to have a normal distribution (bell-shaped), 2) we need to verify that the relationship between variables is actually linear, or 3) check homogeneity of variance and find outliers)

The EDA Approach

When conducting EDA, we begin without predetermined hypotheses. Instead, we examine data from multiple perspectives to discover patterns and generate questions for further investigation.

Simple Tools for Exploring Data

1. Summary Numbers (Descriptive Statistics)

These are basic calculations that describe your data:

Finding the “Typical” Value:

Arithmetic Mean (Average): Add up all values and divide by how many you have. Example: If 5 students scored 70, 80, 85, 90, and 100 on a test, the average is 85.
Median (Middle): The value in the middle when you line up all numbers from smallest to largest. In our test example, the median is also 85.
Mode (Most Common): The value that appears most often. If ten families have 1, 2, 2, 2, 2, 3, 3, 3, 4, and 5 children, the mode is 2 children.

Understanding Spread:

Range: Just subtract the smallest number from the biggest. If students’ ages go from 18 to 24, the range is 6 years.
Standard Deviation: Shows how spread out your data is from the average. A small standard deviation means most values are close to the average; a large one means they’re more spread out.

2. Visual Exploration

Graphical methods help reveal patterns that numerical summaries alone might not show:

Population Pyramids: Show how many people are in each age group, split by males and females. Helps you see if a population is young or old.
Box Plots: Show the middle of your data and help spot unusual values
Scatter Plots: Display relationships between two variables (such as hours studied versus test scores)
Time (Series) Graphs: Show how something changes over time (like temperature throughout the year)
Histograms: A histogram is a graphical representation of data that shows the frequency distribution of a dataset. It consists of adjacent bars (with no gaps between them) where each bar represents a range of values (called a bin or class interval), and the height of the bar shows how many data points (what proportion of data points) fall within that range. Histograms are used to visualize the shape, spread, and central tendency of numerical data.

https://commons.wikimedia.org/wiki/File:%C5%81%C3%B3d%C5%BA_population_pyramid.svg

3. Looking for Connections/Associations:

Do two variables move together? (When one goes up, does the other go up too?)
Can you draw a line (regression line) that roughly fits your data points?
Do you see any clear patterns or trends?

Using the Same Techniques for Different Purposes

Many statistical techniques serve both exploratory and confirmatory functions:

Exploring: We calculate correlations or fit regression lines to understand what relationships exist in the data. The focus is on discovering patterns.

Confirming: We apply statistical tests to determine whether observed patterns are statistically significant or could have occurred by chance. The focus is on formal hypothesis testing.

The same technique can serve different purposes depending on the research phase.

4. Good Questions to Ask While Exploring:

What does the shape of my data look like?
Are there any weird or unusual values?
Do I see any patterns?
Is any data missing?
Do different groups show different patterns?

1.3 Inferential Statistics

After exploring, you might want to make formal conclusions. Inferential statistics helps you do this.

The Basic Idea: You have data from some people (a sample), but you want to know about everyone (the population). Inferential statistics helps you make educated guesses about the bigger group based on your smaller group.

Note

A random sample requires that each member has a known, non-zero chance of being selected, not necessarily an equal chance.

When every member has an equal chance of selection, that’s specifically called a simple random sample - which is the most basic type.

A Soup-Tasting Analogy

Consider a chef preparing soup for 100 people who needs to assess its flavor without consuming the entire batch:

Population: The entire pot of soup (100 servings)
Sample: A single spoonful for tasting
Population Parameter: The true average saltiness of the complete pot (unknown)
Sample Statistic: The saltiness level detected in the spoonful (observable, a point estimate)
Statistical Inference: Using the spoonful’s characteristics to draw conclusions about the entire pot

Key points

1. Random sampling is essential. The cook should stir thoroughly or sample from random locations. Skimming only the surface can miss seasonings that settled to the bottom, introducing systematic bias.

2. Sample size drives precision. A larger ladle — or more spoonfuls (larger n) — reduces random error and gives a more stable estimate of the “average taste,” though cost and time limit how much you can increase n.

3. Uncertainty is unavoidable. Even with proper sampling, a single spoonful may not perfectly represent the whole pot; there is always random variability.

4. Systematic bias undermines inference. If someone secretly adds salt only where you sample, conclusions about the whole pot will be distorted — a classic case of sampling bias.

5. One sample is limited. A single taste can tell you the average saltiness, but not how much it varies across the pot. To assess variability, you need multiple independent samples.

Note: Increasing n improves precision (less noise) but does not remove bias; eliminating bias requires fixing the sampling design.

This analogy captures the essence of statistical reasoning: using carefully selected samples to learn about larger populations while explicitly acknowledging and quantifying the inherent uncertainty in this process.

Statistical Thinking

Key concepts (at a glance)

Pipeline: Research question → Estimand (population quantity) → Parameter (true, unknown value) → Estimator (sample rule/statistic; random) → Estimate (the single number from your data)

What we want to know:

Estimand — the population quantity we aim to learn (the formal target), not the sentence itself.
Example: “Population mean age at first birth in Poland in 2023.”
Parameter (\theta) — the true but unknown value of that estimand in the population (fixed, not random).
Example: The true mean \mu (e.g., \mu=29.4 years).

How we estimate (3 steps):

Sample statistic — any function of the sample (a rule), e.g.
\displaystyle \bar{X}=\frac{1}{n}\sum_{i=1}^n X_i
Estimator — that statistic chosen to estimate a specific parameter (depends on a random sample, so it’s random).
Example: Use \bar{X} as an estimator of \mu.
Estimate (\hat\theta) — the numerical result after applying the estimator to your observed data (x_1,\dots,x_n).
Example: \hat\mu=\bar{x}=29.1 years.

Analogy:

Statistic = tool → Estimator = tool chosen for a goal → Estimate = the finished output (your concrete result)

Common estimators

Target parameter (goal)	Estimator (statistic)	Formula	Note
Population mean \mu	Sample mean	\bar X=\frac{1}{n}\sum_{i=1}^n X_i	Unbiased estimator. The estimator \bar X is a random variable; a specific calculated value (e.g., \bar x = 5.2) is called an estimate.
Population proportion p	Sample proportion	\hat p=\frac{K}{n} where K=\sum_{i=1}^n Y_i for Y_i\in\{0,1\}	Equivalent to \bar Y when encoding outcomes as 0/1. Here K counts the number of successes in n trials.
Population variance \sigma^2	Sample variance	s^2=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X)^2	The n-1 divisor (Bessel’s correction) makes this unbiased for \sigma^2. Using n would give a biased estimator.

Every estimator is a statistic, but not every statistic is an estimator — until you tie it to a target (an estimand), it’s “just” a statistic.

https://allmodelsarewrong.github.io/mse.html

Quality Criteria: Bias, Variance, MSE, Efficiency (*)

How do we assess if an estimator (“method”) is good?

Bias — does our method give true results “on average”?

Imagine we want to know the average height of adult Poles (true value: 172 cm). We draw 100 different samples of 500 people each and calculate the mean for each one.

Unbiased estimator: Those 100 means will differ (169 cm, 173 cm, 171 cm…), but their average will be close to 172 cm. Sometimes we overestimate, sometimes underestimate, but there’s no systematic error.

Biased estimator: If we accidentally always excluded people over 180 cm, all our 100 means would be too low (e.g., oscillating around 168 cm). That’s systematic bias.

Variance — how much do results differ between samples?

We have two methods for estimating the same parameter. Both give good results “on average,” but:
- Method A: from 10 samples we get: 171, 172, 173, 171, 172, 173, 172, 171, 173, 172 cm
- Method B: from 10 samples we get: 165, 179, 168, 176, 171, 174, 169, 175, 167, 176 cm
Method A has lower variance — results are more concentrated, predictable. In practice, you prefer Method A because you can be more confident in a single result.

Key principle: Larger sample = lower variance. With a sample of 100 people, the mean will “jump around” more than with a sample of 1,000 people.

Mean Squared Error (MSE) — what matters more: unbiasedness or stability?

Sometimes we face a dilemma:
- Estimator A: Unbiased (average 172 cm), but very unstable (results from 160 to 184 cm)
- Estimator B: Slightly biased (average 171 cm instead of 172 cm), but very stable (results from 169 to 173 cm)
MSE says: Estimator B is better — a small systematic underestimation of 1 cm is less problematic than the huge spread of results in Estimator A.

Efficiency — which unbiased estimator to choose?

You have data on incomes of 500 people. You want to know the “typical” income. Two options:
- Arithmetic mean: typically gives results in the range 4,800–5,200 PLN
- Median: gives results in the range 4,500–5,500 PLN
If both methods are unbiased, choose the one with smaller spread (the mean is more efficient for normally distributed data).

Example of Statistical Thinking

Your university is considering keeping the library open 24/7. The administration needs to know: What proportion of students support this change?

Note

Ideal world: Ask all 20,000 students → Get the exact answer (\theta parameter)
Real world: Survey 100 students → Get an estimate (\hat{\theta}) with uncertainty

Bias vs. Random Error

Statistical (prediction) error can be decomposed into two main components: bias (systematic error) and random error (unpredictable variation).

Bias is like a miscalibrated scale that consistently reads 2 kg too high—every measurement is wrong in the same direction. It’s systematic error.

Random error is the unpredictable variation in your observations, like:

A dart player aiming at the bullseye—each throw lands in a slightly different spot due to hand tremor, air currents, tiny muscle variations
Measuring someone’s height multiple times and getting 174.8 cm, 175.0 cm, 175.3 cm—small fluctuations from posture changes, breathing, how you read the scale, and natural body variations
A weather model that’s sometimes 2°C too high, sometimes 1°C too low, sometimes spot on
Opinion polls showing 52%, 49%, 51% support across different surveys—each random sample gives slightly different results, but they cluster around the true value

Random error is measured by variance—the average squared deviation of observations from their mean. It quantifies how much your data points (predictions) scatter.

Random error is like asking 5 friends to estimate how many jellybeans are in a jar—they’ll all give different answers just due to chance, but those differences scatter randomly around the truth rather than all being wrong in the same direction.

Polling example: Bias is like polling only at the gym at 6am—you’ll always get more health-conscious, early-rising, employed people and always miss night-shift workers, people with young kids, etc. The poll is broken in a predictable way. Or: only counting responses from people who actually answer unknown phone calls—you’ll systematically miss everyone (especially younger people) who screens their calls.

Key difference: Averaging more observations reduces random error but never fixes bias. You can’t average your way out of a miscalibrated scale—or a biased sampling method!

Two Approaches to the Same Data

Imagine you survey 100 random students and find that 60 support the 24/7 library hours.

❌ Without Statistical Thinking

“60 out of 100 students said yes.”

Conclusion: “Exactly 60% of all students support it.”

Decision: “Since it’s over 50%, we have clear majority support.”

Problem: Ignores that a different sample might give 55% or 65%

✅ With Statistical Thinking

“60 out of 100 students said yes.”

Conclusion: “We estimate 60% support, with a margin of error of ±10 pp.”

Decision: “True support is likely between 50% and 70%—we need more data to be certain of majority support.”

Advantage: Acknowledges uncertainty and informs better decisions

How sample size affects precision:

Sample Size	Observed Result	Margin of Error	(95%) Range of Plausible Values	Interpretation
n = 100	60%	±10 pp	50% to 70%	Uncertain about majority
n = 400	60%	±5 pp	55% to 65%	Likely majority support
n = 1,000	60%	±3 pp	57% to 63%	Clear majority support
n = 1,600	60%	±2.5 pp	57.5% to 62.5%	Strong majority support
n = 10,000	60%	±1 pp	59% to 61%	Very precise estimate

The Diminishing Returns Principle: Notice that quadrupling the sample size from 100 to 400 cuts the margin of error in half, but increasing from 1,600 to 10,000 (a 6.25× increase) only reduces it by 1.5 percentage points. To halve your margin of error, you must quadruple your sample size.

This is why most polls stop around 1,000–1,500 respondents—the gains in precision beyond that point rarely justify the additional cost and effort.

Sample Size and Uncertainty (Random Error)

Suppose we take a random sample of n=1000 voters and observe \hat p = 0.55 (e.g., 55% support for a candidate in upcoming elections—550 out of 1,000 respondents). Then:

Our best single-number estimate (point estimate) of the population proportion is \hat p = 0.55.
A typical “range of plausible values” (at the 95\% confidence level) around \hat p can be approximated by \hat p \pm \text{Margin of Error}, i.e., \hat p \;\pm\; 2\sqrt{\frac{\hat p(1-\hat p)}{n}} \;=\; 0.55 \;\pm\; 2\sqrt{\frac{0.55\cdot 0.45}{1000}} \approx 0.55 \pm 0.031, giving roughly (interval estimate) 52\% to 58\% (approximately \pm 3.1 percentage points).

Note: The factor of 2 is a convenient rounding of 1.96, the critical value from the standard normal distribution for 95% confidence.

The width of this interval shrinks predictably with sample size: \text{Margin of Error} \;\propto\; \frac{1}{\sqrt{n}}. For example, increasing n from 1,000 to 4,000 cuts the margin of error approximately in half (from \pm 3.1\% to \pm 1.6\%).

Note

Fundamental Principle: Statistics does not eliminate uncertainty—it helps us measure, manage, and communicate it effectively.

Historical Example: the 1936 Literary Digest Poll

In 1936, Literary Digest ran one of the largest opinion polls ever — about 2.4 million mailed responses — yet it completely misjudged the U.S. presidential election.

Candidate	Prediction	Actual result	Error
Landon	57%	36.5%	≈20 pp
Roosevelt	43%	60.8%	≈18 pp

What went wrong?

Even with millions of responses, the poll was badly biased — not random, but systematic.

Systematic vs. Random Error

Imagine a bathroom scale that adds +2.3 kg to everyone’s weight:

Random error (no bias): Each time you step on, your balance shifts a little. Readings jump around your true weight — say 68.0–68.5 kg. Averaging them gives the right result (≈68 kg). More readings reduce the scatter.
Systematic error (bias): The scale’s zero point is wrong. Every reading shows +2.3 kg too much. Weigh yourself once: 70.3 kg. Weigh yourself 1,000 times: still ~70.3 kg — precisely wrong.

That was Literary Digest’s problem: a miscalibrated “instrument” for measuring public opinion. Millions of biased responses only produced false confidence.

Where did the bias come from?

Two biases both worked in favor of Alf Landon:

Coverage (selection) bias — who could be contacted
- The poll used telephone books, car registration lists, and magazine subscribers.
- During the Great Depression, these lists mostly included wealthier Americans, who leaned Republican.
- Result: systematic underrepresentation of poorer, pro-Roosevelt voters.
Nonresponse bias — who chose to reply
- Only about one in four people (≈24%) who were contacted returned their ballot.
- Those who responded were more politically active and more likely to oppose Roosevelt.

Together, these created a huge systematic bias that no large sample could fix.

Why sample size couldn’t save the poll

Taking 2.4 million responses from a biased list is like weighing an entire country on a faulty scale.

The maximum possible (worst case scenario) margin of error (for the 95\% confidence level) for a given sample size (if it had been a true random sample) would have been: \text{MoE}_{95\%} \approx 1.96\sqrt{\frac{0.25}{2{,}400{,}000}} \approx \pm 0.06 \text{ percentage points} — tiny.
That formula only captures random error, not bias.
The real error was about ±18–20 percentage points — hundreds of times larger.

Lesson: Precision without representativeness is useless. A huge biased sample can be worse than a small, carefully chosen one.

Modern Polling: Smaller but Smarter

The Literary Digest disaster transformed polling practice:

Probability sampling: every voter has a known, non-zero chance of selection.
Weighting: adjust for groups that reply too often or too rarely.
Total survey error mindset: consider coverage, nonresponse, measurement, and processing errors — not just sampling error.

Bottom line: How you sample matters far more than how many you sample.

1.4 Understanding Randomness

A random experiment is any process whose result cannot be predicted with certainty, such as tossing a coin or rolling a die.

An outcome is a single possible result of that experiment—for example, getting “heads” or rolling a “5”.

Sample space is the set of all possible outcomes of a random experiment. It is typically denoted by the symbol S or Ω (omega).

An event is a set of one or more outcomes that we’re interested in; it could be a simple event (like rolling exactly a 3) or a compound event (like rolling an even number, which includes the outcomes 2, 4, and 6).

Probability is a way of measuring how likely something is to happen. It’s a number between 0 and 1 (or 0% and 100%) that represents the chance of an event occurring.

A probability distribution is a mathematical function/rule that describes the likelihood of different possible outcomes in a random experiment.

If something has a probability of 0, it’s impossible - it will never happen. If something has a probability of 1, it’s certain - it will definitely happen. Most things fall somewhere in between.

For example, when you flip a fair coin, there’s a 0.5 (or 50%) probability it will land on heads, because there are two equally likely outcomes and heads is one of them.

Probability helps us make sense of uncertainty and randomness in the world.

In statistics, randomness is an orderly way to describe uncertainty. While each individual outcome is unpredictable, stable patterns (more formally, empirical distributions of outcomes converge to probability distributions) emerge over many repetitions.

Example: Flip a fair coin:

Single flip: Completely unpredictable—you can’t know if it’ll be heads or tails
100 flips: You’ll get close to 50% heads (maybe 48 or 53)
10,000 flips: Almost certainly very close to 50% heads (perhaps 49.8%)

The same applies to dice: you can’t predict your next roll, but roll 600 times and each number (1-6) will appear close to 100 times. This predictable long-run behavior from unpredictable individual events is the essence of statistical randomness.

Types of Randomness

Epistemic vs. Ontological Randomness:

Epistemic randomness (due to incomplete knowledge): We treat an outcome as random because not all determinants are observed or conditions are not controlled. The system itself is deterministic—it follows fixed rules—but we lack the information needed to predict the outcome.
- Coin toss: The trajectory of the coin is governed entirely by classical mechanics. If we knew the exact initial position, force, angular momentum, air resistance, and surface properties, we could theoretically predict whether it lands on heads or tails. The “randomness” exists only because we cannot measure these conditions with sufficient precision.
- Poll responses: An individual’s answer to a survey question is determined by their beliefs, experiences, and context, but we don’t have access to this complete psychological state, so we model it as random.
- Measurement error: Limited instrument precision means the “true” value exists, but we observe it with uncertainty.
Ontological randomness (intrinsic indeterminacy): Even complete knowledge of all conditions does not remove outcome uncertainty. The randomness is fundamental to the nature of reality itself, not just a gap in our knowledge.
- Radioactive decay: The exact moment when a particular atom will decay is fundamentally unpredictable, even in principle. Quantum mechanics tells us only the probability distribution, not the precise timing.
- Quantum measurements: The outcome of measuring a quantum particle’s position or spin is inherently probabilistic, not determined by hidden variables we simply haven’t discovered yet.

The Coin Toss Paradox

While we treat coin tosses as producing 50-50 random outcomes, research by mathematician Persi Diaconis has shown that with a mechanical coin-flipping machine that precisely controls initial conditions, you can reliably bias the outcome toward a chosen side. This confirms that coin tosses are epistemically, not ontologically, random—the apparent randomness comes from our inability to control and measure conditions, not from any fundamental indeterminacy in physics.

Related Concepts

Randomness vs. Haphazardness: Statistical randomness has mathematical structure and follows probability laws—it’s orderly uncertainty. Haphazardness suggests complete disorder without underlying patterns or rules.

Deterministic Chaos: The middle ground between perfect predictability and randomness. Chaos refers to deterministic systems (following fixed, known rules) that exhibit extreme sensitivity to initial conditions, making long-term prediction impossible in practice.

Think of chaos like a pinball machine, with the butterfly effect:

You know all the rules perfectly—the physics of collisions, friction, gravity
The system is completely deterministic: release the ball from exactly the same spot with exactly the same force, and you’ll get exactly the same result every time
But: A difference of 0.01 millimeters in starting position leads to the ball hitting different bumpers, which compounds with each collision until the final outcome is completely different
This is the butterfly effect: tiny perturbations in initial conditions grow exponentially over time

Classic examples of deterministic chaos:

Weather systems: Edward Lorenz discovered that atmospheric models are so sensitive that a butterfly flapping its wings in Brazil could theoretically alter whether a tornado forms in Texas weeks later. This is why weather forecasts are reliable for days but not months.
Planetary orbits: While stable on human timescales, the solar system’s dynamics are chaotic over millions of years. We cannot predict the exact position of planets in the distant future, even though we know the gravitational laws perfectly.
Double pendulum: Release it from a slightly different angle, and after a few swings, the motion becomes completely different.

Chaos vs. Epistemic Randomness—A Critical Distinction:

Both involve unpredictability due to limited knowledge, but they differ in a crucial way:

Aspect	Epistemic Randomness	Deterministic Chaos
Rules known?	Often yes	Yes, completely
Current state known?	No (or imprecisely)	No (or imprecisely)
What causes unpredictability?	Missing information about the current state	Exponential amplification of tiny measurement errors
Can perfect info help?	Yes—learning the state eliminates uncertainty	Only in the short term—errors accumulate again

Example to clarify:

Epistemic randomness (card face-down): The card is already the 7 of hearts. It’s not changing or evolving. You just don’t know which card it is yet. Flip it over, and the uncertainty vanishes completely and permanently.
Chaos (weather in 3 weeks): Even if you measure current atmospheric conditions to extraordinary precision, tiny errors (measurement at 6 decimal places instead of 20) compound over time. You might predict well for 5 days, but by week 3, your forecast is useless—not because you don’t know the physics, but because the system amplifies microscopic uncertainties.

Key Insight

Chaos is deterministic yet unpredictable. Epistemic randomness is deterministic yet unknown. Ontological randomness is fundamentally indeterministic. Statistical practice treats all three as “random,” but understanding the source of unpredictability helps us know when more information could help (epistemic), when it helps temporarily but not long-term (chaos), and when it cannot help at all (ontological).

Entropy: A measure of disorder or uncertainty in a system. High entropy means high unpredictability or many possible microstates; low entropy means high order and low uncertainty. In information theory and statistics, entropy quantifies the amount of uncertainty in a probability distribution—more spread out distributions have higher entropy.

1.5 Populations and Samples

Understanding the distinction between populations and samples is crucial for proper statistical analysis.

Population

A population is the complete set of individuals, objects, or measurements about which we wish to draw conclusions. The key word here is “complete”—a population includes every single member of the group we’re studying.

Examples of Populations in Demography:

All residents of India as of January 1, 2024: This includes every person living in India on that specific date—approximately 1.4 billion people.
All births in Sweden during 2023: Every baby born within Swedish borders during that calendar year—roughly 100,000 births.
All households in Tokyo: Every residential unit where people live, cook, and sleep separately from others—about 7 million households.
All deaths from COVID-19 worldwide in 2020: Every death where COVID-19 was listed as a cause—several million deaths.

Populations can be:

Finite: Having a countable number of members (all current U.S. citizens, all Polish municipalities in 2024)

Infinite: Theoretical or uncountably large (all possible future births, all possible coin tosses or dice flips)

Fixed: Defined at a specific point in time (all residents on census day)

Dynamic: Changing over time (the population of a city that experiences births, deaths, and migration daily)

Sample

A sample is a subset of the population that is actually observed or measured. We study samples because examining entire populations is often impossible, impractical, or unnecessary.

Why We Use Samples:

Practical Impossibility: Imagine testing every person in China for a disease. By the time you finished testing 1.4 billion people, the disease situation would have changed completely, and some people tested early would need retesting.

Cost Considerations: The 2020 U.S. Census cost approximately $16 billion. Conducting such complete enumerations frequently would be prohibitively expensive. A well-designed sample survey can provide accurate estimates at a fraction of the cost.

Time Constraints: Policy makers often need information quickly. A sample survey of 10,000 people can be completed in weeks, while a census takes years to plan, execute, and process.

Destructive Measurement: Some measurements destroy what’s being measured. Testing the lifespan of light bulbs or the breaking point of materials requires using samples.

Greater Accuracy: Surprisingly, samples can sometimes be more accurate than complete enumerations. With a sample, you can afford better training for interviewers, more careful data collection, and more thorough quality checks.

Example of Sample vs. Population:

Let’s say we want to know the average household size in New York City:

Population: All 3.2 million households in NYC
Census approach: Attempt to contact every household (expensive, time-consuming, some will be missed)
Sample approach: Randomly select 5,000 households, carefully measure their sizes, and use this to estimate the average for all households
Result: The sample might find an average of 2.43 people per household with a margin of error of ±0.05, meaning we’re confident the true population average is between 2.38 and 2.48

Overview of Sampling Methods

Sampling involves selecting a subset of the population to estimate its characteristics. The sampling frame (list from which we sample) should ideally contain each member exactly once. Frame problems: undercoverage, overcoverage, duplication, and clustering.

Probability Sampling (Statistical Inference Possible)

Simple Random Sampling (SRS): Every possible sample of size n has equal probability of selection (sampling without replacement). Gold standard of probability methods.
- Formal definition: Each of the \binom{N}{n} possible samples has probability \frac{1}{\binom{N}{n}}.
- Inclusion probability for a unit:
  - Question: In how many samples does a specific person (e.g., student John) appear?
  - If John is already in the sample (that’s fixed), we need to select n-1 more people from the remaining N-1 people (everyone except John).
  - Number of samples containing John: \binom{N-1}{n-1}
  - Probability:
P(\text{John in sample}) = \frac{\text{samples with John}}{\text{all samples}} = \frac{\binom{N-1}{n-1}}{\binom{N}{n}} = \frac{n}{N}
- Numerical example: N=5 people {A,B,C,D,E}, we sample n=3. All samples: \binom{5}{3}=10. Samples with person A: {ABC, ABD, ABE, ACD, ACE, ADE} = \binom{4}{2}=6 samples. Probability: 6/10 = 3/5 = n/N ✓
Systematic Sampling: Selection of every k-th element, where k = N/n. Simple to implement, but beware of hidden periodicity in the frame (e.g., list ordered by patterns).
Systematic Sampling: Selection of every k-th element, where k = N/n (sampling interval).
- How it works: Randomly select a starting point r from \{1, 2, ..., k\}, then select: r, r+k, r+2k, r+3k, ...
- Example: N=1000, n=100, so k=10. If r=7, we select: 7, 17, 27, 37, …, 997.
- Advantages: Very simple, ensures even coverage of the population.
- Periodicity problem: If the list has a pattern repeating every k elements, the sample can be severely biased.
  - Example (bad): Apartment list: 101, 102, 103, 104 (corner), 201, 202, 203, 204 (corner), … If k=4, we might sample only corner apartments!
  - Example (bad): Daily production data with 7-day cycle. If k=7, we might sample only Mondays.
  - Example (good): Alphabetical list of surnames - usually no periodicity.
Cluster Sampling: Selection of entire groups (clusters) instead of individual units. Cost-effective for geographically dispersed populations (e.g., sampling schools instead of students), but typically less precise than SRS (design effect: DEFF = Variance(cluster)/Variance(SRS)). Can be single- or multi-stage.

Non-Probability Sampling (Limited Statistical Inference)

Convenience Sampling: Selection based on ease of access (e.g., passersby in city center). Useful in pilot/exploratory studies, but likely serious selection bias.
Purposive/Judgmental Sampling: Deliberate selection of typical, extreme, or information-rich cases. Valuable in qualitative research and studying rare populations.
Quota Sampling: Matching population proportions (e.g., 50% women), but without random selection. Quick and inexpensive, but hidden selection bias and no ability to calculate sampling error.
Snowball Sampling: Participants recruit others from their networks. Essential for hard-to-reach populations (drug users, undocumented immigrants), but biased toward well-connected individuals.

Fundamental Principle: Probability sampling enables valid statistical inference and calculation of sampling error; non-probability methods may be necessary for practical or ethical reasons, but limit the ability to generalize results to the entire population.

1.6 Superpopulation and Data Generating Process (DGP) (*)

Superpopulation

A superpopulation is a theoretical infinite population from which your finite population is considered to be one random sample.

Think of it in three levels:

Superpopulation: An infinite collection of possible values (theoretical)
Finite population: The actual population you could theoretically census (e.g., all 50 US states, all 10,000 firms in an industry)
Sample: The subset you actually observe (e.g., 30 states, 500 firms)

Why do we need this concept?

Consider the 50 US states. You might measure unemployment rate for all 50 states—a complete census, no sampling needed. But you still want to:

Test if unemployment is related to education levels
Predict next year’s unemployment rates
Determine if differences between states are “statistically significant”

Without the superpopulation concept, you’re stuck—you have all the data, so what’s left to infer? The answer: treat this year’s 50 values as one draw from an infinite superpopulation of possible values that could occur under similar conditions.

Mathematical representation:

Finite population value: Y_i (state i’s unemployment rate)
Superpopulation model: Y_i = \mu + \epsilon_i where \epsilon_i \sim (0, \sigma^2)
The 50 observed values are one realization of this process

Data Generating Process: The True Recipe

The Data Generating Process (DGP) is the actual mechanism that creates your data—including all factors, relationships, and random elements.

An intuitive example: Suppose student test scores are truly generated by:

\text{Score}_i = 50 + 2(\text{StudyHours}_i) + 3(\text{SleepHours}_i) - 5(\text{Stress}_i) + 1.5(\text{Breakfast}_i) + \epsilon_i

This is the TRUE DGP. But you don’t know this! You might estimate:

\text{Score}_i = \alpha + \beta(\text{StudyHours}_i) + u_i

Your model is simpler than reality. You’re missing variables (sleep, stress, breakfast), so your estimates might be biased. The u_i term captures everything you missed.

Key insight: We never know the true DGP. Our statistical models are always approximations, trying to capture the most important parts of the unknown, complex truth.

Two Approaches to Statistical Inference

When analyzing data, especially from surveys or samples, we can take two philosophical approaches:

1. Design-Based Inference

Philosophy: The population values are fixed numbers. Randomness comes ONLY from which units we happened to sample.
Focus: How we selected the sample (simple random, stratified, cluster sampling, etc.)
Example: The mean income of California counties is a fixed number. We sample 10 counties. Our uncertainty comes from which 10 we randomly selected.
No models needed: We don’t assume anything about the population values’ distribution

2. Model-Based Inference

Philosophy: The population values themselves are realizations from some probability model (superpopulation)
Focus: The statistical model generating the population values
Example: Each California county’s income is drawn from: Y_i = \mu + \epsilon_i where \epsilon_i \sim N(0, \sigma^2)
Models required: We make assumptions about how the data were generated

Which is better?

Large populations, good random samples: Design-based works well
Small populations (like 50 states): Model-based often necessary
Complete enumeration: Only model-based allows inference
Modern practice: Often combines both approaches

Practical Example: Analyzing State Education Spending

Suppose you collect education spending per pupil for all 50 US states.

Without superpopulation thinking:

You have all 50 values—that’s it
The mean is the mean, no uncertainty
You can’t test hypotheses or make predictions

With superpopulation thinking:

This year’s 50 values are one realization from a superpopulation
Model: \text{Spending}_i = \mu + \beta(\text{StateIncome}_i) + \epsilon_i
Now you can:
- Test if spending relates to state income (\beta \neq 0?)
- Predict next year’s values
- Calculate confidence intervals

The key insight: Even with complete data, the superpopulation framework enables statistical inference by treating observed values as one possible outcome from an underlying stochastic process.

Summary

Superpopulation: Treats your finite population as one draw from an infinite possibility space—essential when your finite population is small or completely observed
DGP: The true (unknown) process creating your data—your models try to approximate it

1.7 Understanding Data, Data Distributions, and Data Typologies

What is Data?
Data is a collection of facts, observations, or measurements that we gather to answer questions or understand phenomena. In statistics and data analysis, data represents information in a structured format that can be analyzed.

Data Points
A data point is a single observation or measurement in a dataset. For example, if we measure the height of 5 students, each individual height measurement is a data point.

Variables
A variable is a characteristic or attribute that can take different values across observations. Variables can be:

Categorical (e.g., color, gender, country)
Numerical (e.g., age, temperature, income)

Data Distribution
Data distribution describes what values a variable takes and how often each value occurs in the dataset. Understanding distribution helps us see patterns, central tendencies, and variability in our data.

Frequency Distribution Tables
A frequency distribution table organizes data by showing each unique value (or range of values) and the number of times it appears:

Value	Frequency	Relative Frequency
A	15	0.30 (30%)
B	25	0.50 (50%)
C	10	0.20 (20%)
Total	50	1.00 (100%)

This table allows us to quickly see which values are most common and understand the overall distribution pattern.

Understanding Different Types of Data Structures (Data Sets) and Their Formats

Cross-sectional Data

Observations for variables (columns in a database) collected at a single point in time across multiple entities/individuals:

Individual	Age	Income	Education
1	25	50000	Bachelor’s
2	35	75000	Master’s
3	45	90000	PhD

Time Series Data

Observations of a single entity tracked over multiple time points:

Year	GDP (in billions)	Unemployment Rate
2018	20,580	3.9%
2019	21,433	3.7%
2020	20,933	8.1%

Panel Data (Longitudinal Data)

Observations of multiple entities tracked over time:

Country	Year	GDP per capita	Life Expectancy
USA	2018	62,794	78.7
USA	2019	65,118	78.8
Canada	2018	46,194	81.9
Canada	2019	46,194	82.0

Time-series Cross-sectional (TSCS) Data

A special case of panel data where:

Number of time points > Number of entities
Similar structure to panel data but with emphasis on temporal depth
Common in political science and economics research

Data Formats

Wide Format

Each row represents an entity; columns represent variables/time points:

Country	GDP_2018	GDP_2019	LE_2018	LE_2019
USA	62,794	65,118	78.7	78.8
Canada	46,194	46,194	81.9	82.0

Long Format

Each row represents a unique entity-time-variable combination:

Country	Year	Variable	Value
USA	2018	GDP per capita	62,794
USA	2019	GDP per capita	65,118
USA	2018	Life Expectancy	78.7
USA	2019	Life Expectancy	78.8
Canada	2018	GDP per capita	46,194
Canada	2019	GDP per capita	46,194
Canada	2018	Life Expectancy	81.9
Canada	2019	Life Expectancy	82.0

Note: Long format is generally preferred for:

Data manipulation in R and Python
Statistical analysis
Data visualization

Understanding data types and distributions is fundamental to choosing appropriate analyses and interpreting results correctly.

Types of Data

Data consists of collected observations or measurements. The type of data determines what mathematical operations (e.g. multiplication) are meaningful and what statistical methods apply.

Quantitative Data

Continuous Data can take any value within a range:

Examples:

Age: Can be 25.5 years, 25.51 years, 25.514 years (precision limited only by measurement)
Body Mass Index: 23.7 kg/m²
Fertility Rate: 1.73 children per woman
Population Density: 4,521.3 people per km²
Voter turnout: 60%

Properties:

Can perform all arithmetic operations
Can calculate means, standard deviations

Discrete Data can only take specific values:

Examples:

Number of Children: 0, 1, 2, 3… (can’t have 2.5 children)
Number of Marriages: 0, 1, 2, 3…
Household Size: 1, 2, 3, 4… people
Number of Doctor Visits: 0, 1, 2, 3… per year
Electoral District Magnitude: 1, 2, 3, …

Qualitative/Categorical Data

Nominal Data represents categories with no inherent order:

Examples:

Country of Birth: USA, China, India, Brazil…
Religion: Christian, Muslim, Hindu, Buddhist, None…
Marital Status: Single, Married, Divorced, Widowed
Cause of Death: Heart disease, Cancer, Accident, Stroke…
Blood Type: A, B, AB, O

What We Can Do:

Count frequencies
Calculate proportions
Find mode

What We Cannot Do:

Calculate mean (average religion makes no sense)
Order categories meaningfully
Compute distances between categories

Ordinal Data represents ordered categories:

Examples:

Education Level: None < Primary < Secondary < Tertiary
Socioeconomic Status: Low < Middle < High
Self-Rated Health: Poor < Fair < Good < Excellent
Agreement Scale: Strongly Disagree < Disagree < Neutral < Agree < Strongly Agree

The Challenge: Intervals between categories aren’t necessarily equal. The “distance” from Poor to Fair health may not equal the distance from Good to Excellent.

Frequency, Relative Frequency, and Density

When we analyze data, we’re often interested in how many times each value (or range of values) appears. This leads us to three related concepts:

(Absolute) Frequency is simply the count of how many times a particular value or category occurs in your data. If 15 students scored between 70-80 points on an exam, the frequency for that range is 15.

Relative frequency expresses frequency as a proportion or percentage of the total. It answers the question: “What fraction of all observations fall into this category?” Relative frequency is calculated as:

\text{Relative Frequency} = \frac{\text{Frequency}}{\text{Total Number of Observations}}

If 15 out of 100 students scored 70-80 points, the relative frequency is 15/100 = 0.15 or 15%. Relative frequencies always sum to 1 (or 100%), making them useful for comparing distributions with different sample sizes.

Tip

The probability of an event is a number between 0 and 1; the larger the probability, the more likely an event is to occur.

Density (probability per unit length) measures how concentrated observations are per unit of measurement. When grouping continuous data (like time or unemployment rate) into intervals of different widths, we need density to ensure fair comparison—wider intervals naturally contain more observations simply because they’re wider, not because values are more concentrated there. Density is calculated as:

\text{Density} = \frac{\text{Relative Frequency}}{\text{Interval Width}}

This standardization allows fair comparison between intervals—wider intervals don’t appear artificially more important just because they’re wider.

Density is particularly important for continuous variables because it ensures that the total area under the distribution equals 1, which allows us to interpret areas as probabilities.

Cumulative frequency tells us how many observations fall at or below a certain value.

Instead of asking “how many observations are in this category?”, cumulative frequency answers “how many observations are in this category or any category below it?” It’s calculated by adding up all frequencies from the lowest value up to and including the current value.

Similarly, cumulative relative frequency expresses this as a proportion of the total, answering “what percentage of observations fall at or below this value?” For example, if the cumulative relative frequency at score 70 is 0.40, this means 40% of students scored 70 or below.

Distribution Tables

A frequency distribution table organizes data by showing how observations are distributed across different values or intervals. Here’s an example with exam scores:

Score Range	Frequency	Relative Frequency	Cumulative Frequency	Cumulative Relative Frequency	Density
0-50	10	0.10	10	0.10	0.002
50-70	30	0.30	40	0.40	0.015
70-90	45	0.45	85	0.85	0.0225
90-100	15	0.15	100	1.00	0.015
Total	100	1.00	-	-	-

This table reveals that most students scored in the 70-90 range, while very few scored below 50 or above 90. The cumulative columns show us that 40% of students scored below 70, and 85% scored below 90. Such tables are invaluable for getting a quick overview of your data before conducting more complex analyses.

Visualizing Distributions: Histograms

A histogram is a graphical representation of a frequency distribution. It displays data using bars where:

The x-axis shows the values or intervals (bins)
The y-axis can show frequency, relative frequency, or density
The height of each bar represents the count, proportion, or density for that interval
Bars touch each other (no gaps) for continuous variables

Choosing bin widths: The number and width of bins significantly affects how your histogram looks. Too few bins hide important patterns, while too many bins create “noise” and make patterns hard to see.

In statistics, noise is unwanted random variation that obscures the pattern we’re trying to find. Think of it like static on a radio—it makes the music (the “signal”) harder to hear. In data, noise comes from measurement errors, random fluctuations, or the inherent variability in what we’re studying. Noise is random variation in data that hides the real patterns we want to see, similar to how background noise makes conversation difficult to hear.

Several approaches help determine appropriate bin widths (*):

Sturges’ rule: Use k = 1 + \log_2(n) bins, where n is the sample size. This works well for roughly symmetric distributions.
Square root rule: Use k = \sqrt{n} bins. A simple, reasonable default for many situations.

In R, you can specify bins in several ways:

# Generate exam scores data
set.seed(123)  # For reproducibility
exam_scores <- c(
  rnorm(80, mean = 75, sd = 12),  # Most students cluster around 75
  runif(15, 50, 65),               # Some lower performers
  runif(5, 85, 95)                 # A few high achievers
)

# Keep scores within valid range (0-100)
exam_scores <- pmin(pmax(exam_scores, 0), 100)

# Round to whole numbers
exam_scores <- round(exam_scores)

# Specify number of bins
hist(exam_scores, breaks = 10)

# Specify exact break points
hist(exam_scores, breaks = seq(0, 100, by = 10))

# Let R choose automatically (uses Sturges' rule by default)
hist(exam_scores)

The best approach is often to experiment with different bin widths to find what best reveals your data’s pattern. Start with a default, then try fewer and more bins to see how the story changes.

Defining bin boundaries: When creating bins for a frequency table, you must decide how to handle values that fall exactly on the boundaries. For example, if you have bins 0-10 and 10-20, which bin does the value 10 belong to?

The solution is to use interval notation to specify whether each boundary is included or excluded:

Closed interval [a, b] includes both endpoints: a \leq x \leq b
Open interval (a, b) excludes both endpoints: a < x < b
Half-open interval [a, b) includes the left endpoint but excludes the right: a \leq x < b
Half-open interval (a, b] excludes the left endpoint but includes the right: a < x \leq b

Standard convention: Most statistical software, including R, uses left-closed, right-open intervals [a, b) for all bins except the last one, which is fully closed [a, b]. This means:

The value at the lower boundary is included in the bin
The value at the upper boundary belongs to the next bin
The very last bin includes both boundaries to capture the maximum value

For example, with bins 0-20, 20-40, 40-60, 60-80, 80-100:

Score Range	Interval Notation	Values Included
0-20	[0, 20)	0 ≤ score < 20
20-40	[20, 40)	20 ≤ score < 40
40-60	[40, 60)	40 ≤ score < 60
60-80	[60, 80)	60 ≤ score < 80
80-100	[80, 100]	80 ≤ score ≤ 100

This convention ensures that:

Every value is counted exactly once (no double-counting)
No values fall through the cracks
The bins partition the entire range completely

When presenting frequency tables in reports, you can simply write “0-20, 20-40, …” and note that bins are left-closed, right-open, or explicitly show the interval notation if precision is important.

Frequency histogram shows the raw counts:

# R code example
hist(exam_scores, 
     breaks = seq(0, 100, by = 10),
     main = "Distribution of Exam Scores",
     xlab = "Score",
     ylab = "Frequency",
     col = "lightblue")

Relative frequency histogram shows proportions (useful when comparing groups of different sizes):

hist(exam_scores, 
     breaks = seq(0, 100, by = 10),
     freq = FALSE,  # This creates relative frequency/density
     main = "Distribution of Exam Scores",
     xlab = "Score",
     ylab = "Relative Frequency",
     col = "lightgreen")

Density histogram adjusts for interval width and is used with density curves:

hist(exam_scores, 
     breaks = seq(0, 100, by = 10),
     freq = FALSE,  # Creates density scale
     main = "Distribution of Exam Scores",
     xlab = "Score",
     ylab = "Density",
     col = "lightcoral")

Density Curves

A density curve is a smooth line that approximates/models the shape of a distribution. Unlike histograms that show actual data in discrete bins, density curves show the overall pattern as a continuous function. The area under the entire curve always equals 1, and the area under any portion of the curve represents the proportion of observations in that range.

# Adding a density curve to a histogram
hist(exam_scores, 
     freq = FALSE,
     main = "Exam Scores with Density Curve",
     xlab = "Score",
     ylab = "Density",
     col = "lightblue",
     border = "white")
lines(density(exam_scores), 
      col = "darkred", 
      lwd = 2)

Density curves are particularly useful for:

Identifying the shape of the distribution (symmetric, skewed, bimodal)
Comparing multiple distributions on the same plot
Understanding the theoretical (true) distribution underlying your data

Tip

In statistics, a percentile indicates the relative position of a data point within a dataset by showing the percentage of observations that fall at or below that value. For example, if a student scores at the 90th percentile on a test, their score is equal to or higher than 90% of all other scores.

Quartiles are special percentiles that divide data into four equal parts: the first quartile (Q1, 25th percentile), second quartile (Q2, 50th percentile, also the median), and third quartile (Q3, 75th percentile). If Q1 = 65 points, then 25% of students scored 65 or below.

More generally, quantiles are values that divide data into equal-sized groups—percentiles divide into 100 parts, quartiles into 4 parts, deciles into 10 parts, and so on.

Visualizing Cumulative Frequency (*)

Cumulative frequency plots, also called ogives (pronounced “oh-jive”), display how frequencies accumulate across values. These plots use lines rather than bars and always increase from left to right, eventually reaching the total number of observations (for cumulative frequency) or 1.0 (for cumulative relative frequency).

Cumulative frequency plots are excellent for:

Finding percentiles and quartiles visually
Determining what proportion of data falls below or above a certain value
Comparing distributions of different groups

# Creating cumulative frequency data
score_breaks <- seq(0, 100, by = 10)
freq_counts <- hist(exam_scores, breaks = score_breaks, plot = FALSE)$counts
cumulative_freq <- cumsum(freq_counts)

# Plotting cumulative frequency
plot(score_breaks[-1], cumulative_freq,
     type = "b",  # both points and lines
     main = "Cumulative Frequency of Exam Scores",
     xlab = "Score",
     ylab = "Cumulative Frequency",
     col = "darkblue",
     lwd = 2,
     pch = 19)
grid()

For cumulative relative frequency (which is more commonly used):

# Cumulative relative frequency
cumulative_rel_freq <- cumulative_freq / length(exam_scores)

plot(score_breaks[-1], cumulative_rel_freq,
     type = "b",
     main = "Cumulative Relative Frequency of Exam Scores",
     xlab = "Score",
     ylab = "Cumulative Relative Frequency",
     col = "darkred",
     lwd = 2,
     pch = 19,
     ylim = c(0, 1))
grid()
abline(h = c(0.25, 0.5, 0.75), lty = 2, col = "gray")  # Quartile lines

The cumulative relative frequency curve makes it easy to read percentiles. For example, if you draw a horizontal line at 0.75 and see where it intersects the curve, the corresponding x-value is the 75th percentile—the score below which 75% of students fall.

Discrete vs. Continuous Distributions

The type of variable you’re analyzing determines how you visualize its distribution:

Discrete distributions apply to variables that can only take specific, countable values. Examples include number of children in a family (0, 1, 2, 3…), number of customer complaints per day, or responses on a 5-point Likert scale.

For discrete data, we typically use:

Bar charts (with gaps between bars) rather than histograms
Frequency or relative frequency on the y-axis
Each distinct value gets its own bar

# Example: Number of children per family
children <- c(0, 1, 2, 2, 1, 3, 0, 2, 1, 4, 2, 1, 0, 2, 3)
barplot(table(children),
        main = "Distribution of Number of Children",
        xlab = "Number of Children",
        ylab = "Frequency",
        col = "skyblue")

Continuous distributions apply to variables that can take any value within a range. Examples include temperature, response time, height, or turnout percentage.

For continuous data, we use:

Histograms (with touching bars) that group data into intervals
Density curves to show the smooth pattern
Density on the y-axis when using density curves

# Generate response time data (in seconds)
set.seed(456)  # For reproducibility
response_time <- rgamma(200, shape = 2, scale = 1.5)

# Example: Response time distribution
hist(response_time, 
     breaks = 15,
     freq = FALSE,
     main = "Distribution of Response Time",
     xlab = "Response Time (seconds)",
     ylab = "Density",
     col = "lightgreen",
     border = "white")
lines(density(response_time), 
      col = "darkgreen", 
      lwd = 2)

The key difference is that discrete distributions show probability at specific points, while continuous distributions show probability density across ranges. For continuous variables, the probability of any exact value is essentially zero—instead, we talk about the probability of falling within an interval.

Understanding whether your variable is discrete or continuous guides your choice of visualization and statistical methods, ensuring your analysis accurately represents the nature of your data.

Describing Distributions

Shape Characteristics:

Symmetry vs. Skewness:

Symmetric: Mirror image around center (example: heights in homogeneous population)
Right-skewed (positive skew): Long tail to right (example: income, wealth)
Left-skewed (negative skew): Long tail to left (example: age at death in developed countries)

Example of Skewness Impact:

Income distribution in the U.S.:

Median household income: ~$70,000
Mean household income: ~$100,000
Mean > Median indicates right skew
A few very high incomes pull the mean up

Modality:

Unimodal: One peak (example: test scores)
Bimodal: Two peaks (example: height when mixing males and females)
Multimodal: Multiple peaks (example: age distribution in a college town—peaks at college age and middle age)

Important Probability Distributions:

Normal (Gaussian) Distribution:

Bell-shaped, symmetric
Characterized by mean (\mu) and standard deviation (\sigma)
About 68% of values within \mu \pm \sigma
About 95% within \mu \pm 2\sigma
About 99.7% within \mu \pm 3\sigma

Demographic Applications:

Heights within homogeneous populations
Measurement errors
Sampling distributions of means (Central Limit Theorem)

Binomial Distribution:

Number of successes in n independent trials
Each trial has probability p of success
Mean = np, Variance = np(1-p)

Example: Number of male births out of 100 births (p \approx 0.512)

Poisson Distribution:

Count of events in fixed time/space
Mean = Variance = \lambda
Good for rare events

Demographic Applications:

Number of deaths per day in small town
Number of births per hour in hospital
Number of accidents at intersection per month

Visualizing Frequency Distributions (*)

Histogram: For continuous data, shows frequency with bar heights.

X-axis: Value ranges (bins)
Y-axis: Frequency or density
No gaps between bars (continuous data)
Bin width affects appearance

Bar Chart: For categorical data, shows frequency with separated bars.

X-axis: Categories
Y-axis: Frequency
Gaps between bars (discrete categories)
Order may or may not matter

Cumulative Distribution Function (CDF): Shows proportion of values ≤ each point of data.

Always increases (or stays flat)
Starts at 0, ends at 1
Steep slopes indicate common values
Flat areas indicate rare values

Box Plot (Box-and-Whisker Plot): A visual summary that displays the distribution’s key statistics using five key values.

The Five-Number Summary:

Minimum: Leftmost whisker end (excluding outliers)
Q1 (First Quartile): Left edge of the box (25th percentile)
Median (Q2): Line inside the box (50th percentile)
Q3 (Third Quartile): Right edge of the box (75th percentile)
Maximum: Rightmost whisker end (excluding outliers)

What It Reveals:

Skewness: If median line is off-center in the box, or whiskers are unequal
Spread: Wider boxes and longer whiskers indicate more variability
Outliers: Immediately visible as separate points
Symmetry: Equal whisker lengths and centered median suggest normal distribution

Quick Interpretation:

Narrow box = consistent data
Long whiskers = wide range of values
Many outliers = potential data quality issues or interesting extreme cases
Median closer to Q1 = right-skewed data (tail extends right)
Median closer to Q3 = left-skewed data (tail extends left)

Box plots are especially useful for comparing multiple groups side-by-side!

1.8 Variables and Measurement Scales

A variable is any characteristic that can take different values across units of observation.

Measurement: Transforming Concepts into Numbers

The Political World is Full of Data

Political science has evolved from a primarily theoretical discipline to one that increasingly relies on empirical evidence. Whether we’re studying:

Election outcomes: Why do people vote the way they do?
Public opinion: What shapes attitudes toward immigration or climate policy?
International relations: What factors predict conflict between nations?
Policy effectiveness: Did a new education policy actually improve outcomes?

We need systematic ways to analyze data and draw conclusions that go beyond anecdotes and personal impressions.

Consider this question: “Does democracy lead to economic growth?”

Your intuition might suggest yes—democratic countries tend to be wealthier. But is this causation or correlation? Are there exceptions? How confident can we be in our conclusions?

Statistics provides the tools to move from hunches to evidence-based answers, helping us distinguish between what seems true and what actually is true.

The Challenge of Measurement in Social Sciences

In social sciences, we often struggle with the fact that key concepts do not translate directly into numbers:

How do we measure “democracy”?
What number captures “political ideology”?
How do we quantify “institutional strength”?
How do we measure “political participation”?

🔍 Correlation ≠ Causation: Understanding Spurious Relationships

The Fundamental Distinction

Correlation measures how two variables move together:

Positive: Both increase together (study hours ↑, grades ↑)
Negative: One increases while other decreases (TV hours ↑, grades ↓)
Measured by correlation coefficient: r \in [-1, 1]

Causation means one variable directly influences another:

X \rightarrow Y: Changes in X directly cause changes in Y
Requires: (1) correlation, (2) temporal precedence, (3) no alternative explanations

The Danger: Spurious Correlation

A spurious correlation occurs when two variables appear related but are actually both influenced by a third variable (a confounder).

Classic Example:

Observed: Ice cream sales correlate with drowning deaths
Spurious conclusion: Ice cream causes drowning (❌)
Reality: Summer weather (confounder) causes both:
Summer → More ice cream sales
Summer → More swimming → More drownings

Mathematical representation:

Observed correlation: \text{Cor}(X,Y) \neq 0
But the true model: X = \alpha Z + \epsilon_1 and Y = \beta Z + \epsilon_2
Where Z is the confounding variable causing both

Confounding: The Hidden Influence

A confounding variable (confounder):

Affects both the presumed cause and effect
Creates an illusion of direct causation 3. Must be controlled for valid causal inference

Research Example:

Observed: Coffee consumption correlates with heart disease
Potential confounder: Smoking (coffee drinkers more likely to smoke)
True relationships:
Smoking → Heart disease (causal)
Smoking → Coffee consumption (association)
Coffee → Heart disease (spurious without controlling for smoking)

How to Identify Causal Relationships

Randomized Controlled Trials (RCTs): Random assignment breaks confounding
Natural Experiments: External events create “as-if” random variation
Statistical Control: Include confounders in regression models
Instrumental Variables: Find variables affecting X but not Y directly

Key Takeaway

Finding correlation is easy. Establishing causation is hard. Always ask: “What else could explain this relationship?”

Remember: The most dangerous phrase in empirical research is “our data shows that X causes Y” when all you’ve measured is correlation.

📊 Quick Test: Correlation or Causation?

For each scenario, identify whether the relationship is likely causal or spurious:

Cities with more churches have more crime
- Answer: Spurious (confounder: population size)
Smoking leads to lung cancer
- Answer: Causal (established through multiple study designs)
Students with more books at home get better grades
- Answer: Likely spurious (confounders: parental education, income)
Countries with higher chocolate consumption have more Nobel laureates
- Answer: Spurious (confounder: wealth/development level)

Types of Variables

Quantitative Variables represent amounts or quantities and can be:

Continuous Variables: Can take any value within a range, limited only by measurement precision.

Age (22.5 years, 22.51 years, 22.514 years…)
Income ($45,234.67)
Height (175.3 cm)
Population density (432.7 people per square kilometer)

Discrete Variables: Can only take specific values, usually counts.

Number of children in a family (0, 1, 2, 3…)
Number of marriages (0, 1, 2…)
Number of rooms in a dwelling (1, 2, 3…)
Number of migrants entering a country per year

Qualitative Variables represent categories or qualities and can be:

Nominal Variables: Categories with no inherent order.

Country of birth (USA, Mexico, Canada…)
Religion (Christian, Muslim, Hindu, Buddhist…)
Blood type (A, B, AB, O)
Cause of death (heart disease, cancer, accident…)

Ordinal Variables: Categories with a meaningful order but unequal intervals.

Education level (no schooling, primary, secondary, tertiary)
Satisfaction with healthcare (very dissatisfied, dissatisfied, neutral, satisfied, very satisfied)
Socioeconomic status (low, middle, high)
Self-rated health (poor, fair, good, excellent)

Measurement Scales

Understanding measurement scales is crucial because they determine which statistical methods are appropriate:

Nominal Scale: Categories only—we can count frequencies but cannot order or perform arithmetic. Example: We can say 45% of residents were born locally, but we cannot calculate an “average birthplace.”

Ordinal Scale: Order matters but differences between values are not necessarily equal. Example: The difference between “poor” and “fair” health may not equal the difference between “good” and “excellent” health.

Interval Scale: Equal intervals between values but no true zero point. Example: Temperature in Celsius—the difference between 20°C and 30°C equals the difference between 30°C and 40°C, but 0°C doesn’t mean “no temperature.”

Ratio Scale: Equal intervals with a true zero point, allowing all mathematical operations. Example: Income—$40,000 is twice as much as $20,000, and $0 means no income.

1.9 Parameters, Statistics, Estimands, Estimators, and Estimates

Statistical inference is the process of learning unknown features of a population from finite samples. This section introduces five core ideas.

Quick comparison (summary table)

Term	What is it?	Random?	Typical notation	Example
Estimand	Precisely defined target quantity	No	words (specification)	“Median household income in CA on 2024-01-01.”
Parameter	The true population value of that quantity	No*	\theta,\ \mu,\ p,\ \beta	True mean age at first birth in France (2023)
Estimator	A rule/formula mapping data to an estimate	—	\hat\theta = g(X_1,\dots,X_n)	\bar X, \hat p = X/n, OLS \hat\beta
Statistic	Any function of the sample (includes estimators)	Yes	\bar X,\ s^2,\ r	Sample mean from n=500 births
Estimate	The numerical value obtained from the estimator	No	a number	\hat p = 0.433

*Fixed for the population/time frame you define; it can differ across places/times.

Parameter

A parameter is a numerical characteristic of a population—fixed but unknown.

Common parameters: \mu (mean), \sigma^2 (variance), p (proportion), \beta (regression effect), \lambda (rate).

Example. The true mean age at first birth for all women in France, 2023, is a parameter \mu. We do not know it without full population data.

Note

Notation. A common convention is Greek letters for population parameters and Roman letters for sample statistics. Consistency matters more than the specific symbols chosen.

Statistic

A statistic is any function of sample data. Statistics vary from sample to sample.

Examples: \bar x (sample mean), s^2 (sample variance), \hat p (sample proportion), r (sample correlation), b (sample regression slope).

Example. From a random sample of 500 births, \bar x = 30.9 years; a different sample might give 31.4.

Estimand

The estimand is the target quantity—specified clearly enough that two researchers would compute the same number from the same full population.

Well-specified estimands
- “Median household income in California on 2024-01-01.”
- “Male–female difference in life expectancy for births in Sweden, 2023.”
- “Proportion of 25–34 year-olds in urban areas with tertiary education.”

Warning

Why precise definitions matter. “Unemployment rate” is ambiguous unless you specify (i) who counts as unemployed, (ii) age range, (iii) geography, (iv) time window. Different definitions lead to different parameters (e.g., U-1 … U-6 in the U.S.).

Estimator

An estimator is the rule that turns data into an estimate.

Common estimators

\hat\mu=\bar X=\frac{1}{n}\sum_{i=1}^n X_i

\hat p=\frac{X}{n}\quad\text{(with $X$ successes)}

s^2=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X)^2

Note

Why n-1? Bessel’s correction makes s^2 unbiased for the population variance when the mean is estimated from the same data.

Judging estimators: bias, variance, MSE, efficiency

Bias — is the estimator centered on the truth? If the same study were repeated many times, an unbiased estimator would average to the true value. A biased estimator would systematically miss it (too high or too low).

Variance — how much do estimates differ across samples? Even without bias, repeated samples will not give exactly the same number. Lower variance means more stable results from sample to sample.

Mean Squared Error (MSE) — overall accuracy in one measure. MSE combines both components: \mathrm{MSE}(\hat\theta)=\mathrm{Var}(\hat\theta)+\big(\mathrm{Bias}(\hat\theta)\big)^2. Lower MSE is better. An estimator with a small bias but much lower variance can have a lower MSE than an unbiased but highly variable one.

Efficiency — comparative precision among estimators. Among unbiased estimators that target the same parameter with the same data, the more efficient estimator has the smaller variance. When small bias is allowed, compare using MSE instead.

Sources of precision (common cases)

Sample mean (simple random sample): \operatorname{Var}(\bar X)=\frac{\sigma^2}{n},\qquad \mathrm{SE}(\bar X)=\frac{\sigma}{\sqrt{n}}. Larger n reduces SE at the rate 1/\sqrt{n}.
Sample proportion: \operatorname{Var}(\hat p)=\frac{p(1-p)}{n},\qquad \mathrm{SE}(\hat p)=\sqrt{\frac{\hat p(1-\hat p)}{n}}.
Design effects: clustering, stratification, and weights can change variance. Match your SE method to the sampling design.

Tip

Practical guidance

Define the estimand precisely (population, time, unit, and definition).
Select an estimator that directly targets that estimand.
Among unbiased options, prefer lower variance (greater efficiency).
When bias–variance trade-offs are relevant, compare MSE.
Report the estimate and its uncertainty (SE or CI), and state key assumptions.

Estimate

An estimate is the numerical value obtained after applying the estimator to the data.

Worked example

Estimand: Approval share among all U.S. adults today.
Parameter: p (unknown true approval).
Estimator: \hat p = X/n.
Sample: n=1{,}500, approvals X=650.
Estimate: \hat p = 650/1500 = 0.433 (43.3%).

Common confusions and clarifications

Parameter vs statistic: Population quantity vs sample-derived quantity.
Estimator vs estimate: Procedure vs numerical result.
Time index: Parameters can change over time (e.g., Q2 vs Q3).
Definition first: Specify the estimand before choosing the estimator.

Understanding Different Types of Unpredictability

Not all uncertainty is the same. Understanding different sources of unpredictability helps us choose appropriate statistical methods and interpret results correctly.

Concept	What is it?	Source of unpredictability	Example
Randomness	Individual outcomes are uncertain, but the probability distribution is known or modeled.	Fluctuations across realizations; lack of information about a specific outcome.	Dice roll, coin toss, polling sample
Chaos	Deterministic dynamics highly sensitive to initial conditions (butterfly effect).	Tiny initial differences grow rapidly → large trajectory divergences.	Weather forecasting, double pendulum, population dynamics
Entropy	A measure of uncertainty/dispersion (information-theoretic or thermodynamic).	Larger when outcomes are more evenly distributed (less predictive information).	Shannon entropy in data compression
“Haphazardness” (colloquial)	A felt lack of order without an explicit model; a mixture of mechanisms.	No structured description or stable rules; overlapping processes.	Traffic patterns, social media trends
Quantum randomness	A single outcome is not determined; only the distribution is specified (Born rule).	Fundamental (ontological) indeterminacy of individual measurements.	Electron spin measurement, photon polarization

Key Distinctions for Statistical Practice

Deterministic chaos ≠ statistical randomness: A chaotic system is fully deterministic yet practically unpredictable due to extreme sensitivity to initial conditions. Statistical randomness, by contrast, models uncertainty via probability distributions where individual outcomes are genuinely uncertain.

Why this matters: In statistics, we typically model phenomena as random processes, assuming we can specify probability distributions even when individual outcomes are unpredictable. This assumption underlies most statistical inference.

Quantum Mechanics and Fundamental Randomness

In the Copenhagen interpretation, randomness is fundamental (ontological): a single outcome cannot be predicted, but the probability distribution is given by the Born rule.

This represents true randomness at the most basic level of nature.

1.10 Statistical Error and Uncertainty

Introduction: Why Uncertainty Matters

No measurement or estimate is perfect. Understanding different types of error is crucial for interpreting results and improving study design.

The Central Challenge

Every time we use a sample to learn about a population, we introduce uncertainty. The key is to:

Quantify this uncertainty honestly
Distinguish between different sources of error
Communicate results transparently

Types of Error

Random Error

Random error represents unpredictable fluctuations that vary from observation to observation without a consistent pattern. These errors arise from various sources of natural variability in the data collection and measurement process.

Key Characteristics

Unpredictable Direction: Sometimes too high, sometimes too low
No Consistent Pattern: Varies randomly across observations
Averages to Zero: Over many measurements, positive and negative errors cancel out
Quantifiable: Can be estimated and reduced through appropriate methods

Random error encompasses several subtypes:

Sampling Error

Sampling error is the most common type of random error—it arises because we observe a sample rather than the entire population. Different random samples from the same population will yield different estimates purely by chance.

Key properties:

Decreases with sample size: \propto 1/\sqrt{n}
Quantifiable using probability theory
Inevitable when working with samples

Example: Internet Access Survey

Imagine surveying 100 random households about internet access:

The variation around the true value (red line) represents sampling error. With larger samples, estimates would cluster more tightly.

Measurement Error

Measurement error is random variation in the measurement process itself—even when measuring the same thing repeatedly.

Examples:

Slight variations when reading a thermometer due to parallax
Random fluctuations in electronic instruments
Inconsistencies in human judgment when coding qualitative data

Unlike sampling error (which comes from who/what we observe), measurement error comes from how we observe.

Other Sources of Random Error

Processing error: Random mistakes in data entry, coding, or computation
Model specification error: When the true relationship is more complex than assumed
Temporal variation: Natural day-to-day fluctuations in the phenomenon being measured

Systematic Error (Bias)

Systematic error represents consistent deviation in a particular direction. Unlike random error, it doesn’t average out with repeated sampling or measurement—it persists and pushes results consistently away from the truth.

Sampling method systematically excludes certain groups.

Example: Phone surveys during business hours underrepresent employed people.

Measurement instrument consistently over/under-measures.

Example: Scales that always read 2 pounds heavy; survey questions that lead respondents toward particular answers.

Respondents systematically misreport.

Example: People underreport alcohol consumption, overreport voting, or give socially desirable answers.

Non-responders differ systematically from responders.

Example: Very sick and very healthy people less likely to respond to health surveys, leaving only those with moderate health.

Only observing “survivors” of some process.

Example: During WWII, the military analyzed returning bombers to determine where to add armor. Planes showed the most damage on wings and tail sections. Abraham Wald realized the flaw: they should armor where there weren’t bullet holes—the engine and cockpit. Planes hit in those areas never made it back to be analyzed. They were only studying the survivors.

Observers or interviewers systematically influence results.

Example: Interviewers unconsciously prompting certain responses or recording observations that confirm their expectations.

The Bias-Variance Decomposition

Mathematically, total error (Mean Squared Error) decomposes into:

\mathrm{MSE}(\hat\theta) = \underbrace{\mathrm{Var}(\hat\theta)}_{\text{random error}} + \underbrace{\big(\mathrm{Bias}(\hat\theta)\big)^2}_{\text{systematic error}}

Critical Insight

A large biased sample gives a precisely wrong answer.

Increase n → reduces random error (specifically sampling error)
Improve study design → reduces systematic error
Better instruments → reduces measurement error

Different combinations of bias and variance in estimation

Intuitive analogy: Think of trying to hit a bullseye:

Random error = scattered shots around a target (sometimes left, sometimes right, sometimes high, sometimes low)
Systematic error = consistently hitting the same wrong spot (all shots clustered, but away from the bullseye)
Ideal = shots tightly clustered at the bullseye center

Quantifying Uncertainty

Standard Error

The standard error (SE) quantifies how much an estimate varies across different possible samples. It measures sampling error specifically.

For a Proportion:

SE(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

For a Mean:

SE(\bar{x}) = \frac{s}{\sqrt{n}}

For a Difference:

SE(\bar{x}_1 - \bar{x}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

What SE Tells Us

Standard error quantifies sampling error only. It does not account for systematic errors (bias), measurement error, or other sources of uncertainty.

Margin of Error

The margin of error (MOE) represents the expected maximum difference between sample estimate and true parameter.

\text{MOE} = \text{Critical Value} \times \text{Standard Error}

Understanding the Critical Value

For 95% confidence, we use 1.96 (often simplified to 2). This ensures that ~95% of intervals constructed this way will contain the true parameter.

90% confidence: z = 1.645
95% confidence: z = 1.96
99% confidence: z = 2.576

Confidence Intervals

A confidence interval provides a range of plausible values:

\text{CI} = \text{Estimate} \pm (\text{Critical Value} \times \text{Standard Error})

Important Limitation

Confidence intervals quantify sampling uncertainty but assume no systematic error. A perfectly precise estimate (narrow CI) can still be biased if the study design is flawed.

Practical Application: Opinion Polling

Case Study: Political Polls

When a poll reports “Candidate A: 52%, Candidate B: 48%”, this is incomplete without uncertainty quantification.

The Golden Rule of Polling

With ~1,000 randomly selected respondents:

Margin of error: ±3 percentage points (95% confidence)
Interpretation: A reported 52% means true support likely between 49% and 55%
What this covers: Only random sampling error—assumes no systematic bias

Critical Distinction

The ±3% margin of error quantifies sampling uncertainty only. It does not account for:

Coverage bias (who’s excluded from the sampling frame)
Non-response bias (who refuses to participate)
Response bias (people misreporting their true views)
Timing effects (opinions changing between poll and election)

Sample Size and Precision

Sample Size	Margin of Error (95%)	Use Case
n = 100	± 10 pp	Broad direction only
n = 400	± 5 pp	General trends
n = 1,000	± 3 pp	Standard polls
n = 2,500	± 2 pp	High precision
n = 10,000	± 1 pp	Very high precision

Law of Diminishing Returns

To halve the margin of error, you need four times the sample size because \text{MOE} \propto 1/\sqrt{n}

This applies only to sampling error. Doubling your sample size from 1,000 to 2,000 won’t fix systematic problems like biased question wording or unrepresentative sampling methods.

What Quality Polls Should Report

A transparent poll discloses:

Field dates: When was data collected?
Population and sampling method: Who was surveyed and how were they selected?
Sample size: How many people responded?
Response rate: What proportion of contacted people participated?
Weighting procedures: How was the sample adjusted to match population characteristics?
Margin of sampling error: Quantification of sampling uncertainty
Question wording: Exact text of questions asked

The Reporting Gap

Most news reports mention only the topline numbers and occasionally the margin of error. They rarely discuss potential systematic biases, which can be much larger than sampling error.

Visualization: Sampling Variability

The following simulation demonstrates how confidence intervals behave across repeated sampling:

Show simulation code

library(ggplot2)
set.seed(42)

# Parameters
n_polls      <- 20
n_people     <- 100
true_support <- 0.50

# Simulate independent polls
support <- rbinom(n_polls, n_people, true_support) / n_people

# Calculate standard errors and margins of error
se   <- sqrt(support * (1 - support) / n_people)
moe  <- 2 * se  # Simplified multiplier for clarity

# Create confidence intervals
lower <- pmax(0, support - moe)
upper <- pmin(1, support + moe)

# Check coverage
covers <- (lower <= true_support) & (upper >= true_support)
n_cover <- sum(covers)

results <- data.frame(
  poll = seq_len(n_polls),
  support, se, moe, lower, upper, covers
)

# Create visualization
ggplot(results, aes(x = poll, y = support, color = covers)) +
  geom_errorbar(aes(ymin = lower, ymax = upper), 
                width = 0.3, alpha = 0.8, size = 1) +
  geom_point(size = 3) +
  geom_hline(yintercept = true_support, 
             linetype = "dashed", 
             color = "black",
             alpha = 0.7) +
  scale_color_manual(
    values = c("TRUE" = "forestgreen", "FALSE" = "darkorange"),
    labels = c("TRUE" = "Covers truth", "FALSE" = "Misses truth"),
    name   = NULL
  ) +
  scale_y_continuous(labels = scales::percent,
                     limits = c(0, 1)) +
  labs(
    title    = "Sampling Variability in 20 Independent Polls",
    subtitle = paste0(
      "Each poll: n = ", n_people, " | True value = ",
      scales::percent(true_support),
      " | Coverage: ", n_cover, "/", n_polls,
      " (", round(100 * n_cover / n_polls), "%)"
    ),
    x = "Poll Number",
    y = "Estimated Support",
    caption = "Error bars show approximate 95% confidence intervals"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "top",
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold")
  )

Key Observation

Most intervals capture the true value, but some “miss” purely due to sampling randomness. This is expected and quantifiable—it’s the nature of random sampling error.

Important: This simulation assumes no systematic bias. In real polling, systematic errors (non-response bias, coverage problems, question wording effects) can shift all estimates in the same direction, making them consistently wrong even with large samples.

Common Misconceptions

Misconception #1: Margin of Error Covers All Uncertainty

❌ Myth: “The true value is definitely within the margin of error”

✅ Reality:

With 95% confidence, there’s still a 5% chance the true value falls outside the interval due to sampling randomness alone
More importantly, margin of error only covers sampling error, not systematic biases
Real polls often have larger errors from non-response bias, question wording, or coverage problems than from sampling error

Misconception #2: Larger Samples Fix Everything

❌ Myth: “If we just survey more people, we’ll eliminate all error”

✅ Reality:

Larger samples reduce random error (particularly sampling error): more precise estimates
Larger samples do NOT reduce systematic error: bias remains unchanged
A poll of 10,000 people with 70% response rate and biased sampling frame will give a precisely wrong answer
Better to have 1,000 well-selected respondents than 10,000 poorly selected ones

Misconception #3: Random = Careless

❌ Myth: “Random error means someone made mistakes”

✅ Reality:

Random error is inherent in sampling and measurement—it’s not a mistake
Even with perfect methodology, different random samples yield different results
Random errors are predictable in aggregate even though unpredictable individually
The term “random” refers to the pattern (no systematic direction), not to carelessness

Misconception #4: Confidence Intervals are Guarantees

❌ Myth: “95% confidence means there’s a 95% chance the true value is in this specific interval”

✅ Reality:

The true value is fixed (but unknown)—it either is or isn’t in the interval
“95% confidence” means: if we repeated this process many times, about 95% of the intervals we construct would contain the true value
Each specific interval either captures the truth or doesn’t—we just don’t know which

Misconception #5: Bias Can Be Calculated Like Random Error

❌ Myth: “We can calculate the bias just like we calculate standard error”

✅ Reality:

Random error is quantifiable using probability theory because we know the sampling process
Systematic error is usually unknown and unknowable without external validation
You can’t use the sample itself to detect bias—you need independent information about the population
This is why comparing polls to election results is valuable: it reveals biases that weren’t quantifiable beforehand

Real-World Example: Polling Failures

Case Study: When Polls Mislead

Consider a scenario where 20 polls all show Candidate A leading by 3-5 points, with margins of error around ±3%. The polls seem consistent, but Candidate B wins.

What happened?

Not sampling error: All polls agreed—unlikely if only random variation
Likely systematic error:
- Non-response bias: Certain voters consistently refused to participate
- Social desirability bias: Some voters misreported their true preference
- Turnout modeling error: Wrong assumptions about who would actually vote
- Coverage bias: Sampling frame (e.g., phone lists) systematically excluded certain groups

The lesson: Consistency among polls doesn’t guarantee accuracy. All polls can share the same systematic biases, giving false confidence in wrong estimates.

Key Takeaways

Essential Points

Understanding Error Types:

Random error is unpredictable variation that averages to zero
- Sampling error: From observing a sample, not the whole population
- Measurement error: From imperfect measurement instruments or processes
- Reduced by: larger samples, better instruments, more measurements
Systematic error (bias) is consistent deviation in one direction
- Selection bias, measurement bias, response bias, non-response bias, etc.
- Reduced by: better study design, not larger samples

Quantifying Uncertainty:

Standard error measures typical sampling variability (one type of random error)
Margin of error ≈ 2 × SE gives a range for 95% confidence about sampling uncertainty
Sample size and sampling error precision follow: \text{SE} \propto 1/\sqrt{n}
- Quadrupling sample size halves sampling error
- Diminishing returns as n increases
Confidence intervals provide plausible ranges but assume no systematic bias

Critical Insights:

A precisely wrong answer (large biased sample) is often worse than an imprecisely right answer (small unbiased sample)
Always consider both sampling error AND potential systematic biases—published margins of error typically ignore the latter
Transparency matters: Report methodology, response rates, and potential biases, not just point estimates and margins of error
Validation is essential: Compare estimates to known values whenever possible to detect systematic errors

The Practitioner’s Priority

When designing studies:

First: Minimize systematic error through careful design

Representative sampling methods
High response rates
Unbiased measurement tools
Proper question wording

Then: Optimize sample size to achieve acceptable precision

Larger samples help only after bias is addressed
Balance cost vs. precision improvement
Remember diminishing returns

Finally: Report uncertainty honestly

State assumptions clearly
Acknowledge potential biases
Don’t let precise estimates create false confidence

1.11 Sampling and Sampling Methods (*)

Sampling is the process of selecting a subset of individuals from a population to estimate characteristics of the whole population. The way we sample profoundly affects what we can conclude from our data.

The Sampling Frame

Before discussing methods, we must understand the sampling frame—the list or device from which we draw our sample. The frame should ideally include every population member exactly once.

Common Sampling Frames:

Electoral rolls (for adult citizens)
Telephone directories (increasingly problematic due to mobile phones and unlisted numbers)
Address lists from postal services
Birth registrations (for newborns)
School enrollment lists (for children)
Tax records (for income earners)
Satellite imagery (for dwellings in remote areas)

Frame Problems:

Undercoverage: Frame missing population members (homeless individuals not on address lists)
Overcoverage: Frame includes non-population members (deceased people still on voter rolls)
Duplication: Same unit appears multiple times (people with multiple phone numbers)
Clustering: Multiple population members per frame unit (multiple families at one address)

Probability Sampling Methods

Probability sampling gives every population member a known, non-zero probability of selection. This allows us to make statistical inferences about the population.

Simple Random Sampling (SRS)

Every possible sample of size n has equal probability of selection. It’s the gold standard for statistical theory but often impractical for large populations.

How It Works:

Number every unit in the population from 1 to N
Use random numbers to select n units
Each unit has probability n/N of selection

Example: To sample 50 students from a school of 1,000:

Assign each student a number from 1 to 1,000
Generate 50 random numbers between 1 and 1,000
Select students with those numbers

Advantages:

Statistically optimal
Easy to analyze
No need for additional information about population

Disadvantages:

Requires complete sampling frame
Can be expensive (selected units might be far apart)
May not represent important subgroups well by chance

Systematic Sampling

Select every kth element from an ordered sampling frame, where k = N/n (the sampling interval).

How It Works:

Calculate sampling interval k = N/n
Randomly select starting point between 1 and k
Select every kth unit thereafter

Example: To sample 100 houses from 5,000 on a street listing:

k = 5,000/100 = 50
Random start: 23
Sample houses: 23, 73, 123, 173, 223…

Advantages:

Simple to implement in field
Spreads sample throughout population

Disadvantages:

Can introduce bias if there’s periodicity in the frame

Hidden Periodicity Example: Sampling every 10th apartment in buildings where corner apartments (numbers ending in 0) are all larger. This would bias our estimate of average apartment size.

Stratified Sampling

Divide population into homogeneous subgroups (strata) before sampling. Sample independently within each stratum.

How It Works:

Divide population into non-overlapping strata
Sample independently from each stratum
Combine results with appropriate weights

Example: Studying income in a city with distinct neighborhoods:

Stratum 1: High-income neighborhood (10% of population) - sample 100
Stratum 2: Middle-income neighborhood (60% of population) - sample 600
Stratum 3: Low-income neighborhood (30% of population) - sample 300

Types of Allocation:

Proportional: Sample size in each stratum proportional to stratum size

If stratum has 20% of population, it gets 20% of sample

Optimal (Neyman): Larger samples from more variable strata

If income varies more in high-income areas, sample more there

Equal: Same sample size per stratum regardless of population size

Useful when comparing strata is primary goal

Advantages:

Ensures representation of all subgroups
Can increase precision substantially
Allows different sampling methods per stratum
Provides estimates for each stratum

Disadvantages:

Requires information to create strata
Can be complex to analyze

Cluster Sampling

Select groups (clusters) rather than individuals. Often used when population is naturally grouped or when creating a complete frame is difficult.

Single-Stage Cluster Sampling:

Divide population into clusters
Randomly select some clusters
Include all units from selected clusters

Two-Stage Cluster Sampling:

Randomly select clusters (Primary Sampling Units)
Within selected clusters, randomly select individuals (Secondary Sampling Units)

Example: Surveying rural households in a large country:

Stage 1: Randomly select 50 villages from 1,000 villages
Stage 2: Within each selected village, randomly select 20 households
Total sample: 50 × 20 = 1,000 households

Multi-Stage Example: National health survey:

Stage 1: Select states
Stage 2: Select counties within selected states
Stage 3: Select census blocks within selected counties
Stage 4: Select households within selected blocks
Stage 5: Select one adult within selected households

Advantages:

Doesn’t require complete population list
Reduces travel costs (units clustered geographically)
Can use different methods at different stages
Natural for hierarchical populations

Disadvantages:

Less statistically efficient than SRS
Complex variance estimation
Larger samples needed for same precision

Design Effect: Cluster sampling typically requires larger samples than SRS. The design effect (DEFF) quantifies this:

\text{DEFF} = \frac{\text{Variance(cluster sample)}}{\text{Variance(SRS)}}

If DEFF = 2, you need twice the sample size to achieve the same precision as SRS.

Non-Probability Sampling Methods

Non-probability sampling doesn’t guarantee known selection probabilities. While limiting statistical inference, these methods may be necessary or useful in certain situations.

Convenience Sampling

Selection based purely on ease of access. No attempt at representation.

Examples:

Surveying students in your class about study habits
Interviewing people at a shopping mall about consumer preferences
Online polls where anyone can participate
Medical studies using volunteers who respond to advertisements

When It Might Be Acceptable:

Pilot studies to test survey instruments
Exploratory research to identify issues
When studying processes believed to be universal

Major Problems:

No basis for inference to population
Severe selection bias likely
Results may be completely misleading

Real Example: Literary Digest’s 1936 U.S. presidential poll surveyed 2.4 million people (huge sample!) but used telephone directories and club memberships as frames during the Depression, dramatically overrepresenting wealthy voters and incorrectly predicting Landon would defeat Roosevelt.

Purposive (Judgmental) Sampling

Deliberate selection of specific cases based on researcher judgment about what’s “typical” or “interesting.”

Examples:

Selecting “typical” villages to represent rural areas
Choosing specific age groups for a developmental study
Selecting extreme cases to understand range of variation
Picking information-rich cases for in-depth study

Types of Purposive Sampling:

Typical Case: Choose average or normal examples

Studying “typical” American suburbs

Extreme/Deviant Case: Choose unusual examples

Studying villages with unusually low infant mortality to understand success factors

Maximum Variation: Deliberately pick diverse cases

Selecting diverse schools (urban/rural, rich/poor, large/small) for education research

Critical Case: Choose cases that will be definitive

“If it doesn’t work here, it won’t work anywhere”

When It’s Useful:

Qualitative research focusing on depth over breadth
When studying rare populations
Resource constraints limit sample size severely
Exploratory phases of research

Problems:

Entirely dependent on researcher judgment
No statistical inference possible
Different researchers might select different “typical” cases

Quota Sampling

Selection to match population proportions on key characteristics. Like stratified sampling but without random selection within groups.

How Quota Sampling Works:

Identify key characteristics (age, sex, race, education)
Determine population proportions for these characteristics
Set quotas for each combination
Interviewers fill quotas using convenience methods

Detailed Example: Political poll with quotas:

Population proportions:

Male 18-34: 15%
Male 35-54: 20%
Male 55+: 15%
Female 18-34: 16%
Female 35-54: 19%
Female 55+: 15%

For a sample of 1,000:

Interview 150 males aged 18-34
Interview 200 males aged 35-54
And so on…

Interviewers might stand on street corners approaching people who appear to fit needed categories until quotas are filled.

Why It’s Popular in Market Research:

Faster than probability sampling
Cheaper (no callbacks for specific individuals)
Ensures demographic representation
No sampling frame needed

Why It’s Problematic for Statistical Inference:

Hidden Selection Bias: Interviewers approach people who look approachable, speak the language well, aren’t in a hurry—systematically excluding certain types within each quota cell.

Example of Bias: An interviewer filling a quota for “women 18-34” might approach women at a shopping mall on Tuesday afternoon, systematically missing:

Women who work during weekdays
Women who can’t afford to shop at malls
Women with young children who avoid malls
Women who shop online

Even though the final sample has the “right” proportion of young women, they’re not representative of all young women.

No Measure of Sampling Error: Without selection probabilities, we can’t calculate standard errors or confidence intervals.

Historical Cautionary Tale: Quota sampling was standard in polling until the 1948 U.S. presidential election, when polls using quota sampling incorrectly predicted Dewey would defeat Truman. The failure led to adoption of probability sampling in polling.

Snowball Sampling

Participants recruit additional subjects from their acquaintances. The sample grows like a rolling snowball.

How It Works:

Identify initial participants (seeds)
Ask them to refer others with required characteristics
Ask new participants for further referrals
Continue until sample size reached or referrals exhausted

Example: Studying undocumented immigrants:

Start with 5 immigrants you can identify
Each refers 3 others they know
Those 15 each refer 2-3 others
Continue until you have 100+ participants

When It’s Valuable:

Hidden Populations: Groups without sampling frames

Drug users
Homeless individuals
People with rare diseases
Members of underground movements

Socially Connected Populations: When relationships matter

Studying social network effects
Researching community transmission of diseases
Understanding information diffusion

Trust-Dependent Research: When referrals increase participation

Sensitive topics where trust is essential
Closed communities suspicious of outsiders

Major Limitations:

Samples biased toward cooperative, well-connected individuals
Isolated members of population missed entirely
Statistical inference generally impossible
Can reinforce social divisions (chains rarely cross social boundaries)

Advanced Version - Respondent-Driven Sampling (RDS):

Attempts to make snowball sampling more rigorous by:

Tracking who recruited whom
Limiting number of referrals per person
Weighting based on network size
Using mathematical models to adjust for bias

Still controversial whether RDS truly allows valid inference.

1.12 Probability Concepts for Statistical Analysis

While this is primarily a statistics course, understanding basic probability is essential for statistical inference.

Basic Probability

Probability quantifies uncertainty on a scale from 0 (impossible) to 1 (certain).

Classical Probability: P(\text{event}) = \frac{\text{Number of favorable outcomes}}{\text{Total possible outcomes}}

Example: Probability a randomly selected person is female \approx 0.5

Empirical Probability: Based on observed frequencies

Example: In a village, 423 of 1,000 residents are female, so P(\text{female}) \approx 0.423

Conditional Probability

Conditional Probability is the probability of event A given that event B has occurred: P(A|B)

Demographic Example: Probability of dying within a year given current age:

P(\text{death within year} | \text{age 30}) \approx 0.001
P(\text{death within year} | \text{age 80}) \approx 0.05

These conditional probabilities form the basis of life tables.

Independence

Events A and B are independent if P(A|B) = P(A).

Testing Independence in Demographic Data:

Are education and fertility independent?

P(\text{3+ children}) = 0.3 overall
P(\text{3+ children} | \text{college degree}) = 0.15
Different probabilities indicate dependence

Law of Large Numbers

As sample size increases, sample statistics converge to population parameters.

Demonstration: Estimating sex ratio at birth:

10 births: 7 males (70% - very unstable)
100 births: 53 males (53% - getting closer to ~51.2%)
1,000 births: 515 males (51.5% - quite close)
10,000 births: 5,118 males (51.18% - very close)

Visualizing the Law of Large Numbers: Coin Flips

Let’s see this in action with coin flips. A fair coin has a 50% chance of landing heads, but individual flips are unpredictable.

# Simulate coin flips and show convergence
set.seed(42)
n_flips <- 1000
flips <- rbinom(n_flips, 1, 0.5)  # 1 = heads, 0 = tails

# Calculate cumulative proportion of heads
cumulative_prop <- cumsum(flips) / seq_along(flips)

# Create data frame for plotting
lln_data <- data.frame(
  flip_number = 1:n_flips,
  cumulative_proportion = cumulative_prop
)

# Plot the convergence
ggplot(lln_data, aes(x = flip_number, y = cumulative_proportion)) +
  geom_line(color = "steelblue", alpha = 0.7) +
  geom_hline(yintercept = 0.5, color = "red", linetype = "dashed", size = 1) +
  geom_hline(yintercept = c(0.45, 0.55), color = "red", linetype = "dotted", alpha = 0.7) +
  labs(
    title = "Law of Large Numbers: Coin Flip Proportions Converge to 0.5",
    x = "Number of coin flips",
    y = "Cumulative proportion of heads",
    caption = "Red dashed line = true probability (0.5)\nDotted lines = ±5% range"
  ) +
  scale_y_continuous(limits = c(0.3, 0.7), breaks = seq(0.3, 0.7, 0.1)) +
  theme_minimal()

What this shows:

Early flips show wild variation (first 10 flips might be 70% or 30% heads)
As we add more flips, the proportion stabilizes around 50%
The “noise” of individual outcomes averages out over time

The Mathematical Statement

Let A denote an event of interest (e.g., “heads on a coin flip”, “vote for party X”, “sum of dice equals 7”). If P(A) = p and we observe n independent trials with the same distribution (i.i.d.), then the sample frequency of A:

\hat{p}_n = \frac{\text{number of occurrences of } A}{n}

converges to p as n increases.

Examples in Different Contexts

Dice example: The event “sum = 7” with two dice has probability 6/36 ≈ 16.7\%, while “sum = 4” has 3/36 ≈ 8.3\%. Over many throws, a sum of 7 appears about twice as often as a sum of 4.

Election polling: If population support for a party equals p, then under random sampling of size n, the observed frequency \hat{p}_n will approach p as n grows (assuming random sampling and independence).

Quality control: If 2% of products are defective, then in large batches, approximately 2% will be found defective (assuming independent production).

Why This Matters for Statistics

Bottom line: Randomness underpins statistical inference by turning uncertainty in individual outcomes into predictable distributions for estimates. The Law of Large Numbers guarantees that the “noise” of individual outcomes averages out, allowing us to:

Predict long-run frequencies
Quantify uncertainty (margins of error)
Draw reliable inferences from samples
Make probabilistic statements about populations

This principle works in surveys, experiments, and even quantum phenomena (in the frequentist interpretation).

Central Limit Theorem (CLT)

The Central Limit Theorem states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the shape of the original population distribution. This holds true even for highly skewed or non-normal populations.

Key Insights

Sample Size Threshold: Sample sizes of n ≥ 30 are typically sufficient for the CLT to apply
Standard Error: The standard deviation of sample means equals σ/√n, where σ is the population standard deviation
Statistical Foundation: We can make inferences about population parameters using normal distribution properties, even when the underlying data is non-normal

Why This Matters in Practice

Consider income data, which is typically right-skewed with a long tail of high earners. While individual incomes don’t follow a normal distribution, something remarkable happens when we repeatedly take samples and calculate their means:

What “normally distributed sample means” actually means:

If you take many different groups of 30+ people and calculate each group’s average income
These group averages will form a bell-shaped pattern when plotted
Most group averages will cluster near the true population mean
The probability of getting a group average far from the population mean becomes predictable

This predictable pattern (normal distribution) allows us to:

Calculate confidence intervals using normal distribution properties
Perform statistical hypothesis tests
Make predictions about sample means with known probability

Concrete Example: Imagine a city where individual incomes range from $20,000 to $10,000,000, heavily skewed right. If you:

Randomly select 100 people and calculate their mean income: maybe $75,000
Repeat this 1000 times (1000 different groups of 100 people)
Plot these 1000 group means: they’ll form a bell curve centered around the true population mean
About 95% of these group means will fall within a predictable range
This happens even though individual incomes are extremely skewed!

Mathematical Foundation

For a population with mean μ and finite variance σ²:

Sampling distribution of the mean: \bar{X} \sim N(\mu, \frac{\sigma^2}{n}) as n \to \infty
Standard error of the mean: SE_{\bar{X}} = \frac{\sigma}{\sqrt{n}}
Standardized sample mean: Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1) for large n

Key Takeaways

Universal Application: The CLT applies to any distribution with finite variance
Convergence to Normality: The approximation to normal distribution improves as sample size increases
Foundation for Inference: Most parametric statistical tests rely on the CLT
Sample Size Considerations: While n ≥ 30 is a common guideline, highly skewed distributions may require larger samples for accurate approximation

1.13 Statistical Significance: A Quick Start Guide

Imagine you flip a coin 10 times and get 8 heads. Is the coin biased, or did you just get lucky? This is the core question statistical significance (statistical inference) helps us answer.

Statistical significance tells us whether patterns in our data likely reflect something real or could have happened by pure chance.

Statistical significance is a measure (p-value) of how confident we can be that patterns observed in our sample are not due to chance alone. When a result is statistically significant (typically p-value < 0.05), it means the probability of obtaining such data in the absence of a real effect is very low.

The Courtroom Analogy

Statistical hypothesis testing works like a criminal trial:

Null Hypothesis (H_0): The defendant is innocent (no effect exists)
Alternative Hypothesis (H_1): The defendant is guilty (an effect exists)
The Evidence: Your data and test results
The Verdict: “Guilty” (reject H_0) or “Not Guilty” (fail to reject H_0)

Crucial distinction: “Not guilty” ≠ “Innocent”

A “not guilty” verdict means insufficient evidence to convict
Similarly, “not statistically significant” means insufficient evidence for an effect, NOT proof of no effect

Start with Skepticism (Presumption of Innocence)

In statistics, we always start by assuming nothing special is happening:

Null Hypothesis (H_0): “There’s no effect”
- The coin is fair
- The new drug doesn’t work
- Study time doesn’t affect grades
Alternative Hypothesis (H_1): “There IS an effect”
- The coin is biased
- The drug works
- More study time improves grades

Key principle: We maintain the null hypothesis (innocence) unless our data provides strong evidence against it—“beyond a reasonable doubt” in legal terms, or “p < 0.05” in statistical terms.

1.14 The p-value: Your “Surprise Meter”

The p-value answers one specific question:

“If nothing special were happening (null hypothesis is true), how surprising would our results be?”

A p-value is the probability of observing the study’s results, or more extreme results, if the null hypothesis (a statement of no effect or no difference) is true.

Three Ways to Think About p-values

1. The Surprise Scale

p < 0.01: Very surprising! (Strong evidence against H_0)
p < 0.05: Pretty surprising (Moderate evidence against H_0)
p > 0.05: Not that surprising (Insufficient evidence against H_0)

2. Concrete Example: The Suspicious Coin

You flip a coin 10 times and get 8 heads. What’s the p-value?

The calculation: If the coin were fair, the probability of getting 8 or more heads is: p = P(≥8 \text{ heads in 10 flips}) \approx 0.055 \approx 5.5\%

P(X \geq 8) = \sum_{k=8}^{10} \binom{10}{k} 0,5^{10} = \frac{56}{1024} \approx 0,0547

Interpretation: There’s a 5.5% chance of getting results this extreme with a fair coin. That’s somewhat unusual but not shocking.

3. The Formal Definition

A p-value is the probability of getting results at least as extreme as what you observed, assuming the null hypothesis is true.

Warning

Common Mistake: The p-value is NOT the probability that the null hypothesis is true! It assumes the null is true and tells you how unusual your data would be in that world.

1.15 The Prosecutor Fallacy: A Warning

I can see why the example might be challenging for beginners! Here’s a revised version that builds up the intuition more gradually without requiring knowledge of Bayes theorem or significance levels:

1.16 The Prosecutor Fallacy: A Warning

The Fallacy Explained

Imagine this courtroom scenario:

Prosecutor: “If the defendant were innocent, there’s only a 1% chance we’d find his DNA at the crime scene. We found his DNA. Therefore, there’s a 99% chance he’s guilty!”

This is WRONG! The prosecutor confused:

P(Evidence | Innocent) = 0.01 ← What we know
P(Innocent | Evidence) = ? ← What we want to know (but can’t get from the p-value alone!)

When we get p = 0.01, it’s tempting to think:

❌ WRONG: “There’s only a 1% chance the null hypothesis is true”
❌ WRONG: “There’s a 99% chance our treatment works”

✅ CORRECT: “If the null hypothesis were true, there’s only a 1% chance we’d see data this extreme”

Why This Matters: A Simple Medical Testing Example

Imagine a rare disease test that’s 99% accurate:

If you have the disease, the test is positive 99% of the time
If you don’t have the disease, the test is negative 99% of the time (so 1% false positive rate)

Here’s the key: Suppose only 1 in 1000 people actually have this disease.

Now let’s test 10,000 people:

10 people have the disease → 10 test positive (rounded)
9,990 people don’t have the disease → about 100 test positive by mistake (1% of 9,990)
Total positive tests: 110

If you test positive, what’s the chance you actually have the disease?

Only 10 out of 110 positive tests are real
That’s about 9%, not 99%!

The Research Analogy

The same thing happens in research:

When we test many hypotheses (like testing many potential drugs)
Most don’t work (like most people don’t have the rare disease)
Even with “significant” results (like a positive test), most findings might be false positives

Important

A p-value tells you how surprising your data would be IF the null hypothesis were true. It doesn’t tell you the probability that the null hypothesis IS true.

Think of it like this: The probability of the ground being wet IF it rained is very different from the probability it rained IF the ground is wet—the ground could be wet from a sprinkler!

Remember: A p-value tells you P(Data | Null is true), not P(Null is true | Data). These are as different as P(Wet ground | Rain) and P(Rain | Wet ground)—the ground could be wet from a sprinkler!

1.17 Introduction to Regression Analysis: Modeling Relationships Between Variables

Before we begin our discussion of regression analysis, we need to understand what we mean by a model in scientific inquiry. A model is a simplified, abstract representation of a real-world phenomenon or system. Models deliberately omit details to focus on the essential relationships we are trying to understand. They are not meant to capture every aspect of reality—which would be impossibly complex—but rather to serve as tools that help us identify patterns, make predictions, test hypotheses, and communicate our ideas clearly. The statistician George Box captured this idea perfectly when he noted that “all models are wrong, but some are useful.” In other words, while we know our models don’t perfectly represent reality, they can still provide valuable insights into the phenomena we study.

Regression analysis is a fundamental statistical method for modeling the relationship between variables. Specifically, it helps us understand how one or more independent variables (also called predictors or explanatory variables) are related to a dependent variable (the outcome or response variable we want to explain or predict). The goal of regression analysis is to quantify these relationships and, when appropriate, to predict values of the dependent variable based on the independent variables.

In its simplest form, called simple linear regression, we model the relationship between a single independent variable X and a dependent variable Y using the equation:

Y = \beta_0 + \beta_1 X + \varepsilon

where \beta_0 represents the intercept, \beta_1 represents the slope (showing how much Y changes for each unit change in X), and \varepsilon represents the error term—the part of Y that our model cannot explain.

One of the most powerful tools in statistical analysis is regression analysis—a method for understanding and quantifying relationships between variables.

The core idea is simple: How does one thing relate to another, and can we use that relationship to make predictions?

The One-Sentence Summary: Regression helps us understand how things relate to each other in a messy, complicated world where everything affects everything else.

What is Regression Analysis?

Imagine you’re curious about the relationship between education and income. You notice that people with more education tend to earn more money, but you want to understand this relationship more precisely:

How much does each additional year of education increase income, on average?
How strong is this relationship?
Are there other factors we should consider?
Can we predict someone’s likely income if we know their education level?

Regression analysis provides systematic answers to these questions. It’s like finding the “best-fitting story” that describes how variables relate to each other.

Variables and Variation

A variable is any characteristic that can take different values across units of observation. In political science:

Units of analysis: Countries, individuals, elections, policies, years
Variables: GDP, voting preference, democracy score, conflict occurrence

💡 In Plain English: A variable is anything that changes. If everyone voted the same way, “voting preference” wouldn’t be a variable—it would be a constant. We study variables because we want to understand why things differ.

Note

Consider a typical pre-election news headline: “Candidate Smith’s approval rating reaches 68%.” Your immediate inference likely suggests favorable electoral prospects for Smith—not guaranteed victory, but a strong position. You naturally understand that higher approval ratings tend to predict better electoral performance, even though the relationship is not perfect.

This intuitive assessment exemplifies the core logic of regression analysis. You used one piece of information (approval rating) to make a prediction about another outcome (electoral success). Moreover, you recognized both the relationship between these variables and the uncertainty inherent in your prediction.

While such informal reasoning serves us well in daily life, it has important limitations. How much better are Smith’s chances at 68% approval compared to 58%? What happens when we need to consider multiple factors simultaneously—approval ratings, economic conditions, and incumbency status? How confident should we be in our predictions?

Regression analysis provides a systematic framework for addressing these questions. It transforms our intuitive understanding of relationships into precise mathematical models that can be tested and refined. Through regression analysis, researchers can:

Generate precise predictions: Move beyond general assessments to specific numerical estimates—for instance, predicting not just that Smith will “probably win,” but estimating the expected vote share and range of likely outcomes.
Identify which factors matter most: Determine the relative importance of different variables—perhaps discovering that economic conditions influence elections more strongly than approval ratings.
Quantify uncertainty in predictions: Explicitly measure how confident we should be in our predictions, distinguishing between near-certain outcomes and educated guesses.
Test theoretical propositions with empirical data: Evaluate whether our beliefs about cause-and-effect relationships hold up when examined systematically across many observations.

In essence, regression analysis systematizes the pattern recognition we perform intuitively, providing tools to make our predictions more accurate, our comparisons more meaningful, and our conclusions more reliable.

The Fundamental Model

A model represents an object, person, or system in an informative way. Models divide into physical representations (such as architectural models) and abstract representations (such as mathematical equations describing atmospheric dynamics).

The core of statistical thinking can be expressed as:

Y = f(X) + \text{error}

This equation states that our outcome (Y) equals some function of our predictors (X), plus unpredictable variation.

Components:

Y = Dependent variable (the phenomenon we seek to explain)
X = Independent variable(s) (explanatory factors)
f() = The functional relationship (often assumed linear)
error (\epsilon) = Unexplained variation

💡 What This Really Means: Think of it like a recipe. Your grade in a class (Y) depends on study hours (X), but not perfectly. Two students studying 10 hours might get different grades because of test anxiety, prior knowledge, or just luck (the error term). Regression finds the average relationship.

This model provides the foundation for all statistical analysis—from simple correlations to complex machine learning algorithms.

Regression helps answer fundamental questions such as:

How much does education increase political participation?
What factors predict electoral success?
Do democratic institutions promote economic growth?

The Basic Idea: Drawing the Best Line Through Points

Simple Linear Regression

Let’s start with the simplest case: the relationship between two variables. Suppose we plot education (years of schooling) on the x-axis and annual income on the y-axis for 100 people. We’d see a cloud of points, and regression finds the straight line that best represents the pattern in these points.

What makes a line “best”? The regression line minimizes the total squared vertical distances from all points to the line. Think of it as finding the line that makes the smallest total prediction error.

The equation of this line is: Y = a + bX + \text{error}

Or in our example: \text{Income} = a + b \times \text{Education} + \text{error}

Where:

a (intercept) = predicted income with zero education
b (slope) = change in income per additional year of education
error (e) = difference between actual and predicted income

Interpreting the Results:

If our analysis finds: \text{Income} = 15,000 + 4,000 \times \text{Education}

This tells us:

Someone with 0 years of education is predicted to earn $15,000
Each additional year of education is associated with $4,000 more income
Someone with 12 years of education is predicted to earn: $15,000 + (4,000 ) = $63,000
Someone with 16 years (bachelor’s degree) is predicted to earn: $15,000 + (4,000 ) = $79,000

Understanding Relationships vs. Proving Causation

A crucial distinction: regression shows association, not necessarily causation. Our education-income regression shows they’re related, but doesn’t prove education causes higher income. Other explanations are possible:

Reverse causation: Maybe wealthier families can afford more education for their children
Common cause: Perhaps intelligence or motivation affects both education and income
Coincidence: In small samples, patterns can appear by chance

Example of Spurious Correlation: A regression might show that ice cream sales strongly predict drowning deaths. Does ice cream cause drowning? No! Both increase in summer (the common cause, confounding variable).

Multiple Regression: Controlling for Other Factors

Real life is complicated—many factors influence outcomes simultaneously. Multiple regression lets us examine one relationship while “controlling for” or “holding constant” other variables.

The Power of Statistical Control

Returning to education and income, we might wonder: Is the education effect just because educated people tend to be from wealthier families, or live in cities? Multiple regression can separate these effects:

\text{Income} = a + b_1 \times \text{Education} + b_2 \times \text{Age} + b_3 \times \text{Urban} + b_4 \times \text{Parent Income} + \text{error}

Now b_1 represents the education effect after accounting for age, location, and family background. If b_1 = 3,000, it means: “Comparing people of the same age, location, and family background, each additional year of education is associated with $3,000 more income.”

Demographic Example: Fertility and Women’s Education

Researchers studying fertility might find: \text{Children} = 4.5 - 0.3 \times \text{Education}

This suggests each year of women’s education is associated with 0.3 fewer children. But is education the cause, or are educated women different in other ways? Adding controls:

\text{Children} = a - 0.15 \times \text{Education} - 0.2 \times \text{Urban} + 0.1 \times \text{Husband Education} - 0.4 \times \text{Contraceptive Access}

Now we see education’s association is weaker (-0.15 instead of -0.3) after accounting for urban residence and contraceptive access. This suggests part of education’s apparent effect operates through these other pathways.

Types of Variables in Regression

Outcome (Dependent) Variable

This is what we’re trying to understand or predict:

Income in our first example
Number of children in our fertility example
Life expectancy in health studies
Migration probability in population studies

Predictor (Independent) Variables

These are factors we think might influence the outcome:

Quantitative: Age, years of education, income, distance
Qualitative (categorical): Gender, race, marital status, region
Binary (Dummy): Urban/rural, employed/unemployed, married/unmarried

Handling Categorical Variables: We can’t directly put “religion” into an equation. Instead, we create binary variables:

Christian = 1 if Christian, 0 otherwise
Muslim = 1 if Muslim, 0 otherwise
Hindu = 1 if Hindu, 0 otherwise
(One category becomes the reference group)

Different Types of Regression for Different Outcomes

The basic regression idea adapts to many situations:

Linear Regression

For continuous outcomes (income, height, blood pressure): Y = a + b_1X_1 + b_2X_2 + … + \text{error}

Logistic Regression

For binary outcomes (died/survived, migrated/stayed, married/unmarried):

Instead of predicting the outcome directly, we predict the probability: \log\left(\frac{p}{1-p}\right) = a + b_1X_1 + b_2X_2 + …

Where p is the probability of the event occurring.

Example: Predicting migration probability based on age, education, and marital status. The model might find young, educated, unmarried people have 40% probability of migrating, while older, less educated, married people have only 5% probability.

Poisson Regression

For count outcomes (number of children, number of doctor visits): \log(\text{expected count}) = a + b_1X_1 + b_2X_2 + …

Example: Modeling number of children based on women’s characteristics. Useful because it ensures predictions are never negative (can’t have -0.5 children!).

Survival (Cox model)/Hazard Regression

What it’s for: Predicting when something will happen, not just if it will happen.

The challenge: Imagine you’re studying how long marriages last. You follow 1,000 couples for 10 years, but by the end of your study:

400 couples divorced (you know exactly when)
600 couples are still married (you don’t know if/when they’ll divorce)

Regular regression can’t handle this “incomplete story” problem—those 600 ongoing marriages contain valuable information, but we don’t know their endpoints yet.

How Cox models help: Instead of trying to predict the exact timing, they focus on relative risk—who’s more likely to experience the event sooner. Think of it like asking “At any given moment, who’s at higher risk?” rather than “Exactly when will this happen?”

Real-world applications:

Medical research: Who responds to treatment faster?
Business: Which customers cancel subscriptions sooner?
Social science: What factors make life events happen earlier/later?

Interpreting Regression Results

Coefficients

The coefficient tells us the expected change in outcome for a one-unit increase in the predictor, holding other variables constant.

Examples of Interpretation:

Linear regression for income:

“Each additional year of education is associated with $3,500 higher annual income, controlling for age and experience”

Logistic regression for infant mortality:

“Each additional prenatal visit is associated with 15% lower odds of infant death, controlling for mother’s age and education”

Multiple regression for life expectancy:

“Each $1,000 increase in per-capita GDP is associated with 0.4 years longer life expectancy, after controlling for education and healthcare access”

Statistical Significance

The regression also tests whether relationships could be due to chance:

p-value < 0.05: Relationship unlikely due to chance (statistically significant)
p-value > 0.05: Relationship could plausibly be random variation

But remember: Statistical significance ≠ practical importance. With large samples, tiny effects become “significant.”

Confidence Intervals for Coefficients

Just as we have confidence intervals for means or proportions, we have them for regression coefficients:

“The effect of education on income is $3,500 per year, 95% CI: [$2,800, $4,200]”

This means we’re 95% confident the true effect is between $2,800 and $4,200.

R-squared: How Well Does the Model Fit?

R^2 (R-squared) measures the proportion of variation in the outcome explained by the predictors:

R^2 = 0: Predictors explain nothing
R^2 = 1: Predictors explain everything
R^2 = 0.3: Predictors explain 30% of variation

Example: A model of income with only education might have R^2 = 0.15 (education explains 15% of income variation). Adding age, experience, and location might increase R^2 to 0.35 (together they explain 35%).

Assumptions and Limitations

Regression makes assumptions that may not hold:

Exogeneity (No Hidden Relationships)

The most fundamental assumption: predictors must not be correlated with errors. In simple terms, there shouldn’t be hidden factors that affect both your predictors and outcome.

Example: If studying education’s effect on income but omitting “ability,” your results are biased - ability affects both education level and income. This assumption is written as: E[\varepsilon | X] = 0

Why it matters: Without it, all your coefficients are wrong, even with millions of observations!

Linearity

Assumes straight-line relationships. But what if education’s effect on income is stronger at higher levels? We can add polynomial terms: \text{Income} = a + b_1 \times \text{Education} + b_2 \times \text{Education}^2

Independence

Assumes observations are independent. But family members might be similar, repeated measures on the same person are related, and neighbors might influence each other. Special methods handle these dependencies.

Homoscedasticity

Assumes error variance is constant. But prediction errors might be larger for high-income people than low-income people. Diagnostic plots help detect this.

Normality

Assumes errors follow normal distribution. Important for small samples and hypothesis tests, less critical for large samples.

Note: The first assumption (exogeneity) is about getting the right answer. The others are mostly about precision and statistical inference. Violating exogeneity means your model is fundamentally wrong; violating the others means your confidence intervals and p-values might be off.

Common Statistical Pitfalls

Endogeneity (omitted variable bias): Forgetting about hidden factors that affect both X and Y, violating the fundamental exogeneity assumption. Example: Studying education→income without accounting for ability.
Simultaneity/Reverse causality: When X and Y determine each other at the same time. Simple regression assumes one-way causation, but reality is often bidirectional. Example: Price affects demand AND demand affects price simultaneously.
Confounding: Failing to account for variables that affect both predictor and outcome, leading to spurious relationships. Example: Ice cream sales correlate with drownings (both caused by summer).
Selection bias: Non-random samples that systematically exclude certain groups, making results ungeneralizable. Example: Surveying only smartphone users about internet usage.
Ecological fallacy: Assuming group-level patterns apply to individuals. Example: Rich countries have lower birth rates ≠ rich people have fewer children.
P-hacking (data dredging): Testing multiple hypotheses until finding significance, or tweaking analysis until p < 0.05. With 20 tests, you expect 1 false positive by chance alone!
Overfitting: Building a model too complex for your data - perfect on training data, useless for prediction. Remember: With enough parameters, you can fit an elephant.
Survivorship bias: Analyzing only “survivors” while ignoring failures. Example: Studying successful companies while ignoring those that went bankrupt.
Overgeneralization: Extending findings beyond the studied population, time period, or context. Example: Results from US college students ≠ universal human behavior.

Remember: The first three are forms of endogeneity - they violate E[\varepsilon|X]=0 and make your coefficients fundamentally wrong. The others make results misleading or non-representative.

Applications in Demography

Fertility Analysis

Understanding what factors influence fertility decisions: \text{Children} = f(\text{Education, Income, Urban, Religion, Contraception, …})

Helps identify policy levers for countries concerned about high or low fertility.

Policy levers are the tools and methods that governments and organizations use to influence events and achieve specific goals by affecting behavior and outcomes.

Mortality Modeling

Predicting life expectancy or mortality risk: \text{Mortality Risk} = f(\text{Age, Sex, Smoking, Education, Healthcare Access, …})

Used by insurance companies, public health officials, and researchers.

Migration Prediction

Understanding who migrates and why: P(\text{Migration}) = f(\text{Age, Education, Employment, Family Ties, Distance, …})

Helps predict population flows and plan for demographic change.

Marriage and Divorce

Analyzing union formation and dissolution: P(\text{Divorce}) = f(\text{Age at Marriage, Education Match, Income, Children, Duration, …})

Informs social policy and support services.

Common Pitfalls and How to Avoid Them

Overfitting

Including too many predictors can make the model fit perfectly in your sample but fail with new data. Like memorizing exam answers instead of understanding concepts.

Solution: Use simpler models, cross-validation, or reserve some data for testing.

Multicollinearity

When predictors are highly correlated (e.g., years of education and degree level), the model can’t separate their effects.

Solution: Choose one variable or combine them into an index.

Omitted Variable Bias

Leaving out important variables can make other effects appear stronger or weaker than they really are.

Example: The relationship between ice cream sales and crime rates disappears when you control for temperature.

Extrapolation

Using the model outside the range of observed data.

Example: If your data includes education from 0-20 years, don’t predict income for someone with 30 years of education.

Making Regression Intuitive

Think of regression as a sophisticated averaging technique:

Simple average: “The average income is $50,000”
Conditional average: “The average income for college graduates is $70,000”
Regression: “The average income for 35-year-old college graduates in urban areas is $78,000”

Each added variable makes our prediction more specific and (hopefully) more accurate.

Regression in Practice: A Complete Example

Research Question: What factors influence age at first birth?

Data: Survey of 1,000 women who have had at least one child

Variables:

Outcome: Age at first birth (years)
Predictors: Education (years), Urban (0/1), Income (thousands), Religious (0/1)

Simple Regression Result: \text{Age at First Birth} = 18 + 0.8 \times \text{Education}

Interpretation: Each year of education associated with 0.8 years later first birth.

Multiple Regression Result: \text{Age at First Birth} = 16 + 0.5 \times \text{Education} + 2 \times \text{Urban} + 0.03 \times \text{Income} - 1.5 \times \text{Religious}

Interpretation:

Education effect reduced but still positive (0.5 years per education year)
Urban women have first births 2 years later
Each $1,000 income associated with 0.03 years (11 days) later
Religious women have first births 1.5 years earlier
R^2 = 0.42 (model explains 42% of variation)

This richer model helps us understand that education’s effect partly operates through urban residence and income.

Warning

Regression is a gateway to advanced statistical modeling. Once you understand the basic concept—using variables to predict outcomes and quantifying relationships—you can explore:

Interaction effects: When one variable’s effect depends on another
Non-linear relationships: Curves, thresholds, and complex patterns
Multilevel models: Accounting for grouped data (students in schools, people in neighborhoods)
Time series regression: Analyzing change over time
Machine learning extensions: Random forests, neural networks, and more

The key insight remains: We’re trying to understand how things relate to each other in a systematic, quantifiable way.

1.18 Data Quality and Sources

No analysis is better than the data it’s based on. Understanding data quality issues is crucial for demographic and social research.

Dimensions of Data Quality

Accuracy: How close are measurements to true values?

Example: Age reporting often shows “heaping” at round numbers (30, 40, 50) because people round their ages.

Completeness: What proportion of the population is covered?

Example: Birth registration completeness varies widely:

Developed countries: >99%
Some developing countries: <50%

Timeliness: How current is the data?

Example: Census conducted every 10 years becomes increasingly outdated, especially in rapidly changing areas.

Consistency: Are definitions and methods stable over time and space?

Example: Definition of “urban” varies by country, making international comparisons difficult.

Accessibility: Can researchers and policy makers actually use the data?

Common Data Sources in Demography

Census: Complete enumeration of population

Advantages:

Complete coverage (in theory)
Small area data available
Baseline for other estimates

Disadvantages:

Expensive and infrequent
Some populations hard to count
Limited variables collected

Sample Surveys: Detailed data from population subset

Examples:

Demographic and Health Surveys (DHS)
American Community Survey (ACS)
Labour Force Surveys

Advantages:

Can collect detailed information
More frequent than census
Can focus on specific topics

Disadvantages:

Sampling error present
Small areas not represented
Response burden may reduce quality

Administrative Records: Data collected for non-statistical purposes

Examples:

Tax records
School enrollment
Health insurance claims
Mobile phone data

Advantages:

Already collected (no additional burden)
Often complete for covered population
Continuously updated

Disadvantages:

Coverage may be selective
Definitions may not match research needs
Access often restricted

Data Quality Issues Specific to Demography

Age Heaping: Tendency to report ages ending in 0 or 5

Detection: Calculate Whipple’s Index or Myers’ Index

Impact: Affects age-specific rates and projections

Digit Preference: Reporting certain final digits more than others

Example: Birth weights often reported as 3,000g, 3,500g rather than precise values

Recall Bias: Difficulty remembering past events accurately

Example: “How many times did you visit a doctor last year?” Often underreported for frequent visitors, overreported for rare visitors.

Proxy Reporting: Information provided by someone else

Challenge: Household head reporting for all members may not know everyone’s exact age or education

1.19 Ethical Considerations in Statistical Demographics

Statistics isn’t just about numbers—it involves real people and has real consequences.

Confidentiality and Privacy

Statistical Disclosure Control: Protecting individual identity in published data

Methods include:

Suppressing small cells (e.g., “<5” instead of “2”)
Geographic aggregation

Example: In a table of occupation by age by sex for a small town, there might be only one female doctor aged 60-65, making her identifiable.

Representation and Fairness

Who’s Counted?: Decisions about who to include affect representation

Prisoners: Where are they counted—prison location or home address?
Homeless: How to ensure coverage?
Undocumented immigrants: Include or exclude?

Differential Privacy: Mathematical framework for privacy protection while maintaining statistical utility

Trade-off: More privacy protection = less accurate statistics

Misuse of Statistics

Cherry-Picking: Selecting only favorable results

Example: Reporting decline in teen pregnancy from peak year rather than showing full trend

P-Hacking: Manipulating analysis to achieve statistical significance

Ecological Fallacy: Inferring individual relationships from group data

Example: Counties with more immigrants have higher average incomes ≠ immigrants have higher incomes

Responsible Reporting

Uncertainty Communication: Always report confidence intervals or margins of error

Context Provision: Include relevant comparison groups and historical trends

Limitation Acknowledgment: Clearly state what data can and cannot show

1.20 Common Misconceptions in Statistics

Understanding what statistics is NOT is as important as understanding what it is.

Misconception 1: “Statistics Can Prove Anything”

Reality: Statistics can only provide evidence, never absolute proof. And proper statistics, honestly applied, constrains conclusions significantly.

Example: A study finds correlation between ice cream sales and drowning deaths. Statistics doesn’t “prove” ice cream causes drowning—both are related to summer weather.

Misconception 2: “Larger Samples Are Always Better”

Reality: Beyond a certain point, larger samples add little precision but may add bias.

Example: Online survey with 1 million responses may be less accurate than probability sample of 1,000 due to self-selection bias.

Diminishing Returns:

n = 100: Margin of error \approx 10 pp.
n = 1,000: Margin of error \approx 3.2 pp.
n = 10,000: Margin of error \approx 1 pp.
n = 100,000: Margin of error \approx 0.32 pp.

The jump from 10,000 to 100,000 barely improves precision but costs 10\times more.

Misconception 3: “Statistical Significance = Practical Importance”

Reality: With large samples, tiny differences become “statistically significant” even if meaningless.

Example: Study of 100,000 people finds men are 0.1 cm taller on average (p < 0.001). Statistically significant but practically irrelevant.

Misconception 4: “Correlation Implies Causation”

Reality: Correlation is necessary but not sufficient for causation.

Classic Examples:

Cities with more churches have more crime (both correlate with population size)
Countries with more TV sets have longer life expectancy (both correlate with development)

Misconception 5: “Random Means Haphazard”

Reality: Statistical randomness is carefully controlled and systematic.

Example: Random sampling requires careful procedure, not just grabbing whoever is convenient.

Misconception 6: “Average Represents Everyone”

Reality: Averages can be misleading when distributions are skewed or multimodal.

Example: Average income of bar patrons is $50,000. Bill Gates walks in. Now average is $1 million. Nobody’s actual income changed.

Misconception 7: “Past Patterns Guarantee Future Results”

Reality: Extrapolation assumes conditions remain constant.

Example: Linear population growth projection from 1950-2000 would badly overestimate 2050 population because it misses fertility decline.

1.21 Applications in Demography

These statistical foundations enable sophisticated demographic analyses. Let’s explore key applications.

Population Estimation and Projection

Intercensal Estimates: Estimating population between censuses

Components Method: P(t+1) = P(t) + B - D + I - E

Where:

P(t) = Population at time t
B = Births
D = Deaths
I = Immigration
E = Emigration

Each component estimated from different sources with different error structures.

Population Projections: Forecasting future population

Cohort Component Method:

Project survival rates by age
Project fertility rates
Project migration rates
Apply to base population
Aggregate results

Uncertainty increases with projection horizon.

Demographic Rate Calculation

Crude Rates: Events per 1,000 population

\text{Crude Birth Rate} = \frac{\text{Births}}{\text{Mid-year Population}} \times 1,000

Age-Specific Rates: Control for age structure

\text{Age-Specific Fertility Rate} = \frac{\text{Births to women aged } x}{\text{Women aged } x} \times 1,000

Standardization: Compare populations with different structures

Direct Standardization: Apply population’s rates to standard age structure Indirect Standardization: Apply standard rates to population’s age structure

Life Table Analysis

Life tables summarize mortality experience of a population.

Key Columns:

q_x: Probability of dying between age x and x+1
l_x: Number surviving to age x (from 100,000 births)
d_x: Deaths between age x and x+1
L_x: Person-years lived between age x and x+1
e_x: Life expectancy at age x

Example Interpretation: If q_{65} = 0.015, then 1.5% of 65-year-olds die before reaching 66. If e_{65} = 18.5, then 65-year-olds average 18.5 more years of life.

Fertility Analysis

Total Fertility Rate (TFR): Average children per woman given current age-specific rates

\text{TFR} = \sum (\text{ASFR} \times \text{age interval width})

Example: If each 5-year age group from 15-49 has ASFR = 20 per 1,000: \text{TFR} = 7 \text{ age groups} \times \frac{20}{1,000} \times 5 \text{ years} = 0.7 \text{ children per woman}

This very low TFR indicates below-replacement fertility.

Migration Analysis

Net Migration Rate: \text{NMR} = \frac{\text{Immigrants} - \text{Emigrants}}{\text{Population}} \times 1,000

Migration Effectiveness Index: \text{MEI} = \frac{|\text{In} - \text{Out}|}{\text{In} + \text{Out}}

Values near 0: High turnover, little net change
Values near 1: Mostly one-way flow

Population Health Metrics

Disability-Adjusted Life Years (DALYs): Years of healthy life lost

DALY = Years of Life Lost (YLL) + Years Lived with Disability (YLD)

Healthy Life Expectancy: Expected years in good health

Combines mortality and morbidity information.

1.22 Software and Tools

Modern demographic and social statistics relies heavily on computational tools.

Statistical Software Packages

R: Free, open-source, extensive demographic packages

Packages: demography, popReconstruct, bayesPop
Advantages: Reproducible research, cutting-edge methods
Disadvantages: Steep learning curve

Stata: Widely used in social sciences

Strengths: Survey data analysis, panel data
Common in: Economics, epidemiology

SPSS: User-friendly interface

Strengths: Point-and-click interface
Common in: Social sciences, market research

Python: General programming language with statistical libraries

Libraries: pandas, numpy, scipy, statsmodels
Advantages: Integration with other applications

1.23 Conclusion

Key Terms Summary

Statistics: The science of collecting, organizing, analyzing, interpreting, and presenting data to understand phenomena and support decision-making

Descriptive Statistics: Methods for summarizing and presenting data in meaningful ways without extending conclusions beyond the observed data

Inferential Statistics: Techniques for drawing conclusions about populations from samples, including estimation and hypothesis testing

Population: The complete set of individuals, objects, or measurements about which conclusions are to be drawn

Sample: A subset of the population that is actually observed or measured to make inferences about the population

Superpopulation: A theoretical infinite population from which observed finite populations are considered to be samples

Parameter: A numerical characteristic of a population (usually unknown and denoted by Greek letters)

Statistic: A numerical characteristic calculated from sample data (known and denoted by Roman letters)

Estimator: A rule or formula for calculating estimates of population parameters from sample data

Estimand: The specific population parameter targeted for estimation

Estimate: The numerical value produced by applying an estimator to observed data

Random Error (Sampling Error): Unpredictable variation arising from the sampling process that decreases with larger samples

Systematic Error (Bias): Consistent deviation from true values that cannot be reduced by increasing sample size

Sampling: The process of selecting a subset of units from a population for measurement

Sampling Frame: The list or device from which a sample is drawn, ideally containing all population members

Probability Sampling: Sampling methods where every population member has a known, non-zero probability of selection

Simple Random Sampling: Every possible sample of size n has equal probability of selection

Systematic Sampling: Selection of every kth element from an ordered sampling frame

Stratified Sampling: Division of population into homogeneous subgroups before sampling within each

Cluster Sampling: Selection of groups (clusters) rather than individuals

Non-probability Sampling: Sampling methods without guaranteed known selection probabilities

Convenience Sampling: Selection based purely on ease of access

Purposive Sampling: Deliberate selection based on researcher judgment

Quota Sampling: Selection to match population proportions on key characteristics without random selection

Snowball Sampling: Participants recruit additional subjects from their acquaintances

Standard Error: The standard deviation of the sampling distribution of a statistic

Margin of Error: Maximum expected difference between estimate and parameter at specified confidence

Confidence Interval: Range of plausible values for a parameter at specified confidence level

Confidence Level: Probability that the confidence interval method produces intervals containing the parameter

Data: Collected observations or measurements

Quantitative Data: Numerical measurements (continuous or discrete)

Qualitative Data: Categorical information (nominal or ordinal)

Data Distribution: Description of how values spread across possible outcomes

Frequency Distribution: Summary showing how often each value occurs in data

Absolute Frequency: Count of observations for each value

Relative Frequency: Proportion of observations in each category

Cumulative Frequency: Running total of frequencies up to each value

1.24 Appendix A: Visualizations for Statistics & Demography

## ============================================
## Visualizations for Statistics & Demography
## Chapter 1: Foundations
## ============================================

# Load required libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)
library(scales)
library(patchwork)  # for combining plots

# Set theme for all plots
theme_set(theme_minimal(base_size = 12))

# Color palette for consistency
colors <- c("#2E86AB", "#A23B72", "#F18F01", "#C73E1D", "#6A994E")


# ==================================================
# 1. POPULATION vs SAMPLE VISUALIZATION
# ==================================================

# Create a population and sample visualization
set.seed(123)

# Generate population data (e.g., ages of 10,000 people)
population <- data.frame(
  id = 1:10000,
  age = round(rnorm(10000, mean = 40, sd = 15))
)
population$age[population$age < 0] <- 0
population$age[population$age > 100] <- 100

# Take a random sample
sample_size <- 500
sample_data <- population[sample(nrow(population), sample_size), ]

# Create visualization
p1 <- ggplot(population, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = colors[1], alpha = 0.7, color = "white") +
  geom_vline(xintercept = mean(population$age), 
             color = colors[2], linetype = "dashed", size = 1.2) +
  labs(title = "Population Distribution (N = 10,000)",
       subtitle = paste("Population mean (μ) =", round(mean(population$age), 2), "years"),
       x = "Age (years)", y = "Frequency") +
  theme(plot.title = element_text(face = "bold"))

p2 <- ggplot(sample_data, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = colors[3], alpha = 0.7, color = "white") +
  geom_vline(xintercept = mean(sample_data$age), 
             color = colors[4], linetype = "dashed", size = 1.2) +
  labs(title = paste("Sample Distribution (n =", sample_size, ")"),
       subtitle = paste("Sample mean (x̄) =", round(mean(sample_data$age), 2), "years"),
       x = "Age (years)", y = "Frequency") +
  theme(plot.title = element_text(face = "bold"))

# Combine plots
population_sample_plot <- p1 / p2
print(population_sample_plot)

# ==================================================
# 2. TYPES OF DATA DISTRIBUTIONS
# ==================================================

# Generate different distribution types
set.seed(456)
n <- 5000

# Normal distribution
normal_data <- rnorm(n, mean = 50, sd = 10)

# Right-skewed distribution (income-like)
right_skewed <- rgamma(n, shape = 2, scale = 15)

# Left-skewed distribution (age at death in developed country)
left_skewed <- 90 - rgamma(n, shape = 3, scale = 5)
left_skewed[left_skewed < 0] <- 0

# Bimodal distribution (e.g., height of mixed male/female population)
n2  <- 20000
nf <- n2 %/% 2; nm <- n2 - nf
bimodal <- c(rnorm(nf, mean = 164, sd = 5),
             rnorm(nm, mean = 182, sd = 5))


# Create data frame
distributions_df <- data.frame(
  Normal = normal_data,
  `Right Skewed` = right_skewed,
  `Left Skewed` = left_skewed,
  Bimodal = bimodal
) %>%
  pivot_longer(everything(), names_to = "Distribution", values_to = "Value")

# Plot distributions
distributions_plot <- ggplot(distributions_df, aes(x = Value, fill = Distribution)) +
  geom_histogram(bins = 30, alpha = 0.7, color = "white") +
  facet_wrap(~Distribution, scales = "free", nrow = 2) +
  scale_fill_manual(values = colors[1:4]) +
  labs(title = "Types of Data Distributions",
       subtitle = "Common patterns in demographic data",
       x = "Value", y = "Frequency") +
  theme(plot.title = element_text(face = "bold", size = 14),
        legend.position = "none")

print(distributions_plot)

# ==================================================
# 3. NORMAL DISTRIBUTION WITH 68-95-99.7 RULE
# ==================================================

# Generate normal distribution data
set.seed(789)
mean_val <- 100
sd_val <- 15
x <- seq(mean_val - 4*sd_val, mean_val + 4*sd_val, length.out = 1000)
y <- dnorm(x, mean = mean_val, sd = sd_val)
df_norm <- data.frame(x = x, y = y)

# Create the plot
normal_plot <- ggplot(df_norm, aes(x = x, y = y)) +
  # Fill areas under the curve
  geom_area(data = subset(df_norm, x >= mean_val - sd_val & x <= mean_val + sd_val),
            aes(x = x, y = y), fill = colors[1], alpha = 0.3) +
  geom_area(data = subset(df_norm, x >= mean_val - 2*sd_val & x <= mean_val + 2*sd_val),
            aes(x = x, y = y), fill = colors[2], alpha = 0.2) +
  geom_area(data = subset(df_norm, x >= mean_val - 3*sd_val & x <= mean_val + 3*sd_val),
            aes(x = x, y = y), fill = colors[3], alpha = 0.1) +
  # Add the curve
  geom_line(size = 1.5, color = "black") +
  # Add vertical lines for standard deviations
  geom_vline(xintercept = mean_val, linetype = "solid", size = 1, color = "black") +
  geom_vline(xintercept = c(mean_val - sd_val, mean_val + sd_val), 
             linetype = "dashed", size = 0.8, color = colors[1]) +
  geom_vline(xintercept = c(mean_val - 2*sd_val, mean_val + 2*sd_val), 
             linetype = "dashed", size = 0.8, color = colors[2]) +
  geom_vline(xintercept = c(mean_val - 3*sd_val, mean_val + 3*sd_val), 
             linetype = "dashed", size = 0.8, color = colors[3]) +
  # Add labels
  annotate("text", x = mean_val, y = max(y) * 0.5, label = "68%", 
           size = 5, fontface = "bold", color = colors[1]) +
  annotate("text", x = mean_val, y = max(y) * 0.3, label = "95%", 
           size = 5, fontface = "bold", color = colors[2]) +
  annotate("text", x = mean_val, y = max(y) * 0.1, label = "99.7%", 
           size = 5, fontface = "bold", color = colors[3]) +
  # Labels
  scale_x_continuous(breaks = c(mean_val - 3*sd_val, mean_val - 2*sd_val, 
                                mean_val - sd_val, mean_val, 
                                mean_val + sd_val, mean_val + 2*sd_val, 
                                mean_val + 3*sd_val),
                     labels = c("μ-3σ", "μ-2σ", "μ-σ", "μ", "μ+σ", "μ+2σ", "μ+3σ")) +
  labs(title = "Normal Distribution: The 68-95-99.7 Rule",
       subtitle = "Proportion of data within standard deviations from the mean",
       x = "Value", y = "Probability Density") +
  theme(plot.title = element_text(face = "bold", size = 14))

print(normal_plot)

# ==================================================
# 4. SIMPLE LINEAR REGRESSION
# ==================================================

# Load required libraries
library(ggplot2)
library(scales)

# Define color palette (this was missing in original code)
colors <- c("#2E86AB", "#A23B72", "#F18F01", "#C73E1D", "#592E83")

# Generate data for regression example (Education vs Income)
set.seed(2024)
n_reg <- 200
education <- round(rnorm(n_reg, mean = 14, sd = 3))
education[education < 8] <- 8
education[education > 22] <- 22

# Create income with linear relationship plus noise
income <- 15000 + 4000 * education + rnorm(n_reg, mean = 0, sd = 8000)
income[income < 10000] <- 10000

reg_data <- data.frame(education = education, income = income)

# Fit linear model
lm_model <- lm(income ~ education, data = reg_data)

# Create subset of data for residual lines
subset_indices <- sample(nrow(reg_data), 20)
subset_data <- reg_data[subset_indices, ]
subset_data$predicted <- predict(lm_model, newdata = subset_data)

# Create regression plot
regression_plot <- ggplot(reg_data, aes(x = education, y = income)) +
  # Add points
  geom_point(alpha = 0.6, size = 2, color = colors[1]) +
  
  # Add regression line with confidence interval
  geom_smooth(method = "lm", se = TRUE, color = colors[2], fill = colors[2], alpha = 0.2) +
  
  # Add residual lines for a subset of points to show the concept
  geom_segment(data = subset_data,
               aes(x = education, xend = education, 
                   y = income, yend = predicted),
               color = colors[4], alpha = 0.5, linetype = "dotted") +
  
  # Add equation to plot (adjusted position based on data range)
  annotate("text", x = min(reg_data$education) + 1, y = max(reg_data$income) * 0.9, 
           label = paste("Income = $", format(round(coef(lm_model)[1]), big.mark = ","), 
                        " + $", format(round(coef(lm_model)[2]), big.mark = ","), " × Education",
                        "\nR² = ", round(summary(lm_model)$r.squared, 3), sep = ""),
           hjust = 0, size = 4, fontface = "italic") +
  
  # Labels and formatting
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Simple Linear Regression: Education and Income",
       subtitle = "Each year of education associated with higher income",
       x = "Years of Education", 
       y = "Annual Income") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14))

print(regression_plot)

# ==================================================
# 5. SAMPLING ERROR AND SAMPLE SIZE
# ==================================================

# Show how standard error decreases with sample size
set.seed(111)
sample_sizes <- c(10, 25, 50, 100, 250, 500, 1000, 2500, 5000)
n_simulations <- 1000

# True population parameters
true_mean <- 50
true_sd <- 10

# Run simulations for each sample size
se_results <- data.frame()
for (n in sample_sizes) {
  sample_means <- replicate(n_simulations, mean(rnorm(n, true_mean, true_sd)))
  se_results <- rbind(se_results, 
                      data.frame(n = n, 
                                se_empirical = sd(sample_means),
                                se_theoretical = true_sd / sqrt(n)))
}

# Create the plot
se_plot <- ggplot(se_results, aes(x = n)) +
  geom_line(aes(y = se_empirical, color = "Empirical SE"), size = 1.5) +
  geom_point(aes(y = se_empirical, color = "Empirical SE"), size = 3) +
  geom_line(aes(y = se_theoretical, color = "Theoretical SE"), 
            size = 1.5, linetype = "dashed") +
  scale_x_log10(breaks = sample_sizes) +
  scale_color_manual(values = c("Empirical SE" = colors[1], 
                               "Theoretical SE" = colors[2])) +
  labs(title = "Standard Error Decreases with Sample Size",
       subtitle = "The precision of estimates improves with larger samples",
       x = "Sample Size (log scale)", 
       y = "Standard Error",
       color = "") +
  theme(plot.title = element_text(face = "bold", size = 14),
        legend.position = "top")

print(se_plot)

# ==================================================
# 6. CONFIDENCE INTERVALS VISUALIZATION
# ==================================================

# Simulate multiple samples and their confidence intervals
set.seed(999)
n_samples <- 20
sample_size_ci <- 100
true_mean_ci <- 50
true_sd_ci <- 10

# Generate samples and calculate CIs
ci_data <- data.frame()
for (i in 1:n_samples) {
  sample_i <- rnorm(sample_size_ci, true_mean_ci, true_sd_ci)
  mean_i <- mean(sample_i)
  se_i <- sd(sample_i) / sqrt(sample_size_ci)
  ci_lower <- mean_i - 1.96 * se_i
  ci_upper <- mean_i + 1.96 * se_i
  contains_true <- (true_mean_ci >= ci_lower) & (true_mean_ci <= ci_upper)
  
  ci_data <- rbind(ci_data,
                   data.frame(sample = i, mean = mean_i, 
                             lower = ci_lower, upper = ci_upper,
                             contains = contains_true))
}

# Create CI plot
ci_plot <- ggplot(ci_data, aes(x = sample, y = mean)) +
  geom_hline(yintercept = true_mean_ci, color = "red", 
             linetype = "dashed", size = 1) +
  geom_errorbar(aes(ymin = lower, ymax = upper, color = contains), 
                width = 0.3, size = 0.8) +
  geom_point(aes(color = contains), size = 2) +
  scale_color_manual(values = c("TRUE" = colors[1], "FALSE" = colors[4]),
                    labels = c("Misses true value", "Contains true value")) +
  coord_flip() +
  labs(title = "95% Confidence Intervals from 20 Different Samples",
       subtitle = paste("True population mean = ", true_mean_ci, 
                       " (red dashed line)", sep = ""),
       x = "Sample Number", 
       y = "Sample Mean with 95% CI",
       color = "") +
  theme(plot.title = element_text(face = "bold", size = 14),
        legend.position = "bottom")

print(ci_plot)

# ==================================================
# 7. SAMPLING DISTRIBUTIONS (CENTRAL LIMIT THEOREM)
# ==================================================

# ---- Setup ----
library(tidyverse)
library(ggplot2)
theme_set(theme_minimal(base_size = 13))
set.seed(2025)

# Skewed population (Gamma); change if you want another DGP
Npop <- 100000
population <- rgamma(Npop, shape = 2, scale = 10)  # skewed right
mu    <- mean(population)
sigma <- sd(population)

# ---- CLT: sampling distribution of the mean ----
sample_sizes <- c(1, 5, 10, 30, 100)
B <- 2000  # resamples per n

clt_df <- purrr::map_dfr(sample_sizes, \(n) {
  tibble(n = n,
         mean = replicate(B, mean(sample(population, n, replace = TRUE))))
})

# Normal overlays: N(mu, sigma/sqrt(n))
clt_range <- clt_df |>
  group_by(n) |>
  summarise(min_x = min(mean), max_x = max(mean), .groups = "drop")

normal_df <- clt_range |>
  rowwise() |>
  mutate(x = list(seq(min_x, max_x, length.out = 200))) |>
  unnest(x) |>
  mutate(density = dnorm(x, mean = mu, sd = sigma / sqrt(n)))

clt_plot <- ggplot(clt_df, aes(mean)) +
  geom_histogram(aes(y = after_stat(density), fill = factor(n)),
                 bins = 30, alpha = 0.6, color = "white") +
  geom_line(data = normal_df, aes(x, density), linewidth = 0.8) +
  geom_vline(xintercept = mu, linetype = "dashed") +
  facet_wrap(~ n, scales = "free", ncol = 3) +
  labs(
    title = "CLT: Sampling distribution of the mean → Normal(μ, σ/√n)",
    subtitle = sprintf("Skewed population: Gamma(shape=2, scale=10).  μ≈%.2f, σ≈%.2f; B=%d resamples each.", mu, sigma, B),
    x = "Sample mean", y = "Density"
  ) +
  guides(fill = "none")

clt_plot

# ==================================================
# 8. TYPES OF SAMPLING ERROR
# ==================================================

# Create data to show random vs systematic error
set.seed(321)
n_measurements <- 100
true_value <- 50

# Random error only
random_error <- rnorm(n_measurements, mean = true_value, sd = 5)

# Systematic error (bias) only
systematic_error <- rep(true_value + 10, n_measurements) + rnorm(n_measurements, 0, 0.5)

# Both errors
both_errors <- rnorm(n_measurements, mean = true_value + 10, sd = 5)

error_data <- data.frame(
  measurement = 1:n_measurements,
  `Random Error Only` = random_error,
  `Systematic Error Only` = systematic_error,
  `Both Errors` = both_errors
) %>%
  pivot_longer(-measurement, names_to = "Error_Type", values_to = "Value")

# Create error visualization
error_plot <- ggplot(error_data, aes(x = measurement, y = Value, color = Error_Type)) +
  geom_hline(yintercept = true_value, linetype = "dashed", size = 1, color = "black") +
  geom_point(alpha = 0.6, size = 1) +
  geom_smooth(method = "lm", se = FALSE, size = 1.2) +
  facet_wrap(~Error_Type, nrow = 1) +
  scale_color_manual(values = colors[1:3]) +
  labs(title = "Random Error vs Systematic Error (Bias)",
       subtitle = paste("True value = ", true_value, " (black dashed line)", sep = ""),
       x = "Measurement Number", 
       y = "Measured Value") +
  theme(plot.title = element_text(face = "bold", size = 14),
        legend.position = "none")

print(error_plot)

# ==================================================
# 9. DEMOGRAPHIC PYRAMID
# ==================================================

# Create age pyramid data
set.seed(777)
age_groups <- c("0-4", "5-9", "10-14", "15-19", "20-24", "25-29", 
               "30-34", "35-39", "40-44", "45-49", "50-54", 
               "55-59", "60-64", "65-69", "70-74", "75-79", "80+")

# Create data for a developing country pattern
male_pop <- c(12, 11.5, 11, 10.5, 10, 9.5, 9, 8.5, 8, 7.5, 7, 
             6, 5, 4, 3, 2, 1.5)
female_pop <- c(11.8, 11.3, 10.8, 10.3, 9.8, 9.3, 8.8, 8.3, 7.8, 
               7.3, 6.8, 5.8, 4.8, 3.8, 2.8, 2.2, 2)

pyramid_data <- data.frame(
  Age = factor(rep(age_groups, 2), levels = rev(age_groups)),
  Population = c(-male_pop, female_pop),  # Negative for males
  Sex = c(rep("Male", length(male_pop)), rep("Female", length(female_pop)))
)

# Create population pyramid
pyramid_plot <- ggplot(pyramid_data, aes(x = Age, y = Population, fill = Sex)) +
  geom_bar(stat = "identity", width = 1) +
  scale_y_continuous(labels = function(x) paste0(abs(x), "%")) +
  scale_fill_manual(values = c("Male" = colors[1], "Female" = colors[3])) +
  coord_flip() +
  labs(title = "Population Pyramid",
       subtitle = "Age and sex distribution (typical developing country pattern)",
       x = "Age Group", 
       y = "Percentage of Population") +
  theme(plot.title = element_text(face = "bold", size = 14),
        legend.position = "top")

print(pyramid_plot)

# ==================================================
# 10. REGRESSION RESIDUALS AND DIAGNOSTICS
# ==================================================

# Use the previous regression model for diagnostics
reg_diagnostics <- data.frame(
  fitted = fitted(lm_model),
  residuals = residuals(lm_model),
  standardized_residuals = rstandard(lm_model),
  education = reg_data$education,
  income = reg_data$income
)

# Create diagnostic plots
# 1. Residuals vs Fitted
p_resid_fitted <- ggplot(reg_diagnostics, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.5, color = colors[1]) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(method = "loess", se = TRUE, color = colors[2], size = 0.8) +
  labs(title = "Residuals vs Fitted Values",
       subtitle = "Check for homoscedasticity",
       x = "Fitted Values", y = "Residuals")

# 2. Q-Q plot
p_qq <- ggplot(reg_diagnostics, aes(sample = standardized_residuals)) +
  stat_qq(color = colors[1]) +
  stat_qq_line(color = "red", linetype = "dashed") +
  labs(title = "Normal Q-Q Plot",
       subtitle = "Check for normality of residuals",
       x = "Theoretical Quantiles", y = "Standardized Residuals")

# 3. Histogram of residuals
p_hist_resid <- ggplot(reg_diagnostics, aes(x = residuals)) +
  geom_histogram(bins = 30, fill = colors[3], alpha = 0.7, color = "white") +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Distribution of Residuals",
       subtitle = "Should be approximately normal",
       x = "Residuals", y = "Frequency")

# 4. Residuals vs Predictor
p_resid_x <- ggplot(reg_diagnostics, aes(x = education, y = residuals)) +
  geom_point(alpha = 0.5, color = colors[4]) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(method = "loess", se = TRUE, color = colors[2], size = 0.8) +
  labs(title = "Residuals vs Predictor",
       subtitle = "Check for patterns",
       x = "Education (years)", y = "Residuals")

# Combine diagnostic plots
diagnostic_plots <- (p_resid_fitted + p_qq) / (p_hist_resid + p_resid_x)
print(diagnostic_plots)

# ==================================================
# 11. SAVE ALL PLOTS (Optional)
# ==================================================

# Uncomment to save plots as high-resolution images
# ggsave("population_sample.png", population_sample_plot, width = 10, height = 8, dpi = 300)
# ggsave("distributions.png", distributions_plot, width = 12, height = 8, dpi = 300)
# ggsave("normal_distribution.png", normal_plot, width = 10, height = 6, dpi = 300)
# ggsave("regression.png", regression_plot, width = 10, height = 7, dpi = 300)
# ggsave("standard_error.png", se_plot, width = 10, height = 6, dpi = 300)
# ggsave("confidence_intervals.png", ci_plot, width = 10, height = 8, dpi = 300)
# ggsave("central_limit_theorem.png", clt_plot, width = 14, height = 5, dpi = 300)
# ggsave("error_types.png", error_plot, width = 14, height = 5, dpi = 300)
# ggsave("population_pyramid.png", pyramid_plot, width = 8, height = 8, dpi = 300)
# ggsave("regression_diagnostics.png", diagnostic_plots, width = 12, height = 10, dpi = 300)

1.25 Appendix B: Central Limit Theorem (CLT)

The Central Limit Theorem states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the shape of the original population distribution.

Key Insights

Sample Size Threshold: Sample sizes of n ≥ 30 are typically sufficient for the CLT to apply
Standard Error: The standard deviation of sample means equals σ/√n, where σ is the population standard deviation
Statistical Foundation: We can make inferences about population parameters using normal distribution properties

1.26 Visual Demonstration: Step-by-Step Progression

The most effective approach to understanding CLT is to observe the systematic transformation of the distribution as the number of dice increases. Beginning with 1 die (uniform distribution), we can observe how increasing the sample size gradually transforms the distribution into a normal distribution.

library(ggplot2)
library(dplyr)

set.seed(123)

The Progressive Transformation

# Sample sizes to demonstrate
sample_sizes <- c(1, 2, 5, 10, 30, 50)
num_simulations <- 10000

# Simulate for each sample size
all_data <- data.frame()

for (n in sample_sizes) {
  means <- replicate(num_simulations, {
    dice <- sample(1:6, n, replace = TRUE)
    mean(dice)
  })
  
  temp_df <- data.frame(
    mean = means,
    n = n,
    label = paste(n, ifelse(n == 1, "die", "dice"))
  )
  all_data <- rbind(all_data, temp_df)
}

# Create ordered factor
all_data$label <- factor(all_data$label, 
                         levels = paste(sample_sizes, 
                                       ifelse(sample_sizes == 1, "die", "dice")))

# Plot the progression
ggplot(all_data, aes(x = mean)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 40, fill = "#3b82f6", color = "white", alpha = 0.7) +
  facet_wrap(~label, scales = "free", ncol = 3) +
  labs(
    title = "Central Limit Theorem: Step-by-Step Progression",
    subtitle = sprintf("Each panel shows %s simulations demonstrating the convergence to normality", 
                      format(num_simulations, big.mark = ",")),
    x = "Mean Value",
    y = "Density"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    strip.text = element_text(face = "bold", size = 12),
    strip.background = element_rect(fill = "#f0f0f0", color = NA)
  )

Analysis of Progressive Stages:

1 die: Uniform (discrete) distribution - all values 1 to 6 equally probable
2 dice: Triangular tendency - central values more frequent
5 dice: Emergent bell-shaped pattern - observable clustering around 3.5
10 dice: Distinctly normal - narrow Gaussian curve forming
30 dice: Normal distribution - practical demonstration of CLT
50 dice: Near-ideal normal distribution - strong concentration around mean

The distribution exhibits decreasing variability and increasingly pronounced bell-shaped characteristics as n increases.

Comparative Analysis

A cleaner comparison of key developmental stages:

key_sizes <- all_data %>%
  filter(n %in% c(1, 2, 5, 10, 30))

ggplot(key_sizes, aes(x = mean)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 40, fill = "#3b82f6", color = "white", alpha = 0.7) +
  facet_wrap(~label, scales = "free_x", nrow = 1) +
  labs(
    title = "CLT Evolution: From Uniform to Normal",
    x = "Mean Value",
    y = "Density"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    strip.text = element_text(face = "bold", size = 11),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )

Superimposed Distributions

An alternative visualization method displaying all distributions simultaneously:

comparison_data <- all_data %>%
  filter(n %in% c(1, 5, 10, 30))

ggplot(comparison_data, aes(x = mean, fill = label, color = label)) +
  geom_density(alpha = 0.3, linewidth = 1.2) +
  scale_fill_manual(values = c("#991b1b", "#ea580c", "#ca8a04", "#16a34a")) +
  scale_color_manual(values = c("#991b1b", "#ea580c", "#ca8a04", "#16a34a")) +
  labs(
    title = "CLT Progression: Superimposed Distributions",
    subtitle = "Systematic narrowing and convergence to normal form",
    x = "Mean Value",
    y = "Density",
    fill = "Sample Size",
    color = "Sample Size"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "right"
  )

Key Observation: As sample size increases, the distribution exhibits:

Increased symmetry (bell-shaped form)
Greater concentration around the population mean (3.5)
Improved conformity to the normal distribution

Standard Error Convergence

The dispersion (standard deviation) decreases according to the relationship SE = σ/√n:

variance_data <- all_data %>%
  group_by(n, label) %>%
  summarise(
    observed_sd = sd(mean),
    theoretical_se = sqrt(35/12) / sqrt(n),
    .groups = "drop"
  )

ggplot(variance_data, aes(x = n)) +
  geom_line(aes(y = observed_sd, color = "Observed SD"), 
            linewidth = 1.5) +
  geom_point(aes(y = observed_sd, color = "Observed SD"), 
             size = 4) +
  geom_line(aes(y = theoretical_se, color = "Theoretical SE"), 
            linewidth = 1.5, linetype = "dashed") +
  geom_point(aes(y = theoretical_se, color = "Theoretical SE"), 
             size = 4) +
  scale_color_manual(values = c("Observed SD" = "#3b82f6", 
                                "Theoretical SE" = "#ef4444")) +
  scale_x_continuous(breaks = sample_sizes) +
  labs(
    title = "Standard Error Decreases as Sample Size Increases",
    subtitle = "Following the SE = σ/√n relationship",
    x = "Sample Size (n)",
    y = "Standard Deviation / Standard Error",
    color = NULL
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "top",
    legend.text = element_text(size = 11)
  )

Numerical Summary

summary_stats <- all_data %>%
  group_by(label) %>%
  summarise(
    n = first(n),
    Observed_Mean = round(mean(mean), 3),
    Observed_SD = round(sd(mean), 3),
    Theoretical_Mean = 3.5,
    Theoretical_SE = round(sqrt(35/12) / sqrt(first(n)), 3),
    Range = paste0("[", round(min(mean), 2), ", ", round(max(mean), 2), "]")
  ) %>%
  select(-label)

knitr::kable(summary_stats, 
             caption = "Observed vs Theoretical Values Across Sample Sizes")

Observed vs Theoretical Values Across Sample Sizes
n	Observed_Mean	Observed_SD	Theoretical_Mean	Theoretical_SE	Range
1	3.470	1.716	3.5	1.708	[1, 6]
2	3.503	1.213	3.5	1.208	[1, 6]
5	3.494	0.764	3.5	0.764	[1, 6]
10	3.507	0.537	3.5	0.540	[1.7, 5.4]
30	3.500	0.311	3.5	0.312	[2.27, 4.63]
50	3.498	0.239	3.5	0.242	[2.68, 4.3]

Observations:

The population mean remains constant at 3.5 (independent of sample size)
The standard error exhibits systematic decline as n increases (SE ∝ 1/√n)
The range narrows considerably with increasing sample size

1.27 Mathematical Foundation

For a population with mean μ and finite variance σ²:

\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \text{ as } n \to \infty

Standard error of the mean:

SE_{\bar{X}} = \frac{\sigma}{\sqrt{n}}

For a fair die: μ = 3.5, σ² = 35/12 ≈ 2.917

1.28 Key Takeaways

Initial Condition: A single die exhibits a uniform (discrete) distribution
Progressive Transformation: As the number of observations increases, the distribution shape systematically evolves
Convergence to Normality: At n=30, a distinct normal distribution is observable
Variance Reduction: The distribution demonstrates increasing concentration around the expected value
Universality: The theorem applies to any population distribution with finite variance

1.29 Practical Significance

This distributional transformation enables:

Application of normal distribution tables and properties for statistical inference
Construction of confidence intervals with specified confidence levels
Execution of hypothesis tests (t-tests, z-tests)
Formulation of predictions about sample means with known probability

Essential Property of CLT: Although individual die rolls follow a uniform distribution, the distribution of means from multiple dice converges asymptotically to a normal distribution in a predictable manner consistent with mathematical theory, providing the foundation for classical statistical inference.

1.30 Appendix C: Standard Errors and Margins of Error: Means, Proportions, Variance, and Covariance

Key Insight: A Proportion IS a Mean

A proportion is simply the mean of a binary (0/1) variable. If you code “success” as 1 and “failure” as 0, then:

\hat{p} = \bar{x} = \frac{\sum x_i}{n}

For example, if 6 out of 10 people support a policy (coded as 1=support, 0=don’t support):

Proportion: \hat{p} = 0.6
Mean: \bar{x} = \frac{1+1+1+1+1+1+0+0+0+0}{10} = 0.6

They’re identical! The special formulas for proportions are just the general formulas applied to binary data.

The Universal Formula for Means

Both proportions and continuous means use the same fundamental formula for standard error:

SE = \frac{SD}{\sqrt{n}}

The Margin of Error (for 95% confidence) is then:

MoE = 1.96 \times SE = 1.96 \times \frac{SD}{\sqrt{n}}

Calculating SE and MoE for Proportions

For a sample proportion \hat{p}, the standard deviation is derived from the binomial distribution:

SD = \sqrt{p(1-p)}

Therefore:

SE_p = \sqrt{\frac{p(1-p)}{n}}

MoE_p = 1.96\sqrt{\frac{p(1-p)}{n}}

Example: Political Poll

If 60% of voters support a candidate (p = 0.6) with n = 400:

SD = \sqrt{0.6 \times 0.4} = \sqrt{0.24} = 0.490
SE = \frac{0.490}{\sqrt{400}} = \frac{0.490}{20} = 0.0245 (or 2.45%)
MoE = 1.96 \times 0.0245 = 0.048 (or ±4.8%)

Calculating SE and MoE for Typical Means

For a continuous variable like height, weight, or test scores:

SE_{\bar{x}} = \frac{SD}{\sqrt{n}}

MoE_{\bar{x}} = 1.96 \times \frac{SD}{\sqrt{n}}

Example: Mean Height

If measuring height with SD = 10 cm and n = 100:

SE = \frac{10}{\sqrt{100}} = \frac{10}{10} = 1.0 cm
MoE = 1.96 \times 1.0 = ±1.96 cm

Why Proportions Often Require Larger Samples

The perception that proportions need larger samples arises from several factors:

1. Maximum Variance at p = 0.5

The variance p(1-p) is maximized when p = 0.5, giving:

SD_{max} = \sqrt{0.5 \times 0.5} = 0.5

This means on a 0-1 scale, the standard deviation can be quite large relative to the range. For “maximum uncertainty” scenarios (p = 0.5):

n = \left(\frac{1.96 \times 0.5}{MoE}\right)^2 = \frac{0.9604}{MoE^2}

Sample size requirements for different margins of error (at p = 0.5):

Desired MoE	Required n
±1% (0.01)	9,604
±2% (0.02)	2,401
±3% (0.03)	1,068
±5% (0.05)	385

2. Context of Precision

The desired precision differs by context:

Proportions: Political polls typically want ±3-4 percentage points
Height: ±0.5 cm might suffice (only 5% of a 10 cm SD)
Test scores: ±2 points might be acceptable (depends on scale)

These represent different levels of relative precision.

3. Scale Matters

For a proportion measured as ±0.02 (2 percentage points):

This is 2% of the full 0-1 scale
Relatively speaking, this is very precise

For height measured as ±2 cm with SD = 10 cm:

This is only 20% of one standard deviation
Less stringent requirement

4. Rare Events

When estimating rare proportions (e.g., p = 0.01), you need enough sample to actually observe the events:

For p = 0.01 with n = 100, you expect only 1 success
Need n \approx 1,500 for ±0.5% precision

Margin of Error and Sample Size for Variance

Variance estimation is more complex because sample variance does not follow a normal distribution - it follows a scaled chi-squared distribution (for normally distributed data).

Standard Error of Variance

For a normally distributed variable, the standard error of the sample variance s^2 is:

SE(s^2) = s^2\sqrt{\frac{2}{n-1}}

Example: Height Variance

If height has s^2 = 100 cm² (so s = 10 cm) with n = 101:

SE(s^2) = 100\sqrt{\frac{2}{100}} = 100 \times 0.1414 = 14.14 cm²
MoE = 1.96 \times 14.14 = ±27.7 cm²

Important considerations:

Confidence intervals are asymmetric: Because chi-squared distribution is skewed, exact CIs should use chi-squared quantiles, not the ±1.96 approach
Normality assumption matters: The formula assumes underlying normality
Larger samples needed: Note SE \propto 1/\sqrt{n-1} means variance estimates converge slowly
Sample size for given precision: To get MoE = 0.1 × s^2 (10% precision):

n \approx 1 + 2\left(\frac{1.96}{0.1}\right)^2 = 769

This is much larger than for means!

Standard Error of Standard Deviation

For the standard deviation itself (using delta method):

SE(s) \approx \frac{s}{\sqrt{2(n-1)}}

This is approximately half the coefficient of variation of the variance.

Margin of Error and Sample Size for Covariance

Covariance estimation is even more complex because it depends on the joint distribution of two variables.

Standard Error of Covariance

For two variables X and Y from a bivariate normal distribution:

SE(Cov(X,Y)) \approx \sqrt{\frac{1}{n}\left[\sigma_X^2\sigma_Y^2 + \sigma_{XY}^2\right]}

Where:

\sigma_X^2, \sigma_Y^2 are the population variances
\sigma_{XY} is the population covariance

In practice, these are estimated from the sample.

Example: Covariance of Height and Weight

Suppose:

s_X = 10 cm (height SD)
s_Y = 15 kg (weight SD)
s_{XY} = 80 cm·kg (sample covariance)
n = 100

SE(s_{XY}) \approx \sqrt{\frac{1}{100}[(10^2)(15^2) + 80^2]} = \sqrt{\frac{1}{100}[22,500 + 6,400]}

= \sqrt{289} = 17.0 \text{ cm·kg}

MoE = 1.96 \times 17.0 = ±33.3 \text{ cm·kg}

Standard Error of Correlation

For the correlation coefficient r (Pearson’s), when the true correlation \rho is not zero:

SE(r) \approx \frac{(1-r^2)}{\sqrt{n}}

For r = 0.5 and n = 100:

SE(r) = \frac{1-0.25}{\sqrt{100}} = \frac{0.75}{10} = 0.075

MoE = 1.96 \times 0.075 = ±0.147

Important notes for variance/covariance:

Non-normal sampling distributions: These statistics don’t follow normal distributions, especially for small samples
Fisher’s z-transformation: For correlation, CIs are typically computed using Fisher’s z-transform for better coverage
Bootstrap methods recommended: For complex scenarios, bootstrap confidence intervals often perform better
Larger samples required: Variance and covariance require substantially larger samples than means for equivalent precision

Comparative Example: Sample Size Requirements

To achieve 10% relative precision (MoE = 10% of the estimate):

For a Mean:

Need: n = \left(\frac{1.96 \times CV}{0.10}\right)^2 where CV = SD/\mu
If CV = 0.5: n = 96

For a Proportion at p = 0.5:

Need: n = \left(\frac{1.96 \times 0.5}{0.05}\right)^2 = 384
(Here 10% relative = ±0.05 absolute on 0-1 scale)

For a Variance:

Need: n \approx 1 + 2\left(\frac{1.96}{0.10}\right)^2 = 769

Variance requires approximately 8× the sample size of a mean for equivalent relative precision!

Key Takeaways

For Means (including proportions):

Proportions ARE means - just means of 0/1 data
Standard errors decrease as 1/\sqrt{n}
Normal approximation works well for moderate samples
Symmetric confidence intervals appropriate

For Variance and Covariance:

Standard errors decrease as 1/\sqrt{n} but with larger constants
Sampling distributions are skewed (chi-squared for variance)
Require substantially larger samples for equivalent precision
Asymmetric confidence intervals often needed
Bootstrap or exact methods recommended over simple ±1.96 approach

Why proportions seem to need larger samples:

The scale of measurement (0-1 vs. unbounded)
The relative precision desired (±3% is stringent on 0-1 scale)
The maximum SD for proportions (0.5) being large relative to range
Contextual standards in polling and surveys

When comparing equivalent relative precision, sample size requirements for means and proportions are comparable!