11  Fundamentals of Univariate Descriptive Statistics

Descriptive statistics are fundamental tools in social science research, providing a concise summary of data characteristics. They serve several crucial functions:

11.1 Introduction to Sigma Notation (Σ)

  • What is Sigma summation notation? Sigma (Σ) is a mathematical operator that instructs us to sum (add) a sequence of terms - it functions as a directive to perform addition of all elements within a specified range.
  • Purpose: Provides a concise way to write sums of many similar terms using a single symbol, avoiding lengthy addition expressions.

Basic Formula

  • The general form of sigma notation is: \sum_{i=a}^{b} f(i)
  • Summation index: i
  • Lower bound: a
  • Upper bound: b
  • Function: f(i)

Examples of Sigma Notation Applications

Simple Example: Sum of Natural Numbers

  • Suppose you want to add the first five positive integers: \sum_{i=1}^{5} i = 1 + 2 + 3 + 4 + 5 = 15
  • The above notation adds the first five positive integers.

Sum of Squares

  • Suppose you want to sum the squares of the first four positive integers: \sum_{i=1}^{4} i^2 = 1^2 + 2^2 + 3^2 + 4^2 = 1 + 4 + 9 + 16 = 30
  • This is the sum of squares of the first four positive integers.

Sum of a Constant Value

  • Summing a constant value c for n terms: \sum_{i=1}^{n} c = c + c + c + ... + c \text{ (n times)} = n \cdot c
  • Example: Sum of five fives: \sum_{i=1}^{5} 5 = 5 + 5 + 5 + 5 + 5 = 5 \cdot 5 = 25

Simple Examples in Statistical Context

\sum_{i=1}^{n} x_i - Summation index: i (typically denotes a specific observation in a dataset) - Lower bound: 1 (we usually start from the first observation) - Upper bound: n (total number of observations in our dataset) - Expression: x_i (value of the ith observation)

Summing Observation Values

  • We have a dataset: 5, 8, 12, 15, 20
  • Sum of all values: \sum_{i=1}^{5} x_i = x_1 + x_2 + x_3 + x_4 + x_5 = 5 + 8 + 12 + 15 + 20 = 60
  • This sum is a key element when calculating the arithmetic mean.

Sum of Deviations from the Mean

  • For the same dataset (5, 8, 12, 15, 20), the mean is \bar{x} = 60/5 = 12
  • Sum of deviations from the mean: \sum_{i=1}^{5} (x_i - \bar{x}) = (5-12) + (8-12) + (12-12) + (15-12) + (20-12) = -7 + (-4) + 0 + 3 + 8 = 0
  • Important observation: The sum of deviations from the mean always equals 0, which is a fundamental property of the arithmetic mean.

Summary

  • Sigma Notation (Σ) allows for concise expression of key statistical formulas
  • The most important applications include calculating:
    • Arithmetic mean
    • Variance and standard deviation
    • Various sums of squares used in regression analysis
Summation (Σ) and Product (Π) Operators

Sigma (Σ) Operator

\sum is a summation operator that instructs us to add terms:

\sum_{i=1}^{n} x_i = x_1 + x_2 + ... + x_n

where: - i is the index variable - The lower value under Σ (here i=1) is the starting point - The upper value (here n) is the ending point

Pi (Π) Operator

\prod is a product operator that instructs us to multiply terms:

\prod_{i=1}^{n} x_i = x_1 \times x_2 \times ... \times x_n

where: - i is the index variable - The lower value under Π (here i=1) is the starting point - The upper value (here n) is the ending point

Example of Σ

\sum_{i=1}^{4} i = 1 + 2 + 3 + 4 = 10

Example of Π

\prod_{i=1}^{4} i = 1 \times 2 \times 3 \times 4 = 24

Key Differences
  • Σ represents repeated addition
  • Π represents repeated multiplication

11.2 Types of Data Distributions

Important

Data distribution informs what values a variable takes and how often.

Understanding data distributions is crucial for data analysis and visualization. In this document, we’ll explore various types of distributions and how to visualize them using ggplot2 in R.

Normal Distribution

The normal distribution, also known as the Gaussian distribution, is symmetric and bell-shaped.

# Generate normal distribution data
normal_data <- data.frame(x = rnorm(1000))

# Plot
ggplot(normal_data, aes(x)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") +
  geom_density(color = "red") +
  labs(title = "Normal Distribution", x = "Value", y = "Density")
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Uniform Distribution

In a uniform distribution, all values have an equal probability of occurrence.

# Generate uniform distribution data
uniform_data <- data.frame(x = runif(1000))

# Plot
ggplot(uniform_data, aes(x)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen", color = "black") +
  geom_density(color = "red") +
  labs(title = "Uniform Distribution", x = "Value", y = "Density")

Skewed Distributions

Skewed distributions are asymmetric, with one tail longer than the other.

# Generate right-skewed data
right_skewed <- data.frame(x = rlnorm(1000))

# Plot
ggplot(right_skewed, aes(x)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightyellow", color = "black") +
  geom_density(color = "red") +
  labs(title = "Right-Skewed Distribution", x = "Value", y = "Density")

Bimodal Distribution

A bimodal distribution has two peaks, indicating two distinct subgroups in the data.

# Generate bimodal data
bimodal_data <- data.frame(x = c(rnorm(500, mean = -2), rnorm(500, mean = 2)))

# Plot
ggplot(bimodal_data, aes(x)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightpink", color = "black") +
  geom_density(color = "red") +
  labs(title = "Bimodal Distribution", x = "Value", y = "Density")

Distribution Key Properties Examples
Symmetric (Normal) Symmetric, bell-shaped, most values close to the mean Adult height in population, IQ test scores, measurement errors, standardized exam results
Uniform Equal probability across the entire range Last digit of phone numbers, random day of the week selection, position of pointer after spinning a wheel of fortune
Bimodal Two distinct peaks, suggests presence of subgroups Age structure in university towns (students and permanent residents), opinions on strongly polarizing topics, traffic intensity hours (morning and afternoon peak)
Right-skewed (Positively skewed) Extended “tail” on the right side, most values less than the mean Queue waiting time, commute time to work, age at first marriage
Heavy-tailed skewed (Log-normal) Strong right asymmetry, values cannot be negative, long “fat tail” Personal income, housing prices, household size
Extreme-tailed skewed (Power law) Extreme asymmetry, “rich get richer” effect, no characteristic scale Wealth of the richest individuals, city populations, number of followers on social media, number of citations of scientific publications

11.3 Visualizing Real-World Data Distributions

Let’s use the palmerpenguins dataset to explore data distributions.

Histogram and Density Plot

Understanding Histograms and Density

⭐ A histogram is a special graph for numerical data where:

  • Data is grouped into ranges (called “bins”)
  • Bars touch each other (unlike bar charts!) because the data is continuous
  • Each bar’s height shows how many values fall into that range

Think of density as showing how common or concentrated certain values are in your data:

  • A higher point on a density curve (or taller bar in a histogram) means those values appear more frequently in your data
  • A lower point means those values are less common

Just like a crowded area has more people per space (higher density), a taller part of the graph shows values that appear more often in your dataset!

ggplot(penguins, aes(x = flipper_length_mm)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  geom_density(color = "red") +
  labs(title = "Distribution of Penguin Flipper Lengths", 
       x = "Flipper Length (mm)", 
       y = "Density")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).

Box Plot

Box plots are useful for comparing distributions across categories.

ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot() +
  labs(title = "Distribution of Penguin Body Mass by Species", 
       x = "Species", 
       y = "Body Mass (g)")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Violin Plot

Violin plots combine box plot and density plot features.

ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, fill = "white") +
  labs(title = "Distribution of Penguin Body Mass by Species", 
       x = "Species", 
       y = "Body Mass (g)")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Ridgeline Plot

Ridgeline plots are useful for comparing multiple distributions.

library(ggridges)

ggplot(penguins, aes(x = flipper_length_mm, y = species, fill = species)) +
  geom_density_ridges(alpha = 0.6) +
  labs(title = "Distribution of Flipper Length by Penguin Species",
       x = "Flipper Length (mm)",
       y = "Species")
Picking joint bandwidth of 2.38
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density_ridges()`).

Conclusion

Understanding and visualizing data distributions is crucial in data analysis. ggplot2 provides a flexible and powerful toolkit for creating various types of distribution plots. By exploring different visualization techniques, we can gain insights into the underlying patterns and characteristics of our data.

11.4 Understanding Outliers

Before diving into specific measures, it’s crucial to understand the concept of outliers, as they can significantly impact many descriptive statistics.

Outliers are data points that differ significantly from other observations in the dataset. They can occur due to:

  • Measurement or recording errors
  • Genuine extreme values in the population

Outliers can have a substantial effect on many statistical measures, especially those based on means or sums of squared deviations. Therefore, it’s essential to:

  1. Identify outliers through both statistical methods and domain knowledge
  2. Investigate the cause of outliers
  3. Make informed decisions about whether to include or exclude them in analyses

Throughout this guide, we’ll discuss how different descriptive measures are affected by outliers.

11.5 Statistical Symbols and Notations - Summary

Measure Population Parameter Sample Statistic Alternative Notations Usage Notes
Size N n - Total count of observations
Mean \mu \bar{x}, m M, E(X) E(X) used in probability theory
Variance \sigma^2 s^2 \text{Var}(X), V(X) Squared deviations from mean
Standard Deviation \sigma s \text{SD}, \text{std} Square root of variance
Proportion \pi, P \hat{p} \text{prop} Relative frequencies
Correlation \rho r \text{corr}(x,y) Ranges from -1 to +1
Standard Error \sigma_{\bar{x}} s_{\bar{x}} \text{SE} Standard error of mean
Sum \sum \sum \sum_{i=1}^n With indexing
Individual Value X_i x_i - ith observation
Covariance \sigma_{xy} s_{xy} \text{Cov}(X,Y) Joint variation
Median \eta \text{Med} M Central value
Range R r \text{max}(X) - \text{min}(X) Spread measure
Mode \text{Mo} \text{mo} \text{mod} Most frequent value
Skewness \gamma_1 g_1 \text{SK} Distribution asymmetry
Kurtosis \gamma_2 g_2 \text{KU} Distribution peakedness

Additional useful notations:

  • Sample moments: m_k = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^k
  • Population moments: \mu_k = E[(X - \mu)^k]

11.6 Measures of Central Tendency

Measures of central tendency aim to identify the “typical” or “central” value in a dataset. The three primary measures are mean, median, and mode.

Arithmetic Mean

The arithmetic mean is the sum of all values divided by the number of values.

Formula: \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i

Important Property: The mean is a balancing point in the data. The sum of deviations from the mean is always zero:

\sum_{i=1}^n (x_i - \bar{x}) = 0

This property makes the mean useful in many statistical calculations.

Understanding Mean as a Balance Point 🎯

Let’s consider a dataset X = \{1, 2, 6, 7, 9\} on a number line, imagining it as a seesaw:

https://www.gastonsanchez.com/matrix4sl/mean-as-a-balancing-point.html

The mean (\mu) acts as the perfect balance point of this seesaw. For our data:

\mu = \frac{1 + 2 + 6 + 7 + 9}{5} = 5

What happens at different support points? 🤔

  1. Support point at 6 (too high):
    • Left side: Values (1, 2) are below
    • Right side: Values (7, 9) are above
    • \sum distances from left = (6-1) + (6-2) = 9
    • \sum distances from right = (7-6) + (9-6) = 4
    • The seesaw tilts left! ⬅️ because 9 > 4
  2. Support point at 4 (too low):
    • Left side: Values (1, 2) are below
    • Right side: Values (6, 7, 9) are above
    • \sum distances from left = (4-1) + (4-2) = 5
    • \sum distances from right = (6-4) + (7-4) + (9-4) = 10
    • The seesaw tilts right! ➡️ because 5 < 10
  3. Support point at mean (5) (perfect balance):
    • \sum distances below = \sum distances above
    • ((5-1) + (5-2)) = ((6-5) + (7-5) + (9-5))
    • 7 = 7 ✨ Perfect balance!

This shows why the mean is the unique balance point, where:

\sum_{i=1}^n (x_i - \mu) = 0

The seesaw will always tilt unless the support point is placed exactly at the mean! 🎪

Mean as a Balance Point

This visualization shows how the arithmetic mean (5) acts as a balance point between clustered points on the left and dispersed points on the right:

Left side of the mean: - Points with values 2 and 3 - Close together (difference of 1 unit) - Distances from mean: 3 and 2 units - Sum of “pull” = 5 units

Right side of the mean: - Points with values 6 and 9 - More spread out (difference of 3 units) - Distances from mean: 1 and 4 units - Sum of “pull” = 5 units

Key observations:

  1. The mean (5) is a balance point, even though:
    • Points on the left are clustered (2,3)
    • Points on the right are dispersed (6,9)
    • Green arrows show distances from the mean
  2. Balance is maintained because:
    • Sum of distances balances out: (5-2) + (5-3) = (6-5) + (9-5)
    • Total sum of distances = 5 units on each side

Manual Calculation Example:

Let’s calculate the mean for the dataset: 2, 4, 4, 5, 5, 7, 9

Step Description Calculation
1 Sum all values 2 + 4 + 4 + 5 + 5 + 7 + 9 = 36
2 Count the number of values n = 7
3 Divide the sum by n 36 / 7 = 5.14

R calculation:

data <- c(2, 4, 4, 5, 5, 7, 9)
mean(data)
[1] 5.142857

Pros:

  • Easy to calculate and understand
  • Uses all data points
  • Useful for further statistical calculations

Cons:

  • Sensitive to outliers
  • Not ideal for skewed distributions

Example with outlier:

data_with_outlier <- c(2, 4, 4, 5, 5, 7, 100)
mean(data_with_outlier)
[1] 18.14286

As we can see, the outlier (100) drastically affects the mean.

Median

The median is the middle value when the data is ordered.

Manual Calculation Example:

Using the same dataset: 2, 4, 4, 5, 5, 7, 9

Step Description Result
1 Order the data 2, 4, 4, 5, 5, 7, 9
2 Find the middle value 5

For even number of values, take the average of the two middle values.

R calculation:

data <- c(2, 4, 4, 5, 5, 7, 9)
median(data)
[1] 5
median(data_with_outlier)
[1] 5

Pros:

  • Not affected by extreme outliers
  • Better for skewed distributions

Cons:

  • Doesn’t use all data points
  • Less useful for further statistical calculations
Warning

To find the position of the median in a dataset:

  1. First sort the data in ascending order

  2. If n is odd:

    • Median position = \frac{n + 1}{2}
  3. If n is even:

    • First median position = \frac{n}{2}
    • Second median position = \frac{n}{2} + 1
    • Median = \frac{\text{value at }\frac{n}{2} + \text{value at }(\frac{n}{2}+1)}{2}

For example:

  • Odd n=7: position = \frac{7+1}{2} = 4th value
  • Even n=8: positions = \frac{8}{2} = 4th and 4+1 = 5th value

Mode

The mode is the most frequently occurring value.

Manual Calculation Example:

Using the dataset: 2, 4, 4, 5, 5, 7, 9

Value Frequency
2 1
4 2
5 2
7 1
9 1

The mode is 4 and 5 (bimodal).

R calculation:

library(modeest)
mfv(data)  # Most frequent value
[1] 4 5

Pros:

  • Only measure of central tendency for nominal data
  • Can identify multiple peaks in the data

Cons:

  • Not always uniquely defined
  • Not useful for continuous data

Weighted (arithmetic) Mean (*)

The weighted mean is used when some data points are more important than others. There are two types of weighted means: with not normalized weights and with normalized weights.

Weighted Mean with Not Normalized Weights

This is the standard form of the weighted mean, where weights can be any positive numbers representing the importance of each data point.

Formula: \bar{x}_w = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i}

Manual Calculation Example:

Let’s calculate the weighted mean for the dataset: 2, 4, 5, 7 with weights 1, 2, 3, 1

Step Description Calculation
1 Multiply each value by its weight (2 * 1) + (4 * 2) + (5 * 3) + (7 * 1) = 2 + 8 + 15 + 7 = 32
2 Sum the weights 1 + 2 + 3 + 1 = 7
3 Divide the result from step 1 by the result from step 2 32 / 7 = 4.57

R calculation:

x <- c(2, 4, 5, 7)
w <- c(1, 2, 3, 1)
weighted.mean(x, w)
[1] 4.571429

Weighted Mean with Normalized Weights (Fractions)

In this case, the weights are fractions that sum to 1, representing the proportion of importance for each data point.

Formula: \bar{x}_w = \sum_{i=1}^n w_i x_i, where \sum_{i=1}^n w_i = 1

Manual Calculation Example:

Let’s calculate the weighted mean for the dataset: 2, 4, 5, 7 with normalized weights 0.1, 0.3, 0.4, 0.2

Step Description Calculation
1 Multiply each value by its weight (2 * 0.1) + (4 * 0.3) + (5 * 0.4) + (7 * 0.2)
2 Sum the results 0.2 + 1.2 + 2.0 + 1.4 = 4.8

R calculation:

x <- c(2, 4, 5, 7)
w_normalized <- c(0.1, 0.3, 0.4, 0.2)  # Note: these sum to 1
sum(x * w_normalized)
[1] 4.8

Pros of Weighted Means:

  • Account for varying importance of data points
  • Useful in survey analysis with different sample sizes or importance levels
  • Can adjust for unequal probabilities in sampling designs

Cons of Weighted Means:

  • Require justification for weights
  • Can be misused to manipulate results
  • May be less intuitive to interpret than simple arithmetic mean

11.7 Measures of Variability

These measures describe how spread out the data is. They are crucial for understanding the dispersion of data points around the central tendency.

Understanding Variance
Figure 11.1: Three dot plots showing increasing variance with constant mean

The three dot plots above demonstrate how variance measures the spread of data around a central value:

  • All distributions have the same mean (μ = 10), shown by the dashed line
  • Low Variance (σ² = 1): Points cluster tightly around the mean
  • Medium Variance (σ² = 4): Points show moderate spread
  • High Variance (σ² = 9): Points spread widely around the mean
Understanding Different Levels of Variability

This visualization shows three normal distributions with the same mean (μ = 10) but different levels of variability:

  1. Low Variability (σ = 0.5)
    • Data points cluster tightly around the mean
    • The density curve is tall and narrow
    • Most observations fall within ±0.5 units of the mean
  2. Medium Variability (σ = 2.0)
    • Data points spread out more from the mean
    • The density curve is lower and wider
    • Most observations fall within ±2 units of the mean
  3. High Variability (σ = 4.0)
    • Data points spread widely from the mean
    • The density curve is much flatter and wider
    • Most observations fall within ±4 units of the mean

Range

The range is the difference between the maximum and minimum values.

Formula: R = x_{max} - x_{min}

Manual Calculation Example:

Using the dataset: 2, 4, 4, 5, 5, 7, 9

Step Description Calculation
1 Find the maximum value 9
2 Find the minimum value 2
3 Subtract minimum from maximum 9 - 2 = 7

R calculation:

data <- c(2, 4, 4, 5, 5, 7, 9)
range(data)
[1] 2 9
max(data) - min(data)
[1] 7

Pros:

  • Simple to calculate and understand
  • Gives an immediate sense of data spread

Cons:

  • Extremely sensitive to outliers
  • Doesn’t provide information about the distribution between extremes

Interquartile Range (IQR)

The IQR is the difference between the 75th and 25th percentiles.

Formula: IQR = Q_3 - Q_1

To find quartiles manually:

  1. For odd number of values:
    • Q2 (median) is the middle value
    • Q1 is the median of the lower half (excluding the median of all observations)
    • Q3 is the median of the upper half (excluding the median of all observations)
  2. For even number of values:
    • Q2 is the average of the two middle values
    • Q1 is the median of the lower half (excluding the median of all observations)
    • Q3 is the median of the upper half (excluding the median of all observations)

Manual Calculation Example:

Using the dataset: 2, 4, 4, 5, 5, 7, 9

Step Description Calculation
1 Order the data 2, 4, 4, 5, 5, 7, 9
2 Find Q2 (median) 5
3 Find Q1 (median of lower half) 4
4 Find Q3 (median of upper half) 7
5 Calculate IQR Q3 - Q1 = 7 - 4 = 3

R calculation:

data <- c(2, 4, 4, 5, 5, 7, 9)
print(data)
[1] 2 4 4 5 5 7 9
quantile(data, type = 1)
  0%  25%  50%  75% 100% 
   2    4    5    7    9 
IQR(data, type = 1)
[1] 3

Pros:

  • Robust to outliers
  • Provides information about the spread of the middle 50% of the data

Cons:

  • Ignores the tails of the distribution
  • Less efficient than standard deviation for normal distributions

Variance

Variance measures the average squared deviation from the mean.

Formula: s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}

Variance: Understanding Average Squared Deviations

What is Variance? Variance measures how “spread out” numbers are from their mean - it’s the average of squared deviations from the mean.

Formula: s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}

Simple Example: Consider numbers: 2, 4, 6, 8, 10 Mean (\bar{x}) = 6

Calculating Deviations:

Value Deviation from mean Square of deviation
2 -4 16
4 -2 4
6 0 0
8 +2 4
10 +4 16

Variance = \frac{16 + 4 + 0 + 4 + 16}{4} = 10

Key Points:

  1. Mean acts as a reference line (blue dashed line)
  2. Deviations show distance from mean (red dotted lines)
  3. Squaring makes all deviations positive (blue bars)
  4. Larger deviations contribute more to variance

Manual Calculation Example:

Using the dataset: 2, 4, 4, 5, 5, 7, 9

Step Description Calculation
1 Calculate the mean \bar{x} = 5.14
2 Subtract the mean from each value and square the result (2 - 5.14)^2 = 9.86
(4 - 5.14)^2 = 1.30
(4 - 5.14)^2 = 1.30
(5 - 5.14)^2 = 0.02
(5 - 5.14)^2 = 0.02
(7 - 5.14)^2 = 3.46
(9 - 5.14)^2 = 14.90
3 Sum the squared differences 30.86
4 Divide by (n-1), i.e. by the number of observations - 1 30.86 / 6 = 5.14

R calculation:

var(data)
[1] 5.142857

Pros:

  • Uses all data points
  • Foundation for many statistical tests

Cons:

  • Units are squared, making interpretation less intuitive
  • Sensitive to outliers
Bessel’s Correction: Why We Divide by (n-1) And Not by n

The Key Insight:

When we calculate deviations from the mean, they must sum to zero. This is a mathematical fact: \sum(x_i - \bar{x}) = 0

Think of it Like This:

If you have 5 numbers and their mean:

  • Once you calculate 4 deviations from the mean
  • The 5th deviation MUST be whatever makes the sum zero
  • You don’t really have 5 independent deviations
  • You only have 4 truly “free” deviations

Simple Example:

Numbers: 2, 4, 6, 8, 10

  • Mean = 6
  • Deviations: -4, -2, 0, +2, +4
  • Notice they sum to zero
  • If you know any 4 deviations, the 5th is predetermined!

This is Why:

  • When calculating variance: s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}
  • We divide by (n-1) not n
  • Because only (n-1) deviations are truly independent
  • The last one is determined by the others

Degrees of Freedom:

  • n = number of observations
  • 1 = constraint (deviations must sum to zero)
  • n-1 = degrees of freedom = number of truly independent deviations

When to Use It:

  • When calculating sample variance
  • When calculating sample standard deviation

When NOT to Use It:

  • Population calculations (when you have all data)

Remember:

  • It’s not just a statistical trick
  • Deviations from the mean must sum to zero
  • This constraint costs us one degree of freedom

Standard Deviation

The standard deviation is the square root of the variance and measures the average dispersion of the data about their arithmetic mean. In contrast to the variance, it has the advantage of being expressed in the same units as the original measurements, making its interpretation more intuitive.

Formula: s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}}

Manual Calculation Example:

Using the dataset: 2, 4, 4, 5, 5, 7, 9

Step Description Calculation
1 Calculate the variance s^2 = 5.14 (from previous calculation)
2 Take the square root s = \sqrt{5.14} = 2.27

R calculation:

sd(data)
[1] 2.267787

Pros:

  • In same units as original data
  • Widely used and understood

Cons:

  • Still sensitive to outliers
  • Assumes data is roughly “normally” distributed

Coefficient of Variation (*)

The coefficient of variation is the standard deviation divided by the mean, often expressed as a percentage.

Formula: CV = \frac{s}{\bar{x}} \times 100\%

Manual Calculation Example:

Using the dataset: 2, 4, 4, 5, 5, 7, 9

Step Description Calculation
1 Calculate the mean \bar{x} = 5.14
2 Calculate the standard deviation s = 2.27
3 Divide s by the mean and multiply by 100 (2.27 / 5.14) * 100 = 44.16\%

R calculation:

(sd(data) / mean(data)) * 100
[1] 44.09586

Pros:

  • Allows comparison of variability between datasets with different units or means
  • Useful in fields like finance for risk assessment

Cons:

  • Not meaningful for data with both positive and negative values
  • Can be misleading when mean is close to zero
Limitations of Coefficient of Variation (CV)

The coefficient of variation, calculated as (σ/μ) × 100\%, has two important limitations:

Not meaningful for data with both positive and negative values

  • The mean could be close to zero due to positive and negative values cancelling out
  • Example: Dataset {-5, -3, 2, 6} has mean = 0
    • CV = (std dev / 0) × 100%
    • This leads to division by zero
    • Even if mean isn’t exactly zero, the CV doesn’t represent true relative variability when data cross zero
  • The CV assumes a natural zero point and meaningful ratios between values

Misleading when mean is close to zero

  • Since CV = (σ/μ) × 100\%, as μ approaches zero:
    • The denominator becomes very small
    • Results in extremely large CV values
    • These large values don’t meaningfully represent relative variability
  • Example:
    • Dataset A: {0.001, 0.002, 0.003} has mean = 0.002
    • Even small standard deviations will produce very large CVs
    • The resulting large CV might suggest extreme variability when the data are actually quite close together

Best Use Cases

CV is most useful for:

  • Strictly positive data
  • Data measured on a ratio scale
  • Data with means well above zero
  • Comparing variability between datasets with different units or scales

11.8 Measures of Relative Position (Standing)

Understanding where values sit within a dataset is crucial for data analysis. Let’s explore these concepts step by step.

Quartiles (Q): The Basics

Think of quartiles as special numbers that split your ordered data into four equal parts.

Doane, D. P., & Seward, L. W. (2016). Applied statistics in business and economics. Mcgraw-Hill.

What Are Quartiles?

First Quartile (Q1):

  • Separates the lowest 25% of data from the rest
  • Also called the 25th percentile
  • Example: If Q1 = 50 in a test score dataset, 25% of students scored below 50

Second Quartile (Q2):

  • The median - splits data in half
  • Also called the 50th percentile
  • Example: If Q2 = 70, half the students scored below 70

Third Quartile (Q3):

  • Separates the highest 25% of data from the rest
  • Also called the 75th percentile
  • Example: If Q3 = 85, 75% of students scored below 85

How to Calculate Quartiles (Step by Step) - Two Methods

Let’s examine student test scores using both common quartile calculation methods:

Example 1: Odd Number Case (11 scores)
60, 65, 70, 72, 75, 78, 80, 82, 85, 88, 90

Step 1: Find Q2 (median) - Same for both methods

  • With n = 11 values (odd)
  • Median position = (n + 1)/2 = 6
  • Q2 = 78

Step 2: Find Q1

  • Tukey’s Method:
    • Look at lower half: 60, 65, 70, 72, 75
    • Q1 = median of lower half = 70
  • Interpolation Method:
    • Position = (n + 1)/4 = (11 + 1)/4 = 3
    • Q1 = 70 (3rd value)

Step 3: Find Q3

  • Tukey’s Method:
    • Look at upper half: 80, 82, 85, 88, 90
    • Q3 = median of upper half = 85
  • Interpolation Method:
    • Position = 3(n + 1)/4 = 3(12)/4 = 9
    • Q3 = 85 (9th value)

Example 2: Even Number Case (10 scores)
60, 65, 70, 72, 75, 78, 80, 82, 85, 90

Step 1: Find Q2 (median) - Same for both methods

  • With n = 10 values (even)
  • Median positions = 5 and 6
  • Q2 = (75 + 78)/2 = 76.5

Step 2: Find Q1

  • Tukey’s Method:
    • Look at lower half: 60, 65, 70, 72, 75
    • Q1 = median of lower half = 70
  • Interpolation Method:
    • Position = (10 + 1)/4 = 2.75
    • Q1 = 65 + 0.75(70 - 65) = 68.75

Step 3: Find Q3

  • Tukey’s Method:
    • Look at upper half: 78, 80, 82, 85, 90
    • Q3 = median of upper half = 82
  • Interpolation Method:
    • Position = 3(10 + 1)/4 = 8.25
    • Q3 = 82 + 0.25(85 - 82) = 82.75

Important Notes:

  1. Tukey’s Method:

    • First find the median (Q2)
    • Split the data into lower and upper halves
    • Find Q1 as the median of the lower half
    • Find Q3 as the median of the upper half
    • When n is odd, the median is not included in either half
  2. Interpolation Method:

    • Uses positions (n+1)/4 for Q1 and 3(n+1)/4 for Q3
    • When position falls between values, uses linear interpolation
    • Doesn’t require splitting data into halves

Both methods give the same results for simple positions (Example 1) but can differ when interpolation is needed (Example 2).

Manual Construction of Tukey Boxplot

Step 1: Calculate Key Components

  1. Find quartiles: Q_1, Q_2 (median), Q_3
  2. Calculate Interquartile Range: IQR = Q_3 - Q_1

Step 2: Determine Whisker Boundaries

  • Lower fence: Q_1 - 1.5 \times IQR
  • Upper fence: Q_3 + 1.5 \times IQR

Step 3: Identify Outliers Data points are outliers if they are:

  • Below lower fence: x < Q_1 - 1.5 \times IQR
  • Above upper fence: x > Q_3 + 1.5 \times IQR

Example: Given data: 2, 4, 6, 8, 9, 10, 11, 12, 14, 16, 50

  1. Find quartiles:

    • Q_1 = 6
    • Q_2 = 10
    • Q_3 = 14
  2. Calculate IQR:

    • IQR = 14 - 6 = 8
  3. Calculate fences:

    • Lower: 6 - (1.5 \times 8) = -6
    • Upper: 14 + (1.5 \times 8) = 26
  4. Identify outliers:

    • 50 > 26, therefore 50 is an outlier

Graphical Elements:

  1. Box: Draw from Q_1 to Q_3
  2. Line inside box: Draw at Q_2
  3. Whiskers: Extend to most extreme non-outlier points
  4. Points: Plot outliers individually beyond whiskers

Percentiles: A More Precise Measure of Relative Standing (*)

What Are Percentiles?

Percentiles give us a more detailed view by dividing data into 100 equal parts. Unlike quartiles, percentiles use linear interpolation for more precise measurements.

Key Points:

  • The 25th percentile equals Q1
  • The 50th percentile equals Q2 (median)
  • The 75th percentile equals Q3

Calculating Percentiles

The Formula: P_k = \frac{k(n+1)}{100}

Where:

  • P_k is the position for the kth percentile
  • k is the percentile we want (1-100)
  • n is the number of observations

Example 3: Finding the 60th Percentile Let’s use student homework scores: 72, 75, 78, 80, 82, 85, 88, 90, 92, 95

Step 1: Calculate position

  • n = 10 scores
  • For 60th percentile: P_{60} = \frac{60(10+1)}{100} = 6.6

Step 2: Find surrounding values

  • Position 6: score of 85
  • Position 7: score of 88

Step 3: Interpolate (important: percentiles use linear interpolation)

  • We need to go 0.6 of the way between 85 and 88 P_{60} = 85 + 0.6(88-85) P_{60} = 85 + 0.6(3) P_{60} = 85 + 1.8 = 86.8

What this means: 60% of students scored 86.8 or below.

Percentile Ranks (PR) (*)

What is a Percentile Rank?

While percentiles tell us the value at a certain position, percentile rank tells us what percentage of values fall below a specific score. Think of it as answering the question “What percentage of the class did I score higher than?”

PR = \frac{\text{number of values below } + 0.5 \times \text{number of equal values}}{\text{total number of values}} \times 100

Example 4: Finding a Percentile Rank Consider these exam scores:

65, 70, 70, 75, 75, 75, 80, 85, 85, 90

Let’s find the PR for a score of 75.

Step 1: Count carefully

  • Values below 75: 65, 70, 70 (3 values)
  • Values equal to 75: 75, 75, 75 (3 values)
  • Total values: 10

Step 2: Apply the formula

PR = \frac{3 + 0.5(3)}{10} \times 100 PR = \frac{3 + 1.5}{10} \times 100 PR = \frac{4.5}{10} \times 100 = 45\%

Interpretation: A score of 75 is higher than 45% of the class scores.

Remark:

Q1: “Why do we use 0.5 for equal values in PR?”

A1: This is because we’re assuming people with the same score are evenly spread across that position. It’s like saying they share the position equally.

Understanding and Interpreting Box Plots

Box plots (also known as box-and-whisker plots) are powerful visualization tools for understanding data distributions. In this section, we’ll explore how to construct and interpret box plots using height measurements from two groups.

Construction of the Tukey Box Plot

The box plot was introduced by John Tukey as part of his exploratory data analysis toolkit. It provides a standardized way of displaying the distribution of data based on a five-number summary.

The Five-Number Summary

A box plot represents five key statistical values:

  1. Minimum: The smallest value in the dataset (excluding outliers)
  2. First Quartile (Q1): The 25th percentile, below which 25% of observations fall
  3. Median (Q2): The 50th percentile, which divides the dataset into two equal halves
  4. Third Quartile (Q3): The 75th percentile, below which 75% of observations fall
  5. Maximum: The largest value in the dataset (excluding outliers)
Box Plot Components
Figure 11.2: Boxplot diagram showing its key components.

The components of a box plot include:

  1. The Box:
    • Represents the interquartile range (IQR), containing the middle 50% of the data
    • Lower edge represents Q1
    • Upper edge represents Q3
    • Line inside the box represents the median (Q2)
  2. The Whiskers:
    • Extend from the box to show the range of non-outlier data
    • In a Tukey box plot, whiskers extend up to 1.5 × IQR from the box edges:
      • Lower whisker: extends to the minimum value ≥ (Q1 - 1.5 × IQR)
      • Upper whisker: extends to the maximum value ≤ (Q3 + 1.5 × IQR)
  3. Outliers:
    • Points that fall beyond the whiskers
    • Individually plotted as dots or symbols
    • Values that are < (Q1 - 1.5 × IQR) or > (Q3 + 1.5 × IQR)
Key Features to Observe

When interpreting box plots, look for these characteristics:

  1. Central Tendency: Location of the median line within the box
  2. Dispersion: Width of the box (IQR) and length of the whiskers
  3. Skewness:
    • Symmetrical data: median is approximately in the middle of the box, whiskers are roughly equal in length
    • Right (positive) skew: median is closer to the bottom of the box, upper whisker is longer
    • Left (negative) skew: median is closer to the top of the box, lower whisker is longer
  4. Outliers: Presence of individual points beyond the whiskers

Case Study: Comparing Heights Between Groups

Let’s apply our understanding of box plots to a real dataset. We have height measurements (in centimeters) from two groups of 25 students each.

# Create the dataset
data_height <- data.frame(
  group_1 = c(150, 160, 165, 168, 172, 173, 175, 176, 177, 178, 179, 180, 180, 181, 181, 182, 182, 183, 183, 184, 186, 188, 190, 191, 200),
  group_2 = c(138, 140, 148, 152, 164, 164, 165, 165, 166, 166, 170, 175, 175, 175, 182, 182, 182, 182, 182, 182, 183, 183, 183, 188, 210)
)

# Transform dataset from wide to long format
data_height_l <- gather(data = data_height, key = "Group_number", value = "height", group_1:group_2)

# Display the first few rows
head(data_height_l)
  Group_number height
1      group_1    150
2      group_1    160
3      group_1    165
4      group_1    168
5      group_1    172
6      group_1    173

Let’s calculate some summary statistics for each group:

# Calculate summary statistics for each group
group1_stats <- summary(data_height$group_1)
group2_stats <- summary(data_height$group_2)

# Calculate IQR
group1_iqr <- IQR(data_height$group_1)
group2_iqr <- IQR(data_height$group_2)

# Create a comparison table
stats_table <- rbind(
  group1_stats,
  group2_stats
)
rownames(stats_table) <- c("Group 1", "Group 2")

# Display the table
stats_table
        Min. 1st Qu. Median Mean 3rd Qu. Max.
Group 1  150     175    180  179     183  200
Group 2  138     165    175  172     182  210
# Display IQR values
cat("IQR for Group 1:", group1_iqr, "\n")
IQR for Group 1: 8 
cat("IQR for Group 2:", group2_iqr, "\n")
IQR for Group 2: 17 

Visualizing the Height Data

Now, let’s visualize the data using box plots and density plots:

# Create horizontal boxplots
ggplot(data = data_height_l) + 
  geom_boxplot(aes(x = Group_number, y = height, colour = Group_number), notch = FALSE) + 
  coord_flip() + 
  scale_y_continuous(breaks = seq(130, 210, 5)) + 
  theme_pubr() + 
  grids(linetype = "dashed") +
  labs(title = "Height Distribution by Group",
       x = "Group",
       y = "Height (cm)")
Figure 11.3: Box plots comparing height distributions between groups.

To complement our box plots, let’s also look at the density distributions:

# Create density plots
ggplot(data = data_height_l) + 
  geom_density(aes(x = height, fill = Group_number), alpha = 0.5) + 
  facet_grid(~ Group_number) + 
  scale_x_continuous(breaks = seq(130, 210, 10)) +
  labs(title = "Height Density by Group",
       x = "Height (cm)",
       y = "Density")
Figure 11.4: Density plots showing the height distributions for each group.

Box Plot Interpretation Exercise

Based on the box plots and density plots above, determine whether each of the following statements is True or False. For each statement, provide a brief explanation based on evidence from the visualizations.

Exercise Questions
  1. Students from group 2 (G2) in the studied sample are, on average, taller than those from group 1 (G1).

  2. Group 1 (G1) height measurements are more dispersed/spread out than group 2 (G2).

  3. The lowest person is in group 2 (G2).

  4. Both data sets are negatively (left) skewed.

  5. Half of the students in group 2 (G2) measure at least 175 cm.

Hints for Interpretation

When answering these questions, consider:

  • The position of the median line within each box
  • The relative sizes of the boxes (IQR)
  • The positions of the minimum and maximum values
  • The symmetry of the distributions (balanced or skewed)
  • The lengths of the whiskers

For each statement, determine whether it is True or False and provide your explanation:

  1. Students from G2 are, on average, taller than G1: [True/False]
    • Explanation:
  2. G1 height is more dispersed/spread out: [True/False]
    • Explanation:
  3. The lowest person is in G2: [True/False]
    • Explanation:
  4. Both data sets are negatively (left) skewed: [True/False]
    • Explanation:
  5. Half of G2 measure at least 175 cm: [True/False]
    • Explanation:

Let’s review the answers to our box plot interpretation questions:

  1. Students from G2 are, on average, taller than G1: False
    • Explanation: The median height (middle line in the boxplot) for G1 is higher than G2.
  2. G1 height is more dispersed/spread out: False
    • Explanation: G2 shows greater dispersion. This is visible in the boxplot where G2 has a larger interquartile range (IQR) of 17.5 cm compared to G1’s 9.5 cm. G2 also has a wider range from minimum to maximum values.
  3. The lowest person is in G2: True
    • Explanation: The minimum value in G2 is 138 cm, which is lower than the minimum value in G1 (150 cm).
  4. Both data sets are negatively (left) skewed: True
    • Explanation: In both groups, the median line is positioned toward the upper part of the box, and the lower whisker is longer than the upper whisker. This indicates that there’s a longer tail on the left side of the distribution, which means negative skewness.
  5. Half of G2 measure at least 175 cm: True
    • Explanation: The median (middle line in the boxplot) for G2 is 175 cm, which means that 50% of the values are greater than or equal to 175 cm.

R Code Reference

Here’s the complete R code used in this section:

# Load required packages
library(tidyr)
library(ggplot2)
library(ggpubr)

# Set display options
options(scipen = 999, digits = 3)

# Create the dataset
data_height <- data.frame(
  group_1 = c(150, 160, 165, 168, 172, 173, 175, 176, 177, 178, 179, 180, 180, 181, 181, 182, 182, 183, 183, 184, 186, 188, 190, 191, 200),
  group_2 = c(138, 140, 148, 152, 164, 164, 165, 165, 166, 166, 170, 175, 175, 175, 182, 182, 182, 182, 182, 182, 183, 183, 183, 188, 210)
)

# Transform dataset from wide to long format
data_height_l <- gather(data = data_height, key = "Group_number", value = "height", group_1:group_2)

# Display the first few rows
head(data_height_l)

# Calculate summary statistics for each group
group1_stats <- summary(data_height$group_1)
group2_stats <- summary(data_height$group_2)

# Calculate IQR
group1_iqr <- IQR(data_height$group_1)
group2_iqr <- IQR(data_height$group_2)

# Create horizontal boxplots
ggplot(data = data_height_l) + 
  geom_boxplot(aes(x = Group_number, y = height, colour = Group_number), notch = FALSE) + 
  coord_flip() + 
  scale_y_continuous(breaks = seq(130, 210, 5)) + 
  theme_pubr() + 
  grids(linetype = "dashed") +
  labs(title = "Height Distribution by Group",
       x = "Group",
       y = "Height (cm)")

# Create density plots
ggplot(data = data_height_l) + 
  geom_density(aes(x = height, fill = Group_number), alpha = 0.5) + 
  facet_grid(~ Group_number) + 
  scale_x_continuous(breaks = seq(130, 210, 10)) +
  labs(title = "Height Density by Group",
       x = "Height (cm)",
       y = "Density")

11.9 Shape Measures

Skewness

Definition

Skewness quantifies the asymmetry of a data distribution. It indicates whether data tends to cluster more on one side of the mean than the other.

Mathematical Expression

SK = \frac{n}{(n-1)(n-2)} \sum_{i=1}^n (\frac{x_i - \bar{x}}{s})^3 where: - n is the sample size - x_i is the i-th observation - \bar{x} is the sample mean - s is the sample standard deviation

Simplified Numerical Example

library(moments)

Attaching package: 'moments'
The following object is masked from 'package:modeest':

    skewness
library(ggplot2)
library(tidyverse)
library(gridExtra)

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
# Three example datasets with different types of skewness
# 1. Positive skewness (right tail)
positive_skew_data <- c(2, 3, 4, 4, 5, 5, 5, 6, 6, 7, 8, 12, 15, 20)
# 2. Negative skewness (left tail)
negative_skew_data <- c(1, 5, 10, 13, 14, 15, 16, 16, 17, 17, 18, 18, 19, 20)
# 3. Near-zero skewness (symmetry)
symmetric_data <- c(1, 3, 5, 7, 9, 10, 11, 12, 13, 15, 17, 19, 21)

# Calculating skewness
positive_skewness <- skewness(positive_skew_data)
negative_skewness <- skewness(negative_skew_data)
symmetric_skewness <- skewness(symmetric_data)

# Summary of results
skewness_data <- data.frame(
  "Distribution Type" = c("Positive skewness", "Negative skewness", "Symmetric distribution"),
  "Skewness value" = round(c(positive_skewness, negative_skewness, symmetric_skewness), 3),
  "Interpretation" = c(
    "Longer right tail (majority of data on the left side)",
    "Longer left tail (majority of data on the right side)",
    "Data distributed symmetrically"
  )
)

# Display table
skewness_data
       Distribution.Type Skewness.value
1      Positive skewness           1.42
2      Negative skewness          -1.33
3 Symmetric distribution           0.00
                                         Interpretation
1 Longer right tail (majority of data on the left side)
2 Longer left tail (majority of data on the right side)
3                        Data distributed symmetrically

Visualizations of Skewness Types

# Create a data frame for all sets
df_skewness <- rbind(
  data.frame(value = positive_skew_data, type = "Positive skewness", 
             skewness = round(positive_skewness, 2)),
  data.frame(value = negative_skew_data, type = "Negative skewness", 
             skewness = round(negative_skewness, 2)),
  data.frame(value = symmetric_data, type = "Symmetric distribution", 
             skewness = round(symmetric_skewness, 2))
)

# Histograms for three types of skewness
p1 <- ggplot(df_skewness, aes(x = value)) +
  geom_histogram(bins = 10, fill = "skyblue", color = "darkblue", alpha = 0.7) +
  facet_wrap(~type, scales = "free_x") +
  geom_vline(data = df_skewness %>% group_by(type) %>% summarise(mean = mean(value)),
            aes(xintercept = mean), color = "red", linetype = "dashed") +
  geom_vline(data = df_skewness %>% group_by(type) %>% summarise(median = median(value)),
            aes(xintercept = median), color = "darkgreen", linetype = "dashed") +
  geom_text(data = unique(df_skewness[, c("type", "skewness")]),
           aes(x = Inf, y = Inf, label = paste("SK =", skewness)),
           hjust = 1.1, vjust = 1.5, size = 3.5) +
  labs(
    title = "Histograms showing different types of skewness",
    subtitle = "Red line: mean, Green line: median",
    x = "Value",
    y = "Frequency"
  ) +
  theme_minimal()

# Box plots
p2 <- ggplot(df_skewness, aes(x = type, y = value, fill = type)) +
  geom_boxplot() +
  scale_fill_manual(values = c("skyblue", "lightgreen", "lightsalmon")) +
  labs(
    title = "Box plots for different types of skewness",
    x = "Distribution type",
    y = "Value"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Display plots
grid.arrange(p1, p2, nrow = 2)

Example: Voter Turnout Analysis

# Generate three datasets reflecting different types of skewness
set.seed(123)

# 1. Positive skewness - typical for turnout in regions with low engagement
positive_turnout <- c(
  runif(50, min = 20, max = 30),  # Small group with low turnout
  rbeta(200, shape1 = 2, shape2 = 5) * 50 + 30  # Majority of results shifted to the left
)

# 2. Negative skewness - typical for regions with high political engagement
negative_turnout <- c(
  rbeta(200, shape1 = 5, shape2 = 2) * 30 + 50,  # Majority of results shifted to the right
  runif(50, min = 40, max = 50)  # Small group with lower turnout
)

# 3. Symmetric distribution - typical for regions with uniform engagement
symmetric_turnout <- rnorm(250, mean = 65, sd = 8)

# Create data frame
df_turnout <- rbind(
  data.frame(turnout = positive_turnout, region = "Region A: Positive skewness"),
  data.frame(turnout = negative_turnout, region = "Region B: Negative skewness"),
  data.frame(turnout = symmetric_turnout, region = "Region C: Symmetric distribution")
)

# Calculate skewness for each region
region_skewness <- df_turnout %>%
  group_by(region) %>%
  summarise(skewness = round(skewness(turnout), 2))

# Histogram of turnout by region
p3 <- ggplot(df_turnout, aes(x = turnout)) +
  geom_histogram(bins = 20, fill = "skyblue", color = "darkblue", alpha = 0.7) +
  facet_wrap(~region, ncol = 1) +
  geom_vline(data = df_turnout %>% group_by(region) %>% summarise(mean = mean(turnout)),
            aes(xintercept = mean), color = "red", linetype = "dashed") +
  geom_vline(data = df_turnout %>% group_by(region) %>% summarise(median = median(turnout)),
            aes(xintercept = median), color = "darkgreen", linetype = "dashed") +
  geom_text(data = region_skewness,
           aes(x = 25, y = 20, label = paste("SK =", skewness)),
           size = 3.5) +
  labs(
    title = "Voter turnout in different regions",
    subtitle = "Showing three types of skewness",
    x = "Voter turnout (%)",
    y = "Number of districts"
  ) +
  theme_minimal()

# Box plot
p4 <- ggplot(df_turnout, aes(x = region, y = turnout, fill = region)) +
  geom_boxplot() +
  labs(
    title = "Comparison of turnout distributions across regions",
    x = "Region",
    y = "Voter turnout (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

# Display plots
grid.arrange(p3, p4, ncol = 2, widths = c(2, 1))

Interpretation Guide

  • Positive Skewness (> 0): Distribution has a longer right tail - most values are concentrated on the left side
  • Negative Skewness (< 0): Distribution has a longer left tail - most values are concentrated on the right side
  • Zero Skewness: Distribution is approximately symmetric - values are evenly distributed around the mean

Kurtosis

Definition

Kurtosis measures the “tailedness” of a distribution, indicating the presence of extreme values compared to a normal distribution.

Mathematical Expression

K = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^n (\frac{x_i - \bar{x}}{s})^4 - \frac{3(n-1)^2}{(n-2)(n-3)}

Simplified Numerical Example

# Three example datasets with different levels of kurtosis
# 1. Leptokurtic distribution (high kurtosis, "heavy tails")
leptokurtic_data <- c(
  rnorm(80, mean = 50, sd = 5),  # Most data clustered around the mean
  c(20, 25, 30, 70, 75, 80)      # A few extreme values
)

# 2. Platykurtic distribution (low kurtosis, "flat")
platykurtic_data <- c(
  runif(50, min = 30, max = 70)  # Uniform distribution of values
)

# 3. Mesokurtic distribution (normal kurtosis)
mesokurtic_data <- rnorm(50, mean = 50, sd = 10)

# Calculate kurtosis
kurtosis_lepto <- kurtosis(leptokurtic_data)
kurtosis_platy <- kurtosis(platykurtic_data)
kurtosis_meso <- kurtosis(mesokurtic_data)

# Summary of results
kurtosis_data <- data.frame(
  "Distribution Type" = c("Leptokurtic", "Platykurtic", "Mesokurtic"),
  "Kurtosis value" = round(c(kurtosis_lepto, kurtosis_platy, kurtosis_meso), 3),
  "Interpretation" = c(
    "Many values near the mean, but also more extreme values",
    "Values more uniformly distributed - flat distribution",
    "Similar to normal distribution"
  )
)

# Display table
kurtosis_data
  Distribution.Type Kurtosis.value
1       Leptokurtic           7.39
2       Platykurtic           1.85
3        Mesokurtic           2.25
                                           Interpretation
1 Many values near the mean, but also more extreme values
2   Values more uniformly distributed - flat distribution
3                          Similar to normal distribution

Visualizations of Kurtosis Levels

# Create a data frame for all sets
df_kurtosis <- rbind(
  data.frame(value = leptokurtic_data, type = "Leptokurtic (K > 3)", 
             kurtosis = round(kurtosis_lepto, 2)),
  data.frame(value = platykurtic_data, type = "Platykurtic (K < 3)", 
             kurtosis = round(kurtosis_platy, 2)),
  data.frame(value = mesokurtic_data, type = "Mesokurtic (K ≈ 3)", 
             kurtosis = round(kurtosis_meso, 2))
)

# Histograms for three types of kurtosis
p5 <- ggplot(df_kurtosis, aes(x = value)) +
  geom_histogram(bins = 15, fill = "lightgreen", color = "darkgreen", alpha = 0.7) +
  facet_wrap(~type, scales = "free_y") +
  geom_text(data = unique(df_kurtosis[, c("type", "kurtosis")]),
           aes(x = Inf, y = Inf, label = paste("K =", kurtosis)),
           hjust = 1.1, vjust = 1.5, size = 3.5) +
  labs(
    title = "Histograms showing different levels of kurtosis",
    x = "Value",
    y = "Frequency"
  ) +
  theme_minimal()

# Box plots
p6 <- ggplot(df_kurtosis, aes(x = type, y = value, fill = type)) +
  geom_boxplot() +
  scale_fill_manual(values = c("lightgreen", "lightsalmon", "skyblue")) +
  labs(
    title = "Box plots for different levels of kurtosis",
    x = "Distribution type",
    y = "Value"
  ) +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

# Display plots
grid.arrange(p5, p6, nrow = 2)

Example: Parliamentary Voting Analysis

# Generate three datasets reflecting different levels of kurtosis
set.seed(456)

# 1. Leptokurtic distribution - typical for votes with strong party discipline
lepto_voting <- c(
  rnorm(150, mean = 75, sd = 3),  # Most votes with high agreement
  c(20, 25, 30, 35, 40, 95, 96, 97, 98, 99)  # A few outlier votes
)

# 2. Platykurtic distribution - typical for controversial votes
platy_voting <- c(
  runif(80, min = 40, max = 60),  # Votes with moderate agreement
  runif(80, min = 60, max = 80)   # Votes with higher agreement
)

# 3. Mesokurtic distribution - typical for normal votes
meso_voting <- rnorm(160, mean = 65, sd = 10)

# Create data frame
df_voting <- rbind(
  data.frame(agreement = lepto_voting, bill_type = "Bills A: Leptokurtic"),
  data.frame(agreement = platy_voting, bill_type = "Bills B: Platykurtic"),
  data.frame(agreement = meso_voting, bill_type = "Bills C: Mesokurtic")
)

# Calculate kurtosis for each bill type
bill_kurtosis <- df_voting %>%
  group_by(bill_type) %>%
  summarise(kurtosis = round(kurtosis(agreement), 2))

# Histogram of voting agreement
p7 <- ggplot(df_voting, aes(x = agreement)) +
  geom_histogram(bins = 20, fill = "lightgreen", color = "darkgreen", alpha = 0.7) +
  facet_wrap(~bill_type, ncol = 1) +
  geom_text(data = bill_kurtosis,
           aes(x = Inf, y = Inf, label = paste("K =", kurtosis)),
           hjust = 1.1, vjust = 1.5, size = 3.5) +
  labs(
    title = "Voting agreement for different types of bills",
    subtitle = "Showing three levels of kurtosis",
    x = "Voting agreement index (%)",
    y = "Number of votes"
  ) +
  theme_minimal()

# Box plot
p8 <- ggplot(df_voting, aes(x = bill_type, y = agreement, fill = bill_type)) +
  geom_boxplot() +
  labs(
    title = "Comparison of voting agreement distributions",
    x = "Bill type",
    y = "Voting agreement index (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

# Display plots
grid.arrange(p7, p8, ncol = 2, widths = c(2, 1))

Interpretation Guide

  • Leptokurtic (K > 3): “Slender” distribution with heavy tails - more extreme values than in a normal distribution
  • Platykurtic (K < 3): “Flat” distribution - fewer extreme values than in a normal distribution
  • Mesokurtic (K ≈ 3): Distribution similar to normal in terms of extreme values

11.10 Exercise 1. Center and dispersion of data

Data

We have salary data (in thousands of euros) from two small European companies:

Index Company X Company Y
1 2 3
2 2 3
3 2 4
4 3 4
5 3 4
6 3 4
7 3 4
8 3 4
9 3 5
10 4 5
11 4 5
12 4 5
13 4 5
14 4 5
15 5 6
16 5 6
17 5 6
18 5 7
19 20 7
20 35 8

This table presents the data for both Company X and Company Y side by side, with an index column for easy reference.

Measures of Central Tendency

Mean

The mean is the average of all values in a dataset.

Formula: \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}

Można też zapisać ten wzór w postaci:

\bar{x} = \frac{\sum_{i=1}^{k} x_i f_i}{n}

gdzie f_i to częstość bezwzględna (liczba wystąpień, waga bezwzględna) i-tej wartości, a k to liczba różnych wartości cechy (liczba wartości wyróżnionych).

Z użyciem częstości względnych:

\bar{x} = \sum_{i=1}^{k} x_i p_i

gdzie p_i to częstość względna (frakcja, waga znormalizowana) i-tej wartości, a k to liczba różnych wartości cechy (liczba wartości wyróżnionych).

Manual Calculation for Company X
Value (x_i) Frequency (f_i) x_i \cdot f_i
2 3 6
3 6 18
4 5 20
5 4 20
20 1 20
35 1 35
Total n = 20 Sum = 119

\bar{x} = \frac{119}{20} = 5.95

Manual Calculation for Company Y
Value (x_i) Frequency (f_i) x_i \cdot f_i
3 2 6
4 6 24
5 6 30
6 3 18
7 2 14
8 1 8
Total n = 20 Sum = 100

\bar{y} = \frac{100}{20} = 5

R Verification
X <- c(2,2,2,3,3,3,3,3,3,4,4,4,4,4,5,5,5,5,20,35)
Y <- c(3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6,6,7,7,8)

mean(X)
[1] 5.95
mean(Y)
[1] 5

Median

The median is the middle value when the data is ordered.

Manual Calculation for Company X

Ordered data: [2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 20, 35]

n = 20 (even), so we take the average of the 10th and 11th values:

Median = \frac{4 + 4}{2} = 4

Manual Calculation for Company Y

Ordered data: [3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8]

n = 20 (even), so we take the average of the 10th and 11th values:

Median = \frac{5 + 5}{2} = 5

R Verification
median(X)
[1] 4
median(Y)
[1] 5

Mode

The mode is the most frequent value in the dataset.

For Company X, the mode is 3 (appears 6 times). For Company Y, there are two modes: 4 and 5 (both appear 6 times).

# Function to calculate mode
get_mode <- function(x) {
  unique_x <- unique(x)
  unique_x[which.max(tabulate(match(x, unique_x)))]
}

get_mode(X)
[1] 3
get_mode(Y)
[1] 4

Measures of Dispersion

Variance

The variance measures the average squared deviation from the mean.

Formula: s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}

Poprawka Bessela jest stosowana przy obliczaniu wariancji z próby, aby uzyskać nieobciążony estymator wariancji populacji. W standardowym wzorze na wariancję z próby dzielimy przez (n-1) zamiast przez n.

Modyfikacje wzoru dla danych pogrupowanych (szereg częstości):

Można też zapisać ten wzór w postaci:

s^2 = \frac{1}{n-1} \sum_{i=1}^{k} f_i(x_i - \bar{x})^2

gdzie f_i to częstość bezwzględna (liczba wystąpień) i-tej wartości.

Gdy w obliczeniach stosujemy częstości względne p = f_i/n, gdzie:

  • f_i to częstość (liczba wystąpień)
  • n to całkowita liczebność próby

Wzór na wariancję z uwzględnieniem poprawki Bessela przyjmuje postać:

s^2 = \frac{n}{n-1} \sum_{i=1}^{k} p_i(x_i - \bar{x})^2

gdzie:

  • s^2 to wariancja z próby
  • n to liczebność próby
  • p_i to częstość względna i-tej wartości
  • x_i to i-ta wartość cechy
  • \bar{x} to średnia arytmetyczna
  • k to liczba różnych wartości cechy

Kluczowe jest to, że przy stosowaniu częstości względnych mnożymy całe wyrażenie przez czynnik \frac{n}{n-1}, który wprowadza poprawkę Bessela.

Manual Calculation for Company X
x_i f_i x_i - \bar{x} (x_i - \bar{x})^2 f_i(x_i - \bar{x})^2
2 3 -3.95 15.6025 46.8075
3 6 -2.95 8.7025 52.215
4 5 -1.95 3.8025 19.0125
5 4 -0.95 0.9025 3.61
20 1 14.05 197.4025 197.4025
35 1 29.05 843.9025 843.9025
Total 20 1162.95

s^2 = \frac{1162.95}{19} = 61.21

Manual Calculation for Company Y
y_i f_i y_i - \bar{y} (y_i - \bar{y})^2 f_i(y_i - \bar{y})^2
3 2 -2 4 8
4 6 -1 1 6
5 6 0 0 0
6 3 1 1 3
7 2 2 4 8
8 1 3 9 9
Total 20 34

s^2 = \frac{34}{19} = 1.79

R Verification
var(X)
[1] 61.2
var(Y)
[1] 1.79

Standard Deviation

The standard deviation is the square root of the variance.

Formula: s = \sqrt{s^2}

  • For Company X: s = \sqrt{61.21} = 7.82
  • For Company Y: s = \sqrt{1.79} = 1.34
R Verification
sd(X)
[1] 7.82
sd(Y)
[1] 1.34

Quartiles

Quartiles divide the dataset into four equal parts.

Manual Calculation for Company X

Ordered data: [2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 20, 35]

  • Q1 (25th percentile): median of first 10 numbers = 3
  • Q2 (50th percentile, median): 4
  • Q3 (75th percentile): median of last 10 numbers = 5

Manual Calculation for Company Y

Ordered data: [3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8]

  • Q1 (25th percentile): median of first 10 numbers = 4
  • Q2 (50th percentile, median): 5
  • Q3 (75th percentile): median of last 10 numbers = 6

R Verification

quantile(X)
  0%  25%  50%  75% 100% 
   2    3    4    5   35 
quantile(Y)
  0%  25%  50%  75% 100% 
   3    4    5    6    8 

IQR

  • IQR_x = 5 - 3 = 2
  • IQR_y = 6 - 4 = 2

Tukey Box Plot

A Tukey box plot visually represents the distribution of data based on quartiles. We’ll use ggplot2 to create the plot.

library(ggplot2)
library(tidyr)

# Prepare the data
data <- data.frame(
  Company = rep(c("X", "Y"), each = 20),
  Salary = c(X, Y)
)

# Create the box plot
ggplot(data, aes(x = Company, y = Salary, fill = Company)) +
  geom_boxplot() +
  labs(title = "Salary Distribution in Companies X and Y",
       x = "Company",
       y = "Salary (thousands of euros)") +
  theme_minimal() +
  scale_fill_manual(values = c("X" = "#69b3a2", "Y" = "#404080"))

# Create the box plot
ggplot(data, aes(x = Company, y = Salary, fill = Company)) +
  geom_boxplot(outliers = F) +
  labs(title = "Salary Distribution in Companies X and Y",
       x = "Company",
       y = "Salary (thousands of euros)") +
  theme_minimal() +
  scale_fill_manual(values = c("X" = "#69b3a2", "Y" = "#404080"))

Interpreting the Box Plot

  1. The box represents the interquartile range (IQR) from Q1 to Q3.
  2. The line inside the box is the median (Q2).
  3. Whiskers extend to the smallest and largest values within 1.5 * IQR.
  4. Points beyond the whiskers are considered outliers.

Comparison of Results

Measure Company X Company Y
Mean 5.95 5.00
Median 4 5
Mode 3 4 and 5
Variance 61.21 1.79
Standard Deviation 7.82 1.34
Q1 3 4
Q3 5 6

Key Observations:

  1. Central Tendency: Company X has a higher mean but lower median than Company Y, indicating a right-skewed distribution for Company X.
  2. Dispersion: Company X shows much higher variance and standard deviation, suggesting greater salary disparities.
  3. Distribution Shape: Company Y’s salaries are more tightly clustered, while Company X has extreme values (potential outliers) that significantly affect its mean and variance.
  4. Quartiles: Company Y’s interquartile range (Q3 - Q1) is slightly larger, but its overall range is much smaller than Company X’s.

11.11 Exercise 2. Comparing Electoral District Size Variation Between Countries

Data

We have electoral district size data from two countries:

x <- c(1, 3, 5, 7, 9, 11, 13, 15, 17, 19)  # Country high variance
y <- c(8, 9, 9, 10, 10, 11, 11, 12, 12, 13)  # Country low variance

kable(data.frame(
  "Country X (High var.)" = x,
  "Country Y (Low var.)" = y
))
Country.X..High.var.. Country.Y..Low.var..
1 8
3 9
5 9
7 10
9 10
11 11
13 11
15 12
17 12
19 13

Measures of Central Tendency

Arithmetic Mean

Formula: \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}

Calculations for Country X
Element Value
1 1
2 3
3 5
4 7
5 9
6 11
7 13
8 15
9 17
10 19
Sum 100

\bar{x} = \frac{100}{10} = 10

mean_x <- mean(x)
c("Manual" = 10, "R" = mean_x)
Manual      R 
    10     10 
Calculations for Country Y
Element Value
1 8
2 9
3 9
4 10
5 10
6 11
7 11
8 12
9 12
10 13
Sum 105

\bar{y} = \frac{105}{10} = 10.5

mean_y <- mean(y)
c("Manual" = 10.5, "R" = mean_y)
Manual      R 
  10.5   10.5 

Median

The median is the middle value in an ordered dataset.

Calculations for Country X

Ordered data: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19

For n = 10 (even number of observations): Middle positions: 5 and 6 Middle values: 9 and 11

Median = \frac{9 + 11}{2} = 10

median_x <- median(x)
c("Manual" = 10, "R" = median_x)
Manual      R 
    10     10 
Calculations for Country Y

Ordered data: 8, 9, 9, 10, 10, 11, 11, 12, 12, 13

For n = 10 (even number of observations): Middle positions: 5 and 6 Middle values: 10 and 11

Median = \frac{10 + 11}{2} = 10.5

median_y <- median(y)
c("Manual" = 10.5, "R" = median_y)
Manual      R 
  10.5   10.5 

Mode

Calculations for Country X
Value Frequency
1 1
3 1
5 1
7 1
9 1
11 1
13 1
15 1
17 1
19 1

Conclusion: No mode (all values occur once)

Calculations for Country Y
Value Frequency
8 1
9 2
10 2
11 2
12 2
13 1

Conclusion: Four modes: 9, 10, 11, 12 (each occurs twice)

# Frequency tables
table_x <- table(x)
table_y <- table(y)

list(
  "Country X" = table_x,
  "Country Y" = table_y
)
$`Country X`
x
 1  3  5  7  9 11 13 15 17 19 
 1  1  1  1  1  1  1  1  1  1 

$`Country Y`
y
 8  9 10 11 12 13 
 1  2  2  2  2  1 

Variance

Variance measures the average squared deviation from the mean.

Formula: s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}

Calculations for Country X
x_i (x_i - \bar{x}) (x_i - \bar{x})^2
1 -9 81
3 -7 49
5 -5 25
7 -3 9
9 -1 1
11 1 1
13 3 9
15 5 25
17 7 49
19 9 81
Sum 330

s^2_X = \frac{330}{9} = 36.67

var_x <- var(x)
c("Manual" = 36.67, "R" = var_x)
Manual      R 
 36.67  36.67 
Calculations for Country Y
x_i (y_i - \bar{y}) (y_i - \bar{y})^2
8 -2.5 6.25
9 -1.5 2.25
9 -1.5 2.25
10 -0.5 0.25
10 -0.5 0.25
11 0.5 0.25
11 0.5 0.25
12 1.5 2.25
12 1.5 2.25
13 2.5 6.25
Sum 22.5

s^2_Y = \frac{22.5}{9} = 2.5

var_y <- var(y)
c("Manual" = 2.5, "R" = var_y)
Manual      R 
   2.5    2.5 

Standard Deviation

Standard deviation is the square root of variance. It measures variability in the same units as the data.

Formula: s = \sqrt{s^2}

Calculations for Country X

Using previously calculated variance: s^2_X = 36.67

Calculate square root: s_X = \sqrt{36.67} \approx 6.06

Step Calculation Result
1. Variance s^2_X 36.67
2. Square root \sqrt{36.67} 6.06
sd_x <- sd(x)
c("Manual" = 6.06, "R" = sd_x)
Manual      R 
 6.060  6.055 
Calculations for Country Y

Using previously calculated variance: s^2_Y = 2.5

Calculate square root: s_Y = \sqrt{2.5} \approx 1.58

Step Calculation Result
1. Variance s^2_Y 2.5
2. Square root \sqrt{2.5} 1.58
sd_y <- sd(y)
c("Manual" = 1.58, "R" = sd_y)
Manual      R 
 1.580  1.581 

Interpretation:

  • Country X: Average deviation from the mean is about 6 seats
  • Country Y: Average deviation from the mean is about 1.6 seats

Coefficient of Variation (CV)

The coefficient of variation is the ratio of standard deviation to mean, expressed as a percentage.

Formula: CV = \frac{s}{\bar{x}} \times 100\%

Calculations for Country X

CV_X = \frac{6.06}{10} \times 100\% = 60.6\%

Component Value
Standard deviation (s) 6.06
Mean (\bar{x}) 10
CV 60.6%
cv_x <- sd(x) / mean(x) * 100
c("Manual" = 60.6, "R" = cv_x)
Manual      R 
 60.60  60.55 

Calculations for Country Y

CV_Y = \frac{1.58}{10.5} \times 100\% = 15.0\%

Component Value
Standard deviation (s) 1.58
Mean (\bar{x}) 10.5
CV 15.0%
cv_y <- sd(y) / mean(y) * 100
c("Manual" = 15.0, "R" = cv_y)
Manual      R 
 15.00  15.06 

Quartiles and Interquartile Range (IQR)

Methods for Calculating Quartiles

There are different methods for calculating quartiles. In our manual calculations, we’ll use the median-excluding method:

  1. Split the series at the median
  2. Median is not included in quartile calculations
  3. Calculate median of each part - these will be Q1 and Q3 respectively

Calculations for Country X

Ordered data: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19

Median = 10 (not included in quartile calculations)

Lower half: 1, 3, 5, 7, 9 Q1 = median of lower half = 5

Upper half: 11, 13, 15, 17, 19 Q3 = median of upper half = 15

IQR = Q3 - Q1 = 15 - 5 = 10

Calculations for Country Y

Ordered data: 8, 9, 9, 10, 10, 11, 11, 12, 12, 13

Median = 10.5 (not included in quartile calculations)

Lower half: 8, 9, 9, 10, 10 Q1 = median of lower half = 9

Upper half: 11, 11, 12, 12, 13 Q3 = median of upper half = 12

IQR = Q3 - Q1 = 12 - 9 = 3

# Comparison of different quartile calculation methods in R
methods_comparison <- data.frame(
  Method = c("Manual (excl. median)", 
             "R type=1", "R type=2", "R type=7 (default)"),
  "Q1 Country X" = c(5, 
                    quantile(x, 0.25, type=1),
                    quantile(x, 0.25, type=2),
                    quantile(x, 0.25, type=7)),
  "Q3 Country X" = c(15,
                    quantile(x, 0.75, type=1),
                    quantile(x, 0.75, type=2),
                    quantile(x, 0.75, type=7)),
  "Q1 Country Y" = c(9,
                    quantile(y, 0.25, type=1),
                    quantile(y, 0.25, type=2),
                    quantile(y, 0.25, type=7)),
  "Q3 Country Y" = c(12,
                    quantile(y, 0.75, type=1),
                    quantile(y, 0.75, type=2),
                    quantile(y, 0.75, type=7))
)

kable(methods_comparison, digits = 2,
      caption = "Comparison of different quartile calculation methods")
Comparison of different quartile calculation methods
Method Q1.Country.X Q3.Country.X Q1.Country.Y Q3.Country.Y
Manual (excl. median) 5.0 15.0 9.00 12.00
R type=1 5.0 15.0 9.00 12.00
R type=2 5.0 15.0 9.00 12.00
R type=7 (default) 5.5 14.5 9.25 11.75

Explanation of Different Quartile Calculation Methods

  1. Manual method (excluding median):
    • Splits data into two parts
    • Excludes median
    • Finds median of each part
  2. R type=1:
    • First method in R
    • Uses whole positions
    • No interpolation
  3. R type=2:
    • Second method in R
    • Uses whole positions
    • Interpolates when position is not whole
  4. R type=7 (default):
    • Default method in R
    • Uses quantile()[5] from SAS
    • Interpolates according to Hyndman and Fan method

Results Comparison

summary_df <- data.frame(
  Measure = c("Mean", "Median", "Mode", "Range", "Variance", 
              "Std. Dev.", "Q1", "Q3", "IQR", "CV (%)"),
  "Country X" = c(10, 10, "none", 18, 36.67, 6.06, 5, 15, 10, 60.6),
  "Country Y" = c(10.5, 10.5, "9,10,11,12", 5, 2.5, 1.58, 9, 12, 3, 15.0)
)

kable(summary_df, 
      caption = "Summary of all statistical measures",
      align = c('l', 'r', 'r'))
Summary of all statistical measures
Measure Country.X Country.Y
Mean 10 10.5
Median 10 10.5
Mode none 9,10,11,12
Range 18 5
Variance 36.67 2.5
Std. Dev. 6.06 1.58
Q1 5 9
Q3 15 12
IQR 10 3
CV (%) 60.6 15

Comparison using Box Plot

df_long <- data.frame(
  country = rep(c("X", "Y"), each = 10),
  size = c(x, y)
)

# Basic plot
p <- ggplot(df_long, aes(x = country, y = size, fill = country)) +
  geom_boxplot(outlier.shape = NA) +  # Disable default outlier points
  geom_jitter(width = 0.2, alpha = 0.5) +  # Add points with transparency
  scale_fill_manual(values = c("X" = "#FFA07A", "Y" = "#98FB98")) +
  labs(
    title = "Comparison of Electoral District Size Variation",
    subtitle = paste("CV: Country X =", round(cv_x, 1), "%, Country Y =", round(cv_y, 1), "%"),
    x = "Country",
    y = "District Size"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Add quartile annotations
p + annotate(
  "text", 
  x = c(1, 1, 1, 2, 2, 2), 
  y = c(max(x)+1, mean(x), min(x)-1, max(y)+1, mean(y), min(y)-1),
  label = c(
    paste("Q3 =", quantile(x, 0.75, type=1)),
    paste("M =", median(x)),
    paste("Q1 =", quantile(x, 0.25, type=1)),
    paste("Q3 =", quantile(y, 0.75, type=1)),
    paste("M =", median(y)),
    paste("Q1 =", quantile(y, 0.25, type=1))
  ),
  size = 3
)

Methodological Notes

  1. Quartile Calculations:
    • The median-excluding method used may give different results than R’s default functions
    • Differences in calculation methods don’t affect overall conclusions
    • Always important to specify the method used in reports
  2. Visualization:
    • Box plot effectively shows differences in distributions
    • Additional points show actual values
    • Annotations facilitate interpretation

Application Notes

  1. Using the Analysis:
    • All calculations can be reproduced using the provided R code
    • Code chunks are self-contained and documented
    • Data format requirements are clearly specified
  2. Customization:
    • Analysis can be adapted for different district size datasets
    • Visualization parameters can be adjusted for different presentation needs
    • Statistical methods can be modified based on specific requirements

Conclusion

Summary Statistics Comparison

Measure Country X Country Y Relative Difference
Mean 10.0 10.5 Similar
Median 10.0 10.5 Similar
Mode None Multiple (9,10,11,12) -
Range 18 5 3.6× larger in X
Variance 36.67 2.5 14.7× larger in X
IQR 10 3 3.3× larger in X
CV 60.6% 15.0% 4.0× larger in X

Distribution Characteristics

Country X:

  • Uniform distribution pattern
  • No dominant district size (no mode)
  • Wide range: 1 to 19 seats
  • High variability (CV = 60.6%) - Even spread of values across range

Country Y:

  • Clustered distribution pattern
  • Multiple common sizes (four modes)
  • Narrow range: 8 to 13 seats
  • Low variability (CV = 15.0%) - Values concentrated around mean

Box Plot Interpretation

The box plot visualization reveals:

Structure Elements:

  • Box: Shows interquartile range (IQR)
  • Lower edge: First quartile (Q1)
  • Upper edge: Third quartile (Q3)
  • Internal line: Median (Q2)
  • Whiskers: Extend to ±1.5 IQR - Points: Individual district sizes

Key Visual Findings:

  1. Box Size:
  • Country X: Large box indicates wide spread of middle 50%
  • Country Y: Small box shows tight clustering of middle values
  1. Whisker Length:

    • Country X: Long whiskers indicate broad overall distribution
    • Country Y: Short whiskers show limited total spread
  2. Point Distribution:

    • Country X: Points widely dispersed
    • Country Y: Points densely clustered

Key Observations

  1. Central Tendency:

    • Similar average district sizes
    • Different distribution patterns
    • Distinct approaches to standardization
  2. Variability Measures:

    • All metrics show Country X with 3-15 times more variation
    • Consistent pattern across different statistical measures
    • Systematic difference in district design
  3. System Design:

    • Country X: Flexible, varied approach
    • Country Y: Standardized, uniform approach
    • Different philosophical approaches to representation
  4. Representative Implications:

    • Country X: Variable voter-to-representative ratios
    • Country Y: More consistent representation levels
    • Different approaches to democratic representation

This analysis demonstrates fundamental differences in electoral system design between the two countries, with Country X adopting a more varied approach and Country Y maintaining greater uniformity in district sizes.

11.12 Exercise 3. Voter Participation and Economic Prosperity

Analiza związku między dobrobytem ekonomicznym a frekwencją wyborczą w dzielnicach Amsterdamu na podstawie danych z wyborów samorządowych 2022.

Dane

Próba obejmuje pięć reprezentatywnych dzielnic:

Dzielnica Dochód (tys. €) Frekwencja (%)
A 50 60
B 45 56
C 56 70
D 40 50
E 60 75
# Wczytanie bibliotek
library(tidyverse)

# Utworzenie zbioru danych
dane <- data.frame(
  dzielnica = LETTERS[1:5],
  dochod = c(50, 45, 56, 40, 60),
  frekwencja = c(60, 56, 70, 50, 75)
)

Część 1: Statystyki opisowe

# Statystyki dla dochodu
mean(dane$dochod)
[1] 50.2
median(dane$dochod)
[1] 50
sd(dane$dochod)
[1] 8.075
range(dane$dochod)
[1] 40 60
# Statystyki dla frekwencji
mean(dane$frekwencja)
[1] 62.2
median(dane$frekwencja)
[1] 60
sd(dane$frekwencja)
[1] 10.21
range(dane$frekwencja)
[1] 50 75

Część 2: Analiza korelacji

# Korelacja Pearsona
cor.test(dane$dochod, dane$frekwencja)

    Pearson's product-moment correlation

data:  dane$dochod and dane$frekwencja
t = 16, df = 3, p-value = 0.0005
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9117 0.9996
sample estimates:
   cor 
0.9942 

Część 3: Model regresji OLS

# Dopasowanie modelu OLS
model <- lm(frekwencja ~ dochod, data = dane)

# Podsumowanie modelu
summary(model)

Call:
lm(formula = frekwencja ~ dochod, data = dane)

Residuals:
     1      2      3      4      5 
-1.949  0.336  0.510  0.620  0.482 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.8965     3.9673   -0.23  0.83575    
dochod        1.2569     0.0782   16.07  0.00052 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.26 on 3 degrees of freedom
Multiple R-squared:  0.989, Adjusted R-squared:  0.985 
F-statistic:  258 on 1 and 3 DF,  p-value: 0.000524

Wizualizacja

# Wykres rozrzutu z linią regresji
ggplot(dane, aes(x = dochod, y = frekwencja)) +
  geom_point(size = 4, color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  geom_text(aes(label = dzielnica), vjust = -1) +
  labs(
    title = "Dochód vs frekwencja wyborcza",
    x = "Dochód (tys. €)",
    y = "Frekwencja wyborcza (%)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Wnioski

Analiza wykazała silny dodatni związek między dobrobytem ekonomicznym dzielnicy a frekwencją wyborczą. Mieszkańcy dzielnic o wyższych dochodach częściej uczestniczą w wyborach samorządowych.

Uwaga: Mała liczebność próby (n=5) ogranicza możliwość generalizacji wyników.

11.13 Exercise 4. Understanding Boxplots Through Life Expectancy Data

library(tidyverse)
library(gapminder)

# Prepare data
data_2007 <- gapminder %>%
  filter(year == 2007)

11.14 Introduction to Boxplots

A boxplot (also known as a box-and-whisker plot) reveals key statistics about your data:

  • Median: The middle line in the box (50th percentile)
  • First quartile (Q1): Bottom of the box (25th percentile)
  • Third quartile (Q3): Top of the box (75th percentile)
  • Interquartile Range (IQR): The height of the box (Q3 - Q1)
  • Whiskers: Extend to the most extreme non-outlier values (Tukey’s method: 1.5 × IQR)
  • Outliers: Individual points beyond the whiskers

Visualizing Life Expectancy

ggplot(data_2007, aes(x = reorder(continent, lifeExp, FUN = median), y = lifeExp)) +
  geom_boxplot(fill = "lightblue", alpha = 0.7, outlier.shape = 24, 
               outlier.fill = "red", outlier.alpha = 0.6, outlier.size = 4) +
  geom_jitter(width = 0.2, alpha = 0.4, color = "darkblue") +
  labs(title = "Life Expectancy by Continent (2007)",
       subtitle = "Individual points show raw data; red points indicate outliers",
       x = "Continent",
       y = "Life Expectancy (years)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 14)
  ) +
  scale_y_continuous(breaks = seq(40, 85, by = 5))

11.15 Understanding the Data

Median and Distribution

Answer True or False:

  1. 50% of African countries have life expectancy below 54 years
  2. The median life expectancy in Europe is approximately 78 years
  3. More than 75% of countries in Oceania have life expectancy above 74 years
  4. 25% of Asian countries have life expectancy below 65 years
  5. The middle 50% of life expectancies in Europe fall between 74 and 80 years

Spread and Variation

Answer True or False:

  1. Asia shows the largest spread (IQR) in life expectancy
  2. Europe has the smallest IQR among all continents
  3. The variation in Africa’s life expectancy is greater than in the Americas
  4. Oceania shows the least variation in life expectancy
  5. The range (excluding outliers) in Asia is approximately 20 years

Outliers and Extremes

Answer True or False:

  1. Africa has two countries with unusually low life expectancy
  2. There are no outliers in Oceania’s distribution
  3. Asia has both high and low outliers

11.16 Changes Over Time

time_comparison <- gapminder %>%
  filter(year %in% c(1957, 2007)) %>%
  mutate(year = factor(year))

ggplot(time_comparison, aes(x = continent, y = lifeExp, fill = year)) +
  geom_boxplot(alpha = 0.7, position = "dodge", outlier.shape = 21,
               outlier.alpha = 0.6) +
  labs(title = "Life Expectancy: 1957 vs 2007",
       subtitle = "Comparing distribution changes over 50 years",
       x = "Continent",
       y = "Life Expectancy (years)",
       fill = "Year") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  ) +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(breaks = seq(30, 85, by = 5))

Time Comparison Questions

Answer True or False:

  1. The median life expectancy increased in all continents between 1957 and 2007
  2. The variation in life expectancy (IQR) decreased in most continents over time
  3. Africa showed the smallest improvement in median life expectancy
  4. The spread of life expectancies in Asia decreased substantially from 1957 to 2007
  5. Oceania maintained the highest median life expectancy in both time periods

Statistical Summary

# Calculate summary statistics
summary_stats <- gapminder %>%
  filter(year %in% c(1957, 2007)) %>%
  group_by(continent, year) %>%
  summarise(
    median = median(lifeExp),
    q1 = quantile(lifeExp, 0.25),
    q3 = quantile(lifeExp, 0.75),
    iqr = IQR(lifeExp),
    n_outliers = sum(lifeExp < (q1 - 1.5 * iqr) | lifeExp > (q3 + 1.5 * iqr))
  ) %>%
  arrange(continent, year)
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
knitr::kable(summary_stats, digits = 1,
             caption = "Summary Statistics by Continent and Year")
Summary Statistics by Continent and Year
continent year median q1 q3 iqr n_outliers
Africa 1957 40.6 37.4 44.8 7.4 1
Africa 2007 52.9 47.8 59.4 11.6 0
Americas 1957 56.1 48.6 62.6 14.0 0
Americas 2007 72.9 71.8 76.4 4.6 1
Asia 1957 48.3 41.9 54.1 12.2 0
Asia 2007 72.4 65.5 75.6 10.2 1
Europe 1957 67.7 65.0 69.2 4.2 2
Europe 2007 78.6 75.0 79.8 4.8 0
Oceania 1957 70.3 70.3 70.3 0.0 0
Oceania 2007 80.7 80.5 81.0 0.5 0

11.17 Key Learning Points

  1. Distribution Center:
    • Median shows the typical life expectancy
    • Changes in median reflect overall improvements
  2. Spread and Variation:
    • IQR (box height) indicates data dispersion
    • Wider boxes suggest more inequality in life expectancy
  3. Outliers and Extremes:
    • Outliers often represent countries with unique circumstances
  4. Time Comparison:
    • Shows both absolute improvements and changes in variation
    • Highlights persistent regional disparities
    • Reveals different rates of progress across continents

11.18 Appendix: Summary Tables for Data Types and Applicable Statistical Measures

Table 1: Pros and Cons of Various Statistical Measures

Measures of Center

Measure Pros Cons Applicable to
Mean - Uses all data points
- Allows for further statistical calculations
- Ideal for normally distributed data
- Sensitive to outliers
- Not ideal for skewed distributions
- Not meaningful for nominal data
Interval, Ratio, some Discrete, Continuous
Median - Not affected by outliers
- Good for skewed distributions
- Can be used with ordinal data
- Ignores the actual values of most data points
- Less useful for further statistical analyses
Ordinal, Interval, Ratio, Discrete, Continuous
Mode - Can be used with any data type
- Good for finding most common category
- May not be unique (multimodal)
- Not useful for many types of analyses
- Ignores magnitude of differences between values
All types

Measures of Variability

Measure Pros Cons Applicable to
Range - Simple to calculate and understand
- Gives quick idea of data spread
- Very sensitive to outliers
- Ignores all data between extremes
- Not useful for further statistical analyses
Ordinal, Interval, Ratio, Discrete, Continuous
Interquartile Range (IQR) - Not affected by outliers
- Good for skewed distributions
- Ignores 50% of the data
- Less intuitive than range
Ordinal, Interval, Ratio, Discrete, Continuous
Variance - Uses all data points
- Basis for many statistical procedures
- Sensitive to outliers
- Units are squared (less intuitive)
Interval, Ratio, some Discrete, Continuous
Standard Deviation - Uses all data points
- Same units as original data
- Widely used and understood
- Sensitive to outliers
- Assumes roughly normal distribution for interpretation
Interval, Ratio, some Discrete, Continuous
Coefficient of Variation - Allows comparison between datasets with different units or means - Can be misleading when means are close to zero
- Not meaningful for data with negative values
Ratio, some Interval

Measures of Correlation/Association

Measure Pros Cons Applicable to
Pearson’s r - Measures linear relationship
- Widely used and understood
- Assumes normal distribution
- Sensitive to outliers
- Only captures linear relationships
Interval, Ratio, Continuous
Spearman’s rho - Can be used with ordinal data
- Captures monotonic relationships
- Less sensitive to outliers
- Loses information by converting to ranks
- May miss some types of relationships
Ordinal, Interval, Ratio
Kendall’s tau - Can be used with ordinal data
- More robust than Spearman’s for small samples
- Has nice interpretation (probability of concordance)
- Loses information by only considering order
- Computationally more intensive
Ordinal, Interval, Ratio
Chi-square - Can be used with nominal data
- Tests independence of categorical variables
- Requires large sample sizes
- Sensitive to sample size
- Doesn’t measure strength of association
Nominal, Ordinal
Cramér’s V - Can be used with nominal data
- Provides measure of strength of association
- Normalized to [0,1] range
- Interpretation can be subjective
- May overestimate association in small samples
Nominal, Ordinal
Statistical Measures Applicability / Zastosowanie miar statystycznych
Measure (EN) Miara (PL) Nominal Ordinal Interval Ratio
Central Tendency / Tendencja centralna:
Mode Dominanta
Median Mediana -
Arithmetic Mean Średnia arytmetyczna - - ✓*
Geometric Mean Średnia geometryczna - - -
Harmonic Mean Średnia harmoniczna - - -
Dispersion / Rozproszenie:
Range Rozstęp -
Interquartile Range Rozstęp międzykwartylowy -
Mean Absolute Deviation Średnie odchylenie bezwzględne - -
Variance Wariancja - - ✓*
Standard Deviation Odchylenie standardowe - - ✓*
Coefficient of Variation Współczynnik zmienności - - -
Association / Współzależność:
Chi-square Chi-kwadrat
Spearman Correlation Korelacja Spearmana -
Kendall’s Tau Tau Kendalla -
Pearson Correlation Korelacja Pearsona - - ✓*
Covariance Kowariancja - - ✓*

* Theoretically problematic but commonly used in practice / Teoretycznie problematyczne, ale powszechnie stosowane w praktyce

Notes / Uwagi:

  1. Measurement Scales / Skale pomiarowe:
  • Nominal: Categories without order / Kategorie bez uporządkowania
  • Ordinal: Ordered categories / Kategorie uporządkowane
  • Interval: Equal intervals, arbitrary zero / Równe interwały, umowne zero
  • Ratio: Equal intervals, absolute zero / Równe interwały, absolutne zero
  1. Practical Considerations / Aspekty praktyczne:
  • Some measures marked with ✓* are commonly used for interval data despite theoretical issues / Niektóre miary oznaczone ✓* są powszechnie stosowane dla danych przedziałowych pomimo problemów teoretycznych
  • Choice of measure should consider both theoretical appropriateness and practical utility / Wybór miary powinien uwzględniać zarówno poprawność teoretyczną jak i użyteczność praktyczną
  • More restrictive scales (ratio) allow all measures from less restrictive scales / Bardziej restrykcyjne skale (ilorazowe) pozwalają na wszystkie miary z mniej restrykcyjnych skal