11Fundamentals of Univariate Descriptive Statistics
Descriptive statistics are fundamental tools in social science research, providing a concise summary of data characteristics. They serve several crucial functions:
Summarizing large datasets into manageable information
Identifying patterns and trends in data
Detecting potential anomalies or outliers
Providing a foundation for further statistical analysis
11.1 Introduction to Sigma Notation (Σ)
What is Sigma summation notation? Sigma (Σ) is a mathematical operator that instructs us to sum (add) a sequence of terms - it functions as a directive to perform addition of all elements within a specified range.
Purpose: Provides a concise way to write sums of many similar terms using a single symbol, avoiding lengthy addition expressions.
Basic Formula
The general form of sigma notation is: \sum_{i=a}^{b} f(i)
Summation index:i
Lower bound:a
Upper bound:b
Function:f(i)
Examples of Sigma Notation Applications
Simple Example: Sum of Natural Numbers
Suppose you want to add the first five positive integers: \sum_{i=1}^{5} i = 1 + 2 + 3 + 4 + 5 = 15
The above notation adds the first five positive integers.
Sum of Squares
Suppose you want to sum the squares of the first four positive integers: \sum_{i=1}^{4} i^2 = 1^2 + 2^2 + 3^2 + 4^2 = 1 + 4 + 9 + 16 = 30
This is the sum of squares of the first four positive integers.
Sum of a Constant Value
Summing a constant value c for n terms: \sum_{i=1}^{n} c = c + c + c + ... + c \text{ (n times)} = n \cdot c
Example: Sum of five fives: \sum_{i=1}^{5} 5 = 5 + 5 + 5 + 5 + 5 = 5 \cdot 5 = 25
Simple Examples in Statistical Context
\sum_{i=1}^{n} x_i - Summation index:i (typically denotes a specific observation in a dataset) - Lower bound: 1 (we usually start from the first observation) - Upper bound:n (total number of observations in our dataset) - Expression:x_i (value of the ith observation)
Summing Observation Values
We have a dataset: 5, 8, 12, 15, 20
Sum of all values: \sum_{i=1}^{5} x_i = x_1 + x_2 + x_3 + x_4 + x_5 = 5 + 8 + 12 + 15 + 20 = 60
This sum is a key element when calculating the arithmetic mean.
Sum of Deviations from the Mean
For the same dataset (5, 8, 12, 15, 20), the mean is \bar{x} = 60/5 = 12
Sum of deviations from the mean: \sum_{i=1}^{5} (x_i - \bar{x}) = (5-12) + (8-12) + (12-12) + (15-12) + (20-12)= -7 + (-4) + 0 + 3 + 8 = 0
Important observation: The sum of deviations from the mean always equals 0, which is a fundamental property of the arithmetic mean.
Summary
Sigma Notation (Σ) allows for concise expression of key statistical formulas
The most important applications include calculating:
Arithmetic mean
Variance and standard deviation
Various sums of squares used in regression analysis
Summation (Σ) and Product (Π) Operators
Sigma (Σ) Operator
\sum is a summation operator that instructs us to add terms:
\sum_{i=1}^{n} x_i = x_1 + x_2 + ... + x_n
where: - i is the index variable - The lower value under Σ (here i=1) is the starting point - The upper value (here n) is the ending point
Pi (Π) Operator
\prod is a product operator that instructs us to multiply terms:
Data distribution informs what values a variable takes and how often.
Understanding data distributions is crucial for data analysis and visualization. In this document, we’ll explore various types of distributions and how to visualize them using ggplot2 in R.
Normal Distribution
The normal distribution, also known as the Gaussian distribution, is symmetric and bell-shaped.
# Generate normal distribution datanormal_data <-data.frame(x =rnorm(1000))# Plotggplot(normal_data, aes(x)) +geom_histogram(aes(y = ..density..), bins =30, fill ="skyblue", color ="black") +geom_density(color ="red") +labs(title ="Normal Distribution", x ="Value", y ="Density")
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
Uniform Distribution
In a uniform distribution, all values have an equal probability of occurrence.
# Generate uniform distribution datauniform_data <-data.frame(x =runif(1000))# Plotggplot(uniform_data, aes(x)) +geom_histogram(aes(y = ..density..), bins =30, fill ="lightgreen", color ="black") +geom_density(color ="red") +labs(title ="Uniform Distribution", x ="Value", y ="Density")
Skewed Distributions
Skewed distributions are asymmetric, with one tail longer than the other.
# Generate right-skewed dataright_skewed <-data.frame(x =rlnorm(1000))# Plotggplot(right_skewed, aes(x)) +geom_histogram(aes(y = ..density..), bins =30, fill ="lightyellow", color ="black") +geom_density(color ="red") +labs(title ="Right-Skewed Distribution", x ="Value", y ="Density")
Bimodal Distribution
A bimodal distribution has two peaks, indicating two distinct subgroups in the data.
# Generate bimodal databimodal_data <-data.frame(x =c(rnorm(500, mean =-2), rnorm(500, mean =2)))# Plotggplot(bimodal_data, aes(x)) +geom_histogram(aes(y = ..density..), bins =30, fill ="lightpink", color ="black") +geom_density(color ="red") +labs(title ="Bimodal Distribution", x ="Value", y ="Density")
Distribution
Key Properties
Examples
Symmetric (Normal)
Symmetric, bell-shaped, most values close to the mean
Adult height in population, IQ test scores, measurement errors, standardized exam results
Uniform
Equal probability across the entire range
Last digit of phone numbers, random day of the week selection, position of pointer after spinning a wheel of fortune
Bimodal
Two distinct peaks, suggests presence of subgroups
Age structure in university towns (students and permanent residents), opinions on strongly polarizing topics, traffic intensity hours (morning and afternoon peak)
Right-skewed (Positively skewed)
Extended “tail” on the right side, most values less than the mean
Queue waiting time, commute time to work, age at first marriage
Heavy-tailed skewed (Log-normal)
Strong right asymmetry, values cannot be negative, long “fat tail”
Personal income, housing prices, household size
Extreme-tailed skewed (Power law)
Extreme asymmetry, “rich get richer” effect, no characteristic scale
Wealth of the richest individuals, city populations, number of followers on social media, number of citations of scientific publications
11.3 Visualizing Real-World Data Distributions
Let’s use the palmerpenguins dataset to explore data distributions.
Histogram and Density Plot
Understanding Histograms and Density
⭐ A histogram is a special graph for numerical data where:
Data is grouped into ranges (called “bins”)
Bars touch each other (unlike bar charts!) because the data is continuous
Each bar’s height shows how many values fall into that range
Think of density as showing how common or concentrated certain values are in your data:
A higher point on a density curve (or taller bar in a histogram) means those values appear more frequently in your data
A lower point means those values are less common
Just like a crowded area has more people per space (higher density), a taller part of the graph shows values that appear more often in your dataset!
ggplot(penguins, aes(x = flipper_length_mm)) +geom_histogram(aes(y = ..density..), bins =30, fill ="lightblue", color ="black") +geom_density(color ="red") +labs(title ="Distribution of Penguin Flipper Lengths", x ="Flipper Length (mm)", y ="Density")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Box Plot
Box plots are useful for comparing distributions across categories.
ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +geom_boxplot() +labs(title ="Distribution of Penguin Body Mass by Species", x ="Species", y ="Body Mass (g)")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Violin Plot
Violin plots combine box plot and density plot features.
ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +geom_violin(trim =FALSE) +geom_boxplot(width =0.1, fill ="white") +labs(title ="Distribution of Penguin Body Mass by Species", x ="Species", y ="Body Mass (g)")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Ridgeline Plot
Ridgeline plots are useful for comparing multiple distributions.
library(ggridges)ggplot(penguins, aes(x = flipper_length_mm, y = species, fill = species)) +geom_density_ridges(alpha =0.6) +labs(title ="Distribution of Flipper Length by Penguin Species",x ="Flipper Length (mm)",y ="Species")
Picking joint bandwidth of 2.38
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density_ridges()`).
Conclusion
Understanding and visualizing data distributions is crucial in data analysis. ggplot2 provides a flexible and powerful toolkit for creating various types of distribution plots. By exploring different visualization techniques, we can gain insights into the underlying patterns and characteristics of our data.
11.4 Understanding Outliers
Before diving into specific measures, it’s crucial to understand the concept of outliers, as they can significantly impact many descriptive statistics.
Outliers are data points that differ significantly from other observations in the dataset. They can occur due to:
Measurement or recording errors
Genuine extreme values in the population
Outliers can have a substantial effect on many statistical measures, especially those based on means or sums of squared deviations. Therefore, it’s essential to:
Identify outliers through both statistical methods and domain knowledge
Investigate the cause of outliers
Make informed decisions about whether to include or exclude them in analyses
Throughout this guide, we’ll discuss how different descriptive measures are affected by outliers.
The mean (\mu) acts as the perfect balance point of this seesaw. For our data:
\mu = \frac{1 + 2 + 6 + 7 + 9}{5} = 5
What happens at different support points? 🤔
Support point at 6 (too high):
Left side: Values (1, 2) are below
Right side: Values (7, 9) are above
\sum distances from left = (6-1) + (6-2) = 9
\sum distances from right = (7-6) + (9-6) = 4
The seesaw tilts left! ⬅️ because 9 > 4
Support point at 4 (too low):
Left side: Values (1, 2) are below
Right side: Values (6, 7, 9) are above
\sum distances from left = (4-1) + (4-2) = 5
\sum distances from right = (6-4) + (7-4) + (9-4) = 10
The seesaw tilts right! ➡️ because 5 < 10
Support point at mean (5) (perfect balance):
\sum distances below = \sum distances above
((5-1) + (5-2)) = ((6-5) + (7-5) + (9-5))
7 = 7 ✨ Perfect balance!
This shows why the mean is the unique balance point, where:
\sum_{i=1}^n (x_i - \mu) = 0
The seesaw will always tilt unless the support point is placed exactly at the mean! 🎪
Mean as a Balance Point
This visualization shows how the arithmetic mean (5) acts as a balance point between clustered points on the left and dispersed points on the right:
Left side of the mean: - Points with values 2 and 3 - Close together (difference of 1 unit) - Distances from mean: 3 and 2 units - Sum of “pull” = 5 units
Right side of the mean: - Points with values 6 and 9 - More spread out (difference of 3 units) - Distances from mean: 1 and 4 units - Sum of “pull” = 5 units
Key observations:
The mean (5) is a balance point, even though:
Points on the left are clustered (2,3)
Points on the right are dispersed (6,9)
Green arrows show distances from the mean
Balance is maintained because:
Sum of distances balances out: (5-2) + (5-3) = (6-5) + (9-5)
Total sum of distances = 5 units on each side
Manual Calculation Example:
Let’s calculate the mean for the dataset: 2, 4, 4, 5, 5, 7, 9
As we can see, the outlier (100) drastically affects the mean.
Median
The median is the middle value when the data is ordered.
Manual Calculation Example:
Using the same dataset: 2, 4, 4, 5, 5, 7, 9
Step
Description
Result
1
Order the data
2, 4, 4, 5, 5, 7, 9
2
Find the middle value
5
For even number of values, take the average of the two middle values.
R calculation:
data <-c(2, 4, 4, 5, 5, 7, 9)median(data)
[1] 5
median(data_with_outlier)
[1] 5
Pros:
Not affected by extreme outliers
Better for skewed distributions
Cons:
Doesn’t use all data points
Less useful for further statistical calculations
Warning
To find the position of the median in a dataset:
First sort the data in ascending order
If n is odd:
Median position = \frac{n + 1}{2}
If n is even:
First median position = \frac{n}{2}
Second median position = \frac{n}{2} + 1
Median = \frac{\text{value at }\frac{n}{2} + \text{value at }(\frac{n}{2}+1)}{2}
For example:
Odd n=7: position = \frac{7+1}{2} = 4th value
Even n=8: positions = \frac{8}{2} = 4th and 4+1 = 5th value
Mode
The mode is the most frequently occurring value.
Manual Calculation Example:
Using the dataset: 2, 4, 4, 5, 5, 7, 9
Value
Frequency
2
1
4
2
5
2
7
1
9
1
The mode is 4 and 5 (bimodal).
R calculation:
library(modeest)mfv(data) # Most frequent value
[1] 4 5
Pros:
Only measure of central tendency for nominal data
Can identify multiple peaks in the data
Cons:
Not always uniquely defined
Not useful for continuous data
Weighted (arithmetic) Mean (*)
The weighted mean is used when some data points are more important than others. There are two types of weighted means: with not normalized weights and with normalized weights.
Weighted Mean with Not Normalized Weights
This is the standard form of the weighted mean, where weights can be any positive numbers representing the importance of each data point.
Deviations show distance from mean (red dotted lines)
Squaring makes all deviations positive (blue bars)
Larger deviations contribute more to variance
Manual Calculation Example:
Using the dataset: 2, 4, 4, 5, 5, 7, 9
Step
Description
Calculation
1
Calculate the mean
\bar{x} = 5.14
2
Subtract the mean from each value and square the result
(2 - 5.14)^2 = 9.86
(4 - 5.14)^2 = 1.30
(4 - 5.14)^2 = 1.30
(5 - 5.14)^2 = 0.02
(5 - 5.14)^2 = 0.02
(7 - 5.14)^2 = 3.46
(9 - 5.14)^2 = 14.90
3
Sum the squared differences
30.86
4
Divide by (n-1), i.e. by the number of observations - 1
30.86 / 6 = 5.14
R calculation:
var(data)
[1] 5.142857
Pros:
Uses all data points
Foundation for many statistical tests
Cons:
Units are squared, making interpretation less intuitive
Sensitive to outliers
Bessel’s Correction: Why We Divide by (n-1) And Not by n
The Key Insight:
When we calculate deviations from the mean, they must sum to zero. This is a mathematical fact: \sum(x_i - \bar{x}) = 0
Think of it Like This:
If you have 5 numbers and their mean:
Once you calculate 4 deviations from the mean
The 5th deviation MUST be whatever makes the sum zero
You don’t really have 5 independent deviations
You only have 4 truly “free” deviations
Simple Example:
Numbers: 2, 4, 6, 8, 10
Mean = 6
Deviations: -4, -2, 0, +2, +4
Notice they sum to zero
If you know any 4 deviations, the 5th is predetermined!
This is Why:
When calculating variance: s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}
We divide by (n-1) not n
Because only (n-1) deviations are truly independent
The last one is determined by the others
Degrees of Freedom:
n = number of observations
1 = constraint (deviations must sum to zero)
n-1 = degrees of freedom = number of truly independent deviations
When to Use It:
When calculating sample variance
When calculating sample standard deviation
When NOT to Use It:
Population calculations (when you have all data)
Remember:
It’s not just a statistical trick
Deviations from the mean must sum to zero
This constraint costs us one degree of freedom
Standard Deviation
The standard deviation is the square root of the variance and measures the average dispersion of the data about their arithmetic mean. In contrast to the variance, it has the advantage of being expressed in the same units as the original measurements, making its interpretation more intuitive.
Percentiles: A More Precise Measure of Relative Standing (*)
What Are Percentiles?
Percentiles give us a more detailed view by dividing data into 100 equal parts. Unlike quartiles, percentiles use linear interpolation for more precise measurements.
Key Points:
The 25th percentile equals Q1
The 50th percentile equals Q2 (median)
The 75th percentile equals Q3
Calculating Percentiles
The Formula: P_k = \frac{k(n+1)}{100}
Where:
P_k is the position for the kth percentile
k is the percentile we want (1-100)
n is the number of observations
Example 3: Finding the 60th Percentile Let’s use student homework scores: 72, 75, 78, 80, 82, 85, 88, 90, 92, 95
Step 1: Calculate position
n = 10 scores
For 60th percentile: P_{60} = \frac{60(10+1)}{100} = 6.6
Step 2: Find surrounding values
Position 6: score of 85
Position 7: score of 88
Step 3: Interpolate (important: percentiles use linear interpolation)
We need to go 0.6 of the way between 85 and 88 P_{60} = 85 + 0.6(88-85)P_{60} = 85 + 0.6(3)P_{60} = 85 + 1.8 = 86.8
What this means: 60% of students scored 86.8 or below.
Percentile Ranks (PR) (*)
What is a Percentile Rank?
While percentiles tell us the value at a certain position, percentile rank tells us what percentage of values fall below a specific score. Think of it as answering the question “What percentage of the class did I score higher than?”
PR = \frac{\text{number of values below } + 0.5 \times \text{number of equal values}}{\text{total number of values}} \times 100
Example 4: Finding a Percentile Rank Consider these exam scores:
Interpretation: A score of 75 is higher than 45% of the class scores.
Remark:
Q1: “Why do we use 0.5 for equal values in PR?”
A1: This is because we’re assuming people with the same score are evenly spread across that position. It’s like saying they share the position equally.
Understanding and Interpreting Box Plots
Box plots (also known as box-and-whisker plots) are powerful visualization tools for understanding data distributions. In this section, we’ll explore how to construct and interpret box plots using height measurements from two groups.
Construction of the Tukey Box Plot
The box plot was introduced by John Tukey as part of his exploratory data analysis toolkit. It provides a standardized way of displaying the distribution of data based on a five-number summary.
The Five-Number Summary
A box plot represents five key statistical values:
Minimum: The smallest value in the dataset (excluding outliers)
First Quartile (Q1): The 25th percentile, below which 25% of observations fall
Median (Q2): The 50th percentile, which divides the dataset into two equal halves
Third Quartile (Q3): The 75th percentile, below which 75% of observations fall
Maximum: The largest value in the dataset (excluding outliers)
Box Plot Components
Figure 11.2: Boxplot diagram showing its key components.
The components of a box plot include:
The Box:
Represents the interquartile range (IQR), containing the middle 50% of the data
Lower edge represents Q1
Upper edge represents Q3
Line inside the box represents the median (Q2)
The Whiskers:
Extend from the box to show the range of non-outlier data
In a Tukey box plot, whiskers extend up to 1.5 × IQR from the box edges:
Lower whisker: extends to the minimum value ≥ (Q1 - 1.5 × IQR)
Upper whisker: extends to the maximum value ≤ (Q3 + 1.5 × IQR)
Outliers:
Points that fall beyond the whiskers
Individually plotted as dots or symbols
Values that are < (Q1 - 1.5 × IQR) or > (Q3 + 1.5 × IQR)
Key Features to Observe
When interpreting box plots, look for these characteristics:
Central Tendency: Location of the median line within the box
Dispersion: Width of the box (IQR) and length of the whiskers
Skewness:
Symmetrical data: median is approximately in the middle of the box, whiskers are roughly equal in length
Right (positive) skew: median is closer to the bottom of the box, upper whisker is longer
Left (negative) skew: median is closer to the top of the box, lower whisker is longer
Outliers: Presence of individual points beyond the whiskers
Case Study: Comparing Heights Between Groups
Let’s apply our understanding of box plots to a real dataset. We have height measurements (in centimeters) from two groups of 25 students each.
# Create the datasetdata_height <-data.frame(group_1 =c(150, 160, 165, 168, 172, 173, 175, 176, 177, 178, 179, 180, 180, 181, 181, 182, 182, 183, 183, 184, 186, 188, 190, 191, 200),group_2 =c(138, 140, 148, 152, 164, 164, 165, 165, 166, 166, 170, 175, 175, 175, 182, 182, 182, 182, 182, 182, 183, 183, 183, 188, 210))# Transform dataset from wide to long formatdata_height_l <-gather(data = data_height, key ="Group_number", value ="height", group_1:group_2)# Display the first few rowshead(data_height_l)
Figure 11.3: Box plots comparing height distributions between groups.
To complement our box plots, let’s also look at the density distributions:
# Create density plotsggplot(data = data_height_l) +geom_density(aes(x = height, fill = Group_number), alpha =0.5) +facet_grid(~ Group_number) +scale_x_continuous(breaks =seq(130, 210, 10)) +labs(title ="Height Density by Group",x ="Height (cm)",y ="Density")
Figure 11.4: Density plots showing the height distributions for each group.
Box Plot Interpretation Exercise
Based on the box plots and density plots above, determine whether each of the following statements is True or False. For each statement, provide a brief explanation based on evidence from the visualizations.
Exercise Questions
Students from group 2 (G2) in the studied sample are, on average, taller than those from group 1 (G1).
Group 1 (G1) height measurements are more dispersed/spread out than group 2 (G2).
The lowest person is in group 2 (G2).
Both data sets are negatively (left) skewed.
Half of the students in group 2 (G2) measure at least 175 cm.
Hints for Interpretation
When answering these questions, consider:
The position of the median line within each box
The relative sizes of the boxes (IQR)
The positions of the minimum and maximum values
The symmetry of the distributions (balanced or skewed)
The lengths of the whiskers
For each statement, determine whether it is True or False and provide your explanation:
Answer Template
Students from G2 are, on average, taller than G1: [True/False]
Explanation:
G1 height is more dispersed/spread out: [True/False]
Explanation:
The lowest person is in G2: [True/False]
Explanation:
Both data sets are negatively (left) skewed: [True/False]
Explanation:
Half of G2 measure at least 175 cm: [True/False]
Explanation:
Let’s review the answers to our box plot interpretation questions:
Solutions
Students from G2 are, on average, taller than G1: False
Explanation: The median height (middle line in the boxplot) for G1 is higher than G2.
G1 height is more dispersed/spread out: False
Explanation: G2 shows greater dispersion. This is visible in the boxplot where G2 has a larger interquartile range (IQR) of 17.5 cm compared to G1’s 9.5 cm. G2 also has a wider range from minimum to maximum values.
The lowest person is in G2: True
Explanation: The minimum value in G2 is 138 cm, which is lower than the minimum value in G1 (150 cm).
Both data sets are negatively (left) skewed: True
Explanation: In both groups, the median line is positioned toward the upper part of the box, and the lower whisker is longer than the upper whisker. This indicates that there’s a longer tail on the left side of the distribution, which means negative skewness.
Half of G2 measure at least 175 cm: True
Explanation: The median (middle line in the boxplot) for G2 is 175 cm, which means that 50% of the values are greater than or equal to 175 cm.
R Code Reference
Here’s the complete R code used in this section:
# Load required packageslibrary(tidyr)library(ggplot2)library(ggpubr)# Set display optionsoptions(scipen =999, digits =3)# Create the datasetdata_height <-data.frame(group_1 =c(150, 160, 165, 168, 172, 173, 175, 176, 177, 178, 179, 180, 180, 181, 181, 182, 182, 183, 183, 184, 186, 188, 190, 191, 200),group_2 =c(138, 140, 148, 152, 164, 164, 165, 165, 166, 166, 170, 175, 175, 175, 182, 182, 182, 182, 182, 182, 183, 183, 183, 188, 210))# Transform dataset from wide to long formatdata_height_l <-gather(data = data_height, key ="Group_number", value ="height", group_1:group_2)# Display the first few rowshead(data_height_l)# Calculate summary statistics for each groupgroup1_stats <-summary(data_height$group_1)group2_stats <-summary(data_height$group_2)# Calculate IQRgroup1_iqr <-IQR(data_height$group_1)group2_iqr <-IQR(data_height$group_2)# Create horizontal boxplotsggplot(data = data_height_l) +geom_boxplot(aes(x = Group_number, y = height, colour = Group_number), notch =FALSE) +coord_flip() +scale_y_continuous(breaks =seq(130, 210, 5)) +theme_pubr() +grids(linetype ="dashed") +labs(title ="Height Distribution by Group",x ="Group",y ="Height (cm)")# Create density plotsggplot(data = data_height_l) +geom_density(aes(x = height, fill = Group_number), alpha =0.5) +facet_grid(~ Group_number) +scale_x_continuous(breaks =seq(130, 210, 10)) +labs(title ="Height Density by Group",x ="Height (cm)",y ="Density")
11.9 Shape Measures
Skewness
Definition
Skewness quantifies the asymmetry of a data distribution. It indicates whether data tends to cluster more on one side of the mean than the other.
Mathematical Expression
SK = \frac{n}{(n-1)(n-2)} \sum_{i=1}^n (\frac{x_i - \bar{x}}{s})^3 where: - n is the sample size - x_i is the i-th observation - \bar{x} is the sample mean - s is the sample standard deviation
Simplified Numerical Example
library(moments)
Attaching package: 'moments'
The following object is masked from 'package:modeest':
skewness
The following object is masked from 'package:dplyr':
combine
# Three example datasets with different types of skewness# 1. Positive skewness (right tail)positive_skew_data <-c(2, 3, 4, 4, 5, 5, 5, 6, 6, 7, 8, 12, 15, 20)# 2. Negative skewness (left tail)negative_skew_data <-c(1, 5, 10, 13, 14, 15, 16, 16, 17, 17, 18, 18, 19, 20)# 3. Near-zero skewness (symmetry)symmetric_data <-c(1, 3, 5, 7, 9, 10, 11, 12, 13, 15, 17, 19, 21)# Calculating skewnesspositive_skewness <-skewness(positive_skew_data)negative_skewness <-skewness(negative_skew_data)symmetric_skewness <-skewness(symmetric_data)# Summary of resultsskewness_data <-data.frame("Distribution Type"=c("Positive skewness", "Negative skewness", "Symmetric distribution"),"Skewness value"=round(c(positive_skewness, negative_skewness, symmetric_skewness), 3),"Interpretation"=c("Longer right tail (majority of data on the left side)","Longer left tail (majority of data on the right side)","Data distributed symmetrically" ))# Display tableskewness_data
Distribution.Type Skewness.value
1 Positive skewness 1.42
2 Negative skewness -1.33
3 Symmetric distribution 0.00
Interpretation
1 Longer right tail (majority of data on the left side)
2 Longer left tail (majority of data on the right side)
3 Data distributed symmetrically
Visualizations of Skewness Types
# Create a data frame for all setsdf_skewness <-rbind(data.frame(value = positive_skew_data, type ="Positive skewness", skewness =round(positive_skewness, 2)),data.frame(value = negative_skew_data, type ="Negative skewness", skewness =round(negative_skewness, 2)),data.frame(value = symmetric_data, type ="Symmetric distribution", skewness =round(symmetric_skewness, 2)))# Histograms for three types of skewnessp1 <-ggplot(df_skewness, aes(x = value)) +geom_histogram(bins =10, fill ="skyblue", color ="darkblue", alpha =0.7) +facet_wrap(~type, scales ="free_x") +geom_vline(data = df_skewness %>%group_by(type) %>%summarise(mean =mean(value)),aes(xintercept = mean), color ="red", linetype ="dashed") +geom_vline(data = df_skewness %>%group_by(type) %>%summarise(median =median(value)),aes(xintercept = median), color ="darkgreen", linetype ="dashed") +geom_text(data =unique(df_skewness[, c("type", "skewness")]),aes(x =Inf, y =Inf, label =paste("SK =", skewness)),hjust =1.1, vjust =1.5, size =3.5) +labs(title ="Histograms showing different types of skewness",subtitle ="Red line: mean, Green line: median",x ="Value",y ="Frequency" ) +theme_minimal()# Box plotsp2 <-ggplot(df_skewness, aes(x = type, y = value, fill = type)) +geom_boxplot() +scale_fill_manual(values =c("skyblue", "lightgreen", "lightsalmon")) +labs(title ="Box plots for different types of skewness",x ="Distribution type",y ="Value" ) +theme_minimal() +theme(legend.position ="none")# Display plotsgrid.arrange(p1, p2, nrow =2)
Example: Voter Turnout Analysis
# Generate three datasets reflecting different types of skewnessset.seed(123)# 1. Positive skewness - typical for turnout in regions with low engagementpositive_turnout <-c(runif(50, min =20, max =30), # Small group with low turnoutrbeta(200, shape1 =2, shape2 =5) *50+30# Majority of results shifted to the left)# 2. Negative skewness - typical for regions with high political engagementnegative_turnout <-c(rbeta(200, shape1 =5, shape2 =2) *30+50, # Majority of results shifted to the rightrunif(50, min =40, max =50) # Small group with lower turnout)# 3. Symmetric distribution - typical for regions with uniform engagementsymmetric_turnout <-rnorm(250, mean =65, sd =8)# Create data framedf_turnout <-rbind(data.frame(turnout = positive_turnout, region ="Region A: Positive skewness"),data.frame(turnout = negative_turnout, region ="Region B: Negative skewness"),data.frame(turnout = symmetric_turnout, region ="Region C: Symmetric distribution"))# Calculate skewness for each regionregion_skewness <- df_turnout %>%group_by(region) %>%summarise(skewness =round(skewness(turnout), 2))# Histogram of turnout by regionp3 <-ggplot(df_turnout, aes(x = turnout)) +geom_histogram(bins =20, fill ="skyblue", color ="darkblue", alpha =0.7) +facet_wrap(~region, ncol =1) +geom_vline(data = df_turnout %>%group_by(region) %>%summarise(mean =mean(turnout)),aes(xintercept = mean), color ="red", linetype ="dashed") +geom_vline(data = df_turnout %>%group_by(region) %>%summarise(median =median(turnout)),aes(xintercept = median), color ="darkgreen", linetype ="dashed") +geom_text(data = region_skewness,aes(x =25, y =20, label =paste("SK =", skewness)),size =3.5) +labs(title ="Voter turnout in different regions",subtitle ="Showing three types of skewness",x ="Voter turnout (%)",y ="Number of districts" ) +theme_minimal()# Box plotp4 <-ggplot(df_turnout, aes(x = region, y = turnout, fill = region)) +geom_boxplot() +labs(title ="Comparison of turnout distributions across regions",x ="Region",y ="Voter turnout (%)" ) +theme_minimal() +theme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1))# Display plotsgrid.arrange(p3, p4, ncol =2, widths =c(2, 1))
Interpretation Guide
Positive Skewness (> 0): Distribution has a longer right tail - most values are concentrated on the left side
Negative Skewness (< 0): Distribution has a longer left tail - most values are concentrated on the right side
Zero Skewness: Distribution is approximately symmetric - values are evenly distributed around the mean
Kurtosis
Definition
Kurtosis measures the “tailedness” of a distribution, indicating the presence of extreme values compared to a normal distribution.
Mathematical Expression
K = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^n (\frac{x_i - \bar{x}}{s})^4 - \frac{3(n-1)^2}{(n-2)(n-3)}
Simplified Numerical Example
# Three example datasets with different levels of kurtosis# 1. Leptokurtic distribution (high kurtosis, "heavy tails")leptokurtic_data <-c(rnorm(80, mean =50, sd =5), # Most data clustered around the meanc(20, 25, 30, 70, 75, 80) # A few extreme values)# 2. Platykurtic distribution (low kurtosis, "flat")platykurtic_data <-c(runif(50, min =30, max =70) # Uniform distribution of values)# 3. Mesokurtic distribution (normal kurtosis)mesokurtic_data <-rnorm(50, mean =50, sd =10)# Calculate kurtosiskurtosis_lepto <-kurtosis(leptokurtic_data)kurtosis_platy <-kurtosis(platykurtic_data)kurtosis_meso <-kurtosis(mesokurtic_data)# Summary of resultskurtosis_data <-data.frame("Distribution Type"=c("Leptokurtic", "Platykurtic", "Mesokurtic"),"Kurtosis value"=round(c(kurtosis_lepto, kurtosis_platy, kurtosis_meso), 3),"Interpretation"=c("Many values near the mean, but also more extreme values","Values more uniformly distributed - flat distribution","Similar to normal distribution" ))# Display tablekurtosis_data
Distribution.Type Kurtosis.value
1 Leptokurtic 7.39
2 Platykurtic 1.85
3 Mesokurtic 2.25
Interpretation
1 Many values near the mean, but also more extreme values
2 Values more uniformly distributed - flat distribution
3 Similar to normal distribution
Visualizations of Kurtosis Levels
# Create a data frame for all setsdf_kurtosis <-rbind(data.frame(value = leptokurtic_data, type ="Leptokurtic (K > 3)", kurtosis =round(kurtosis_lepto, 2)),data.frame(value = platykurtic_data, type ="Platykurtic (K < 3)", kurtosis =round(kurtosis_platy, 2)),data.frame(value = mesokurtic_data, type ="Mesokurtic (K ≈ 3)", kurtosis =round(kurtosis_meso, 2)))# Histograms for three types of kurtosisp5 <-ggplot(df_kurtosis, aes(x = value)) +geom_histogram(bins =15, fill ="lightgreen", color ="darkgreen", alpha =0.7) +facet_wrap(~type, scales ="free_y") +geom_text(data =unique(df_kurtosis[, c("type", "kurtosis")]),aes(x =Inf, y =Inf, label =paste("K =", kurtosis)),hjust =1.1, vjust =1.5, size =3.5) +labs(title ="Histograms showing different levels of kurtosis",x ="Value",y ="Frequency" ) +theme_minimal()# Box plotsp6 <-ggplot(df_kurtosis, aes(x = type, y = value, fill = type)) +geom_boxplot() +scale_fill_manual(values =c("lightgreen", "lightsalmon", "skyblue")) +labs(title ="Box plots for different levels of kurtosis",x ="Distribution type",y ="Value" ) +theme_minimal() +theme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1))# Display plotsgrid.arrange(p5, p6, nrow =2)
Example: Parliamentary Voting Analysis
# Generate three datasets reflecting different levels of kurtosisset.seed(456)# 1. Leptokurtic distribution - typical for votes with strong party disciplinelepto_voting <-c(rnorm(150, mean =75, sd =3), # Most votes with high agreementc(20, 25, 30, 35, 40, 95, 96, 97, 98, 99) # A few outlier votes)# 2. Platykurtic distribution - typical for controversial votesplaty_voting <-c(runif(80, min =40, max =60), # Votes with moderate agreementrunif(80, min =60, max =80) # Votes with higher agreement)# 3. Mesokurtic distribution - typical for normal votesmeso_voting <-rnorm(160, mean =65, sd =10)# Create data framedf_voting <-rbind(data.frame(agreement = lepto_voting, bill_type ="Bills A: Leptokurtic"),data.frame(agreement = platy_voting, bill_type ="Bills B: Platykurtic"),data.frame(agreement = meso_voting, bill_type ="Bills C: Mesokurtic"))# Calculate kurtosis for each bill typebill_kurtosis <- df_voting %>%group_by(bill_type) %>%summarise(kurtosis =round(kurtosis(agreement), 2))# Histogram of voting agreementp7 <-ggplot(df_voting, aes(x = agreement)) +geom_histogram(bins =20, fill ="lightgreen", color ="darkgreen", alpha =0.7) +facet_wrap(~bill_type, ncol =1) +geom_text(data = bill_kurtosis,aes(x =Inf, y =Inf, label =paste("K =", kurtosis)),hjust =1.1, vjust =1.5, size =3.5) +labs(title ="Voting agreement for different types of bills",subtitle ="Showing three levels of kurtosis",x ="Voting agreement index (%)",y ="Number of votes" ) +theme_minimal()# Box plotp8 <-ggplot(df_voting, aes(x = bill_type, y = agreement, fill = bill_type)) +geom_boxplot() +labs(title ="Comparison of voting agreement distributions",x ="Bill type",y ="Voting agreement index (%)" ) +theme_minimal() +theme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1))# Display plotsgrid.arrange(p7, p8, ncol =2, widths =c(2, 1))
Interpretation Guide
Leptokurtic (K > 3): “Slender” distribution with heavy tails - more extreme values than in a normal distribution
Platykurtic (K < 3): “Flat” distribution - fewer extreme values than in a normal distribution
Mesokurtic (K ≈ 3): Distribution similar to normal in terms of extreme values
11.10 Exercise 1. Center and dispersion of data
Data
We have salary data (in thousands of euros) from two small European companies:
Index
Company X
Company Y
1
2
3
2
2
3
3
2
4
4
3
4
5
3
4
6
3
4
7
3
4
8
3
4
9
3
5
10
4
5
11
4
5
12
4
5
13
4
5
14
4
5
15
5
6
16
5
6
17
5
6
18
5
7
19
20
7
20
35
8
This table presents the data for both Company X and Company Y side by side, with an index column for easy reference.
Measures of Central Tendency
Mean
The mean is the average of all values in a dataset.
Formula: \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}
Można też zapisać ten wzór w postaci:
\bar{x} = \frac{\sum_{i=1}^{k} x_i f_i}{n}
gdzie f_i to częstość bezwzględna (liczba wystąpień, waga bezwzględna) i-tej wartości, a k to liczba różnych wartości cechy (liczba wartości wyróżnionych).
Z użyciem częstości względnych:
\bar{x} = \sum_{i=1}^{k} x_i p_i
gdzie p_i to częstość względna (frakcja, waga znormalizowana) i-tej wartości, a k to liczba różnych wartości cechy (liczba wartości wyróżnionych).
Manual Calculation for Company X
Value (x_i)
Frequency (f_i)
x_i \cdot f_i
2
3
6
3
6
18
4
5
20
5
4
20
20
1
20
35
1
35
Total
n = 20
Sum = 119
\bar{x} = \frac{119}{20} = 5.95
Manual Calculation for Company Y
Value (x_i)
Frequency (f_i)
x_i \cdot f_i
3
2
6
4
6
24
5
6
30
6
3
18
7
2
14
8
1
8
Total
n = 20
Sum = 100
\bar{y} = \frac{100}{20} = 5
R Verification
X <-c(2,2,2,3,3,3,3,3,3,4,4,4,4,4,5,5,5,5,20,35)Y <-c(3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6,6,7,7,8)mean(X)
[1] 5.95
mean(Y)
[1] 5
Median
The median is the middle value when the data is ordered.
Poprawka Bessela jest stosowana przy obliczaniu wariancji z próby, aby uzyskać nieobciążony estymator wariancji populacji. W standardowym wzorze na wariancję z próby dzielimy przez (n-1) zamiast przez n.
Modyfikacje wzoru dla danych pogrupowanych (szereg częstości):
Q1 (25th percentile): median of first 10 numbers = 4
Q2 (50th percentile, median): 5
Q3 (75th percentile): median of last 10 numbers = 6
R Verification
quantile(X)
0% 25% 50% 75% 100%
2 3 4 5 35
quantile(Y)
0% 25% 50% 75% 100%
3 4 5 6 8
IQR
IQR_x = 5 - 3 = 2
IQR_y = 6 - 4 = 2
Tukey Box Plot
A Tukey box plot visually represents the distribution of data based on quartiles. We’ll use ggplot2 to create the plot.
library(ggplot2)library(tidyr)# Prepare the datadata <-data.frame(Company =rep(c("X", "Y"), each =20),Salary =c(X, Y))# Create the box plotggplot(data, aes(x = Company, y = Salary, fill = Company)) +geom_boxplot() +labs(title ="Salary Distribution in Companies X and Y",x ="Company",y ="Salary (thousands of euros)") +theme_minimal() +scale_fill_manual(values =c("X"="#69b3a2", "Y"="#404080"))
# Create the box plotggplot(data, aes(x = Company, y = Salary, fill = Company)) +geom_boxplot(outliers = F) +labs(title ="Salary Distribution in Companies X and Y",x ="Company",y ="Salary (thousands of euros)") +theme_minimal() +scale_fill_manual(values =c("X"="#69b3a2", "Y"="#404080"))
Interpreting the Box Plot
The box represents the interquartile range (IQR) from Q1 to Q3.
The line inside the box is the median (Q2).
Whiskers extend to the smallest and largest values within 1.5 * IQR.
Points beyond the whiskers are considered outliers.
Comparison of Results
Measure
Company X
Company Y
Mean
5.95
5.00
Median
4
5
Mode
3
4 and 5
Variance
61.21
1.79
Standard Deviation
7.82
1.34
Q1
3
4
Q3
5
6
Key Observations:
Central Tendency: Company X has a higher mean but lower median than Company Y, indicating a right-skewed distribution for Company X.
Dispersion: Company X shows much higher variance and standard deviation, suggesting greater salary disparities.
Distribution Shape: Company Y’s salaries are more tightly clustered, while Company X has extreme values (potential outliers) that significantly affect its mean and variance.
Quartiles: Company Y’s interquartile range (Q3 - Q1) is slightly larger, but its overall range is much smaller than Company X’s.
11.11 Exercise 2. Comparing Electoral District Size Variation Between Countries
Data
We have electoral district size data from two countries:
x <-c(1, 3, 5, 7, 9, 11, 13, 15, 17, 19) # Country high variancey <-c(8, 9, 9, 10, 10, 11, 11, 12, 12, 13) # Country low variancekable(data.frame("Country X (High var.)"= x,"Country Y (Low var.)"= y))
Country.X..High.var..
Country.Y..Low.var..
1
8
3
9
5
9
7
10
9
10
11
11
13
11
15
12
17
12
19
13
Measures of Central Tendency
Arithmetic Mean
Formula: \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}
Calculations for Country X
Element
Value
1
1
2
3
3
5
4
7
5
9
6
11
7
13
8
15
9
17
10
19
Sum
100
\bar{x} = \frac{100}{10} = 10
mean_x <-mean(x)c("Manual"=10, "R"= mean_x)
Manual R
10 10
Calculations for Country Y
Element
Value
1
8
2
9
3
9
4
10
5
10
6
11
7
11
8
12
9
12
10
13
Sum
105
\bar{y} = \frac{105}{10} = 10.5
mean_y <-mean(y)c("Manual"=10.5, "R"= mean_y)
Manual R
10.5 10.5
Median
The median is the middle value in an ordered dataset.
Calculations for Country X
Ordered data: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19
For n = 10 (even number of observations): Middle positions: 5 and 6 Middle values: 9 and 11
Median = \frac{9 + 11}{2} = 10
median_x <-median(x)c("Manual"=10, "R"= median_x)
Manual R
10 10
Calculations for Country Y
Ordered data: 8, 9, 9, 10, 10, 11, 11, 12, 12, 13
For n = 10 (even number of observations): Middle positions: 5 and 6 Middle values: 10 and 11
df_long <-data.frame(country =rep(c("X", "Y"), each =10),size =c(x, y))# Basic plotp <-ggplot(df_long, aes(x = country, y = size, fill = country)) +geom_boxplot(outlier.shape =NA) +# Disable default outlier pointsgeom_jitter(width =0.2, alpha =0.5) +# Add points with transparencyscale_fill_manual(values =c("X"="#FFA07A", "Y"="#98FB98")) +labs(title ="Comparison of Electoral District Size Variation",subtitle =paste("CV: Country X =", round(cv_x, 1), "%, Country Y =", round(cv_y, 1), "%"),x ="Country",y ="District Size" ) +theme_minimal() +theme(legend.position ="none")# Add quartile annotationsp +annotate("text", x =c(1, 1, 1, 2, 2, 2), y =c(max(x)+1, mean(x), min(x)-1, max(y)+1, mean(y), min(y)-1),label =c(paste("Q3 =", quantile(x, 0.75, type=1)),paste("M =", median(x)),paste("Q1 =", quantile(x, 0.25, type=1)),paste("Q3 =", quantile(y, 0.75, type=1)),paste("M =", median(y)),paste("Q1 =", quantile(y, 0.25, type=1)) ),size =3)
Methodological Notes
Quartile Calculations:
The median-excluding method used may give different results than R’s default functions
Differences in calculation methods don’t affect overall conclusions
Always important to specify the method used in reports
Visualization:
Box plot effectively shows differences in distributions
Additional points show actual values
Annotations facilitate interpretation
Application Notes
Using the Analysis:
All calculations can be reproduced using the provided R code
Code chunks are self-contained and documented
Data format requirements are clearly specified
Customization:
Analysis can be adapted for different district size datasets
Visualization parameters can be adjusted for different presentation needs
Statistical methods can be modified based on specific requirements
Conclusion
Summary Statistics Comparison
Measure
Country X
Country Y
Relative Difference
Mean
10.0
10.5
Similar
Median
10.0
10.5
Similar
Mode
None
Multiple (9,10,11,12)
-
Range
18
5
3.6× larger in X
Variance
36.67
2.5
14.7× larger in X
IQR
10
3
3.3× larger in X
CV
60.6%
15.0%
4.0× larger in X
Distribution Characteristics
Country X:
Uniform distribution pattern
No dominant district size (no mode)
Wide range: 1 to 19 seats
High variability (CV = 60.6%) - Even spread of values across range
Country Y:
Clustered distribution pattern
Multiple common sizes (four modes)
Narrow range: 8 to 13 seats
Low variability (CV = 15.0%) - Values concentrated around mean
Box Plot Interpretation
The box plot visualization reveals:
Structure Elements:
Box: Shows interquartile range (IQR)
Lower edge: First quartile (Q1)
Upper edge: Third quartile (Q3)
Internal line: Median (Q2)
Whiskers: Extend to ±1.5 IQR - Points: Individual district sizes
Key Visual Findings:
Box Size:
Country X: Large box indicates wide spread of middle 50%
Country Y: Small box shows tight clustering of middle values
Whisker Length:
Country X: Long whiskers indicate broad overall distribution
Country Y: Short whiskers show limited total spread
Point Distribution:
Country X: Points widely dispersed
Country Y: Points densely clustered
Key Observations
Central Tendency:
Similar average district sizes
Different distribution patterns
Distinct approaches to standardization
Variability Measures:
All metrics show Country X with 3-15 times more variation
Consistent pattern across different statistical measures
Systematic difference in district design
System Design:
Country X: Flexible, varied approach
Country Y: Standardized, uniform approach
Different philosophical approaches to representation
Representative Implications:
Country X: Variable voter-to-representative ratios
Country Y: More consistent representation levels
Different approaches to democratic representation
This analysis demonstrates fundamental differences in electoral system design between the two countries, with Country X adopting a more varied approach and Country Y maintaining greater uniformity in district sizes.
11.12 Exercise 3. Voter Participation and Economic Prosperity
Analiza związku między dobrobytem ekonomicznym a frekwencją wyborczą w dzielnicach Amsterdamu na podstawie danych z wyborów samorządowych 2022.
Pearson's product-moment correlation
data: dane$dochod and dane$frekwencja
t = 16, df = 3, p-value = 0.0005
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9117 0.9996
sample estimates:
cor
0.9942
Część 3: Model regresji OLS
# Dopasowanie modelu OLSmodel <-lm(frekwencja ~ dochod, data = dane)# Podsumowanie modelusummary(model)
Call:
lm(formula = frekwencja ~ dochod, data = dane)
Residuals:
1 2 3 4 5
-1.949 0.336 0.510 0.620 0.482
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.8965 3.9673 -0.23 0.83575
dochod 1.2569 0.0782 16.07 0.00052 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.26 on 3 degrees of freedom
Multiple R-squared: 0.989, Adjusted R-squared: 0.985
F-statistic: 258 on 1 and 3 DF, p-value: 0.000524
Wizualizacja
# Wykres rozrzutu z linią regresjiggplot(dane, aes(x = dochod, y = frekwencja)) +geom_point(size =4, color ="blue") +geom_smooth(method ="lm", se =TRUE, color ="red") +geom_text(aes(label = dzielnica), vjust =-1) +labs(title ="Dochód vs frekwencja wyborcza",x ="Dochód (tys. €)",y ="Frekwencja wyborcza (%)" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Wnioski
Analiza wykazała silny dodatni związek między dobrobytem ekonomicznym dzielnicy a frekwencją wyborczą. Mieszkańcy dzielnic o wyższych dochodach częściej uczestniczą w wyborach samorządowych.
Uwaga: Mała liczebność próby (n=5) ogranicza możliwość generalizacji wyników.
11.13 Exercise 4. Understanding Boxplots Through Life Expectancy Data
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
knitr::kable(summary_stats, digits =1,caption ="Summary Statistics by Continent and Year")
Summary Statistics by Continent and Year
continent
year
median
q1
q3
iqr
n_outliers
Africa
1957
40.6
37.4
44.8
7.4
1
Africa
2007
52.9
47.8
59.4
11.6
0
Americas
1957
56.1
48.6
62.6
14.0
0
Americas
2007
72.9
71.8
76.4
4.6
1
Asia
1957
48.3
41.9
54.1
12.2
0
Asia
2007
72.4
65.5
75.6
10.2
1
Europe
1957
67.7
65.0
69.2
4.2
2
Europe
2007
78.6
75.0
79.8
4.8
0
Oceania
1957
70.3
70.3
70.3
0.0
0
Oceania
2007
80.7
80.5
81.0
0.5
0
11.17 Key Learning Points
Distribution Center:
Median shows the typical life expectancy
Changes in median reflect overall improvements
Spread and Variation:
IQR (box height) indicates data dispersion
Wider boxes suggest more inequality in life expectancy
Outliers and Extremes:
Outliers often represent countries with unique circumstances
Time Comparison:
Shows both absolute improvements and changes in variation
Highlights persistent regional disparities
Reveals different rates of progress across continents
11.18 Appendix: Summary Tables for Data Types and Applicable Statistical Measures
Table 1: Pros and Cons of Various Statistical Measures
Measures of Center
Measure
Pros
Cons
Applicable to
Mean
- Uses all data points - Allows for further statistical calculations - Ideal for normally distributed data
- Sensitive to outliers - Not ideal for skewed distributions - Not meaningful for nominal data
Interval, Ratio, some Discrete, Continuous
Median
- Not affected by outliers - Good for skewed distributions - Can be used with ordinal data
- Ignores the actual values of most data points - Less useful for further statistical analyses
Ordinal, Interval, Ratio, Discrete, Continuous
Mode
- Can be used with any data type - Good for finding most common category
- May not be unique (multimodal) - Not useful for many types of analyses - Ignores magnitude of differences between values
All types
Measures of Variability
Measure
Pros
Cons
Applicable to
Range
- Simple to calculate and understand - Gives quick idea of data spread
- Very sensitive to outliers - Ignores all data between extremes - Not useful for further statistical analyses
Ordinal, Interval, Ratio, Discrete, Continuous
Interquartile Range (IQR)
- Not affected by outliers - Good for skewed distributions
- Ignores 50% of the data - Less intuitive than range
Ordinal, Interval, Ratio, Discrete, Continuous
Variance
- Uses all data points - Basis for many statistical procedures
- Sensitive to outliers - Units are squared (less intuitive)
Interval, Ratio, some Discrete, Continuous
Standard Deviation
- Uses all data points - Same units as original data - Widely used and understood
- Sensitive to outliers - Assumes roughly normal distribution for interpretation
Interval, Ratio, some Discrete, Continuous
Coefficient of Variation
- Allows comparison between datasets with different units or means
- Can be misleading when means are close to zero - Not meaningful for data with negative values
Ratio, some Interval
Measures of Correlation/Association
Measure
Pros
Cons
Applicable to
Pearson’s r
- Measures linear relationship - Widely used and understood
- Assumes normal distribution - Sensitive to outliers - Only captures linear relationships
Interval, Ratio, Continuous
Spearman’s rho
- Can be used with ordinal data - Captures monotonic relationships - Less sensitive to outliers
- Loses information by converting to ranks - May miss some types of relationships
Ordinal, Interval, Ratio
Kendall’s tau
- Can be used with ordinal data - More robust than Spearman’s for small samples - Has nice interpretation (probability of concordance)
- Loses information by only considering order - Computationally more intensive
Ordinal, Interval, Ratio
Chi-square
- Can be used with nominal data - Tests independence of categorical variables
- Requires large sample sizes - Sensitive to sample size - Doesn’t measure strength of association
Nominal, Ordinal
Cramér’s V
- Can be used with nominal data - Provides measure of strength of association - Normalized to [0,1] range
- Interpretation can be subjective - May overestimate association in small samples
Nominal, Ordinal
Statistical Measures Applicability / Zastosowanie miar statystycznych
Measure (EN)
Miara (PL)
Nominal
Ordinal
Interval
Ratio
Central Tendency / Tendencja centralna:
Mode
Dominanta
✓
✓
✓
✓
Median
Mediana
-
✓
✓
✓
Arithmetic Mean
Średnia arytmetyczna
-
-
✓*
✓
Geometric Mean
Średnia geometryczna
-
-
-
✓
Harmonic Mean
Średnia harmoniczna
-
-
-
✓
Dispersion / Rozproszenie:
Range
Rozstęp
-
✓
✓
✓
Interquartile Range
Rozstęp międzykwartylowy
-
✓
✓
✓
Mean Absolute Deviation
Średnie odchylenie bezwzględne
-
-
✓
✓
Variance
Wariancja
-
-
✓*
✓
Standard Deviation
Odchylenie standardowe
-
-
✓*
✓
Coefficient of Variation
Współczynnik zmienności
-
-
-
✓
Association / Współzależność:
Chi-square
Chi-kwadrat
✓
✓
✓
✓
Spearman Correlation
Korelacja Spearmana
-
✓
✓
✓
Kendall’s Tau
Tau Kendalla
-
✓
✓
✓
Pearson Correlation
Korelacja Pearsona
-
-
✓*
✓
Covariance
Kowariancja
-
-
✓*
✓
* Theoretically problematic but commonly used in practice / Teoretycznie problematyczne, ale powszechnie stosowane w praktyce
Notes / Uwagi:
Measurement Scales / Skale pomiarowe:
Nominal: Categories without order / Kategorie bez uporządkowania
Ordinal: Ordered categories / Kategorie uporządkowane
Interval: Equal intervals, arbitrary zero / Równe interwały, umowne zero
Ratio: Equal intervals, absolute zero / Równe interwały, absolutne zero
Practical Considerations / Aspekty praktyczne:
Some measures marked with ✓* are commonly used for interval data despite theoretical issues / Niektóre miary oznaczone ✓* są powszechnie stosowane dla danych przedziałowych pomimo problemów teoretycznych
Choice of measure should consider both theoretical appropriateness and practical utility / Wybór miary powinien uwzględniać zarówno poprawność teoretyczną jak i użyteczność praktyczną
More restrictive scales (ratio) allow all measures from less restrictive scales / Bardziej restrykcyjne skale (ilorazowe) pozwalają na wszystkie miary z mniej restrykcyjnych skal