3  Understanding Data Types in Social Sciences

In social science research, understanding the nature of our data is crucial for selecting appropriate analysis methods and drawing valid conclusions.

3.1 Foundations in Number Sets

Before diving into data types, it’s essential to understand the basic number sets that form the foundation of our understanding of data.

Basic Number Sets

  1. Natural Numbers (ℕ): The counting numbers {0, 1, 2, 3, …}
  2. Integers (ℤ): Includes natural numbers, their negatives, and zero {…, -2, -1, 0, 1, 2, …}
  3. Rational Numbers (ℚ): Numbers that can be expressed as a fraction of two integers
  4. Real Numbers (ℝ): All numbers on the number line, including rationals and irrationals

Properties of Sets

  1. Countable Sets: Sets whose elements can be put in a one-to-one correspondence with the natural numbers. For example, the set of integers is countable.

  2. Uncountable Sets: Sets that are not countable. The set of real numbers is uncountable.

  3. Discrete Sets: Sets where each element is separated from other elements by a finite gap. The integers form a discrete set.

  4. Dense Sets: Sets where between any two elements, there is always another element of the set. The rational numbers and real numbers are dense sets.

Note

Understanding these set properties is crucial for grasping the nature of different data types in social sciences.

3.2 Discrete vs. Continuous Data

In data science and statistics, we often categorize variables as either discrete or continuous. However, the distinction is not always clear-cut, and some variables exhibit characteristics of both types. This section explores the concepts of discrete and continuous data, their differences, and the interesting cases of variables that can be treated as both or challenge our intuitive understanding.

https://individual-psychometrics.rbind.io/

Discrete Data

Discrete data can only take on specific, countable values. These values are often (but not always) integers.

Characteristics of Discrete Data:

  • Countable
  • Often represented by integers
  • Can be finite or infinite
  • No values between two adjacent data points

Examples:

  1. Number of students in a class
  2. Number of cars sold by a dealership
  3. Shoe sizes

Continuous Data

Continuous data can take on any value within a given range, including fractional and decimal values. It’s important to note that continuity is not solely determined by uncountability, but also by density.

Characteristics of Continuous Data:

  • Can be uncountable (like real numbers) or dense (like rational numbers)
  • Can be measured to any level of precision (theoretically)
  • Represented by real numbers or dense subsets of real numbers
  • There are always values between any two data points

Examples:

  1. Height
  2. Weight
  3. Temperature
  4. Percentages (explained further below)

The Discrete-Continuous Spectrum

In practice, some variables that are mathematically discrete are often treated as if they are continuous. This dual nature provides flexibility in how these variables can be analyzed and interpreted.

Reasons for Treating Discrete Data as Continuous:

  1. Dense Granularity
    • When a discrete variable has a large number of possible values within a range, it can approximate continuity.
    • Example: Income measured in individual cents. While technically discrete, the large number of possible values makes it behave similarly to a continuous variable.
  2. Analytical Convenience
    • Continuous methods often yield reasonable and useful results even for dense discrete variables.
    • It’s often easier to use existing statistical tools if continuity is assumed, as this allows the use of calculus-based methods.
  3. Approximation of Underlying Phenomena
    • In some cases, a discrete measurement might be an approximation of an underlying continuous process.
    • Example: While we measure time in discrete units (seconds, minutes, hours), time itself is continuous.

Examples of Variables with Dual Discrete-Continuous Nature:

  1. Age
    • Discrete: Typically measured in whole years
    • Continuous: Can be considered as a continuous variable in many analyses, especially when dealing with large populations
  2. Price and Income
    • Discrete: Prices and incomes are actually measured in discrete units (e.g., cents or smallest currency unit)
    • Continuous: In economic models and many analyses, prices and incomes are treated as continuous variables due to their dense nature and analytical convenience
  3. Test Scores
    • Discrete: Often given as whole numbers
    • Continuous: In statistical analyses, test scores might be treated as continuous, especially when the range of possible scores is large

Special Case: Percentages and Rational Numbers

Percentages present an interesting case in the discrete-continuous spectrum:

  1. Rational Nature: Percentages are essentially fractions (m/100), making them rational numbers.
  2. Dense but Countable: The set of rational numbers is dense (between any two rationals, there’s another rational) but also countable.
  3. Practical Continuity: In most practical applications, percentages are treated as continuous due to their dense nature.
  4. Finite Precision: In reality, percentages are often reported to a limited number of decimal places, creating a finite set of possible values.
Percentages: Bridging Discrete and Continuous

Variables measured in percentages, such as unemployment rates or voter turnout, challenge our intuitive understanding of discreteness and continuity:

  1. They are rational numbers (fractions with denominator 100), which are technically countable.
  2. They form a dense set within their range (0% to 100%), allowing for values between any two percentages.
  3. In practice, they are often treated as continuous variables due to their dense nature and analytical convenience.
  4. The precision of measurement (e.g., reporting to one or two decimal places) can impose a discrete structure on what is conceptually a dense set.

This duality allows for flexible analytical approaches, depending on the specific research context and required precision.

Implications for Data Analysis

Understanding the nuanced nature of variables as discrete, continuous, or somewhere in between has important implications for data analysis:

  1. Flexibility in Modeling: It allows for the use of a wider range of statistical techniques.
  2. Simplified Calculations: Treating dense discrete data as continuous can simplify calculations and make certain analyses more tractable.
  3. Improved Interpretability: In some cases, treating discrete data as continuous can lead to more intuitive or useful interpretations of results.
  4. Potential for Error: It’s important to be aware of when approximations are appropriate and when they might lead to misleading results.
  5. Theoretical vs. Practical Considerations: While the mathematical nature of the data is important, practical considerations in measurement and analysis often guide how we treat variables.

Conclusion

The distinction between discrete and continuous data is not always rigid in social sciences. Many variables, including those involving money, percentages, or dense measurements, can be viewed through both discrete and continuous lenses. The choice of treatment should be guided by the nature of the data, the goals of the analysis, and the potential implications of the choice. This flexibility, when used thoughtfully, provides powerful tools for social science researchers to gain insights from their data.

Discrete vs. Continuous Numerical Data: A Language-Based Analogies

The Language Connection

Think about how you naturally ask questions about quantities:

  • “How many cookies are in the jar?” (counting)
  • “How much water is in the glass?” (measuring)

This natural language distinction reflects the two fundamental types of numerical data:

Discrete Data = “How Many?” Questions

  • Like counting whole objects (countable nouns)

  • Takes specific values with gaps between them

  • Examples:

    • Number of pets: 0, 1, 2, 3… (can’t have 2.5 pets)
    • Dice rolls: 1, 2, 3, 4, 5, 6
    • Students in a class: 20, 21, 22…

🤔 Self-Check: Can you find a value between 2 and 3 students? Why not?

Continuous Data = “How Much?” Questions

  • Like measuring quantities (uncountable nouns)

  • Can take any value within a range

  • Examples:

    • Height: 1.7231… meters
    • Temperature: 36.8325… °C
    • Time: 3.5792… hours

🤔 Self-Check: Write down three different values between 1.72 and 1.73 meters

Quick Recognition Guide

  • If you naturally ask “How many?” → Discrete
  • If you naturally ask “How much?” → Continuous
  • If you can measure it more precisely → Continuous
  • If you can only use whole numbers → Discrete

✍️ Practice: Classify these quantities as discrete or continuous

  • Your age in years: _____
  • Your height: _____
  • Number of songs in a playlist: _____
  • Volume of water: _____

3.3 Introduction to Stevens’ Data Typology

Stanley S. Stevens, an American psychologist, introduced a classification system for scales of measurement in his 1946 paper “On the Theory of Scales of Measurement.” This system, known as Stevens’ data typology or levels of measurement, has become fundamental in understanding how different types of data should be analyzed and interpreted.

Stevens proposed four levels of measurement:

  1. Nominal
  2. Ordinal
  3. Interval
  4. Ratio

Each level has specific properties and allows for different types of statistical operations and analyses.

https://individual-psychometrics.rbind.io/

Nominal Scale

Definition

The nominal scale is the most basic level of measurement. It uses labels or categories to classify data without any quantitative value or order.

Properties

  • Categories are mutually exclusive
  • No inherent order among categories
  • No meaningful arithmetic operations can be performed

Examples

  • Nationality (Polish, English, …)
  • Blood types (A, B, AB, O)
  • Eye color (Blue, Brown, Green, Hazel)
  • Binary variables (“Success” versus “Failure”)

Ordinal Scale

Definition

The ordinal scale categorizes data into ordered categories, but the intervals between categories are not necessarily equal or meaningful.

Properties

  • Categories have a defined order
  • Differences between categories are not quantifiable
  • Arithmetic operations on the numbers are not meaningful

Examples

  • Education levels (High School, Bachelor’s, Master’s, PhD)
  • Likert scales (Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree)
  • Socioeconomic status (Low, Medium, High)

Interval Scale

Definition

The interval scale has ordered categories with equal intervals between adjacent categories. However, it lacks a true zero point.

Properties

  • Equal intervals between adjacent categories
  • No true zero point (zero is arbitrary)
  • Ratios between values are not meaningful

Examples

  • Temperature in Celsius or Fahrenheit
  • Calendar years
  • pH scale (the difference between pH 4 and 5 represents the same change in hydrogen ion concentration as between pH 6 and 7)
  • Elevation above sea level

Ratio Scale

Definition

The ratio scale is the highest level of measurement. It has all the properties of the interval scale plus a true zero point, making ratios between values meaningful.

Properties

  • All properties of interval scales
  • True zero point
  • Ratios between values are meaningful

Examples

  • Height
  • Weight
  • Age
  • Income
Why Some Statistics Work (and Others Don’t) for Interval Scales

Key Idea

An interval scale is one where the distances between values are meaningful, but the zero point is arbitrary. For interval scales (e.g., temperature):

  • Allowed: Addition/subtraction of values and multiplication/division by constants.
  • Not allowed: Multiplication/division of values from the scale by each other, as this leads to results without physical interpretation.

Properties of Interval Scales

  1. Equal intervals represent the same differences:
    • The difference between 20°C and 25°C (5°C) represents the same change as between 30°C and 35°C.
    • Proportions of differences are preserved: 10°C is twice the change of 5°C.
  2. The zero point is arbitrary:
    • 0°C is the freezing point of water, not the absence of temperature.
    • The same physical state has different values in different scales: 0°C = 32°F.
  3. Linear transformation:
    • General formula: y = ax + b, where a \neq 0.
    • For temperature: F = C \times \frac{9}{5} + 32.

Why the Arithmetic Mean Works

The arithmetic mean works because it relies on addition and division by a constant, which are allowed on interval scales. Example:

Data: 20°C and 30°C

Method 1: Mean in Celsius, then conversion
1. Mean: (20°C + 30°C) ÷ 2 = 25°C
2. Conversion: 25°C × (9/5) + 32 = 77°F

Method 2: Conversion to °F, then mean
1. Conversion: 20°C → 68°F, 30°C → 86°F
2. Mean: (68°F + 86°F) ÷ 2 = 77°F

Both methods give the same result! ✓

Mathematical proof of correctness: \begin{align} \bar{F} &= \frac{F_1 + F_2}{2} \\ &= \frac{(C_1 \times \frac{9}{5} + 32) + (C_2 \times \frac{9}{5} + 32)}{2} \\ &= \frac{(C_1 + C_2) \times \frac{9}{5} + 64}{2} \\ &= \left(\frac{C_1 + C_2}{2}\right) \times \frac{9}{5} + 32 \\ &= \bar{C} \times \frac{9}{5} + 32 \end{align}

Why Variance is Problematic

Variance is problematic because it relies on squared differences, leading to squared units (e.g., °C²) without clear physical interpretation. Example:

Same temperatures: 20°C and 30°C

Method 1: Variance in Celsius
1. Mean: 25°C
2. Deviations: (20 - 25)°C = -5°C, (30 - 25)°C = 5°C
3. Squared deviations: (-5°C)² = 25(°C)², (5°C)² = 25(°C)²
4. Mean: (25 + 25)(°C)² ÷ 2 = 25(°C)²

Method 2: Variance in Fahrenheit
1. Conversion: 20°C → 68°F, 30°C → 86°F
2. Mean: 77°F
3. Deviations: (68 - 77)°F = -9°F, (86 - 77)°F = 9°F
4. Squared deviations: (-9°F)² = 81(°F)², (9°F)² = 81(°F)²
5. Mean: (81 + 81)(°F)² ÷ 2 = 81(°F)²

Problem: 25(°C)² and 81(°F)² are not equivalent!

Mathematical analysis of the problem: \begin{align} (F_i - \bar{F})^2 &= [(C_i \times \frac{9}{5} + 32) - (\bar{C} \times \frac{9}{5} + 32)]^2 \\ &= [(C_i - \bar{C}) \times \frac{9}{5}]^2 \\ &= (C_i - \bar{C})^2 \times \left(\frac{9}{5}\right)^2 \end{align}

Theoretical Conclusions

  1. Allowed operations:
    • Addition/subtraction (preserves differences).
    • Multiplication/division by constants (scaling).
    • Arithmetic means.
    • Comparing temperature differences.
  2. Not allowed operations:
    • Multiplying temperatures by each other.
    • Dividing temperatures by each other.
    • Geometric means.
    • Coefficient of variation.
  3. Practical implications:
    • Variance and standard deviation require careful interpretation.
    • Better to use measures based on differences (e.g., MAD - mean absolute deviation).
    • When comparing variability, it is advisable to standardize the data.

Practical Rule

If your calculations involve multiplying values from an interval scale by each other, be particularly cautious in interpreting the results!

Proportions in Measurement Scales: The Case of Temperature

Two Types of Proportions

Proportions of values (DO NOT hold for interval scales):
Take 80°C and 20°C:
In Celsius: 80°C/20°C = 4
In Fahrenheit: 176°F/68°F ≈ 2.59
In Kelvin: 353.15K/293.15K ≈ 1.20

The same temperatures give different proportions! 
→ Proportions of values DO NOT make sense on interval scales; they only make sense on ratio scales.
Proportions of differences (hold for interval scales):
Take two pairs of differences:
Pair 1: 30°C - 20°C = 10°C
Pair 2: 80°C - 60°C = 20°C

Proportion of differences in Celsius:
20°C/10°C = 2

The same temperatures in Fahrenheit:
Pair 1: 86°F - 68°F = 18°F
Pair 2: 176°F - 140°F = 36°F

Proportion of differences in Fahrenheit:
36°F/18°F = 2

The proportion of differences is the same! ✓

Mathematical Explanation

For the transformation F = \frac{9}{5}C + 32:

  1. Proportions of values DO NOT hold: [ = ]

  2. Proportions of differences hold: [ = = ]

Why This Matters

  1. For values:
    • In Celsius: 40°C is not “twice as hot” as 20°C.
    • In Fahrenheit: 100°F is not “twice as hot” as 50°F.
    • Only in Kelvin do proportions of values have physical meaning.
  2. For differences:
    • An increase of 20°C is always twice as large as an increase of 10°C.
    • An increase of 36°F is always twice as large as an increase of 18°F.
    • Proportions of differences are scale-independent.

Implications for Statistics

  1. Operations based on differences (WORK):
    • Arithmetic mean.
    • Mean absolute deviation.
    • Range.
  2. Operations based on proportions of values (DO NOT WORK):
    • Geometric mean.
    • Coefficient of variation.
    • Variance (because it uses squared values).

Importance in Research and Analysis

Understanding Stevens’ data typology is crucial for several reasons:

  1. Choosing appropriate statistical tests: The level of measurement determines which statistical analyses are appropriate for a given dataset.

  2. Interpreting results: The meaning of statistical results depends on the level of measurement of the variables involved.

  3. Designing measurement instruments: When creating surveys or other measurement tools, researchers must consider the level of measurement they want to achieve.

  4. Data transformation: Sometimes, data can be transformed from one level to another, but this must be done carefully to avoid misinterpretation.

Controversies and Limitations

While Stevens’ typology is widely used, it has faced some criticisms:

  1. Rigidity: Some argue that the typology is too rigid and that many real-world measurements fall between these categories.

  2. Treatment of ordinal data: There’s ongoing debate about when it’s appropriate to treat ordinal data as interval for certain analyses.

  3. Psychological scaling: Some psychological constructs (like intelligence) are difficult to categorize definitively within this system.

Conclusion

Stevens’ data typology provides a fundamental framework for understanding different types of data and their properties. By recognizing the level of measurement of their variables, researchers can make informed decisions about data collection, analysis, and interpretation. However, it’s important to remember that while this typology is a useful guide, real-world data often requires nuanced consideration and may not always fit neatly into these categories.

pH as an Interval Scale

pH is considered an interval scale because:

  1. It has ordered categories: Lower pH values indicate higher acidity, while higher values indicate higher alkalinity.

  2. The intervals between adjacent pH values are equal in terms of hydrogen ion concentration:

    • Each whole number change in pH represents a tenfold change in hydrogen ion concentration.
    • For example, the difference in acidity between pH 4 and pH 5 is the same as the difference between pH 7 and pH 8.
  3. It lacks a true zero point:

    • pH 0 does not represent a complete absence of hydrogen ions.
    • Negative pH values and values above 14 are possible in extreme conditions.
  4. Ratios are not meaningful:

    • A pH of 4 is not “twice as acidic” as a pH of 2.
    • The ratio of hydrogen ion concentrations, not pH values, indicates relative acidity.

These characteristics align with the definition of an interval scale, where the differences between values are meaningful and consistent, but ratios are not interpretable.

3.4 Common Ordinal Scales in Behavioural Research

Likert Scales

Likert scales are widely used in psychology and social sciences to measure attitudes, opinions, and perceptions. Named after psychologist Rensis Likert, these scales typically consist of a series of statements or questions that respondents rate on a scale, often from “Strongly Disagree” to “Strongly Agree.”

https://individual-psychometrics.rbind.io/

Why Likert Scales are Ordinal Variables

Likert scales are considered ordinal variables for several reasons:

  1. Order without equal intervals: While the responses have a clear order (e.g., “Strongly Disagree” < “Disagree” < “Neutral” < “Agree” < “Strongly Agree”), the intervals between these categories are not necessarily equal.

  2. Subjective interpretation: The difference between “Strongly Disagree” and “Disagree” may not be the same as the difference between “Agree” and “Strongly Agree” for all respondents.

  3. Lack of true zero point: Likert scales typically don’t have a true zero point, which is a characteristic of interval or ratio scales.

IQ and Other Psychological Variables as Ordinal Measures

Many psychological measures, including IQ, are often treated as interval scales but are, in fact, ordinal. Here’s why:

  1. IQ Scores:

    • While IQ scores are presented as numbers, the difference between an IQ of 100 and 110 may not represent the same cognitive difference as between 130 and 140.
    • The scale is normalized and adjusted over time, making it difficult to claim true interval properties.
  2. Other Psychological Measures:

    • Depression scales (e.g., Beck Depression Inventory)
    • Anxiety measures (e.g., State-Trait Anxiety Inventory)
    • Personality assessments (e.g., Big Five Inventory)

These measures often use summed Likert-type items or other scoring methods that don’t guarantee equal intervals between scores.

Implications for Analysis

Recognizing these measures as ordinal has important implications for data analysis:

  1. Appropriate statistical tests: Use non-parametric tests (e.g., Mann-Whitney U, Kruskal-Wallis) instead of parametric ones.

  2. Correlation analysis: Use Spearman’s rank correlation instead of Pearson’s correlation.

  3. Central tendency: Report median and mode rather than mean.

  4. Data visualization: Use methods appropriate for ordinal data, such as bar plots or stacked bar charts.

Conclusion

While Likert scales and many behavioural measures are often treated as interval data for practical reasons, it’s crucial to remember their ordinal nature.

https://individual-psychometrics.rbind.io/
Exercise: Identifying Measurement Scales

For each of the following variables, determine the most appropriate scale of measurement (Nominal, Ordinal, Interval, or Ratio). Also evaluate whether the variable is discrete or continuous.

  1. Gender: nominal level of measurement, and discrete;
  2. Customer satisfaction: Poor, Fair, Good, Excellent
  3. Height (questionnaire): “I am: very short, short, average, tall, very tall”
  4. Height (inches)
  5. Reaction time (milliseconds)
  6. Postal codes: e.g., 61548, 61761, 62461, 47424, 65233
  7. Age (years)
  8. Nationality
  9. Street addresses
  10. Military ranks
  11. Left-Right political scale placement
  12. Family size: 1 child, 2 children, 3 children, …
  13. IQ score
  14. Shirt size (S, M, L, …)
  15. Movie ratings (1 star, 2 stars, 3 stars)
  16. Temperature (Celsius)
  17. Temperature (Kelvin)
  18. Blood types: A, B, AB, O
  19. Income categories: low, medium, high
  20. Voter turnout
  21. Political party affiliation
  22. Electoral district magnitude

Remember to justify your choices for each variable.

For instance: In Stevens’ typology of measurement scales, street addresses are nominal data. This is because:

They serve purely as labels/identifiers. They have no inherent ordering (123 Main St isn’t “more than” 23 Oak St). You can’t perform meaningful mathematical operations on them.The only valid operation is testing for equality/inequality (is this the same address or different?)

Statistical Measures Applicability / Zastosowanie miar statystycznych
Measure (EN) Miara (PL) Nominal Ordinal Interval Ratio
Central Tendency / Tendencja centralna:
Mode Dominanta
Median Mediana -
Arithmetic Mean Średnia arytmetyczna - -
Geometric Mean Średnia geometryczna - - -
Harmonic Mean Średnia harmoniczna - - -
Dispersion / Rozproszenie:
Range Rozstęp -
Interquartile Range Rozstęp międzykwartylowy -
Mean Absolute Deviation Średnie odchylenie bezwzględne - -
Variance Wariancja - - ✓*
Standard Deviation Odchylenie standardowe - - ✓*
Coefficient of Variation Współczynnik zmienności - - -
Association / Współzależność:
Chi-square Chi-kwadrat
Spearman Correlation Korelacja Spearmana -
Kendall’s Tau Tau Kendalla -
Pearson Correlation Korelacja Pearsona - - ✓*
Covariance Kowariancja - - ✓*

* Theoretically problematic but commonly used in practice / Teoretycznie problematyczne, ale powszechnie stosowane w praktyce

Notes / Uwagi:

  1. Measurement Scales / Skale pomiarowe:
  • Nominal: Categories without order / Kategorie bez uporządkowania
  • Ordinal: Ordered categories / Kategorie uporządkowane
  • Interval: Equal intervals, arbitrary zero / Równe interwały, umowne zero
  • Ratio: Equal intervals, absolute zero / Równe interwały, absolutne zero
  1. Practical Considerations / Aspekty praktyczne:
  • Some measures marked with ✓* are commonly used for interval data despite theoretical issues / Niektóre miary oznaczone ✓* są powszechnie stosowane dla danych przedziałowych pomimo problemów teoretycznych
  • Choice of measure should consider both theoretical appropriateness and practical utility / Wybór miary powinien uwzględniać zarówno poprawność teoretyczną jak i użyteczność praktyczną
  • More restrictive scales (ratio) allow all measures from less restrictive scales / Bardziej restrykcyjne skale (ilorazowe) pozwalają na wszystkie miary z mniej restrykcyjnych skal