```{r setup, include=FALSE}
# Attach packages
library(here) # For simple handling of relative file paths
library(tidyverse) # To use the tidy packages
library(modelsummary) # To make pretty tables
library(ggpmisc) # To annotate plot fits
library(car) # For `linearHypothesis`
library(ggrepel) # For text annotations
library(kableExtra) # For table formatting
# Set global options
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE
)
# Turn off scientific notation
options(scipen=999)
options(modelsummary_html = list(
"html" = list(
"table.css" = "width: 100%; font-size: 8px;" # Adjust width and font size as needed
)
))
options(datasummary_html = list(
"html" = list(
"table.css" = "width: 100%; font-size: 8px;" # Adjust width and font size as needed
)
))
```
## Academic honesty statement
I have been academically honest in all of my work and will not tolerate academic dishonesty of others, consistent with [UGA's Academic Honesty Policy](https://honesty.uga.edu/Academic-Honesty-Policy/).
Sign the academic honesty statement by typing your name on the **Signature** line.
**Signature**: Spencer Katzman
We will not accept submissions that omit a signed Academic Honesty statement.
# Introduction
## Overview
This project examines pay differences between males and females using data from
the March 2022 edition of the Current Population Survey (CPS). My analysis
focuses on working-age individuals, their demographic and household
characteristics, and their earnings. I identify a statistically significant
wage gap where men earn more than women on average and show that the gap increases at a
decreasing rate over a career. The gap persists when education and demographic
controls are used in estimations.
# Data
## March 2022 CPS
The March 2022 CPS surveyed about 54,000 households. The monthly CPS collects
labor force information about households' employment and demographics.
The ASEC supplement adds data on income, work experience, poverty, and other
variables. Importantly ASEC data is based on the previous year while CPS data is
current.
## March 2022 CPS Extract
```{r read_data, eval = TRUE}
cpsmar_e <- read_csv(here("data", "cpsmar_e.csv"))
```
Age, earnings, hours, weeks, race, marital status,
education, and job type were extracted from the person file, and region, county, and city
from the household file. Rename was used to simplify variable names, and
mutate to create indicators for categorical data. Households with children under six were
identified using group_by and mutate. After filtering for full-time workers only and
merging the files based on household sequence number, 52,097 observations of 20 variables remained.
## Analysis sample
```{r btl1, eval = TRUE}
cpsmar_a <- cpsmar_e %>%
filter(
age >= 23,
age <= 62,
earnings > 0
) %>%
mutate(
gender = ifelse(female == 1, "Female", "Male"),
wage = earnings/(hours*weeks),
lwage = log(wage),
Black = case_when(race==2~1, TRUE ~ 0),
south = case_when(region==3~1, TRUE ~ 0),
married = case_when((marital==1 | marital==2 | marital==3)~1, TRUE ~ 0),
age_centered = age - 23
)
```
The analysis sample includes 46,194 observations of individuals with positive
earnings who are between 23 and 62 years old.
# Baseline earnings distributions
## Plotting earnings distributions
```{r btl2, eval = TRUE}
figure1 <- ggplot(cpsmar_a, aes(x = earnings, group = gender, fill = gender)) +
geom_density(alpha = 0.4) +
labs(
title="Figure 1. Distribution of earnings by gender",
x="Earnings",
y="Density"
)+
theme_minimal()
earnings_fvm <- cpsmar_a %>%
group_by(gender) %>%
summarize(avg_earnings = round(mean(earnings, na.rm = TRUE),0))
avg_earnings_f <- earnings_fvm %>%
filter(gender == "Female") %>%
pull(avg_earnings) # `pull` extracts the "avg_earnings" value for "Female" from earnings_fvm, a single value since the data only record two genders.
avg_earnings_m <- earnings_fvm %>%
filter(gender == "Male") %>%
pull(avg_earnings) # `pull` extracts the "avg_earnings" value for "Male" from earnings_fvm, a single value since the data only record two genders.
```
## Distribution of earnings by gender
```{r btl2.5, echo=FALSE, fig.align='center', out.width='90%', eval = TRUE}
figure1
```
## Baseline comparisons
The most important fact communicated by Figure 1 is that average male earnings are higher
than average female earnings. Female average earnings are $`r avg_earnings_f`,
while male average earnings are $`r avg_earnings_m`. The dollar gap is
$`r avg_earnings_m - avg_earnings_f`, which translates to an approximate
`r round((avg_earnings_m - avg_earnings_f) / avg_earnings_f * 100)`% pay gap.
The figure also shows that male earnings have greater
variability than female earnings.
# The career gender gap
## Wages and hours differences
```{r mi1, echo=FALSE, eval = TRUE}
datasummary(
(wage + hours) ~ gender * (N + Mean + SD),
data=cpsmar_a,
title="Table 1. Wages and hours by gender"
)
```
## Documenting the differences
Table 1 shows that there are more men than women in the sample. It also shows that
men, on average, have higher hourly wages ($37.91 v. $30.65), work more hours
(44.03 v. 42.26), and that their hours and wages are more variable than those for women.
## Plotting career log wage profiles
```{r mi2, eval = TRUE}
cef_fvm_w <- cpsmar_a %>%
group_by(age_centered, gender) %>%
summarize(avg_lwage = mean(lwage, na.rm = TRUE))
figure2 <- ggplot(cef_fvm_w, aes(x = age_centered, y = avg_lwage, color = gender, linetype = gender, linewidth = gender)) +
geom_point() +
geom_line() +
scale_linetype_manual(values = c("Female" = "longdash", "Male" = "solid")) +
scale_linewidth_manual(values = c("Female" = 0.7, "Male" = 0.5)) +
guides(linewidth = "none") +
labs(
title="Figure 2. Career log-wage profiles for women and men",
x="Year",
y="Average log wage"
)+
theme_minimal()
```
## Career log wage profiles
```{r mi2.5, echo=FALSE, fig.align='center', out.width='90%', eval = TRUE}
figure2
```
## Estimating wage differences over a career
```{r mi3, eval = TRUE}
males <- cef_fvm_w %>%
filter(gender == "Male") %>%
rename(avg_lwage_male = avg_lwage) %>%
select(-gender)
females <- cef_fvm_w %>%
filter(gender == "Female") %>%
rename(avg_lwage_female = avg_lwage) %>%
select(-gender)
diff_fvm <- inner_join(males, females, by = "age_centered") %>%
filter(age_centered <= 30) %>%
mutate(
diff = avg_lwage_male - avg_lwage_female,
age_group = cut(
age_centered,
breaks = c(-1, 10, 20, 30),
labels = c("1-10", "11-20", "21-30"))
) %>%
group_by(age_group) %>%
summarize(mean_diff = mean(diff)*100)
table2 <- kable(
diff_fvm,
digits = 2,
col.names = c("Year Range", "Avg Pct Difference"),
align = "cc",
caption = "Table 2. Percent wage differences, first 30 years",
) %>%
kable_styling(position = "center")
```
## Evolution of the gender wage gap
```{r mi3.5, echo=FALSE, eval = TRUE}
table2
```
## Discussing the gender wage gap evolution
Figure 2 and Table 2 show that the male-female wage gap in average log wages
increases over the course of a career. Table 2 shows the gap in ten year
increments whereas Figure 2 displays the quadratic nature of the gap increase more incrementally.
# Explaining the gender wage gap
## Fitting the log wage profiles
```{r reg1, eval = TRUE}
formula <- y ~ x + I(x^2)
figure3 <- figure2 +
geom_smooth(
method = "lm",
formula = formula,
aes(group = gender),
se = FALSE
) +
stat_poly_eq(
aes(label = after_stat(eq.label)),
formula = formula,
parse = TRUE
) +
labs(
title="Figure 3. Career log-wage profiles with quadratic fits",
x="Year",
y="Average log wage"
)+
theme_minimal() +
theme(legend.position = "bottom")
```
## Log wage profiles with quadratic fits
```{r reg1.5, echo=FALSE, fig.align='center', out.width='90%', eval = TRUE}
figure3
```
## Gender differences in education
```{r ed_vars, echo=FALSE, eval = TRUE}
datasummary(
formula = (HSGrad + SomeColl + CollDeg) ~ gender * (N + Mean + SD),
data = cpsmar_a,
title = "Table 3. Educational attainment by gender"
)
```
## Gender differences in demographics
```{r demo_vars, echo=FALSE, eval = TRUE}
datasummary(
formula = (Black + hisp + south + city + married + child_u6) ~ gender * (N + Mean + SD),
data = cpsmar_a,
title = "Table 4. Demographic characteristics by gender"
)
```
## Documenting differences in characteristics
Table 3 shows that females in the sample are less likely to have a high school degree than
males, about equally likely to have some college, but more likely to have a college degree.
Table 4 shows that males in the sample are more likely to be Hispanic while females
are more likely to be Black. Table 4 also shows that males are more likely to
be married and have at least one child under six in the household.
Finally, residence in the South and in urban areas shows little to no gender difference.
## Controlling for education and demographic characteristics
```{r reg2a, eval = TRUE}
singles <- cpsmar_a %>%
filter(
married==0,
child_u6==0
)
models <- list(
"Baseline" = lm(lwage ~ female +
age_centered + I(age_centered^2),
data = cpsmar_a),
"Add Education" = lm(lwage ~ female +
age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg,
data = cpsmar_a),
"Add Person" = lm(lwage ~ female +
age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg +
Black + hisp + south + city,
data = cpsmar_a),
"Add Household" = lm(lwage ~ female +
age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg +
Black + hisp + south + city +
married + child_u6,
data = cpsmar_a),
"Only Singles" = lm(lwage ~ female +
age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg +
Black + hisp + south + city,
data = singles)
)
```
## Reporting the results
```{r reg2b, eval = TRUE}
cm <- c(
'female' = 'Female',
'age_centered' = 'Age',
'I(age_centered^2)' = 'Age$^2$',
'(Intercept)' = 'Constant'
)
gm <- tibble::tribble(
~raw, ~clean, ~fmt,
"nobs", "$N$", 0,
"r.squared", "$R^2$", 2
)
rows <- tribble(~term, ~Baseline, ~Add_Education, ~Add_Person, ~Add_Household, ~Only_Singles,
'Education controls', ' ', 'X', 'X', 'X', 'X',
'Demographic controls', ' ', ' ', 'X', 'X', 'X',
'Household controls', ' ', ' ', ' ', 'X', 'X'
)
attr(rows, 'position') <- c(9, 10, 11) # Positions where you want these rows to appear
table5 <- modelsummary(
models,
add_rows = rows,
coef_map = cm,
gof_map = gm,
vcov = c("robust","robust","robust","robust","robust"),
title = "Table 5. OLS estimates of the gender wage gap",
notes = "Robust standard errors in parentheses.",
escape = FALSE
)
```
## Explaining the gender wage gap
```{r reg2.5, echo=FALSE, eval = TRUE}
table5
```
## Documenting the findings
The baseline regression model shows a gender gap of 16.2%. This estimate increases
to 23.6% when educational controls are added, indicating that the baseline model
likely suffers from omitted variable bias. Estimates decrease slightly to 22.7% as personal
demographic controls are added and yet more to 21.4% when household
controls are used. Restricting the analysis to singles shows a much lower gap
of 12.3%. All estimates of the female coefficient are statistically significant at the 99% confidence level.
# Conclusion
## Summary
This project examines gender pay differences
using a March 2022 CPS sample of workers age 23 to 62 who
made positive earnings. Statistically significant regression estimates show that men
earn more than women. These estimates range from 12.3% for singles to 23.6%
for the entire sample when only educational controls are used.
The gap estimate lessens to 21.4% when all controls are used. It is also shown that the gap increases at a
decreasing rate over the course of a career, almost doubling between the first
and second 10 years, and continuing to grow throughout.
# Appendix
## Data documentation
```{r var_doc, eval = TRUE}
# Define the variables and their descriptions
variables <- data.frame(
Variable = c(
"age",
"earnings",
"hours",
"race",
"marital",
"HSGrad",
"SomeColl",
"CollDeg",
"region",
"female",
"hisp",
"fulltime"
),
Definition = c(
"years; capped at 85",
"earnings; greater than 0",
"hours worked per week",
"respondent’s race (1 = White only, 2 = Black only, 3 = AI only, 4 = Asian only, 5 = Hawaiian/Pacific Islander only (HP), 6 = White-Black, 7 = White-AI, 8 = White-Asian, 9 = White-HP, 11 = Black-Asian, 12 = Black-HP, 13 = AI-Asian, 14 = AI-HP, 15 = Asian-HP, 16 = White-Black-AI, 17 = White-Black-Asian, 18 = White-Black-HP, 19 = White-AI-Asian, 20 = White-AI-HP, 21 = White-Asian-HP, 22 = Black-AI-Asian, 23 = White-Black-AI-Asian, 24 = White-AI-Asian-HP, 25 = White-Black-AI-Asian-HP, 25 = Other 3 race comb., 26 = Other 4 or 5 race comb.)",
"marital status (1 = Married civilian, 2 = Married AF, 3 = Married absent, 4 = Widowed, 5 = Divorced, 6 = Separated, 7 = Never married)",
"= 1, if high school graduate",
"= 1, if some college",
"= 1, if college degree",
"household region (1 = Northeast, 2 = Midwest, 3 = South, 4 = West)",
"= 1, if female",
"= 1, if Hispanic, Spanish, or Latino",
"= 1, if works full time"
)
)
```
## List of main variables with definitions
This is a list of the main variables used in this project with their definitions.
```{r var_doc_table, echo=FALSE, eval = TRUE}
# Render the table
knitr::kable(variables, format = "simple",
col.names = c("Variable", "Definition"))
```