1 Introduction

1.1 Description

Information on instances of gun violence documented in the United States from January 2013 through March 2018. Note that gun violence incidents in 2013 are not all included, with only 279 incidents from 2013 in this data set, notably missing the Las Vegas Mass Shooting. The data is contained in gunViolenceGeo.csv.

1.2 Source

Data was obtained from the GitHub repo https://github.com/jamesqo/gun-violence-data, which originally obtained the data from the Gun Violence Archive’s website. From the organization’s description:

1.3 Source description

Gun Violence Archive (GVA) is a not for profit corporation formed in 2013 to provide free online public access to accurate information about gun-related violence in the United States. GVA will collect and check for accuracy, comprehensive information about gun-related violence in the U.S. and then post and disseminate it online.

Although there are possible limitations in its comprehensiveness, research supports the utility of the data provided by the Gun Violence Archive for public health research and quantifying the amount of gun violence occurring throughout the United States.

1.4 Load R Packages

library(tidyverse)
library(gapminder)
library(skimr)
library(ggthemes)
library(flextable)
library(dplyr)
library(naniar)
library(lubridate)
library(tidyr)
library(sf)
library(mapview)
library(maps)
library(naniar)
library(plotly) 
library(ggpubr)
library(ggcorrplot)
library(scales)
library(tidymodels)
library(car)
library(infer)
library(htmltools)
library(leaflet)

1.5 Importing data

gunViolence <- read_csv("gunViolenceGeo.csv")
census_df <- read_csv("https://raw.githubusercontent.com/dilernia/STA418-518/main/Data/census_data_state_2008-2021.csv")
county_census_df <- read_csv(
  "https://raw.githubusercontent.com/dilernia/STA418-518/main/Data/census_data_county_2009-2021.csv")

2 Gun Violence Data Dictionary

# Create the data dictionary
dataDictionary <- tibble(
  Variable = colnames(gunViolence),
  Description = c(
    "Geographic ID",
    "gunviolencearchive.org ID for incident",
    "date of occurrence",
    "state",
    "city or county",
    "address where incident took place",
    "number of people killed",
    "number of people injured",
    "link to gunviolencearchive.org webpage containing details of incident",
    "link to online news story concerning incident",
    "ignore, always False",
    "congressional district",
    "key: gun ID, value: 'Unknown' or 'Stolen'",
    "key: gun ID, value: description of gun type",
    "list of incident characteristics",
    "latitude",
    "description of location where incident took place",
    "longitude",
    "number of guns involved",
    "additional notes about the incident",
    "key: participant ID",
    "key: participant ID, value: description of age group, e.g. 'Adult 18+'",
    "key: participant ID, value: 'Male' or 'Female'",
    "key: participant ID",
    "key: participant ID, value: relationship of participant to other participants",
    "key: participant ID, value: 'Arrested', 'Killed', 'Injured', or 'Unharmed'",
    "key: participant ID, value: 'Victim' or 'Subject-Suspect'",
    "links to online news stories concerning incident",
    "state house district",
    "state senate district",
    "full address where incident took place"
  ),
  Type = map_chr(gunViolence, .f = function(x) { typeof(x)[1] }),
  Class = map_chr(gunViolence, .f = function(x) { class(x)[1] })
)

# Print the data dictionary
data_dict_ft <-flextable(dataDictionary) |>
  add_header_row(values = "Table 1: Data Dictionary", colwidths = 4) |>
  padding(padding = 3, part = "all") |>
  add_footer_lines(values = "Source: Gun Violence Archive") |>
  set_table_properties(width = 1, layout = "autofit")|>
  colformat_double(j = 2:4, digits = 2) |>
  theme_vanilla() |>
  autofit()

data_dict_ft

Table 1: Data Dictionary
Variable	Description	Type	Class
geoid	Geographic ID	character	character
incident_id	gunviolencearchive.org ID for incident	double	numeric
date	date of occurrence	double	Date
state	state	character	character
city_or_county	city or county	character	character
address	address where incident took place	character	character
n_killed	number of people killed	double	numeric
n_injured	number of people injured	double	numeric
incident_url	link to gunviolencearchive.org webpage containing details of incident	character	character
source_url	link to online news story concerning incident	character	character
incident_url_fields_missing	ignore, always False	logical	logical
congressional_district	congressional district	double	numeric
gun_stolen	key: gun ID, value: 'Unknown' or 'Stolen'	character	character
gun_type	key: gun ID, value: description of gun type	character	character
incident_characteristics	list of incident characteristics	character	character
lat	latitude	double	numeric
location_description	description of location where incident took place	character	character
long	longitude	double	numeric
n_guns_involved	number of guns involved	double	numeric
notes	additional notes about the incident	character	character
participant_age	key: participant ID	character	character
participant_age_group	key: participant ID, value: description of age group, e.g. 'Adult 18+'	character	character
participant_gender	key: participant ID, value: 'Male' or 'Female'	character	character
participant_name	key: participant ID	character	character
participant_relationship	key: participant ID, value: relationship of participant to other participants	character	character
participant_status	key: participant ID, value: 'Arrested', 'Killed', 'Injured', or 'Unharmed'	character	character
participant_type	key: participant ID, value: 'Victim' or 'Subject-Suspect'	character	character
sources	links to online news stories concerning incident	character	character
state_house_district	state house district	double	numeric
state_senate_district	state senate district	double	numeric
address_full	full address where incident took place	character	character
Source: Gun Violence Archive

2.1 Exploring high level data characteristics

skim(gunViolence)

Data summary
Name	gunViolence
Number of rows	239677
Number of columns	31
_______________________
Column type frequency:
character	20
Date	1
logical	1
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
geoid	8025	0.97	5	5	2821
state	0	1.00	4	20	51
city_or_county	0	1.00	3	46	12821
address	16497	0.93	2	97	197115
incident_url	0	1.00	48	50	239677
source_url	468	1.00	2	255	213974
gun_stolen	99498	0.58	8	5488	349
gun_type	99451	0.59	5	5488	2502
incident_characteristics	326	1.00	8	761	18126
location_description	197588	0.18	1	100	27605
notes	81017	0.66	1	261	136435
participant_age	92298	0.61	3	517	18951
participant_age_group	42119	0.82	11	1536	898
participant_gender	36362	0.85	6	803	873
participant_name	122253	0.49	4	2158	113402
participant_relationship	223903	0.07	8	366	284
participant_status	27626	0.88	8	1500	2150
participant_type	24863	0.90	8	1311	259
sources	609	1.00	8	3015	217280
address_full	7935	0.97	23	224	191632

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2013-01-01	2018-03-31	2016-04-05	1725

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
incident_url_fields_missing	0	1	0	FAL: 239677

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
incident_id	0	1.00	559334.35	293128.68	92114.00	308545.00	543587.00	817228.00	1083472.00	▇▇▆▆▆
n_killed	0	1.00	0.25	0.52	0.00	0.00	0.00	0.00	50.00	▇▁▁▁▁
n_injured	0	1.00	0.49	0.73	0.00	0.00	0.00	1.00	53.00	▇▁▁▁▁
congressional_district	11944	0.95	8.00	8.48	0.00	2.00	5.00	10.00	53.00	▇▂▁▁▁
lat	7923	0.97	37.55	5.13	19.11	33.90	38.57	41.44	71.34	▁▇▅▁▁
long	7923	0.97	-89.34	14.36	-171.43	-94.16	-86.25	-80.05	97.43	▁▇▁▁▁
n_guns_involved	99451	0.59	1.37	4.68	1.00	1.00	1.00	1.00	400.00	▇▁▁▁▁
state_house_district	38772	0.84	55.45	42.05	1.00	21.00	47.00	84.00	901.00	▇▁▁▁▁
state_senate_district	32335	0.87	20.48	14.20	1.00	9.00	19.00	30.00	94.00	▇▆▂▁▁

3 Data Cleaning

I wanted to start pruning the data by starting with the most relevant and complete data. As stated in the description, the data is incomplete for the years 2013 and 2018. I proceeded to filter the data for the four years in between. Additionally, I made the data easier to work with by converting dates to datetime to extract chronological information.

# Extract the year from the date column using stringr
gunViolence <- gunViolence |>
  mutate(year = as.integer(str_sub(date, 1, 4)))

# Filter data for the years 2014 to 2017
gunViolence <- gunViolence |>
  filter(year >= 2014 & year <= 2017)

# Change date column to datetime
gunViolence <- gunViolence |>
  mutate(date = as.Date(date, format = "%Y-%m-%d"))

# Extract the year and month from the date column
gunViolence <- gunViolence |>
  mutate(
    month = format(date, "%m")
  )

# Use stringr to remove text within parentheses in the city_or_county column
gunViolence <- gunViolence |>
  mutate(city_or_county = str_replace_all(city_or_county, "\\s*\\([^\\)]+\\)", ""))

# Trim any leading or trailing whitespace
gunViolence <- gunViolence |>
  mutate(city_or_county = str_trim(city_or_county))

3.1 Data Missingness

3.1.1 Plotting missingness for all data features

I began by examining the null values within the data and how to address them. Given the robust amount of information provided about these gun violence incidents, as well as the other datasets available to me regarding U.S. census data at the county, city, and state levels, I knew that location variables were important. I plotted all missing data and then visualized the missingness in relation to location.

gg_miss_var(gunViolence, show_pct = TRUE)

3.2 Plotting geolocation feature missingness

This visualization provides a closer look at location-level information by state. After creating the heatmap to observe the percentage of rows with null location information, I observed that missing lat and long coordinates were part and parcel to missing geoid values, which are extremely salient for analysis. This may suggest that these states weren’t as diligent in capturing that kind of data, or were negligent. States that were pushing the upper bounds of location missingness proportions were:

Idaho
Indiana
South Dakota
Virginia
West Virginia

gunViolence |>
  dplyr::select(
    geoid,
    lat,
    long,
    state
  ) |>
  gg_miss_fct(fct = state) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

3.3 Addressing important missing `geoid` data

To address geoid missingness, I compared incidents without IDs to ones that had them, by city, state, and year. If they were the same, I copied the geoid over and merged it back into the dataset. Here is an an examination of missingness after the merge. Here we can see that we went from over 8,000 missing geoid values to 443 and a completeness rate of 99.8% for the variable.

# Identify rows with missing geoid
missing_geoid <- gunViolence |>
  filter(is.na(geoid))

# Identify rows with non-missing geoid
non_missing_geoid <- gunViolence |>
  filter(!is.na(geoid)) |>
  group_by(state, city_or_county, year) |>
  summarise(geoid = first(geoid), .groups = 'drop')

# Perform the matching and fill in the missing geoid values
filled_data <- missing_geoid |>
  left_join(non_missing_geoid, by = c("state", "city_or_county", "year")) |>
  mutate(geoid = if_else(is.na(geoid.x), geoid.y, geoid.x)) |>
  select(-geoid.x, -geoid.y)

# Combine the filled rows with the original data excluding the previously missing rows
final_data <- gunViolence |>
  filter(!is.na(geoid)) |>
  bind_rows(filled_data)

skim(final_data$geoid)

Data summary
Name	final_data$geoid
Number of rows	225597
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data	443	1	5	5	0	2800	0

3.4 Reviewing missingness after cleaning

This improved my location accuracy immensely. Hawaii, Idaho, Vermont, and West Virginia were the only states left facing higher proportions of missingness, which topped out at just over 1%. As one reads deeper, we will see that in terms of incidents and deaths, these states are not as relevant.

final_data |>
  dplyr::select(
    geoid,
    state
  ) |>
  gg_miss_fct(fct = state) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

# Calculate count of missing values for each state
missing_counts <- final_data |>
  dplyr::select(geoid, state) |>
  group_by(state) |>
  summarise(missing_count = sum(is.na(geoid))) |>
  ungroup()

# Create a bar plot to show the count of missing values
ggplot(missing_counts, aes(x = reorder(state, -missing_count), y = missing_count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(
    x = "State",
    y = "Count of Missing geoid",
    title = "Count of Missing geoid by State",
    subtitle = "Based on final_data"
  )

3.5 Missingness Reflection

My method of matching locations from the same year to cross-reference geographical IDs worked to reduce the number of missing values. As we can see, the highest numbers of null values exist in high-incident states. In the scheme of things, this should not significantly affect the accuracy of our data.

4 EDA

4.1 Country Wide Year-level Summary Statistics

# Calculate the count of incidents per year
yearly_counts <- final_data |>
  group_by(year) |>
  summarise(
    incidents = n(),
    n_killed = sum(n_killed),
    n_injured = sum(n_injured)
  )

# Convert the year column to a character to avoid comma formatting
yearly_counts$year <- as.character(yearly_counts$year)

# Display the yearly counts using flextable
yearly_counts_ft <- yearly_counts |>
  flextable() |>
  set_header_labels(
    year = "Year",
    incidents = "Incidents",
    n_killed = "Killed",
    n_injured = "Injured"
  ) |>
  add_header_row(
    values = "Table 2: Aggregate Sum of U.S. Gun Violence Incidents, Deaths, and Injuries per Year", 
    colwidths = 4
  ) |>  
  padding(padding = 3, part = "all") |>
  theme_vanilla() |>
  autofit()

# Print the flextable
yearly_counts_ft

Table 2: Aggregate Sum of U.S. Gun Violence Incidents, Deaths, and Injuries per Year
Year	Incidents	Killed	Injured
2014	51,854	12,557	23,002
2015	53,579	13,484	26,967
2016	58,763	15,066	30,580
2017	61,401	15,511	30,703

4.2 Country Wide Year-level Summary Statistics

# Calculate summary statistics for incidents, n_killed, and n_injured per year
summary_year <- yearly_counts |>
  reframe(
    Statistic = c("Minimum", "Q1", "Median", "Mean (μ)", "Q3", "Maximum", "SD (σ)"),
    Incidents = c(
      min(incidents, na.rm = TRUE),
      quantile(incidents, 0.25, na.rm = TRUE),
      median(incidents, na.rm = TRUE),
      mean(incidents, na.rm = TRUE),
      quantile(incidents, 0.75, na.rm = TRUE),
      max(incidents, na.rm = TRUE),
      sd(incidents, na.rm = TRUE)
    ),
    Killed = c(
      min(n_killed, na.rm = TRUE),
      quantile(n_killed, 0.25, na.rm = TRUE),
      median(n_killed, na.rm = TRUE),
      mean(n_killed, na.rm = TRUE),
      quantile(n_killed, 0.75, na.rm = TRUE),
      max(n_killed, na.rm = TRUE),
      sd(n_killed, na.rm = TRUE)
    ),
    Injured = c(
      min(n_injured, na.rm = TRUE),
      quantile(n_injured, 0.25, na.rm = TRUE),
      median(n_injured, na.rm = TRUE),
      mean(n_injured, na.rm = TRUE),
      quantile(n_injured, 0.75, na.rm = TRUE),
      max(n_injured, na.rm = TRUE),
      sd(n_injured, na.rm = TRUE)
    )
  ) |>
  mutate_if(is.numeric, round, digits = 2)

summary_year_ft <- flextable(summary_year) |>
  add_header_row(values = "Table 3: U.S. Summary Statistics of Gun Violence Incidents, Deaths, and Injuries per Year (2014-2017)", colwidths = 4) |>
  padding(padding = 3, part = "all") |>
  add_footer_lines(values = "Source: Gun Violence Archive") |>
  colformat_double(j = 2:4, digits = 2)|>
  theme_vanilla() |>
  autofit()


summary_year_ft

Table 3: U.S. Summary Statistics of Gun Violence Incidents, Deaths, and Injuries per Year (2014-2017)
Statistic	Incidents	Killed	Injured
Minimum	51,854.00	12,557.00	23,002.00
Q1	53,147.75	13,252.25	25,975.75
Median	56,171.00	14,275.00	28,773.50
Mean (μ)	56,399.25	14,154.50	27,813.00
Q3	59,422.50	15,177.25	30,610.75
Maximum	61,401.00	15,511.00	30,703.00
SD (σ)	4,442.89	1,375.08	3,645.54
Source: Gun Violence Archive

Here are the statistical summary statistics for our gun violence incidents for the entire US for our years of concern.

4.3 Visualization: Histogram of Monthly Incidents per Year

# Create the faceted histogram
ggplot(final_data, aes(x = month)) +
  geom_histogram(stat = "count", fill = "steelblue") +
  facet_grid(year ~ .) +
  labs(
    title = "Monthly Gun Violence Incidents",
    x = "Month",
    y = "Number of Incidents",
    caption = "Source: Gun Violence Archive"
  ) +
  theme_minimal() +
  scale_x_discrete(labels = month.abb)  # Use abbreviated month names

I next broke each year of gun violence incidents down by month. An interesting trend emerges with this histogram. With the exception of a peak in incidents in January, we notice that incidents start to increase as the warmer months approach.

4.4 Gun Violence - State Level

I began analyzing this data by examining gun violence at the state level for the most recent year, as 2017 has the highest number of incidents.

4.4.1 State-level Summary Statistics for Top 15 States

# Summarize the number of incidents per state
state_counts_all_years <- final_data |>
  filter(state != "District of Columbia") |>
  group_by(state, year) |>
  summarise(
    count = n(),
    n_killed = sum(n_killed),
    n_injured = sum(n_injured)
    ) |>
  arrange(desc(count))

# Calculate the average count for each state over all years
state_avg_counts <- state_counts_all_years %>%
  group_by(state) %>%
  summarize(avg_count = mean(count, na.rm = TRUE)) %>%
  arrange(desc(avg_count))

# Select the top 15 states by average count
top_15_states <- state_avg_counts %>% slice(1:15)

# Filter the original data for these top 15 states
top_15_data <- state_counts_all_years %>% filter(state %in% top_15_states$state)

# Generate the summary statistics for each of the top 15 states
summary_stats <- top_15_data %>%
  group_by(state) %>%
  summarize(
    min_count = min(count, na.rm = TRUE),
    q1_count = quantile(count, 0.25, na.rm = TRUE),
    median_count = median(count, na.rm = TRUE),
    mean_count = mean(count, na.rm = TRUE),
    q3_count = quantile(count, 0.75, na.rm = TRUE),
    max_count = max(count, na.rm = TRUE),
    sd_count = sd(count, na.rm = TRUE),
    min_killed = min(n_killed, na.rm = TRUE),
    q1_killed = quantile(n_killed, 0.25, na.rm = TRUE),
    median_killed = median(n_killed, na.rm = TRUE),
    mean_killed = mean(n_killed, na.rm = TRUE),
    q3_killed = quantile(n_killed, 0.75, na.rm = TRUE),
    max_killed = max(n_killed, na.rm = TRUE),
    sd_killed = sd(n_killed, na.rm = TRUE),
    min_injured = min(n_injured, na.rm = TRUE),
    q1_injured = quantile(n_injured, 0.25, na.rm = TRUE),
    median_injured = median(n_injured, na.rm = TRUE),
    mean_injured = mean(n_injured, na.rm = TRUE),
    q3_injured = quantile(n_injured, 0.75, na.rm = TRUE),
    max_injured = max(n_injured, na.rm = TRUE),
    sd_injured = sd(n_injured, na.rm = TRUE)
  )

# Round all summary statistics
summary_stats <- summary_stats %>%
  mutate(across(where(is.numeric), round, digits = 0))

# Split the summary statistics into three different data frames
summary_incidents <- summary_stats %>%
  select(state, min_count, q1_count, median_count, mean_count, q3_count, max_count, sd_count)

summary_killed <- summary_stats %>%
  select(state, min_killed, q1_killed, median_killed, mean_killed, q3_killed, max_killed, sd_killed)

summary_injured <- summary_stats %>%
  select(state, min_injured, q1_injured, median_injured, mean_injured, q3_injured, max_injured, sd_injured)

# Convert each summary statistics data frame to a flextable
ft_incidents <- flextable(summary_incidents) %>%
  set_header_labels(
    state = "State",
    min_count = "Min Incident", q1_count = "Q1 Incident", median_count = "Median Incident", 
    mean_count = "Mean Incident", q3_count = "Q3 Incident", max_count = "Max Incident", 
    sd_count = "SD Incident"
  ) %>%
  add_header_row(values = "Table 4a: Summary Statistics of Gun Violence Incidents by state (2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

ft_killed <- flextable(summary_killed) %>%
  set_header_labels(
    state = "State",
    min_killed = "Min Killed", q1_killed = "Q1 Killed", median_killed = "Median Killed", 
    mean_killed = "Mean Killed", q3_killed = "Q3 Killed", max_killed = "Max Killed", 
    sd_killed = "SD Killed"
  ) %>%
  add_header_row(values = "Table 4b: Summary Statistics of Gun Violence Deaths by state (2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

ft_injured <- flextable(summary_injured) %>%
  set_header_labels(
    state = "State",
    min_injured = "Min Injured", q1_injured = "Q1 Injured", median_injured = "Median Injured", 
    mean_injured = "Mean Injured", q3_injured = "Q3 Injured", max_injured = "Max Injured", 
    sd_injured = "SD Injured"
  ) %>%
  add_header_row(values = "Table 4c: Summary Statistics of Gun Violence Injuries by state (2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

# Print the flextables
ft_incidents

Table 4a: Summary Statistics of Gun Violence Incidents by state (2017)
State	Min Incident	Q1 Incident	Median Incident	Mean Incident	Q3 Incident	Max Incident	SD Incident
California	3,234	3,521	3,674	3,793	3,946	4,588	571
Florida	2,702	3,029	3,647	3,549	4,167	4,201	748
Georgia	1,708	1,922	2,013	2,125	2,216	2,767	452
Illinois	3,095	3,366	4,256	4,174	5,064	5,089	1,048
Louisiana	1,692	1,852	1,946	1,914	2,008	2,070	162
Massachusetts	968	1,246	1,441	1,403	1,598	1,761	337
Missouri	1,272	1,482	1,627	1,564	1,709	1,730	210
New York	1,903	1,999	2,276	2,340	2,617	2,903	461
North Carolina	1,856	2,040	2,121	2,066	2,147	2,165	142
Ohio	2,132	2,309	2,405	2,411	2,507	2,701	234
Pennsylvania	1,789	2,076	2,178	2,103	2,206	2,267	214
South Carolina	1,490	1,618	1,678	1,642	1,702	1,721	104
Tennessee	1,590	1,720	1,818	1,810	1,909	2,014	179
Texas	2,875	3,068	3,204	3,222	3,358	3,606	305
Virginia	1,273	1,313	1,400	1,412	1,499	1,578	139
Source: Gun Violence Archive

ft_killed

Table 4b: Summary Statistics of Gun Violence Deaths by state (2017)
State	Min Killed	Q1 Killed	Median Killed	Mean Killed	Q3 Killed	Max Killed	SD Killed
California	1,204	1,250	1,292	1,303	1,346	1,423	93
Florida	796	854	912	905	963	1,001	90
Georgia	522	542	572	581	611	658	60
Illinois	637	675	817	808	949	959	169
Louisiana	465	492	514	510	532	549	36
Massachusetts	101	107	110	112	116	128	11
Missouri	383	465	504	500	538	609	93
New York	347	384	406	406	428	463	48
North Carolina	487	492	524	528	561	576	45
Ohio	494	526	582	590	646	700	92
Pennsylvania	508	537	552	562	576	635	53
South Carolina	344	374	386	378	389	395	23
Tennessee	352	384	432	428	476	495	66
Texas	1,094	1,126	1,152	1,178	1,205	1,313	95
Virginia	284	309	351	346	388	398	54
Source: Gun Violence Archive

ft_injured

Table 4c: Summary Statistics of Gun Violence Injuries by state (2017)
State	Min Injured	Q1 Injured	Median Injured	Mean Injured	Q3 Injured	Max Injured	SD Injured
California	1,607	1,614	1,756	1,768	1,910	1,953	182
Florida	1,463	1,526	1,674	1,664	1,812	1,846	188
Georgia	817	847	957	957	1,067	1,096	140
Illinois	2,276	2,686	3,226	3,216	3,756	4,137	828
Louisiana	898	915	1,012	1,026	1,123	1,184	139
Massachusetts	315	365	419	405	459	468	71
Missouri	647	793	882	833	921	921	129
New York	965	1,060	1,184	1,213	1,336	1,521	241
North Carolina	884	1,043	1,124	1,084	1,166	1,205	141
Ohio	1,120	1,258	1,380	1,349	1,470	1,517	177
Pennsylvania	1,064	1,157	1,204	1,190	1,237	1,288	94
South Carolina	626	701	732	720	750	791	69
Tennessee	857	980	1,092	1,060	1,172	1,198	155
Texas	1,157	1,306	1,460	1,423	1,577	1,614	210
Virginia	578	731	860	838	967	1,051	205
Source: Gun Violence Archive

4.4.2 Visualization: Bar graph of Incidents by state

# filtering for the year 2017
gun_data_2017 <- final_data |>
  filter(format(as.Date(date), "%Y") == "2017")

# Summarize the number of incidents per state
state_counts <- gun_data_2017 |>
  group_by(state) |>
  summarise(count = n()) |>
  arrange(desc(count))

# Convert the state column to a factor with levels ordered by count
gun_data_2017 <- gun_data_2017 |>
  mutate(state = factor(state, levels = state_counts$state))

gun_data_2017 |>
  ggplot(
    aes(
      x = state
    )
  ) +
  geom_bar(binwidth = 1, fill = "steelblue", boundary = 0) +
  scale_y_continuous(expand = expansion(mult = c(0,0.1))) +
  # scale_x_continuous(limits = c(0, 14), breaks = seq(0, 14, by = 2)) +  
  labs(
    y = "Number of Incidents",
    x = "State",
    title = "Total number of gun-violence incidents by state",
    subtitle = "Year 2017",
    caption = "Source: Gun Violence Archive"
  ) + 
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)
  )

Looking at this initial visualization, we can see some staggering levels of gun violence in certain states. However, most of the states with high levels of gun violence are also some of the most densely populated in the country. I decided to analyze the same data while controlling for population.

4.4.3 Census Data Cleaning

census_df <- census_df |>
  filter(year >= 2014, year <=2017)

# Filtering census data for 2017
census_df_2017 <- census_df |>
  filter(year == 2017)

# Aggregating population by state
census_df_2017_pop_by_state <- census_df_2017 |>
  group_by(county_state, year, geoid) |>
  summarise(population = sum(population), .groups = 'drop')

4.4.4 Examining high-level data characteristics

skim(census_df)

Data summary
Name	census_df
Number of rows	208
Number of columns	26
_______________________
Column type frequency:
character	2
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
geoid	0	1	2	2	0	52	0
county_state	0	1	4	20	0	52	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2015.50	1.12	2014.00	2014.75	2015.50	2016.25	2017.00	▇▇▁▇▇
population	1	6263911.70	7109820.85	579315.00	1791128.50	4278116.50	7027585.00	39536653.00	▇▂▁▁▁
median_income	1	56577.62	11095.19	18626.00	49491.25	55088.00	63116.00	82372.00	▁▁▇▅▂
median_monthly_rent_cost	1	921.28	213.41	433.00	769.75	853.50	1049.50	1573.00	▁▇▅▂▁
median_monthly_home_cost	1	1057.44	344.69	251.00	826.25	968.00	1269.25	1977.00	▁▇▆▃▂
prop_female	1	0.51	0.01	0.47	0.50	0.51	0.51	0.53	▁▁▆▇▂
prop_male	1	0.49	0.01	0.47	0.49	0.49	0.50	0.53	▂▇▆▁▁
prop_white	1	0.76	0.13	0.25	0.68	0.78	0.85	0.95	▁▁▃▇▇
prop_black	1	0.11	0.11	0.00	0.03	0.08	0.16	0.49	▇▃▁▁▁
prop_native	1	0.02	0.03	0.00	0.00	0.00	0.01	0.15	▇▁▁▁▁
prop_asian	1	0.04	0.05	0.00	0.02	0.03	0.05	0.38	▇▁▁▁▁
prop_hawaiin_islander	1	0.00	0.01	0.00	0.00	0.00	0.00	0.10	▇▁▁▁▁
prop_other_race	1	0.03	0.03	0.00	0.01	0.02	0.05	0.17	▇▂▁▁▁
prop_multi_racial	1	0.03	0.03	0.01	0.02	0.03	0.03	0.24	▇▁▁▁▁
prop_highschool	1	0.24	0.04	0.15	0.22	0.24	0.26	0.35	▁▅▇▃▁
prop_GED	1	0.04	0.01	0.02	0.04	0.04	0.05	0.07	▃▇▆▃▂
prop_some_college	1	0.06	0.01	0.01	0.06	0.06	0.07	0.09	▁▁▇▇▂
prop_college_no_degree	1	0.15	0.02	0.09	0.13	0.15	0.16	0.21	▂▃▇▅▁
prop_associates	1	0.09	0.02	0.03	0.08	0.08	0.10	0.14	▁▂▇▃▁
prop_bachelors	1	0.19	0.03	0.12	0.17	0.19	0.21	0.26	▂▅▇▆▂
prop_masters	1	0.08	0.03	0.05	0.07	0.08	0.09	0.21	▇▅▁▁▁
prop_professional	1	0.02	0.01	0.01	0.02	0.02	0.02	0.08	▇▁▁▁▁
prop_doctoral	1	0.01	0.01	0.01	0.01	0.01	0.02	0.05	▇▂▁▁▁
prop_poverty	1	0.15	0.05	0.07	0.11	0.14	0.16	0.46	▇▅▁▁▁

4.4.5 Visualization: Bar Graph of Incidents by state per 100,000 people

# Summarize the number of incidents per state
state_counts <- gun_data_2017 |>
  group_by(state) |>
  summarise(count = n()) |>
  arrange(desc(count))

# Merge with population data
state_counts <- state_counts |>
  left_join(census_df_2017_pop_by_state, by = c("state" = "county_state"))

# Calculate incidents per 100,000 people
state_counts <- state_counts |>
  mutate(incidents_per_100k = (count / population) * 100000)

# Convert the state column to a factor with levels ordered by incidents_per_100k
state_counts <- state_counts |>
  mutate(state = factor(state, levels = state[order(incidents_per_100k, decreasing = TRUE)]))

# Create the bar chart
ggplot(state_counts, aes(x = state, y = incidents_per_100k)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    y = "Incidents per 100,000 People",
    x = "State",
    title = "Gun Violence Incidents per 100,000 People by State",
    subtitle = "Year 2017",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)
  )

After controlling for population using the Census Bureau dataset, DC has the most gun violence incidents per 100,000 people in the country for this year. However, DC is not considered a state but a federal district, more like a city. For the sake of analyzing like for like, I will omit it from the analysis for now.

4.4.6 Visulization: Bar Graph of Incidents by state per 100,000 people - Removing DC

# Summarize the number of incidents per state
state_counts <- gun_data_2017 |>
  filter(state != "District of Columbia") |>
  group_by(state) |>
  summarise(count = n()) |>
  arrange(desc(count))

# Merge with population data
state_counts <- state_counts |>
  left_join(census_df_2017_pop_by_state, by = c("state" = "county_state"))

# Calculate incidents per 100,000 people
state_counts <- state_counts |>
  mutate(incidents_per_100k = (count / population) * 100000)

# Convert the state column to a factor with levels ordered by incidents_per_100k
state_counts <- state_counts |>
  mutate(state = factor(state, levels = state[order(incidents_per_100k, decreasing = TRUE)]))

# Create the bar chart

ggplot(state_counts, aes(x = state, y = incidents_per_100k)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    y = "Incidents per 100,000 People",
    x = "State",
    title = "Gun Violence Incidents per 100,000 People by State",
    subtitle = "Year 2017",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
    legend.position = "none"
  )

After removing the District of Columbia, we get a better comparison between states. I was surprised to see Alaska taking the top spot. After some Google searching, I obtained records of gun ownership by state.

According to World Population Review, gun ownership rates in Alaska are estimated to be 64.5%. This could potentially be a correlative factor in gun violence incidents within the state. It was also surprising to see Delaware above a state like Illinois, considering the anecdotes I’ve always heard regarding the violence in Chicago.

Interestingly enough, the inverse is true for Alaska vis-à-vis gun ownership rates. Delaware ranks as the 10th lowest state proportion of gun ownership at 34.4%, according to the World Population Review. Illinois ranks even lower, at 7th, with 27.8%. Gun ownership seems like an interesting feature to explore, but given some of the inconsistencies, I chose to press on and explore more.

4.5 Visualization: Interactive Map of Incidents by State

Here I mapped out all 50 state’s level of incidents per 100,000 people onto a leaflet to highlight the most gun violence prone states in 2017.

# Downloading state-level shape files from US Census Bureau
if(!file.exists("cb_2018_us_state_500k.zip")) {
download.file(url = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_500k.zip",
              destfile = "cb_2018_us_state_500k.zip")
}

# Create directory for geospatial files
if(!dir.exists("GeoFiles")) {
dir.create("GeoFiles")
}

# Unzipping files
utils::unzip("cb_2018_us_state_500k.zip",
             exdir = "GeoFiles")

# Loading the shapefiles
state_shape <- st_read("GeoFiles//cb_2018_us_state_500k.shp",
                       quiet = TRUE)

# Ensure the geoid column in state_counts is properly formatted as GEOID
state_counts <- state_counts |>
  mutate(GEOID = str_pad(geoid, width = 2, side = "left", pad = "0"))

# Merging state_counts with state_shape shapefile
state_shape_full <- state_shape |> 
  left_join(state_counts, by = c("GEOID" = "GEOID"))

# Remove rows with NA values
cleaned_state_shape <- state_shape_full |>
  drop_na()

# Ensure cleaned_state_shape is an sf object if not already done
if (!inherits(cleaned_state_shape, "sf")) {
  cleaned_state_shape <- st_as_sf(cleaned_state_shape)
}

# Transform to WGS84 CRS to be compatible with leaflet
cleaned_state_shape <- st_transform(cleaned_state_shape, crs = 4326)

# Check for and handle missing values in key columns
cleaned_state_shape <- cleaned_state_shape |>
  filter(!is.na(state) & !is.na(population) & !is.na(count) & !is.na(incidents_per_100k))

# Defining pop-up labels
labels <- cleaned_state_shape |>
  mutate(labels = paste0(
    "<strong>", str_to_title(state), "</strong><br/>",
    "Population: ", population, "<br/>",
    "Incidents: ", count, "<br/>",
    "Incidents per 100k: ", round(incidents_per_100k, 2))) |>
  pull(labels) |>
  lapply(htmltools::HTML)

# Quantitative variable to color areal units (states) by
myVariable <- cleaned_state_shape$incidents_per_100k

# Defining nice cut points for coloring
bins <- seq(min(myVariable, na.rm = TRUE), max(myVariable, na.rm = TRUE), length.out = 6)

# Selecting color palette
pal <- colorBin("PuBu", domain = myVariable, bins = bins)

# Create the choropleth map
myLeaflet <- leaflet(cleaned_state_shape) |>
  setView(lng = -96, lat = 37.8, zoom = 2.4) |>
  addProviderTiles("OpenStreetMap.Mapnik",
                   options = providerTileOptions(
                     id = "mapbox.light",
                     accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN'))) |>
  addPolygons(
    fillColor = ~pal(myVariable),
    weight = 0.80,
    opacity = 1,
    color = "black",
    fillOpacity = 1,
    highlightOptions = highlightOptions(
      color = "#666"),
    label = labels,
    labelOptions = labelOptions(textsize = "15px")) |>
  addLegend(pal = pal, 
            values = ~ myVariable, 
            opacity = 0.7, 
            title = "Gun Violence Incidents per 100k (2017)",
            position = "bottomright")

# Display the map
myLeaflet

4.6 Visualization: Boxplot Showing Range of Incidents Per Year from 2014-2017

# Arrange by incidents_per_100k in descending order and extract the top 15 states
top_15_states <- state_counts |>
  arrange(desc(incidents_per_100k)) |>
  slice(1:15) |>
  pull(state)

# Merge with population data
state_counts_all_years <- state_counts_all_years |>
  left_join(census_df, by = c("state" = "county_state",
                              "year" = "year"))

# Calculate incidents per 100,000 people
state_counts_all_years <- state_counts_all_years |>
  mutate(incidents_per_100k = (count / population) * 100000)

# Filter the state_counts_all_years DataFrame to include only the top 15 states
filtered_data <- state_counts_all_years |>
  filter(state %in% top_15_states)

# Create the box plot
ggplot(filtered_data, aes(x = fct_reorder(state, incidents_per_100k, .desc=FALSE ) , y = incidents_per_100k)) +
  geom_boxplot(fill = "steelblue", alpha = 0.9) +
  stat_boxplot(geom = "errorbar", width = 0.3, coef = 1.5) +
  labs(
    y = "Incidents per 100,000 People",
    x = "State",
    title = "Range of Gun Violence Incidents per 100,000 People by State (2014-2017)",
    subtitle = "Top 15 States",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  
  coord_flip() +
  theme_minimal() +
  theme(
    axis.text.x = element_text(hjust = .5, vjust = 0.5)
  )

Here we can see how the gun incident rates per 100,000 people ranged from 2014 to 2017 for the top 15 states.

4.7 Gun Violence - City Level

Having thoroughly examined the data at the state level, it became evident that the broader trends and anomalies could benefit from a more granular analysis. Thus, I decided to adjust my aperture toward the city level. By zooming in on the urban centers, we can gain deeper insights into the dynamics of gun violence. Cities often present unique challenges and opportunities for intervention, and understanding the variations within these metropolitan areas will be crucial for formulating more effective strategies. Let’s delve into the city-level data to uncover these nuanced patterns.

4.7.1 City Level Summary Statistics

# Group by state, geoid, and city_or_county, then summarise and arrange
sorted_city_rates_2017 <- gun_data_2017 |>
  group_by(geoid, state, city_or_county) |>
  summarise(count = n(), .groups = 'drop') |>
  arrange(desc(count))

# filter county data for 2017
county_census_2017 <- county_census_df |>
  filter(
    year == 2017
  )

# Split the county_state column into county and state columns
county_census_2017 <- county_census_2017 |>
  separate(county_state, into = c("city_or_county", "state"), sep = ", ")

# Merge with population data
grouped_data <- county_census_2017  |>
  left_join(sorted_city_rates_2017, by = c("geoid" = "geoid"))

grouped_data <- grouped_data |>
  drop_na()

grouped_data <- grouped_data |>
  rename(
    county = city_or_county.x,
    city = city_or_county.y,
    state = state.x
  ) |>
  select(-state.y)

# Calculate incidents per 100,000 people
grouped_data <- grouped_data |>
  mutate(incidents_per_100k = (count / population) * 100000)

# Getting state abbreviations for plotting
state_abbr_lookup <- data.frame(
  state = c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "District of Columbia", "Florida", "Georgia", 
            "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", 
            "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", 
            "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", 
            "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", 
            "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"),
  state_abbr = c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC","FL", "GA", 
                 "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
                 "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
                 "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
                 "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY")
)
# Join with the state abbreviation lookup table
grouped_data <- grouped_data |>
  left_join(state_abbr_lookup, by = "state")

# Create a unique identifier by combining county and state
grouped_data <- grouped_data |>
  mutate(city_state = paste(city, state_abbr, sep = ", "))

top_15_cities <- grouped_data |>
  arrange(desc(count)) |>  # Sort by count in descending order
  slice_head(n = 15) 

# Split the county_state column into county and state columns
county_census_df <- county_census_df |>
  separate(county_state, into = c("city_or_county", "state"), sep = ", ")

# Group by state, geoid, and city_or_county, then summarise and arrange
sorted_city_rates_all_years <- final_data |>
  group_by(geoid, state, city_or_county, year) |>
  summarise(
    count = n(),
    n_killed = sum(n_killed),
    n_injured = sum(n_injured),
    .groups = 'drop'
    ) |>
  arrange(desc(count), city_or_county, state )

# Merge with population data
sorted_city_rates_all_years <- sorted_city_rates_all_years  |>
  left_join(county_census_df, by = c("geoid" = "geoid",
                                     "year" = "year"))

sorted_city_rates_all_years <- sorted_city_rates_all_years |>
  rename(state = state.x, city = city_or_county.x, county = city_or_county.y )|>
  group_by(
    city,
    state,
    year
  ) |>
  summarise(
    count = sum(count),
    n_killed = sum(n_killed),
    n_injured = sum(n_injured),
    .groups = 'drop'
  )

# Join with the state abbreviation lookup table
sorted_city_rates_all_years <- sorted_city_rates_all_years |>
  left_join(state_abbr_lookup, by = "state")

# Create a unique identifier by combining county and state
sorted_city_rates_all_years <- sorted_city_rates_all_years |>
  mutate(city_state = paste(city, state_abbr, sep = ", "))


# Filter sorted_city_rates_all_years to only include rows with the specified geoids
sorted_city_rates_all_years <- sorted_city_rates_all_years %>%
  filter(city_state %in% top_15_cities$city_state)

# Generate the summary statistics for each of the top 15 states
city_summary_stats <- sorted_city_rates_all_years %>%
  mutate(
    rounded_count = round(count, digits = 0),
    rounded_n_killed = round(n_killed, digits = 0),
    rounded_n_injured = round(n_injured, digits = 0)
  ) %>%
  group_by(city) %>%
  summarize(
    min_count = min(count, na.rm = TRUE),
    q1_count = quantile(count, 0.25, na.rm = TRUE),
    median_count = median(count, na.rm = TRUE),
    mean_count = mean(count, na.rm = TRUE),
    q3_count = quantile(count, 0.75, na.rm = TRUE),
    max_count = max(count, na.rm = TRUE),
    sd_count = sd(count, na.rm = TRUE),
    min_killed = min(n_killed, na.rm = TRUE),
    q1_killed = quantile(n_killed, 0.25, na.rm = TRUE),
    median_killed = median(n_killed, na.rm = TRUE),
    mean_killed = mean(n_killed, na.rm = TRUE),
    q3_killed = quantile(n_killed, 0.75, na.rm = TRUE),
    max_killed = max(n_killed, na.rm = TRUE),
    sd_killed = sd(n_killed, na.rm = TRUE),
    min_injured = min(n_injured, na.rm = TRUE),
    q1_injured = quantile(n_injured, 0.25, na.rm = TRUE),
    median_injured = median(n_injured, na.rm = TRUE),
    mean_injured = mean(n_injured, na.rm = TRUE),
    q3_injured = quantile(n_injured, 0.75, na.rm = TRUE),
    max_injured = max(n_injured, na.rm = TRUE),
    sd_injured = sd(n_injured, na.rm = TRUE)
  ) %>%
  arrange(desc(mean_count))

# Round all summary statistics
city_summary_stats <- city_summary_stats %>%
  mutate(across(where(is.numeric), round, digits = 0))


# Split the summary statistics into three different data frames
city_summary_incidents <- city_summary_stats %>%
  select(city, min_count, q1_count, median_count, mean_count, q3_count, max_count, sd_count)

city_summary_killed <- city_summary_stats %>%
  select(city, min_killed, q1_killed, median_killed, mean_killed, q3_killed, max_killed, sd_killed)

city_summary_injured <- city_summary_stats %>%
  select(city, min_injured, q1_injured, median_injured, mean_injured, q3_injured, max_injured, sd_injured)

# Convert each summary statistics data frame to a flextable
ft_incidents_city <- flextable(city_summary_incidents) %>%
  set_header_labels(
    city = "City",
    min_count = "Min Incident", q1_count = "Q1 Incident", median_count = "Median Incident", 
    mean_count = "Mean Incident", q3_count = "Q3 Incident", max_count = "Max Incident", 
    sd_count = "SD Incident"
  ) %>%
  add_header_row(values = "Table 5a: City Summary Statistics of Gun Violence Incidents per Year (2014-2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

ft_killed_city <- flextable(city_summary_killed) %>%
  set_header_labels(
    city = "City",
    min_killed = "Min Killed", q1_killed = "Q1 Killed", median_killed = "Median Killed", 
    mean_killed = "Mean Killed", q3_killed = "Q3 Killed", max_killed = "Max Killed", 
    sd_killed = "SD Killed"
  ) %>%
  add_header_row(values = "Table 5b: City Summary Statistics of Gun Violence Deaths per Year (2014-2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

ft_injured_city <- flextable(city_summary_injured) %>%
  set_header_labels(
    city = "City",
    min_injured = "Min Injured", q1_injured = "Q1 Injured", median_injured = "Median Injured", 
    mean_injured = "Mean Injured", q3_injured = "Q3 Injured", max_injured = "Max Injured", 
    sd_injured = "SD Injured"
  ) %>%
  add_header_row(values = "Table 5c: City Summary Statistics of Gun Violence Injuries per Year (2014-2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

# Print the flextables
ft_incidents_city

Table 5a: City Summary Statistics of Gun Violence Incidents per Year (2014-2017)
City	Min Incident	Q1 Incident	Median Incident	Mean Incident	Q3 Incident	Max Incident	SD Incident
Chicago	2,043	2,316	2,723	2,770	3,177	3,591	685
Baltimore	760	873	916	928	971	1,117	146
Washington	496	525	686	752	913	1,142	301
New Orleans	660	702	744	750	792	850	81
Philadelphia	533	561	610	698	748	1,039	232
Saint Louis	517	546	643	635	732	736	115
Milwaukee	479	496	522	590	617	839	168
Houston	550	559	582	580	604	606	28
Memphis	421	497	583	574	660	709	128
Jacksonville	327	355	570	563	778	784	251
Indianapolis	425	446	463	462	479	497	31
Cleveland	284	370	450	435	515	554	119
Boston	330	338	399	404	466	490	81
Detroit	266	286	314	332	360	433	73
Peoria	144	152	184	223	256	380	109
Source: Gun Violence Archive

ft_killed_city

Table 5b: City Summary Statistics of Gun Violence Deaths per Year (2014-2017)
City	Min Killed	Q1 Killed	Median Killed	Mean Killed	Q3 Killed	Max Killed	SD Killed
Chicago	414	434	538	553	658	722	150
Baltimore	172	241	264	251	274	304	56
Washington	91	98	103	105	110	122	13
New Orleans	154	162	170	169	177	182	12
Philadelphia	186	206	217	211	222	223	17
Saint Louis	171	218	234	235	250	300	53
Milwaukee	91	105	122	120	136	144	24
Houston	217	254	278	266	290	292	35
Memphis	110	120	146	148	174	189	37
Jacksonville	88	92	95	98	100	113	11
Indianapolis	136	139	148	148	157	163	13
Cleveland	85	104	122	117	134	139	24
Boston	34	37	40	41	44	49	6
Detroit	126	130	137	140	147	160	15
Peoria	4	6	10	9	13	14	5
Source: Gun Violence Archive

ft_injured_city

Table 5c: City Summary Statistics of Gun Violence Injuries per Year (2014-2017)
City	Min Injured	Q1 Injured	Median Injured	Mean Injured	Q3 Injured	Max Injured	SD Injured
Chicago	1,895	2,252	2,662	2,674	3,084	3,478	689
Baltimore	442	550	600	574	624	655	92
Washington	298	328	339	330	341	345	22
New Orleans	381	419	467	454	502	503	59
Philadelphia	431	448	470	546	568	812	179
Saint Louis	353	385	439	436	490	514	75
Milwaukee	336	346	370	375	399	423	40
Houston	222	242	260	284	302	394	76
Memphis	298	345	412	402	469	486	88
Jacksonville	234	250	272	272	295	311	35
Indianapolis	233	253	268	273	287	324	38
Cleveland	176	236	298	293	355	398	97
Boston	108	115	134	139	158	180	33
Detroit	169	170	197	227	253	344	82
Peoria	51	56	64	71	78	105	24
Source: Gun Violence Archive

# Convert the city column to a factor with levels ordered by incidents_per_100k
top_15_cities <- top_15_cities |>
  mutate(city_state = factor(city_state, levels = city_state[order(incidents_per_100k, decreasing = FALSE)]))

# Create the bar chart
ggplot(top_15_cities, aes(x = city_state, y = incidents_per_100k)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    y = "Incidents per 100,000 People",
    x = "City",
    title = "Gun Violence Incidents per 100,000 People by city",
    subtitle = "Year 2017",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(vjust = 0.5)
  ) +
  coord_flip()

n_killed_2017 <- gun_data_2017 |>
  group_by(geoid) |>
  summarise(n_killed = sum(n_killed), .groups = 'drop')

top_15_cities <- top_15_cities |>
  left_join(n_killed_2017, by = "geoid")

# Calculate gun deaths per 100,000 people
top_15_cities <- top_15_cities |>
  mutate(prop_deaths = (n_killed / count))

# Calculate gun deaths per 100,000 people
top_15_cities <- top_15_cities |>
  mutate(deaths_per_100k = (n_killed / population) * 100000)

top_15_cities <- top_15_cities |>
  mutate(city_state = factor(city_state, levels = city_state[order(deaths_per_100k, decreasing = FALSE)]))

# Create the bar chart
ggplot(top_15_cities, aes(x = city_state, y = deaths_per_100k)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    y = "Deaths per 100,000 People",
    x = "City",
    title = "Gun Violence Deaths per 100,000 People by city",
    subtitle = "Year 2017",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(vjust = 0.5)
  ) +
  coord_flip()

As we look at the data more granularly, a different picture starts to emerge. While at the state level, Alaska seems to be experiencing the most gun violence, no Alaskan city is within the top 15 most incident-prone cities. Analyzing the top 15 most incident-prone states through the lens of the most deadly, Saint Louis, MO, stands out, with 36% of all gun violence incidents ending in death in 2017. Let’s dive a bit deeper to uncover more insights.

4.7.2 Visualization: Correlation plots

With all the additional demographic data from the census dataset, I wanted to see if any of the added features might correlate with the deaths and incidents occurring within our dataset. From here on out, I will continue to look at the top 15 states with the highest levels of gun violence incidents in 2017.

Adding 3 additional features to analyze.

rent_percentage: The proportion of median annual rent cost to the median annual income.
home_percentage: The proportion of median annual mortgage cost to the median annual income.
prop_deaths: The proportion of gun violence incidents that produced a gun violence death

I added these to see if these scores, in addition to any other macroeconomic data, might give us a better signal to correlate with deaths and incidents.

# Adding rent_percentage and home_percentage to top_15_cities
top_15_cities <- top_15_cities |>
  mutate(
    # rent_percentage: The proportion of median annual rent cost to the median annual income
    rent_percentage = (median_monthly_rent_cost * 12) / median_income,
    
    # home_percentage: The proportion of median annual home cost to the median annual income
    home_percentage = (median_monthly_home_cost * 12) / median_income
  )

# Function to rearrange columns by data type
rearrange_by_data_type <- function(df) {
  # Split the columns by data type
  num_cols <- df |> select(where(is.numeric))
  factor_cols <- df |> select(where(is.factor))
  char_cols <- df |> select(where(is.character))
  logical_cols <- df |> select(where(is.logical))
  date_cols <- df |> select(where(is.Date))
  
  # Combine the columns back in the desired order
  df_rearranged <- bind_cols(char_cols, factor_cols, date_cols, num_cols, logical_cols, )
  
  return(df_rearranged)
}

# Apply the function to rearrange columns by data type
top_15_cities <- rearrange_by_data_type(top_15_cities)

# selecting numeric columns to perform correlation matrix on
numeric_columns <- top_15_cities |>
  select(population:home_percentage)

# Calculate the correlation matrix for the selected numeric columns
correlation_matrix <- cor(numeric_columns, use = "complete.obs")

# Extract correlations related to incidents_per_100k and deaths_per_100k
correlation_incidents <- correlation_matrix["incidents_per_100k", ]
correlation_deaths <- correlation_matrix["deaths_per_100k", ]

# Extract the row corresponding to incidents_per_100k
correlation_incidents <- correlation_matrix["incidents_per_100k", ]

# Convert the extracted correlations to a data frame
correlation_incidents_df <- as.data.frame(correlation_incidents)
correlation_incidents_df$variable <- rownames(correlation_incidents_df)
rownames(correlation_incidents_df) <- NULL

# Create a diagonal matrix to visualize with ggcorrplot
diag_matrix <- diag(0, ncol(correlation_matrix))
rownames(diag_matrix) <- colnames(diag_matrix) <- colnames(correlation_matrix)
diag_matrix["incidents_per_100k", ] <- correlation_matrix["incidents_per_100k", ]
diag_matrix[, "incidents_per_100k"] <- correlation_matrix[, "incidents_per_100k"]
diag_matrix <- as.data.frame(diag_matrix)
diag_matrix <- diag_matrix |> select(incidents_per_100k)
diag_matrix <- t(diag_matrix)

# Plot the filtered correlation matrix using ggcorrplot
ggcorrplot(as.matrix(diag_matrix),
           hc.order = FALSE,
           outline.color = "white",
           lab = TRUE,
           lab_size = 3,
           title = "Correlation with Incidents per 100k") +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10),
    axis.text.y = element_text(size = 10),
    plot.title = element_text(size = 15),
    plot.subtitle = element_text(size = 10),
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  ) +
  coord_flip()

At the incidents per 100k people level, we get the strongest signal from the population, deaths per 100,000 people, and the proportion of deaths from an incident, which seems to be tacit. We see a very small signal from the gender demographic. As males tend to be more aggressive, it seems plausible that if they are less represented within the population, gun violence may decrease slightly.

# Convert the extracted correlations to a data frame
correlation_deaths_df <- as.data.frame(correlation_deaths)
correlation_deaths_df$variable <- rownames(correlation_deaths_df)
rownames(correlation_deaths_df) <- NULL

# Create a diagonal matrix to visualize with ggcorrplot
diag_matrix <- diag(0, ncol(correlation_matrix))
rownames(diag_matrix) <- colnames(diag_matrix) <- colnames(correlation_matrix)
diag_matrix["deaths_per_100k", ] <- correlation_matrix["deaths_per_100k", ]
diag_matrix[, "deaths_per_100k"] <- correlation_matrix[, "deaths_per_100k"]
diag_matrix <- as.data.frame(diag_matrix)
diag_matrix <- diag_matrix |> select(deaths_per_100k)
diag_matrix <- t(diag_matrix)

# Plot the filtered correlation matrix using ggcorrplot
ggcorrplot(as.matrix(diag_matrix),
           hc.order = FALSE,
           outline.color = "white",
           lab = TRUE,
           lab_size = 3,
           title = "Correlation with Deaths per 100k") +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10),
    axis.text.y = element_text(size = 10),
    plot.title = element_text(size = 15),
    plot.subtitle = element_text(size = 10),
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  ) +
  coord_flip()

When correlating our features with deaths per 100,000 people, we see a different picture. It seems that poverty has a moderate influence on the deadliness of an American city. Our added scores regarding the cost of living are seemingly a function of the median income and/or the proportion of poverty, so I chose to focus on poverty itself to see if we can gain some insight into its effects.

5 Function of poverty

5.1 Visualization: Scatter plots with linear regression lines

Here, I plotted our incidents per 100,000 people in a scatter plot with a correlation line with respect to the proportion of the population within the city living in poverty.

p <- top_15_cities |>
  ggplot(
    aes(
      x = incidents_per_100k,
      y = prop_poverty
    )
  ) +
  geom_point(
    aes(
      text = paste(
        "City:", city_state,
        "<br>Incidents per 100k:", round(incidents_per_100k, digits = 1),
        "<br>Proportion in Poverty:", percent(prop_poverty, accuracy = 0.1)
        )
      )
    ) +
  # Add a linear regression line for each facet/group, colored blue
  geom_smooth(
    method = "lm", # linear model
    se = FALSE, # standard error visible
    size = 1, 
    color = "red"
  )  +
  labs(
    title = "Incidents per 100k by proportion of poverty",
    subtitle = "Top 15 cities in 2017",
    x = "Incidents per 100k",   
    y = "Proportion of city population in poverty",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
    scale_y_continuous(labels = percent, breaks = seq(0, 0.5, by = 0.02)) +

  theme_bw()

# Convert the ggplot2 plot to an interactive plotly plot with custom tooltips
p_plotly <- ggplotly(p, tooltip = "text")

# Manually add the subtitle and caption
p_plotly <- p_plotly |>
  layout(
    title = list(
      text = "Incidents per 100k by proportion of poverty<br><sup>Top 15 cities in 2017</sup>"
    ),
    annotations = list(
      list(
        x = 0, 
        y = -0.13,  # Adjust the y position to move the caption up slightly
        text = "Source: Gun Violence Archive & U.S. Census Bureau", 
        showarrow = FALSE, 
        xref = 'paper', 
        yref = 'paper', 
        xanchor = 'left', 
        yanchor = 'auto', 
        xshift = 0, 
        yshift = 0,
        font = list(size = 8)
      )
    ),
    margin = list(t = 80, b = 100)  # Add extra margin to accommodate the caption
  )

# Display the interactive plot
p_plotly

The regression line is nearly flat, which is a direct reflection of the correlation matrix in the previous section.

q <- top_15_cities |>
  ggplot(
    aes(
      x = deaths_per_100k,
      y = prop_poverty
    )
  ) +
  geom_point(
    aes(
      text = paste(
        "City:", city_state,
        "<br>Deaths per 100k:", round(deaths_per_100k, digits = 1),
        "<br>Proportion in Poverty:", percent(prop_poverty, accuracy = 0.1)
        )
      )
    ) +
  # Add a linear regression line for each facet/group, colored blue
  geom_smooth(
    method = "lm", # linear model
    se = FALSE, # standard error visible
    size = 1, 
    color = "red"
  )  +
  labs(
    title = "Deaths per 100k by proportion of poverty",
    subtitle = "Top 15 cities in 2017",
    x = "Deaths per 100k",   
    y = "Proportion of city population in poverty",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  scale_y_continuous(labels = percent, breaks = seq(0, 0.5, by = 0.02)) +
  theme_bw()

# Convert the ggplot2 plot to an interactive plotly plot with custom tooltips
q_plotly <- ggplotly(q, tooltip = "text")

# Manually add the subtitle and caption
q_plotly <- q_plotly |>
  layout(
    title = list(
      text = "Deaths per 100k by proportion of poverty<br><sup>Top 15 cities in 2017</sup>"
    ),
    annotations = list(
      list(
        x = 0, 
        y = -0.13,  # Adjust the y position to move the caption up slightly
        text = "Source: Gun Violence Archive & U.S. Census Bureau", 
        showarrow = FALSE, 
        xref = 'paper', 
        yref = 'paper', 
        xanchor = 'left', 
        yanchor = 'auto', 
        xshift = 0, 
        yshift = 0,
        font = list(size = 8)
      )
    ),
    margin = list(t = 80, b = 100)  # Add extra margin to accommodate the caption
  )

# Display the interactive plot
q_plotly

When we plot our deaths per 100,000 people with regard to poverty, our regression line changes dramatically. We do, however, have four cities that are above a 20% poverty threshold. Although the evidence seems strong enough to support an inference, I want to establish statistical significance to support a hypothesis.

5.2 Monte Carlo Methods of Inference

5.2.1 Hypothesis

Given the relationship between poverty and gun violence observed in the preliminary analysis, I hypothesize that cities with higher poverty levels experience a greater number of gun violence deaths per 100,000 people. Specifically:

\[H_0: \mu_1 \le \mu_2 \text{ vs. } H_a: \mu_1 > \mu_2\]

Where $\mu_1$ is the average number of gun violence deaths per 100,000 people in low poverty cities, and $\mu_2$ is the average number of gun violence deaths per 100,000 people in high poverty cities.

# Divide cities into low and high poverty groups
prop_poverty_threshold <- median(top_15_cities$prop_poverty)
low_poverty_cities <- top_15_cities %>% filter(prop_poverty <= prop_poverty_threshold)
high_poverty_cities <- top_15_cities %>% filter(prop_poverty > prop_poverty_threshold)

val <- round(prop_poverty_threshold * 100, digits = 1)

low_poverty_cities <- low_poverty_cities |>
  select(
    -(1:5),
    -(7:18),
    -(20:21)
    )

high_poverty_cities <- high_poverty_cities |>
  select(
    -(1:5),
    -(7:18),
    -(20:21)
    )

# Stacking data frames
stacked_data <- bind_rows(low_poverty_cities, high_poverty_cities)

stacked_data <- stacked_data %>%
  mutate(Poverty_Level = case_when(
    row_number() <= 8 ~ "Low poverty cities",
    row_number() > 8 ~ "High poverty cities"
  ))

stacked_data <- stacked_data |>
  select(city_state, Poverty_Level, deaths_per_100k)

To test this hypothesis, I divided the top 15 cities into low and high poverty groups based on the median proportion (19.8%) of the population living in poverty.

stacked_ft <- flextable(stacked_data) %>%
  set_header_labels(
    city_state = "City, State",
    Poverty_Level = "Poverty Level",
    deaths_per_100k = "Deaths per 100k"
  ) %>%
  add_header_row(values = "Table 6: 15 Deadliest Cities by Poverty Rate (2017)", colwidths = 3) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive & U.S. Census Bureau") %>%
  theme_vanilla() %>%
  autofit()

stacked_ft

Table 6: 15 Deadliest Cities by Poverty Rate (2017)
City, State	Poverty Level	Deaths per 100k
Chicago, IL	Low poverty cities	13.687017
Washington, DC	Low poverty cities	14.872299
Jacksonville, FL	Low poverty cities	12.609055
Houston, TX	Low poverty cities	6.695365
Cleveland, OH	Low poverty cities	11.293136
Boston, MA	Low poverty cities	6.660817
Indianapolis, IN	Low poverty cities	16.702767
Peoria, IL	Low poverty cities	6.983803
Baltimore, MD	High poverty cities	48.080336
Milwaukee, WI	High poverty cities	11.603766
New Orleans, LA	High poverty cities	39.414501
Memphis, TN	High poverty cities	18.553133
Saint Louis, MO	High poverty cities	67.330016
Philadelphia, PA	High poverty cities	14.206925
Detroit, MI	High poverty cities	10.375197
Source: Gun Violence Archive & U.S. Census Bureau

5.2.2 Visualization: Comparison of Gun Violence Deaths

The first visualization compares the number of gun violence deaths per 100,000 people between low poverty and high poverty cities. This box plot displays the distribution of gun violence deaths for both groups.

stacked_data |>
  ggplot(
    aes(
      x = Poverty_Level,
      y = deaths_per_100k,
      fill = Poverty_Level
    )
  ) +
  stat_boxplot(geom = "errorbar", width = 0.2, coef = 1.5) +
  stat_boxplot(geom = "boxplot", width = 0.5, coef = 1.5,
               outlier.shape = 8) +
  stat_summary(fun = "mean", geom = "point", shape = 23, fill = "black",
               color = "white") +
  scale_fill_manual(values = c("#009E73", "#56B4E9")) +
  coord_flip() +
  labs(
    y = "Number of Deaths per 100,000 people",
    title = "Comparison of Gun Violence Deaths",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
    ) +
  theme(legend.position = "none")

Key observations from the box plot:

Low Poverty Cities: The box plot shows a relatively narrow range of deaths per 100,000 people, indicating less variability within this group. The median number of deaths is lower, and the whiskers suggest fewer extreme values.

High Poverty Cities: The box plot for high poverty cities reveals a much wider range of deaths per 100,000 people, with a higher median and more variability. This indicates a greater spread of gun violence incidents in these cities, including some with very high death rates.

This visualization supports the hypothesis that cities with higher poverty levels experience more gun violence deaths, highlighting the need to explore further the impact of socioeconomic factors on gun violence.

5.2.3 Visualization: Empirical Density Curves and Dot Plot

The second visualization provides an empirical density curve and dot plot comparing the number of gun violence deaths per 100,000 people between low poverty and high poverty cities. This visualization helps to highlight the distribution and density of gun violence deaths in both groups.

stacked_data |> 
  ggplot(aes(x = deaths_per_100k,
             fill = Poverty_Level)) +
  geom_dotplot() +
  geom_density(color = "black", alpha = 0.6) +
    scale_fill_manual(values = c("#009E73", "#56B4E9")) +
  facet_grid(Poverty_Level ~ .) +
    labs(x = "Number of Deaths per 100,000 people",
         title = "Empirical Density Curves and Dot Plot",
         y = "Density") +
  theme(legend.position = "none") +
  ylim(0, 0.09)

Key observations from the density curves and dot plot:

Low Poverty Cities: The density curve for low poverty cities shows a higher peak, indicating that most cities in this group have a lower number of deaths per 100,000 people. The dot plot reinforces this by showing a concentration of points at lower death rates.

High Poverty Cities: The density curve for high poverty cities is relatively flat, suggesting a wider spread of death rates within this group. The dot plot shows that these cities have a broader range of gun violence death rates, with some cities experiencing significantly higher death rates.

5.2.4 Assumption Checking for T-Tests

5.2.4.1 Checking Normality

Before proceeding with the t-test, it is essential to verify if the data follows a normal distribution. The Shapiro-Wilk test was used to assess the normality of the gun violence death rates per 100,000 people in both low and high poverty cities.

# Check for normality
shapiro_low <- shapiro.test(stacked_data$deaths_per_100k[stacked_data$Poverty_Level == "Low poverty cities"])
shapiro_high <- shapiro.test(stacked_data$deaths_per_100k[stacked_data$Poverty_Level == "High poverty cities"])

# Extract the results into a data frame
shapiro_low_df <- data.frame(
  Statistic = shapiro_low$statistic,
  p_value = shapiro_low$p.value,
  Method = shapiro_low$method,
  Data = "Low poverty cities"
)

shapiro_high_df <- data.frame(
  Statistic = shapiro_high$statistic,
  p_value = shapiro_high$p.value,
  Method = shapiro_high$method,
  Data = "High poverty cities"
)

shapiro_low_df |> flextable() |>
  set_caption("Result from Shapiro-Wilk Normality Test for Low poverty cities") |>
  padding(padding = 3, part = "all") %>%
  theme_vanilla() %>%
  autofit()

Result from Shapiro-Wilk Normality Test for Low poverty cities
Statistic	p_value	Method	Data
0.887914	0.2237709	Shapiro-Wilk normality test	Low poverty cities

shapiro_high_df |> flextable() |>
  set_caption("Result from Shapiro-Wilk Normality Test for High poverty cities") |>
  padding(padding = 3, part = "all") %>%
  theme_vanilla() %>%
  autofit()

Result from Shapiro-Wilk Normality Test for High poverty cities
Statistic	p_value	Method	Data
0.8616306	0.1565586	Shapiro-Wilk normality test	High poverty cities

Results:

Low Poverty Cities: The Shapiro-Wilk test statistic is 0.89 with a p-value of 0.22, indicating that we do not reject the null hypothesis of normality. Therefore, the data for low poverty cities can be considered normally distributed.

High Poverty Cities: The Shapiro-Wilk test statistic is 0.86 with a p-value of 0.16, indicating that we do not reject the null hypothesis of normality. Thus, the data for high poverty cities can also be considered normally distributed.

Both groups meet the normality assumption necessary for conducting the t-test. However, it’s important to note that with a small sample size (n = 15), the power of the normality test may be limited, and the results should be interpreted with caution.

5.2.4.2 Checking Equal Variance

Next, I assess the assumption of equal variance between the two groups using the standard deviations and variances of gun violence death rates per 100,000 people.

# Calculating standard deviations and variances for each group
stacked_data |> 
  group_by(Poverty_Level) |> 
  summarize(Mean = mean(deaths_per_100k),
            n = n(),
            SD = sd(deaths_per_100k),
            Variance = var(deaths_per_100k)) |> 
  flextable() |> 
  colformat_double(digits = 3) |> 
  padding(padding = 3, part = "all") %>%
  theme_vanilla() %>%
  autofit()

Poverty_Level	Mean	n	SD	Variance
High poverty cities	29.938	7	22.034	485.507
Low poverty cities	11.188	8	3.975	15.802

Results:

High Poverty Cities: The mean death rate is 29.9 with a standard deviation of 22 and a variance of 485.507.

Low Poverty Cities: The mean death rate is 11.2 with a standard deviation of 4 and a variance of 15.8.

The results show that the variances are not equal, with high poverty cities having a significantly larger variance than low poverty cities. Therefore, the assumption of equal variance is violated.

Given these results, we will use Welch’s t-test, which does not assume equal variances, to compare the gun violence death rates between low and high poverty cities.

5.2.5 Welch’s t-test

# Perform Welch's t-test
welch_t_result <- t.test(deaths_per_100k ~ Poverty_Level, data = stacked_data, alternative = "greater", var.equal = FALSE)

# Extract the results into a data frame
welch_t_result_df <- data.frame(
  Statistic = welch_t_result$statistic,
  t_df = welch_t_result$parameter,
  p_value = welch_t_result$p.value,
  Alternative = welch_t_result$alternative,
  Estimate = welch_t_result$estimate,
  Lower_CI = welch_t_result$conf.int[1],
  Upper_CI = welch_t_result$conf.int[2]
)

# Convert the data frame to a flextable
welch_t_result_ft <- welch_t_result_df |>
  flextable() |>
  set_caption("Results of Welch's t-test comparing rates of gun violence deaths between the low poverty cities and high poverty cities groups.") |>
  padding(padding = 3, part = "all") %>%
  theme_vanilla() %>%
  autofit()

# Print the flextable
welch_t_result_ft

Results of Welch's t-test comparing rates of gun violence deaths between the low poverty cities and high poverty cities groups.
Statistic	t_df	p_value	Alternative	Estimate	Lower_CI	Upper_CI
2.219969	6.342214	0.03290321	greater	29.93770	2.494976	Inf
2.219969	6.342214	0.03290321	greater	11.18803	2.494976	Inf

Results:

The Welch’s t-test statistic is 2.219969, and the degrees of freedom (t_df) is 6.342214.
The p-value is 0.03290321, indicating a statistically significant difference in death rates between the two groups.
The alternative hypothesis specifies that the death rate is greater in high poverty cities.
The mean estimate for high poverty cities is 29.9, while for low poverty cities, it is 11.2.
The confidence interval for the difference in means ranges from 2.494976 to infinity.

These results indicate that gun violence death rates are significantly higher in high poverty cities compared to low poverty cities, supporting the hypothesis that poverty levels are associated with increased gun violence deaths.

5.2.6 Permutation Testing

To further validate the results of the Welch’s t-test, we performed a permutation test. This method helps to assess the robustness of our findings by generating a distribution of t-statistics under the null hypothesis through 1000 random permutations.

# Set the number of permutations
nperms <- 1000

# Initialize a vector to store the permuted test statistics
permTs <- vector(length = nperms)

# Calculate the observed Welch's t-test statistic
observed_t <- t.test(
  deaths_per_100k ~ Poverty_Level, 
  data = stacked_data, 
  alternative = "greater", 
  var.equal = FALSE
  )$statistic

# Perform the permutation test
set.seed(1986) # For reproducibility

for (p in 1:nperms) {
  permuted_data <- stacked_data %>%
    mutate(
      Poverty_Level = sample(Poverty_Level, replace = FALSE)
      )
  permTs[p] <- t.test(
    deaths_per_100k ~ Poverty_Level, 
    data = permuted_data, 
    alternative = "greater", 
    var.equal = FALSE
    )$statistic
}

# Calculate the p-value
p_value <- mean(permTs >= observed_t)

# Create a data frame for the p-value result
p_value_df <- data.frame(
  Statistic = c("Observed t-statistic", "Permutation p-value"),
  Value = c(round(observed_t, 2), round(p_value, 3))
)

# Display the p-value result using flextable
p_value_ft <- p_value_df |>
  flextable() |>
  set_caption("Permutation Test Results") |>
  padding(padding = 3, part = "all") %>%
  theme_vanilla() %>%
  autofit() 

# Print the flextable
p_value_ft

Permutation Test Results
Statistic	Value
Observed t-statistic	2.220
Permutation p-value	0.011

The permutation test p-value is 0.011, indicating that the observed t-statistic is rare under the null hypothesis.

# Tally up
tally_results <- janitor::tabyl(permTs >= observed_t)

# Convert to a data frame for plotting
tally_df <- as.data.frame(tally_results)

# Create a bar plot to visualize the tally results
ggplot(tally_df, aes(x = `permTs >= observed_t`, y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = n), vjust = -0.3) +
  labs(
    title = "Tally of Permuted t-Statistics Greater Than or Equal to Observed t-Statistic",
    x = "Permuted t-Statistic >= Observed t-Statistic",
    y = "Count"
  ) +
  theme_minimal()

The bar plot visualizes the tally of permutations greater than or equal to the observed t-statistic, reinforcing the significance of our original t-test results.

# Visualize the permutation distribution
perm_df <- data.frame(permTs = permTs)

max_y <- max(table(cut(permTs, breaks = seq(min(permTs), max(permTs), by = 0.1)))) + 2


ggplot(perm_df, aes(x = permTs)) +
  geom_histogram(binwidth = 0.1, fill = "steelblue", color = "white") +
  geom_vline(aes(xintercept = observed_t), color = "dodgerblue",  linetype = "dashed") +
  geom_vline(xintercept = quantile(permTs, probs = 0.95), color = "red") +
  theme_minimal() +
  labs(title = "Permutation Distribution of t-Statistics",
       x = "Permuted t-Statistics",
       y = "Frequency") +
  annotate("text", x = observed_t, y = max_y, 
           label = paste("Observed t-statistic:", round(observed_t, 2)), color = "dodgerblue", hjust = 0.9) +
  theme_minimal()

The histogram of permuted t-statistics shows the distribution of t-values generated from the permutations.

Statistical Decision

Given the evidence from Welch’s t-test and the permutation test, we reject the null hypothesis ($H_0: \mu_1 \le \mu_2$) in favor of the alternative hypothesis ($H_a: \mu_1 > \mu_2$). This decision is supported by:

A p-value of 0.0329 from Welch’s t-test, which is less than the common alpha level of 0.05.
A permutation p-value of 0.011, indicating that the observed t-statistic is unlikely to occur by chance.

Conclusion: There is statistically significant evidence to conclude that cities with higher poverty levels experience higher gun violence death rates compared to cities with lower poverty levels.

This analysis underscores the critical impact of socioeconomic factors on gun violence and highlights the need for targeted interventions in high poverty areas to reduce gun violence deaths.

5.3 Bootstrapping

To further analyze the data, we performed a bootstrapping procedure to estimate the distribution of the median number of deaths per 100,000 people in high poverty cities. Bootstrapping allows us to generate multiple samples from the original dataset and compute the statistic of interest, providing a robust estimate of its distribution.

high_poverty <- stacked_data |>
  filter(Poverty_Level == "High poverty cities")

# Number of bootstrap samples
B <- 10000

# Instantiating matrix for bootstrap samples
boots <- matrix(NA, nrow = nrow(high_poverty), ncol = B)

# Sampling with replacement B times
for(b in 1:B) {
boots[, b] <- high_poverty |> 
  slice_sample(prop = 1, replace = TRUE) |> 
  dplyr::pull(deaths_per_100k)
}

# Instantiating vector for bootstrap medians
boot_medians <- vector(length = B)

# Calculating medians for bootstrap samples
for(b in 1:B) {
boot_medians[b] <- median(boots[, b])
}

tibble(Median = boot_medians) |>
  ggplot(aes(x = Median)) +
  geom_histogram(fill = "steelblue", color = "white") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + 
  labs(title = "Non Parametric bootstrap distribution for the median",
       x = "Test statistic value",
       y = "Frequency") +
  geom_vline(xintercept = quantile(boot_medians, probs = c(0.025, 0.975)), color = "red", linetype = "dotted") +
  theme_minimal()

Results:

The histogram shows the distribution of bootstrap medians.
The vertical dotted lines represent the 2.5th and 97.5th percentiles of the bootstrap distribution, providing a 95% confidence interval for the median.

# Estimating the standard error
standard_error <- sd(boot_medians)

# Calculating the 95% confidence interval
confidence_interval <- quantile(boot_medians, probs = c(0.025, 0.975))

# Create a data frame for the results
results_df <- data.frame(
  Statistic = c("Standard Error", "2.5th Percentile", "97.5th Percentile"),
  Value = c(standard_error, confidence_interval[1], confidence_interval[2])
)

# Display the results using flextable
library(flextable)

results_ft <- results_df |>
  flextable() |>
  set_caption("Bootstrap Estimates for the Median") |>
  padding(padding = 3, part = "all") %>%
  theme_vanilla() |>
  
  autofit()

# Print the flextable
results_ft

Bootstrap Estimates for the Median
Statistic	Value
Standard Error	13.35475
2.5th Percentile	11.60377
97.5th Percentile	48.08034

Bootstrap Confidence Interval Findings:

We are 95% confident that at least half of the cities with high poverty rates will experience a death rate from gun violence between 11.6 and 48 deaths per 100,000 people.

6 Conclusion

Gun ownership, legislation, and violence in America have always been complex and polarizing topics. Although gun ownership is woven into the fabric of the American identity through the Constitution and is seen as an essential right, that right is viewed by some as conflicting with efforts to solve the problem of gun violence.

This analysis gets closer to untangling some of the mystery behind America’s gun violence issues, but there is still more work to be done. Many more features could be added to get a better sense of what is causing a lethal increase in gun violence deaths in America. With more time and research, we may be able to reach a consensus on how to legislate and deal with the issue.

I personally did not choose this project and had no preconceived notions about what evidence I would find, if any at all. I simply let the data lead my line of questioning. While I intuited that poverty had something to do with gun violence, I didn’t know to what extent. I also did not anticipate the noticeable disparity between its effects on gun violence incidents versus deaths. Through this exercise, it became clear to me that the perspective by which we view the legislation of gun laws is often flawed. In this analysis, gun violence deaths are clearly a dependent variable, or an outcome. They are a function of another feature or set of features of American culture. Trying to solve a problem by focusing on the output instead of the input seems to only perpetuate the issue.

Moreover, through this process and generally looking over some of the previously cited data regarding gun ownership in this country, an interesting pattern emerged that seemed worth exploring. I noticed that many of the states with the deadliest cities had some of the very lowest rates of gun ownership in the country. Despite this being anecdotal, it is a curious question that deserves a solid answer. The devil is in the details, and this multifaceted problem has no shortage of them.

Regarding the statistical significance of poverty as a driver of lethality among states, one might posit that a lack of healthcare infrastructure in impoverished communities could be consequential to health outcomes vis-à-vis gun violence. Poor hospitals, a lack of hospitals, or poor emergency response times could be factors worth examining when thinking about how to improve the situation. Other potential social and psychological factors may be helpful but arduous to collect or rely on for accuracy. Furthermore, they may be clouded by ambiguity and not easily stated.

Being from Baltimore, Maryland, the second deadliest city in America in 2017, it is hard to see the poverty and violence. Knowing that right around the corner, my fellow Americans are residing in a war zone is unsettling, to say the least. I would personally like to see more careful analysis and concerted effort among legislators to pinpoint effective ways to ameliorate gun violence in America. Only through thorough research, open dialogue, and targeted interventions can we hope to make meaningful progress in addressing this critical issue.

Gun Violence Data Analysis

Steve Meadows

2024-07-24

1 Introduction

1.1 Description

1.2 Source

1.3 Source description

1.4 Load R Packages

1.5 Importing data

2 Gun Violence Data Dictionary

2.1 Exploring high level data characteristics

3 Data Cleaning

3.1 Data Missingness

3.1.1 Plotting missingness for all data features

3.2 Plotting geolocation feature missingness

3.3 Addressing important missing geoid data

3.4 Reviewing missingness after cleaning

3.5 Missingness Reflection

4 EDA

4.1 Country Wide Year-level Summary Statistics

4.2 Country Wide Year-level Summary Statistics

4.3 Visualization: Histogram of Monthly Incidents per Year

4.4 Gun Violence - State Level

4.4.1 State-level Summary Statistics for Top 15 States

4.4.2 Visualization: Bar graph of Incidents by state

4.4.3 Census Data Cleaning

4.4.4 Examining high-level data characteristics

4.4.5 Visualization: Bar Graph of Incidents by state per 100,000 people

4.4.6 Visulization: Bar Graph of Incidents by state per 100,000 people - Removing DC

4.5 Visualization: Interactive Map of Incidents by State

4.6 Visualization: Boxplot Showing Range of Incidents Per Year from 2014-2017

4.7 Gun Violence - City Level

4.7.1 City Level Summary Statistics

4.7.2 Visualization: Correlation plots

5 Function of poverty

5.1 Visualization: Scatter plots with linear regression lines

5.2 Monte Carlo Methods of Inference

5.2.1 Hypothesis

5.2.2 Visualization: Comparison of Gun Violence Deaths

5.2.3 Visualization: Empirical Density Curves and Dot Plot

5.2.4 Assumption Checking for T-Tests

5.2.4.1 Checking Normality

5.2.4.2 Checking Equal Variance

5.2.5 Welch’s t-test

5.2.6 Permutation Testing

5.3 Bootstrapping

6 Conclusion

3.3 Addressing important missing `geoid` data