1 Introduction

1.1 Description

Information on instances of gun violence documented in the United States from January 2013 through March 2018. Note that gun violence incidents in 2013 are not all included, with only 279 incidents from 2013 in this data set, notably missing the Las Vegas Mass Shooting. The data is contained in gunViolenceGeo.csv.

1.2 Source

Data was obtained from the GitHub repo https://github.com/jamesqo/gun-violence-data, which originally obtained the data from the Gun Violence Archive’s website. From the organization’s description:

1.3 Source description

Gun Violence Archive (GVA) is a not for profit corporation formed in 2013 to provide free online public access to accurate information about gun-related violence in the United States. GVA will collect and check for accuracy, comprehensive information about gun-related violence in the U.S. and then post and disseminate it online.

Although there are possible limitations in its comprehensiveness, research supports the utility of the data provided by the Gun Violence Archive for public health research and quantifying the amount of gun violence occurring throughout the United States.

1.4 Load R Packages

library(tidyverse)
library(gapminder)
library(skimr)
library(ggthemes)
library(flextable)
library(dplyr)
library(naniar)
library(lubridate)
library(tidyr)
library(sf)
library(mapview)
library(maps)
library(naniar)
library(plotly) 
library(ggpubr)
library(ggcorrplot)
library(scales)
library(tidymodels)
library(car)
library(infer)
library(htmltools)
library(leaflet)

1.5 Importing data

gunViolence <- read_csv("gunViolenceGeo.csv")
census_df <- read_csv("https://raw.githubusercontent.com/dilernia/STA418-518/main/Data/census_data_state_2008-2021.csv")
county_census_df <- read_csv(
  "https://raw.githubusercontent.com/dilernia/STA418-518/main/Data/census_data_county_2009-2021.csv")

2 Gun Violence Data Dictionary

# Create the data dictionary
dataDictionary <- tibble(
  Variable = colnames(gunViolence),
  Description = c(
    "Geographic ID",
    "gunviolencearchive.org ID for incident",
    "date of occurrence",
    "state",
    "city or county",
    "address where incident took place",
    "number of people killed",
    "number of people injured",
    "link to gunviolencearchive.org webpage containing details of incident",
    "link to online news story concerning incident",
    "ignore, always False",
    "congressional district",
    "key: gun ID, value: 'Unknown' or 'Stolen'",
    "key: gun ID, value: description of gun type",
    "list of incident characteristics",
    "latitude",
    "description of location where incident took place",
    "longitude",
    "number of guns involved",
    "additional notes about the incident",
    "key: participant ID",
    "key: participant ID, value: description of age group, e.g. 'Adult 18+'",
    "key: participant ID, value: 'Male' or 'Female'",
    "key: participant ID",
    "key: participant ID, value: relationship of participant to other participants",
    "key: participant ID, value: 'Arrested', 'Killed', 'Injured', or 'Unharmed'",
    "key: participant ID, value: 'Victim' or 'Subject-Suspect'",
    "links to online news stories concerning incident",
    "state house district",
    "state senate district",
    "full address where incident took place"
  ),
  Type = map_chr(gunViolence, .f = function(x) { typeof(x)[1] }),
  Class = map_chr(gunViolence, .f = function(x) { class(x)[1] })
)

# Print the data dictionary
data_dict_ft <-flextable(dataDictionary) |>
  add_header_row(values = "Table 1: Data Dictionary", colwidths = 4) |>
  padding(padding = 3, part = "all") |>
  add_footer_lines(values = "Source: Gun Violence Archive") |>
  set_table_properties(width = 1, layout = "autofit")|>
  colformat_double(j = 2:4, digits = 2) |>
  theme_vanilla() |>
  autofit()

data_dict_ft

Table 1: Data Dictionary

Variable

Description

Type

Class

geoid

Geographic ID

character

character

incident_id

gunviolencearchive.org ID for incident

double

numeric

date

date of occurrence

double

Date

state

state

character

character

city_or_county

city or county

character

character

address

address where incident took place

character

character

n_killed

number of people killed

double

numeric

n_injured

number of people injured

double

numeric

incident_url

link to gunviolencearchive.org webpage containing details of incident

character

character

source_url

link to online news story concerning incident

character

character

incident_url_fields_missing

ignore, always False

logical

logical

congressional_district

congressional district

double

numeric

gun_stolen

key: gun ID, value: 'Unknown' or 'Stolen'

character

character

gun_type

key: gun ID, value: description of gun type

character

character

incident_characteristics

list of incident characteristics

character

character

lat

latitude

double

numeric

location_description

description of location where incident took place

character

character

long

longitude

double

numeric

n_guns_involved

number of guns involved

double

numeric

notes

additional notes about the incident

character

character

participant_age

key: participant ID

character

character

participant_age_group

key: participant ID, value: description of age group, e.g. 'Adult 18+'

character

character

participant_gender

key: participant ID, value: 'Male' or 'Female'

character

character

participant_name

key: participant ID

character

character

participant_relationship

key: participant ID, value: relationship of participant to other participants

character

character

participant_status

key: participant ID, value: 'Arrested', 'Killed', 'Injured', or 'Unharmed'

character

character

participant_type

key: participant ID, value: 'Victim' or 'Subject-Suspect'

character

character

sources

links to online news stories concerning incident

character

character

state_house_district

state house district

double

numeric

state_senate_district

state senate district

double

numeric

address_full

full address where incident took place

character

character

Source: Gun Violence Archive

2.1 Exploring high level data characteristics

skim(gunViolence)
Data summary
Name gunViolence
Number of rows 239677
Number of columns 31
_______________________
Column type frequency:
character 20
Date 1
logical 1
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
geoid 8025 0.97 5 5 0 2821 0
state 0 1.00 4 20 0 51 0
city_or_county 0 1.00 3 46 0 12821 0
address 16497 0.93 2 97 0 197115 0
incident_url 0 1.00 48 50 0 239677 0
source_url 468 1.00 2 255 0 213974 0
gun_stolen 99498 0.58 8 5488 0 349 0
gun_type 99451 0.59 5 5488 0 2502 0
incident_characteristics 326 1.00 8 761 0 18126 0
location_description 197588 0.18 1 100 0 27605 0
notes 81017 0.66 1 261 0 136435 0
participant_age 92298 0.61 3 517 0 18951 0
participant_age_group 42119 0.82 11 1536 0 898 0
participant_gender 36362 0.85 6 803 0 873 0
participant_name 122253 0.49 4 2158 0 113402 0
participant_relationship 223903 0.07 8 366 0 284 0
participant_status 27626 0.88 8 1500 0 2150 0
participant_type 24863 0.90 8 1311 0 259 0
sources 609 1.00 8 3015 0 217280 0
address_full 7935 0.97 23 224 0 191632 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2013-01-01 2018-03-31 2016-04-05 1725

Variable type: logical

skim_variable n_missing complete_rate mean count
incident_url_fields_missing 0 1 0 FAL: 239677

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
incident_id 0 1.00 559334.35 293128.68 92114.00 308545.00 543587.00 817228.00 1083472.00 ▇▇▆▆▆
n_killed 0 1.00 0.25 0.52 0.00 0.00 0.00 0.00 50.00 ▇▁▁▁▁
n_injured 0 1.00 0.49 0.73 0.00 0.00 0.00 1.00 53.00 ▇▁▁▁▁
congressional_district 11944 0.95 8.00 8.48 0.00 2.00 5.00 10.00 53.00 ▇▂▁▁▁
lat 7923 0.97 37.55 5.13 19.11 33.90 38.57 41.44 71.34 ▁▇▅▁▁
long 7923 0.97 -89.34 14.36 -171.43 -94.16 -86.25 -80.05 97.43 ▁▇▁▁▁
n_guns_involved 99451 0.59 1.37 4.68 1.00 1.00 1.00 1.00 400.00 ▇▁▁▁▁
state_house_district 38772 0.84 55.45 42.05 1.00 21.00 47.00 84.00 901.00 ▇▁▁▁▁
state_senate_district 32335 0.87 20.48 14.20 1.00 9.00 19.00 30.00 94.00 ▇▆▂▁▁

3 Data Cleaning

I wanted to start pruning the data by starting with the most relevant and complete data. As stated in the description, the data is incomplete for the years 2013 and 2018. I proceeded to filter the data for the four years in between. Additionally, I made the data easier to work with by converting dates to datetime to extract chronological information.

# Extract the year from the date column using stringr
gunViolence <- gunViolence |>
  mutate(year = as.integer(str_sub(date, 1, 4)))

# Filter data for the years 2014 to 2017
gunViolence <- gunViolence |>
  filter(year >= 2014 & year <= 2017)

# Change date column to datetime
gunViolence <- gunViolence |>
  mutate(date = as.Date(date, format = "%Y-%m-%d"))

# Extract the year and month from the date column
gunViolence <- gunViolence |>
  mutate(
    month = format(date, "%m")
  )

# Use stringr to remove text within parentheses in the city_or_county column
gunViolence <- gunViolence |>
  mutate(city_or_county = str_replace_all(city_or_county, "\\s*\\([^\\)]+\\)", ""))

# Trim any leading or trailing whitespace
gunViolence <- gunViolence |>
  mutate(city_or_county = str_trim(city_or_county))

3.1 Data Missingness

3.1.1 Plotting missingness for all data features

I began by examining the null values within the data and how to address them. Given the robust amount of information provided about these gun violence incidents, as well as the other datasets available to me regarding U.S. census data at the county, city, and state levels, I knew that location variables were important. I plotted all missing data and then visualized the missingness in relation to location.

gg_miss_var(gunViolence, show_pct = TRUE)

3.2 Plotting geolocation feature missingness

This visualization provides a closer look at location-level information by state. After creating the heatmap to observe the percentage of rows with null location information, I observed that missing lat and long coordinates were part and parcel to missing geoid values, which are extremely salient for analysis. This may suggest that these states weren’t as diligent in capturing that kind of data, or were negligent. States that were pushing the upper bounds of location missingness proportions were:

  • Idaho
  • Indiana
  • South Dakota
  • Virginia
  • West Virginia
gunViolence |>
  dplyr::select(
    geoid,
    lat,
    long,
    state
  ) |>
  gg_miss_fct(fct = state) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

3.3 Addressing important missing geoid data

To address geoid missingness, I compared incidents without IDs to ones that had them, by city, state, and year. If they were the same, I copied the geoid over and merged it back into the dataset. Here is an an examination of missingness after the merge. Here we can see that we went from over 8,000 missing geoid values to 443 and a completeness rate of 99.8% for the variable.

# Identify rows with missing geoid
missing_geoid <- gunViolence |>
  filter(is.na(geoid))

# Identify rows with non-missing geoid
non_missing_geoid <- gunViolence |>
  filter(!is.na(geoid)) |>
  group_by(state, city_or_county, year) |>
  summarise(geoid = first(geoid), .groups = 'drop')

# Perform the matching and fill in the missing geoid values
filled_data <- missing_geoid |>
  left_join(non_missing_geoid, by = c("state", "city_or_county", "year")) |>
  mutate(geoid = if_else(is.na(geoid.x), geoid.y, geoid.x)) |>
  select(-geoid.x, -geoid.y)

# Combine the filled rows with the original data excluding the previously missing rows
final_data <- gunViolence |>
  filter(!is.na(geoid)) |>
  bind_rows(filled_data)

skim(final_data$geoid)
Data summary
Name final_data$geoid
Number of rows 225597
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
data 443 1 5 5 0 2800 0

3.4 Reviewing missingness after cleaning

This improved my location accuracy immensely. Hawaii, Idaho, Vermont, and West Virginia were the only states left facing higher proportions of missingness, which topped out at just over 1%. As one reads deeper, we will see that in terms of incidents and deaths, these states are not as relevant.

final_data |>
  dplyr::select(
    geoid,
    state
  ) |>
  gg_miss_fct(fct = state) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

# Calculate count of missing values for each state
missing_counts <- final_data |>
  dplyr::select(geoid, state) |>
  group_by(state) |>
  summarise(missing_count = sum(is.na(geoid))) |>
  ungroup()

# Create a bar plot to show the count of missing values
ggplot(missing_counts, aes(x = reorder(state, -missing_count), y = missing_count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(
    x = "State",
    y = "Count of Missing geoid",
    title = "Count of Missing geoid by State",
    subtitle = "Based on final_data"
  )

3.5 Missingness Reflection

My method of matching locations from the same year to cross-reference geographical IDs worked to reduce the number of missing values. As we can see, the highest numbers of null values exist in high-incident states. In the scheme of things, this should not significantly affect the accuracy of our data.

4 EDA

4.1 Country Wide Year-level Summary Statistics

# Calculate the count of incidents per year
yearly_counts <- final_data |>
  group_by(year) |>
  summarise(
    incidents = n(),
    n_killed = sum(n_killed),
    n_injured = sum(n_injured)
  )

# Convert the year column to a character to avoid comma formatting
yearly_counts$year <- as.character(yearly_counts$year)

# Display the yearly counts using flextable
yearly_counts_ft <- yearly_counts |>
  flextable() |>
  set_header_labels(
    year = "Year",
    incidents = "Incidents",
    n_killed = "Killed",
    n_injured = "Injured"
  ) |>
  add_header_row(
    values = "Table 2: Aggregate Sum of U.S. Gun Violence Incidents, Deaths, and Injuries per Year", 
    colwidths = 4
  ) |>  
  padding(padding = 3, part = "all") |>
  theme_vanilla() |>
  autofit()

# Print the flextable
yearly_counts_ft

Table 2: Aggregate Sum of U.S. Gun Violence Incidents, Deaths, and Injuries per Year

Year

Incidents

Killed

Injured

2014

51,854

12,557

23,002

2015

53,579

13,484

26,967

2016

58,763

15,066

30,580

2017

61,401

15,511

30,703

4.2 Country Wide Year-level Summary Statistics

# Calculate summary statistics for incidents, n_killed, and n_injured per year
summary_year <- yearly_counts |>
  reframe(
    Statistic = c("Minimum", "Q1", "Median", "Mean (μ)", "Q3", "Maximum", "SD (σ)"),
    Incidents = c(
      min(incidents, na.rm = TRUE),
      quantile(incidents, 0.25, na.rm = TRUE),
      median(incidents, na.rm = TRUE),
      mean(incidents, na.rm = TRUE),
      quantile(incidents, 0.75, na.rm = TRUE),
      max(incidents, na.rm = TRUE),
      sd(incidents, na.rm = TRUE)
    ),
    Killed = c(
      min(n_killed, na.rm = TRUE),
      quantile(n_killed, 0.25, na.rm = TRUE),
      median(n_killed, na.rm = TRUE),
      mean(n_killed, na.rm = TRUE),
      quantile(n_killed, 0.75, na.rm = TRUE),
      max(n_killed, na.rm = TRUE),
      sd(n_killed, na.rm = TRUE)
    ),
    Injured = c(
      min(n_injured, na.rm = TRUE),
      quantile(n_injured, 0.25, na.rm = TRUE),
      median(n_injured, na.rm = TRUE),
      mean(n_injured, na.rm = TRUE),
      quantile(n_injured, 0.75, na.rm = TRUE),
      max(n_injured, na.rm = TRUE),
      sd(n_injured, na.rm = TRUE)
    )
  ) |>
  mutate_if(is.numeric, round, digits = 2)

summary_year_ft <- flextable(summary_year) |>
  add_header_row(values = "Table 3: U.S. Summary Statistics of Gun Violence Incidents, Deaths, and Injuries per Year (2014-2017)", colwidths = 4) |>
  padding(padding = 3, part = "all") |>
  add_footer_lines(values = "Source: Gun Violence Archive") |>
  colformat_double(j = 2:4, digits = 2)|>
  theme_vanilla() |>
  autofit()


summary_year_ft

Table 3: U.S. Summary Statistics of Gun Violence Incidents, Deaths, and Injuries per Year (2014-2017)

Statistic

Incidents

Killed

Injured

Minimum

51,854.00

12,557.00

23,002.00

Q1

53,147.75

13,252.25

25,975.75

Median

56,171.00

14,275.00

28,773.50

Mean (μ)

56,399.25

14,154.50

27,813.00

Q3

59,422.50

15,177.25

30,610.75

Maximum

61,401.00

15,511.00

30,703.00

SD (σ)

4,442.89

1,375.08

3,645.54

Source: Gun Violence Archive

Here are the statistical summary statistics for our gun violence incidents for the entire US for our years of concern.

4.3 Visualization: Histogram of Monthly Incidents per Year

# Create the faceted histogram
ggplot(final_data, aes(x = month)) +
  geom_histogram(stat = "count", fill = "steelblue") +
  facet_grid(year ~ .) +
  labs(
    title = "Monthly Gun Violence Incidents",
    x = "Month",
    y = "Number of Incidents",
    caption = "Source: Gun Violence Archive"
  ) +
  theme_minimal() +
  scale_x_discrete(labels = month.abb)  # Use abbreviated month names

I next broke each year of gun violence incidents down by month. An interesting trend emerges with this histogram. With the exception of a peak in incidents in January, we notice that incidents start to increase as the warmer months approach.

4.4 Gun Violence - State Level

I began analyzing this data by examining gun violence at the state level for the most recent year, as 2017 has the highest number of incidents.

4.4.1 State-level Summary Statistics for Top 15 States

# Summarize the number of incidents per state
state_counts_all_years <- final_data |>
  filter(state != "District of Columbia") |>
  group_by(state, year) |>
  summarise(
    count = n(),
    n_killed = sum(n_killed),
    n_injured = sum(n_injured)
    ) |>
  arrange(desc(count))

# Calculate the average count for each state over all years
state_avg_counts <- state_counts_all_years %>%
  group_by(state) %>%
  summarize(avg_count = mean(count, na.rm = TRUE)) %>%
  arrange(desc(avg_count))

# Select the top 15 states by average count
top_15_states <- state_avg_counts %>% slice(1:15)

# Filter the original data for these top 15 states
top_15_data <- state_counts_all_years %>% filter(state %in% top_15_states$state)

# Generate the summary statistics for each of the top 15 states
summary_stats <- top_15_data %>%
  group_by(state) %>%
  summarize(
    min_count = min(count, na.rm = TRUE),
    q1_count = quantile(count, 0.25, na.rm = TRUE),
    median_count = median(count, na.rm = TRUE),
    mean_count = mean(count, na.rm = TRUE),
    q3_count = quantile(count, 0.75, na.rm = TRUE),
    max_count = max(count, na.rm = TRUE),
    sd_count = sd(count, na.rm = TRUE),
    min_killed = min(n_killed, na.rm = TRUE),
    q1_killed = quantile(n_killed, 0.25, na.rm = TRUE),
    median_killed = median(n_killed, na.rm = TRUE),
    mean_killed = mean(n_killed, na.rm = TRUE),
    q3_killed = quantile(n_killed, 0.75, na.rm = TRUE),
    max_killed = max(n_killed, na.rm = TRUE),
    sd_killed = sd(n_killed, na.rm = TRUE),
    min_injured = min(n_injured, na.rm = TRUE),
    q1_injured = quantile(n_injured, 0.25, na.rm = TRUE),
    median_injured = median(n_injured, na.rm = TRUE),
    mean_injured = mean(n_injured, na.rm = TRUE),
    q3_injured = quantile(n_injured, 0.75, na.rm = TRUE),
    max_injured = max(n_injured, na.rm = TRUE),
    sd_injured = sd(n_injured, na.rm = TRUE)
  )

# Round all summary statistics
summary_stats <- summary_stats %>%
  mutate(across(where(is.numeric), round, digits = 0))

# Split the summary statistics into three different data frames
summary_incidents <- summary_stats %>%
  select(state, min_count, q1_count, median_count, mean_count, q3_count, max_count, sd_count)

summary_killed <- summary_stats %>%
  select(state, min_killed, q1_killed, median_killed, mean_killed, q3_killed, max_killed, sd_killed)

summary_injured <- summary_stats %>%
  select(state, min_injured, q1_injured, median_injured, mean_injured, q3_injured, max_injured, sd_injured)

# Convert each summary statistics data frame to a flextable
ft_incidents <- flextable(summary_incidents) %>%
  set_header_labels(
    state = "State",
    min_count = "Min Incident", q1_count = "Q1 Incident", median_count = "Median Incident", 
    mean_count = "Mean Incident", q3_count = "Q3 Incident", max_count = "Max Incident", 
    sd_count = "SD Incident"
  ) %>%
  add_header_row(values = "Table 4a: Summary Statistics of Gun Violence Incidents by state (2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

ft_killed <- flextable(summary_killed) %>%
  set_header_labels(
    state = "State",
    min_killed = "Min Killed", q1_killed = "Q1 Killed", median_killed = "Median Killed", 
    mean_killed = "Mean Killed", q3_killed = "Q3 Killed", max_killed = "Max Killed", 
    sd_killed = "SD Killed"
  ) %>%
  add_header_row(values = "Table 4b: Summary Statistics of Gun Violence Deaths by state (2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

ft_injured <- flextable(summary_injured) %>%
  set_header_labels(
    state = "State",
    min_injured = "Min Injured", q1_injured = "Q1 Injured", median_injured = "Median Injured", 
    mean_injured = "Mean Injured", q3_injured = "Q3 Injured", max_injured = "Max Injured", 
    sd_injured = "SD Injured"
  ) %>%
  add_header_row(values = "Table 4c: Summary Statistics of Gun Violence Injuries by state (2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

# Print the flextables
ft_incidents

Table 4a: Summary Statistics of Gun Violence Incidents by state (2017)

State

Min Incident

Q1 Incident

Median Incident

Mean Incident

Q3 Incident

Max Incident

SD Incident

California

3,234

3,521

3,674

3,793

3,946

4,588

571

Florida

2,702

3,029

3,647

3,549

4,167

4,201

748

Georgia

1,708

1,922

2,013

2,125

2,216

2,767

452

Illinois

3,095

3,366

4,256

4,174

5,064

5,089

1,048

Louisiana

1,692

1,852

1,946

1,914

2,008

2,070

162

Massachusetts

968

1,246

1,441

1,403

1,598

1,761

337

Missouri

1,272

1,482

1,627

1,564

1,709

1,730

210

New York

1,903

1,999

2,276

2,340

2,617

2,903

461

North Carolina

1,856

2,040

2,121

2,066

2,147

2,165

142

Ohio

2,132

2,309

2,405

2,411

2,507

2,701

234

Pennsylvania

1,789

2,076

2,178

2,103

2,206

2,267

214

South Carolina

1,490

1,618

1,678

1,642

1,702

1,721

104

Tennessee

1,590

1,720

1,818

1,810

1,909

2,014

179

Texas

2,875

3,068

3,204

3,222

3,358

3,606

305

Virginia

1,273

1,313

1,400

1,412

1,499

1,578

139

Source: Gun Violence Archive

ft_killed

Table 4b: Summary Statistics of Gun Violence Deaths by state (2017)

State

Min Killed

Q1 Killed

Median Killed

Mean Killed

Q3 Killed

Max Killed

SD Killed

California

1,204

1,250

1,292

1,303

1,346

1,423

93

Florida

796

854

912

905

963

1,001

90

Georgia

522

542

572

581

611

658

60

Illinois

637

675

817

808

949

959

169

Louisiana

465

492

514

510

532

549

36

Massachusetts

101

107

110

112

116

128

11

Missouri

383

465

504

500

538

609

93

New York

347

384

406

406

428

463

48

North Carolina

487

492

524

528

561

576

45

Ohio

494

526

582

590

646

700

92

Pennsylvania

508

537

552

562

576

635

53

South Carolina

344

374

386

378

389

395

23

Tennessee

352

384

432

428

476

495

66

Texas

1,094

1,126

1,152

1,178

1,205

1,313

95

Virginia

284

309

351

346

388

398

54

Source: Gun Violence Archive

ft_injured

Table 4c: Summary Statistics of Gun Violence Injuries by state (2017)

State

Min Injured

Q1 Injured

Median Injured

Mean Injured

Q3 Injured

Max Injured

SD Injured

California

1,607

1,614

1,756

1,768

1,910

1,953

182

Florida

1,463

1,526

1,674

1,664

1,812

1,846

188

Georgia

817

847

957

957

1,067

1,096

140

Illinois

2,276

2,686

3,226

3,216

3,756

4,137

828

Louisiana

898

915

1,012

1,026

1,123

1,184

139

Massachusetts

315

365

419

405

459

468

71

Missouri

647

793

882

833

921

921

129

New York

965

1,060

1,184

1,213

1,336

1,521

241

North Carolina

884

1,043

1,124

1,084

1,166

1,205

141

Ohio

1,120

1,258

1,380

1,349

1,470

1,517

177

Pennsylvania

1,064

1,157

1,204

1,190

1,237

1,288

94

South Carolina

626

701

732

720

750

791

69

Tennessee

857

980

1,092

1,060

1,172

1,198

155

Texas

1,157

1,306

1,460

1,423

1,577

1,614

210

Virginia

578

731

860

838

967

1,051

205

Source: Gun Violence Archive

4.4.2 Visualization: Bar graph of Incidents by state

# filtering for the year 2017
gun_data_2017 <- final_data |>
  filter(format(as.Date(date), "%Y") == "2017")

# Summarize the number of incidents per state
state_counts <- gun_data_2017 |>
  group_by(state) |>
  summarise(count = n()) |>
  arrange(desc(count))

# Convert the state column to a factor with levels ordered by count
gun_data_2017 <- gun_data_2017 |>
  mutate(state = factor(state, levels = state_counts$state))

gun_data_2017 |>
  ggplot(
    aes(
      x = state
    )
  ) +
  geom_bar(binwidth = 1, fill = "steelblue", boundary = 0) +
  scale_y_continuous(expand = expansion(mult = c(0,0.1))) +
  # scale_x_continuous(limits = c(0, 14), breaks = seq(0, 14, by = 2)) +  
  labs(
    y = "Number of Incidents",
    x = "State",
    title = "Total number of gun-violence incidents by state",
    subtitle = "Year 2017",
    caption = "Source: Gun Violence Archive"
  ) + 
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)
  )

Looking at this initial visualization, we can see some staggering levels of gun violence in certain states. However, most of the states with high levels of gun violence are also some of the most densely populated in the country. I decided to analyze the same data while controlling for population.

4.4.3 Census Data Cleaning

census_df <- census_df |>
  filter(year >= 2014, year <=2017)

# Filtering census data for 2017
census_df_2017 <- census_df |>
  filter(year == 2017)

# Aggregating population by state
census_df_2017_pop_by_state <- census_df_2017 |>
  group_by(county_state, year, geoid) |>
  summarise(population = sum(population), .groups = 'drop')

4.4.4 Examining high-level data characteristics

skim(census_df)
Data summary
Name census_df
Number of rows 208
Number of columns 26
_______________________
Column type frequency:
character 2
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
geoid 0 1 2 2 0 52 0
county_state 0 1 4 20 0 52 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2015.50 1.12 2014.00 2014.75 2015.50 2016.25 2017.00 ▇▇▁▇▇
population 0 1 6263911.70 7109820.85 579315.00 1791128.50 4278116.50 7027585.00 39536653.00 ▇▂▁▁▁
median_income 0 1 56577.62 11095.19 18626.00 49491.25 55088.00 63116.00 82372.00 ▁▁▇▅▂
median_monthly_rent_cost 0 1 921.28 213.41 433.00 769.75 853.50 1049.50 1573.00 ▁▇▅▂▁
median_monthly_home_cost 0 1 1057.44 344.69 251.00 826.25 968.00 1269.25 1977.00 ▁▇▆▃▂
prop_female 0 1 0.51 0.01 0.47 0.50 0.51 0.51 0.53 ▁▁▆▇▂
prop_male 0 1 0.49 0.01 0.47 0.49 0.49 0.50 0.53 ▂▇▆▁▁
prop_white 0 1 0.76 0.13 0.25 0.68 0.78 0.85 0.95 ▁▁▃▇▇
prop_black 0 1 0.11 0.11 0.00 0.03 0.08 0.16 0.49 ▇▃▁▁▁
prop_native 0 1 0.02 0.03 0.00 0.00 0.00 0.01 0.15 ▇▁▁▁▁
prop_asian 0 1 0.04 0.05 0.00 0.02 0.03 0.05 0.38 ▇▁▁▁▁
prop_hawaiin_islander 0 1 0.00 0.01 0.00 0.00 0.00 0.00 0.10 ▇▁▁▁▁
prop_other_race 0 1 0.03 0.03 0.00 0.01 0.02 0.05 0.17 ▇▂▁▁▁
prop_multi_racial 0 1 0.03 0.03 0.01 0.02 0.03 0.03 0.24 ▇▁▁▁▁
prop_highschool 0 1 0.24 0.04 0.15 0.22 0.24 0.26 0.35 ▁▅▇▃▁
prop_GED 0 1 0.04 0.01 0.02 0.04 0.04 0.05 0.07 ▃▇▆▃▂
prop_some_college 0 1 0.06 0.01 0.01 0.06 0.06 0.07 0.09 ▁▁▇▇▂
prop_college_no_degree 0 1 0.15 0.02 0.09 0.13 0.15 0.16 0.21 ▂▃▇▅▁
prop_associates 0 1 0.09 0.02 0.03 0.08 0.08 0.10 0.14 ▁▂▇▃▁
prop_bachelors 0 1 0.19 0.03 0.12 0.17 0.19 0.21 0.26 ▂▅▇▆▂
prop_masters 0 1 0.08 0.03 0.05 0.07 0.08 0.09 0.21 ▇▅▁▁▁
prop_professional 0 1 0.02 0.01 0.01 0.02 0.02 0.02 0.08 ▇▁▁▁▁
prop_doctoral 0 1 0.01 0.01 0.01 0.01 0.01 0.02 0.05 ▇▂▁▁▁
prop_poverty 0 1 0.15 0.05 0.07 0.11 0.14 0.16 0.46 ▇▅▁▁▁

4.4.5 Visualization: Bar Graph of Incidents by state per 100,000 people

# Summarize the number of incidents per state
state_counts <- gun_data_2017 |>
  group_by(state) |>
  summarise(count = n()) |>
  arrange(desc(count))

# Merge with population data
state_counts <- state_counts |>
  left_join(census_df_2017_pop_by_state, by = c("state" = "county_state"))

# Calculate incidents per 100,000 people
state_counts <- state_counts |>
  mutate(incidents_per_100k = (count / population) * 100000)

# Convert the state column to a factor with levels ordered by incidents_per_100k
state_counts <- state_counts |>
  mutate(state = factor(state, levels = state[order(incidents_per_100k, decreasing = TRUE)]))

# Create the bar chart
ggplot(state_counts, aes(x = state, y = incidents_per_100k)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    y = "Incidents per 100,000 People",
    x = "State",
    title = "Gun Violence Incidents per 100,000 People by State",
    subtitle = "Year 2017",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)
  )

After controlling for population using the Census Bureau dataset, DC has the most gun violence incidents per 100,000 people in the country for this year. However, DC is not considered a state but a federal district, more like a city. For the sake of analyzing like for like, I will omit it from the analysis for now.

4.4.6 Visulization: Bar Graph of Incidents by state per 100,000 people - Removing DC

# Summarize the number of incidents per state
state_counts <- gun_data_2017 |>
  filter(state != "District of Columbia") |>
  group_by(state) |>
  summarise(count = n()) |>
  arrange(desc(count))

# Merge with population data
state_counts <- state_counts |>
  left_join(census_df_2017_pop_by_state, by = c("state" = "county_state"))

# Calculate incidents per 100,000 people
state_counts <- state_counts |>
  mutate(incidents_per_100k = (count / population) * 100000)

# Convert the state column to a factor with levels ordered by incidents_per_100k
state_counts <- state_counts |>
  mutate(state = factor(state, levels = state[order(incidents_per_100k, decreasing = TRUE)]))

# Create the bar chart

ggplot(state_counts, aes(x = state, y = incidents_per_100k)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    y = "Incidents per 100,000 People",
    x = "State",
    title = "Gun Violence Incidents per 100,000 People by State",
    subtitle = "Year 2017",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
    legend.position = "none"
  )

After removing the District of Columbia, we get a better comparison between states. I was surprised to see Alaska taking the top spot. After some Google searching, I obtained records of gun ownership by state.

According to World Population Review, gun ownership rates in Alaska are estimated to be 64.5%. This could potentially be a correlative factor in gun violence incidents within the state. It was also surprising to see Delaware above a state like Illinois, considering the anecdotes I’ve always heard regarding the violence in Chicago.

Interestingly enough, the inverse is true for Alaska vis-à-vis gun ownership rates. Delaware ranks as the 10th lowest state proportion of gun ownership at 34.4%, according to the World Population Review. Illinois ranks even lower, at 7th, with 27.8%. Gun ownership seems like an interesting feature to explore, but given some of the inconsistencies, I chose to press on and explore more.

4.5 Visualization: Interactive Map of Incidents by State

Here I mapped out all 50 state’s level of incidents per 100,000 people onto a leaflet to highlight the most gun violence prone states in 2017.

# Downloading state-level shape files from US Census Bureau
if(!file.exists("cb_2018_us_state_500k.zip")) {
download.file(url = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_500k.zip",
              destfile = "cb_2018_us_state_500k.zip")
}

# Create directory for geospatial files
if(!dir.exists("GeoFiles")) {
dir.create("GeoFiles")
}

# Unzipping files
utils::unzip("cb_2018_us_state_500k.zip",
             exdir = "GeoFiles")

# Loading the shapefiles
state_shape <- st_read("GeoFiles//cb_2018_us_state_500k.shp",
                       quiet = TRUE)

# Ensure the geoid column in state_counts is properly formatted as GEOID
state_counts <- state_counts |>
  mutate(GEOID = str_pad(geoid, width = 2, side = "left", pad = "0"))

# Merging state_counts with state_shape shapefile
state_shape_full <- state_shape |> 
  left_join(state_counts, by = c("GEOID" = "GEOID"))

# Remove rows with NA values
cleaned_state_shape <- state_shape_full |>
  drop_na()

# Ensure cleaned_state_shape is an sf object if not already done
if (!inherits(cleaned_state_shape, "sf")) {
  cleaned_state_shape <- st_as_sf(cleaned_state_shape)
}

# Transform to WGS84 CRS to be compatible with leaflet
cleaned_state_shape <- st_transform(cleaned_state_shape, crs = 4326)

# Check for and handle missing values in key columns
cleaned_state_shape <- cleaned_state_shape |>
  filter(!is.na(state) & !is.na(population) & !is.na(count) & !is.na(incidents_per_100k))

# Defining pop-up labels
labels <- cleaned_state_shape |>
  mutate(labels = paste0(
    "<strong>", str_to_title(state), "</strong><br/>",
    "Population: ", population, "<br/>",
    "Incidents: ", count, "<br/>",
    "Incidents per 100k: ", round(incidents_per_100k, 2))) |>
  pull(labels) |>
  lapply(htmltools::HTML)

# Quantitative variable to color areal units (states) by
myVariable <- cleaned_state_shape$incidents_per_100k

# Defining nice cut points for coloring
bins <- seq(min(myVariable, na.rm = TRUE), max(myVariable, na.rm = TRUE), length.out = 6)

# Selecting color palette
pal <- colorBin("PuBu", domain = myVariable, bins = bins)

# Create the choropleth map
myLeaflet <- leaflet(cleaned_state_shape) |>
  setView(lng = -96, lat = 37.8, zoom = 2.4) |>
  addProviderTiles("OpenStreetMap.Mapnik",
                   options = providerTileOptions(
                     id = "mapbox.light",
                     accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN'))) |>
  addPolygons(
    fillColor = ~pal(myVariable),
    weight = 0.80,
    opacity = 1,
    color = "black",
    fillOpacity = 1,
    highlightOptions = highlightOptions(
      color = "#666"),
    label = labels,
    labelOptions = labelOptions(textsize = "15px")) |>
  addLegend(pal = pal, 
            values = ~ myVariable, 
            opacity = 0.7, 
            title = "Gun Violence Incidents per 100k (2017)",
            position = "bottomright")

# Display the map
myLeaflet

4.6 Visualization: Boxplot Showing Range of Incidents Per Year from 2014-2017

# Arrange by incidents_per_100k in descending order and extract the top 15 states
top_15_states <- state_counts |>
  arrange(desc(incidents_per_100k)) |>
  slice(1:15) |>
  pull(state)

# Merge with population data
state_counts_all_years <- state_counts_all_years |>
  left_join(census_df, by = c("state" = "county_state",
                              "year" = "year"))

# Calculate incidents per 100,000 people
state_counts_all_years <- state_counts_all_years |>
  mutate(incidents_per_100k = (count / population) * 100000)

# Filter the state_counts_all_years DataFrame to include only the top 15 states
filtered_data <- state_counts_all_years |>
  filter(state %in% top_15_states)

# Create the box plot
ggplot(filtered_data, aes(x = fct_reorder(state, incidents_per_100k, .desc=FALSE ) , y = incidents_per_100k)) +
  geom_boxplot(fill = "steelblue", alpha = 0.9) +
  stat_boxplot(geom = "errorbar", width = 0.3, coef = 1.5) +
  labs(
    y = "Incidents per 100,000 People",
    x = "State",
    title = "Range of Gun Violence Incidents per 100,000 People by State (2014-2017)",
    subtitle = "Top 15 States",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  
  coord_flip() +
  theme_minimal() +
  theme(
    axis.text.x = element_text(hjust = .5, vjust = 0.5)
  )

Here we can see how the gun incident rates per 100,000 people ranged from 2014 to 2017 for the top 15 states.

4.7 Gun Violence - City Level

Having thoroughly examined the data at the state level, it became evident that the broader trends and anomalies could benefit from a more granular analysis. Thus, I decided to adjust my aperture toward the city level. By zooming in on the urban centers, we can gain deeper insights into the dynamics of gun violence. Cities often present unique challenges and opportunities for intervention, and understanding the variations within these metropolitan areas will be crucial for formulating more effective strategies. Let’s delve into the city-level data to uncover these nuanced patterns.

4.7.1 City Level Summary Statistics

# Group by state, geoid, and city_or_county, then summarise and arrange
sorted_city_rates_2017 <- gun_data_2017 |>
  group_by(geoid, state, city_or_county) |>
  summarise(count = n(), .groups = 'drop') |>
  arrange(desc(count))

# filter county data for 2017
county_census_2017 <- county_census_df |>
  filter(
    year == 2017
  )

# Split the county_state column into county and state columns
county_census_2017 <- county_census_2017 |>
  separate(county_state, into = c("city_or_county", "state"), sep = ", ")

# Merge with population data
grouped_data <- county_census_2017  |>
  left_join(sorted_city_rates_2017, by = c("geoid" = "geoid"))

grouped_data <- grouped_data |>
  drop_na()

grouped_data <- grouped_data |>
  rename(
    county = city_or_county.x,
    city = city_or_county.y,
    state = state.x
  ) |>
  select(-state.y)

# Calculate incidents per 100,000 people
grouped_data <- grouped_data |>
  mutate(incidents_per_100k = (count / population) * 100000)

# Getting state abbreviations for plotting
state_abbr_lookup <- data.frame(
  state = c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "District of Columbia", "Florida", "Georgia", 
            "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", 
            "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", 
            "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", 
            "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", 
            "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"),
  state_abbr = c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC","FL", "GA", 
                 "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
                 "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
                 "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
                 "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY")
)
# Join with the state abbreviation lookup table
grouped_data <- grouped_data |>
  left_join(state_abbr_lookup, by = "state")

# Create a unique identifier by combining county and state
grouped_data <- grouped_data |>
  mutate(city_state = paste(city, state_abbr, sep = ", "))

top_15_cities <- grouped_data |>
  arrange(desc(count)) |>  # Sort by count in descending order
  slice_head(n = 15) 

# Split the county_state column into county and state columns
county_census_df <- county_census_df |>
  separate(county_state, into = c("city_or_county", "state"), sep = ", ")

# Group by state, geoid, and city_or_county, then summarise and arrange
sorted_city_rates_all_years <- final_data |>
  group_by(geoid, state, city_or_county, year) |>
  summarise(
    count = n(),
    n_killed = sum(n_killed),
    n_injured = sum(n_injured),
    .groups = 'drop'
    ) |>
  arrange(desc(count), city_or_county, state )

# Merge with population data
sorted_city_rates_all_years <- sorted_city_rates_all_years  |>
  left_join(county_census_df, by = c("geoid" = "geoid",
                                     "year" = "year"))

sorted_city_rates_all_years <- sorted_city_rates_all_years |>
  rename(state = state.x, city = city_or_county.x, county = city_or_county.y )|>
  group_by(
    city,
    state,
    year
  ) |>
  summarise(
    count = sum(count),
    n_killed = sum(n_killed),
    n_injured = sum(n_injured),
    .groups = 'drop'
  )

# Join with the state abbreviation lookup table
sorted_city_rates_all_years <- sorted_city_rates_all_years |>
  left_join(state_abbr_lookup, by = "state")

# Create a unique identifier by combining county and state
sorted_city_rates_all_years <- sorted_city_rates_all_years |>
  mutate(city_state = paste(city, state_abbr, sep = ", "))


# Filter sorted_city_rates_all_years to only include rows with the specified geoids
sorted_city_rates_all_years <- sorted_city_rates_all_years %>%
  filter(city_state %in% top_15_cities$city_state)

# Generate the summary statistics for each of the top 15 states
city_summary_stats <- sorted_city_rates_all_years %>%
  mutate(
    rounded_count = round(count, digits = 0),
    rounded_n_killed = round(n_killed, digits = 0),
    rounded_n_injured = round(n_injured, digits = 0)
  ) %>%
  group_by(city) %>%
  summarize(
    min_count = min(count, na.rm = TRUE),
    q1_count = quantile(count, 0.25, na.rm = TRUE),
    median_count = median(count, na.rm = TRUE),
    mean_count = mean(count, na.rm = TRUE),
    q3_count = quantile(count, 0.75, na.rm = TRUE),
    max_count = max(count, na.rm = TRUE),
    sd_count = sd(count, na.rm = TRUE),
    min_killed = min(n_killed, na.rm = TRUE),
    q1_killed = quantile(n_killed, 0.25, na.rm = TRUE),
    median_killed = median(n_killed, na.rm = TRUE),
    mean_killed = mean(n_killed, na.rm = TRUE),
    q3_killed = quantile(n_killed, 0.75, na.rm = TRUE),
    max_killed = max(n_killed, na.rm = TRUE),
    sd_killed = sd(n_killed, na.rm = TRUE),
    min_injured = min(n_injured, na.rm = TRUE),
    q1_injured = quantile(n_injured, 0.25, na.rm = TRUE),
    median_injured = median(n_injured, na.rm = TRUE),
    mean_injured = mean(n_injured, na.rm = TRUE),
    q3_injured = quantile(n_injured, 0.75, na.rm = TRUE),
    max_injured = max(n_injured, na.rm = TRUE),
    sd_injured = sd(n_injured, na.rm = TRUE)
  ) %>%
  arrange(desc(mean_count))

# Round all summary statistics
city_summary_stats <- city_summary_stats %>%
  mutate(across(where(is.numeric), round, digits = 0))


# Split the summary statistics into three different data frames
city_summary_incidents <- city_summary_stats %>%
  select(city, min_count, q1_count, median_count, mean_count, q3_count, max_count, sd_count)

city_summary_killed <- city_summary_stats %>%
  select(city, min_killed, q1_killed, median_killed, mean_killed, q3_killed, max_killed, sd_killed)

city_summary_injured <- city_summary_stats %>%
  select(city, min_injured, q1_injured, median_injured, mean_injured, q3_injured, max_injured, sd_injured)

# Convert each summary statistics data frame to a flextable
ft_incidents_city <- flextable(city_summary_incidents) %>%
  set_header_labels(
    city = "City",
    min_count = "Min Incident", q1_count = "Q1 Incident", median_count = "Median Incident", 
    mean_count = "Mean Incident", q3_count = "Q3 Incident", max_count = "Max Incident", 
    sd_count = "SD Incident"
  ) %>%
  add_header_row(values = "Table 5a: City Summary Statistics of Gun Violence Incidents per Year (2014-2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

ft_killed_city <- flextable(city_summary_killed) %>%
  set_header_labels(
    city = "City",
    min_killed = "Min Killed", q1_killed = "Q1 Killed", median_killed = "Median Killed", 
    mean_killed = "Mean Killed", q3_killed = "Q3 Killed", max_killed = "Max Killed", 
    sd_killed = "SD Killed"
  ) %>%
  add_header_row(values = "Table 5b: City Summary Statistics of Gun Violence Deaths per Year (2014-2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

ft_injured_city <- flextable(city_summary_injured) %>%
  set_header_labels(
    city = "City",
    min_injured = "Min Injured", q1_injured = "Q1 Injured", median_injured = "Median Injured", 
    mean_injured = "Mean Injured", q3_injured = "Q3 Injured", max_injured = "Max Injured", 
    sd_injured = "SD Injured"
  ) %>%
  add_header_row(values = "Table 5c: City Summary Statistics of Gun Violence Injuries per Year (2014-2017)", colwidths = 8) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive") %>%
  theme_vanilla() %>%
  autofit()

# Print the flextables
ft_incidents_city

Table 5a: City Summary Statistics of Gun Violence Incidents per Year (2014-2017)

City

Min Incident

Q1 Incident

Median Incident

Mean Incident

Q3 Incident

Max Incident

SD Incident

Chicago

2,043

2,316

2,723

2,770

3,177

3,591

685

Baltimore

760

873

916

928

971

1,117

146

Washington

496

525

686

752

913

1,142

301

New Orleans

660

702

744

750

792

850

81

Philadelphia

533

561

610

698

748

1,039

232

Saint Louis

517

546

643

635

732

736

115

Milwaukee

479

496

522

590

617

839

168

Houston

550

559

582

580

604

606

28

Memphis

421

497

583

574

660

709

128

Jacksonville

327

355

570

563

778

784

251

Indianapolis

425

446

463

462

479

497

31

Cleveland

284

370

450

435

515

554

119

Boston

330

338

399

404

466

490

81

Detroit

266

286

314

332

360

433

73

Peoria

144

152

184

223

256

380

109

Source: Gun Violence Archive

ft_killed_city

Table 5b: City Summary Statistics of Gun Violence Deaths per Year (2014-2017)

City

Min Killed

Q1 Killed

Median Killed

Mean Killed

Q3 Killed

Max Killed

SD Killed

Chicago

414

434

538

553

658

722

150

Baltimore

172

241

264

251

274

304

56

Washington

91

98

103

105

110

122

13

New Orleans

154

162

170

169

177

182

12

Philadelphia

186

206

217

211

222

223

17

Saint Louis

171

218

234

235

250

300

53

Milwaukee

91

105

122

120

136

144

24

Houston

217

254

278

266

290

292

35

Memphis

110

120

146

148

174

189

37

Jacksonville

88

92

95

98

100

113

11

Indianapolis

136

139

148

148

157

163

13

Cleveland

85

104

122

117

134

139

24

Boston

34

37

40

41

44

49

6

Detroit

126

130

137

140

147

160

15

Peoria

4

6

10

9

13

14

5

Source: Gun Violence Archive

ft_injured_city

Table 5c: City Summary Statistics of Gun Violence Injuries per Year (2014-2017)

City

Min Injured

Q1 Injured

Median Injured

Mean Injured

Q3 Injured

Max Injured

SD Injured

Chicago

1,895

2,252

2,662

2,674

3,084

3,478

689

Baltimore

442

550

600

574

624

655

92

Washington

298

328

339

330

341

345

22

New Orleans

381

419

467

454

502

503

59

Philadelphia

431

448

470

546

568

812

179

Saint Louis

353

385

439

436

490

514

75

Milwaukee

336

346

370

375

399

423

40

Houston

222

242

260

284

302

394

76

Memphis

298

345

412

402

469

486

88

Jacksonville

234

250

272

272

295

311

35

Indianapolis

233

253

268

273

287

324

38

Cleveland

176

236

298

293

355

398

97

Boston

108

115

134

139

158

180

33

Detroit

169

170

197

227

253

344

82

Peoria

51

56

64

71

78

105

24

Source: Gun Violence Archive

# Convert the city column to a factor with levels ordered by incidents_per_100k
top_15_cities <- top_15_cities |>
  mutate(city_state = factor(city_state, levels = city_state[order(incidents_per_100k, decreasing = FALSE)]))

# Create the bar chart
ggplot(top_15_cities, aes(x = city_state, y = incidents_per_100k)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    y = "Incidents per 100,000 People",
    x = "City",
    title = "Gun Violence Incidents per 100,000 People by city",
    subtitle = "Year 2017",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(vjust = 0.5)
  ) +
  coord_flip()

n_killed_2017 <- gun_data_2017 |>
  group_by(geoid) |>
  summarise(n_killed = sum(n_killed), .groups = 'drop')

top_15_cities <- top_15_cities |>
  left_join(n_killed_2017, by = "geoid")

# Calculate gun deaths per 100,000 people
top_15_cities <- top_15_cities |>
  mutate(prop_deaths = (n_killed / count))

# Calculate gun deaths per 100,000 people
top_15_cities <- top_15_cities |>
  mutate(deaths_per_100k = (n_killed / population) * 100000)

top_15_cities <- top_15_cities |>
  mutate(city_state = factor(city_state, levels = city_state[order(deaths_per_100k, decreasing = FALSE)]))

# Create the bar chart
ggplot(top_15_cities, aes(x = city_state, y = deaths_per_100k)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    y = "Deaths per 100,000 People",
    x = "City",
    title = "Gun Violence Deaths per 100,000 People by city",
    subtitle = "Year 2017",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(vjust = 0.5)
  ) +
  coord_flip()

As we look at the data more granularly, a different picture starts to emerge. While at the state level, Alaska seems to be experiencing the most gun violence, no Alaskan city is within the top 15 most incident-prone cities. Analyzing the top 15 most incident-prone states through the lens of the most deadly, Saint Louis, MO, stands out, with 36% of all gun violence incidents ending in death in 2017. Let’s dive a bit deeper to uncover more insights.

4.7.2 Visualization: Correlation plots

With all the additional demographic data from the census dataset, I wanted to see if any of the added features might correlate with the deaths and incidents occurring within our dataset. From here on out, I will continue to look at the top 15 states with the highest levels of gun violence incidents in 2017.

Adding 3 additional features to analyze.

  • rent_percentage: The proportion of median annual rent cost to the median annual income.

  • home_percentage: The proportion of median annual mortgage cost to the median annual income.

  • prop_deaths: The proportion of gun violence incidents that produced a gun violence death

I added these to see if these scores, in addition to any other macroeconomic data, might give us a better signal to correlate with deaths and incidents.

# Adding rent_percentage and home_percentage to top_15_cities
top_15_cities <- top_15_cities |>
  mutate(
    # rent_percentage: The proportion of median annual rent cost to the median annual income
    rent_percentage = (median_monthly_rent_cost * 12) / median_income,
    
    # home_percentage: The proportion of median annual home cost to the median annual income
    home_percentage = (median_monthly_home_cost * 12) / median_income
  )

# Function to rearrange columns by data type
rearrange_by_data_type <- function(df) {
  # Split the columns by data type
  num_cols <- df |> select(where(is.numeric))
  factor_cols <- df |> select(where(is.factor))
  char_cols <- df |> select(where(is.character))
  logical_cols <- df |> select(where(is.logical))
  date_cols <- df |> select(where(is.Date))
  
  # Combine the columns back in the desired order
  df_rearranged <- bind_cols(char_cols, factor_cols, date_cols, num_cols, logical_cols, )
  
  return(df_rearranged)
}

# Apply the function to rearrange columns by data type
top_15_cities <- rearrange_by_data_type(top_15_cities)

# selecting numeric columns to perform correlation matrix on
numeric_columns <- top_15_cities |>
  select(population:home_percentage)

# Calculate the correlation matrix for the selected numeric columns
correlation_matrix <- cor(numeric_columns, use = "complete.obs")

# Extract correlations related to incidents_per_100k and deaths_per_100k
correlation_incidents <- correlation_matrix["incidents_per_100k", ]
correlation_deaths <- correlation_matrix["deaths_per_100k", ]

# Extract the row corresponding to incidents_per_100k
correlation_incidents <- correlation_matrix["incidents_per_100k", ]

# Convert the extracted correlations to a data frame
correlation_incidents_df <- as.data.frame(correlation_incidents)
correlation_incidents_df$variable <- rownames(correlation_incidents_df)
rownames(correlation_incidents_df) <- NULL

# Create a diagonal matrix to visualize with ggcorrplot
diag_matrix <- diag(0, ncol(correlation_matrix))
rownames(diag_matrix) <- colnames(diag_matrix) <- colnames(correlation_matrix)
diag_matrix["incidents_per_100k", ] <- correlation_matrix["incidents_per_100k", ]
diag_matrix[, "incidents_per_100k"] <- correlation_matrix[, "incidents_per_100k"]
diag_matrix <- as.data.frame(diag_matrix)
diag_matrix <- diag_matrix |> select(incidents_per_100k)
diag_matrix <- t(diag_matrix)

# Plot the filtered correlation matrix using ggcorrplot
ggcorrplot(as.matrix(diag_matrix),
           hc.order = FALSE,
           outline.color = "white",
           lab = TRUE,
           lab_size = 3,
           title = "Correlation with Incidents per 100k") +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10),
    axis.text.y = element_text(size = 10),
    plot.title = element_text(size = 15),
    plot.subtitle = element_text(size = 10),
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  ) +
  coord_flip() 

At the incidents per 100k people level, we get the strongest signal from the population, deaths per 100,000 people, and the proportion of deaths from an incident, which seems to be tacit. We see a very small signal from the gender demographic. As males tend to be more aggressive, it seems plausible that if they are less represented within the population, gun violence may decrease slightly.

# Convert the extracted correlations to a data frame
correlation_deaths_df <- as.data.frame(correlation_deaths)
correlation_deaths_df$variable <- rownames(correlation_deaths_df)
rownames(correlation_deaths_df) <- NULL

# Create a diagonal matrix to visualize with ggcorrplot
diag_matrix <- diag(0, ncol(correlation_matrix))
rownames(diag_matrix) <- colnames(diag_matrix) <- colnames(correlation_matrix)
diag_matrix["deaths_per_100k", ] <- correlation_matrix["deaths_per_100k", ]
diag_matrix[, "deaths_per_100k"] <- correlation_matrix[, "deaths_per_100k"]
diag_matrix <- as.data.frame(diag_matrix)
diag_matrix <- diag_matrix |> select(deaths_per_100k)
diag_matrix <- t(diag_matrix)

# Plot the filtered correlation matrix using ggcorrplot
ggcorrplot(as.matrix(diag_matrix),
           hc.order = FALSE,
           outline.color = "white",
           lab = TRUE,
           lab_size = 3,
           title = "Correlation with Deaths per 100k") +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10),
    axis.text.y = element_text(size = 10),
    plot.title = element_text(size = 15),
    plot.subtitle = element_text(size = 10),
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  ) +
  coord_flip()

When correlating our features with deaths per 100,000 people, we see a different picture. It seems that poverty has a moderate influence on the deadliness of an American city. Our added scores regarding the cost of living are seemingly a function of the median income and/or the proportion of poverty, so I chose to focus on poverty itself to see if we can gain some insight into its effects.

5 Function of poverty

5.1 Visualization: Scatter plots with linear regression lines

Here, I plotted our incidents per 100,000 people in a scatter plot with a correlation line with respect to the proportion of the population within the city living in poverty.

p <- top_15_cities |>
  ggplot(
    aes(
      x = incidents_per_100k,
      y = prop_poverty
    )
  ) +
  geom_point(
    aes(
      text = paste(
        "City:", city_state,
        "<br>Incidents per 100k:", round(incidents_per_100k, digits = 1),
        "<br>Proportion in Poverty:", percent(prop_poverty, accuracy = 0.1)
        )
      )
    ) +
  # Add a linear regression line for each facet/group, colored blue
  geom_smooth(
    method = "lm", # linear model
    se = FALSE, # standard error visible
    size = 1, 
    color = "red"
  )  +
  labs(
    title = "Incidents per 100k by proportion of poverty",
    subtitle = "Top 15 cities in 2017",
    x = "Incidents per 100k",   
    y = "Proportion of city population in poverty",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
    scale_y_continuous(labels = percent, breaks = seq(0, 0.5, by = 0.02)) +

  theme_bw()

# Convert the ggplot2 plot to an interactive plotly plot with custom tooltips
p_plotly <- ggplotly(p, tooltip = "text")

# Manually add the subtitle and caption
p_plotly <- p_plotly |>
  layout(
    title = list(
      text = "Incidents per 100k by proportion of poverty<br><sup>Top 15 cities in 2017</sup>"
    ),
    annotations = list(
      list(
        x = 0, 
        y = -0.13,  # Adjust the y position to move the caption up slightly
        text = "Source: Gun Violence Archive & U.S. Census Bureau", 
        showarrow = FALSE, 
        xref = 'paper', 
        yref = 'paper', 
        xanchor = 'left', 
        yanchor = 'auto', 
        xshift = 0, 
        yshift = 0,
        font = list(size = 8)
      )
    ),
    margin = list(t = 80, b = 100)  # Add extra margin to accommodate the caption
  )

# Display the interactive plot
p_plotly

The regression line is nearly flat, which is a direct reflection of the correlation matrix in the previous section.

q <- top_15_cities |>
  ggplot(
    aes(
      x = deaths_per_100k,
      y = prop_poverty
    )
  ) +
  geom_point(
    aes(
      text = paste(
        "City:", city_state,
        "<br>Deaths per 100k:", round(deaths_per_100k, digits = 1),
        "<br>Proportion in Poverty:", percent(prop_poverty, accuracy = 0.1)
        )
      )
    ) +
  # Add a linear regression line for each facet/group, colored blue
  geom_smooth(
    method = "lm", # linear model
    se = FALSE, # standard error visible
    size = 1, 
    color = "red"
  )  +
  labs(
    title = "Deaths per 100k by proportion of poverty",
    subtitle = "Top 15 cities in 2017",
    x = "Deaths per 100k",   
    y = "Proportion of city population in poverty",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
  ) +
  scale_y_continuous(labels = percent, breaks = seq(0, 0.5, by = 0.02)) +
  theme_bw()

# Convert the ggplot2 plot to an interactive plotly plot with custom tooltips
q_plotly <- ggplotly(q, tooltip = "text")

# Manually add the subtitle and caption
q_plotly <- q_plotly |>
  layout(
    title = list(
      text = "Deaths per 100k by proportion of poverty<br><sup>Top 15 cities in 2017</sup>"
    ),
    annotations = list(
      list(
        x = 0, 
        y = -0.13,  # Adjust the y position to move the caption up slightly
        text = "Source: Gun Violence Archive & U.S. Census Bureau", 
        showarrow = FALSE, 
        xref = 'paper', 
        yref = 'paper', 
        xanchor = 'left', 
        yanchor = 'auto', 
        xshift = 0, 
        yshift = 0,
        font = list(size = 8)
      )
    ),
    margin = list(t = 80, b = 100)  # Add extra margin to accommodate the caption
  )

# Display the interactive plot
q_plotly

When we plot our deaths per 100,000 people with regard to poverty, our regression line changes dramatically. We do, however, have four cities that are above a 20% poverty threshold. Although the evidence seems strong enough to support an inference, I want to establish statistical significance to support a hypothesis.

5.2 Monte Carlo Methods of Inference

5.2.1 Hypothesis

Given the relationship between poverty and gun violence observed in the preliminary analysis, I hypothesize that cities with higher poverty levels experience a greater number of gun violence deaths per 100,000 people. Specifically:

\[H_0: \mu_1 \le \mu_2 \text{ vs. } H_a: \mu_1 > \mu_2\]

Where \(\mu_1\) is the average number of gun violence deaths per 100,000 people in low poverty cities, and \(\mu_2\) is the average number of gun violence deaths per 100,000 people in high poverty cities.

# Divide cities into low and high poverty groups
prop_poverty_threshold <- median(top_15_cities$prop_poverty)
low_poverty_cities <- top_15_cities %>% filter(prop_poverty <= prop_poverty_threshold)
high_poverty_cities <- top_15_cities %>% filter(prop_poverty > prop_poverty_threshold)

val <- round(prop_poverty_threshold * 100, digits = 1)

low_poverty_cities <- low_poverty_cities |>
  select(
    -(1:5),
    -(7:18),
    -(20:21)
    )

high_poverty_cities <- high_poverty_cities |>
  select(
    -(1:5),
    -(7:18),
    -(20:21)
    )

# Stacking data frames
stacked_data <- bind_rows(low_poverty_cities, high_poverty_cities)

stacked_data <- stacked_data %>%
  mutate(Poverty_Level = case_when(
    row_number() <= 8 ~ "Low poverty cities",
    row_number() > 8 ~ "High poverty cities"
  ))

stacked_data <- stacked_data |>
  select(city_state, Poverty_Level, deaths_per_100k)

To test this hypothesis, I divided the top 15 cities into low and high poverty groups based on the median proportion (19.8%) of the population living in poverty.

stacked_ft <- flextable(stacked_data) %>%
  set_header_labels(
    city_state = "City, State",
    Poverty_Level = "Poverty Level",
    deaths_per_100k = "Deaths per 100k"
  ) %>%
  add_header_row(values = "Table 6: 15 Deadliest Cities by Poverty Rate (2017)", colwidths = 3) %>%
  padding(padding = 3, part = "all") %>%
  add_footer_lines(values = "Source: Gun Violence Archive & U.S. Census Bureau") %>%
  theme_vanilla() %>%
  autofit()

stacked_ft

Table 6: 15 Deadliest Cities by Poverty Rate (2017)

City, State

Poverty Level

Deaths per 100k

Chicago, IL

Low poverty cities

13.687017

Washington, DC

Low poverty cities

14.872299

Jacksonville, FL

Low poverty cities

12.609055

Houston, TX

Low poverty cities

6.695365

Cleveland, OH

Low poverty cities

11.293136

Boston, MA

Low poverty cities

6.660817

Indianapolis, IN

Low poverty cities

16.702767

Peoria, IL

Low poverty cities

6.983803

Baltimore, MD

High poverty cities

48.080336

Milwaukee, WI

High poverty cities

11.603766

New Orleans, LA

High poverty cities

39.414501

Memphis, TN

High poverty cities

18.553133

Saint Louis, MO

High poverty cities

67.330016

Philadelphia, PA

High poverty cities

14.206925

Detroit, MI

High poverty cities

10.375197

Source: Gun Violence Archive & U.S. Census Bureau

5.2.2 Visualization: Comparison of Gun Violence Deaths

The first visualization compares the number of gun violence deaths per 100,000 people between low poverty and high poverty cities. This box plot displays the distribution of gun violence deaths for both groups.

stacked_data |>
  ggplot(
    aes(
      x = Poverty_Level,
      y = deaths_per_100k,
      fill = Poverty_Level
    )
  ) +
  stat_boxplot(geom = "errorbar", width = 0.2, coef = 1.5) +
  stat_boxplot(geom = "boxplot", width = 0.5, coef = 1.5,
               outlier.shape = 8) +
  stat_summary(fun = "mean", geom = "point", shape = 23, fill = "black",
               color = "white") +
  scale_fill_manual(values = c("#009E73", "#56B4E9")) +
  coord_flip() +
  labs(
    y = "Number of Deaths per 100,000 people",
    title = "Comparison of Gun Violence Deaths",
    caption = "Source: Gun Violence Archive & U.S. Census Bureau"
    ) +
  theme(legend.position = "none")

Key observations from the box plot:

Low Poverty Cities: The box plot shows a relatively narrow range of deaths per 100,000 people, indicating less variability within this group. The median number of deaths is lower, and the whiskers suggest fewer extreme values.

High Poverty Cities: The box plot for high poverty cities reveals a much wider range of deaths per 100,000 people, with a higher median and more variability. This indicates a greater spread of gun violence incidents in these cities, including some with very high death rates.

This visualization supports the hypothesis that cities with higher poverty levels experience more gun violence deaths, highlighting the need to explore further the impact of socioeconomic factors on gun violence.

5.2.3 Visualization: Empirical Density Curves and Dot Plot

The second visualization provides an empirical density curve and dot plot comparing the number of gun violence deaths per 100,000 people between low poverty and high poverty cities. This visualization helps to highlight the distribution and density of gun violence deaths in both groups.

stacked_data |> 
  ggplot(aes(x = deaths_per_100k,
             fill = Poverty_Level)) +
  geom_dotplot() +
  geom_density(color = "black", alpha = 0.6) +
    scale_fill_manual(values = c("#009E73", "#56B4E9")) +
  facet_grid(Poverty_Level ~ .) +
    labs(x = "Number of Deaths per 100,000 people",
         title = "Empirical Density Curves and Dot Plot",
         y = "Density") +
  theme(legend.position = "none") +
  ylim(0, 0.09)

Key observations from the density curves and dot plot:

Low Poverty Cities: The density curve for low poverty cities shows a higher peak, indicating that most cities in this group have a lower number of deaths per 100,000 people. The dot plot reinforces this by showing a concentration of points at lower death rates.

High Poverty Cities: The density curve for high poverty cities is relatively flat, suggesting a wider spread of death rates within this group. The dot plot shows that these cities have a broader range of gun violence death rates, with some cities experiencing significantly higher death rates.

5.2.4 Assumption Checking for T-Tests

5.2.4.1 Checking Normality

Before proceeding with the t-test, it is essential to verify if the data follows a normal distribution. The Shapiro-Wilk test was used to assess the normality of the gun violence death rates per 100,000 people in both low and high poverty cities.

# Check for normality
shapiro_low <- shapiro.test(stacked_data$deaths_per_100k[stacked_data$Poverty_Level == "Low poverty cities"])
shapiro_high <- shapiro.test(stacked_data$deaths_per_100k[stacked_data$Poverty_Level == "High poverty cities"])

# Extract the results into a data frame
shapiro_low_df <- data.frame(
  Statistic = shapiro_low$statistic,
  p_value = shapiro_low$p.value,
  Method = shapiro_low$method,
  Data = "Low poverty cities"
)

shapiro_high_df <- data.frame(
  Statistic = shapiro_high$statistic,
  p_value = shapiro_high$p.value,
  Method = shapiro_high$method,
  Data = "High poverty cities"
)

shapiro_low_df |> flextable() |>
  set_caption("Result from Shapiro-Wilk Normality Test for Low poverty cities") |>
  padding(padding = 3, part = "all") %>%
  theme_vanilla() %>%
  autofit()
Result from Shapiro-Wilk Normality Test for Low poverty cities

Statistic

p_value

Method

Data

0.887914

0.2237709

Shapiro-Wilk normality test

Low poverty cities

shapiro_high_df |> flextable() |>
  set_caption("Result from Shapiro-Wilk Normality Test for High poverty cities") |>
  padding(padding = 3, part = "all") %>%
  theme_vanilla() %>%
  autofit()
Result from Shapiro-Wilk Normality Test for High poverty cities

Statistic

p_value

Method

Data

0.8616306

0.1565586

Shapiro-Wilk normality test

High poverty cities

Results:

Low Poverty Cities: The Shapiro-Wilk test statistic is 0.89 with a p-value of 0.22, indicating that we do not reject the null hypothesis of normality. Therefore, the data for low poverty cities can be considered normally distributed.

High Poverty Cities: The Shapiro-Wilk test statistic is 0.86 with a p-value of 0.16, indicating that we do not reject the null hypothesis of normality. Thus, the data for high poverty cities can also be considered normally distributed.

Both groups meet the normality assumption necessary for conducting the t-test. However, it’s important to note that with a small sample size (n = 15), the power of the normality test may be limited, and the results should be interpreted with caution.

5.2.4.2 Checking Equal Variance

Next, I assess the assumption of equal variance between the two groups using the standard deviations and variances of gun violence death rates per 100,000 people.

# Calculating standard deviations and variances for each group
stacked_data |> 
  group_by(Poverty_Level) |> 
  summarize(Mean = mean(deaths_per_100k),
            n = n(),
            SD = sd(deaths_per_100k),
            Variance = var(deaths_per_100k)) |> 
  flextable() |> 
  colformat_double(digits = 3) |> 
  padding(padding = 3, part = "all") %>%
  theme_vanilla() %>%
  autofit()

Poverty_Level

Mean

n

SD

Variance

High poverty cities

29.938

7

22.034

485.507

Low poverty cities

11.188

8

3.975

15.802

Results:

High Poverty Cities: The mean death rate is 29.9 with a standard deviation of 22 and a variance of 485.507.

Low Poverty Cities: The mean death rate is 11.2 with a standard deviation of 4 and a variance of 15.8.

The results show that the variances are not equal, with high poverty cities having a significantly larger variance than low poverty cities. Therefore, the assumption of equal variance is violated.

Given these results, we will use Welch’s t-test, which does not assume equal variances, to compare the gun violence death rates between low and high poverty cities.

5.2.5 Welch’s t-test

# Perform Welch's t-test
welch_t_result <- t.test(deaths_per_100k ~ Poverty_Level, data = stacked_data, alternative = "greater", var.equal = FALSE)

# Extract the results into a data frame
welch_t_result_df <- data.frame(
  Statistic = welch_t_result$statistic,
  t_df = welch_t_result$parameter,
  p_value = welch_t_result$p.value,
  Alternative = welch_t_result$alternative,
  Estimate = welch_t_result$estimate,
  Lower_CI = welch_t_result$conf.int[1],
  Upper_CI = welch_t_result$conf.int[2]
)

# Convert the data frame to a flextable
welch_t_result_ft <- welch_t_result_df |>
  flextable() |>
  set_caption("Results of Welch's t-test comparing rates of gun violence deaths between the low poverty cities and high poverty cities groups.") |>
  padding(padding = 3, part = "all") %>%
  theme_vanilla() %>%
  autofit()

# Print the flextable
welch_t_result_ft
Results of Welch's t-test comparing rates of gun violence deaths between the low poverty cities and high poverty cities groups.

Statistic

t_df

p_value

Alternative

Estimate

Lower_CI

Upper_CI

2.219969

6.342214

0.03290321

greater

29.93770

2.494976

Inf

2.219969

6.342214

0.03290321

greater

11.18803

2.494976

Inf

Results:

  • The Welch’s t-test statistic is 2.219969, and the degrees of freedom (t_df) is 6.342214.

  • The p-value is 0.03290321, indicating a statistically significant difference in death rates between the two groups.

  • The alternative hypothesis specifies that the death rate is greater in high poverty cities.

  • The mean estimate for high poverty cities is 29.9, while for low poverty cities, it is 11.2.

  • The confidence interval for the difference in means ranges from 2.494976 to infinity.

These results indicate that gun violence death rates are significantly higher in high poverty cities compared to low poverty cities, supporting the hypothesis that poverty levels are associated with increased gun violence deaths.

5.2.6 Permutation Testing

To further validate the results of the Welch’s t-test, we performed a permutation test. This method helps to assess the robustness of our findings by generating a distribution of t-statistics under the null hypothesis through 1000 random permutations.

# Set the number of permutations
nperms <- 1000

# Initialize a vector to store the permuted test statistics
permTs <- vector(length = nperms)

# Calculate the observed Welch's t-test statistic
observed_t <- t.test(
  deaths_per_100k ~ Poverty_Level, 
  data = stacked_data, 
  alternative = "greater", 
  var.equal = FALSE
  )$statistic

# Perform the permutation test
set.seed(1986) # For reproducibility

for (p in 1:nperms) {
  permuted_data <- stacked_data %>%
    mutate(
      Poverty_Level = sample(Poverty_Level, replace = FALSE)
      )
  permTs[p] <- t.test(
    deaths_per_100k ~ Poverty_Level, 
    data = permuted_data, 
    alternative = "greater", 
    var.equal = FALSE
    )$statistic
}

# Calculate the p-value
p_value <- mean(permTs >= observed_t)

# Create a data frame for the p-value result
p_value_df <- data.frame(
  Statistic = c("Observed t-statistic", "Permutation p-value"),
  Value = c(round(observed_t, 2), round(p_value, 3))
)

# Display the p-value result using flextable
p_value_ft <- p_value_df |>
  flextable() |>
  set_caption("Permutation Test Results") |>
  padding(padding = 3, part = "all") %>%
  theme_vanilla() %>%
  autofit() 

# Print the flextable
p_value_ft
Permutation Test Results

Statistic

Value

Observed t-statistic

2.220

Permutation p-value

0.011

The permutation test p-value is 0.011, indicating that the observed t-statistic is rare under the null hypothesis.

# Tally up
tally_results <- janitor::tabyl(permTs >= observed_t)

# Convert to a data frame for plotting
tally_df <- as.data.frame(tally_results)

# Create a bar plot to visualize the tally results
ggplot(tally_df, aes(x = `permTs >= observed_t`, y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = n), vjust = -0.3) +
  labs(
    title = "Tally of Permuted t-Statistics Greater Than or Equal to Observed t-Statistic",
    x = "Permuted t-Statistic >= Observed t-Statistic",
    y = "Count"
  ) +
  theme_minimal()

The bar plot visualizes the tally of permutations greater than or equal to the observed t-statistic, reinforcing the significance of our original t-test results.

# Visualize the permutation distribution
perm_df <- data.frame(permTs = permTs)

max_y <- max(table(cut(permTs, breaks = seq(min(permTs), max(permTs), by = 0.1)))) + 2


ggplot(perm_df, aes(x = permTs)) +
  geom_histogram(binwidth = 0.1, fill = "steelblue", color = "white") +
  geom_vline(aes(xintercept = observed_t), color = "dodgerblue",  linetype = "dashed") +
  geom_vline(xintercept = quantile(permTs, probs = 0.95), color = "red") +
  theme_minimal() +
  labs(title = "Permutation Distribution of t-Statistics",
       x = "Permuted t-Statistics",
       y = "Frequency") +
  annotate("text", x = observed_t, y = max_y, 
           label = paste("Observed t-statistic:", round(observed_t, 2)), color = "dodgerblue", hjust = 0.9) +
  theme_minimal()

The histogram of permuted t-statistics shows the distribution of t-values generated from the permutations.

Statistical Decision

Given the evidence from Welch’s t-test and the permutation test, we reject the null hypothesis (\(H_0: \mu_1 \le \mu_2\)) in favor of the alternative hypothesis (\(H_a: \mu_1 > \mu_2\)). This decision is supported by:

  • A p-value of 0.0329 from Welch’s t-test, which is less than the common alpha level of 0.05.

  • A permutation p-value of 0.011, indicating that the observed t-statistic is unlikely to occur by chance.

Conclusion: There is statistically significant evidence to conclude that cities with higher poverty levels experience higher gun violence death rates compared to cities with lower poverty levels.

This analysis underscores the critical impact of socioeconomic factors on gun violence and highlights the need for targeted interventions in high poverty areas to reduce gun violence deaths.

5.3 Bootstrapping

To further analyze the data, we performed a bootstrapping procedure to estimate the distribution of the median number of deaths per 100,000 people in high poverty cities. Bootstrapping allows us to generate multiple samples from the original dataset and compute the statistic of interest, providing a robust estimate of its distribution.

high_poverty <- stacked_data |>
  filter(Poverty_Level == "High poverty cities")

# Number of bootstrap samples
B <- 10000

# Instantiating matrix for bootstrap samples
boots <- matrix(NA, nrow = nrow(high_poverty), ncol = B)

# Sampling with replacement B times
for(b in 1:B) {
boots[, b] <- high_poverty |> 
  slice_sample(prop = 1, replace = TRUE) |> 
  dplyr::pull(deaths_per_100k)
}

# Instantiating vector for bootstrap medians
boot_medians <- vector(length = B)

# Calculating medians for bootstrap samples
for(b in 1:B) {
boot_medians[b] <- median(boots[, b])
}

tibble(Median = boot_medians) |>
  ggplot(aes(x = Median)) +
  geom_histogram(fill = "steelblue", color = "white") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + 
  labs(title = "Non Parametric bootstrap distribution for the median",
       x = "Test statistic value",
       y = "Frequency") +
  geom_vline(xintercept = quantile(boot_medians, probs = c(0.025, 0.975)), color = "red", linetype = "dotted") +
  theme_minimal()

Results:

  • The histogram shows the distribution of bootstrap medians.

  • The vertical dotted lines represent the 2.5th and 97.5th percentiles of the bootstrap distribution, providing a 95% confidence interval for the median.

# Estimating the standard error
standard_error <- sd(boot_medians)

# Calculating the 95% confidence interval
confidence_interval <- quantile(boot_medians, probs = c(0.025, 0.975))

# Create a data frame for the results
results_df <- data.frame(
  Statistic = c("Standard Error", "2.5th Percentile", "97.5th Percentile"),
  Value = c(standard_error, confidence_interval[1], confidence_interval[2])
)

# Display the results using flextable
library(flextable)

results_ft <- results_df |>
  flextable() |>
  set_caption("Bootstrap Estimates for the Median") |>
  padding(padding = 3, part = "all") %>%
  theme_vanilla() |>
  
  autofit()

# Print the flextable
results_ft
Bootstrap Estimates for the Median

Statistic

Value

Standard Error

13.35475

2.5th Percentile

11.60377

97.5th Percentile

48.08034

Bootstrap Confidence Interval Findings:

We are 95% confident that at least half of the cities with high poverty rates will experience a death rate from gun violence between 11.6 and 48 deaths per 100,000 people.

6 Conclusion

Gun ownership, legislation, and violence in America have always been complex and polarizing topics. Although gun ownership is woven into the fabric of the American identity through the Constitution and is seen as an essential right, that right is viewed by some as conflicting with efforts to solve the problem of gun violence.

This analysis gets closer to untangling some of the mystery behind America’s gun violence issues, but there is still more work to be done. Many more features could be added to get a better sense of what is causing a lethal increase in gun violence deaths in America. With more time and research, we may be able to reach a consensus on how to legislate and deal with the issue.

I personally did not choose this project and had no preconceived notions about what evidence I would find, if any at all. I simply let the data lead my line of questioning. While I intuited that poverty had something to do with gun violence, I didn’t know to what extent. I also did not anticipate the noticeable disparity between its effects on gun violence incidents versus deaths. Through this exercise, it became clear to me that the perspective by which we view the legislation of gun laws is often flawed. In this analysis, gun violence deaths are clearly a dependent variable, or an outcome. They are a function of another feature or set of features of American culture. Trying to solve a problem by focusing on the output instead of the input seems to only perpetuate the issue.

Moreover, through this process and generally looking over some of the previously cited data regarding gun ownership in this country, an interesting pattern emerged that seemed worth exploring. I noticed that many of the states with the deadliest cities had some of the very lowest rates of gun ownership in the country. Despite this being anecdotal, it is a curious question that deserves a solid answer. The devil is in the details, and this multifaceted problem has no shortage of them.

Regarding the statistical significance of poverty as a driver of lethality among states, one might posit that a lack of healthcare infrastructure in impoverished communities could be consequential to health outcomes vis-à-vis gun violence. Poor hospitals, a lack of hospitals, or poor emergency response times could be factors worth examining when thinking about how to improve the situation. Other potential social and psychological factors may be helpful but arduous to collect or rely on for accuracy. Furthermore, they may be clouded by ambiguity and not easily stated.

Being from Baltimore, Maryland, the second deadliest city in America in 2017, it is hard to see the poverty and violence. Knowing that right around the corner, my fellow Americans are residing in a war zone is unsettling, to say the least. I would personally like to see more careful analysis and concerted effort among legislators to pinpoint effective ways to ameliorate gun violence in America. Only through thorough research, open dialogue, and targeted interventions can we hope to make meaningful progress in addressing this critical issue.