1 Introduction

1.1 Explanation of Project

Grand Valley’s Machine Learning course, CIS 678 challenged students to identify underrepresented bird species in data collected across the United States from eBird, a platform where citizen scientists record their sightings. The dataset includes over 85 species, ranging from voluminous counts exceeding 160,000 to rare sightings of just 300. We explore this further in our Exploratory Analysis.

Our team implemented a K-Nearest Neighbors (KNN) algorithm, a non-parametric supervised learning method. KNN works by comparing a new observation to its nearest neighbors in the dataset, classifying it based on the majority category among the neighbors. This project used MAE (Mean Absolute Error) as the primary performance metric for classification.

A key challenge was identifying underrepresented common birds in the testing data, as these may be overshadowed by rarer or more exciting species. To address this, we tested various hyperparameters, distance metrics, and decay functions. Our algorithm consistently performed best using the Manhattan distance metric. This preference may arise because Manhattan distance measures absolute differences across features, making it robust to outliers and well-suited for high-dimensional datasets with uneven scales.

1.2 Applying HPC Principles

We leveraged the doParallel and parallelly packages to execute tasks in parallel across multiple cores, significantly improving efficiency. Dividing up the tasks across multiple cores, meant faster iteration during testing and analysis—reducing run times from approximately 10 minutes to just 4 minutes.

2 Exploratory Analysis

# Data Wrangling 
# Renaming column to get "Species":
training_set<- training_set|>
  rename(Species = ...1)

test_set <- test_set|>
  rename(Species = ...1)


# Changing underscores to be spaces for cleaner labels
training_set$Species <- gsub("_", " ", training_set$Species)
test_set$Species <- gsub("_", " ", test_set$Species)
  

# for summary statistics
bird_totals <- training_set|>
  rowwise()|>
  mutate(Bird_Totals = sum(c_across(where(is.numeric))))|>
  select(Species, Bird_Totals)|>
  arrange(desc(Bird_Totals))

# for comparison
top_bird_totals <- bird_totals|>
  arrange(desc(Bird_Totals))|>
  mutate(group = "Top 10")|>
  head(10)

bottom_bird_totals<- bird_totals|>
  arrange(Bird_Totals)|>
  mutate(group = "Bottom 10")|>
  head(10)

combined_bird_totals <- bind_rows(bottom_bird_totals, top_bird_totals)|>
  mutate(group = factor(group, levels = c("Top 10", "Bottom 10")))

combined_bird_totals <- readRDS("combined_bird_totals.rds")

combined_bird_totals |> 
  ggplot(aes(x = reorder(Species, Bird_Totals), y = Bird_Totals, fill = Bird_Totals)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(labels = comma) +
scale_fill_viridis_c(option = "viridis", begin = 0.1, end = .8) +
  facet_wrap(~group, scales = "free", ncol = 1, 
             labeller = as_labeller(c(
               "Top 10" = " 10 Most Common Birds",
               "Bottom 10" = "10 Least Common Birds"))) +
  labs(x = "Species", 
       y = "Total Count Observed", 
       title = "Examining Bird Frequencies",
       subtitle = "Population Observations Vary Widely by Species") +
  theme_minimal() +
      theme(legend.position = "none",
            plot.title = element_text(color = "#440356",
                                      size = 16,
                                      face = "bold",
                                      hjust = 0.5),  
      plot.subtitle = element_text(face = "italic", hjust = 0.5))

bird_totals <- readRDS("bird_totals.rds")

bird_summary <- psych::describe(bird_totals$Bird_Totals) %>% 
  select(n, mean, median, sd, min, max)

bird_summary_table <- bird_summary |>
  flextable()|>
  set_header_labels(
    n = "N",
    mean = "Mean",
    median = "Median",
    sd = "SD",
    min = "Min",
    max = "Max") |>
    add_header_lines("Bird Species Observations: Summary Statistics") |>
  theme_viridis_flex() |>
  autofit()

bird_summary_table

Bird Species Observations: Summary Statistics
N	Mean	Median	SD	Min	Max
85	22,176.48	8,284	34,324.25	300	168,864

2.1 Distribution of the Test Data

quantiles <- quantile(bird_totals$Bird_Totals, prob = seq(0, 1, length = 6), type = 5)

# Categorize each value into a quantile
bird_totals$Q <- cut(
  bird_totals$Bird_Totals,
  breaks = quantiles,
  labels = paste0("Q", 1:5), # Label each quantile group (e.g., Q1, Q2, ...)
  include.lowest = TRUE)

# Group the data by quantile and calculate summary statistics
quantile_summary <- bird_totals %>%
  group_by(Q) %>%
  summarise(
    count = n(),
    mean = mean(Bird_Totals, na.rm = TRUE),
    median = median(Bird_Totals, na.rm = TRUE),
    sd = sd(Bird_Totals, na.rm = TRUE),
    min = min(Bird_Totals, na.rm = TRUE),
    max = max(Bird_Totals, na.rm = TRUE))


quantile_table <- quantile_summary |>
  flextable() |>
  set_header_labels(
    Q = "Quantile",
    count = "Count",
    mean = "Mean",
    median = "Median",
    sd = "SD",
    min = "Min",
    max = "Max") |>
  add_header_lines("Bird Species Quintiles: Summary Statistics Comparison") |>
  theme_viridis_flex() |>
  autofit()

#side by side box plots
ggplot(bird_totals, aes(x = Q, y = Bird_Totals, fill = Q)) +
  geom_boxplot(alpha = 0.8) +  # Set transparency to 50%
  scale_y_log10(labels = comma) +  # Apply logarithmic scale to y-axis
scale_fill_viridis_d(option = "viridis", begin = 0.01, end = 0.8)+ 
  labs(
    x = "Quintile Group",
    y = "Log-Scaled Bird Totals",
    title = "Bird Species Observations Across Quintiles",
    subtitle = "Largest Spread Observed Within Q5", 
    caption = "*Logarithmic Scale for Better Comparison") +
  theme_viridis_plot()+
  theme(legend.position = "none")

Bird Species Quintiles: Summary Statistics Comparison
Quantile	Count	Mean	Median	SD	Min	Max
Q1	17	918.9412	1,085	397.9556	300	1,343
Q2	17	2,567.1765	2,203	914.7128	1,436	4,229
Q3	17	8,364.8235	8,284	2,521.9425	4,481	12,697
Q4	17	19,979.7647	20,244	4,978.1422	13,029	29,059
Q5	17	79,051.7059	78,712	40,245.6395	29,649	168,864

The summary statistics reveal wide variability of species counts. When we analyze the data by quintiles, we see that Q5 exhibits the largest standard deviation, and variability. Conversely, the bottom quintile (Q1) has the smallest spread, reflecting tightly clustered values within a narrower range.

This builds the case for exploring normalizing the data-set before implementation into the KNN algorithm. For example, the Common_Myna with over 168K observations may dominate the distance calculations, undermining the contributions of smaller-scale (Q1) features.

Re-scaling all features to a comparable range, provides equal contribution to distance metrics, improving the algorithm’s ability to predict missing birds accurately. We did, however, move forward with non-normalized data for initial KNN iterations.

rownames(training_set) <- training_set$...1
training_set <- training_set[, -1]  # Remove the `...1` column from the data-set

rownames(test_set) <- test_set$...1
test_set <- test_set[, -1]  # Remove the `...1` column from the data-set

2.2 Examining Data Sparcity

2.2.1 Value Distribution Analysis

################################################################################
### Function created to help us analyze data sparsity
### Analyze zero vs non-zero distribution
### Our goal with this is to help us identify what distance metric to use
################################################################################
analyze_zero_nonzero <- function(dataset, dataset_name) {
  
  # kills the first column of species
  values <- as.vector(as.matrix(dataset[, -1])) 
  
  # gets sum of zero values in the data-set
  zero_count <- sum(values == 0)
  
  # gets sum of non zero values in data-set
  non_zero_count <- sum(values != 0)
  
  # sums the two
  total <- zero_count + non_zero_count
  
  # creates data frame that shows proportionality
  distribution <- data.frame(
    Category = c("Zero", "Non-Zero"),
    Count = c(zero_count, non_zero_count),
    Percentage = c(zero_count / total * 100, non_zero_count / total * 100),
    Dataset = dataset_name
  )
  
  return(distribution)
}

# Combine data-sets into one for facet wrapping
combine_distributions <- function(train, test) {
  train$Dataset <- "Training Set"
  test$Dataset <- "Test Set"
  combined <- bind_rows(train, test)
  return(combined)
}

# Perform analysis on training and test sets
training_distribution <- analyze_zero_nonzero(training_set, "Training Set")
test_distribution <- analyze_zero_nonzero(test_set, "Test Set")
combined_distribution <- combine_distributions(training_distribution, test_distribution)

# Plot donut charts with facet wrap
ggplot(combined_distribution, aes(x = 2, y = Percentage, fill = Category)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.5)) +
  theme_void() +
  labs(title = "Zero vs Non-Zero Distribution by Dataset", fill = "Category") +
  scale_fill_viridis_d(option = "viridis", begin = 0.7, end = .5) + # keeping the virdis theme
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    strip.text = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "#440356")) +
  facet_wrap(~ Dataset) +
  xlim(0.5, 2.5) # Adjust the x limits to create the donut hole

Given the sparsity and high dimensionality of our dataset, we are prioritizing Manhattan distance as our initial distance metric. Its ability to handle sparse data and maintain meaningful comparisons in high-dimensional spaces makes it a strong candidate for analysis.

2.3 Initial Optimization of K

As mentioned, our group decided to use Manhattan distance as the primary distance metric for optimizing \(k\) within our model.

We first plot the MAE for \(K\) values from 3 - 25 for Manhattan distance with no weighting applied to non-normalized data. This allows for a baseline estimate of what \(k\) should be before other hyperparameter tuning is applied.

# Load the required sheets and add a column for the transformation type
sheet_2 <- read_xlsx("Test_Journal.xlsx", sheet = 2) %>%
  mutate(Transformation = "No Normalization")


# Plot the combined data with a color for each transformation type
ggplot(sheet_2, aes(x = K, y = MAE)) +
  geom_line(color = "#30738E", size = 1) +    # Add the line
  geom_point(color = "#440356", size = 2) +   # Add points for better visualization
  labs(
    title = "Effect of K on MAE: Using Manhattan Distance Without Weights",
    subtitle = "Optimal K Identified at ~19",
    x = "Number of Neighbors (K)",
    y = "Mean Absolute Error (MAE)") +
  theme_viridis_plot()

Along with selecting a distance metric, we are exploring neighbor weighting strategies. To establish a baseline for a decay function, we first examine the distance distribution between neighbors in the training set. Understanding this distribution will help design a weighting approach that balances influence among nearest neighbors.

registerDoParallel(cl)

set.seed(1986)

# Number of random samples
num_samples <- 1000

# Initialize a vector to store medians
medians <- numeric(num_samples)

medians <- foreach(i = 1:num_samples, .combine = c) %dopar% {
  # Randomly decide to sample from training or test set
  if (runif(1) < 0.5) {
    # Sample from training set (using double bracket to ensure a vector)
    sample_col <- training_set[[ sample(1:ncol(training_set), 1) ]]
    distances <- apply(training_set, 2, function(col) sum(abs(sample_col - col)))
  } else {
    # Sample from test set (using double bracket to ensure a vector)
    sample_col <- test_set[[ sample(1:ncol(test_set), 1) ]]
    distances <- apply(training_set, 2, function(col) sum(abs(sample_col - col)))
  }
  
  # Return the median distance
  median(distances)
}
# Stop the cluster after computation
stopCluster(cl)

We took 1,000 random samples of checklists from both the training and test sets. For each sample, we calculated the Manhattan distance to every other column in the training set. We then captured the median distance for each sample to better understand the overall distance distribution. This analysis helped us determine an appropriate value for a decay function.

medians <- readRDS("medians.rds")

medians_df <- data.frame(MedianDistance = medians)

ggplot(medians_df, aes(x = MedianDistance)) +
  geom_histogram(bins = 20, color = "black", fill = "#58BF6C", alpha = 0.7) +
  labs(
    title = "Distribution of Distance Medians",
    subtitle = "Most Observations are Clustered at Lower Values",
    x = "Median Distance",
    y = "Frequency") +
  theme_viridis_plot()

# Load the saved summary from the RDS file
median_summary <- readRDS("median_summary.rds")

# Create the flextable
flex_table <- flextable(median_summary) |>
  add_header_lines(values = "Summary of Median Manhattan Distances") |>
  theme_viridis_flex() |>
  autofit()

# Print the flextable
flex_table

Summary of Median Manhattan Distances
Min	1st Qu.	Median	Mean	3rd Qu.	Max
96	118	145.5	170.604	206	523

After 1,000 random samples, the most frequent median Manhattan distance density occurred at 145.5. Based on this, we will start with a Sigma of 145.5 for exponential decay.

Additionally, we found that the 75th percentile of Manhattan distances from these samples was 206. If we choose to use triangular decay, this value will serve as our default radius (r = 206).

Thus, our default decay hyper-parameter functions will assume the following shape:

3 Decay Functions Defined

3.1 Exponential Decay Function

\[ w_i = e^{-\frac{d_i}{\sigma}} \] Explanation of the Terms:

\(w_i\): Weight for neighbor \(i\).
\(d_i\): Distance of neighbor \(i\) from the test point.
\(\sigma\): Decay parameter controlling the rate of decay.

3.2 Triangular Decay Function

\[ w_i = \max(0, 1 - \frac{d_i}{r}) \]

Explanation of the Terms:

\(w_i\): Weight for neighbor \(i\).
\(d_i\): Distance of neighbor \(i\) from the test point.
\(r\): Radius parameter defining the maximum distance for contributing weights.

4 Distance Function Defined

Manhattan Distance \[ d = \sum_{i=1}^{n} |x_i - y_i| \]

Note: Initial tests were done using euclidean and cosine distance metrics, but were outperformed by the manhattan distance metric. Due to lack of affect, we obfuscated formulas for other distance metrics.

5 Hyperparameter Tuning

After we found a seemingly optimal value for k (19), we moved forward with exponential decay evaluation as it seemed like a much smoother weighting that handles further neighbors in a very delicate way. Testing with the default of sigma of 145.5 yielded a positive result, decreasing our MAE by .001.

Upon further examination, after printing the output of the kNN algorithm, and taking note of the distances between test columns and their nearest neighbors, rarely did we see any given neighbor breach 100, and looking back at our original visualization for our default decay functions, for exponential decay, a distance of 100 would have meant a weight of ~.5. Obvious clue that we needed to tighten up our value of sigma. After testing with a value of 40, we saw a nice decrease in MAE to about 0.487.

We decided to test for sigma values around 40, such as 10, 20, 30, and 50 to see if we have any improvements in our score.

5.1 Testing Different Decay Functions and Values

5.1.1 Exponential Decay

The plot above helped us visualize how our distances would be weighted for our tested values of sigma.

decay_simulations = read_xlsx("Test_Journal.xlsx", sheet = 3)

ggplot(decay_simulations, aes(x = Sigma, y = MAE)) +
  geom_line(color = "#30738E", size = 1) +    # Add the line
  geom_point(color = "#440356", size = 3) +   # Add points for better visualization
  geom_hline(yintercept = 0.047, linetype = "dotted", color = "#58BF6C", size = 0.8) + # Add the dotted line with a better color
  annotate(
    "text", x = min(decay_simulations$Sigma), y = 0.047, 
    label = "Target MAE", hjust = 3, vjust = -0.5, size = 5, color = "#58BF6C"
  ) + # Add label near the dotted line
  labs(
    title = "Effect of Exponential Decay on MAE",
    subtitle = "Decreasing Sigma Improves Model Performance",
    x = "Sigma (σ)",
    y = "Mean Absolute Error (MAE)"
  ) +
  scale_x_reverse() + # Reverse the x-axis
  theme_viridis_plot()

Results

As a result, we were able to decrease our MAE score down below the recommended threshold for the given problem by simply thinking about the properties of our data with respect to the distance formula, selecting the optimal number of neighbors, and tailoring our decay function to map to the distances between test and training columns. Our lowest score was achieved with a sigma value of 10, which yielded an MAE of 0.046.

5.1.2 Triangular Decay

In a similar fashion, we visualize the impact the triangular weightings have on varying distances within the dataset. This allows for examination of what a potential \(r\) value should be.

decay_simulations = read_xlsx("Test_Journal.xlsx", sheet = 4)


ggplot(decay_simulations, aes(x = r, y = MAE)) +
  geom_line(color = "#30738E", size = 1) +    # Add the line
  geom_point(color = "#440356", size = 3) +   # Add points for better visualization
  geom_hline(yintercept = 0.047, linetype = "dotted", color = "#58BF6C", size = 0.8) + # Add the dotted line with a better color
  annotate(
    "text", x = min(decay_simulations$r), y = 0.047, 
    label = "Target MAE", hjust = -2.3, vjust =-1, size = 5, color = "#58BF6C"
  ) + # Add label near the dotted line
  labs(
    title = "Effect of Triangular Decay on MAE",
    subtitle = "Increasing Radius Stabilizes Model Performance But Not to Ideal Levels",
    x = "r (Radius)",
    y = "Mean Absolute Error (MAE)") +
  theme_viridis_plot()

With this in mind, we acknowledge that having too low of an \(r\) value would result in unrealistic predictions if many checklists don’t meet the required distance value. Therefore, we hypothesize that an appropriate \(r\) is somewhere between 50 - 250. We plot the MAE’s for \(r\) values within this range and obtain the following:

Although increasing the size of \(r\) yields improving MAE results, it still didn’t perform as well as exponential decay, which we hypothesize has a more realistic smoothing/weighting effect as compared to triangular. Therefore, we continue our analysis with a focus on exponential decay.

5.2 Soft Zero Weighting

We added another layer to the algorithm by incorporating soft zero weighting, which applies a conditional scalar to adjust the impact of missing values in the test column.

Essentially, we implemented this in the predict_knn_weighted() function by supplying the algorithm a scalar weight between [0,1], which is passed to the calculate_distance() function. That function calls generate_weights(), which examines the test column and creates a vector of weights of the same length.

The function applies a conditional scalar, ensuring that zeros in the test column receive the assigned soft zero weight (s), while nonzero values retain full weight:

\[ w_i = \begin{cases} s, & \text{if } x_i = 0 \\ 1, & \text{otherwise} \end{cases} \]

This vector of weights is then returned to calculate_distance(). The distance_func() passes the test column, training column, and the weight vector to the manhattan_distance() function, which computes the distance between the test and training column points.

The distance is adjusted by multiplying each feature’s absolute difference by its corresponding weight. The final sum of weighted distances is returned to the main algorithm, ensuring that missing values (zero entries) contribute proportionally while preserving the integrity of observed values.

We then began hyperparameter testing for our established decay function weighting as well as our our new soft-zero weighting function. Introducing our main script with combinations of 2 hyperparameters created a new problem we had to solve. Although we added parallelized processing to our workflow, nesting two loops for testing our hyperparameters increased our runtime complexity by orders of magnitude.

The way we circumnavigated this was by offloading the distance calculations to a separate code block so that the distances of test and training columns did not have to be calculated for each iteration. In addition to that, we created an 80/20 split on the training data to perform internal cross-validation instead of validating our MAE scores in Kaggle. This change greatly increased efficiency, allowing us to generate hundreds of MAE scores from multidimensional hyperparameter tests in minutes. The only drawback is a slight modification to the algorithm’s order. In our original approach, nearest neighbors were selected after all weighting and distances were applied. In this optimized version, we first compute distances, identify the nearest neighbors, and then apply weighting.

This fact, along with using a shortened dataset, caused differences between our internally generated MAE values and those calculated by Kaggle. However, solutions that minimized MAE internally also minimized
MAE externally, demonstrating the effectiveness of this methodology.

We tested our hyperparameters iteratively increasing resolution based on minimum MAE results. For the first run we tested:

Sigma [5:50] (increments of 5)
Scalar (Soft-zero weights) [.1:1] (increments of .1)

The given 3 dimensional plot provides our results:

# Load the results dataframe
results_df <- readRDS("results.rds")  

results_df$Sigma <- as.numeric(as.character(results_df$Sigma))

# Create custom hover text
results_df$HoverText <- paste(
  "Scalar: ", results_df$Scalar,
  "<br>Sigma: ", results_df$Sigma,
  "<br>MAE: ", round(results_df$MAE, 4)
)

# Create a 3D scatter plot with custom hover labels
plot_ly(
  data = results_df,
  x = ~Scalar,
  y = ~Sigma,
  z = ~MAE,
  text = ~HoverText,  # Use the custom hover text
  hoverinfo = "text",  # Display only the custom text on hover
  color = ~Sigma,
  colors = viridisLite::viridis(100),
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    title = list(
      text = "3D Plot #1: Sigma and Scalar Combination Testing",
      
      x = 0,  # Align the title to the left
      xanchor = "left",  # Ensure alignment is set to the left
      font = list(
        size = 18,  # Adjust the font size (e.g., 14 -> 16 for +2 pts)
        weight = "bold"  # Make the font bold
      )
    ),
    scene = list(
      xaxis = list(title = "Scalar"),
      yaxis = list(title = "Sigma"),
      zaxis = list(title = "MAE")
    ),
    annotations = list(
      list(
        x = 0.5,  # Center horizontally
        y = -0.2,  # Position below the plot area
        text = "MAE across Sigma and Scalar values calculated using an 80/20 split of the training data.",
        showarrow = FALSE,  # No arrow pointing to the caption
        xref = "paper",  # Position relative to the entire plot area
        yref = "paper",  # Position relative to the entire plot area
        font = list(
          size = 12,
          color = "gray"
        ),
        align = "center"
      )
    ),
    margin = list(b = 80)
  )

Result: This yeilded an optimum value of .6 for Scalar and a Sigma of 10 which produced a Kaggle MAE of 0.0447; a reduction from our previous MAE of 0.0467 with a sigma of 10 alone.

The second pass focused in on a domain and range the increased our resolution on a local minima.

Sigma [5:12] (increments of .25)
Scalar (Soft-zero weights) [.15: .75] (increments of .05)

The given 3 dimensional plot provides our results:

# Load the results dataframe
g_results_df <- readRDS("granular_results.rds")  # Ensure the path is correct

g_results_df$Sigma <- as.numeric(as.character(g_results_df$Sigma))

# Create custom hover text
g_results_df$HoverText <- paste(
  "Scalar: ", g_results_df$Scalar,
  "<br>Sigma: ", g_results_df$Sigma,
  "<br>MAE: ", round(g_results_df$MAE, 4)
)

# Create a 3D scatter plot with custom hover labels
plot_ly(
  data = g_results_df,
  x = ~Scalar,
  y = ~Sigma,
  z = ~MAE,
  text = ~HoverText,  # Use the custom hover text
  hoverinfo = "text",  # Display only the custom text on hover
  color = ~Sigma,
  colors = viridisLite::viridis(100),
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    title = list(
      text = "3D Plot #2: Sigma and Scalar Combination Testing",
      
      x = 0,  # Align the title to the left
      xanchor = "left",  # Ensure alignment is set to the left
      font = list(
        size = 18,  # Adjust the font size (e.g., 14 -> 16 for +2 pts)
        weight = "bold"  # Make the font bold
      )
    ),
    scene = list(
      xaxis = list(title = "Scalar"),
      yaxis = list(title = "Sigma"),
      zaxis = list(title = "MAE")
    ),
    annotations = list(
      list(
        x = 0.5,  # Center horizontally
        y = -0.2,  # Position below the plot area
        text = "MAE across Sigma and Scalar values calculated using an 80/20 split of the training data.",
        showarrow = FALSE,  # No arrow pointing to the caption
        xref = "paper",  # Position relative to the entire plot area
        yref = "paper",  # Position relative to the entire plot area
        font = list(
          size = 12,
          color = "gray"
        ),
        align = "center"
      )
    ),
    margin = list(b = 80)
  )

Result: This yielded a minimum value of .4 for Scalar and a Sigma of 9 which produced a Kaggle MAE of 0.0444; a reduction from our previous MAE of 0.0447.

The third pass increased our resolution yet again.

Sigma [7:9.75] (increments of .05)
Scalar (Soft-zero weights) [.3: .5] (increments of .01)

The given 3 dimensional plot provides our results:

# Load the results dataframe
results_df_v3 <- readRDS("results_df_v3.rds")  # Ensure the path is correct

results_df_v3$Sigma <- as.numeric(as.character(results_df_v3$Sigma))

# Create custom hover text
results_df_v3$HoverText <- paste(
  "Scalar: ", results_df_v3$Scalar,
  "<br>Sigma: ", results_df_v3$Sigma,
  "<br>MAE: ", round(results_df_v3$MAE, 4)
)

# Create a 3D scatter plot with custom hover labels
plot_ly(
  data = results_df_v3,
  x = ~Scalar,
  y = ~Sigma,
  z = ~MAE,
  text = ~HoverText,  # Use the custom hover text
  hoverinfo = "text",  # Display only the custom text on hover
  color = ~Sigma,
  colors = viridisLite::viridis(100),
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    title = list(
      text = "3D Plot #3: Sigma and Scalar Combination Testing",
      
      x = 0,  # Align the title to the left
      xanchor = "left",  # Ensure alignment is set to the left
      font = list(
        size = 18,  # Adjust the font size 
        weight = "bold"  # Make the font bold
      )
    ),
    scene = list(
      xaxis = list(title = "Scalar"),
      yaxis = list(title = "Sigma"),
      zaxis = list(title = "MAE")
    ),
    annotations = list(
      list(
        x = 0.5,  # Center horizontally
        y = -0.2,  # Position below the plot area
        text = "MAE across Sigma and Scalar values calculated using an 80/20 split of the training data.",
        showarrow = FALSE,  # No arrow pointing to the caption
        xref = "paper",  # Position relative to the entire plot area
        yref = "paper",  # Position relative to the entire plot area
        font = list(
          size = 12,
          color = "gray"
        ),
        align = "center"
      )
    ),
    margin = list(b = 80)
  )

This pass created an interesting structure with troughs where sigma ~9, and ~7.75, but our minimum values remained at the previous iterations hyperparameter values.

6 Data Normalization

Our group decided to try two different normalization techniques to see if it improved the robustness/accuracy of the KNN recommendations in conjunction with exponentional decay and soft zero weighting. The two normalization techniques we decided to utilize were min-max scaling and log transformation, as discussed in the article titled, “How to Normalize and Standardize Data in R?” from GeeksForGeeks.

6.1 Min-Max Scaling

As discussed within the article, min-max scaling converts the data to be on a scale from 0 to 1 by utilizing the following formula:

\[ x_{\text{scaled}} = \frac{x - \min(x)}{\max(x) - \min(x)} \]

#Defining train and test in terms of this method: 

min_max_train<- as.data.frame(lapply(training_set, minMax))
min_max_test<- as.data.frame(lapply(test_set, minMax))

6.2 Log-Transformation

Another common technique is to apply a log transformation to the data which is said to prevent skewness and allow the data to follow a more normal-distribution. We apply a constant of 1 to any observation that is 0 so there is no arithmetic error when computing the log (GeeksForGeeks).

\[ x_{\text{log\_transformed}} = \log(x + 1) \]

6.3 Distance Value Distribution Analysis of Normalized Data

In a similar fashion, we can observe the distance distributions of various checklists from the testing and training set:

cl <- makeCluster(ncores)
registerDoParallel(cl)

set.seed(1986)

# Number of random samples
num_samples <- 1000

# Parallel loop to calculate medians for Min-Max Transformation
min_max_medians <- foreach(i = 1:num_samples, .combine = c, .packages = "base") %dopar% {
  # Randomly decide to sample from training or test set
  if (runif(1) < 0.5) {
    # Sample from training set
    sample_col <- min_max_train[, sample(1:ncol(min_max_train), 1)]
    distances <- apply(min_max_train, 2, function(col) sum(abs(sample_col - col)))
  } else {
    # Sample from test set
    sample_col <- min_max_test[, sample(1:ncol(min_max_test), 1)]
    distances <- apply(min_max_train, 2, function(col) sum(abs(sample_col - col)))
  }
  
  # Return the median distance
  median(distances)
}

# Parallel loop to calculate medians for Log Transformation
log_medians <- foreach(i = 1:num_samples, .combine = c, .packages = "base") %dopar% {
  # Randomly decide to sample from training or test set
  if (runif(1) < 0.5) {
    # Sample from training set
    sample_col <- log_train[, sample(1:ncol(log_train), 1)]
    distances <- apply(log_train, 2, function(col) sum(abs(sample_col - col)))
  } else {
    # Sample from test set
    sample_col <- log_test[, sample(1:ncol(log_test), 1)]
    distances <- apply(log_train, 2, function(col) sum(abs(sample_col - col)))
  }
  
  # Return the median distance
  median(distances)
}

# Stop the cluster after computation
stopCluster(cl)

# Combine the results into a data frame for ggplot
medians_df <- data.frame(
  MedianDistance = c(min_max_medians, log_medians),
  Dataset = c(rep("Min-Max", length(min_max_medians)), rep("Log Transformation", length(log_medians)))
)

# Visualize the distance distributions
distance_plot<- ggplot(medians_df, aes(x = MedianDistance, fill = Dataset)) +
  geom_histogram(color = "black", bins = 20, alpha = 0.6) +  
  scale_fill_manual(values = c("Min-Max" = "#30738E", "Log Transformation" = "#58BF6C")) + 
  labs(
    title = "Distribution of Distance Medians",
    subtitle = "Log Transformation Provides a More Balanced Distribution",
    x = "Median Distance",
    y = "Frequency",
    fill = "Dataset") +
  theme_viridis_plot()

From this plot, we can see that the range of values for distances using min-max normalization compared to a log transformation is much more refined and may require differing hyperparameter values for optimal results.

6.4 Optimizing for K With Normalized Data

We optimize for \(K\) in a similar fashion and find an optimal value of 17:

sheet_5 <- read_xlsx("Test_Journal.xlsx", sheet = 5) %>%
  mutate(Transformation = "Min-Max Scaling")

sheet_6 <- read_xlsx("Test_Journal.xlsx", sheet = 6) %>%
  mutate(Transformation = "Log Transformation")

simulations <- bind_rows(sheet_5, sheet_6)

# Plotting data for both transformations: 

ggplot(simulations, aes(x = K, y = MAE, color = Transformation, group = Transformation)) +
  scale_color_manual(values = c("Min-Max Scaling" = "#30738E", "Log Transformation" = "#58BF6C")) + 
  geom_line(size = 1) +                  
  geom_point(size = 2) +              
  labs(
    title = "K Unweighted Manhattan Distance Effects on  on MAE",
    subtitle = "Log Transformation Scored Lower MAE",
    x = "Number of Neighbors (K)",
    y = "Mean Absolute Error (MAE)",
    color = "Transformation") +
  theme_viridis_plot()

From this, we found that a log transformation performed significantly better than min-max scaling (and even non-normalized data recorded earlier in the report). Therefore, we moved forward with a log-transformation of the data.

6.5 Normalized Data Hyperparameter Testing

After normalizing, we deployed the same testing strategy using our best value of nearest neighbors, 17. We calculated over 2,000 MAEs across a wide range of combinations for exponential decay and soft zero weights, including tests for combinations with no weights. The results are shown here:

# Load the results dataframe
results_df_normalization <- readRDS("results_df_normalization.rds")  # Ensure the path is correct

results_df_normalization$Sigma <- as.numeric(as.character(results_df_normalization$Sigma))

# Create custom hover text
results_df_normalization$HoverText <- paste(
  "Scalar: ", results_df_normalization$Scalar,
  "<br>Sigma: ", results_df_normalization$Sigma,
  "<br>MAE: ", round(results_df_normalization$MAE, 4)
)

# Create a 3D scatter plot with custom hover labels
plot_ly(
  data = results_df_normalization,
  x = ~Scalar,
  y = ~Sigma,
  z = ~MAE,
  text = ~HoverText,  # Use the custom hover text
  hoverinfo = "text",  # Display only the custom text on hover
  color = ~Sigma,
  colors = viridisLite::viridis(100),
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    title = list(
      text = "3D Plot #4: Sigma and Scalar Combination Testing with Normalized Data",
      x = 0,  # Align the title to the left
      xanchor = "left",  # Ensure alignment is set to the left
      font = list(
        size = 18,  # Adjust the font size (e.g., 14 -> 16 for +2 pts)
        weight = "bold"  # Make the font bold
      )
    ),
    scene = list(
      xaxis = list(title = "Scalar"),
      yaxis = list(title = "Sigma"),
      zaxis = list(title = "MAE")
    ),
    annotations = list(
      list(
        x = 0.5,  # Center horizontally
        y = -0.2,  # Position below the plot area
        text = "MAE across Sigma and Scalar values calculated using an 80/20 split of the training data.",
        showarrow = FALSE,  # No arrow pointing to the caption
        xref = "paper",  # Position relative to the entire plot area
        yref = "paper",  # Position relative to the entire plot area
        font = list(
          size = 12,
          color = "gray"
        ),
        align = "center"
      )
    ),
    margin = list(b = 80)
  )

Result:

As combinations without exponential decay performed significantly worse, they were excluded. Data normalization compressed distances, which in turn compressed the ranges for both soft zero and exponential decay. However, our three-dimensional plots retained a similar trend: as the value of our exponential decay curve decreased, MAE scores improved. The scores hit an inflection point at a Sigma value of about 3, with optimal results around 1, after which MAE increased rapidly.

For soft zero weighting, the impact was less pronounced but still noticeable. MAE scores consistently decreased as the scalar value was reduced, with the lowest values occurring between 0.2 and 0.6.

Through this analysis, we identified an optimal Sigma value of 1.5 and a Scalar value of 0.25, yielding our lowest internal MAE score of 0.04068677 —a 14.7% reduction from our previous minimum without normalization. Since tests used an 80/20 split of the training data and a slightly modified algorithm, we validated these values using the provided test and training sets.

Upon submission to Kaggle, the model trained on normalized data, and using exponential decay and soft zero weighting achieved an MAE score of 0.0362, reflecting an 18.5% reduction from our previous score. This placed our results 23% below the recommended MAE threshold.

6.6 Hyperparameter Grid Search

After reviewing the project, we realized we had assumed k = 17 based on our analysis with normalized data, excluding other hyperparameters. We decided to run a grid search to explore k and all other hyperparameters. We added a nested loop to test k values from 3 to 71. Our results are visualized below.

Note: Values of K from 3 to 16 were obfuscated, as they did not produce better results than our previous analysis.

# Load the grid search dataframe (adjust the path if necessary)
grid_search_df <- readRDS("grid_search_17_thru_71.rds")

# (Optional) Ensure that Sigma and K are numeric. Adjust if needed.
grid_search_df$Sigma <- as.numeric(as.character(grid_search_df$Sigma))
grid_search_df$K <- as.numeric(as.character(grid_search_df$K))

# Create custom hover text including K
grid_search_df$HoverText <- paste(
  "Scalar: ", grid_search_df$Scalar,
  "<br>Sigma: ", grid_search_df$Sigma,
  "<br>MAE: ", round(grid_search_df$MAE, 4),
  "<br>K: ", grid_search_df$K
)

plot_ly(
  data = grid_search_df,
  x = ~Scalar,
  y = ~Sigma,
  z = ~MAE,
  frame = ~K,            # Each frame corresponds to a different K value
  text = ~HoverText,
  hoverinfo = "text",
  color = ~Sigma,        # Color maps to Sigma (you could change this if preferred)
  colors = viridis(100), # Using a Viridis color scale
  type = "scatter3d",
  mode = "markers",
  marker = list(size = 4)  # Adjust marker size as needed
) %>%
  layout(
    title = list(
      text = "3D Plot #5: Sigma and Scalar Combination Testing with Normalized Data (Varying K)",
      x = 0,              # Left align the title
      xanchor = "left",
      font = list(
        size = 18,
        weight = "bold"
      )
    ),
    scene = list(
      xaxis = list(title = "Scalar"),
      yaxis = list(title = "Sigma"),
      zaxis = list(title = "MAE")
    ),
    annotations = list(
      list(
        x = 0.5,        # Center horizontally
        y = -0.2,       # Position below the plot area
        text = "Animation frames represent different values of K.",
        showarrow = FALSE,
        xref = "paper",
        yref = "paper",
        font = list(
          size = 12,
          color = "gray"
        ),
        align = "center"
      )
    ),
    margin = list(b = 80)
  )

As you can see, our MAE scores improved as we increased the value of k. We submitted our best result to Kaggle, which yielded an MAE score of 0.0357. This was a 3.04% reduction from our previous best score of 0.0362, which put us 25.3% below the recommended MAE threshold.

7 Future Considerations for Optimization

To further refine the KNN algorithm, one potential avenue is feature engineering a new variable derived from a K-means Clustering. Incorporating relationships between birds that are often spotted together or absent together in the data could improve predictions by enhancing the weighting in the KNN algorithm during grid search.

In addition to the K-means clustering extension, our team discussed dimensionality reduction, such as principal component analysis to prepare the data for K-means clustering. The initial use of PCA can help in determining the components of the data that explain the most variance for specific types of birds. After this reduction, it could be useful to perform K-means and find an appropriately sized parameter to use for the number of clusters. In addition to this, one must be mindful in the decision of what normalization and distance metric to use within K-means clustering. For example, since K-means has more of a global structure and handles data sparsity better than KNN, it may be more appropriate to utilize euclidean distance and a different normalization technique to improve clustering results.

Following this, one can perform model stacking to combine the benefits of both methods to further refine model performance.

Although we began exploring this approach, the significant improvements achieved through our methodology led us to prioritize those efforts.

Kaggle Challenge One

Team Elite (Lauryn Davis, Steve Meadows, Brooke Walters)

2025-02-06

1 Introduction

1.1 Explanation of Project

1.2 Applying HPC Principles

2 Exploratory Analysis

2.1 Distribution of the Test Data

2.2 Examining Data Sparcity

2.2.1 Value Distribution Analysis

2.3 Initial Optimization of K

3 Decay Functions Defined

3.1 Exponential Decay Function

3.2 Triangular Decay Function

4 Distance Function Defined

5 Hyperparameter Tuning

5.1 Testing Different Decay Functions and Values

5.1.1 Exponential Decay

5.1.2 Triangular Decay

5.2 Soft Zero Weighting

6 Data Normalization

6.1 Min-Max Scaling

6.2 Log-Transformation

6.3 Distance Value Distribution Analysis of Normalized Data

6.4 Optimizing for K With Normalized Data

6.5 Normalized Data Hyperparameter Testing

6.6 Hyperparameter Grid Search

7 Future Considerations for Optimization