🌌 Abstract Neural Evolution

This project harnesses deep learning techniques to predict Antibody-Derived Tag (ADT) expression profiles. By analyzing RNA sequencing data, our model predicts the expression of 25 ADT proteins. It captures complex, non-linear relationships between RNA and protein expression, allowing it to generalize effectively and infer protein abundance with high accuracy. A fully connected feedforward neural network1 drives this approach, performance is compared against a traditional multiple linear regression model.

  • Goal: Achieve highly accurate ADT value predictions using advanced machine learning techniques and strategic hyperparameter tuning.

  • Evaluation: Model performance is primarily assessed using Pearson’s Correlation, the key evaluation metric.


Neural Network Mechanics:

To achieve a highly accurate and robust final prediction, we implemented a neural network incorporating:

After rigorous testing, we selected a four-layer neural network optimized with ADAM that is visualized below and achieved a .858 Pearson Correlation.

While the deeper network achieved the best Kaggle score, our analysis found that a shallower network provides more consistent predictions. Bayesian optimization confirmed this, showing a lower loss range. This will be discussed further in the report.

Exploratory Data Analysis

library(tidyverse)
library(ggplot2)
library(plotly)
library(flextable)
library(viridis)
library(officer)
library(tibble)
library(flextable)
library(dplyr)
library(ggridges)    # for geom_density_ridges_gradient()
library(hrbrthemes)

source("themes/plot_themes.r") # custom themes, available in our git hub repo

# Loading in the data: 
test_set_rna <- read.csv("Data/test_set_rna.csv", header=TRUE, row.names=1)
training_set_rna <- read.csv("Data/training_set_rna.csv", header=TRUE, row.names=1)
training_set_adt <- read.csv("Data/training_set_adt.csv", header=TRUE, row.names=1)

The Response Variables: ADT

training_set_adt:

Consists of 25 ADT proteins with 4,000 observations

# manual because i fighting the formatting
adt_features <- data.frame(
  Col1 = c("CD8a", "CD27", "HLA.DR", "CD28", "CD57"),
  Col2 = c("CD4", "CD38", "CD11A", "CD278-ICC", "CD79B"),
  Col3 = c("CD3", "CD45A", "CD45R0", "CD16", "CD197-CC"),
  Col4 = c("CD14", "CS127-1L7", "CD69", "CD34", "CD56"),
  Col5 = c("CD11C", "CD19", "CD123", "CD161", "CD25"))

# Create and style flextable
adt_table <- flextable(adt_features) %>%
  set_header_labels(Col1 = "", Col2 = "", Col3 = "", Col4 = "", Col5 = "") %>% 
  bold(bold = TRUE, part = "body") %>% 
  cells_flex()

adt_table

CD8a

CD4

CD3

CD14

CD11C

CD27

CD38

CD45A

CS127-1L7

CD19

HLA.DR

CD11A

CD45R0

CD69

CD123

CD28

CD278-ICC

CD16

CD34

CD161

CD57

CD79B

CD197-CC

CD56

CD25


The Training Variables: RNA

# Perform PCA
set.seed(1986)
pca_result <- prcomp(training_set_rna[, 1:256], scale. = TRUE)

# Convert PCA results into a dataframe
pca_df <- data.frame(PC1 = pca_result$x[, 1], PC2 = pca_result$x[, 2])

# Plot PCA
ggplot(pca_df, aes(PC1, PC2)) +
  geom_point(
    alpha = 0.5,
    color = "#00B4da",
    size = 2,
    shape = 16) +  
  cells_plot() +
  labs(
    title = "Dense clustering on the right suggests \n that many RNA features share common expression patterns",
    x = "PC1",
    y = "PC2",
    caption = "The scatter plot illustrates relationships between RNA expression profiles \n and the top two principal components."
  ) +
  theme(plot.caption = element_text(
    size = 10,
    hjust = 0.5,
    face = "italic",
    color = "black"))  # caption styling

pca_summary <- summary(pca_result)
# Extracting PC1, PC2, and PC3 from PCA summary
pcas <- data.frame(
  Metric = c("STD", "Prop. Variance", "Cuml. Variance"),
  PC1 = round(pca_summary$importance[1:3, 1], 2),
  PC2 = round(pca_summary$importance[1:3, 2], 2),
  PC3 = round(pca_summary$importance[1:3, 3], 2))

# Create a styled flextable
pca_table <- flextable(pcas) %>%
  set_header_labels(
    Metric = "Metric",
    PC1 = "PC1",
    PC2 = "PC2",
    PC3 = "PC3") %>%
 cells_flex()

# Print the table
pca_table

Metric

PC1

PC2

PC3

STD

12.80

3.43

1.74

Prop. Variance

0.64

0.05

0.01

Cuml. Variance

0.64

0.69

0.70

  • training_set_rna: Contains 4,000 rows (cells) × 639 columns representing gene expression features.

    test_set_rna: Includes 1,000 rows (cells) × 639 columns of gene expression data only, with no corresponding protein expression features.

Which of the 256 RNA features share similarities for predicting ADT expression? Each RNA feature has 4,000 dimensions, making it nearly impossible to visualize in a meaningful way. Interpretation of the principal components is based on finding which variables are most strongly correlated with each component.5

The summary(pca_result) revealed that the first principle component (PC1) has a standard deviation of 12.7993, which is much larger than the others. It explains ~63.99% of the total variance in the dataset vs the second most pc, PC2 which explains only 3.43% of variance, meaning it contributes much less to the overall structure.

Interpretations and Application: We could potentially reduce the features by selecting only the most influential principal components, thereby improving computational efficiency while preserving most of the variance in the data. Feature reduction could improve our neural network’s performance by reducing noise and overfitting, however given the results achieved through network tuning, we did not deem this step necessary.



Traditional Statistical Methods Using Regression

The initial approach to predicting ADT protein expression from RNA data involved implementing a multiple linear regression model using matrix algebra.

In matrix form, the linear regression equation can be expressed as:

\[ \mathbf{Y} = \mathbf{X} \mathbf{B} + \mathbf{\epsilon} \]

Where:

  • Y = The response matrix (ADT protein expression), representing the observed protein levels.

  • X = The design matrix (RNA expression features), containing gene expression values for each cell.

  • \(\varepsilon\) = The error term, capturing the unexplained variance and noise in the model.

This solves the equation: \[ \mathbf{ADT_{train}} = \mathbf{RNA_{train}} \times \hat{\mathbf{B}} \]

# Transpose so that rows are samples and columns are genes/proteins:

train_rna <- t(as.matrix(training_set_rna))  # (4000 × 639)
train_adt <- t(as.matrix(training_set_adt))  # (4000 × 25)
test_rna <- t(as.matrix(test_set_rna))  # (1000 × 639)

The ordinary least squares equation allows us to solve for \(\hat{\mathbf{B}}\), which represents the regression coefficients we need:

\[ \hat{\mathbf{B}} = (\mathbf{RNA_{train}}^T \mathbf{RNA_{train}})^{-1} \mathbf{RNA_{train}}^T \mathbf{ADT_{train}} \]

Here, a corresponds to the inverse term shown above, while b represents the transformed response term.

Once we have \(\hat{\mathbf{B}}\), we can compute the predicted ADT values using:

\[ \mathbf{ADT_{test}} = \mathbf{RNA_{test}} \times \hat{\mathbf{B}} \]

Baseline MLR Algorithm:

predict_adt <- function(train_rna, train_adt, test_rna) {
  # Compute the inverse term
  a <- solve(t(train_rna) %*% train_rna)
  b <- t(train_rna) %*% train_adt
  
  # Compute the coefficient estimates
  b_hat <- a %*% b
  
  # Compute the predicted ADT values for the test set
  predicted_values <- test_rna %*% b_hat
  
  # Compute training set predictions
  predicted_train <- train_rna %*% b_hat
  
  # Compute residuals
  residuals_matrix <- train_adt - predicted_train
  
  # Compute R-squared values
  y_mean <- colMeans(train_adt)
  ss_tot <- colSums((train_adt - y_mean)^2)  # Total Sum of Squares
  ss_res <- colSums(residuals_matrix^2)      # Residual Sum of Squares
  r_squared <- 1 - (ss_res / ss_tot)         # Compute R^2
  
  # Compute F-test p-values
  n <- nrow(train_adt)
  p <- ncol(train_rna)
  f_statistic <- ((ss_tot - ss_res) / p) / (ss_res / (n - p - 1))
  p_values_f <- pf(f_statistic, df1 = p, df2 = n - p - 1, lower.tail = FALSE)
  
  # Compute Mean Squared Error (MSE)
  mse_values <- colMeans((train_adt - predicted_train)^2)
  
  
  # Convert to named vectors
  names(r_squared) <- colnames(train_adt)
  names(p_values_f) <- colnames(train_adt)
  names(mse_values) <- colnames(train_adt)

  # Return predictions along with statistics
  return(list(
    predictions = as.data.frame(predicted_values),
    residuals = residuals_matrix,
    r_squared = r_squared,
    mse_values = mse_values,
    f_test_p_values = p_values_f,
    coefficients = b_hat
  ))
}

predictions <- predict_adt(train_rna, train_adt, test_rna)

Key Findings:

The baseline multiple linear regression model provided a strong prediction of ADT values, achieving a 0.8022 Pearson Correlation Coefficient. Our group also tested R’s lm() function as a baseline estimate, which yielded a 0.80225 Pearson Correlation Coefficient.

However, this approach resulted in only a 1% improvement in overall model performance. Given the marginal gain, we opted to proceed with the matrix-based implementation. This approach will serve as the foundation for subsequent analyses involving deep learning.


Gradient Descent

Function-Oriented Approach

We initially built our first neural network using a function-oriented paradigm, which served as the foundation for our understanding of the complex inner workings of neural networks. We approached the project incrementally—acquiring learning resources, focusing on one aspect at a time, and then implementing it in code. Our objective was to replicate the functionality of frameworks like TensorFlow or Keras by starting with hard-coded neural networks and progressively enhancing modularity to increase our model’s flexibility for further hyperparameter tuning. We implemented:

  • Stopping Criteria

  • Layer Independent Dropout

  • Dynamic Dropout

  • L2 Regularization

  • Layer independent Activation Functions

  • Mini-Batching

  • ADAM Optimization

  • Gradient Clipping

  • Graphical Plotting for Training

While some of the implementations fell short and may have required more time and research to properly address some issues, we were able to push our Kaggle score above .84 using a handful of options including batch size, L2 regularization, dropout rate, and learning rate.

(We are hiding forward and backward propogation implementation to align with course requirements).

library(reshape2)

# Activation functions:
swish <- function(x) {
  x * sigmoid(x)
}
relu <- function(x) {
  x[x < 0] <- 0
  return(x)
}
softplus <- function(x) {
  log(1 + exp(x))
}
sigmoid <- function(x) {
  1 / (1 + exp(-x))
}

# Their derivatives:
sigmoid_derivative <- function(x) {
  s <- sigmoid(x)
  s * (1 - s)
}
swish_derivative <- function(x) {
  s <- sigmoid(x)
  s + x * s * (1 - s)
}
relu_derivative <- function(x) {
  ifelse(x > 0, 1, 0)
}
softplus_derivative <- function(x) {
  sigmoid(x)  # because d/dx softplus = sigmoid(x)
}

#' Initialize Model Parameters
#'
#' Initializes the weights and biases for a neural network with a given 
#' layer structure. Weights are initialized using He initialization.
#'
#' @param num_layers A numeric vector specifying the number of units in each 
#'   layer, including the input and output layers.
#'
#' @return A list containing the initialized weight matrices and bias vectors 
#'   for each layer in the network.
#'
#' @examples
#' # Initialize a network with input size 3, two hidden layers, and output 1
#' parameters <- initialize_parameters(c(3, 5, 4, 1))
#'
initialize_parameters <- function(num_layers) {
  set.seed(1986)
  parameters <- list()
  num_layers_count <- length(num_layers)
  
  for (layer in 1:(num_layers_count - 1)) {
    parameters[[paste0("weights", layer)]] <- matrix(
      runif(num_layers[layer + 1] * num_layers[layer], -1, 1) * sqrt(2 / num_layers[layer]),
      nrow = num_layers[layer + 1],
      ncol = num_layers[layer]
    )
    # Bias for each neuron; one bias per unit (column vector)
    parameters[[paste0("bias", layer)]] <- rep(0, num_layers[layer + 1])
  }
  return(parameters)
}
#' Mean Squared Error (MSE) Loss with L2 Regularization
#'
#' Computes the Mean Squared Error (MSE) loss between the target outputs 
#' and predictions. Optionally includes L2 regularization on the weights.
#'
#' @param target_output A numeric matrix containing the true target values.
#' @param predictions A numeric matrix containing the predicted values.
#' @param parameters A list of model parameters, including weight matrices 
#'   used for regularization.
#' @param lambda A numeric value (default: 0) specifying the L2 
#'   regularization strength. If 0, no regularization is applied.
#'
#' @return A numeric value representing the computed MSE loss, optionally 
#'   including the L2 regularization term.
#'
#' @examples
#' target_output <- matrix(runif(10), ncol = 1)
#' predictions <- matrix(runif(10), ncol = 1)
#' parameters <- list(weights1 = matrix(runif(10), ncol = 2))
#'
#' # Compute MSE loss without regularization
#' loss <- mse_loss(target_output, predictions, parameters)
#'
#' # Compute MSE loss with L2 regularization (lambda = 0.1)
#' loss_reg <- mse_loss(target_output, predictions, parameters, lambda = 0.1)
mse_loss <- function(target_output,
                     predictions,
                     parameters,
                     lambda = 0) {
  mse <- mean((target_output - predictions)^2)
  if (lambda > 0) {
    num_layers_count <- length(parameters) / 2
    reg_term <- 0
    for (layer in 1:num_layers_count) {
      reg_term <- reg_term + sum(parameters[[paste0("weights", layer)]]^2)
    }
    mse <- mse + (lambda / (2 * ncol(target_output))) * reg_term
  }
  return(mse)
}

#' Compute R-Squared (R²) Score
#'
#' Calculates the coefficient of determination (R²) to measure the proportion 
#' of variance in the ground truth values explained by the predictions.
#'
#' @param predictions A numeric vector containing predicted values.
#' @param ground_truth A numeric vector containing actual target values.
#'
#' @return A numeric value representing the R² score, where:
#'   - `1` indicates perfect predictions,
#'   - `0` indicates that predictions are as good as the mean of ground truth,
#'   - Negative values indicate that the model performs worse than a constant 
#'     baseline.
compute_r2 <- function(predictions, ground_truth) {
  # Flatten and aggregate across features
  predictions <- as.vector(predictions)
  ground_truth <- as.vector(ground_truth)
  
  ss_res <- sum((ground_truth - predictions)^2)
  ss_tot <- sum((ground_truth - mean(ground_truth))^2)
  r2 <- 1 - (ss_res / ss_tot)
  return(r2)
}

#' Compute Mean Absolute Error (MAE)
#'
#' Calculates the Mean Absolute Error (MAE) between predicted values and 
#' ground truth values. MAE represents the average absolute difference 
#' between predictions and actual values, providing an intuitive measure 
#' of prediction error.
#'
#' @param predictions A numeric vector containing predicted values.
#' @param ground_truth A numeric vector containing actual target values.
#'
#' @return A numeric value representing the MAE, where lower values indicate 
#'   better model performance.
compute_mae <- function(predictions, ground_truth) {
  predictions <- as.vector(predictions)
  ground_truth <- as.vector(ground_truth)
  
  mae <- mean(abs(predictions - ground_truth))
  return(mae)
}

#' Compute Root Mean Squared Error (RMSE)
#'
#' Calculates the Root Mean Squared Error (RMSE) between predicted values 
#' and ground truth values. RMSE provides a measure of error magnitude, 
#' with higher penalties for large errors due to squaring.
#'
#' @param predictions A numeric vector containing predicted values.
#' @param ground_truth A numeric vector containing actual target values.
#'
#' @return A numeric value representing the RMSE, where lower values indicate 
#'   better model performance.
compute_rmse <- function(predictions, ground_truth) {
  predictions <- as.vector(predictions)
  ground_truth <- as.vector(ground_truth)
  
  rmse <- sqrt(mean((predictions - ground_truth)^2))
  return(rmse)
}

#' Compute Explained Variance Score (EVS)
#'
#' Calculates the explained variance score (EVS) between predicted values 
#' and ground truth values. EVS measures how much of the variance in the 
#' target variable is explained by the predictions. A score close to 1 
#' indicates that most of the variance is captured, while a lower score 
#' suggests poor performance.
#'
#' @param predictions A numeric vector containing predicted values.
#' @param ground_truth A numeric vector containing actual target values.
#'
#' @return A numeric value representing the explained variance score, 
#'   where higher values indicate better model performance.
compute_explained_variance <- function(predictions, ground_truth) {
  predictions <- as.vector(predictions)
  ground_truth <- as.vector(ground_truth)
  
  variance_residuals <- var(predictions - ground_truth)
  variance_target <- var(ground_truth)
  evs <- 1 - (variance_residuals / variance_target)
  return(evs)
}

#' Compute Cosine Similarity
#'
#' Computes the cosine similarity between predicted values and ground 
#' truth values. Cosine similarity measures the cosine of the angle 
#' between two nonzero vectors, indicating how similar they are. 
#' A value close to 1 suggests high similarity, while a value near 0 
#' indicates dissimilarity.
#'
#' @param predictions A numeric vector containing predicted values.
#' @param ground_truth A numeric vector containing actual target values.
#'
#' @return A numeric value representing the cosine similarity, ranging 
#'   from -1 (opposite direction) to 1 (identical direction).
compute_cosine_similarity <- function(predictions, ground_truth) {
  predictions <- as.vector(predictions)
  ground_truth <- as.vector(ground_truth)
  
  dot_product <- sum(predictions * ground_truth)
  magnitude_pred <- sqrt(sum(predictions^2))
  magnitude_true <- sqrt(sum(ground_truth^2))
  cosine_sim <- dot_product / (magnitude_pred * magnitude_true)
  return(cosine_sim)
}
#' Update Model Parameters
#'
#' Updates the weights and biases of a neural network using the 
#' computed gradients and a specified learning rate.
#'
#' @param parameters A list containing the model's weight and bias 
#'   parameters.
#' @param gradients A list containing the computed gradients for 
#'   weights and biases.
#' @param learning_rate Numeric; the step size for updating 
#'   parameters.
#'
#' @return A list of updated weight and bias parameters.
update_parameters <- function(parameters, gradients, learning_rate) {
  num_layers_count <- length(parameters) / 2  # Number of layers
  
  for (layer in 1:num_layers_count) {
    parameters[[paste0("weights", layer)]] <- parameters[[paste0("weights", layer)]] - learning_rate * gradients[[paste0("weights_gradient", layer)]]
    parameters[[paste0("bias", layer)]] <- parameters[[paste0("bias", layer)]] - learning_rate * matrix(gradients[[paste0("bias_gradient", layer)]], ncol = 1)
  }
  return(parameters)
}

#' Adam Optimizer Update
#'
#' Updates model parameters using the Adam optimization algorithm.
#'
#' @param parameters List of current model parameters (weights and biases).
#' @param gradients List of gradients computed via backpropagation.
#' @param adam_state List containing the current Adam state (first and second moment estimates).
#' @param t Current time step (epoch or iteration count).
#' @param learning_rate Learning rate for the update.
#' @param beta1 Exponential decay rate for the first moment estimates.
#' @param beta2 Exponential decay rate for the second moment estimates.
#' @param epsilon Small constant to prevent division by zero.
#'
#' @return A list with updated parameters and the new Adam state.
update_parameters_adam <- function(parameters, gradients, adam_state, t, learning_rate,
                                     beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8) {
  num_layers <- length(parameters) / 2
  
  for (layer in 1:num_layers) {
    # Update biased first moment estimate for weights and biases
    adam_state[[paste0("vW", layer)]] <- beta1 * adam_state[[paste0("vW", layer)]] +
      (1 - beta1) * gradients[[paste0("weights_gradient", layer)]]
    adam_state[[paste0("vb", layer)]] <- beta1 * adam_state[[paste0("vb", layer)]] +
      (1 - beta1) * gradients[[paste0("bias_gradient", layer)]]
    
    # Update biased second raw moment estimate for weights and biases
    adam_state[[paste0("sW", layer)]] <- beta2 * adam_state[[paste0("sW", layer)]] +
      (1 - beta2) * (gradients[[paste0("weights_gradient", layer)]]^2)
    adam_state[[paste0("sb", layer)]] <- beta2 * adam_state[[paste0("sb", layer)]] +
      (1 - beta2) * (gradients[[paste0("bias_gradient", layer)]]^2)
    
    # Compute bias-corrected first moment estimate
    vW_corrected <- adam_state[[paste0("vW", layer)]] / (1 - beta1^t)
    vb_corrected <- adam_state[[paste0("vb", layer)]] / (1 - beta1^t)
    
    # Compute bias-corrected second raw moment estimate
    sW_corrected <- adam_state[[paste0("sW", layer)]] / (1 - beta2^t)
    sb_corrected <- adam_state[[paste0("sb", layer)]] / (1 - beta2^t)
    
    # Update parameters for weights and biases
    parameters[[paste0("weights", layer)]] <- parameters[[paste0("weights", layer)]] -
      learning_rate * vW_corrected / (sqrt(sW_corrected) + epsilon)
    parameters[[paste0("bias", layer)]] <- parameters[[paste0("bias", layer)]] -
      learning_rate * vb_corrected / (sqrt(sb_corrected) + epsilon)
  }
  
  return(list("parameters" = parameters, "adam_state" = adam_state))
}
#' Train a Neural Network Model
#'
#' Trains a neural network using forward and backward propagation, with 
#' optional regularization, dropout, gradient clipping, and early stopping.
#'
#' @param input_data Matrix; training input data (features x samples).
#' @param target_output Matrix; expected output values (targets x samples).
#' @param num_layers Vector; specifies the number of neurons in each layer.
#' @param learning_rate Numeric; step size for parameter updates.
#' @param epochs Integer; total number of training iterations.
#' @param lambda Numeric; L2 regularization strength (default: NULL).
#' @param alpha Numeric; weight for relative loss improvement in composite 
#'   stopping metric.
#' @param beta Numeric; weight for normalized gradient norm in composite 
#'   stopping metric.
#' @param dropout Logical; whether to apply dropout.
#' @param dropout_rate Numeric; probability of dropout (default: NULL).
#' @param dynamic_dropout Logical; enable dynamic dropout rate adjustment.
#' @param patience Integer; number of epochs to wait before early stopping.
#' @param tolerance Numeric; threshold for stopping based on loss improvement.
#' @param grad_tolerance Numeric; threshold for stopping based on gradient norm.
#' @param comp_threshold Numeric; threshold for stopping based on composite 
#'   metric.
#' @param clip_gradients Logical; whether to apply gradient clipping.
#' @param simple_stop Logical; whether to use simple early stopping.
#' @param adam Logical; whether to use Adam optimization.
#' @param activation_functions List; activation functions per layer (default: 
#'   NULL).
#' @param activation_derivatives List; activation function derivatives per 
#'   layer (default: NULL).
#'
#' @return A list containing the trained parameters, loss history, gradient 
#'   norm history, and composite metric history.
train_model <- function(input_data, target_output, num_layers, learning_rate, epochs, 
                        lambda = NULL,    # L2 regularization strength
                        alpha = NULL,      # Weight for relative loss improvement in composite metric
                        beta = NULL,       # Weight for normalized gradient norm in composite metric
                        dropout = NULL,
                        dropout_rate = NULL,
                        dynamic_dropout = NULL,
                        patience = 10, 
                        tolerance = 1e-5,
                        grad_tolerance = 1e-6, 
                        comp_threshold = 0.5, 
                        clip_gradients = NULL,
                        simple_stop = NULL,
                        adam = NULL,
                        activation_functions = NULL, # list of functions for each layer
                        activation_derivatives = NULL  # corresponding list of derivative functions
                        ) {
  
  parameters <- initialize_parameters(num_layers)
  cost_history <- c()
  grad_norm_history <- c()
  composite_history <- c()
  patience_counter <- 0
  batch_size <- 50  
  num_batches <- ceiling(ncol(input_data) / batch_size)
  best_loss <- Inf
  
  # Initialize adam_state once before the training loop:
  adam_state <- list()
  num_layers_count <- length(num_layers) - 1
  for (layer in 1:num_layers_count) {
    adam_state[[paste0("vW", layer)]] <- matrix(0, nrow = num_layers[layer + 1], ncol = num_layers[layer])
    adam_state[[paste0("vb", layer)]] <- matrix(0, nrow = num_layers[layer + 1], ncol = 1)
    adam_state[[paste0("sW", layer)]] <- matrix(0, nrow = num_layers[layer + 1], ncol = num_layers[layer])
    adam_state[[paste0("sb", layer)]] <- matrix(0, nrow = num_layers[layer + 1], ncol = 1)
  }

  
  # For composite metric normalization
  init_grad_norm <- NULL
  prev_cost <- Inf
  
  cat("\n========================================\n")
  cat(" Neural Network Training Started\n")
  cat("========================================\n")
  cat("Model Configuration:\n")
  cat(" - Layers:", length(num_layers) - 1, "\n")
  cat(" - Neurons:", paste(num_layers, collapse=" ➝ "), "\n")
  cat(" - Learning Rate:", learning_rate, "\n")
  cat(" - Epochs:", epochs, "\n")
  cat(" - Batch Size:", batch_size, "\n")
  cat(" - Regularization Lambda:", lambda, "\n")
  if (dropout) {
      if (dynamic_dropout) {
        cat(" - Dropout Rate: Dynamic\n")
      } else {
      cat(" - Dropout Rate:", dropout_rate, "\n")
      }
  }
  if (!simple_stop){
    cat(" - Alpha (loss improvement weight):", alpha, "\n")
    cat(" - Beta (gradient norm weight):", beta, "\n")
  }
  cat("----------------------------------------\n")
  
  start_time <- Sys.time()
  
  # cat("Alpha: ", alpha, "| Beta: ", beta, "\n")
  
  for (epoch in 1:epochs) {
    for (batch in 1:num_batches) {
      start_idx <- (batch - 1) * batch_size + 1
      end_idx <- min(batch * batch_size, ncol(input_data))
      batch_x <- input_data[, start_idx:end_idx, drop = FALSE]
      batch_y <- target_output[, start_idx:end_idx, drop = FALSE]
      
      if (dropout) {
        if (dynamic_dropout) {
          rate <- .2 * (1 - exp(-0.18 * epoch))
          dropout_rate <- min(0.2, rate)
        } else {
          # Fixed dropout rate
          dropout_rate <- dropout_rate
        }
      } else {
        # No dropout
        dropout_rate <- 0
      }
      
      # Use the specified dropout_rate during training
      forward_result <- forward_propagation(batch_x,
                                            parameters,
                                            num_layers,
                                            dropout_rate,
                                            activation_functions)
      cost <- mse_loss(batch_y, forward_result$A_final, parameters, lambda)
      
      gradients <- backward_propagation(
        batch_x,
        batch_y,
        forward_result$activations,
        parameters,
        num_layers,
        lambda = lambda,
        clip_gradients,
        activation_derivatives = activation_derivatives
      )
      
      if (adam) {
        current_iter <- epoch * num_batches + batch
        update_result <- update_parameters_adam(parameters,
                                                gradients,
                                                adam_state,
                                                t = current_iter,
                                                learning_rate = learning_rate)
        parameters <- update_result$parameters
        adam_state <- update_result$adam_state
      } else {
        parameters <- update_parameters(parameters, gradients, learning_rate)
      }
    }
    
    cost_history <- c(cost_history, cost)
    
    # Compute gradient norm (sum of squared gradients)
    grad_norm <- sum(sapply(gradients, function(g) sum(g^2)))
    grad_norm_history <- c(grad_norm_history, grad_norm)
    
    if (is.null(init_grad_norm)) {
      init_grad_norm <- grad_norm
      # cat("Initial Gradient Norm:", init_grad_norm, "\n")
    }
    
    # Calculate relative loss improvement
    if (prev_cost == Inf) {
      rel_loss_improve <- 1
    } else {
      rel_loss_improve <- (prev_cost - cost) / prev_cost
    }
    
    # Composite metric: lower values indicate convergence
    composite_score <- alpha * rel_loss_improve + beta * (grad_norm / init_grad_norm)
    
    # cat(sprintf("Epoch [%4d]: Loss Improve: %.6f | Grad Norm Ratio: %.6f | Composite: %.6f\n",
    #         epoch, rel_loss_improve, grad_norm / init_grad_norm, composite_score))
    
    composite_history <- c(composite_history, composite_score)
    prev_cost <- cost
    
    
    if (epoch %% 10 == 0 || epoch == epochs) {
      cat(
        sprintf(
          "Epoch [%4d/%4d] | Loss: %8.6f | Grad Norm: %10.6f | Composite: %8.6f | Patience: %2d\n",
          epoch,
          epochs,
          cost,
          grad_norm,
          composite_score,
          patience_counter
        )
      )
    }
    
    # Stopping Criteria Handler
    if (!simple_stop) {
      # 🛠️ Hybrid Stopping Criteria
      if (epoch > 100) {
        # Calculate recent loss improvement
        recent_losses <- tail(cost_history, 20)
        avg_loss_change <- mean(diff(recent_losses))
        
        # Calculate gradient norm variability
        recent_grads <- tail(grad_norm_history, 20)
        grad_variability <- sd(recent_grads)
        
        # Check stopping conditions
        if (abs(avg_loss_change) < 0.001 &&
            grad_variability < 0.05) {
          cat("\n🔴 Early Stopping Triggered: Loss improvement minimal & gradients stable.\n")
          break
        }
        
      }
      
      # Composite Score-Based Stopping
      if (composite_score < comp_threshold) {
        patience_counter <- patience_counter + 1
        if (patience_counter >= patience) {
          cat(
            "\n🛑 Early Stopping Triggered: Composite metric below threshold for",
            patience,
            "epochs.\n"
          )
          cat("Final Loss:", cost, "\n")
          break
        }
      } else {
        patience_counter <- 0
      }
      
    } else {
      # 🧪 Simple Stopping Criteria
      if (best_loss - cost < tolerance) {
        patience_counter <- patience_counter + 1
        if (patience_counter >= patience) {
          cat("\nEarly Stopping Triggered: No improvement for",
              patience,
              "epochs.\n")
          cat("Training stopped at epoch",
              epoch,
              "with final loss:",
              cost,
              "\n")
          break
        }
      } else {
        best_loss <- cost
        patience_counter <- 0
      }
      
      # Check Gradient Norm for Convergence
      if (grad_norm < grad_tolerance) {
        cat("\n✅ Convergence Reached: Gradient norm below threshold.\n")
        cat("Final Loss:", cost, "\n")
        break
      }
    }
  }
  
  end_time <- Sys.time()
  execution_time <- end_time - start_time
  
  cat("\n========================================\n")
  cat(" Training Completed\n")
  cat("========================================\n")
  cat("Final Loss:", cost, "\n")
  cat("Execution Time:", execution_time, "seconds\n")
  cat("----------------------------------------\n")
  
  return(list("parameters" = parameters, 
              "cost_history" = cost_history, 
              "grad_norm_history" = grad_norm_history,
              "composite_history" = composite_history))
}
#' Validate Model Performance
#'
#' Evaluates the model's performance on a validation dataset by performing 
#' forward propagation and computing the mean squared error (MSE) loss.
#'
#' @param val_x Matrix of input features for validation.
#' @param val_y Matrix of target values for validation.
#' @param parameters List of trained model parameters.
#' @param num_layers Vector specifying the number of neurons in each layer.
#' @param verbose Logical; if TRUE, prints validation details.
#'
#' @return A list containing:
#'   \item{val_loss}{The computed validation loss (MSE).}
#'   \item{execution_time}{The total execution time for validation.}
validate_model <- function(val_x,
                           val_y,
                           parameters,
                           num_layers,
                           verbose = TRUE) {
  start_time <- Sys.time()
  forward_result <- forward_propagation(val_x, parameters, num_layers, dropout_rate = 0)
  val_loss <- mse_loss(val_y, forward_result$A_final)
  end_time <- Sys.time()
  execution_time <- end_time - start_time


  cat("\n========================================\n")
  cat(" Validation Evaluation Completed\n")
  cat("========================================\n")
  cat("Validation Configuration:\n")
  cat(" - Number of Samples:", ncol(val_x), "\n")
  cat(" - Layers:", length(num_layers) - 1, "\n")
  cat(" - Neurons:", paste(num_layers, collapse=" ➝ "), "\n")
  cat("Final Validation Loss:", val_loss, "\n")
  cat("Execution Time:", execution_time, "seconds\n")
  cat("----------------------------------------\n")
  
  return(list("val_loss" = val_loss, "execution_time" = execution_time))
}

#' Generate Model Predictions
#'
#' Performs forward propagation on test data using the trained model parameters 
#' and returns the predicted outputs.
#'
#' @param test_data Matrix of input features for prediction.
#' @param parameters List of trained model parameters.
#' @param num_layers Vector specifying the number of neurons in each layer.
#'
#' @return A matrix containing the predicted outputs.
make_predictions <- function(test_data, parameters, num_layers) {
  forward_result <- forward_propagation(test_data, parameters, num_layers)
  return(forward_result$A_final)
}

#' Compute Pearson Correlation
#'
#' Calculates the Pearson correlation coefficient between predicted and actual 
#' values for each row and returns the median correlation across all rows.
#'
#' @param predictions Matrix of predicted values.
#' @param ground_truth Matrix of actual target values.
#'
#' @return The median Pearson correlation coefficient across all rows.
compute_pearson <- function(predictions, ground_truth) {
  cor_results <- numeric(nrow(predictions))
  for (i in 1:nrow(predictions)) {
    cor_results[i] <- cor(predictions[i, ], ground_truth[i, ], method = "pearson")
  }
  return(median(cor_results, na.rm = TRUE))
}

#' Validate Model with Pearson Correlation
#'
#' Evaluates the model on the validation set using forward propagation and 
#' computes the Pearson correlation between predictions and actual values.
#'
#' @param val_x Matrix of validation input features.
#' @param val_y Matrix of actual validation target values.
#' @param parameters List of trained model parameters.
#' @param num_layers Vector specifying the number of neurons in each layer.
#'
#' @return The median Pearson correlation coefficient between predictions and 
#' actual values.
validate_with_pearson <- function(val_x, val_y, parameters, num_layers) {
  forward_result <- forward_propagation(val_x, parameters, num_layers)
  predictions <- forward_result$A_final
  pearson_corr <- compute_pearson(predictions, val_y)
  
  cat("\n========================================\n")
  cat(" Pearson Correlation on Validation Set \n")
  cat("========================================\n")
  cat(sprintf("Median Pearson Correlation: %8.6f
", pearson_corr))
  cat("========================================\n")
  
  return(pearson_corr)
}

#' Validate Model with Multiple Metrics
#'
#' Evaluates the model's performance on the validation set by computing 
#' multiple statistical metrics, including R-squared, MAE, RMSE, 
#' explained variance, and cosine similarity.
#'
#' @param val_x Matrix of validation input features.
#' @param val_y Matrix of actual validation target values.
#' @param parameters List of trained model parameters.
#' @param num_layers Vector specifying the number of neurons in each layer.
#'
#' @return A list containing the following evaluation metrics:
#'   \item{r2}{R-squared (coefficient of determination).}
#'   \item{mae}{Mean Absolute Error.}
#'   \item{rmse}{Root Mean Squared Error.}
#'   \item{evs}{Explained Variance Score.}
#'   \item{cosine}{Cosine Similarity between predicted and true values.}
validate_with_metrics <- function(val_x, val_y, parameters, num_layers) {
  forward_result <- forward_propagation(val_x, parameters, num_layers)
  predictions <- forward_result$A_final
  
  # Calculate multiple metrics
  r2 <- compute_r2(predictions, val_y)
  mae <- compute_mae(predictions, val_y)
  rmse <- compute_rmse(predictions, val_y)
  evs <- compute_explained_variance(predictions, val_y)
  cosine_sim <- compute_cosine_similarity(as.vector(predictions), as.vector(val_y))
  
  cat("\n========================================\n")
  cat(" 📊 Model Performance Metrics\n")
  cat("========================================\n")
  cat(sprintf("R-Squared (R²):         %.6f   # Proportion of variance explained by the model.\n", r2))
  cat(sprintf("Mean Absolute Error:    %.6f   # Average absolute error across all predictions.\n", mae))
  cat(sprintf("Root Mean Squared Error:%.6f   # Penalizes large errors more than MAE.\n", rmse))
  cat(sprintf("Explained Variance:     %.6f   # Measures the proportion of variance captured by the model.\n", evs))
  cat(sprintf("Cosine Similarity:      %.6f   # Measures similarity between prediction & truth vectors.\n", cosine_sim))
  cat("========================================\n")
  
  return(list(
    pearson = pearson_corr,
    r2 = r2,
    mae = mae,
    rmse = rmse,
    evs = evs,
    cosine = cosine_sim
  ))
}

# Load Data
rna_training_set <- as.matrix(read.csv("../../Data/training_set_rna.csv", header = TRUE, row.names = 1))
adt_training_set <- as.matrix(read.csv("../../Data/training_set_adt.csv", header = TRUE, row.names = 1))
rna_test_set  <- as.matrix(read.csv("../../Data/test_set_rna.csv", header = TRUE, row.names = 1))

# 80/20 Split

set.seed(1986)

train_indices <- sample(1:ncol(rna_training_set), size = 0.8 * ncol(rna_training_set))
test_indices <- setdiff(1:ncol(rna_training_set), train_indices)
rna_train <- rna_training_set[, train_indices, drop = FALSE]
adt_train <- adt_training_set[, train_indices, drop = FALSE]
rna_test  <- rna_training_set[, test_indices, drop = FALSE]
adt_test  <- adt_training_set[, test_indices, drop = FALSE]
num_layers <- c(nrow(rna_train), 256, 64, 25)
activation_functions <- list(relu, relu, relu)  
activation_derivatives <- list(
  relu_derivative, 
  relu_derivative, 
  relu_derivative)  

#' Train a Neural Network Model
#'
#' Trains a feedforward neural network with specified architecture, optimization, 
#' and regularization settings.
#'
#' @param input_data A matrix of input features (rows = features, cols = samples).
#' @param target_output A matrix of target outputs.
#' @param num_layers A numeric vector specifying the number of neurons per layer.
#' @param learning_rate A numeric value for the learning rate.
#' @param epochs An integer specifying the maximum number of training epochs.
#' @param lambda A numeric value for L2 regularization strength (default = 0).
#' @param alpha A numeric weight for relative loss improvement in composite stopping.
#' @param beta A numeric weight for normalized gradient norm in composite stopping.
#' @param dropout A logical indicating whether to use dropout (default = FALSE).
#' @param dropout_rate A numeric value for dropout rate if dropout is enabled.
#' @param dynamic_dropout A logical indicating whether dropout rate changes over epochs.
#' @param clip_gradients A logical indicating whether to clip gradients to prevent explosion.
#' @param simple_stop A logical indicating whether to use simple early stopping.
#' @param adam A logical indicating whether to use the Adam optimizer (default = FALSE).
#' @param activation_functions A list of activation functions for each hidden layer.
#' @param activation_derivatives A list of derivative functions for activations.
#'
#' @return A list containing:
#' \item{parameters}{Trained model parameters (weights & biases).}
#' \item{cost_history}{Loss history over epochs.}
#' \item{grad_norm_history}{Gradient norm history over epochs.}
#' \item{composite_history}{Composite metric history for stopping criteria.}
model <- train_model(rna_train, adt_train, num_layers, learning_rate = 0.01,
                     epochs = 5000,
                     lambda = 0,
                     alpha = .6,
                     beta = 0.4,
                     dropout = FALSE,
                     dropout_rate = 0.04,
                     dynamic_dropout = FALSE,
                     clip_gradients = FALSE,
                     simple_stop = TRUE,
                     adam = FALSE,
                     activation_functions = activation_functions,
                     activation_derivatives = activation_derivatives
                     )
# Validate using the test set
validation_results <- validate_model(rna_test, adt_test, model$parameters, num_layers)
pearson_corr <- validate_with_pearson(rna_test, adt_test, model$parameters, num_layers)
loss_metrics <- validate_with_metrics(rna_test, adt_test, model$parameters, num_layers)

#Make predictions on the test set
predictions <- make_predictions(rna_test_set, model$parameters, num_layers)

# Convert wide format (1000 × 25) to long format (25,000 × 2)
submission <- melt(predictions, variable.name = "protein", value.name = "Expected")

# Create the sequential ID column (ID_1 to ID_25000)
submission$ID <- paste0("ID_", 1:nrow(submission))

# Select only required columns
submission <- submission[, c("ID", "Expected")]

# Save to CSV in correct format
write.csv(submission, "submission_L2_dropout_test_256_64_25.csv", row.names = FALSE)

Object-Oriented Approach

Our next approach was to replicate the best aspects of our model implementation by refactoring our code within an object-oriented paradigm using the R6 package in R. Employing a class-based approach allowed us to simplify our codebase, making it more readable and easier to use.

(We are hiding forward and backward propogation implementation to align with course requirements).

library(R6)

#' Layer Class for Neural Networks
#'
#' Defines a single layer in a neural network, supporting forward propagation,
#' backward propagation, and parameter updates.
#'
#' @field weights A matrix representing the layer's weights.
#' @field bias A numeric vector representing the layer's bias.
#' @field activation A function representing the activation function.
#' @field activation_deriv A function representing the derivative of the activation function.
#' @field input The input data stored during the forward pass (used for backpropagation).
#' @field Z The linear transformation output before activation (`Z = W * input + b`).
#' @field A The activation output of the layer.
#' @field grad_w The gradient of the weights computed during backpropagation.
#' @field grad_b The gradient of the biases computed during backpropagation.
#'
#' @method initialize
#' @param input_size Integer, number of input features for the layer.
#' @param output_size Integer, number of output features for the layer.
#' @param activation Function, activation function applied to the layer's output.
#' @param activation_deriv Function, derivative of the activation function.
#' @param init_method String, weight initialization method ("he" for He initialization, default).
#'
#' @method forward
#' @param input_data Matrix, input data to the layer.
#' @return The activated output of the layer.
#'
#' @method backward
#' @param dA Matrix, gradient of the loss with respect to the layer’s output.
#' @param lambda Numeric, L2 regularization strength (default: 0).
#' @return The gradient of the loss with respect to the input (for previous layer).
#'
#' @method update_parameters
#' @param learning_rate Numeric, learning rate for parameter updates.
#'
Layer <- R6Class("Layer",
  public = list(
    weights = NULL,
    bias = NULL,
    activation = NULL,
    activation_deriv = NULL,
    input = NULL,   # To store the input during forward pass (for backprop)
    Z = NULL,       # Linear output (weights %*% input + bias)
    A = NULL,       # Activation output
    grad_w = NULL,  # Gradient of weights
    grad_b = NULL,  # Gradient of biases

    initialize = function(input_size, output_size, activation, activation_deriv, init_method = "he") {
      # Initialize weights and bias. Here we use He initialization by default.
      if(init_method == "he") {
        scale <- sqrt(2 / input_size)
      } else {
        scale <- sqrt(1 / input_size)
      }
      self$weights <- matrix(runif(output_size * input_size, -1, 1) * scale, nrow = output_size, ncol = input_size)
      self$bias <- rep(0, output_size)
      self$activation <- activation
      self$activation_deriv <- activation_deriv
    },

    update_parameters = function(learning_rate) {
      self$weights <- self$weights - learning_rate * self$grad_w
      self$bias <- self$bias - learning_rate * self$grad_b
    }
  )
)

#' Neural Network Class
#'
#' Implements a fully connected feedforward neural network using the R6 class system.
#'
#' @field layers List of layer objects, each representing a neural network layer.
#' @field cost_history Numeric vector storing the loss value at each training epoch.
#'
#' @method initialize
#' @param layer_dims Numeric vector specifying the number of neurons per layer.
#' @param activations List of activation functions, one per hidden/output layer.
#' @param activation_derivs List of activation function derivatives.
#'
#' @method forward
#' @param X Matrix of input features.
#' @param dropout_rate Numeric, dropout probability applied to hidden layers.
#' @return Matrix, network output after forward propagation.
#'
#' @method compute_loss
#' @param Y Matrix of true labels.
#' @param Yhat Matrix of predicted outputs.
#' @param lambda Numeric, L2 regularization strength (default: 0).
#' @return Numeric, computed loss value (MSE with optional L2 regularization).
#'
#' @method backward
#' @param X Matrix of input features.
#' @param Y Matrix of true labels.
#' @param lambda Numeric, L2 regularization strength (default: 0).
#'
#' @method update_parameters
#' @param learning_rate Numeric, learning rate for updating weights and biases.
#'
#' @method train
#' @param X Matrix of input features.
#' @param Y Matrix of true labels.
#' @param epochs Integer, number of training iterations.
#' @param learning_rate Numeric, learning rate for parameter updates.
#' @param lambda Numeric, L2 regularization strength (default: 0).
#' @param verbose Logical, whether to print training progress.
#'
NeuralNetwork <- R6Class("NeuralNetwork",
  public = list(
    layers = list(),
    cost_history = NULL,

    initialize = function(layer_dims, activations, activation_derivs) {
      # layer_dims: a numeric vector (e.g., c(input_size, hidden1, hidden2, output_size))
      # activations and activation_derivs: lists of functions for each layer (excluding the input layer)
      num_layers <- length(layer_dims) - 1
      for(i in seq_len(num_layers)) {
        layer <- Layer$new(
          input_size = layer_dims[i],
          output_size = layer_dims[i+1],
          activation = activations[[i]],
          activation_deriv = activation_derivs[[i]]
        )
        self$layers[[i]] <- layer
      }
    },

    compute_loss = function(Y, Yhat, lambda = 0) {
      # Mean squared error (MSE) with optional L2 regularization
      m <- ncol(Y)
      mse <- sum((Yhat - Y)^2) / m
      
      if(lambda > 0) {
        reg <- 0
        for(layer in self$layers) {
          reg <- reg + sum(layer$weights^2)
        }
        mse <- mse + (lambda / (2*m)) * reg
      }
      return(mse)
    },

    update_parameters = function(learning_rate) {
      for(layer in self$layers) {
        layer$update_parameters(learning_rate)
      }
    },

    train = function(X, Y, epochs = 1000, learning_rate = 0.01, lambda = 0, verbose = TRUE) {
      self$cost_history <- numeric(epochs)
      
      for(epoch in 1:epochs) {
        # Forward propagation
        Yhat <- self$forward(X)
        
        # Compute cost
        cost <- self$compute_loss(Y, Yhat, lambda)
        self$cost_history[epoch] <- cost
        
        # Backward propagation
        self$backward(X, Y, lambda)
        
        # Update parameters
        self$update_parameters(learning_rate)
        
        if(verbose && (epoch %% 100 == 0 || epoch == 1)) {
          cat(sprintf("Epoch %d/%d - Loss: %f\n", epoch, epochs, cost))
        }
      }
    }
  )
)
# Example activation functions (make sure to define these as in your original code)
relu <- function(x) { 
  x[x < 0] <- 0
  return(x)
}
relu_derivative <- function(x) { 
  ifelse(x > 0, 1, 0)
}

# You could also define softplus, sigmoid, etc.
softplus <- function(x) { log(1 + exp(x)) }
softplus_derivative <- function(x) { 1 / (1 + exp(-x)) }

# Define the architecture
layer_dims <- c(639, 64, 32, 25)  # example dimensions: 100 inputs, two hidden layers, 10 outputs

# For simplicity, we use ReLU for all layers here (but you can mix and match)
activations <- list(relu, relu, relu)
activation_derivs <- list(relu_derivative, relu_derivative, relu_derivative)

# Create the Neural Network object
nn <- NeuralNetwork$new(layer_dims, activations, activation_derivs)

# rna_training_set <- as.matrix(read.csv("../../Data/training_set_rna.csv", header = TRUE, row.names = 1))
# adt_training_set <- as.matrix(read.csv("../../Data/training_set_adt.csv", header = TRUE, row.names = 1))
# rna_test_set  <- as.matrix(read.csv("../../Data/test_set_rna.csv", header = TRUE, row.names = 1))

# Train the network
nn$train(rna_training_set, adt_training_set, epochs = 1000, learning_rate = 0.001, lambda = 0.01)


Neural Network Using Torch

With a deeper understanding of the tools available for optimizing neural networks, we turned to a favorite among industry professionals to start building onto what we had learned from our deep exploration and research.

The library, Torch, (Similar to PyTorch in Python), provides a streamlined and efficient approach to implementing the neural network that is necessary for this project. Torch allows for easy access of built-in functionality relating to forward and backward propagation6 and activation function tuning. Furthermore, ADAM can be easily implemented using various function calls of interest.


Cross Validation Methods

  • Our group implemented a 90/10 train-test split, chosen based on dataset size to ensure sufficient training data while maintaining a reliable evaluation set. This approach not only provided a comprehensive assessment of model performance but also facilitated robust training iterations, leading to more accurate predictions. Additionally, the split was complemented by carefully chosen network initialization, further enhancing the model’s predictive capability.

  • Various validation metrics were monitored during training, including Pearson correlation, R², and loss (MSE). Through strategic analysis, we found that higher R² validation scores correlated with better Kaggle performance compared to higher Pearson correlation scores. Consequently, our subsequent analyses prioritized network parameters that yielded the highest R² values. Another critical factor influencing performance was the gap between training and validation loss. Smaller gaps consistently led to improved Kaggle scores. These validation insights were integral to our hyperparameter tuning process.

Testing Schedule

We conducted extensive research as we began hyperparameter tuning and developed a systematic schedule. Our approach followed a research-driven methodology, aligning closely with best practices from Deep Learning Illustrated, which we referenced throughout the project. Expert insights from highly respected data scientists guided our optimization process, ensuring our approach was both methodologically sound and strategically informed.

We developed a testing schedule that refined the network step by step. Beginning with a baseline estimation, we progressively incorporated regularization and optimization techniques, carefully evaluating their impact to enhance performance.


Hyperparameter Tuning

Learning Rate

We began tuning the learning rate on a baseline architecture that felt appropriate for the dataset of use. The chosen architecture followed the structure: 639 → 256 → 64 → 25. The tuning focused on the learning rate, with a classic ReLU activation applied to the first hidden layer following the input layer and a default batch size of 120. There was no dropout, regularization, or optimization presented.

df <- readRDS("Learning_Rate_Tuning.rds")
# Create the visualization using the aggregated data frame
p <- ggplot(df, aes(x = learning_rate)) +
  geom_point(aes(y = best_pearson, color = "Pearson"), size = 3) +
  geom_line(aes(y = best_pearson, color = "Pearson")) +
  geom_point(aes(y = best_r2, color = "R2"), size = 3) +
  geom_line(aes(y = best_r2, color = "R2")) +
  scale_x_log10(
    breaks = unique(df$learning_rate),
    labels = function(x)
      format(round(x, 4), nsmall = 4, scientific = FALSE)
  ) +
  labs(
    x = "Learning Rate",
    y = "Metric Value",
    title = "Tuning Results: Validation Pearson & R² vs. Learning Rate",
    caption = "Pearson correlation remains consistently high across learning rates, while R² improves with increasing rates. ",
  color = "Metric") +
  cells_plot() +
  scale_color_manual(values = c("Pearson" = "#00B4da", "R2" = "#DC57BB"))

p

Based on our analysis, learning rates of 0.01 and approximately 0.001 demonstrated strong performance when evaluating validation loss and R^2/Pearson correlation metrics. While the smallest learning rate produced solid results, further testing revealed that it significantly prolonged training time without offering substantial performance gains. As a result, we proceeded with the two best-performing learning rates for further optimization.

(One important consideration—further discussed in later sections—is that this initial tuning did not include a consistent random initialization of the training loader. As we progressed into more advanced tuning, we ensured proper implementation of this step. However, for this initial phase, we determined that re-running the analysis was unnecessary).


Architecture Tuning

With those two learning rates in mind, the testing moved on to analyzing architectures of varying depth and neuron sizes to obtain a robust representation of performance on differing sized architectures. The results are as follows:


Composite Score

The first visualization showcases our group’s use of a composite score, designed to evaluate model performance holistically. This score integrates multiple validation metrics, including \(\mathrm{R}^2\), Pearson correlation, the generalization gap between training and validation, and cosine similarity.

The composite score is calculated as follows:

\[ \begin{aligned} \text{Composite} = & \ (0.4 \times \mathrm{R}^2) + (0.2 \times \text{Pearson}) + (0.1 \times \text{Cosine Similarity}) \\ & - (0.1 \times \text{Validation Loss}) - (0.2 \times (\text{Validation Loss} - \text{Training Loss})) \end{aligned} \]

library(patchwork)
library(reshape2)


architectures <- list(
  arch1 = c(128),
  # 639 -> 128 -> 25
  arch2 = c(128, 64),
  arch3 = c(128, 64, 32),
  # 639 -> 128 -> 64 -> 25
  arch4 = c(256, 64),
  # 639 -> 256 -> 64 -> 25
  arch5 = c(256, 128),
  # 639 -> 256 -> 128 -> 25
  arch6 = c(256, 128, 64),
  # 639 -> 256 -> 128 -> 64 -> 25
  arch7 = c(256, 128, 64, 32),
  # 639 -> 256 -> 128 -> 64 -> 32 -> 25
  arch8 = c(512, 256, 128),
  # 639 -> 512 -> 256 -> 128 -> 25
  arch9 = c(512, 256, 128, 64),
  # 639 -> 512 -> 256 -> 128 -> 64 -> 25
  arch10 = c(512, 256, 128, 64, 32),
  # 639 -> 512 -> 256 -> 128 -> 64 -> 32 -> 25
  arch11 = c(1024, 512, 256, 128, 64, 32) # 639 -> 1024 -> 512 -> 256 -> 128 -> 64 -> 32 -> 25
)

df_br <- readRDS("Architecture_Tuning.rds")
# --- Plot 1: Line Plot of Validation Pearson by Architecture & Learning Rate ---
# Subset for validation Pearson values
mse_val <- df_br[, c("arch", "lr", "val_pearson")]
mse_val$arch <- factor(mse_val$arch, levels = paste0("arch", 1:11))
# Create architecture labels from hidden layer configurations
arch_labels <- sapply(architectures, function(x)
  paste(x, collapse = "-"))

p_line <- ggplot(mse_val,
                 aes(
                   x = arch,
                   y = val_pearson,
                   group = factor(lr),
                   color = factor(lr)
                 )) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  geom_text(
    aes(label = round(val_pearson, 3)),
    vjust = -1,
    size = 3,
    show.legend = FALSE
  ) +
  scale_x_discrete(labels = arch_labels) +
  labs(
    x = "Architecture (Hidden Layers)",
    y = "Validation Pearson",
    title = "Validation Pearson by Architecture and Learning Rate",
    color = "Learning Rate"
  ) +
  cells_plot() +
  scale_color_manual(values = c("0.01" = "#00B4da", "0.002" = "#DC57BB")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))



# --- Plot 2: Bar Chart of Composite Score (Ordered Ascending) ---
# Create a new column combining the architecture label and learning rate
df_br$arch_lr <- paste0(arch_labels[df_br$arch], " (lr=", df_br$lr, ")")
df_br <- df_br %>% arrange(composite_score)
df_br$arch_lr <- factor(df_br$arch_lr, levels = unique(df_br$arch_lr))

p_bar <- ggplot(df_br, aes(x = arch_lr, y = composite_score, fill = factor(lr))) +
  geom_bar(stat = "identity") +
  labs(x = "Architecture (Hidden Layers & Learning Rate)",
       y = "Composite Score",
       title = "Composite Score by Architecture",
       fill = "Learning Rate") +
  cells_plot() +
  scale_fill_manual(values = c("0.01" = "#00B4da", "0.002" = "#DC57BB")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Combine the two plots (vertical arrangement) and save as a PNG file
combined_plot <- p_bar / p_line

combined_plot

From this, we find that larger depth sized architectures with more neurons performed better in terms of the composite score we were referencing. There are, however, some medium sized architectures that also performed quite well and are estimated to have promising performance due to their simplicity and potential prevention of overfitting.

In addition, our analysis of the second visualization shows that, generally, increasing depth correlates with higher validation Pearson scores. However, it’s important to note that the learning rate plays a significant role in performance. For instance, the deepest network performed substantially worse with a 0.01 learning rate compared to 0.002. With this in mind, we incorporated both learning rates in our subsequent tuning analysis to ensure we captured various network’s unique representations accurately.


Dropout Tuning

dropout_relu_df <- readRDS("Project Files/dropout_relu_df.rds")

unique_arch <- unique(dropout_relu_df$arch_name)
ordered_arch <- unique_arch[order(as.numeric(gsub("arch", "", unique_arch)))]

# Create a factor for arch_name using these ordered levels.
dropout_relu_df$arch_order <- factor(dropout_relu_df$arch_name, levels = ordered_arch)


# Create a mapping from arch names to hidden_layers string labels.
# This will be used to relabel the x-axis ticks.
label_map <- setNames(dropout_relu_df$hidden_layers, dropout_relu_df$arch_name)

# Build the ggplot heatmap.
p_val_loss <- ggplot(dropout_relu_df, aes(
  x = arch_order,
  y = dropout_rate,
  fill = best_loss,
  text = paste("Dropout Rate:", dropout_rate, "<br>",
               "Validation Loss:", best_loss, "<br>",
               "Hidden Layers:", hidden_layers)
)) +
  geom_tile() +
  facet_wrap(~ learning_rate) +   # Two grids side by side, one per learning_rate
  scale_x_discrete(labels = label_map) +  # Replace x tick labels with hidden_layers string
  scale_fill_viridis(option = "viridis", name = "Best Loss") +
  # scale_fill_gradientn(
  #   colors = c("#DC57BB", "#33FF99","#00B4da", "#222222"),
  #   name = "Best Loss") + # too ugly
  labs(
    title = "Heatmap of Best Loss by Hidden Layers (Arch) & Dropout Rate",
    x = "Hidden Layers",  # The tick labels come from the hidden_layers string
    y = "Dropout Rate"
  ) +
  cells_plot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
  strip.background = element_blank())

# Convert the ggplot to an interactive plotly object
interactive_val_loss <- ggplotly(p_val_loss, tooltip = c("text", "fill"))
interactive_val_loss

Despite poor training results with our smaller, shallower networks and our deep concerns regarding overfitting and unsatisfactory Kaggle scores from our deepest networks, we persisted with dropout tuning in the hope of ameliorating the lackluster test outcomes. However, this implementation only served to reinforce our initial findings.

When testing all permutations of network architectures with dropout rates ranging from 0 to 0.50 across two different learning rates, we found that the optimal dropout rate for networks with a learning rate of 0.001 ranged from 0.20 to 0.40. Networks at this learning rate that exhibited the most pronounced decrease in training loss were the 512→256→128 and the 512→256→128→64 (i.e., “moderate”) architectures.

Conversely, dropout exhibited a narrower range of success for networks learning at a faster rate, with the lowest loss occurring at a dropout rate of 0.30 for our 512→256→128→64 neural network.

# Plot for Best Pearson
p_best_pearson <- ggplot(dropout_relu_df, aes(
  x = arch_order,
  y = dropout_rate,
  fill = best_pearson,
  text = paste("Dropout Rate:", dropout_rate, "<br>",
               "Best Pearson:", best_pearson, "<br>",
               "Hidden Layers:", hidden_layers)
)) +
  geom_tile() +
  facet_wrap(~ learning_rate) +
  scale_x_discrete(labels = label_map) +
  scale_fill_viridis(option = "viridis", name = "Best Pearson") +
  scale_y_continuous(breaks = sort(unique(dropout_relu_df$dropout_rate))) +
  labs(
    title = "Heatmap of Best Pearson by Hidden Layers (Arch) & Dropout Rate",
    x = "Hidden Layers",
    y = "Dropout Rate"
  ) +
  cells_plot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
  strip.background = element_blank())

interactive_best_pearson <- ggplotly(p_best_pearson, tooltip = c("text", "fill"))
interactive_best_pearson

Our validation Pearson’s correlation scores followed the same pattern, as we had intuitively expected. The same dropout rates exerted a significant influence on these metrics across the architectures and their respective learning rates. Many of these architectures experienced an increase of 0.02 to 0.05 in their validation Pearson’s scores with this straightforward implementation.

# Plot for Best R² (using superscript via Unicode \u00B2)
p_best_r2 <- ggplot(dropout_relu_df, aes(
  x = arch_order,
  y = dropout_rate,
  fill = best_r2,
  text = paste("Dropout Rate:", dropout_rate, "<br>",
               "Best R\u00B2:", best_r2, "<br>",
               "Hidden Layers:", hidden_layers)
)) +
  geom_tile() +
  facet_wrap(~ learning_rate) +
  scale_x_discrete(labels = label_map) +
  scale_fill_viridis(option = "viridis", name = "Best R\u00B2") +
  scale_y_continuous(breaks = sort(unique(dropout_relu_df$dropout_rate))) +
  labs(
    title = "Heatmap of Best R\u00B2 by Hidden Layers (Arch) & Dropout Rate",
    x = "Hidden Layers",
    y = "Dropout Rate"
  ) +
  cells_plot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
  strip.background = element_blank())

interactive_best_r2 <- ggplotly(p_best_r2, tooltip = c("text", "fill"))
interactive_best_r2

The experiment was replicated for our \(\mathrm{R}^2\) metric as well. This initial test, combined with the extremely poor performance of our small, shallow networks and the performance observations during the training of our larger, deeper networks, led us to converge on a smaller range of network architectures for subsequent work. The results also provided compelling evidence to unquestionably accept dropout as a viable and meaningful contributor to model improvement. Furthermore, we hypothesized that our smaller networks were not capturing enough detail from the data to make meaningful predictions, whereas our deeper networks were capturing excessive detail. Thus, a delicate balance in approach appears to be necessary for this data. Consequently, we focused our efforts in subsequent tests on moderately sized network architectures.

dropout_01 <- dropout_relu_df %>%
  filter(learning_rate == 0.01)

desired_order <- c(
  "128",
  "128->64",
  "128->64->32",
  "256->64",
  "256->128",
  "256->128->64",
  "256->128->64->32",
  "512->256->128",
  "512->256->128->64",
  "512->256->128->64->32",
  "1024->512->256->128->64->32"
)

# Pivot the data to long format
df_long <- dropout_01 %>%
  pivot_longer(
    cols = c(best_pearson, best_r2),
    names_to = "metric",
    values_to = "value"
  ) %>%
  # Extract the numeric part from arch_name and reorder the factor accordingly
  mutate(arch_num = as.numeric(gsub("arch", "", arch_name))) %>%
  arrange(arch_num) %>%
  mutate(arch_name = factor(arch_name, levels = unique(arch_name)))

# Pivot the data to long format
df_long <- df_long %>%
  mutate(
    hidden_layers = factor(hidden_layers, levels = desired_order)
  )

# Create the ggplot; here group=arch_name ensures each arch's data is treated separately 
p <- ggplot(df_long, aes(x = dropout_rate, y = value, color = metric)) +
  geom_line(size = 1) +
  facet_wrap(~ hidden_layers) +  # Will follow the factor levels
  labs(
    x = "Dropout Rate", 
    y = "Metric Value",
    title = "Best Pearson and R² as a Function of Dropout Rate",
    caption = "Learning rate of 0.01",
    color = "Metric"
  ) + 
  scale_color_manual(values = c("best_pearson" = "#00B4da", "best_r2" = "#DC57BB")) +  # Set line colors
  cells_plot() +
  theme(
    strip.background = element_blank(),
    plot.caption = element_text(size = 14)
  )
p

We also sought to analyze our performance metrics as a function of dropout rate. An arching effect emerged when comparing architectures ranging from shallow and narrow to deep and wide. The shallow, narrow networks exhibited a precipitous increase in performance metrics with an increase in dropout rate, whereas the deeper, wider networks experienced a precipitous decline in performance metrics with increasing dropout rate for both learning rates.

dropout_001 <- dropout_relu_df %>%
  filter(learning_rate == 0.001)

desired_order <- c(
  "128",
  "128->64",
  "128->64->32",
  "256->64",
  "256->128",
  "256->128->64",
  "256->128->64->32",
  "512->256->128",
  "512->256->128->64",
  "512->256->128->64->32",
  "1024->512->256->128->64->32"
)

# Pivot the data to long format
df_long <- dropout_001 %>%
  pivot_longer(
    cols = c(best_pearson, best_r2),
    names_to = "metric",
    values_to = "value"
  ) %>%
  # Extract the numeric part from arch_name and reorder the factor accordingly
  mutate(arch_num = as.numeric(gsub("arch", "", arch_name))) %>%
  arrange(arch_num) %>%
  mutate(arch_name = factor(arch_name, levels = unique(arch_name)))

# Pivot the data to long format
df_long <- df_long %>%
  mutate(
    hidden_layers = factor(hidden_layers, levels = desired_order)
  )

# Create the ggplot; here group=arch_name ensures each arch's data is treated separately 
p <- ggplot(df_long, aes(x = dropout_rate, y = value, color = metric)) +
  geom_line(size = 1) +
  facet_wrap(~ hidden_layers) +  # will follow the factor levels
  labs(
    x = "Dropout Rate", 
    y = "Metric Value",
    title = "Best Pearson and R² as a Function of Dropout Rate",
    caption = "Learning rate of .001",
    color = "Metric"
  ) + 
    scale_color_manual(values = c("best_pearson" = "#00B4da", "best_r2" = "#DC57BB")) +  # Set line colors
  cells_plot() +
  theme(
    strip.background = element_blank(),
    plot.caption = element_text(size = 14)
  )
p

The decline observed in networks with a slower learning rate becomes markedly more pronounced as we increase width, depth, and dropout probability. This effect is illustrated below.

library(tidyverse)

# Create the summary for each learning rate and add an identifier
dropout_01_avg_loss <- dropout_01 %>%
  group_by(dropout_rate) %>%
  summarize(avg_loss = mean(best_loss, na.rm = TRUE)) %>%
  mutate(learning_rate = ".01")

dropout_001_avg_loss <- dropout_001 %>%
  group_by(dropout_rate) %>%
  summarize(avg_loss = mean(best_loss, na.rm = TRUE)) %>%
  mutate(learning_rate = ".001")

# Combine the two summaries into one data frame
combined_loss <- bind_rows(dropout_01_avg_loss, dropout_001_avg_loss)

# Create the plot with faceting by learning_rate
ggplot(
  combined_loss,
  aes(
    x = dropout_rate,
    y = avg_loss,
    group = learning_rate,
    colour = learning_rate
  )
) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  facet_wrap( ~ learning_rate) +
  scale_color_manual(values = c(".001" = "#DC57BB", ".01" = "#00B4da")) +
  labs(
    title = "Average Validation Loss as a Function of Dropout Rate",
    caption = "Learning rates: .01 and .001",
    x = "Dropout Rate",
    y = "Average Loss"
  ) +
  cells_plot()

When averaging the loss across all networks over the observed dropout probability range, we observed that architectures with slower learning rates incurred a higher penalty for extreme values. Additionally, our results indicate that, on average, these architectures benefit most from employing a mid-range dropout probability across their layers.

Based on these insights, we advanced the best-performing models from our architectural testing phase into our dropout testing phase. From a validation standpoint, the top-performing models conformed to one of two structures:

  • 639→512→256→128→25
  • 639→512→256→128→64→25

Both architectures consistently dominated the upper echelon across various dropout configurations, excelling in all key model performance metrics.

Batch Size

Throughout our research, mini-batching was regarded as a standard practice for model optimization. Naturally, we adopted this approach during our early implementation and hyperparameter testing phases.

We conducted our experiments by applying our two model architectures across every previously tested dropout probability (ranging from 0 to 0.5 in 0.05 increments) and evaluating them using the following mini-batch sizes:

  • 32
  • 48
  • 64
  • 82
  • 120
  • 140
  • 180

Our previous test established a benchmark with a top Pearson’s correlation of 0.869. Following our batch size tests, three hyperparameter combinations met or exceeded this benchmark.

  • 639->512->256->128->25
    • Learning Rate: 0.001
    • Dropout Rate: 0.30
    • Batch Size: 48
    • Pearson’s Correlation: 0.872
  • 639->512->256->128->25
    • Learning Rate: 0.01
    • Dropout Rate: 0.35
    • Batch Size: 82
    • Pearson’s Correlation: 0.87
  • 639->512->256->128->64->25
    • Learning Rate: 0.01
    • Dropout Rate: 0.35
    • Batch Size: 32
    • Pearson’s Correlation: 0.869

The following plots illustrate the changes in our metrics following the implementation of the corresponding mini-batch rates.

slope_df <- readRDS("Project Files/slope_df.rds")

best_dropout_rates_df <- dropout_relu_df %>%
  filter(best_pearson > 0.861 &
           arch_name %in% c("arch8", "arch9")) %>%
  select(-arch_order) %>%
  mutate(unique_id = paste0(
    arch_name,
    "_",
    sub("^0\\.", "", sprintf("%.2f", dropout_rate)),
    "_",
    sub("^0\\.", "", sprintf("%.3f", learning_rate))
  )) %>%
  rename(
    best_loss_old = best_loss,
    best_pearson_old = best_pearson,
    best_r2_old = best_r2
  )

batch_test_df <- readRDS("Project Files/batch_test_df_relu.rds")

best_batch_test_df <- batch_test_df %>%
  filter(arch_name %in% c("arch1", "arch2")) %>%
  mutate(
    arch_name = recode(arch_name, "arch1" = "arch8", "arch2" = "arch9"),
    unique_id = paste0(
      arch_name,
      "_",
      sub("^0\\.", "", sprintf("%.2f", dropout_rate)),
      "_",
      sub("^0\\.", "", sprintf("%.3f", learning_rate))
    )
  )
best_batch_test_df <- best_batch_test_df %>%
  filter(unique_id %in% best_dropout_rates_df$unique_id)

joined_df <- best_batch_test_df %>%
  inner_join(
    best_dropout_rates_df %>%
      select(unique_id, best_loss_old, best_pearson_old, best_r2_old),
    by = "unique_id"
  )

final_df <- joined_df %>%
  arrange(desc(best_pearson)) %>%
  filter(best_pearson_old < best_pearson &
           best_pearson > 0.86954206)

loss_df <- final_df %>%
  select(
    unique_id,
    batch_size,
    arch_name,
    hidden_layers,
    dropout_rate,
    best_loss_old,
    best_loss,
    learning_rate
  ) %>%
  pivot_longer(
    cols = c(best_loss_old, best_loss),
    names_to = "stage",
    values_to = "loss"
  ) %>%
  mutate(
    stage = recode(stage, best_loss_old = "Before", best_loss     = "After"),
    stage = factor(stage, levels = c("Before", "After")),
  )

loss_df <- final_df %>%
  select(
    unique_id,
    batch_size,
    arch_name,
    hidden_layers,
    dropout_rate,
    best_loss_old,
    best_loss,
    learning_rate
  ) %>%
  # Pivot to get "Before" vs. "After" rows
  pivot_longer(
    cols = c(best_loss_old, best_loss),
    names_to = "stage",
    values_to = "loss"
  ) %>%
  mutate(
    stage = recode(stage, best_loss_old = "Before", best_loss     = "After"),
    stage = factor(stage, levels = c("Before", "After"))
  )

theme_set(theme_classic())

# 1) Convert config_before to factor and drop any unused levels
slope_df <- slope_df %>%
  mutate(config_before = droplevels(factor(config_before)))

p <- ggplot(slope_df) +
  # a) Lines (slope segments), color by config, but no legend
  geom_segment(
    aes(
      x = 1,
      xend = 2,
      y = loss_before,
      yend = loss_after,
      color = config_before,
      group = config_before
    ),
    size = 1.2,
    show.legend = FALSE  # <--- no legend for lines
  ) +
  
  # b) Points for Before & After, color by config
  #    We allow legend here so the color legend is created from points, not lines
  geom_point(
    aes(
      x = 1,
      y = loss_before,
      color = config_before,
      group = config_before
    ),
    shape = 19,
    size = 3
  ) +
  geom_point(
    aes(
      x = 2,
      y = loss_after,
      color = config_before,
      group = config_before
    ),
    shape = 19,
    size = 3
  ) +
  
  # c) Numeric labels at each point (rounded to 4 decimals)
  geom_text(
    aes(
      x = 1,
      y = loss_before,
      label = round(loss_before, 4),
      color = config_before
    ),
    vjust = -0.9,
    size = 3,
    show.legend = FALSE
  ) +
  geom_text(
    aes(
      x = 2,
      y = loss_after,
      label = round(loss_after, 4),
      color = config_before
    ),
    vjust = -0.7,
    hjust = -.2,
    size = 3,
    show.legend = FALSE
  ) +
  
  # d) Vertical dashed lines for clarity
  geom_vline(xintercept = 1,
             linetype = "dashed",
             size = 0.1) +
  geom_vline(xintercept = 2,
             linetype = "dashed",
             size = 0.1) +
  
  # e) X-axis for "Before" & "After"
  scale_x_continuous(
    limits = c(0.5, 2.5),
    breaks = c(1, 2),
    labels = c("Before", "After")
  ) +
  
  # line colors
  scale_color_manual(values = c("#00B4da",  "black", "#DC57BB")) +
  
  # f) Y-axis limits based on data range
  coord_cartesian(ylim = c(
    min(slope_df$loss_before, slope_df$loss_after) * 0.9,
    max(slope_df$loss_before, slope_df$loss_after) * 1.1
  )) +
  
  # g) Labels
  labs(
    title = "Validation Loss Comparison: Before vs. After Batch Size Tuning",
    subtitle = "Before = Dropout only applied | After = Dropoout and Batch Size Applied",
    x = NULL,
    y = "Validation Loss",
    color = "Config"
  ) +
  
  # h) Facet by hidden_layers
  facet_wrap( ~ hidden_layers, scales = "free_x", nrow = 1) +
  
  # i) Legend override so color legend uses circles
  guides(color = guide_legend(override.aes = list(shape = 16))) +
  
  # j) Theme adjustments
  theme(
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank(),
    axis.text.x = element_text(size = 12),
    panel.border = element_blank(),
    plot.margin = unit(c(1, 1, 1, 1), "mm")
  )

print(p)

Introducing mini-batch processing for our shallower networks led to a noticeable decrease in training losses, whereas our deeper network exhibited virtually no change.

pearson_df_slope <- final_df %>%
  select(
    unique_id,
    batch_size,
    arch_name,
    hidden_layers,
    dropout_rate,
    best_pearson_old,
    best_pearson,
    learning_rate
  ) %>%
  pivot_longer(
    cols = c(best_pearson_old, best_pearson),
    names_to = "stage",
    values_to = "pearson"
  ) %>%
  mutate(
    stage = recode(stage, best_pearson_old = "Before", best_pearson = "After"),
    stage = factor(stage, levels = c("Before", "After")),
    # Create a configuration factor for coloring
    config = factor(paste0(
      "LR=", learning_rate, " | DR=", dropout_rate
    )),
    # Label is the Pearson value rounded to 4 decimals
    label = round(pearson, 4)
  )

# ----------------------------
# Split into Left and Right Halves and Rename Columns
# ----------------------------
left_label <- pearson_df_slope %>%
  filter(stage == "Before") %>%
  select(unique_id, hidden_layers, stage, pearson, config, label) %>%
  rename(
    pearson_before = pearson,
    config_before = config,
    label_before = label
  )

right_label <- pearson_df_slope %>%
  filter(stage == "After") %>%
  select(unique_id, hidden_layers, stage, pearson, label) %>%
  rename(pearson_after = pearson, label_after = label)

# ----------------------------
# Join the Two Halves by unique_id and hidden_layers
# ----------------------------
slope_df_pearson <- left_join(left_label, right_label, by = c("unique_id", "hidden_layers")) %>%
  mutate(config_before = droplevels(factor(config_before)))

# 1) Convert config_before to factor and drop any unused levels
slope_df_pearson <- slope_df_pearson %>%
  mutate(config_before = droplevels(factor(config_before)))

p <- ggplot(slope_df_pearson) +
  # a) Lines (slope segments), color by config, but no legend
  geom_segment(
    aes(
      x = 1,
      xend = 2,
      y = pearson_before,
      yend = pearson_after,
      color = config_before,
      group = config_before
    ),
    size = 1.2,
    show.legend = FALSE  # <--- no legend for lines
  ) +
  
  # b) Points for Before & After, color by config
  #    We allow legend here so the color legend is created from points, not lines
  geom_point(
    aes(
      x = 1,
      y = pearson_before,
      color = config_before,
      group = config_before
    ),
    shape = 19,
    size = 3
  ) +
  geom_point(
    aes(
      x = 2,
      y = pearson_after,
      color = config_before,
      group = config_before
    ),
    shape = 19,
    size = 3
  ) +
  
  # c) Numeric labels at each point (rounded to 4 decimals)
  geom_text(
    aes(
      x = 1,
      y = pearson_before,
      label = round(pearson_before, 4),
      color = config_before
    ),
    vjust = -0.9,
    size = 3,
    show.legend = FALSE
  ) +
  geom_text(
    aes(
      x = 2,
      y = pearson_after,
      label = round(pearson_after, 4),
      color = config_before
    ),
    vjust = -0.7,
    hjust = -.2,
    size = 3,
    show.legend = FALSE
  ) +
  
  # d) Vertical dashed lines for clarity
  geom_vline(xintercept = 1,
             linetype = "dashed",
             size = 0.1) +
  geom_vline(xintercept = 2,
             linetype = "dashed",
             size = 0.1) +
  
  # e) X-axis for "Before" & "After"
  scale_x_continuous(
    limits = c(0.5, 2.5),
    breaks = c(1, 2),
    labels = c("Before", "After")
  ) +

  # line colors
  scale_color_manual(values = c("#00B4da",  "black", "#DC57BB")) +   
  # g) Labels
  labs(
    title = "Validation Loss Comparison: Before vs. After Batch Size Tuning",
    subtitle = "Before = Dropout only applied | After = Dropoout and Batch Size Applied",
    x = NULL,
    y = "Validation Pearson",
    color = "Config"
  ) +
  
  # h) Facet by hidden_layers
  facet_wrap( ~ hidden_layers, scales = "free_x", nrow = 1) +
  
  # i) Legend override so color legend uses circles
  guides(color = guide_legend(override.aes = list(shape = 16))) +
  
  # j) Theme adjustments
  theme(
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank(),
    axis.text.x = element_text(size = 12),
    panel.border = element_blank(),
    plot.margin = unit(c(1, 1, 1, 1), "mm")
  )

print(p)

We observed significant increases in our Pearson’s correlation metrics for the same networks, with a modest improvement for our deeper network.

r2_df_slope <- final_df %>%
  select(
    unique_id,
    batch_size,
    arch_name,
    hidden_layers,
    dropout_rate,
    best_r2_old,
    best_r2,
    learning_rate
  ) %>%
  pivot_longer(
    cols = c(best_r2_old, best_r2),
    names_to = "stage",
    values_to = "r2"
  ) %>%
  mutate(
    # Recode stage: best_r2_old -> "Before", best_r2 -> "After"
    stage = recode(stage, best_r2_old = "Before", best_r2 = "After"),
    stage = factor(stage, levels = c("Before", "After")),
    # Create a configuration factor for coloring
    config = factor(paste0(
      "LR=", learning_rate, " | DR=", dropout_rate
    )),
    # Label is the R² value rounded to 4 decimals
    label = round(r2, 4
    )
  )

# ----------------------------
# Split into Left and Right Halves and Rename Columns
# ----------------------------
left_label <- r2_df_slope %>%
  filter(stage == "Before") %>%
  select(unique_id, hidden_layers, stage, r2, config, label) %>%
  rename(
    r2_before = r2,
    config_before = config,
    label_before = label
  )

right_label <- r2_df_slope %>%
  filter(stage == "After") %>%
  select(unique_id, hidden_layers, stage, r2, label) %>%
  rename(r2_after = r2, label_after = label)

# ----------------------------
# Join the Two Halves by unique_id and hidden_layers
# ----------------------------
slope_df_r2 <- left_join(left_label, right_label, by = c("unique_id", "hidden_layers")) %>%
  mutate(config_before = droplevels(factor(config_before)))

p <- ggplot(slope_df_r2) +
  # a) Lines (slope segments), color by config, but no legend
  geom_segment(
    aes(
      x = 1,
      xend = 2,
      y = r2_before,
      yend = r2_after,
      color = config_before,
      group = config_before
    ),
    size = 1.2,
    show.legend = FALSE  # <--- no legend for lines
  ) +
  
  # b) Points for Before & After, color by config
  #    We allow legend here so the color legend is created from points, not lines
  geom_point(
    aes(
      x = 1,
      y = r2_before,
      color = config_before,
      group = config_before
    ),
    shape = 19,
    size = 3
  ) +
  geom_point(
    aes(
      x = 2,
      y = r2_after,
      color = config_before,
      group = config_before
    ),
    shape = 19,
    size = 3
  ) +
  
  # c) Numeric labels at each point (rounded to 4 decimals)
  geom_text(
    aes(
      x = 1,
      y = r2_before,
      label = round(r2_before, 4),
      color = config_before
    ),
    vjust = -0.9,
    size = 3,
    show.legend = FALSE
  ) +
  geom_text(
    aes(
      x = 2,
      y = r2_after,
      label = round(r2_after, 4),
      color = config_before
    ),
    vjust = -0.7,
    hjust = -0.2,
    size = 3,
    show.legend = FALSE
  ) +
  
  # d) Vertical dashed lines for clarity
  geom_vline(xintercept = 1, linetype = "dashed", size = 0.1) +
  geom_vline(xintercept = 2, linetype = "dashed", size = 0.1) +
  
  # e) X-axis for "Before" & "After"
  scale_x_continuous(
    limits = c(0.5, 2.5),
    breaks = c(1, 2),
    labels = c("Before", "After")
  ) +

    # line colors
  scale_color_manual(values = c("#00B4da",  "black", "#DC57BB")) +  
  
  # g) Labels
  labs(
    title = "Validation R² Comparison: Before vs. After Batch Size Tuning",
    subtitle = "Before = Dropout only applied | After = Dropout and Batch Size Applied",
    x = NULL,
    y = "Validation R²",
    color = "Config"
  ) +
  
  # h) Facet by hidden_layers
  facet_wrap(~ hidden_layers, scales = "free_x", nrow = 1) +
  
  # i) Legend override so color legend uses circles
  guides(
    color = guide_legend(override.aes = list(shape = 16))
  ) +
  
  # j) Theme adjustments
  theme(
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank(),
    axis.text.x = element_text(size = 12),
    panel.border = element_blank(),
    plot.margin = unit(c(1, 1, 1, 1), "mm")
  )

print(p)

The same dramatic increase occurred for validation \(\mathrm{R}^2\), demonstrating that introducing mini-batch processing explained more variance in our shallower networks, while our deeper network experienced a slight decline. Consequently, we opted to further test our most promising configuration: a 639→512→256→128→25 architecture with a learning rate of 0.001, a dropout rate of 0.3, and a batch size of 48.


Activation Functions

Within the literature cited, we found that, in addition to the classic ReLU, several other popular activation functions—sigmoid, tanh, and softplus—are commonly used. To evaluate their effectiveness, we first tested our baseline architecture mentioned in the learning rate section (without regularization or optimization) to determine whether applying an activation function to a specific layer produced meaningful improvements.

combined_results <- read_csv("Activation_Tuning.csv")
# Create a unique identifier for each activation function combination
combined_results$activation_combo <- factor(combined_results$activations,
                                            levels = unique(combined_results$activations))

# Create a label that combines Pearson and R² values
combined_results$pearson_r2_label <- paste0(round(combined_results$pearson_correlation, 3),
                                            " / ",
                                            round(combined_results$r2, 3))

# Get the model architecture for use in the subtitle (assumes one unique architecture)
model_architecture <- unique(combined_results$architecture)

# Plot: Validation MSE with Pearson / R² labels, grouped by activation function
mse_plot <- ggplot(combined_results, aes(x = activation_combo)) +
  geom_line(aes(y = validation_mse, group = 1),
            color = "#00B4da",
            size = 1.2) +  # Solid blue line for validation MSE
  geom_point(aes(y = validation_mse),
             color = "#DC57BB",
             size = 3) +  # Red points for validation MSE
  geom_text(
    aes(y = validation_mse, label = pearson_r2_label),
    vjust = -0.8,
    hjust = 0.5,
    size = 4,
    fontface = "bold"
  ) +  # Pearson / R² labels above points
  labs(
    title = "Validation MSE Across Different Activation Function Combinations",
    subtitle = paste("Model Architecture:", model_architecture),
    x = "Activation Function Combination",
    y = "Mean Squared Error (MSE)",
    caption = "Each point represents an activation function combination. Label shows Pearson / R²."
  ) +
  cells_plot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

mse_plot

From this analysis, we found that ReLU performed consistently well across layers and will therefore be used as the primary activation function in our subsequent experiments. We suspect that sigmoid may require additional preprocessing to better align with our regression task, while softplus showed some promise but did not provide a significant enough improvement to justify its use moving forward.

Batch Norm vs Layer Norm

# Load and preprocess the dataset
norm_test_df <- readRDS("Project Files/norm_test_df.rds") %>%
  select(-batch_size, -dropout_rate) %>%
  mutate(norm_method = c("batch", "layer"))

# Transpose and convert back to data frame
transposed_df <- as.data.frame(t(norm_test_df), stringsAsFactors = FALSE)


# Move row names to a column
transposed_df <- tibble::rownames_to_column(transposed_df, "Metric")

# Move 'norm_method' to the first row
transposed_df <- transposed_df %>%
  arrange(ifelse(Metric == "norm_method", 1, 2))

# Ensure metric names are mapped correctly
metric_labels <- c(
  "best_loss" = "Validation Loss",
  "training_loss" = "Training Loss",
  "generalization_gap" = "Generalization Gap",
  "best_pearson" = "Best Pearson",
  "best_r2" = "Best R²",
  "val_mse" = "Validation MSE",
  "val_mae" = "Validation MAE",
  "val_rmse" = "Validation RMSE",
  "val_r2" = "Validation R²",
  "val_explained_variance" = "Explained Variance",
  "val_cosine_similarity" = "Cosine Similarity"
)

# Convert Metric to character before recoding
cleaned_df <- transposed_df %>%
  filter(!Metric %in% c("arch_name", "hidden_layers", "learning_rate", "norm_method")) %>%
  mutate(Metric = recode(Metric, !!!metric_labels))

library(flextable)

norm_test_table <- flextable(cleaned_df) %>%
  set_header_labels(Metric = "Metric", V1 = "Batch Norm", V2 = "Layer Norm") %>%
  add_header_lines("Testing Batch Normalization and Layer Normalization") %>%  # Add Title
  add_footer_lines("512->256->128 | LR: .001 | BS: 48 | DR: .30") %>%         # Add Footer
  cells_flex() %>%                                                            # Apply the 'jest' theme
  align(align = "center", part = "footer") %>%                                # Center align footer cells
  set_table_properties(layout = "autofit") %>%                                # Adjust table layout
  # Bold everything in Batch Norm column except for Generalization Gap and Validation MAE:
  bold(i = ~ !(Metric %in% c("Generalization Gap", "Validation MAE")),
       j = "V1", bold = TRUE) %>%
  # Bold only Generalization Gap and Validation MAE in the Layer Norm column:
  bold(i = ~ Metric %in% c("Generalization Gap", "Validation MAE"),
       j = "V2", bold = TRUE)

norm_test_table

Testing Batch Normalization and Layer Normalization

Metric

Batch Norm

Layer Norm

Validation Loss

0.1218628

0.1262403

Training Loss

0.08824008

0.09527829

Generalization Gap

0.03362277

0.03096197

Best Pearson

0.8725703

0.8675238

Best R²

0.734414

0.724874

Validation MSE

0.1241583

0.1277253

Validation MAE

0.2421115

0.2389239

Validation RMSE

0.3523611

0.3573867

Validation R²

0.7294114

0.7216376

Explained Variance

0.7295272

0.7239486

Cosine Similarity

0.9434834

0.9419619

512->256->128 | LR: .001 | BS: 48 | DR: .30

With batch normalization being such a widely used feature, we incorporated it alongside ADAM in our baseline testing. We also identified layer normalization as a viable alternative and an easy swap within our workflow. However, after testing, batch normalization consistently outperformed layer normalization, leading us to maintain its use in subsequent experiments.

L2 Tuning

l2_test_df <- readRDS("Project Files/l2_test_df_relu.rds")

theme_set(theme_classic())

# Prepare a tidy data frame with weight_decay as a factor
plot_data <- l2_test_df %>%
  select(weight_decay, best_loss, best_pearson, best_r2, generalization_gap) %>%
  mutate(weight_decay = factor(weight_decay)) %>%
  pivot_longer(
    cols = c(best_loss, best_pearson, best_r2, generalization_gap),
    names_to = "metric",
    values_to = "value"
  )

plot_data <- plot_data %>%
  mutate(metric = recode(metric,
                         best_loss = "Validation Loss",
                         best_pearson = "Pearson",
                         best_r2 = "R²",
                         generalization_gap = "Generalization Gap"))


# Plot: weight_decay on x-axis as discrete ticks, facets for each metric
l2_p <- ggplot(plot_data, aes(x = weight_decay, y = value, group = 1)) +
  geom_point(size = 3) +
  geom_line() +
  facet_wrap(~ metric, scales = "free_y") +
  labs(
    title = "Relationship Between Weight Decay and Model Metrics",
    x = "Weight Decay",
    y = "Metric Value"
  ) +
  theme_classic()
l2_p

L2 Regularization was the next parameter we examined. We evaluated our model using an L2 regularization range from 0 to 1e-2, and the results were intriguing. The primary function of L2 regularization is to mitigate overfitting—a phenomenon suggested by the gradual decrease in the generalization gap score. However, this benefit came at the expense of other performance metrics. Notably, our \(\mathrm{R}^2\) and Pearson’s correlation coefficients decreased precipitously, while our validation loss escalated dramatically. Given these outcomes, we opted to proceed with our previously obtained results.

🚀 Adam Tuning

adam_test_df <- readRDS("Project Files/adam_test_df_relu.rds")

# arrange df by best_pearson

adam_test_df <- adam_test_df %>%
  arrange(desc(best_pearson)) |>
  select(
    adam_beta1,
    adam_beta2,
    training_loss,
    generalization_gap,
    best_pearson, 
    best_r2
  )

ft <- flextable(adam_test_df) %>%
  set_header_labels(
    adam_beta1 = "\u03B2\u2081",
    adam_beta2 = "\u03B2\u2082",
    training_loss = "Training Loss",
    generalization_gap = "Generalization Gap",
    best_pearson = "Pearson",
    best_r2 = "R²"
  ) %>%
  add_header_lines(paste("Tuning Adam:", "\u03B2\u2081 and \u03B2\u2082")) %>%
  add_footer_lines("512 → 256 → 128 | LR: 0.001 | BS: 48 | DR: 0.30") %>%  # Add footer with # Add footer
  cells_flex() %>%  # Apply the 'jest' theme
  align(align = "center", part = "header") %>%  # Center align header (including title)
  align(j = NULL, align = "center", part = "footer") %>%  # Center align footer
  set_table_properties(layout = "autofit")  # Adjust layout



# Print the table
ft

Tuning Adam: β₁ and β₂

β₁

β₂

Training Loss

Generalization Gap

Pearson

0.90

0.990

0.07844304

0.04331907

0.8727865

0.7346336

0.90

0.999

0.08824008

0.03362277

0.8725703

0.7344140

0.85

0.990

0.09257692

0.03122787

0.8690152

0.7301818

0.85

0.999

0.09414066

0.03276612

0.8661174

0.7234213

512 → 256 → 128 | LR: 0.001 | BS: 48 | DR: 0.30

Adam’s ability to drive convergence, reduce loss, and enhance performance was unparalleled. This optimizer proved to be an invaluable addition to our Torch-based model, as well as to our previous implementations. We sought to explore the effects of adjusting its hyperparameters, beginning with the default values for \(\beta_1\) and \(\beta_2\). We then decreased one beta at a time while holding the other constant. Our findings revealed that a slight decrease in \(\beta_1\) marginally reduced model performance, whereas a decrease in \(\beta_2\) resulted in a very modest improvement. However, the enhancement was not substantial enough to warrant adopting the adjusted \(\beta_2\) as the default hyperparameter setting.



Trials & Tribulations

  • Random Seed: Setting a random seed consistently during neural network training and tuning is essential for ensuring consistent and reproducible analysis. Early in this project, when implementing a neural network in Base R, we relied on the set.seed() function. This ensured reproducibility across different training splits and network initialization. We originally had not set this seed and realized it was impacting our findings so we needed to re-run our baseline tuning.

However, when transitioning to Torch, we discovered an additional seed was required to maintain consistency across training iterations. The function torch_manual_seed() was necessary to ensure that all Torch-based neural network operations produced consistent results.

This addition was crucial, particularly for hyperparameter tuning, as it allowed us to accurately interpret performance changes. By setting the seed properly, the data maintained a consistent starting state, and updates followed the same trajectory each time the model was run, ensuring fair comparisons across experiments.

  • Stopping Criteria: Careful selection of stopping criteria was essential in preventing overfitting, ensuring the model generalized well to unseen data. Regularization techniques, such as dropout and weight decay, further contributed to stabilizing training and improving robustness.

Our stopping criteria is based on early stopping using validation loss improvement, with a patience of 10 epochs;
specifically validation loss monitoring.

The model tracks validation loss at each epoch. If the loss improves, it resets the counter and saves the best model parameters. If the loss does not improve for 10 consecutive epochs (patience = 10), training stops.

  • Best Pearson Correlation & R² Tracking: The model also tracks the best Pearson correlation and R² score on the validation set. If a new best Pearson correlation or R² is achieved, the corresponding model parameters are saved.

  • Gradient Norm Tracking: Gradient norms are monitored, ensuring training stability.



Parallel Baysian Optimization of Hyperparameters

Toward the end of our tuning process, we discovered the ParBayesianOptimization package. The premise behind this approach is to automate the often tedious and expert‐driven process of hyperparameter tuning by leveraging Bayesian optimization with bayesOpt(). Rather than manually guessing or exhaustively searching through hyperparameter values, we define a scoring function that trains a neural network using a specified set of hyperparameters (learning rate, dropout rate, batch size, and architecture) and returns a score based on the negative validation loss. By modeling the network’s performance as a sample from a Gaussian process, bayesOpt() efficiently leverages prior evaluations to suggest promising hyperparameter combinations for subsequent trials. This methodology not only streamlines the tuning process but also systematically identifies the optimal settings for the task at hand.

The code implementation reflects this process:

  • Defining a scoring function: The function converts hyperparameters to the correct types, selects the appropriate network architecture, and adjusts the batch size based on the provided ranges.

  • Training and evaluation: It trains the model on training data and evaluates it on a validation set, returning the negative best loss as the score.

  • Optimization loop: With defined bounds for each hyperparameter, the bayesOpt() function explores the search space—initially using a set number of random points, and then iteratively refining the search with the Gaussian process model.

This setup enables us to automatically and efficiently optimize hyperparameters, ensuring that the model’s performance is maximized without resorting to ad hoc or brute-force methods.

library(ParBayesianOptimization)

#' Define Scoring Function for Hyperparameter Optimization
#'
#' Trains a neural network model with specified hyperparameters and returns a
#' score based on the negative best loss. Input hyperparameters are first
#' converted and rounded appropriately; the architecture and batch size are
#' chosen based on these values. The model is then initialized, trained, and
#' evaluated on a validation set.
#'
#' @param learning_rate Numeric; the learning rate for the optimizer.
#' @param dropout_rate Numeric; the dropout probability applied during
#'   training.
#' @param batch_size_choice Numeric; a float value that is manually rounded
#'   to select a discrete batch size.
#' @param arch_index Numeric; a float value that determines the network
#'   architecture.
#'
#' @return A list containing a single element, \code{Score}, which is the
#' negative best loss achieved during training.
#' 
scoringFunction <- function(learning_rate, dropout_rate, batch_size_choice, arch_index) {
  # Convert inputs to the proper types (if needed)
  lr <- as.numeric(learning_rate)
  dp <- round(as.numeric(dropout_rate), 2)
  bs <- as.integer(round(batch_size_choice, 0))  # Ensures proper rounding and integer conversion
  arch_choice <- as.integer(round(arch_index, 0))  # Same fix here
  
  # Choose the architecture based on arch_choice:
  # 1 -> (512, 256, 128)
  # 2 -> (512, 256, 128, 64)
  if (arch_choice < 1.5) {
    hidden_layers <- c(512, 256, 128)
    arch_name <- "arch1"
  } else if (arch_choice < 2.5) {
    hidden_layers <- c(512, 256, 128, 64)
    arch_name <- "arch2"
  } else {
    stop("Invalid architecture choice.")
  }
  
  if (batch_size_choice < 1.5) {
  bs <- 32
} else if (batch_size_choice < 2.5) {
  bs <- 64
} else if (batch_size_choice < 3.5) {
  bs <- 80
} else if (batch_size_choice < 4.5) {
  bs <- 128
} else if (batch_size_choice < 5.5) {
  bs <- 140
} else if (batch_size_choice < 6.5) {
  bs <- 160
} else if (batch_size_choice < 7.5) {
  bs <- 180
} else if (batch_size_choice < 8.5) {
  bs <- 200
} else if (batch_size_choice < 9.5) {
  bs <- 220
} else if (batch_size_choice < 10.5) {
  bs <- 240
} else if (batch_size_choice < 11.5) {
  bs <- 256
} else {
  stop("Invalid batch size choice.")
}

  
  cat("%.3f selected. Batch Size: ", batch_size_choice, bs, "\n")
  
  # Initialize your model using the provided dropout and fixed architecture.
  model <- net(input_size = 639, hidden_layers = hidden_layers,
               output_size = 25, dropout_prob = dp)
  
  # Create the data loader (using a fixed batch size, e.g., 48)
  train_loader <- initialize_loader(bs)
  
  # Set up your optimizer (here we fix weight_decay to 0)
  optimizer <- optim_adam(model$parameters, lr = lr, weight_decay = 0)
  
  # Use your predefined loss function (MSE loss)
  criterion <- nn_mse_loss()
  
  # Create your validation data list (X_val and Y_val are assumed to be defined globally)
  val_data <- list(X = X_val, Y = Y_val)
  
  # Train your model for a fixed number of epochs (keep it low for speed, e.g., 50)
  # You may want to wrap this in tryCatch for robustness.
  res <- train_model(model, train_loader, val_data, criterion, optimizer,
                     epochs = 500, patience = 10,
                     learning_rate = lr,
                     hidden_layers = hidden_layers,
                     experiment_key = "bayesOpt_trial",
                     dropout_prob = dp,
                     arch_name = arch_name)
  
  # For this example, let's use best_loss (lower is better) and return the negative value.
  Score <- res$best_loss * -1 
  
  # Optionally, you can return additional elements.
  return(list(Score = Score))
}
# Define hyperparameter bounds
bounds <- list(
  learning_rate = c(0.0001, 0.01),  # e.g., search between 0.0005 and 0.01
  dropout_rate  = c(0, 0.5),       # search between 20% and 50% dropout
  batch_size_choice = c(1,10),        # search between 32 and 256
  arch_index    = c(1, 2)            # 1 for (512,256,128), 2 for (512,256,128,64)
)

#' Run Bayesian Hyperparameter Optimization
#'
#' Defines hyperparameter bounds for optimizing a neural network model
#' using Bayesian optimization via the bayesOpt function from the
#' ParBayesianOptimization package. The search spans learning_rate,
#' dropout_rate, batch_size_choice, and arch_index. A seed is set for
#' reproducibility, and additional iterations can be added after the
#' initial run. The best hyperparameter combination is extracted at the end.
#'
#' @return A list containing the best hyperparameter settings.
#' 
set.seed(1986)

Results <- bayesOpt(
  FUN = scoringFunction,
  bounds = bounds,
  initPoints = 10,
  iters.n = 200,
  gsPoints = 50,
  verbose = 1
)

# Optionally, add more iterations
Results <- addIterations(Results, iters.n = 10)

# Get the best hyperparameters
best_params <- getBestPars(Results, N = 1)
print(best_params)
score_summary <- readRDS("Project Files/par_bayes_opt_score_summary_df.rds")

color_map <- list(
  "arch8" = "#00B4da",
  "arch9" = "#DC57BB"
)

max_score <- min(score_summary$Score)
max_point <- score_summary[which.min(score_summary$Score)]

Iterations

p1 <- plot_ly(score_summary, 
              x = ~Iteration, 
              y = ~Score, 
              type = 'scatter', 
              mode = 'lines+markers',  # Explicitly setting mode first
              line = list(color = 'black', width = 1), 
              marker = list(size = 6, color = '#00B4da'),  
              showlegend = FALSE,  
              text = ~paste(
                "Architecture: ", arch_name, "<br>",
                "Learning Rate: ", round(learning_rate, 4), "<br>",
                "Dropout Rate: ", round(dropout_rate, 2), "<br>",
                "Score: ", round(Score, 4), "<br>",
                "Batch Size: ", batch_size
              ),
              hoverinfo = "text") %>%
  
  add_trace(
    data = max_point,
    x = ~Iteration,
    y = ~Score,
    type = 'scatter',
    mode = 'markers',  
    name = "Max Score",
    marker = list(color = 'gold', size = 12, line = list(color = 'black', width = 2)),
    text = ~paste(
      "🏆 Highest Score!<br>",
      "Architecture: ", arch_name, "<br>",
      "Learning Rate: ", round(learning_rate, 4), "<br>",
      "Dropout Rate: ", round(dropout_rate, 2), "<br>",
      "Score: ", round(Score, 4), "<br>",
      "Batch Size: ", batch_size
    ),
    hoverinfo = "text"
  ) %>%
  
  layout(
    title = list(text = "Training Loss over 200 Iterations", y = 0.98),
    xaxis = list(title = "Iteration"),
    yaxis = list(title = "Training Loss", autorange = "reversed"),  
    legend = list(
      title = list(text = "Architecture"),
      orientation = "h",
      xanchor = "center",
      x = 0.5,
      y = -0.15
    )
  )
p1

We gave the model 200 iterations to train, providing it with two neural network architecture options, a learning rate range from 0.0001 to 0.01, a dropout probability range from 0 to 0.5, and a batch size range from 32 to 256. This setup allowed the model to explore a multidimensional space and identify the optimal combination of parameters. Over the course of 200 runs, we observed a steady decrease in training loss, indicating that the Gaussian process was learning effectively. The best score was achieved on iteration 51 with a learning rate of 0.0013, a dropout rate of 0.19, and a batch size of 32, using the architecture 639->512->256->128->25, as indicated by the large gold point on the plot. The following plots explore the Gaussian process by dimension.

Learning Rate

p2 <- plot_ly(score_summary,
              x = ~learning_rate,
              y = ~Score,
              type = 'scatter',
              mode = 'markers',
              color = ~arch_name,
              colors = c("#00B4da", "#DC57BB"),
              marker = list(size = 6),
              text = ~paste(
                "Architecture: ", arch_name, "<br>",
                "Learning Rate: ", round(learning_rate, 4), "<br>",
                "Dropout Rate: ", round(dropout_rate, 2), "<br>",
                "Score: ", round(Score, 4), "<br>",
                "Batch Size: ", batch_size
              ),
              hoverinfo = "text",
              showlegend = TRUE) %>%
  add_trace(data = max_point,
            x = ~learning_rate,
            y = ~Score,
            type = 'scatter',
            mode = 'markers',
            name = "Best Score",
            showlegend = TRUE,
            marker = list(color = "#DC57BB", size = 12,
                          line = list(color = 'black', width = 1)),
            text = ~paste(
              "🏆 Highest Score!<br>",
              "Architecture: ", arch_name, "<br>",
              "Learning Rate: ", round(learning_rate, 4), "<br>",
              "Dropout Rate: ", round(dropout_rate, 2), "<br>",
              "Score: ", round(Score, 4), "<br>",
              "Batch Size: ", batch_size
            ),
            hoverinfo = "text") %>%
  layout(
    title = list(text = "Training Loss for 200 Learning Rate Tests", y = 0.98),
    xaxis = list(title = "Learning Rate"),
    yaxis = list(title = "Training Loss", autorange = "reversed"),
    legend = list(
      title = list(text = "Architecture"),
      orientation = "h",
      xanchor = "center",
      x = 0.5,
      y = -0.15
    )
  )
p2

With learning rate, the pattern is not as clear, as the points do not cluster heavily within a given region—perhaps due to the wide range of values. This dimension might have been best explored independently at first, to iteratively narrow the range and identify the optimal rate. Nonetheless, a global optimum was achieved.

Dropout Rate

p3 <- plot_ly(
  data = score_summary,
  x = ~dropout_rate,
  y = ~Score,
  type = 'scatter',
  mode = 'markers',
  color = ~arch_name,                       
  colors = c("#00B4da", "#DC57BB"),
  marker = list(size = 6),
  text = ~paste(
    "Architecture: ", arch_name, "<br>",
    "Learning Rate: ", round(learning_rate, 4), "<br>",
    "Dropout Rate: ", round(dropout_rate, 2), "<br>",
    "Score: ", round(Score, 4), "<br>",
    "Batch Size: ", batch_size
  ),
  hoverinfo = "text"
) %>%
  add_trace(
    data = max_point,
    x = ~dropout_rate,
    y = ~Score,
    type = 'scatter',
    mode = 'markers',
    showlegend = TRUE,
    name = "Best Score",
    marker = list(
      color = "#DC57BB",      # Dynamically pick the color from color_map
      size = 12,
      line = list(color = 'black', width = 1)
    ),
    text = ~paste(
      "🏆 Highest Score!<br>",
      "Architecture: ", arch_name, "<br>",
      "Learning Rate: ", round(learning_rate, 4), "<br>",
      "Dropout Rate: ", round(dropout_rate, 2), "<br>",
      "Score: ", round(Score, 4), "<br>",
      "Batch Size: ", batch_size
    ),
    hoverinfo = "text"
  ) %>%
  layout(
    title = list(text = "Training Loss for 200 Dropout Rate Tests", y = 0.98),
    xaxis = list(title = "Dropout Rate"),
    yaxis = list(title = "Training Loss", autorange = "reversed"),
    legend = list(title = list(text = "Architecture"),
      orientation = "h",   # Place legend horizontally
      xanchor = "center",
      x = 0.5,
      y = -0.15  )
  )

p3

Here, our earlier analysis of dropout is vindicated, as we observe the same pattern: the shallower network is more negatively sensitive to lower dropout rates, while the deeper network is more negatively sensitive to higher dropout rates. We also observe a clearer clustering pattern and a more distinct bifurcation of the two architectures with respect to dropout rate. Moreover, the shallower network exhibits more data points above the 0.125 loss threshold, indicating that although the deeper network achieved our best loss score, the shallower network may more consistently produce a lower loss metric in the long run.

Batch Size

# Create a new grouping variable combining batch_size and arch_name
score_summary2 <- score_summary %>%
  filter(!is.na(batch_size), !is.na(Score), !is.na(arch_name)) %>%
  mutate(
    batch_size_factor = factor(batch_size),
    batch_arch = interaction(batch_size_factor, arch_name, sep = " - ")
  )

# Plot using the combined grouping variable
ggplot(score_summary2, aes(
    x = Score,
    y = batch_arch,
    fill = arch_name
  )) +
  geom_density_ridges_gradient(
    scale = 3,
    rel_min_height = 0.01
  ) +
   scale_fill_manual(
    values = c(
      "512->256->128" = "#00B4da",  # Architecture 1 color
      "512->256->128->64" = "#DC57BB"   # Architecture 2 color
    ),
    name = "Architecture" 
   ) +
  # Modify y-axis: extract the batch_size part from "batch - arch"
  scale_y_discrete(
    labels = function(x) sapply(strsplit(x, " - "), `[`, 1)
  ) +
  labs(
    title = "Loss Distributions by Batch Size and Architecture",
    subtitle = "Comparing loss distributions for each architecture within each batch size",
    x = "Loss (Score)",
    y = "Batch Size"
  ) +
  theme_ipsum() +
  theme(
    legend.position = "right",
    panel.spacing = unit(0.1, "lines")
  )

With batch size, the pattern is more subtle, but the Gaussian process yielded results consistent with our previous analysis. Within the context of the loss distribution for a given batch size, the best score was an outlier at the lowest batch size; however, overall, we observed the greatest improvement in our metrics using smaller batches. Notably, we achieved some success with batch sizes around 140. In fact, our best score was achieved with a batch size of 140. However, due to the nature of randomized initialization, the effects of backpropagation, and the potential for neuron saturation, this best score might not be reproducible given a suboptimal hyperparameter range.

Network Architecture

p5 <- plot_ly(
  data = score_summary,
  y = ~arch_name,
  x = ~Score,
  type = 'box',
  orientation = 'h',
  boxpoints = FALSE,
  line = list(color = "#00B4da"),
  text = ~paste(
    "Architecture: ", arch_name, "<br>",
    "Learning Rate: ", round(learning_rate, 4), "<br>",
    "Dropout Rate: ", round(dropout_rate, 2), "<br>",
    "Score: ", round(Score, 4), "<br>",
    "Batch Size: ", batch_size
  ),
  hoverinfo = "text",
  name = "All Architectures"
) %>%
  add_trace(
    data = max_point,
    y = ~arch_name,
    x = ~Score,
    type = 'scatter',
    mode = 'markers',
    orientation = 'h',
    showlegend = TRUE,
    name = "Best Score",
    marker = list(
      color = "#DC57BB",
      size = 12,
      line = list(color = 'black', width = 1)
    ),
    text = ~paste(
      "🏆 Highest Score!<br>",
      "Architecture: ", arch_name, "<br>",
      "Learning Rate: ", round(learning_rate, 4), "<br>",
      "Dropout Rate: ", round(dropout_rate, 2), "<br>",
      "Score: ", round(Score, 4), "<br>",
      "Batch Size: ", batch_size
    ),
    hoverinfo = "text"
  ) %>%
  layout(
    title = list(text = "Training Loss Distribution by Architecture", y = 0.98),
    xaxis = list(
      title = "Score",
      autorange = "reversed",
      automargin = TRUE,
      tickfont = list(size = 10)
    ),
    yaxis = list(
      title = "Architecture",
      tickangle = 90
    ),
    legend = list(
      orientation = "h",
      xanchor = "center",
      x = 0.5,
      y = -0.15
    ),
    margin = list(l = 60, r = 20, b = 60, t = 60),
    boxmode = "group",
    boxgap = 0.2,
    boxgroupgap = 0.1
  )

p5

Lastly, we had the Bayesian optimizer examine the effects of architecture on our loss metric. Much like in the dropout rate analysis, we achieved our best score in the deeper network; however, the 25th percentile range spans a much lower loss range in the shallower network. This indicates that although the deeper network attained the best score, the shallower network more consistently reduced loss and may be a better choice for reliably optimizing the desired outcome. This observation further validates our selection of the more shallow model in our analysis.

Ensemble Models

One interesting approach we explored was testing ensemble models. We trained several distinct models independently and then combined their predictions using a weighted average. This strategy enabled us to assess each model’s accuracy and level of detail separately, and subsequently tailor the weighting to achieve a more precise overall prediction.

Key Insights and Future Considerations

In our initial analysis, we found that the deepest and widest architecture produced the best results. However, after further analysis and a deeper understanding of overfitting, training/validation loss convergence, and model generalization, we discovered that more is not always better. The more complex models appeared to overlook the intrinsic nature of our data, whereas the simpler model was better able to capture the underlying patterns. This insight underscores the necessity for generalizability in our future statistical and machine learning models.

To enhance our analysis, the composite score could have assigned a higher weight to the delta between training and validation loss, aligning more closely with the performance observed in large-depth architectures. As noted, these architectures performed poorly on Kaggle and exhibited larger disparities between their training and validation loss. Therefore, the composite score might have more accurately reflected generalization had we weighted this factor more appropriately.

Furthermore, tools such as the Parallel Bayesian Optimization package could have aided our understanding of the hyperparameter space from the outset, guiding our decision-making in testing and analysis. This process might have accelerated our hyperparameter tuning by refining our search ranges, similar to the patterns observed in dropout rate analysis.

The package also offers additional tools, such as the getLocalOptimums() function. Rather than seeking a global optimum across a multidimensional hyperparameter space, this function allows us to identify local optima, yielding multiple promising models and hyperparameter combinations instead of a single solution. This reinforces our testing results, which indicated that there was never a clearly optimal solution, but rather a range of viable solutions capable of producing positive outcomes.

Additionally, using the getLocalOptimums() function would have enabled us to quickly converge on several models suitable for ensemble learning. This approach would allow us to combine the strengths of the best-performing models to generate robust predictions that generalize well and appropriately address the complexity of the data.



References


  1. D. Jurafsky and J. H. Martin, Speech and Language Processing (3rd ed. draft)Draft of January 12, 2025, Chapter 7. Accessed via web February 24 2025↩︎

  2. J. Krohn, G. Beyleveld, and A. Bassens, Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence. Addison-Wesley, 2019, p. 147↩︎

  3. J. Krohn, G. Beyleveld, and A. Bassens, Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence. Addison-Wesley, 2019, p. 143↩︎

  4. J. Krohn, G. Beyleveld, and A. Bassens, Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence. Addison-Wesley, 2019, p. 138↩︎

  5. “Lesson 11.4 - Principal Components,” STAT 505 Applied Multivariate Statistical Analysis, PennState Eberly College of Science, [Online]. Available: https://online.stat.psu.edu/stat505/lesson/11/11.4. [Accessed: Feb. 22, 2025].↩︎

  6. J. Krohn, G. Beyleveld, and A. Bassens, Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence. Addison-Wesley, 2019, p. 124 - 125↩︎