1 Introduction

1.1 Explanation of Project

This project focused on analyzing over 10,000 wines across 64 grape varietals—distinct types of wine made primarily from a single grape species. Each wine was associated with climate data from the 2022 vintage year, specifically spanning daily minimum and maximum temperatures, rainfall, and sunshine levels from March 1 to November 1.

The objective was to build a machine learning model that predicts the probability of specific tasting note keywords appearing in a wine review, using only climate data and the varietal type as inputs. By modeling these relationships, the project explores how environmental factors influence the language used to describe wine—offering winemakers insights into how certain climate conditions may shape the sensory perception and market reception of their wines.

Students were evaluated based on the mean absolute error (MAE) of their model’s predicted keyword probabilities compared to the true probabilities from smoothed word count data.

2 Exploratory Data Analysis

2.1 General Summary Statistics

The following table provides key insights into the wine IDs, climate modalities, and varietals for better context and understanding:

library(dplyr)
library(officer)


# Load data (replace with your actual paths)
train_climate <- read.csv("LGBM/data/climate_train.csv")
test_climate <- read.csv("LGBM/data/climate_test.csv")
train_words <- read.csv("LGBM/data/words_train2.csv")

# Core values
n_wines <- as.integer(nrow(train_climate))
n_test <- as.integer(nrow(test_climate))
n_varietals <- as.integer(length(unique(train_climate$varietal)))
n_words <- as.integer(ncol(train_words) - 1)
n_days <- as.integer(sum(grepl("^maxtemp_", colnames(train_climate))))
n_modalities <- 4  # Assuming maxtemp, mintemp, rain, sunshine
total_climate_features <- as.integer(n_days * n_modalities)
avg_words_per_wine <- as.integer(mean(rowSums(train_words[, -1]) > 0.1))
sparsity <- as.integer(mean(train_words[, -1] == 0) * 100)

# Extract all climate-related date columns
climate_cols <- grep("^(maxtemp_|mintemp_|rain_|sunshine_)", colnames(train_climate), value = TRUE)


library(dplyr)
library(gt)

style_table <- function(tbl) {
  tbl %>%
    gt() %>%
    fmt_number(columns = "Value", decimals = 6, use_seps = FALSE) %>%
    opt_table_outline(style = "solid", width = px(1.5), color = "#800020") %>%  # Burgundy border
    tab_options(
      table.background.color = "#FFFFFF",          # Clean white background
      table.border.top.color = "#800020",
      table.border.bottom.color = "#800020",
      column_labels.background.color = "#f3e5e5",  # Light burgundy-pink
      column_labels.font.weight = "bold",
      table.font.color = "#2c2c2c",
      table.font.names = "Arial",
      data_row.padding = px(6),
      heading.border.bottom.color = "#800020",
      column_labels.border.top.color = "#800020",
      column_labels.border.bottom.color = "#800020",
      table.width = px(600),                       # Wider table
      table.font.size = "13px",                    # Larger font
      table.align = "center"
    ) %>%
    cols_width(
      "Parameter" ~ px(350),                       # Wider columns
      "Value" ~ px(200)
    )
}

# ---- Your data setup ----
summary_table <- tibble(
  Metric = c(
    "Number of training wine samples",
    "Number of testing wine samples",
    "Number of unique tasting note keywords",
    "Number of climate days per modality",
    "Min climate date",
    "Max climate date"
  ),
  Value = c(
    9802,
    1001,
    n_words,
    n_days,
    "March 1, 2022",
    "November 1, 2022"
  )
)

# Rename column to match expected input
summary_table_renamed <- summary_table %>%
  rename(Parameter = Metric)

# Apply styling
styled_summary_table <- style_table(summary_table_renamed) %>%
  tab_header(title = "Climate & Word Data Summary")

# Return the styled table (no need for div() unless in HTML context)
styled_summary_table
Climate & Word Data Summary
Parameter Value
Number of training wine samples 9802
Number of testing wine samples 1001
Number of unique tasting note keywords 1416
Number of climate days per modality 246
Min climate date March 1, 2022
Max climate date November 1, 2022

2.2 Distribution Analysis in Conjunction With Normalization

As previously noted, the time series data includes four climate modalities, along with word probability training data for the corresponding wines. With this in mind, we examined the distributions of both datasets before and after normalization.

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
library(scales)
library(stringr)

# Set larger plot size for interactive viewing
options(repr.plot.width = 14, repr.plot.height = 12)  # For Jupyter/RStudio

# Modified theme with matching font sizes
theme_burgundy_min <- theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "plain", size = 12, hjust = 0.5, family = "Arial", color = "#2c2c2c"),
    axis.title = element_text(size = 12, family = "Arial", color = "#3c3c3c"),
    axis.text = element_text(size = 12, family = "Arial", color = "#4a4a4a"),
    panel.grid.major = element_line(color = "#e5e5e5", size = 0.4),
    panel.grid.minor = element_blank(),
    plot.margin = margin(15, 15, 15, 15)
  )

# Assuming 'train_climate' contains the data and 'dates' is already created
modalities <- c("maxtemp", "mintemp", "rain", "sunshine")

# Extract the climate data by modality
climate_data <- list()
for (mod in modalities) {
  climate_data[[mod]] <- train_climate[, grep(mod, names(train_climate))]
}

dates <- sort(unique(sub(".*_(.*)", "\\1", names(train_climate)[grepl("_", names(train_climate))])))
date_objs <- as.Date(dates, format = "%Y.%m.%d")

# 1. Distribution of Climate Modalities (Across All Wines)
plot_list <- list()

for (i in seq_along(modalities)) {
  mod <- modalities[i]
  climate_values <- as.vector(as.matrix(climate_data[[mod]]))
  
  x_label <- case_when(
    mod %in% c("maxtemp", "mintemp") ~ "Temperature (°C)",
    mod == "rain" ~ "Daily Rainfall (mm)",
    mod == "sunshine" ~ "Sunshine Duration (minutes)",
    TRUE ~ "Value"
  )

  p <- ggplot(data.frame(value = climate_values), aes(x = value)) +
    geom_histogram(bins = 50, fill = "#b03a2e", color = "#800020", alpha = 0.75) +
    labs(
      title = str_wrap(str_to_title(mod), width = 20),
      x = x_label,
      y = "Frequency"
    ) +
    theme_burgundy_min

  plot_list[[i]] <- p
}

# Arrange and save non-normalized plots
non_norm_plot <- grid.arrange(grobs = plot_list, ncol = 2)
ggsave("non_normalized_climate_plots.png", plot = non_norm_plot, width = 10, height = 8, units = "in", dpi = 300)

# 1.5. Distribution of Climate Modalities After Normalization
normalize_data <- function(x) {
  return((x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
}

log_transform <- function(x) {
  return(log1p(x))
}

# Apply transformations
norm_data <- list()
for (mod in modalities) {
  climate_values <- as.vector(as.matrix(climate_data[[mod]]))
  
  if (mod %in% c("maxtemp", "mintemp")) {
    norm_data[[mod]] <- normalize_data(climate_values)
  } else if (mod == "sunshine") {
    norm_data[[mod]] <- normalize_data(log_transform(climate_values))
  } else if (mod == "rain") {
    norm_data[[mod]] <- log_transform(climate_values)  # only log, no standard scale
  }
}

# Plot
norm_plot_list <- list()
for (i in seq_along(modalities)) {
  mod <- modalities[i]
  
  transformation <- case_when(
    mod == "rain" ~ "Log Transformed",
    mod == "sunshine" ~ "Standard Scaled + Log Transformed",
    TRUE ~ "Standard Scaled"
  )

  p <- ggplot(data.frame(value = norm_data[[mod]]), aes(x = value)) +
    geom_histogram(bins = 50, fill = "#c97c7c", color = "#800020", alpha = 0.75) +
    scale_y_log10() +
    labs(
      title = str_wrap(paste0(str_to_title(mod), " — ", transformation), width = 20),
      x = "Normalized Values",
      y = "Log Frequency"
    ) +
    theme_burgundy_min

  norm_plot_list[[i]] <- p
}

# norm_plot <- grid.arrange(grobs = norm_plot_list, ncol = 2)
# ggsave("images/normalized_climate_plots.png", plot = norm_plot, width = 10, height = 8, units = "in", dpi = 300)

3 Feature Engineering

This project involved extensive feature engineering, which was critical not only for optimizing model performance and efficiency but also for enhancing interpretability. Understanding which features influenced tasting note prediction probabilities allowed us to better contextualize the patterns observed in the machine learning outputs.

The table below summarizes the engineered features that were evaluated across the models used in this analysis:

# Load necessary libraries
library(gt)
library(dplyr)

# Define the style_table function (using gradient colors)
style_table <- function(tbl) {
  tbl %>%
    opt_table_outline(style = "solid", width = px(1.5), color = "#800020") %>%  # Burgundy border
    tab_options(
      table.background.color = "#FFFFFF",          # Clean white background
      table.border.top.color = "#800020",
      table.border.bottom.color = "#800020",
      column_labels.background.color = "#f3e5e5",  # Light burgundy-pink
      column_labels.font.weight = "bold",
      table.font.color = "black",
      table.font.names = "Arial",
      data_row.padding = px(6),
      heading.border.bottom.color = "#800020",
      column_labels.border.top.color = "#800020",
      column_labels.border.bottom.color = "#800020",
      table.width = px(1100),                      # Wider table
      table.font.size = "15px"                     # Slightly larger font for table
    ) %>%
    cols_width(
      Feature_Group ~ px(200),                     # Increased width
      Feature_Name ~ px(240),                      # Increased width
      Description ~ px(450),                       # Increased width
      Normalization ~ px(160)                      # Increased width
    ) %>%
    tab_style(
      style = cell_fill(color = "#c97c7c"),  # Light burgundy for Daily Climate
      locations = cells_body(rows = 1:4, columns = everything())
    ) %>%
    tab_style(
      style = cell_fill(color = "#d19b9b"),  # Lighter burgundy-pink for Varietal
      locations = cells_body(rows = 5, columns = everything())
    ) %>%
    tab_style(
      style = cell_fill(color = "#dababa"),  # Light pink-gray for Moving Average
      locations = cells_body(rows = 6:9, columns = everything())
    ) %>%
    tab_style(
      style = cell_fill(color = "#e3d8d8"),  # Very light pink-gray for Engineered
      locations = cells_body(rows = 10:20, columns = everything())
    ) %>%
    tab_style(
      style = cell_fill(color = "#ece6e6"),  # Near-white pink-gray for Word-Cluster and DistilBERT
      locations = cells_body(rows = 21:25, columns = everything())
    ) %>%
    tab_style(
      style = cell_fill(color = "#f3e5e5"),  # Light burgundy-pink for headers
      locations = cells_column_labels()
    ) %>%
    tab_style(
      style = cell_text(weight = "bold"),
      locations = cells_column_labels()
    ) %>%
    tab_style(
      style = cell_text(align = "left"),
      locations = cells_body(columns = c("Feature_Group", "Feature_Name", "Description", "Normalization"))
    ) %>%
    tab_style(
      style = cell_text(size = px(17)),  # Larger font for footnote
      locations = cells_footnotes()
    )
}

# Create the data frame for the table
table_data <- data.frame(
  Feature_Group = c(
    "Daily Climate Features\n(samples, days)", rep("", 3),
    "Varietal Features", 
    "Moving Average Features", rep("", 3),
    "Engineered Features", rep("", 10),
    "Word-Cluster Features", "", "",
    "", "DistilBERT Features"
  ),
  Feature_Name = c(
    "Maxtemp_[date]", "Mintemp_[date]", "Rain_[date]", "Sunshine_[date]",
    "Varietal_Embedding",
    "Maxtemp_MA_7/14/30", "Mintemp_MA_7/14/30", "Rain_MA_7/14/30", "Sunshine_MA_7/14/30",
    "Rainfall_Type_Dry", "Rainfall_Type_Moderate", "Rainfall_Type_Wet", "Is_Desert",
    "Num_Rainy_Days", "Max_Temp_Drop_Weekly", "Rain_Avg_Spring", "Rain_Avg_Summer",
    "Rain_Avg_Fall", "Rain_Spring_Interaction", "Rain_Maxtemp_Interaction",
    "Word_Cluster_0_prob to Word_Cluster_8_prob", "Word_Entropy",
    "[word]_prob (7 words)",
    "[word]_[climate_feature]_ratio (5 words)",
    "distilbert_pca_feature_0 to distilbert_pca_feature_4"
  ),
  Description = c(
    "Daily maximum temperature (°C) for each day",
    "Daily minimum temperature (°C) for each day",
    "Daily rainfall (mm) for each day",
    "Daily sunshine duration (minutes) for each day",
    "Integer encoding of varietal (for neural embedding)",
    "7/14/30-day moving average of max temperature (°C)",
    "7/14/30-day moving average of min temperature (°C)",
    "7/14/30-day moving average of rainfall (mm)",
    "7/14/30-day moving average of sunshine (minutes)",
    "Indicator for total rain < 400 mm",
    "Indicator for total rain 400-800 mm",
    "Indicator for total rain > 800 mm",
    "Indicator for avg max temp > 30°C and total rain < 250 mm",
    "Count of days with rain > 0.1 mm",
    "Largest 7-day temperature decrease (°C)",
    "Average rainfall (mm) for spring days (Mar-May)",
    "Average rainfall (mm) for summer days (Jun-Aug)",
    "Average rainfall (mm) for fall days (Sep-Nov)",
    "Product of num_rainy_days and avg spring rainfall",
    "Product of num_rainy_days and last day max temp",
    paste(
      "9 probabilities from K-means clustering of 1416 tasting note words",
      "(words_train2.csv). Clusters group words by co-occurrence.",
      "Probabilities are averages for each varietal’s samples.",
      "Test rows use training varietal probabilities.",
      "See following section for details."
    ),
    paste(
      "Shannon entropy of word probabilities per sample",
      "(words_train2.csv, smoothing=0.001, threshold=0.01).",
      "Test samples use training mean entropy.",
      "See following section for details."
    ),
    paste(
      "Probabilities of top 7 most frequent words from words_train2.csv,",
      "averaged by varietal in training. Test wines use training varietal averages."
    ),
    paste(
      "Ratios of top 5 word probabilities to climate features (e.g., rain_avg_spring),",
      "averaged by varietal in training, clipped to [0, 10].",
      "Test wines use training varietal averages."
    ),
    paste(
      "5 PCA components from DistilBERT embeddings of varietal-level pseudo-texts",
      "(words_train2.csv). Test wines use training varietal embeddings.",
      "See following section for details."
    )
  ),
  Normalization = c(
    "StandardScaler", "StandardScaler", "log1p", "log1p + StandardScaler",
    "None (integer)",
    "StandardScaler", "StandardScaler", "log1p + StandardScaler", "log1p + StandardScaler",
    "None (0 or 1)", "None (0 or 1)", "None (0 or 1)", "None (0 or 1)",
    "StandardScaler", "StandardScaler", "log1p + StandardScaler", "log1p + StandardScaler",
    "log1p + StandardScaler", "StandardScaler", "StandardScaler",
    "None (raw probabilities)", "None (raw entropy)",
    "None (raw probabilities)",
    "None (raw ratios)",
    "None"
  )
)

# Create the gt table
feature_table <- gt(table_data) %>%
  tab_header(
    title = "Feature Summary for Wine Dataset"
  ) %>%
  style_table() %>%
  tab_footnote(
    footnote = paste(
      "Note: Two CSVs are generated: 'wine_features_raw.csv' with raw values and",
      "'wine_features_normalized.csv' with normalized values as specified in the 'Normalization' column.",
      "Daily climate, moving average, and engineered features are raw in the raw CSV and normalized in the normalized CSV.",
      "Word-cluster probabilities, word entropy, top word probabilities, and interaction ratios remain unscaled (raw) in both CSVs."
    ),
    locations = cells_body(columns = "Feature_Group", rows = nrow(table_data))
  ) %>%
tab_source_note(
  source_note = md(
    paste(
      "<span style='font-size:17px; font-weight:bold'>Feature Categories</span>: ",
      "<u style='text-decoration-color:#c97c7c; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>Daily Climate Features</span></u>, ",
      "<u style='text-decoration-color:#d19b9b; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>Varietal Features</span></u>, ",
      "<u style='text-decoration-color:#dababa; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>Moving Average Features</span></u>, ",
      "<u style='text-decoration-color:#e3d8d8; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>Engineered Features</span></u>, ",
      "<u style='text-decoration-color:#ece6e6; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>Word-Cluster Features</span></u>, ",
      "<u style='text-decoration-color:#ece6e6; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>DistilBERT Features</span></u>"
    )
  )
)

# Save the table as an image
gtsave(feature_table, filename = "feature_summary_table.png", path = "images", vwidth = 1800, vheight = 1600)

3.1 Word Cluster Probabilities

The first set of features crucial to the performance of both primary machine learning models were the n_word_cluster_probabilities. These features utilized K-means clustering combined with varietal-level aggregation, effectively connecting similarities in word review probabilities to specific grape varietals.

Steps of computation:

  • Step 1: We reference the training data to review word probabilities for specific Wine IDs. Varietals (Cabernet Sauvignon, Pinot Noir, Chardonnay) are identified from associated climate data.

  • Step 2: We use KMeans to group the probabilities based on distributions. This organizes descriptors like Cherry, Oak, and Lemon into distinct clusters.

  • Step 3: Average probabilities for each cluster are calculated for the varietals. This produces a matrix of varietal-cluster probability values.

  • Step 4: A sample’s cluster probabilities are assigned based on its varietal’s matrix row. For example, a Pinot Noir sample receives its varietal’s cluster values.

The scree plot was generated to evaluate the optimal number of word clusters for KMeans clustering by plotting the inertia (within-cluster sum of squares) against the number of clusters (1 to 20). This visualization helps identify the “elbow point” where adding more clusters yields diminishing reductions in inertia, justifying the selection of 9 clusters for the wine dataset.

Subsequently, we visualize the resulting clusters and analyze the overarching thematic patterns of the words grouped together, facilitating a deeper understanding of the semantic relationships within the data.

3.2 DistilBert PCA

Another significant word-related feature set was developed using the DistilBERT model from the Hugging Face Transformers library. This feature set effectively complemented the n_word_probabilities by introducing a semantic dimension to the data, capturing deeper contextual relationships between wine descriptions.

As outlined in the feature description table, the features were computed by first aggregating word probabilities at the varietal level using the training dataset (words_train2.csv). For each varietal, a pseudo-text was generated by selecting the top 10 most probable words, creating a concise representation of the varietal’s flavor profile. These pseudo-texts were then processed using DistilBERT’s feature extraction pipeline to obtain 768-dimensional embeddings, which encode the semantic meaning of the text. The test wines were imputed with the corresponding varietal embeddings to ensure consistency across the dataset.

To manage dimensionality, we applied Principal Component Analysis (PCA) using the scikit-learn library, reducing the embeddings to 5 components. The number of components was chosen by generating a scree plot for up to 50 components and selecting 5, as they captured a substantial portion of the variance while maintaining computational efficiency. This process was applied to the embeddings of all wines (10,803 in total), resulting in a compact, semantically rich feature set for downstream modeling.

3.2.1 Factor Analysis

3.2.1.1 Eigenvalues and Variance Explained

In the interest of interpretability, we also performed a factor analysis on the DistilBERT PCA features. This analysis aimed to identify the underlying structure of the data and to understand how the features relate to one another. The results are summarized in the table below, which includes the eigenvalues and the percentage of variance explained by each factor.

Eigenvalues & Explained Variance
Factor Eigenvalue Explained Variance Cumulative %
Factor 1 0.531 0.413 0.413
Factor 2 0.219 0.170 0.583
Factor 3 0.137 0.107 0.690
Factor 4 0.080 0.062 0.752
Factor 5 0.072 0.056 0.808

We then performed a Promax rotation on our factor loadings to enhance interpretability and ensure that the DistilBERT embeddings functioned as intended. After the rotation, we aggregated the loadings by varietal and computed the top words for each factor. Several words appeared in 70% of the wine samples. These words were:

  • palate
  • flavors
  • finish
  • aromas
  • notes
  • wine

These words were not very informative and did not help distinguish one varietal from another. We removed them from the analysis and focused on words that could better differentiate the varietals within each factor.

3.2.1.2 Factor Loading Heatmap

# install.packages(c("readr","dplyr","tidyr","ggplot2","plotly","viridis"))
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(plotly)
library(viridis)

# 1. Load your tidy data
word_df <- read_csv("project_files/factor_top_words.csv")
# Expect columns: factor, word, count

# 2. Make sure every (word, factor) pair is present, filling missing with 0
word_df_full <- word_df %>%
  complete(word, factor, fill = list(count = 0)) %>%
  # keep your ordering
  mutate(
    factor = factor(factor, levels = unique(factor)),
    word   = factor(word,   levels = rev(unique(word)))
  )

# 3. Build the ggplot
p <- ggplot(word_df_full, aes(x = factor, y = word, fill = count)) +
  geom_tile(color = "white", width = 1, height = 1) +        # white borders, full cell
  geom_text(aes(label = count), size = 3) +                  # annotate counts
  scale_fill_distiller(
    name    = "Count",
    palette = "Reds",    # ColorBrewer Reds
    direction = 1,       # 1 = light→dark
    na.value  = "grey90",
    limits    = c(0, max(word_df_full$count))
  ) +
  labs(x = NULL, y = NULL, title = "Top Word Frequencies per Factor") +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid   = element_blank()
  )

# 4. Convert to interactive Plotly
ggplotly(p, tooltip = c("x", "y", "fill"))

Thanks to the PCA and Promax rotation, we were able to establish a clear structure in the factor loadings. The first factor weighted heavily on notes like refreshing, clean, crisp, and white—a clue that it corresponds to drier, crisp white wines. The second factor was more fruity yet also refreshing, suggesting sweeter, fruity white or rosé wines. The third factor was more balanced and structured, common descriptors for many drinkable red wines. The fourth factor indicated a bold yet sweet and fruity red wine. The fifth factor’s notes pointed to a more spicy, dark, and bold red wine.

One interesting point—the note acidity was highly represented across all wines, similar to the 70% of notes we removed. However, we chose to retain it in this analysis not to show its general presence, but rather to highlight its absence in factor five. This lack of acidity provides meaningful insight into the nature of the varietals in this factor and further differentiates it from the others, potentially contributing to stronger model predictability.

Factor Loadings for Wine Varietals1
PCA with Promax Rotation
Factor1: Clean, Crisp & White Factor2: Refreshing & Fruity Factor3: Balanced & Structured Factor4: Red, Fruity & Sweet Factor5: Bold, Spicy & Dark
St. Pepin, Seyval Blanc, Edelweiss, Pinot Grigio, Itasca, Apple Cider, La Crescent, Brianna Pinot Noir, Chardonnay, Merlot, Sparkling Wine, Rosé of Pinot Noir, Rose, Sauvignon Blanc, Apple Cider Mead, Chardonnay, Norton, Blend, Roussanne, Apple Cider, Viognier, Traminette Pinot Noir, Marquette, Concord, Hard Cider, Tempranillo, Barbera, Apple Cider, GSM Merlot, Meritage, Proprietary Red Blend, Cabernet Sauvignon, Marechal Foch, Cabernet Sauvignon Blend, Norton, Malbec
1 Factors group varietals by taste profiles from PCA with promax rotation.

After merging the rotation data with the varietal labels, we pulled the top Promax-weighted wines associated with each factor, which confirmed our intuition and theoretical understanding of the factors. This supports that the DistilBERT embeddings were effective in capturing the semantic meaning of the words and provided our model with an interpretable semantic understanding of the underlying structure of the word data.

To demonstrate the effectiveness of the DistilBERT embeddings and the factor analysis, we provided a scatter plot of the first and fifth factors. In the interest of parsimony, we included only these two factors, as they are the most interpretable and distinct.

3.3 Feature Importance Analysis

Our group recognized the power of feature engineering and its potential to enhance both our understanding of the data and the model’s learning capability. However, we also wanted to ensure that any engineered feature was empirically justified. To this end, we performed a feature importance analysis using the sklearn.feature_selection module, specifically the mutual_info_regression function.

Given the cyclical nature of the temporal features and the complexity of the data we were providing to our model, we chose to use mutual information to examine relationships between the features and the target variable—specifically for its ability to capture non-linear relationships. Our findings are presented below. Note: The feature importance analysis was initially performed on a version of our dataset that included 33 features but excluded word cluster probabilities, DistilBERT features, and others. An identical analysis was later performed on a dataset that included those features; however, no features were removed, as doing so degraded model performance.

After this initial pass and an examination of the correlative nature of the features we created, it became clear that the word cluster probabilities and the varietal features were the most salient in terms of mutual information with our target words variable.

After applying a 90% feature importance threshold, we identified the most impactful features likely to benefit our model. These features, shown above, reduced our feature set from 33 to 21.

Before this elimination step, our model’s MAE was approximately 0.022. Running the model with the reduced feature set—without any hyperparameter tuning—yielded an MAE of ~0.020, a 9% improvement. More is not always better when it comes to model performance. Through the feature engineering process and feature importance analysis, we demonstrated that we can strengthen signal and improve efficiency by leveraging the right tools.

4 Modeling

4.1 Feed Forward NN

To explore baseline model behavior, we worked with a Feedforward Neural Network (FNN) (see [Appendix]) trained on the initial engineered input features (climate + varietal) to predict possible wine tasting notes. The network consists of two hidden layers (512 and 256 units respectively), each with LayerNorm, ReLU, and 45% dropout. The output layer uses a sigmoid activation to produce probabilities tasting notes.

We chose this architecture to give the model enough depth to learn nonlinear relationships between climate/varietal features and the tasting notes. It’s not overly deep, which keeps training time manageable, but still expressive enough to handle the complexity of the task.

Given the extreme sparsity and imbalance within the labels (≈80% zeros), we applied focal loss during training to help the model focus on rarer positive labels. L1 sparsity penalty on the predicted probabilities to discourage overconfident outputs.

To further address overfitting and tune model capacity, we:

  • Conducted a randomized hyperparameter grid search across dropout rates, learning rates, batch sizes, and hidden dimensions

  • Used a dropout rate of 45%

  • Ran five-fold cross-validation during early experimentation to assess generalization

Training progress was tracked with a detailed set of metrics each epoch:

  • Validation MAE and validation F1 to assess prediction accuracy

  • Mean probability to monitor whether the model’s output probabilities matched the expected sparsity of the label space.

Conclusions: As expected, the FNN didn’t perform as well as more advanced models like BERT and LGBM on Kaggle. However, its simpler structure and deterministic input format made it a strong candidate for SHAP-based explanation. This allowed us to generate visual insights into which features (e.g., rainfall, temperature changes) most influenced the model’s predictions, laying the groundwork for deeper interpretability in future iterations.

4.2 Statistical Modeling with LGBM

We also implemented a Light Gradient Boosting Machine (LightGBM) model. LightGBM is an efficient, gradient-boosting decision-tree algorithm known for its speed, scalability, accuracy, and interpretability, particularly suited to handling large and complex datasets due to its histogram-based method and leaf-wise tree growth strategy LightGBM Documentation.

We configured this model as a regression task using a Tweedie loss function, which is effective when modeling continuous target variables that include many zeros or highly skewed distributions. The Tweedie loss is mathematically represented as:

\[ loss = -y * (μ^{1 - p})/(1 - p) + (μ^{2 - p})/(2 - p) \]

To enhance generalization and predictive robustness, we implemented KMeans clustering over the word-level target matrix across multiple resolutions (e.g., 30, 50, 70, 90 clusters). Each cluster groups semantically similar words, and a separate LightGBM model is trained for each cluster at each resolution. This process forms a multi-resolution modeling strategy, which simulates an ensemble by averaging predictions across all cluster resolutions—capturing both coarse and fine-grained semantic patterns within wine reviews.

Additional performance improvements were introduced through feature scaling of the most important climate variables and the creation of interaction features (e.g., rain_spring_interaction, rain_maxtemp_interaction) to model non-linear dependencies. Post-prediction, we applied a sparsity factor and thresholding mechanism to better align with the distributional characteristics of the Kaggle test data, simulating real-world sparsity.

4.2.1 Results

4.2.1.1 Climate Only Model

The first model leveraged only the engineered climate features described in the Feature Engineering table of this report, excluding daily climate data. This streamlined approach enhanced efficiency and provided interpretability by enabling stakeholders to assess which aggregated climate attributes most significantly contribute to predicting wine review probabilities. Such insights empower consumers to strategically market or rank wines based on the climatic characteristics of their vintage year. This model acheived an MAE of .02541 in Kaggle:

library(gt)
library(tibble)
library(htmltools)

# Hyperparameters DataFrame
hparams_df <- tibble(
  Parameter = c(
    "Learning Rate", "Num Leaves", "Max Depth",
    "Num Estimators", "Num Clusters", "Num Word Clusters",
    "Use Varietal Features", "Use Varietal Correlation",
    "Sparsity Factor", "Sparsity Threshold", "Objective", "Final Val MAE"
  ),
  Value = c(
    "0.010817", "30", "24", "340", "30,60,90",
    "0", "False", "False", "1.0", "0.225085", "tweedie", .0256
  )
)

# Feature Importances DataFrame
feature_importances_df <- tibble(
  Feature = c(
    "Num Rainy Days", "Spring Rain Avg", "30-day Max Temp MA",
    "Weekly Max Temp Drop", "7-day Max Temp MA", "Fall Rain Avg",
    "7-day Min Temp MA", "30-day Min Temp MA", "Summer Rain Avg",
    "14-day Min Temp MA"
  ),
  Gain = c(
    4808.95, 2832.41, 2739.82, 2464.91, 2183.81, 
    2097.73, 2041.06, 1924.11, 1895.59, 1649.31
  )
)

# Custom compact styling function (your original burgundy theme)
compact_burgundy_theme <- function(tbl, title) {
  tbl %>%
    gt() %>%
    tab_header(title = md(title)) %>%
    opt_table_outline(style = "solid", width = px(1), color = "#800020") %>%
    tab_options(
      table.background.color = "#FFFFFF",
      table.border.top.color = "#800020",
      table.border.bottom.color = "#800020",
      column_labels.background.color = "#f3e5e5",
      column_labels.font.weight = "bold",
      table.font.color = "black",
      table.font.names = "Arial",
      data_row.padding = px(4),
      heading.border.bottom.color = "#800020",
      column_labels.border.top.color = "#800020",
      column_labels.border.bottom.color = "#800020",
      table.width = px(330),
      table.font.size = "12px",
      table.align = "center"
    )
}

# Styled Hyperparameters Table
hparams_table <- compact_burgundy_theme(hparams_df, "**Model Hyperparameters**") %>%
  cols_width(
    Parameter ~ px(160),
    Value ~ px(120)
  )

# Styled Feature Importances Table
feature_importances_table <- compact_burgundy_theme(feature_importances_df, "**Top 10 Features (Gain)**") %>%
  fmt_number(columns = "Gain", decimals = 2, use_seps = TRUE) %>%
  cols_width(
    Feature ~ px(180),
    Gain ~ px(100)
  )

# Display side by side (in R Markdown)
browsable(
  div(style = "display:flex; gap:20px; justify-content:center;",
      hparams_table, feature_importances_table)
)
Model Hyperparameters
Parameter Value
Learning Rate 0.010817
Num Leaves 30
Max Depth 24
Num Estimators 340
Num Clusters 30,60,90
Num Word Clusters 0
Use Varietal Features False
Use Varietal Correlation False
Sparsity Factor 1.0
Sparsity Threshold 0.225085
Objective tweedie
Final Val MAE 0.0256
Top 10 Features (Gain)
Feature Gain
Num Rainy Days 4,808.95
Spring Rain Avg 2,832.41
30-day Max Temp MA 2,739.82
Weekly Max Temp Drop 2,464.91
7-day Max Temp MA 2,183.81
Fall Rain Avg 2,097.73
7-day Min Temp MA 2,041.06
30-day Min Temp MA 1,924.11
Summer Rain Avg 1,895.59
14-day Min Temp MA 1,649.31

To understand which features most influenced predictions, we used LightGBM’s gain-based feature importance, which measures how much each feature reduces the model’s loss function across all splits. The feature with the highest gain was num_rainy_days, suggesting that the frequency of rain events during the growing season is a strong predictor of wine review vocabulary—likely due to its impact on grape development, disease risk, and harvest timing.

Other top features included:

  • spring_rain_avg: Critical rainfall during budding and flowering phases.

  • maxtemp_ma_30: A long-term max temperature average, capturing overall warmth.

  • max_temp_drop_weekly: Indicates weather volatility that may stress vines.

  • Minimum temperature and sunshine averages: Linked to ripening patterns and acidity balance.

These results highlight how seasonal and aggregated climate metrics meaningfully relate to the language used in wine reviews.

While we focused on gain, LightGBM also offers the split metric, which counts how often a feature is used to split the data. Though less tied to performance, it shows how frequently features are selected across trees. For more details on LightGBM feature importance, see GeeksforGeeks.

4.2.1.2 Climate + Word Feature Model

This model incorporated a subset of the engineered features (can be seen in the appendix code section), excluding daily climate data to enhance efficiency and interpretability. It also introduced a suite of DistilBERT-derived features, capturing semantic richness from wine reviews. Unlike subsequent models that used per-varietal aggregation, this approach computed DistilBERT PCA embeddings, word entropy, and top word ratios on a per-wine basis. This adjustment yielded improved results with LightGBM, likely due to finer granularity in capturing wine-specific language patterns. To manage dimensionality, the DistilBERT embeddings were reduced using Principal Component Analysis (PCA), retaining the top 20 components per wine. This configuration allowed the model to effectively integrate structured climate variables with detailed textual descriptors while maintaining computational scalability. This model achieved a .02061 MAE in Kaggle.

library(gt)
library(tibble)
library(htmltools)

# ---- Feature Importance Table ----
feature_importances_df <- tibble(
  Feature = c(
    "word_entropy", "word_cluster_8_prob", "word_cluster_4_prob",
    "distilbert_pca_feature_3", "distilbert_pca_feature_0",
    "distilbert_pca_feature_1", "distilbert_pca_feature_2",
    "word_cluster_3_prob", "distilbert_pca_feature_4",
    "distilbert_pca_feature_6"
  ),
  Importance = c(
    3665575.45, 2768614.30, 2422616.90,
    2198461.51, 2174008.85, 2140382.56,
    1972365.65, 1940113.69, 1920906.32,
    1888735.51
  )
)

# ---- Hyperparameters & Metrics Table ----
hyperparams_df <- tibble(
  Parameter = c(
    "Learning Rate", "Num Leaves", "Max Depth", "Num Estimators",
    "Feature Fraction", "Lambda L1", "Lambda L2",
    "Min Child Weight", "Min Split Gain",
    "Tweedie Variance Power", "Objective", "Use DistilBERT",
    "Cluster Resolutions", "Final Train MAE", "Final Val MAE"
  ),
  Value = c(
    0.028192, 70, 16, 1000,
    0.741177, 0.949204, 0.614408,
    0.000428, 0.014655,
    1.678497, "mae", "TRUE",
    "40, 80, 110", 0.018348, 0.020770
  )
)

# ---- Styling Theme (Your Burgundy Theme) ----
compact_burgundy_theme <- function(tbl, title) {
  tbl %>%
    gt() %>%
    tab_header(title = md(title)) %>%
    opt_table_outline(style = "solid", width = px(1), color = "#800020") %>%
    tab_options(
      table.background.color = "#FFFFFF",
      table.border.top.color = "#800020",
      table.border.bottom.color = "#800020",
      column_labels.background.color = "#f3e5e5",
      column_labels.font.weight = "bold",
      table.font.color = "black",
      table.font.names = "Arial",
      data_row.padding = px(4),
      heading.border.bottom.color = "#800020",
      column_labels.border.top.color = "#800020",
      column_labels.border.bottom.color = "#800020",
      table.width = px(350),
      table.font.size = "12px",
      table.align = "center"
    )
}

# ---- Apply Theme to Each Table ----
feature_table <- compact_burgundy_theme(feature_importances_df, "**Top 10 Feature Importances**") %>%
  fmt_number(columns = "Importance", decimals = 2, use_seps = TRUE) %>%
  cols_width(
    Feature ~ px(200),
    Importance ~ px(120)
  )

hyperparam_table <- compact_burgundy_theme(hyperparams_df, "**Hyperparameters & Results**") %>%
  fmt_number(columns = "Value", decimals = 6, use_seps = FALSE) %>%
  cols_width(
    Parameter ~ px(220),
    Value ~ px(120)
  )

# ---- Display Side by Side ----
browsable(
  div(style = "display: flex; gap: 25px; justify-content: center;",
      hyperparam_table, feature_table)
)
Hyperparameters & Results
Parameter Value
Learning Rate 0.028192
Num Leaves 70
Max Depth 16
Num Estimators 1000
Feature Fraction 0.741177
Lambda L1 0.949204
Lambda L2 0.614408
Min Child Weight 0.000428
Min Split Gain 0.014655
Tweedie Variance Power 1.678497
Objective mae
Use DistilBERT TRUE
Cluster Resolutions 40, 80, 110
Final Train MAE 0.018348
Final Val MAE 0.02077
Top 10 Feature Importances
Feature Importance
word_entropy 3,665,575.45
word_cluster_8_prob 2,768,614.30
word_cluster_4_prob 2,422,616.90
distilbert_pca_feature_3 2,198,461.51
distilbert_pca_feature_0 2,174,008.85
distilbert_pca_feature_1 2,140,382.56
distilbert_pca_feature_2 1,972,365.65
word_cluster_3_prob 1,940,113.69
distilbert_pca_feature_4 1,920,906.32
distilbert_pca_feature_6 1,888,735.51

The feature importances were again calculated using the gain metric. We hypothesize the values are much larger in this model compared to the climate-only version due to the higher number of informative, high-variance features, particularly those derived from text, which contribute more frequently and substantially during tree construction.

This model still offers interpretability through the visualizations in the Feature Engineering section, particularly the n-word cluster probabilities, which group semantically similar terms. Although a similar analysis could be done for the DistilBERT PCA components, we opted not to complete it, as later deep learning models using per-varietal aggregation performed better overall for this feature. Some key insights from the feature importances of this model include:

  • word_entropy: Measures the unpredictability or diversity of language used in a wine’s review. Higher entropy likely reflects wines with more complex or nuanced descriptions, which may correlate with distinctive sensory profiles or premium positioning.

  • word_cluster_8_prob (Red Fruit & Spicy): High importance suggests that the presence of red fruit and spice-related language (e.g., cherry, spice, tannins) is a strong signal in predicting review patterns—possibly due to their common association with popular and recognizable varietals.

  • word_cluster_4_prob (Citrus & Mineral): Indicates that mentions of citrus, minerality, and freshness (e.g., lemon, pear, floral) play a key role in model predictions, likely capturing characteristics tied to lighter white wines or crisp finishes appreciated by certain consumer segments.

4.3 Deep Learning with Pytorch Encoder Transformer

As part of our exploration into deep learning approaches for the Kaggle Challenge Four, we deployed a transformer-based model called ClimateEncoderTransformer to predict the probabilities of tasting note keywords based on climate data and varietal information. This model draws inspiration from BERT’s architecture but is tailored for our multi-label regression task, leveraging embeddings, attention mechanisms, and a carefully designed structure to capture the relationships between environmental factors and wine sensory descriptors. Here’s a breakdown of how it works, its internal components, and the key design choices we made.

The ClimateEncoderTransformer takes two inputs: a varietal identifier (an integer encoding the grape type) and a vector of climate features (e.g., temperature, rainfall, sunshine) for each wine sample. Since we’re working with just these two inputs, the model processes a sequence of length 2—one token for the varietal and one for the climate data. Each input is first passed through an embedding layer to convert it into a dense representation in a shared vector space of dimension d_model=512 (as set in train.py). For the varietal, we used an nn.Embedding layer to map the integer ID to a dense vector, capturing latent characteristics of each grape type. For the climate data, we used a linear layer to project the climate features into the same d_model space, followed by LayerNorm and dropout for regularization.

To make the model aware of the positional relationship between the varietal and climate tokens, we added positional embeddings. We gave the model the option to use either learned embeddings (via nn.Embedding) or sinusoidal embeddings (inspired by the original Transformer paper), controlled by a hyperparameter use_learned_pos. For the final model, we set use_learned_pos=True (as in train.py), which is more akin to BERT’s learned positional embeddings, allowing the model to adapt the positional information to our short sequence length of 2.

The core of the model is a transformer encoder, implemented using PyTorch’s nn.TransformerEncoder with 2 layers (num_layers=2) and 8 attention heads (num_heads=4, as set in train.py). Each head processes a subspace of the d_model dimensions (512 / 4 = 128 dimensions per head), enabling the model to capture interactions between the varietal and climate tokens through self-attention. The feedforward network (FFN) within each transformer layer expands to dim_feedforward = d_model * mlp_mult = 512 * 8 = 4096 (with mlp_mult=8 from train.py). We used GELU activation in the FFN, matching BERT’s hidden_act, and applied dropout (attn_dropout=0.2, as set in train.py) to prevent overfitting within the attention and FFN sub-layers.

For the output, we extracted the representation of the last token (the climate token) after the transformer encoder, as we found it captured the combined influence of the varietal and climate through attention. This representation (a d_model-sized vector) is passed through a single linear layer (fc_out) to produce raw logits for the 1416 tasting note keywords. In my WinePredictor LightningModule, we apply a sigmoid activation to these logits (preds = torch.sigmoid(logits)) to constrain the predictions to [0, 1], matching the target range of the keyword probabilities. This setup ensures the model outputs probabilities directly, which is ideal for our regression task.

For the loss, we used Mean Absolute Error (MAE) via nn.L1Loss, as specified in model.py, which aligns with the Kaggle evaluation metric. we also added a mean penalty term (lambda_mean * |preds.mean() - target_mean|) to encourage the model’s predictions to match the dataset’s mean probability (around 0.0356), helping stabilize the output distribution. For optimization, we used AdamW with L2 regularization (weight_decay in train.py), which is a common choice for transformers, providing a balance of fast convergence and regularization to prevent overfitting.

This ClimateEncoderTransformer combines the power of embeddings to represent varietal and climate data, attention to model their interactions, and a streamlined output head to predict keyword probabilities. Its design, inspired by BERT but adapted for our task, allowed us to effectively capture the influence of climate on wine tasting notes, as evidenced by the feature importance analysis where varietal embeddings and climate features ranked highly.

Encoder Diagram

4.3.1 Hyperparameter Tuning

The transformer model was trained using a combination of hyperparameters:

  • d_model: The dimensionality of the input and output embeddings, set to 512.
  • mlp_mult: The multiplier for the feedforward network size.
  • num_heads: The number of attention heads in the multi-head attention mechanism, set to 16.
  • num_layers: The number of transformer encoder layers, set to 4.
  • use_learned_pos: A boolean indicating whether to use learned positional embeddings (True) or sinusoidal embeddings (False).
  • lambda_mean: A regularization parameter for the mean penalty term.
  • learning_rate: The learning rate for the AdamW optimizer.
  • temperature: A scaling factor for the logits before applying sigmoid, set to 1.0.
  • beta1, beta2: The exponential decay rates for the first and second moment estimates in the AdamW optimizer.
  • eps: A small constant for numerical stability in the AdamW optimizer.
  • weight_decay: The weight decay parameter for L2 regularization in the AdamW optimizer.

These were all tuned using Baysian optimization using Axe/BoTorch. The above correlation heatmap show how each of the hyperparameters correlated with the validation loss and the mean of the predicted probabilities. These findings were crucial in guiding our hyperparameter tuning process.

4.3.2 Conclusions and interpretations

The model performed quite well, especially considering no actual time series data was used to generate our predictions. Our model acheived an MAE of 0.01879. This model used a 2 layer transformer encoder with 4 attention heads and a 512 dimensional embedding. We then ensembled this model with our second best performer which was a 3 layer transformer encoder with 8 attention heads, also with a 512 dimensional embedding. Adding the additional model dropped our MAE down to 0.01873.

4.4 Explainable AI

Normalized Featured Engineered Data Fed to the Neural Net (first data iteration)
Normalized Featured Engineered Data Fed to the Neural Net (first data iteration)

Why interpretability matters:
“To create a model, we make choices about what’s important enough to include, simplifying the world into a toy version that can be easily understood and from which we can infer important facts and actions. We expect it to handle only one job and accept that it will occasionally act like a clueless machine, one with enormous blind spots.” — Cathy O’Neil, Weapons of Math Destruction

Explainable AI helps us open up the “black box” of machine learning models and better understand how they arrive at predictions. As AI systems are increasingly used in high-stakes decision-making, interpretability tools like SHAP (SHapley Additive exPlanations) are becoming essential—not just for model validation, but for building transparency, trust, and accountability.

*Excludes Varietal. High variance often indicates greater discriminating potential, as these features change meaningfully across samples. That means they’re more likely to influence model predictions. We’ll see if SHAP confirms their importance during model interpretation Several rainfall features are right-skewed, which could impact model behavior. Temperature-related features like max_temp_drop_weekly are more normally distributed, which is often ideal for tree-based models.
*Excludes Varietal. High variance often indicates greater discriminating potential, as these features change meaningfully across samples. That means they’re more likely to influence model predictions. We’ll see if SHAP confirms their importance during model interpretation Several rainfall features are right-skewed, which could impact model behavior. Temperature-related features like max_temp_drop_weekly are more normally distributed, which is often ideal for tree-based models.

Unlike traditional feature importance methods, which offer a global view of how features affect overall model performance, SHAP values estimate the individual impact of each feature on a specific prediction. They tell us whether a feature pushed the prediction higher or lower, and by how much. In the context of this project, that means we can explain why a specific wine was predicted to have a certain tasting note—not just which features matter overall.

Application: Feed Forward Neural Network (FNN)

(see [Appendix] for the model’s code)

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import f1_score, precision_score, recall_score  # Include precision_score
import datetime
import random
import os
from scipy.stats import linregress
from sklearn.metrics import f1_score, precision_score, recall_score
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

## see appendix for data load and full model code
# importing the FFN from the module
from fnn_module import WineNet
model = WineNet(input_dim=1018, output_dim=1416, hidden_dim1=512, hidden_dim2=256, dropout_rate=0.45)
model.load_state_dict(torch.load("trained_model.pt", map_location="cpu"))
model.eval()

def model_predict(X_numpy):
    X_tensor = torch.tensor(X_numpy, dtype=torch.float32)
    with torch.no_grad():
        logits = model(X_tensor)
        probs = torch.sigmoid(logits).numpy()
    return probs

4.4.1 SHAP Analysis

Comparing Probability Predictions for Wine Notes #1 and #47:

To understand how our Feedforward Neural Network (FNN) makes decisions, we applied SHAP to the model’s output (formatted for Kaggle). The SHAP Summary Plots below show a comparison between how the model arrived at predictions for two different tasting notes based on specific features. These scores are likely overconfident due to label sparsity (~80% zeros), but still valuable for interpretability analysis.

- Tasting Note 1 had a predicted probability of 0.82

- Tasting Note 47 had a predicted probability of 0.41

import shap
shap.initjs() 

# Prepare SHAP data
background = X_train[:1000]  # Background dataset for SHAP
X_explain = X_val[:100]     # Samples to explain
feature_names = features_df.columns.tolist()  # Feature names from CSV

# Set device to GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define prediction function for first tasting note
def model_predict(inputs, output_idx=1): 
    model.eval()
    with torch.no_grad():
        inputs_tensor = torch.tensor(inputs, dtype=torch.float32).to(device)
        outputs = model(inputs_tensor)
        probs = torch.sigmoid(outputs[:, output_idx]).cpu().numpy()
    return probs

# Initialize KernelExplainer
explainer = shap.KernelExplainer(lambda x: model_predict(x, output_idx=1), background)

# Compute SHAP values
shap_values = explainer.shap_values(X_explain, nsamples=200)

# Visualize for the first tasting note
shap.summary_plot(shap_values, X_explain, feature_names=feature_names, plot_type="bar")
shap.summary_plot(shap_values, X_explain, feature_names=feature_names)
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0], X_explain[0], feature_names=feature_names)

# to generate the plot for other notes, simply update the outout_idx
Positive SHAP pushes prediction higher
Tasting Note 1: Model is less confident. SHAP values are more diffuse than the plot for Tasting Note 47

A positive SHAP value means the feature pushed the prediction for the tasting note higher
(e.g., increased the probability). A negative SHAP value means the
feature pulled the prediction lower.

Stronger SHAP signal for second tasting note
Stronger SHAP signal for tasting note 47

The model is more confident in predicting this specific tasting note
compared to the first. There are features (like num_rainy_days
and varietal_embedding) with clearly high or low values
consistently nudging the prediction up or down. SHAP values are larger,
meaning this tasting note has a stronger signal in the data.

4.4.2 How to Read the Charts

Traditional feature importance provides a general sense of which features matter across the dataset. SHAP takes it further by showing how much each feature pushed a prediction up or down — and whether it was due to a high or low value for that specific wine. This level of detail is increasingly important in real-world applications, where decisions often need to be interpretable and justifiable at the individual level. (Source: “SHAP Values for Beginners” – A Data Odyssey).

For example:

  • A red dot = a high value for that feature

  • A blue dot = a low value

  • The position along the x-axis = the SHAP value (how much it moved the prediction)

In our wine modeling project, SHAP helps us understand how factors like rainfall, temperature drops, or varietal embeddings influence the predicted tasting notes of a wine. While the stakes in wine prediction are relatively low (after all, no one’s liberty hinges on the flavor profile of a Merlot), the techniques we use are the same ones applied in life-altering domains.

Take the COMPAS algorithm, for example. It’s used in the criminal justice system to predict a defendant’s likelihood of reoffending. COMPAS is a proprietary algorithm, meaning the public can’t see how it works. Defendants, lawyers, and even judges don’t know which variables influence the score or how heavily they’re weighted. This violates a basic principle of due process: the right to understand and challenge the evidence against you.

Tasting Note One (Row1)

For Tasting Note 47, the model relies more heavily on a small set of aggregated features—especially varietal_embedding and num_rainy_days—compared to individual daily weather values. These engineered inputs consistently nudged predictions up or down, and had larger SHAP values than granular features like rain_2022-04-14.

SHAP helped confirm that meaningful feature engineering—rather than just throwing more raw data at the model—made a real difference in both performance and interpretability.

This model’s preference for aggregated features, validates our decision to refine and reduce the variables upfront, which:

  • Improved model accuracy

  • Reduced input dimensionality and computation

  • Made the model easier to interpret

In our wine model, we’re using proxy variables—engineered from climate data, seasonal trends, and varietal embeddings—to predict complex outcomes like sensory flavor notes. If we were to scale this up to product labeling, supply chain decisions, or even pricing strategies, understanding which features drive those predictions would become essential for fairness and transparency—even in commerce.

SHAP is one method of many that shines light into these models, whether we’re selecting tasting descriptors or guiding real-world policy. The stakes may differ, but as data practitioners, the responsibility to understand and interrogate our models remains.

5 Conclusions

This project demonstrated that multiple modeling strategies—ranging from statistical machine learning methods to deep learning—can effectively predict wine review probabilities. While models like LightGBM remained powerful and interpretable, our exploration of a Feedforward Neural Network (FNN) added value by offering a more flexible, non-linear approach to capturing interactions between engineered climate features and varietal encodings. With the transformer encoder, we were able to leverage the power of embeddings and attention, which elevated model performance by capturing complex relationships and adaptively learning to connect the data to the wine tasting notes.

In addition to implementing these models, we employed several statistical and empirical techniques to develop new features, assess feature importance, select the number of factors and clusters, and analyze them using traditional methods to ensure efficacy, interpretability, and robustness of the input data.

Through SHAP, we were able to dissect individual predictions and visualize how aggregated inputs—such as rainfall frequency and varietal embeddings—contributed to specific tasting note probabilities. These results reinforce the importance of thoughtful data preparation and confirm that both modern deep learning models and traditional methods can offer meaningful, interpretable insights when paired with explainable AI techniques.