This project focused on analyzing over 10,000 wines across 64 grape varietals—distinct types of wine made primarily from a single grape species. Each wine was associated with climate data from the 2022 vintage year, specifically spanning daily minimum and maximum temperatures, rainfall, and sunshine levels from March 1 to November 1.
The objective was to build a machine learning model that predicts the probability of specific tasting note keywords appearing in a wine review, using only climate data and the varietal type as inputs. By modeling these relationships, the project explores how environmental factors influence the language used to describe wine—offering winemakers insights into how certain climate conditions may shape the sensory perception and market reception of their wines.
Students were evaluated based on the mean absolute error (MAE) of their model’s predicted keyword probabilities compared to the true probabilities from smoothed word count data.
The following table provides key insights into the wine IDs, climate modalities, and varietals for better context and understanding:
library(dplyr)
library(officer)
# Load data (replace with your actual paths)
train_climate <- read.csv("LGBM/data/climate_train.csv")
test_climate <- read.csv("LGBM/data/climate_test.csv")
train_words <- read.csv("LGBM/data/words_train2.csv")
# Core values
n_wines <- as.integer(nrow(train_climate))
n_test <- as.integer(nrow(test_climate))
n_varietals <- as.integer(length(unique(train_climate$varietal)))
n_words <- as.integer(ncol(train_words) - 1)
n_days <- as.integer(sum(grepl("^maxtemp_", colnames(train_climate))))
n_modalities <- 4 # Assuming maxtemp, mintemp, rain, sunshine
total_climate_features <- as.integer(n_days * n_modalities)
avg_words_per_wine <- as.integer(mean(rowSums(train_words[, -1]) > 0.1))
sparsity <- as.integer(mean(train_words[, -1] == 0) * 100)
# Extract all climate-related date columns
climate_cols <- grep("^(maxtemp_|mintemp_|rain_|sunshine_)", colnames(train_climate), value = TRUE)
library(dplyr)
library(gt)
style_table <- function(tbl) {
tbl %>%
gt() %>%
fmt_number(columns = "Value", decimals = 6, use_seps = FALSE) %>%
opt_table_outline(style = "solid", width = px(1.5), color = "#800020") %>% # Burgundy border
tab_options(
table.background.color = "#FFFFFF", # Clean white background
table.border.top.color = "#800020",
table.border.bottom.color = "#800020",
column_labels.background.color = "#f3e5e5", # Light burgundy-pink
column_labels.font.weight = "bold",
table.font.color = "#2c2c2c",
table.font.names = "Arial",
data_row.padding = px(6),
heading.border.bottom.color = "#800020",
column_labels.border.top.color = "#800020",
column_labels.border.bottom.color = "#800020",
table.width = px(600), # Wider table
table.font.size = "13px", # Larger font
table.align = "center"
) %>%
cols_width(
"Parameter" ~ px(350), # Wider columns
"Value" ~ px(200)
)
}
# ---- Your data setup ----
summary_table <- tibble(
Metric = c(
"Number of training wine samples",
"Number of testing wine samples",
"Number of unique tasting note keywords",
"Number of climate days per modality",
"Min climate date",
"Max climate date"
),
Value = c(
9802,
1001,
n_words,
n_days,
"March 1, 2022",
"November 1, 2022"
)
)
# Rename column to match expected input
summary_table_renamed <- summary_table %>%
rename(Parameter = Metric)
# Apply styling
styled_summary_table <- style_table(summary_table_renamed) %>%
tab_header(title = "Climate & Word Data Summary")
# Return the styled table (no need for div() unless in HTML context)
styled_summary_table
| Climate & Word Data Summary | |
| Parameter | Value |
|---|---|
| Number of training wine samples | 9802 |
| Number of testing wine samples | 1001 |
| Number of unique tasting note keywords | 1416 |
| Number of climate days per modality | 246 |
| Min climate date | March 1, 2022 |
| Max climate date | November 1, 2022 |
As previously noted, the time series data includes four climate modalities, along with word probability training data for the corresponding wines. With this in mind, we examined the distributions of both datasets before and after normalization.
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
library(scales)
library(stringr)
# Set larger plot size for interactive viewing
options(repr.plot.width = 14, repr.plot.height = 12) # For Jupyter/RStudio
# Modified theme with matching font sizes
theme_burgundy_min <- theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "plain", size = 12, hjust = 0.5, family = "Arial", color = "#2c2c2c"),
axis.title = element_text(size = 12, family = "Arial", color = "#3c3c3c"),
axis.text = element_text(size = 12, family = "Arial", color = "#4a4a4a"),
panel.grid.major = element_line(color = "#e5e5e5", size = 0.4),
panel.grid.minor = element_blank(),
plot.margin = margin(15, 15, 15, 15)
)
# Assuming 'train_climate' contains the data and 'dates' is already created
modalities <- c("maxtemp", "mintemp", "rain", "sunshine")
# Extract the climate data by modality
climate_data <- list()
for (mod in modalities) {
climate_data[[mod]] <- train_climate[, grep(mod, names(train_climate))]
}
dates <- sort(unique(sub(".*_(.*)", "\\1", names(train_climate)[grepl("_", names(train_climate))])))
date_objs <- as.Date(dates, format = "%Y.%m.%d")
# 1. Distribution of Climate Modalities (Across All Wines)
plot_list <- list()
for (i in seq_along(modalities)) {
mod <- modalities[i]
climate_values <- as.vector(as.matrix(climate_data[[mod]]))
x_label <- case_when(
mod %in% c("maxtemp", "mintemp") ~ "Temperature (°C)",
mod == "rain" ~ "Daily Rainfall (mm)",
mod == "sunshine" ~ "Sunshine Duration (minutes)",
TRUE ~ "Value"
)
p <- ggplot(data.frame(value = climate_values), aes(x = value)) +
geom_histogram(bins = 50, fill = "#b03a2e", color = "#800020", alpha = 0.75) +
labs(
title = str_wrap(str_to_title(mod), width = 20),
x = x_label,
y = "Frequency"
) +
theme_burgundy_min
plot_list[[i]] <- p
}
# Arrange and save non-normalized plots
non_norm_plot <- grid.arrange(grobs = plot_list, ncol = 2)
ggsave("non_normalized_climate_plots.png", plot = non_norm_plot, width = 10, height = 8, units = "in", dpi = 300)
# 1.5. Distribution of Climate Modalities After Normalization
normalize_data <- function(x) {
return((x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
}
log_transform <- function(x) {
return(log1p(x))
}
# Apply transformations
norm_data <- list()
for (mod in modalities) {
climate_values <- as.vector(as.matrix(climate_data[[mod]]))
if (mod %in% c("maxtemp", "mintemp")) {
norm_data[[mod]] <- normalize_data(climate_values)
} else if (mod == "sunshine") {
norm_data[[mod]] <- normalize_data(log_transform(climate_values))
} else if (mod == "rain") {
norm_data[[mod]] <- log_transform(climate_values) # only log, no standard scale
}
}
# Plot
norm_plot_list <- list()
for (i in seq_along(modalities)) {
mod <- modalities[i]
transformation <- case_when(
mod == "rain" ~ "Log Transformed",
mod == "sunshine" ~ "Standard Scaled + Log Transformed",
TRUE ~ "Standard Scaled"
)
p <- ggplot(data.frame(value = norm_data[[mod]]), aes(x = value)) +
geom_histogram(bins = 50, fill = "#c97c7c", color = "#800020", alpha = 0.75) +
scale_y_log10() +
labs(
title = str_wrap(paste0(str_to_title(mod), " — ", transformation), width = 20),
x = "Normalized Values",
y = "Log Frequency"
) +
theme_burgundy_min
norm_plot_list[[i]] <- p
}
# norm_plot <- grid.arrange(grobs = norm_plot_list, ncol = 2)
# ggsave("images/normalized_climate_plots.png", plot = norm_plot, width = 10, height = 8, units = "in", dpi = 300)
This project involved extensive feature engineering, which was critical not only for optimizing model performance and efficiency but also for enhancing interpretability. Understanding which features influenced tasting note prediction probabilities allowed us to better contextualize the patterns observed in the machine learning outputs.
The table below summarizes the engineered features that were evaluated across the models used in this analysis:
# Load necessary libraries
library(gt)
library(dplyr)
# Define the style_table function (using gradient colors)
style_table <- function(tbl) {
tbl %>%
opt_table_outline(style = "solid", width = px(1.5), color = "#800020") %>% # Burgundy border
tab_options(
table.background.color = "#FFFFFF", # Clean white background
table.border.top.color = "#800020",
table.border.bottom.color = "#800020",
column_labels.background.color = "#f3e5e5", # Light burgundy-pink
column_labels.font.weight = "bold",
table.font.color = "black",
table.font.names = "Arial",
data_row.padding = px(6),
heading.border.bottom.color = "#800020",
column_labels.border.top.color = "#800020",
column_labels.border.bottom.color = "#800020",
table.width = px(1100), # Wider table
table.font.size = "15px" # Slightly larger font for table
) %>%
cols_width(
Feature_Group ~ px(200), # Increased width
Feature_Name ~ px(240), # Increased width
Description ~ px(450), # Increased width
Normalization ~ px(160) # Increased width
) %>%
tab_style(
style = cell_fill(color = "#c97c7c"), # Light burgundy for Daily Climate
locations = cells_body(rows = 1:4, columns = everything())
) %>%
tab_style(
style = cell_fill(color = "#d19b9b"), # Lighter burgundy-pink for Varietal
locations = cells_body(rows = 5, columns = everything())
) %>%
tab_style(
style = cell_fill(color = "#dababa"), # Light pink-gray for Moving Average
locations = cells_body(rows = 6:9, columns = everything())
) %>%
tab_style(
style = cell_fill(color = "#e3d8d8"), # Very light pink-gray for Engineered
locations = cells_body(rows = 10:20, columns = everything())
) %>%
tab_style(
style = cell_fill(color = "#ece6e6"), # Near-white pink-gray for Word-Cluster and DistilBERT
locations = cells_body(rows = 21:25, columns = everything())
) %>%
tab_style(
style = cell_fill(color = "#f3e5e5"), # Light burgundy-pink for headers
locations = cells_column_labels()
) %>%
tab_style(
style = cell_text(weight = "bold"),
locations = cells_column_labels()
) %>%
tab_style(
style = cell_text(align = "left"),
locations = cells_body(columns = c("Feature_Group", "Feature_Name", "Description", "Normalization"))
) %>%
tab_style(
style = cell_text(size = px(17)), # Larger font for footnote
locations = cells_footnotes()
)
}
# Create the data frame for the table
table_data <- data.frame(
Feature_Group = c(
"Daily Climate Features\n(samples, days)", rep("", 3),
"Varietal Features",
"Moving Average Features", rep("", 3),
"Engineered Features", rep("", 10),
"Word-Cluster Features", "", "",
"", "DistilBERT Features"
),
Feature_Name = c(
"Maxtemp_[date]", "Mintemp_[date]", "Rain_[date]", "Sunshine_[date]",
"Varietal_Embedding",
"Maxtemp_MA_7/14/30", "Mintemp_MA_7/14/30", "Rain_MA_7/14/30", "Sunshine_MA_7/14/30",
"Rainfall_Type_Dry", "Rainfall_Type_Moderate", "Rainfall_Type_Wet", "Is_Desert",
"Num_Rainy_Days", "Max_Temp_Drop_Weekly", "Rain_Avg_Spring", "Rain_Avg_Summer",
"Rain_Avg_Fall", "Rain_Spring_Interaction", "Rain_Maxtemp_Interaction",
"Word_Cluster_0_prob to Word_Cluster_8_prob", "Word_Entropy",
"[word]_prob (7 words)",
"[word]_[climate_feature]_ratio (5 words)",
"distilbert_pca_feature_0 to distilbert_pca_feature_4"
),
Description = c(
"Daily maximum temperature (°C) for each day",
"Daily minimum temperature (°C) for each day",
"Daily rainfall (mm) for each day",
"Daily sunshine duration (minutes) for each day",
"Integer encoding of varietal (for neural embedding)",
"7/14/30-day moving average of max temperature (°C)",
"7/14/30-day moving average of min temperature (°C)",
"7/14/30-day moving average of rainfall (mm)",
"7/14/30-day moving average of sunshine (minutes)",
"Indicator for total rain < 400 mm",
"Indicator for total rain 400-800 mm",
"Indicator for total rain > 800 mm",
"Indicator for avg max temp > 30°C and total rain < 250 mm",
"Count of days with rain > 0.1 mm",
"Largest 7-day temperature decrease (°C)",
"Average rainfall (mm) for spring days (Mar-May)",
"Average rainfall (mm) for summer days (Jun-Aug)",
"Average rainfall (mm) for fall days (Sep-Nov)",
"Product of num_rainy_days and avg spring rainfall",
"Product of num_rainy_days and last day max temp",
paste(
"9 probabilities from K-means clustering of 1416 tasting note words",
"(words_train2.csv). Clusters group words by co-occurrence.",
"Probabilities are averages for each varietal’s samples.",
"Test rows use training varietal probabilities.",
"See following section for details."
),
paste(
"Shannon entropy of word probabilities per sample",
"(words_train2.csv, smoothing=0.001, threshold=0.01).",
"Test samples use training mean entropy.",
"See following section for details."
),
paste(
"Probabilities of top 7 most frequent words from words_train2.csv,",
"averaged by varietal in training. Test wines use training varietal averages."
),
paste(
"Ratios of top 5 word probabilities to climate features (e.g., rain_avg_spring),",
"averaged by varietal in training, clipped to [0, 10].",
"Test wines use training varietal averages."
),
paste(
"5 PCA components from DistilBERT embeddings of varietal-level pseudo-texts",
"(words_train2.csv). Test wines use training varietal embeddings.",
"See following section for details."
)
),
Normalization = c(
"StandardScaler", "StandardScaler", "log1p", "log1p + StandardScaler",
"None (integer)",
"StandardScaler", "StandardScaler", "log1p + StandardScaler", "log1p + StandardScaler",
"None (0 or 1)", "None (0 or 1)", "None (0 or 1)", "None (0 or 1)",
"StandardScaler", "StandardScaler", "log1p + StandardScaler", "log1p + StandardScaler",
"log1p + StandardScaler", "StandardScaler", "StandardScaler",
"None (raw probabilities)", "None (raw entropy)",
"None (raw probabilities)",
"None (raw ratios)",
"None"
)
)
# Create the gt table
feature_table <- gt(table_data) %>%
tab_header(
title = "Feature Summary for Wine Dataset"
) %>%
style_table() %>%
tab_footnote(
footnote = paste(
"Note: Two CSVs are generated: 'wine_features_raw.csv' with raw values and",
"'wine_features_normalized.csv' with normalized values as specified in the 'Normalization' column.",
"Daily climate, moving average, and engineered features are raw in the raw CSV and normalized in the normalized CSV.",
"Word-cluster probabilities, word entropy, top word probabilities, and interaction ratios remain unscaled (raw) in both CSVs."
),
locations = cells_body(columns = "Feature_Group", rows = nrow(table_data))
) %>%
tab_source_note(
source_note = md(
paste(
"<span style='font-size:17px; font-weight:bold'>Feature Categories</span>: ",
"<u style='text-decoration-color:#c97c7c; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>Daily Climate Features</span></u>, ",
"<u style='text-decoration-color:#d19b9b; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>Varietal Features</span></u>, ",
"<u style='text-decoration-color:#dababa; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>Moving Average Features</span></u>, ",
"<u style='text-decoration-color:#e3d8d8; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>Engineered Features</span></u>, ",
"<u style='text-decoration-color:#ece6e6; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>Word-Cluster Features</span></u>, ",
"<u style='text-decoration-color:#ece6e6; text-decoration-thickness: 5px'><span style='color:black; font-size:17px'>DistilBERT Features</span></u>"
)
)
)
# Save the table as an image
gtsave(feature_table, filename = "feature_summary_table.png", path = "images", vwidth = 1800, vheight = 1600)
The first set of features crucial to the performance of both primary
machine learning models were the
n_word_cluster_probabilities. These features utilized
K-means clustering combined with varietal-level aggregation, effectively
connecting similarities in word review probabilities to specific grape
varietals.
Steps of computation:
Step 1: We reference the training data to review word probabilities for specific Wine IDs. Varietals (Cabernet Sauvignon, Pinot Noir, Chardonnay) are identified from associated climate data.
Step 2: We use KMeans to group the probabilities based on distributions. This organizes descriptors like Cherry, Oak, and Lemon into distinct clusters.
Step 3: Average probabilities for each cluster are calculated for the varietals. This produces a matrix of varietal-cluster probability values.
Step 4: A sample’s cluster probabilities are assigned based on its varietal’s matrix row. For example, a Pinot Noir sample receives its varietal’s cluster values.
The scree plot was generated to evaluate the optimal number of word clusters for KMeans clustering by plotting the inertia (within-cluster sum of squares) against the number of clusters (1 to 20). This visualization helps identify the “elbow point” where adding more clusters yields diminishing reductions in inertia, justifying the selection of 9 clusters for the wine dataset.
Subsequently, we visualize the
resulting clusters and analyze the overarching thematic patterns of the
words grouped together, facilitating a deeper understanding of the
semantic relationships within the data.
Another significant word-related feature set was developed using the
DistilBERT model from the
Hugging Face Transformers library. This feature set
effectively complemented the n_word_probabilities by
introducing a semantic dimension to the data, capturing
deeper contextual relationships between wine descriptions.
As outlined in the feature description table, the features were
computed by first aggregating word probabilities at the varietal level
using the training dataset (words_train2.csv). For each
varietal, a pseudo-text was generated by selecting the
top 10 most probable words, creating a concise representation of the
varietal’s flavor profile. These pseudo-texts were then processed using
DistilBERT’s feature extraction pipeline to obtain 768-dimensional
embeddings, which encode the semantic meaning of the text. The test
wines were imputed with the corresponding varietal embeddings to ensure
consistency across the dataset.
To manage dimensionality, we applied Principal Component Analysis
(PCA) using the scikit-learn library, reducing the
embeddings to 5 components. The number of components was chosen by
generating a scree plot for up to 50 components and selecting 5, as they
captured a substantial portion of the variance while maintaining
computational efficiency. This process was applied to the embeddings of
all wines (10,803 in total), resulting in a compact, semantically rich
feature set for downstream modeling.
In the interest of interpretability, we also performed a factor analysis on the DistilBERT PCA features. This analysis aimed to identify the underlying structure of the data and to understand how the features relate to one another. The results are summarized in the table below, which includes the eigenvalues and the percentage of variance explained by each factor.
| Eigenvalues & Explained Variance | |||
| Factor | Eigenvalue | Explained Variance | Cumulative % |
|---|---|---|---|
| Factor 1 | 0.531 | 0.413 | 0.413 |
| Factor 2 | 0.219 | 0.170 | 0.583 |
| Factor 3 | 0.137 | 0.107 | 0.690 |
| Factor 4 | 0.080 | 0.062 | 0.752 |
| Factor 5 | 0.072 | 0.056 | 0.808 |
We then performed a Promax rotation on our factor loadings to enhance interpretability and ensure that the DistilBERT embeddings functioned as intended. After the rotation, we aggregated the loadings by varietal and computed the top words for each factor. Several words appeared in 70% of the wine samples. These words were:
These words were not very informative and did not help distinguish one varietal from another. We removed them from the analysis and focused on words that could better differentiate the varietals within each factor.
# install.packages(c("readr","dplyr","tidyr","ggplot2","plotly","viridis"))
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(plotly)
library(viridis)
# 1. Load your tidy data
word_df <- read_csv("project_files/factor_top_words.csv")
# Expect columns: factor, word, count
# 2. Make sure every (word, factor) pair is present, filling missing with 0
word_df_full <- word_df %>%
complete(word, factor, fill = list(count = 0)) %>%
# keep your ordering
mutate(
factor = factor(factor, levels = unique(factor)),
word = factor(word, levels = rev(unique(word)))
)
# 3. Build the ggplot
p <- ggplot(word_df_full, aes(x = factor, y = word, fill = count)) +
geom_tile(color = "white", width = 1, height = 1) + # white borders, full cell
geom_text(aes(label = count), size = 3) + # annotate counts
scale_fill_distiller(
name = "Count",
palette = "Reds", # ColorBrewer Reds
direction = 1, # 1 = light→dark
na.value = "grey90",
limits = c(0, max(word_df_full$count))
) +
labs(x = NULL, y = NULL, title = "Top Word Frequencies per Factor") +
theme_minimal(base_size = 14) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank()
)
# 4. Convert to interactive Plotly
ggplotly(p, tooltip = c("x", "y", "fill"))
Thanks to the PCA and Promax rotation, we were able to establish a clear structure in the factor loadings. The first factor weighted heavily on notes like refreshing, clean, crisp, and white—a clue that it corresponds to drier, crisp white wines. The second factor was more fruity yet also refreshing, suggesting sweeter, fruity white or rosé wines. The third factor was more balanced and structured, common descriptors for many drinkable red wines. The fourth factor indicated a bold yet sweet and fruity red wine. The fifth factor’s notes pointed to a more spicy, dark, and bold red wine.
One interesting point—the note acidity was highly represented across all wines, similar to the 70% of notes we removed. However, we chose to retain it in this analysis not to show its general presence, but rather to highlight its absence in factor five. This lack of acidity provides meaningful insight into the nature of the varietals in this factor and further differentiates it from the others, potentially contributing to stronger model predictability.
| Factor Loadings for Wine Varietals1 | ||||
| PCA with Promax Rotation | ||||
| Factor1: Clean, Crisp & White | Factor2: Refreshing & Fruity | Factor3: Balanced & Structured | Factor4: Red, Fruity & Sweet | Factor5: Bold, Spicy & Dark |
|---|---|---|---|---|
| St. Pepin, Seyval Blanc, Edelweiss, Pinot Grigio, Itasca, Apple Cider, La Crescent, Brianna | Pinot Noir, Chardonnay, Merlot, Sparkling Wine, Rosé of Pinot Noir, Rose, Sauvignon Blanc, Apple Cider | Mead, Chardonnay, Norton, Blend, Roussanne, Apple Cider, Viognier, Traminette | Pinot Noir, Marquette, Concord, Hard Cider, Tempranillo, Barbera, Apple Cider, GSM | Merlot, Meritage, Proprietary Red Blend, Cabernet Sauvignon, Marechal Foch, Cabernet Sauvignon Blend, Norton, Malbec |
| 1 Factors group varietals by taste profiles from PCA with promax rotation. | ||||
After merging the rotation data with the varietal labels, we pulled the top Promax-weighted wines associated with each factor, which confirmed our intuition and theoretical understanding of the factors. This supports that the DistilBERT embeddings were effective in capturing the semantic meaning of the words and provided our model with an interpretable semantic understanding of the underlying structure of the word data.
To demonstrate the effectiveness of the DistilBERT embeddings and the factor analysis, we provided a scatter plot of the first and fifth factors. In the interest of parsimony, we included only these two factors, as they are the most interpretable and distinct.
Our group recognized the power of feature engineering and its
potential to enhance both our understanding of the data and the model’s
learning capability. However, we also wanted to ensure that any
engineered feature was empirically justified. To this end, we performed
a feature importance analysis using the
sklearn.feature_selection module, specifically the
mutual_info_regression function.
Given the cyclical nature of the temporal features and the complexity of the data we were providing to our model, we chose to use mutual information to examine relationships between the features and the target variable—specifically for its ability to capture non-linear relationships. Our findings are presented below. Note: The feature importance analysis was initially performed on a version of our dataset that included 33 features but excluded word cluster probabilities, DistilBERT features, and others. An identical analysis was later performed on a dataset that included those features; however, no features were removed, as doing so degraded model performance.
After this initial pass and an examination of the correlative nature of the features we created, it became clear that the word cluster probabilities and the varietal features were the most salient in terms of mutual information with our target words variable.
After applying a 90% feature importance threshold, we identified the most impactful features likely to benefit our model. These features, shown above, reduced our feature set from 33 to 21.
Before this elimination step, our model’s MAE was approximately 0.022. Running the model with the reduced feature set—without any hyperparameter tuning—yielded an MAE of ~0.020, a 9% improvement. More is not always better when it comes to model performance. Through the feature engineering process and feature importance analysis, we demonstrated that we can strengthen signal and improve efficiency by leveraging the right tools.
To explore baseline model behavior, we worked with a Feedforward Neural Network (FNN) (see [Appendix]) trained on the initial engineered input features (climate + varietal) to predict possible wine tasting notes. The network consists of two hidden layers (512 and 256 units respectively), each with LayerNorm, ReLU, and 45% dropout. The output layer uses a sigmoid activation to produce probabilities tasting notes.
We chose this architecture to give the model enough depth to learn nonlinear relationships between climate/varietal features and the tasting notes. It’s not overly deep, which keeps training time manageable, but still expressive enough to handle the complexity of the task.
Given the extreme sparsity and imbalance within the labels (≈80% zeros), we applied focal loss during training to help the model focus on rarer positive labels. L1 sparsity penalty on the predicted probabilities to discourage overconfident outputs.
To further address overfitting and tune model capacity, we:
Conducted a randomized hyperparameter grid search across dropout rates, learning rates, batch sizes, and hidden dimensions
Used a dropout rate of 45%
Ran five-fold cross-validation during early experimentation to assess generalization
Training progress was tracked with a detailed set of metrics each epoch:
Validation MAE and validation F1 to assess prediction accuracy
Mean probability to monitor whether the model’s output probabilities matched the expected sparsity of the label space.
Conclusions: As expected, the FNN didn’t perform as well as more advanced models like BERT and LGBM on Kaggle. However, its simpler structure and deterministic input format made it a strong candidate for SHAP-based explanation. This allowed us to generate visual insights into which features (e.g., rainfall, temperature changes) most influenced the model’s predictions, laying the groundwork for deeper interpretability in future iterations.
We also implemented a Light Gradient Boosting Machine (LightGBM) model. LightGBM is an efficient, gradient-boosting decision-tree algorithm known for its speed, scalability, accuracy, and interpretability, particularly suited to handling large and complex datasets due to its histogram-based method and leaf-wise tree growth strategy LightGBM Documentation.
We configured this model as a regression task using a Tweedie loss function, which is effective when modeling continuous target variables that include many zeros or highly skewed distributions. The Tweedie loss is mathematically represented as:
\[ loss = -y * (μ^{1 - p})/(1 - p) + (μ^{2 - p})/(2 - p) \]
To enhance generalization and predictive robustness, we implemented KMeans clustering over the word-level target matrix across multiple resolutions (e.g., 30, 50, 70, 90 clusters). Each cluster groups semantically similar words, and a separate LightGBM model is trained for each cluster at each resolution. This process forms a multi-resolution modeling strategy, which simulates an ensemble by averaging predictions across all cluster resolutions—capturing both coarse and fine-grained semantic patterns within wine reviews.
Additional performance improvements were introduced through feature scaling of the most important climate variables and the creation of interaction features (e.g., rain_spring_interaction, rain_maxtemp_interaction) to model non-linear dependencies. Post-prediction, we applied a sparsity factor and thresholding mechanism to better align with the distributional characteristics of the Kaggle test data, simulating real-world sparsity.
The first model leveraged only the engineered climate features described in the Feature Engineering table of this report, excluding daily climate data. This streamlined approach enhanced efficiency and provided interpretability by enabling stakeholders to assess which aggregated climate attributes most significantly contribute to predicting wine review probabilities. Such insights empower consumers to strategically market or rank wines based on the climatic characteristics of their vintage year. This model acheived an MAE of .02541 in Kaggle:
library(gt)
library(tibble)
library(htmltools)
# Hyperparameters DataFrame
hparams_df <- tibble(
Parameter = c(
"Learning Rate", "Num Leaves", "Max Depth",
"Num Estimators", "Num Clusters", "Num Word Clusters",
"Use Varietal Features", "Use Varietal Correlation",
"Sparsity Factor", "Sparsity Threshold", "Objective", "Final Val MAE"
),
Value = c(
"0.010817", "30", "24", "340", "30,60,90",
"0", "False", "False", "1.0", "0.225085", "tweedie", .0256
)
)
# Feature Importances DataFrame
feature_importances_df <- tibble(
Feature = c(
"Num Rainy Days", "Spring Rain Avg", "30-day Max Temp MA",
"Weekly Max Temp Drop", "7-day Max Temp MA", "Fall Rain Avg",
"7-day Min Temp MA", "30-day Min Temp MA", "Summer Rain Avg",
"14-day Min Temp MA"
),
Gain = c(
4808.95, 2832.41, 2739.82, 2464.91, 2183.81,
2097.73, 2041.06, 1924.11, 1895.59, 1649.31
)
)
# Custom compact styling function (your original burgundy theme)
compact_burgundy_theme <- function(tbl, title) {
tbl %>%
gt() %>%
tab_header(title = md(title)) %>%
opt_table_outline(style = "solid", width = px(1), color = "#800020") %>%
tab_options(
table.background.color = "#FFFFFF",
table.border.top.color = "#800020",
table.border.bottom.color = "#800020",
column_labels.background.color = "#f3e5e5",
column_labels.font.weight = "bold",
table.font.color = "black",
table.font.names = "Arial",
data_row.padding = px(4),
heading.border.bottom.color = "#800020",
column_labels.border.top.color = "#800020",
column_labels.border.bottom.color = "#800020",
table.width = px(330),
table.font.size = "12px",
table.align = "center"
)
}
# Styled Hyperparameters Table
hparams_table <- compact_burgundy_theme(hparams_df, "**Model Hyperparameters**") %>%
cols_width(
Parameter ~ px(160),
Value ~ px(120)
)
# Styled Feature Importances Table
feature_importances_table <- compact_burgundy_theme(feature_importances_df, "**Top 10 Features (Gain)**") %>%
fmt_number(columns = "Gain", decimals = 2, use_seps = TRUE) %>%
cols_width(
Feature ~ px(180),
Gain ~ px(100)
)
# Display side by side (in R Markdown)
browsable(
div(style = "display:flex; gap:20px; justify-content:center;",
hparams_table, feature_importances_table)
)
| Model Hyperparameters | |
| Parameter | Value |
|---|---|
| Learning Rate | 0.010817 |
| Num Leaves | 30 |
| Max Depth | 24 |
| Num Estimators | 340 |
| Num Clusters | 30,60,90 |
| Num Word Clusters | 0 |
| Use Varietal Features | False |
| Use Varietal Correlation | False |
| Sparsity Factor | 1.0 |
| Sparsity Threshold | 0.225085 |
| Objective | tweedie |
| Final Val MAE | 0.0256 |
| Top 10 Features (Gain) | |
| Feature | Gain |
|---|---|
| Num Rainy Days | 4,808.95 |
| Spring Rain Avg | 2,832.41 |
| 30-day Max Temp MA | 2,739.82 |
| Weekly Max Temp Drop | 2,464.91 |
| 7-day Max Temp MA | 2,183.81 |
| Fall Rain Avg | 2,097.73 |
| 7-day Min Temp MA | 2,041.06 |
| 30-day Min Temp MA | 1,924.11 |
| Summer Rain Avg | 1,895.59 |
| 14-day Min Temp MA | 1,649.31 |
To understand which features most influenced predictions, we used
LightGBM’s gain-based feature importance, which
measures how much each feature reduces the model’s loss function across
all splits. The feature with the highest gain was
num_rainy_days, suggesting that the frequency of rain
events during the growing season is a strong predictor of wine review
vocabulary—likely due to its impact on grape development, disease risk,
and harvest timing.
Other top features included:
spring_rain_avg: Critical rainfall during budding
and flowering phases.
maxtemp_ma_30: A long-term max temperature average,
capturing overall warmth.
max_temp_drop_weekly: Indicates weather volatility
that may stress vines.
Minimum temperature and sunshine averages: Linked to
ripening patterns and acidity balance.
These results highlight how seasonal and aggregated climate metrics meaningfully relate to the language used in wine reviews.
While we focused on gain, LightGBM also offers the split
metric, which counts how often a feature is used to split the data.
Though less tied to performance, it shows how frequently features are
selected across trees. For more details on LightGBM feature importance,
see GeeksforGeeks.
This model incorporated a subset of the engineered features (can be seen in the appendix code section), excluding daily climate data to enhance efficiency and interpretability. It also introduced a suite of DistilBERT-derived features, capturing semantic richness from wine reviews. Unlike subsequent models that used per-varietal aggregation, this approach computed DistilBERT PCA embeddings, word entropy, and top word ratios on a per-wine basis. This adjustment yielded improved results with LightGBM, likely due to finer granularity in capturing wine-specific language patterns. To manage dimensionality, the DistilBERT embeddings were reduced using Principal Component Analysis (PCA), retaining the top 20 components per wine. This configuration allowed the model to effectively integrate structured climate variables with detailed textual descriptors while maintaining computational scalability. This model achieved a .02061 MAE in Kaggle.
library(gt)
library(tibble)
library(htmltools)
# ---- Feature Importance Table ----
feature_importances_df <- tibble(
Feature = c(
"word_entropy", "word_cluster_8_prob", "word_cluster_4_prob",
"distilbert_pca_feature_3", "distilbert_pca_feature_0",
"distilbert_pca_feature_1", "distilbert_pca_feature_2",
"word_cluster_3_prob", "distilbert_pca_feature_4",
"distilbert_pca_feature_6"
),
Importance = c(
3665575.45, 2768614.30, 2422616.90,
2198461.51, 2174008.85, 2140382.56,
1972365.65, 1940113.69, 1920906.32,
1888735.51
)
)
# ---- Hyperparameters & Metrics Table ----
hyperparams_df <- tibble(
Parameter = c(
"Learning Rate", "Num Leaves", "Max Depth", "Num Estimators",
"Feature Fraction", "Lambda L1", "Lambda L2",
"Min Child Weight", "Min Split Gain",
"Tweedie Variance Power", "Objective", "Use DistilBERT",
"Cluster Resolutions", "Final Train MAE", "Final Val MAE"
),
Value = c(
0.028192, 70, 16, 1000,
0.741177, 0.949204, 0.614408,
0.000428, 0.014655,
1.678497, "mae", "TRUE",
"40, 80, 110", 0.018348, 0.020770
)
)
# ---- Styling Theme (Your Burgundy Theme) ----
compact_burgundy_theme <- function(tbl, title) {
tbl %>%
gt() %>%
tab_header(title = md(title)) %>%
opt_table_outline(style = "solid", width = px(1), color = "#800020") %>%
tab_options(
table.background.color = "#FFFFFF",
table.border.top.color = "#800020",
table.border.bottom.color = "#800020",
column_labels.background.color = "#f3e5e5",
column_labels.font.weight = "bold",
table.font.color = "black",
table.font.names = "Arial",
data_row.padding = px(4),
heading.border.bottom.color = "#800020",
column_labels.border.top.color = "#800020",
column_labels.border.bottom.color = "#800020",
table.width = px(350),
table.font.size = "12px",
table.align = "center"
)
}
# ---- Apply Theme to Each Table ----
feature_table <- compact_burgundy_theme(feature_importances_df, "**Top 10 Feature Importances**") %>%
fmt_number(columns = "Importance", decimals = 2, use_seps = TRUE) %>%
cols_width(
Feature ~ px(200),
Importance ~ px(120)
)
hyperparam_table <- compact_burgundy_theme(hyperparams_df, "**Hyperparameters & Results**") %>%
fmt_number(columns = "Value", decimals = 6, use_seps = FALSE) %>%
cols_width(
Parameter ~ px(220),
Value ~ px(120)
)
# ---- Display Side by Side ----
browsable(
div(style = "display: flex; gap: 25px; justify-content: center;",
hyperparam_table, feature_table)
)
| Hyperparameters & Results | |
| Parameter | Value |
|---|---|
| Learning Rate | 0.028192 |
| Num Leaves | 70 |
| Max Depth | 16 |
| Num Estimators | 1000 |
| Feature Fraction | 0.741177 |
| Lambda L1 | 0.949204 |
| Lambda L2 | 0.614408 |
| Min Child Weight | 0.000428 |
| Min Split Gain | 0.014655 |
| Tweedie Variance Power | 1.678497 |
| Objective | mae |
| Use DistilBERT | TRUE |
| Cluster Resolutions | 40, 80, 110 |
| Final Train MAE | 0.018348 |
| Final Val MAE | 0.02077 |
| Top 10 Feature Importances | |
| Feature | Importance |
|---|---|
| word_entropy | 3,665,575.45 |
| word_cluster_8_prob | 2,768,614.30 |
| word_cluster_4_prob | 2,422,616.90 |
| distilbert_pca_feature_3 | 2,198,461.51 |
| distilbert_pca_feature_0 | 2,174,008.85 |
| distilbert_pca_feature_1 | 2,140,382.56 |
| distilbert_pca_feature_2 | 1,972,365.65 |
| word_cluster_3_prob | 1,940,113.69 |
| distilbert_pca_feature_4 | 1,920,906.32 |
| distilbert_pca_feature_6 | 1,888,735.51 |
The feature importances were again calculated using the gain metric. We hypothesize the values are much larger in this model compared to the climate-only version due to the higher number of informative, high-variance features, particularly those derived from text, which contribute more frequently and substantially during tree construction.
This model still offers interpretability through the visualizations
in the Feature Engineering section, particularly the
n-word cluster probabilities, which group semantically
similar terms. Although a similar analysis could be done for the
DistilBERT PCA components, we opted not to complete it, as later deep
learning models using per-varietal aggregation performed better overall
for this feature. Some key insights from the feature importances of this
model include:
word_entropy: Measures the unpredictability or
diversity of language used in a wine’s review. Higher entropy likely
reflects wines with more complex or nuanced descriptions, which may
correlate with distinctive sensory profiles or premium
positioning.
word_cluster_8_prob (Red Fruit & Spicy): High
importance suggests that the presence of red fruit and spice-related
language (e.g., cherry, spice, tannins) is a strong signal in predicting
review patterns—possibly due to their common association with popular
and recognizable varietals.
word_cluster_4_prob (Citrus & Mineral):
Indicates that mentions of citrus, minerality, and freshness (e.g.,
lemon, pear, floral) play a key role in model predictions, likely
capturing characteristics tied to lighter white wines or crisp finishes
appreciated by certain consumer segments.
As part of our exploration into deep learning approaches for the
Kaggle Challenge Four, we deployed a transformer-based model called
ClimateEncoderTransformer to predict the probabilities of
tasting note keywords based on climate data and varietal information.
This model draws inspiration from BERT’s architecture but is tailored
for our multi-label regression task, leveraging embeddings, attention
mechanisms, and a carefully designed structure to capture the
relationships between environmental factors and wine sensory
descriptors. Here’s a breakdown of how it works, its internal
components, and the key design choices we made.
The ClimateEncoderTransformer takes two inputs: a
varietal identifier (an integer encoding the grape type) and a vector of
climate features (e.g., temperature, rainfall, sunshine) for each wine
sample. Since we’re working with just these two inputs, the model
processes a sequence of length 2—one token for the varietal and one for
the climate data. Each input is first passed through an embedding layer
to convert it into a dense representation in a shared vector space of
dimension d_model=512 (as set in train.py).
For the varietal, we used an nn.Embedding layer to map the
integer ID to a dense vector, capturing latent characteristics of each
grape type. For the climate data, we used a linear layer to project the
climate features into the same d_model space, followed by
LayerNorm and dropout for regularization.
To make the model aware of the positional relationship between the
varietal and climate tokens, we added positional embeddings. We gave the
model the option to use either learned embeddings (via
nn.Embedding) or sinusoidal embeddings (inspired by the
original Transformer paper), controlled by a hyperparameter
use_learned_pos. For the final model, we set
use_learned_pos=True (as in train.py), which is more akin
to BERT’s learned positional embeddings, allowing the model to adapt the
positional information to our short sequence length of 2.
The core of the model is a transformer encoder, implemented using
PyTorch’s nn.TransformerEncoder with 2 layers
(num_layers=2) and 8 attention heads
(num_heads=4, as set in train.py). Each head
processes a subspace of the d_model dimensions
(512 / 4 = 128 dimensions per head), enabling the model to
capture interactions between the varietal and climate tokens through
self-attention. The feedforward network (FFN) within each transformer
layer expands to
dim_feedforward = d_model * mlp_mult = 512 * 8 = 4096 (with
mlp_mult=8 from train.py). We used GELU
activation in the FFN, matching BERT’s hidden_act, and applied dropout
(attn_dropout=0.2, as set in train.py) to
prevent overfitting within the attention and FFN sub-layers.
For the output, we extracted the representation of the last token
(the climate token) after the transformer encoder, as we found it
captured the combined influence of the varietal and climate through
attention. This representation (a d_model-sized vector) is
passed through a single linear layer (fc_out) to produce
raw logits for the 1416 tasting note keywords. In my
WinePredictor LightningModule, we apply a sigmoid
activation to these logits (preds =
torch.sigmoid(logits)) to constrain the predictions to [0,
1], matching the target range of the keyword probabilities. This setup
ensures the model outputs probabilities directly, which is ideal for our
regression task.
For the loss, we used Mean Absolute Error (MAE) via
nn.L1Loss, as specified in model.py, which aligns with the
Kaggle evaluation metric. we also added a mean penalty term
(lambda_mean * |preds.mean() - target_mean|) to encourage
the model’s predictions to match the dataset’s mean probability (around
0.0356), helping stabilize the output distribution. For optimization, we
used AdamW with L2 regularization
(weight_decay in train.py), which is a common
choice for transformers, providing a balance of fast convergence and
regularization to prevent overfitting.
This ClimateEncoderTransformer combines the power of
embeddings to represent varietal and climate data, attention to model
their interactions, and a streamlined output head to predict keyword
probabilities. Its design, inspired by BERT but adapted for our task,
allowed us to effectively capture the influence of climate on wine
tasting notes, as evidenced by the feature importance analysis where
varietal embeddings and climate features ranked highly.
The transformer model was trained using a combination of hyperparameters:
d_model: The dimensionality of the input and output
embeddings, set to 512.mlp_mult: The multiplier for the feedforward network
size.num_heads: The number of attention heads in the
multi-head attention mechanism, set to 16.num_layers: The number of transformer encoder layers,
set to 4.use_learned_pos: A boolean indicating whether to use
learned positional embeddings (True) or sinusoidal
embeddings (False).lambda_mean: A regularization parameter for the mean
penalty term.learning_rate: The learning rate for the AdamW
optimizer.temperature: A scaling factor for the logits before
applying sigmoid, set to 1.0.beta1, beta2: The exponential decay rates
for the first and second moment estimates in the AdamW optimizer.eps: A small constant for numerical stability in the
AdamW optimizer.weight_decay: The weight decay parameter for L2
regularization in the AdamW optimizer.These were all tuned using Baysian optimization using Axe/BoTorch. The above correlation heatmap show how each of the hyperparameters correlated with the validation loss and the mean of the predicted probabilities. These findings were crucial in guiding our hyperparameter tuning process.
The model performed quite well, especially considering no actual time series data was used to generate our predictions. Our model acheived an MAE of 0.01879. This model used a 2 layer transformer encoder with 4 attention heads and a 512 dimensional embedding. We then ensembled this model with our second best performer which was a 3 layer transformer encoder with 8 attention heads, also with a 512 dimensional embedding. Adding the additional model dropped our MAE down to 0.01873.
Why interpretability matters:
“To create a model, we make choices about what’s important enough to
include, simplifying the world into a toy version that can be easily
understood and from which we can infer important facts and actions. We
expect it to handle only one job and accept that it will occasionally
act like a clueless machine, one with enormous blind spots.” — Cathy
O’Neil, Weapons of Math Destruction
Explainable AI helps us open up the “black box” of machine learning models and better understand how they arrive at predictions. As AI systems are increasingly used in high-stakes decision-making, interpretability tools like SHAP (SHapley Additive exPlanations) are becoming essential—not just for model validation, but for building transparency, trust, and accountability.
Unlike traditional feature importance methods, which offer a global view of how features affect overall model performance, SHAP values estimate the individual impact of each feature on a specific prediction. They tell us whether a feature pushed the prediction higher or lower, and by how much. In the context of this project, that means we can explain why a specific wine was predicted to have a certain tasting note—not just which features matter overall.
Application: Feed Forward Neural Network (FNN)
(see [Appendix] for the model’s code)
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import f1_score, precision_score, recall_score # Include precision_score
import datetime
import random
import os
from scipy.stats import linregress
from sklearn.metrics import f1_score, precision_score, recall_score
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
## see appendix for data load and full model code
# importing the FFN from the module
from fnn_module import WineNet
model = WineNet(input_dim=1018, output_dim=1416, hidden_dim1=512, hidden_dim2=256, dropout_rate=0.45)
model.load_state_dict(torch.load("trained_model.pt", map_location="cpu"))
model.eval()
def model_predict(X_numpy):
X_tensor = torch.tensor(X_numpy, dtype=torch.float32)
with torch.no_grad():
logits = model(X_tensor)
probs = torch.sigmoid(logits).numpy()
return probs
Comparing Probability Predictions for Wine Notes #1 and #47:
To understand how our Feedforward Neural Network (FNN) makes decisions, we applied SHAP to the model’s output (formatted for Kaggle). The SHAP Summary Plots below show a comparison between how the model arrived at predictions for two different tasting notes based on specific features. These scores are likely overconfident due to label sparsity (~80% zeros), but still valuable for interpretability analysis.
- Tasting Note 1 had a predicted probability of 0.82
- Tasting Note 47 had a predicted probability of 0.41
import shap
shap.initjs()
# Prepare SHAP data
background = X_train[:1000] # Background dataset for SHAP
X_explain = X_val[:100] # Samples to explain
feature_names = features_df.columns.tolist() # Feature names from CSV
# Set device to GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Define prediction function for first tasting note
def model_predict(inputs, output_idx=1):
model.eval()
with torch.no_grad():
inputs_tensor = torch.tensor(inputs, dtype=torch.float32).to(device)
outputs = model(inputs_tensor)
probs = torch.sigmoid(outputs[:, output_idx]).cpu().numpy()
return probs
# Initialize KernelExplainer
explainer = shap.KernelExplainer(lambda x: model_predict(x, output_idx=1), background)
# Compute SHAP values
shap_values = explainer.shap_values(X_explain, nsamples=200)
# Visualize for the first tasting note
shap.summary_plot(shap_values, X_explain, feature_names=feature_names, plot_type="bar")
shap.summary_plot(shap_values, X_explain, feature_names=feature_names)
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0], X_explain[0], feature_names=feature_names)
# to generate the plot for other notes, simply update the outout_idx
A positive SHAP value means the feature pushed the prediction for the
tasting note higher
(e.g., increased the probability). A negative SHAP value means the
feature pulled the prediction lower.
The model is more confident in predicting this specific tasting
note
compared to the first. There are features (like
num_rainy_days
and varietal_embedding) with clearly high or low
values
consistently nudging the prediction up or down. SHAP values are
larger,
meaning this tasting note has a stronger signal in the data.
Traditional feature importance provides a general sense of which features matter across the dataset. SHAP takes it further by showing how much each feature pushed a prediction up or down — and whether it was due to a high or low value for that specific wine. This level of detail is increasingly important in real-world applications, where decisions often need to be interpretable and justifiable at the individual level. (Source: “SHAP Values for Beginners” – A Data Odyssey).
For example:
A red dot = a high value for that feature
A blue dot = a low value
The position along the x-axis = the SHAP value (how much it moved the prediction)
In our wine modeling project, SHAP helps us understand how factors like rainfall, temperature drops, or varietal embeddings influence the predicted tasting notes of a wine. While the stakes in wine prediction are relatively low (after all, no one’s liberty hinges on the flavor profile of a Merlot), the techniques we use are the same ones applied in life-altering domains.
Take the COMPAS algorithm, for example. It’s used in the criminal justice system to predict a defendant’s likelihood of reoffending. COMPAS is a proprietary algorithm, meaning the public can’t see how it works. Defendants, lawyers, and even judges don’t know which variables influence the score or how heavily they’re weighted. This violates a basic principle of due process: the right to understand and challenge the evidence against you.
Tasting Note One (Row1)
For Tasting Note 47, the model relies more heavily on a small set of
aggregated features—especially varietal_embedding and
num_rainy_days—compared to individual daily weather values.
These engineered inputs consistently nudged predictions up or down, and
had larger SHAP values than granular features like
rain_2022-04-14.
SHAP helped confirm that meaningful feature engineering—rather than just throwing more raw data at the model—made a real difference in both performance and interpretability.
This model’s preference for aggregated features, validates our decision to refine and reduce the variables upfront, which:
Improved model accuracy
Reduced input dimensionality and computation
Made the model easier to interpret
In our wine model, we’re using proxy variables—engineered from climate data, seasonal trends, and varietal embeddings—to predict complex outcomes like sensory flavor notes. If we were to scale this up to product labeling, supply chain decisions, or even pricing strategies, understanding which features drive those predictions would become essential for fairness and transparency—even in commerce.
SHAP is one method of many that shines light into these models, whether we’re selecting tasting descriptors or guiding real-world policy. The stakes may differ, but as data practitioners, the responsibility to understand and interrogate our models remains.
This project demonstrated that multiple modeling strategies—ranging from statistical machine learning methods to deep learning—can effectively predict wine review probabilities. While models like LightGBM remained powerful and interpretable, our exploration of a Feedforward Neural Network (FNN) added value by offering a more flexible, non-linear approach to capturing interactions between engineered climate features and varietal encodings. With the transformer encoder, we were able to leverage the power of embeddings and attention, which elevated model performance by capturing complex relationships and adaptively learning to connect the data to the wine tasting notes.
In addition to implementing these models, we employed several statistical and empirical techniques to develop new features, assess feature importance, select the number of factors and clusters, and analyze them using traditional methods to ensure efficacy, interpretability, and robustness of the input data.
Through SHAP, we were able to dissect individual predictions and visualize how aggregated inputs—such as rainfall frequency and varietal embeddings—contributed to specific tasting note probabilities. These results reinforce the importance of thoughtful data preparation and confirm that both modern deep learning models and traditional methods can offer meaningful, interpretable insights when paired with explainable AI techniques.