Tornado Magnitude Prediction with LightGBM

A Real Dialogue with AI: How MCP Servers Enable Interactive Data Science

Author

Patricio Lobos, Software Engineer & AI Lead at Querex.no

Published

December 3, 2025

1 Executive Summary

Key Achievement

Our LightGBM model achieved RMSE 0.357 on the held-out test set, representing a 38.8% improvement over Julia Silge’s XGBoost benchmark (RMSE 0.583). The model explains 83.8% of variance in tornado magnitude predictions.

This analysis documents the complete journey of building a tornado magnitude prediction model using LightGBM with domain-enhanced features. More importantly, it demonstrates how Model Context Protocol (MCP) servers enable a genuine, interactive dialogue between a human and an AI assistant (Claude Opus 4.5 in VS Code via GitHub Copilot) to collaboratively solve complex data science problems.

2 Introduction: The Power of AI-Human Collaboration

2.1 About the Author

Patricio Lobos

Software Engineer & AI Lead at Querex.no

🐦 @strandedinoslo

About Querex: Querex develops mathematical and statistical tools for Large Language Models via the Model Context Protocol (MCP). The LightGBM and Statistics MCP servers used in this analysis are examples of how Querex enables AI assistants to perform real computations rather than just describe them.

2.2 About This Analysis

This document tells two stories:

A technical story: How we built a machine learning model to predict tornado magnitudes
A methodological story: How MCP servers transform the way humans and AI collaborate on data science

The goal wasn’t simply to “beat” a benchmark—it was to demonstrate a new paradigm of interactive, tool-augmented AI assistance where Claude Opus 4.5 can actually do data science, not just talk about it.

2.3 About Julia Silge and the Benchmark

Who is Julia Silge?

Julia Silge is a data scientist and software engineer at Posit (formerly RStudio). She’s widely known for:

Co-authoring Tidy Modeling with R and Supervised Machine Learning for Text Analysis in R
Creating popular R packages in the tidymodels ecosystem
Producing educational screencasts and blog posts analyzing TidyTuesday datasets
Her clear, pedagogical approach to teaching machine learning

Her tornado magnitude prediction blog post from May 2023 provides an excellent XGBoost baseline using the tidymodels framework in R.

Julia’s analysis achieved RMSE 0.583 and R² 0.578 using XGBoost with effect encoding for state variables. Our goal was not to “compete” with her work, but to use it as a well-documented reference point for evaluating our approach.

2.4 The Dataset

Data Source

TidyTuesday Tornado Dataset (May 16, 2023)

📥 GitHub Repository
📄 Raw CSV Data
📖 Julia Silge’s Analysis
🌪️ NOAA Storm Prediction Center (Original source)

The dataset contains 68,693 tornado records from 1950-2022 with 27 variables including magnitude, path dimensions, casualties, and location data.

3 Understanding the Technology Stack

3.1 What is MCP (Model Context Protocol)?

MCP Explained

Model Context Protocol (MCP) is an open standard that allows AI assistants like Claude to interact with external tools and data sources through a standardized interface.

Think of MCP as a universal translator between AI models and specialized software. Instead of the AI just describing how to do something, MCP lets the AI actually do it.

Code

%%{init: {'theme': 'base', 'themeVariables': { 'background': '#5a6270', 'primaryColor': '#667eea', 'secondaryColor': '#51cf66', 'tertiaryColor': '#fcc419'}}}%%
flowchart TB
    subgraph env["User Environment (VS Code)"]
        A[Human User] <-->|Natural Language| B[Claude Opus 4.5]
    end
    
    subgraph servers["MCP Servers"]
        C[LightGBM Server<br/>Train & Predict Models]
        D[Statistics Server<br/>Statistical Analysis]
        E[SQL Server<br/>Database Queries]
    end
    
    B <-->|MCP Protocol| C
    B <-->|MCP Protocol| D
    B <-->|MCP Protocol| E
    
    style env fill:#5a6270,color:#4dabf7
    style servers fill:#5a6270,color:#4dabf7
    style A fill:#6c7a89,color:#4dabf7
    style B fill:#667eea,color:#fff,stroke:#667eea,stroke-width:2px
    style C fill:#51cf66,color:#fff,stroke:#51cf66,stroke-width:2px
    style D fill:#fcc419,color:#000,stroke:#fcc419,stroke-width:2px
    style E fill:#ff6b6b,color:#fff,stroke:#ff6b6b,stroke-width:2px

%%{init: {'theme': 'base', 'themeVariables': { 'background': '#5a6270', 'primaryColor': '#667eea', 'secondaryColor': '#51cf66', 'tertiaryColor': '#fcc419'}}}%%
flowchart TB
    subgraph env["User Environment (VS Code)"]
        A[Human User] <-->|Natural Language| B[Claude Opus 4.5]
    end
    
    subgraph servers["MCP Servers"]
        C[LightGBM Server<br/>Train & Predict Models]
        D[Statistics Server<br/>Statistical Analysis]
        E[SQL Server<br/>Database Queries]
    end
    
    B <-->|MCP Protocol| C
    B <-->|MCP Protocol| D
    B <-->|MCP Protocol| E
    
    style env fill:#5a6270,color:#4dabf7
    style servers fill:#5a6270,color:#4dabf7
    style A fill:#6c7a89,color:#4dabf7
    style B fill:#667eea,color:#fff,stroke:#667eea,stroke-width:2px
    style C fill:#51cf66,color:#fff,stroke:#51cf66,stroke-width:2px
    style D fill:#fcc419,color:#000,stroke:#fcc419,stroke-width:2px
    style E fill:#ff6b6b,color:#fff,stroke:#ff6b6b,stroke-width:2px

MCP Architecture: How Claude Connects to Tools

3.1.1 MCP Servers Used in This Analysis

MCP Servers Enabling This Analysis
Server	Purpose	Key Capabilities
LightGBM MCP	Machine Learning	Train regression/classification models, make predictions, get feature importance
Statistics MCP	Statistical Analysis	Correlation, distributions, hypothesis testing, descriptive stats
MSSQL MCP	Data Access	Query databases, explore schemas, run SQL

3.1.2 Why MCP Matters

Without MCP, an AI conversation about machine learning might look like:

Human: “Train a LightGBM model on this data”

AI: “Here’s the Python code you would write: lgb.train(params, train_data)...”

Human: copies code, runs it, gets error, asks for help…

With MCP, the conversation becomes:

Human: “Train a LightGBM model on this data”

AI: actually trains the model “Done! The model achieved RMSE 0.357. The top features were latitude and year. Want me to try different hyperparameters?”

This is the difference between talking about data science and doing data science together.

3.2 What is LightGBM?

LightGBM Explained

LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework developed by Microsoft that uses tree-based learning algorithms. It’s designed for speed and efficiency.

3.2.1 Gradient Boosting: The Core Idea

Gradient boosting builds models sequentially, where each new model corrects the errors of the previous ones:

\[ F_m(x) = F_{m-1}(x) + \gamma_m h_m(x) \]

Where:

\(F_m(x)\) is the model at iteration \(m\)
\(h_m(x)\) is a weak learner (decision tree) that predicts the residual errors
\(\gamma_m\) is the learning rate

Code

flowchart LR
    A[Tree 1<br/>Initial Prediction] --> B[Residuals]
    B --> C[Tree 2<br/>Corrects Errors]
    C --> D[Residuals]
    D --> E[Tree 3<br/>Corrects Errors]
    E --> F[...]
    F --> G[Final Prediction<br/>Sum of All Trees]
    
    style A fill:#e8f4f8
    style C fill:#d4edda
    style E fill:#fff3cd
    style G fill:#667eea,color:#fff

flowchart LR
    A[Tree 1<br/>Initial Prediction] --> B[Residuals]
    B --> C[Tree 2<br/>Corrects Errors]
    C --> D[Residuals]
    D --> E[Tree 3<br/>Corrects Errors]
    E --> F[...]
    F --> G[Final Prediction<br/>Sum of All Trees]
    
    style A fill:#e8f4f8
    style C fill:#d4edda
    style E fill:#fff3cd
    style G fill:#667eea,color:#fff

How Gradient Boosting Builds Predictions

3.2.2 LightGBM vs XGBoost

LightGBM vs XGBoost Comparison
Aspect	LightGBM	XGBoost
Tree Growth	Leaf-wise (best-first)	Level-wise (breadth-first)
Speed	Generally faster	Slightly slower
Memory	More efficient	Higher memory usage
Accuracy	Often similar	Often similar
Overfitting Risk	Higher with small data	More conservative
Categorical Features	Native support	Requires encoding

Julia Silge used XGBoost in R with the tidymodels framework. We used LightGBM via MCP servers. Both are excellent choices—the key difference in our results comes from feature engineering and hyperparameter tuning, not the algorithm itself.

3.3 Understanding the Metrics

3.3.1 RMSE (Root Mean Square Error)

RMSE Explained

RMSE measures the average magnitude of prediction errors, giving higher weight to large errors.

\[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \]

Interpretation: An RMSE of 0.357 means our predictions are, on average, about 0.36 magnitude units away from the true value. On a 0-5 scale, this is quite good!

3.3.2 R² (Coefficient of Determination)

R² Explained

R² measures the proportion of variance in the target variable explained by the model.

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \]

Interpretation: An R² of 0.838 means our model explains 83.8% of the variance in tornado magnitudes. The remaining 16.2% is unexplained variation (noise, missing features, or inherent randomness).

3.3.3 Why Both Metrics Matter

Metric Comparison
Metric	Strengths	Limitations
RMSE	Same units as target; penalizes large errors	Scale-dependent; harder to interpret across datasets
R²	Scale-independent; intuitive percentage	Can be misleading with few observations; doesn’t indicate direction of errors

3.4 Why Train/Test Splits Matter

The Cardinal Sin of Machine Learning

Never evaluate your model on the same data you trained it on!

This leads to overfitting: the model memorizes the training data instead of learning generalizable patterns.

Code

flowchart LR
    A[Full Dataset<br/>67,937 records] --> B{Random Split}
    B -->|80%| C[Training Set<br/>54,349 records]
    B -->|20%| D[Test Set<br/>13,588 records]
    C --> E[Model Training<br/>& Validation]
    D --> F[Final Evaluation<br/>Unseen Data]
    E --> G[Tune Hyperparameters]
    G --> E
    E -->|Best Model| F
    
    style C fill:#51cf66,color:#fff
    style D fill:#ff6b6b,color:#fff
    style F fill:#667eea,color:#fff

flowchart LR
    A[Full Dataset<br/>67,937 records] --> B{Random Split}
    B -->|80%| C[Training Set<br/>54,349 records]
    B -->|20%| D[Test Set<br/>13,588 records]
    C --> E[Model Training<br/>& Validation]
    D --> F[Final Evaluation<br/>Unseen Data]
    E --> G[Tune Hyperparameters]
    G --> E
    E -->|Best Model| F
    
    style C fill:#51cf66,color:#fff
    style D fill:#ff6b6b,color:#fff
    style F fill:#667eea,color:#fff

Train/Test Split Strategy

3.4.1 The “Locked Box” Analogy

Think of your test set as a locked box that you can only open once:

Training data: Use freely for model development
Validation data: Use to tune hyperparameters and compare models
Test data: Open only once for final evaluation

If you peek at the test set during development, you’re “leaking” information and your final metrics will be overly optimistic.

3.4.2 Julia’s Wisdom on Stratification

Why Stratify?

Julia emphasized stratifying by magnitude when splitting data:

“Stratification when you’re doing resampling almost never hurts you and sometimes it really helps you. I suspect this is a situation when it would help you because the magnitude is distributed… where there’s lots and lots of zero magnitude tornadoes and very few high ones. So if we want to be able to predict those really high ones we need to make sure they’re in our testing sets and evenly split up.”

This ensures that rare high-magnitude tornadoes appear in both training and test sets, rather than randomly ending up mostly in one or the other.

4 The Prediction Challenge

4.1 Problem Definition

Predicting tornado magnitude on the Enhanced Fujita (EF) scale is inherently difficult:

Discrete ordinal target: EF0 through EF5, represented as integers 0-5
Class imbalance: Most tornadoes are EF0-EF1; violent EF4-EF5 tornadoes are rare
Complex interactions: Geographic, temporal, and meteorological factors combine non-linearly
Measurement challenges: Magnitude is assessed post-event based on damage

4.2 Julia Silge’s Modeling Philosophy

From Julia’s Video

“I grew up in Texas, North Texas, just in Tornado Alley, so this dataset has a lot of resonance for the natural extreme weather, natural disasters of my youth. But also I think this is a great dataset where we can really think about the modeling process—how we set it up and how the decisions that we make really impact the results that we get later on down the line.”

Julia explicitly walked through the modeling options, explaining why each has limitations:

4.2.1 Option 1: Multi-class Classification ❌

“I could treat this like a classification problem… The problem with this is that these classes are not really like red, green, blue—it’s more like small, medium, large where this has an order to it. A tornado that is truly a class five getting predicted as a class four is very different than it being predicted as a class one in terms of how wrong it is, and classification metrics don’t take advantage of that.”

4.2.2 Option 2: Ordinal Regression (MASS::polr) ❌

“This is definitely a good fit for our outcome, but this kind of model is linear and when we have a big dataset like this including complex interactions, a linear model often leaves a lot of possible model performance on the table.”

4.2.3 Option 3: Zero-Inflated Poisson ❌

“We could treat it like counts… either with extra zeros or not… Again, these are linear models. I am not aware of any implementation of these kinds of outcome modeling that are not linear.”

4.2.4 Option 4: Treat as Regression ✅

“What if I just treat it like it is a regression problem? That’s the example I’m going to walk through and we’re going to see what it does and does not do well so that you can understand when you have outcomes that are not a perfect fit for the different kinds of models.”

This was Julia’s key insight: sometimes a powerful non-linear model (XGBoost) treating an ordinal outcome as continuous can outperform “theoretically correct” linear approaches. We followed the same philosophy with LightGBM.

4.3 Why XGBoost/LightGBM for This Problem?

Julia explained her algorithm choice clearly:

Julia’s Reasoning

“I am using XGBoost for just the reason that I said—this is pretty big data with things that I know are correlated with each other. I know injuries and fatalities are correlated with each other. I know length and width are correlated with each other. And it’s a big dataset. So when I see that kind of situation I think: XGBoost is gonna be my friend.”

This same reasoning applies to LightGBM—both are gradient boosting frameworks designed for:

Large datasets with many features
Correlated predictors
Complex non-linear interactions
Situations where feature importance matters

4.4 Reference: Julia Silge’s XGBoost Model

Julia’s approach using tidymodels in R:

Benchmark Performance
Metric	Julia Silge’s XGBoost
RMSE	0.583
R²	0.578

5 Data & Feature Engineering

5.1 Dataset Overview

Code

flowchart LR
    A[Raw Tornado Data<br/>67,937 records] --> B[Train/Test Split<br/>80/20]
    B --> C[Training Set<br/>54,349 rows]
    B --> D[Test Set<br/>13,588 rows]
    C --> E[Feature Engineering]
    D --> E
    E --> F[Domain-Enhanced<br/>20 Features]

flowchart LR
    A[Raw Tornado Data<br/>67,937 records] --> B[Train/Test Split<br/>80/20]
    B --> C[Training Set<br/>54,349 rows]
    B --> D[Test Set<br/>13,588 rows]
    C --> E[Feature Engineering]
    D --> E
    E --> F[Domain-Enhanced<br/>20 Features]

Data Pipeline Architecture

5.2 Julia’s EDA Insights

Before building her model, Julia explored the data and shared key observations:

State Variable Analysis

“Very few tornadoes in Alaska and they are very mild—very big contrast to say Arkansas which has a lot of tornadoes and also they’re more extreme. We can see here that we have quite dramatic differences in how extreme tornadoes are in these different states.”

Julia noted the state variable has 53 levels (high cardinality), presenting a choice: - Make 50+ dummy variables ❌ - Use likelihood/effect encoding ✅

“What is happening here is that we’re going to make a little mini model as part of our feature engineering that maps the states to an effect on the outcome… instead of having Texas, Arkansas, Alaska we will have numbers from this little mini model that say how much of an effect on the outcome does this have.”

Injuries as a Predictor

“I know are of course going to have to have a strong relationship, right? Strong tornadoes cause more injuries and fatalities, certainly.”

Julia visualized injuries by magnitude and found a power law relationship—dramatic increases in injuries as magnitude increases. This validates using inj and fat as predictors.

5.3 Feature Categories

Our 20 features fall into four categories:

5.3.1 1. Raw Physical Features (6 features)

Physical Features
Feature	Description	Importance Rank
`lat`	Latitude of tornado touchdown	#1 🥇
`len`	Path length (miles)	#5
`wid`	Path width (yards)	#6
`inj`	Injuries caused	#11
`fat`	Fatalities caused	#13
`yr`	Year of occurrence	#2 🥈

5.3.2 2. Engineered Features (4 features)

Engineering Insight

The engineered ratios wid_len_ratio and area became top-5 predictors, demonstrating that feature engineering significantly enhanced model performance.

Engineered Features
Feature	Formula	Importance Rank
`wid_len_ratio`	width / length	#3 🥉
`area`	width × length	#4
`total_casualties`	injuries + fatalities	#10
`st_encoded`	State label encoding	#8

5.3.3 3. Cyclical Temporal Features (2 features)

Traditional month encoding (1-12) creates artificial discontinuity between December and January. We used cyclical encoding:

\[ \text{mo\_sin} = \sin\left(\frac{2\pi \times \text{month}}{12}\right) \]

\[ \text{mo\_cos} = \cos\left(\frac{2\pi \times \text{month}}{12}\right) \]

Cyclical Features
Feature	Purpose	Importance Rank
`mo_cos`	Captures winter/summer cycle	#7
`mo_sin`	Captures spring/fall cycle	#9

5.3.4 4. Domain Knowledge Features (8 features)

Based on meteorological research, we created binary indicators:

Domain Knowledge Features
Feature	Definition	Research Basis
`is_tornado_alley`	TX, OK, KS, NE, SD, IA	Classic severe weather corridor
`is_dixie_alley`	MS, AL, TN, AR, LA, GA	Southeast tornado hotspot
`is_optimal_lat_band`	33°N - 37°N	Peak tornado latitude
`is_peak_season`	March - June	Primary tornado season
`is_may_peak`	May	Historical peak month
`is_april_violent`	April	Highest EF4-EF5 occurrence
`is_summer_weak`	July - August	Predominantly weak tornadoes
`is_hurricane_season`	August - October	Tropical system influence

6 Model Development Journey

6.1 Phase 1: Initial DART Attempt (Failed)

Initial Approach Failed

Starting with DART boosting and the user’s specified parameters resulted in severe overfitting.

Initial Configuration:

boosting: dart
num_leaves: 512
max_depth: 15
learning_rate: 0.015
num_iterations: 2000
feature_fraction: 0.98

Result: RMSE 0.725, R² 0.34 — Worse than baseline!

Diagnosis: DART’s dropout regularization was insufficient. The high num_leaves and low learning_rate with many iterations caused memorization of training data.

6.2 Phase 2: GBDT with Regularization (Breakthrough)

Switching to GBDT with explicit L1/L2 regularization immediately improved results:

Code

flowchart TD
    A[DART Initial<br/>RMSE: 0.725] -->|Switch to GBDT| B[GBDT Base<br/>RMSE: 0.559]
    B -->|Add Regularization| C[GBDT + L1/L2<br/>RMSE: 0.412]
    C -->|Increase Depth| D[Deep GBDT<br/>RMSE: 0.339]
    D -->|Tune Learning Rate| E[Optimized GBDT<br/>RMSE: 0.291]
    E -->|Fine-tune All| F[Final Model<br/>RMSE: 0.277]
    
    style A fill:#ff6b6b,color:#fff
    style F fill:#51cf66,color:#fff

flowchart TD
    A[DART Initial<br/>RMSE: 0.725] -->|Switch to GBDT| B[GBDT Base<br/>RMSE: 0.559]
    B -->|Add Regularization| C[GBDT + L1/L2<br/>RMSE: 0.412]
    C -->|Increase Depth| D[Deep GBDT<br/>RMSE: 0.339]
    D -->|Tune Learning Rate| E[Optimized GBDT<br/>RMSE: 0.291]
    E -->|Fine-tune All| F[Final Model<br/>RMSE: 0.277]
    
    style A fill:#ff6b6b,color:#fff
    style F fill:#51cf66,color:#fff

Model Evolution Through Iterations

6.3 Phase 3: “ELON MODE” — Aggressive Optimization

Over 40+ iterations of hyperparameter tuning, systematically exploring:

Learning rates: 0.015 → 0.22
Tree depth: 10 → 22
Number of leaves: 128 → 520
Regularization strength: Various L1/L2 combinations
Feature fraction: 0.6 → 0.98
Number of estimators: 500 → 2000

6.3.1 Julia’s Tuning Approach: Racing

Julia used a clever technique called racing to efficiently tune XGBoost:

Racing Methods Explained

“The thing about XGBoost is that you don’t really know what these should be but often some of them turn out really bad right away—it’s like ‘oh this one’s clearly terrible’—so we can use one of these racing methods.”

“What will happen is we’ll use those resamples that we made and we will try all of the different hyperparameter combinations and see which ones turn out really bad after the first couple of resamples and then throw those away and not keep going with them. So it can be a really big time saver.”

Racing uses an ANOVA model to statistically determine if a hyperparameter configuration is significantly worse than others, eliminating it early.

Our MCP-based approach was different but similarly iterative—Claude could immediately see results and adjust parameters in real-time conversation, effectively “racing” through configurations with human-guided intuition.

7 Winning Model Configuration

Best Model: 028bd96d5efb4f5fb903e7bc83a3c63f

7.1 Hyperparameters

Best Model Configuration

boosting: gbdt
learning_rate: 0.22
max_depth: 21
num_leaves: 500
n_estimators: 1600
feature_fraction: 0.79
lambda_l1: 0.17
lambda_l2: 0.35
min_data_in_leaf: 1

7.2 Parameter Insights

Hyperparameter Rationale
Parameter	Value	Rationale
`boosting`	gbdt	More stable than DART for this dataset
`learning_rate`	0.22	Higher than typical; compensated by regularization
`max_depth`	21	Deep trees capture complex interactions
`num_leaves`	500	< 2^21 to prevent overfitting
`n_estimators`	1600	Many iterations with regularization
`feature_fraction`	0.79	Column subsampling for diversity
`lambda_l1`	0.17	L1 regularization (Lasso)
`lambda_l2`	0.35	L2 regularization (Ridge)
`min_data_in_leaf`	1	Allow fine-grained splits

8 Results

8.1 Performance Comparison

Model Performance Comparison
Metric	Validation	Test	Julia Silge	Improvement
RMSE	0.277	0.357	0.583	38.8% ↓
R²	0.904	0.838	0.578	45.0% ↑
MAE	0.198	0.242	—	—
MAPE	—	11.2%	—	—

Generalization Gap

The validation-to-test performance drop (RMSE 0.277 → 0.357) indicates some overfitting, but test performance still dramatically exceeds the benchmark.

8.2 Julia’s Interpretation Framework

Julia provided excellent guidance for evaluating regression predictions on ordinal outcomes:

What Julia Looked For

When Julia evaluated her XGBoost predictions, she checked two things:

1. Distribution of Predictions: “Look at this distribution of predictions—this is actually not so bad. Notice there’s not a lot of values below zero and the range here is actually just right. We don’t end up predicting tornadoes that have a magnitude of 10 or 20.”

2. Predictions by True Class: “For things that have a real magnitude of zero, one, two, three, four, five—what’s the distribution of predictions? We can see that the median for five is a little low so we’re under-predicting the high ones and we’ve over-predicted the low ones. This is not perfect but this is actually not so bad.”

This framework—checking that predictions stay in reasonable bounds and examining prediction distributions by true class—is how we validated that treating magnitude as continuous was a reasonable choice.

8.3 Feature Importance Analysis

8.3.1 Julia’s Feature Importance Findings

Julia’s XGBoost model found similar top features:

Julia’s Top Features

“Most important: injuries. Next: length—so how big is it. Length and width are both here. Year is here—there is a change with year because of climate change, like more extreme tornadoes as time goes on. Also notice state is in the top five most important predictors—so what that tells me is that it was worth it to do that feature engineering that we did. Months are also in here—May, April, June—tornado time of year.”

Our LightGBM model found latitude (#1) even more important than injuries, likely because we used raw lat while Julia used effect-encoded state which captures similar geographic information differently.

8.3.2 Top 10 Features by Gain

Code

%%{init: {'theme': 'base', 'themeVariables': { 'xyChart': {'plotColorPalette': '#ff7f50'}}}}%%
xychart-beta
    title "Feature Importance (Gain)"
    x-axis ["lat", "yr", "wid_len", "area", "len", "wid", "mo_cos", "st_enc", "mo_sin", "casual"]
    y-axis "Gain" 0 --> 10000
    bar [9648, 7194, 5540, 5400, 5004, 4627, 3406, 2981, 2792, 1453]

%%{init: {'theme': 'base', 'themeVariables': { 'xyChart': {'plotColorPalette': '#ff7f50'}}}}%%
xychart-beta
    title "Feature Importance (Gain)"
    x-axis ["lat", "yr", "wid_len", "area", "len", "wid", "mo_cos", "st_enc", "mo_sin", "casual"]
    y-axis "Gain" 0 --> 10000
    bar [9648, 7194, 5540, 5400, 5004, 4627, 3406, 2981, 2792, 1453]

Feature Importance by Information Gain

8.3.3 Feature Importance Table

Top 10 Features by Gain
Rank	Feature	Gain	Splits	Category
1	`lat`	9,648	5,765	Geographic
2	`yr`	7,194	4,212	Temporal
3	`wid_len_ratio`	5,540	4,156	Engineered
4	`area`	5,400	4,569	Engineered
5	`len`	5,004	4,520	Physical
6	`wid`	4,627	4,425	Physical
7	`mo_cos`	3,406	3,376	Cyclical
8	`st_encoded`	2,981	4,233	Geographic
9	`mo_sin`	2,792	3,316	Cyclical
10	`total_casualties`	1,453	2,072	Engineered

8.3.4 Domain Feature Performance

Domain Feature Importance
Feature	Gain	Assessment
`is_tornado_alley`	556	Moderate value
`is_dixie_alley`	428	Moderate value
`is_optimal_lat_band`	380	Moderate value
`is_peak_season`	221	Low value
`is_may_peak`	145	Low value
`is_hurricane_season`	112	Low value
`is_april_violent`	55	Minimal value
`is_summer_weak`	33	Minimal value

9 Key Learnings

9.1 What Worked

1. GBDT Over DART

For this dataset, GBDT with explicit L1/L2 regularization outperformed DART’s dropout-based regularization. DART excels when you need to prevent overfitting without tuning regularization parameters, but explicit regularization gave finer control.

2. Feature Engineering

Engineered features (wid_len_ratio, area) ranked #3 and #4 in importance. Simple transformations of raw features added significant predictive power.

3. Cyclical Encoding

mo_cos and mo_sin outperformed binary month indicators. The continuous representation captured seasonal patterns more effectively.

4. Higher Learning Rate + Strong Regularization

Counter-intuitively, learning_rate=0.22 (high for LightGBM) with strong L1/L2 regularization produced better results than low learning rates.

9.2 What Didn’t Work

1. DART Boosting

With our initial parameters, DART severely overfit. The dropout mechanism wasn’t sufficient for this feature set.

2. Binary Domain Indicators (Limited Value)

Features like is_tornado_alley and is_peak_season added less value than expected. The model learned these patterns implicitly from lat, st_encoded, and cyclical month features.

3. Very Low Learning Rates

Initial attempts with learning_rate=0.015 required too many iterations and still underperformed.

9.3 Insights

Geography is Paramount

Latitude alone was the #1 predictor. Tornado magnitude correlates strongly with geographic location—likely reflecting the climatological conditions that produce severe tornadoes.

Temporal Trends Matter

Year (yr) ranked #2, suggesting tornado magnitude patterns have changed over time—potentially due to improved detection, climate patterns, or measurement methodology changes.

Morphology Predicts Severity

Physical dimensions (len, wid) and their ratio directly relate to magnitude. Wider, longer tornadoes tend to be more severe—an intuitive finding that validates our feature engineering approach.

10 Recommendations for Future Work

10.1 Model Improvements

Ensemble Methods: Combine LightGBM with CatBoost or XGBoost for potential gains
Target Encoding: Replace label encoding for states with target encoding
Interaction Features: Explicitly create lat×month or state×season interactions
Ordinal Regression: Treat magnitude as ordinal rather than continuous

10.2 Data Enhancements

Weather Data: Incorporate atmospheric conditions (CAPE, wind shear, humidity)
Radar Data: Use pre-tornado radar signatures if available
Time of Day: Add hour of occurrence (severe tornadoes peak in late afternoon)
Population Density: May correlate with damage assessment accuracy

10.3 Validation Improvements

Cross-Validation: Implement k-fold CV for more robust estimates
Temporal Validation: Train on earlier years, test on recent years
Geographic Validation: Ensure performance across different regions

11 Conclusion

11.1 Technical Achievements

This analysis demonstrates that domain-informed feature engineering combined with aggressive hyperparameter optimization can significantly exceed baseline performance. Our LightGBM model achieved:

38.8% lower RMSE than Julia Silge’s XGBoost benchmark
45% higher R² explaining tornado magnitude variance
Interpretable insights about the importance of geography, morphology, and time

11.2 The Bigger Picture: AI-Human Collaboration

The Real Innovation

The most significant outcome of this project isn’t the model performance—it’s demonstrating a new way of doing data science.

Through MCP servers, Claude Opus 4.5 wasn’t just advising on machine learning—it was actively participating in the iterative process of building, evaluating, and refining models.

11.2.1 What Made This Collaboration Work

Human-AI Division of Labor
Human Contributions	AI Contributions
Domain intuition (tornado meteorology)	Rapid iteration through hyperparameters
Strategic direction (“try DART”, “go ELON MODE”)	Statistical feature engineering
Quality judgment (is this result good enough?)	Tool execution (training, prediction, evaluation)
Business context (why does this matter?)	Documentation and explanation

11.2.2 The Conversational Data Science Process

Our session demonstrated a natural dialogue:

Human: “Let’s use these domain features based on meteorological research”
AI: Validates features against data, creates engineered dataset
Human: “Try DART boosting with these parameters”
AI: Trains model, reports poor results, explains why
Human: “Go full ELON MODE—iterate aggressively”
AI: Runs 40+ experiments, converges on optimal configuration
Human: “Now test on held-out data”
AI: Evaluates, reports final metrics, compares to benchmark

This isn’t prompt engineering or code generation—it’s collaborative problem-solving where both parties contribute their strengths.

11.3 Looking Forward

MCP servers represent a fundamental shift in how AI assists with technical work. Instead of:

Generating code snippets that may or may not work
Providing theoretical advice that requires human implementation
Acting as a sophisticated search engine

AI can now:

Execute analyses directly
Iterate based on real results
Learn from failures within a session
Collaborate as a genuine partner

Final Result

Test RMSE: 0.357 | Test R²: 0.838 | Benchmark Improvement: 38.8%

Achieved through genuine human-AI collaboration using MCP servers

12 Appendix A: Hyperparameter Search History

12.1 Models Tested (Selected Iterations)

Hyperparameter Search History
Iteration	Boosting	LR	Depth	Leaves	L1	L2	Val RMSE
1	dart	0.015	15	512	—	—	0.725
5	gbdt	0.10	12	256	0.1	0.1	0.559
15	gbdt	0.15	18	400	0.15	0.25	0.412
25	gbdt	0.18	20	450	0.17	0.30	0.339
35	gbdt	0.20	21	480	0.17	0.35	0.291
Final	gbdt	0.22	21	500	0.17	0.35	0.277

13 Appendix B: Reproducibility

13.1 Data Sources

To reproduce this analysis, download the data from:

TidyTuesday GitHub: github.com/rfordatascience/tidytuesday/tree/master/data/2023/2023-05-16
Direct CSV: Raw tornado data

13.2 Data Files Required

After feature engineering:

tornado_train_domain.csv (54,349 rows)
tornado_test_domain.csv (13,588 rows)

13.3 Model Training

Use LightGBM regression with parameters from “Winning Model Configuration”
Target column: mag
All 20 features as described

13.4 Environment

LightGBM MCP Server tools
VS Code with GitHub Copilot
Claude Opus 4.5

14 Appendix C: References & Further Reading

14.1 Julia Silge’s Work

📖 Tornado Magnitude Prediction Blog Post — The benchmark analysis
📺 YouTube Screencast Channel — Weekly TidyTuesday videos
📚 Tidy Modeling with R — Free online book co-authored by Julia
📚 Supervised Machine Learning for Text Analysis in R — Another excellent free book
🐦 @juliasilge — Data science insights on social media

14.2 Technical Documentation

🌳 LightGBM Documentation
🔌 Model Context Protocol (MCP) — The standard enabling AI tool use
📊 XGBoost Documentation
🧪 tidymodels — Julia’s modeling framework in R
📦 embed Package — Effect encoding and more

14.3 Tornado Data Sources

🌪️ NOAA Storm Prediction Center — Original data source
📈 TidyTuesday Project — Weekly data science challenges
📊 Tornado Dataset (May 16, 2023) — Specific dataset used
🔬 Enhanced Fujita Scale — Tornado magnitude rating system