A Real Dialogue with AI: How MCP Servers Enable Interactive Data Science
Author
Patricio Lobos, Software Engineer & AI Lead at Querex.no
Published
December 3, 2025
1 Executive Summary
TipKey Achievement
Our LightGBM model achieved RMSE 0.357 on the held-out test set, representing a 38.8% improvement over Julia Silge’s XGBoost benchmark (RMSE 0.583). The model explains 83.8% of variance in tornado magnitude predictions.
This analysis documents the complete journey of building a tornado magnitude prediction model using LightGBM with domain-enhanced features. More importantly, it demonstrates how Model Context Protocol (MCP) servers enable a genuine, interactive dialogue between a human and an AI assistant (Claude Opus 4.5 in VS Code via GitHub Copilot) to collaboratively solve complex data science problems.
2 Introduction: The Power of AI-Human Collaboration
About Querex: Querex develops mathematical and statistical tools for Large Language Models via the Model Context Protocol (MCP). The LightGBM and Statistics MCP servers used in this analysis are examples of how Querex enables AI assistants to perform real computations rather than just describe them.
2.2 About This Analysis
This document tells two stories:
A technical story: How we built a machine learning model to predict tornado magnitudes
A methodological story: How MCP servers transform the way humans and AI collaborate on data science
The goal wasn’t simply to “beat” a benchmark—it was to demonstrate a new paradigm of interactive, tool-augmented AI assistance where Claude Opus 4.5 can actually do data science, not just talk about it.
2.3 About Julia Silge and the Benchmark
NoteWho is Julia Silge?
Julia Silge is a data scientist and software engineer at Posit (formerly RStudio). She’s widely known for:
Julia’s analysis achieved RMSE 0.583 and R² 0.578 using XGBoost with effect encoding for state variables. Our goal was not to “compete” with her work, but to use it as a well-documented reference point for evaluating our approach.
The dataset contains 68,693 tornado records from 1950-2022 with 27 variables including magnitude, path dimensions, casualties, and location data.
3 Understanding the Technology Stack
3.1 What is MCP (Model Context Protocol)?
NoteMCP Explained
Model Context Protocol (MCP) is an open standard that allows AI assistants like Claude to interact with external tools and data sources through a standardized interface.
Think of MCP as a universal translator between AI models and specialized software. Instead of the AI just describing how to do something, MCP lets the AI actually do it.
Code
%%{init: {'theme': 'base', 'themeVariables': { 'background': '#5a6270', 'primaryColor': '#667eea', 'secondaryColor': '#51cf66', 'tertiaryColor': '#fcc419'}}}%%flowchart TB subgraph env["User Environment (VS Code)"] A[Human User] <-->|Natural Language| B[Claude Opus 4.5] end subgraph servers["MCP Servers"] C[LightGBM Server<br/>Train & Predict Models] D[Statistics Server<br/>Statistical Analysis] E[SQL Server<br/>Database Queries] end B <-->|MCP Protocol| C B <-->|MCP Protocol| D B <-->|MCP Protocol| E style env fill:#5a6270,color:#4dabf7 style servers fill:#5a6270,color:#4dabf7 style A fill:#6c7a89,color:#4dabf7 style B fill:#667eea,color:#fff,stroke:#667eea,stroke-width:2px style C fill:#51cf66,color:#fff,stroke:#51cf66,stroke-width:2px style D fill:#fcc419,color:#000,stroke:#fcc419,stroke-width:2px style E fill:#ff6b6b,color:#fff,stroke:#ff6b6b,stroke-width:2px
%%{init: {'theme': 'base', 'themeVariables': { 'background': '#5a6270', 'primaryColor': '#667eea', 'secondaryColor': '#51cf66', 'tertiaryColor': '#fcc419'}}}%%
flowchart TB
subgraph env["User Environment (VS Code)"]
A[Human User] <-->|Natural Language| B[Claude Opus 4.5]
end
subgraph servers["MCP Servers"]
C[LightGBM Server<br/>Train & Predict Models]
D[Statistics Server<br/>Statistical Analysis]
E[SQL Server<br/>Database Queries]
end
B <-->|MCP Protocol| C
B <-->|MCP Protocol| D
B <-->|MCP Protocol| E
style env fill:#5a6270,color:#4dabf7
style servers fill:#5a6270,color:#4dabf7
style A fill:#6c7a89,color:#4dabf7
style B fill:#667eea,color:#fff,stroke:#667eea,stroke-width:2px
style C fill:#51cf66,color:#fff,stroke:#51cf66,stroke-width:2px
style D fill:#fcc419,color:#000,stroke:#fcc419,stroke-width:2px
style E fill:#ff6b6b,color:#fff,stroke:#ff6b6b,stroke-width:2px
MCP Architecture: How Claude Connects to Tools
3.1.1 MCP Servers Used in This Analysis
MCP Servers Enabling This Analysis
Server
Purpose
Key Capabilities
LightGBM MCP
Machine Learning
Train regression/classification models, make predictions, get feature importance
Without MCP, an AI conversation about machine learning might look like:
Human: “Train a LightGBM model on this data”
AI: “Here’s the Python code you would write: lgb.train(params, train_data)...”
Human: copies code, runs it, gets error, asks for help…
With MCP, the conversation becomes:
Human: “Train a LightGBM model on this data”
AI: actually trains the model “Done! The model achieved RMSE 0.357. The top features were latitude and year. Want me to try different hyperparameters?”
This is the difference between talking about data science and doing data science together.
3.2 What is LightGBM?
NoteLightGBM Explained
LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework developed by Microsoft that uses tree-based learning algorithms. It’s designed for speed and efficiency.
3.2.1 Gradient Boosting: The Core Idea
Gradient boosting builds models sequentially, where each new model corrects the errors of the previous ones:
\[
F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)
\]
Where:
\(F_m(x)\) is the model at iteration \(m\)
\(h_m(x)\) is a weak learner (decision tree) that predicts the residual errors
\(\gamma_m\) is the learning rate
Code
flowchart LR A[Tree 1<br/>Initial Prediction] --> B[Residuals] B --> C[Tree 2<br/>Corrects Errors] C --> D[Residuals] D --> E[Tree 3<br/>Corrects Errors] E --> F[...] F --> G[Final Prediction<br/>Sum of All Trees] style A fill:#e8f4f8 style C fill:#d4edda style E fill:#fff3cd style G fill:#667eea,color:#fff
flowchart LR
A[Tree 1<br/>Initial Prediction] --> B[Residuals]
B --> C[Tree 2<br/>Corrects Errors]
C --> D[Residuals]
D --> E[Tree 3<br/>Corrects Errors]
E --> F[...]
F --> G[Final Prediction<br/>Sum of All Trees]
style A fill:#e8f4f8
style C fill:#d4edda
style E fill:#fff3cd
style G fill:#667eea,color:#fff
How Gradient Boosting Builds Predictions
3.2.2 LightGBM vs XGBoost
LightGBM vs XGBoost Comparison
Aspect
LightGBM
XGBoost
Tree Growth
Leaf-wise (best-first)
Level-wise (breadth-first)
Speed
Generally faster
Slightly slower
Memory
More efficient
Higher memory usage
Accuracy
Often similar
Often similar
Overfitting Risk
Higher with small data
More conservative
Categorical Features
Native support
Requires encoding
Julia Silge used XGBoost in R with the tidymodels framework. We used LightGBM via MCP servers. Both are excellent choices—the key difference in our results comes from feature engineering and hyperparameter tuning, not the algorithm itself.
3.3 Understanding the Metrics
3.3.1 RMSE (Root Mean Square Error)
NoteRMSE Explained
RMSE measures the average magnitude of prediction errors, giving higher weight to large errors.
Interpretation: An RMSE of 0.357 means our predictions are, on average, about 0.36 magnitude units away from the true value. On a 0-5 scale, this is quite good!
3.3.2 R² (Coefficient of Determination)
NoteR² Explained
R² measures the proportion of variance in the target variable explained by the model.
Interpretation: An R² of 0.838 means our model explains 83.8% of the variance in tornado magnitudes. The remaining 16.2% is unexplained variation (noise, missing features, or inherent randomness).
3.3.3 Why Both Metrics Matter
Metric Comparison
Metric
Strengths
Limitations
RMSE
Same units as target; penalizes large errors
Scale-dependent; harder to interpret across datasets
R²
Scale-independent; intuitive percentage
Can be misleading with few observations; doesn’t indicate direction of errors
3.4 Why Train/Test Splits Matter
WarningThe Cardinal Sin of Machine Learning
Never evaluate your model on the same data you trained it on!
This leads to overfitting: the model memorizes the training data instead of learning generalizable patterns.
Code
flowchart LR A[Full Dataset<br/>67,937 records] --> B{Random Split} B -->|80%| C[Training Set<br/>54,349 records] B -->|20%| D[Test Set<br/>13,588 records] C --> E[Model Training<br/>& Validation] D --> F[Final Evaluation<br/>Unseen Data] E --> G[Tune Hyperparameters] G --> E E -->|Best Model| F style C fill:#51cf66,color:#fff style D fill:#ff6b6b,color:#fff style F fill:#667eea,color:#fff
flowchart LR
A[Full Dataset<br/>67,937 records] --> B{Random Split}
B -->|80%| C[Training Set<br/>54,349 records]
B -->|20%| D[Test Set<br/>13,588 records]
C --> E[Model Training<br/>& Validation]
D --> F[Final Evaluation<br/>Unseen Data]
E --> G[Tune Hyperparameters]
G --> E
E -->|Best Model| F
style C fill:#51cf66,color:#fff
style D fill:#ff6b6b,color:#fff
style F fill:#667eea,color:#fff
Train/Test Split Strategy
3.4.1 The “Locked Box” Analogy
Think of your test set as a locked box that you can only open once:
Training data: Use freely for model development
Validation data: Use to tune hyperparameters and compare models
Test data: Open only once for final evaluation
If you peek at the test set during development, you’re “leaking” information and your final metrics will be overly optimistic.
3.4.2 Julia’s Wisdom on Stratification
TipWhy Stratify?
Julia emphasized stratifying by magnitude when splitting data:
“Stratification when you’re doing resampling almost never hurts you and sometimes it really helps you. I suspect this is a situation when it would help you because the magnitude is distributed… where there’s lots and lots of zero magnitude tornadoes and very few high ones. So if we want to be able to predict those really high ones we need to make sure they’re in our testing sets and evenly split up.”
This ensures that rare high-magnitude tornadoes appear in both training and test sets, rather than randomly ending up mostly in one or the other.
4 The Prediction Challenge
4.1 Problem Definition
Predicting tornado magnitude on the Enhanced Fujita (EF) scale is inherently difficult:
Discrete ordinal target: EF0 through EF5, represented as integers 0-5
Class imbalance: Most tornadoes are EF0-EF1; violent EF4-EF5 tornadoes are rare
Complex interactions: Geographic, temporal, and meteorological factors combine non-linearly
Measurement challenges: Magnitude is assessed post-event based on damage
4.2 Julia Silge’s Modeling Philosophy
NoteFrom Julia’s Video
“I grew up in Texas, North Texas, just in Tornado Alley, so this dataset has a lot of resonance for the natural extreme weather, natural disasters of my youth. But also I think this is a great dataset where we can really think about the modeling process—how we set it up and how the decisions that we make really impact the results that we get later on down the line.”
Julia explicitly walked through the modeling options, explaining why each has limitations:
4.2.1 Option 1: Multi-class Classification ❌
“I could treat this like a classification problem… The problem with this is that these classes are not really like red, green, blue—it’s more like small, medium, large where this has an order to it. A tornado that is truly a class five getting predicted as a class four is very different than it being predicted as a class one in terms of how wrong it is, and classification metrics don’t take advantage of that.”
4.2.2 Option 2: Ordinal Regression (MASS::polr) ❌
“This is definitely a good fit for our outcome, but this kind of model is linear and when we have a big dataset like this including complex interactions, a linear model often leaves a lot of possible model performance on the table.”
4.2.3 Option 3: Zero-Inflated Poisson ❌
“We could treat it like counts… either with extra zeros or not… Again, these are linear models. I am not aware of any implementation of these kinds of outcome modeling that are not linear.”
4.2.4 Option 4: Treat as Regression ✅
“What if I just treat it like it is a regression problem? That’s the example I’m going to walk through and we’re going to see what it does and does not do well so that you can understand when you have outcomes that are not a perfect fit for the different kinds of models.”
This was Julia’s key insight: sometimes a powerful non-linear model (XGBoost) treating an ordinal outcome as continuous can outperform “theoretically correct” linear approaches. We followed the same philosophy with LightGBM.
4.3 Why XGBoost/LightGBM for This Problem?
Julia explained her algorithm choice clearly:
TipJulia’s Reasoning
“I am using XGBoost for just the reason that I said—this is pretty big data with things that I know are correlated with each other. I know injuries and fatalities are correlated with each other. I know length and width are correlated with each other. And it’s a big dataset. So when I see that kind of situation I think: XGBoost is gonna be my friend.”
This same reasoning applies to LightGBM—both are gradient boosting frameworks designed for:
Large datasets with many features
Correlated predictors
Complex non-linear interactions
Situations where feature importance matters
4.4 Reference: Julia Silge’s XGBoost Model
Julia’s approach using tidymodels in R:
Benchmark Performance
Metric
Julia Silge’s XGBoost
RMSE
0.583
R²
0.578
5 Data & Feature Engineering
5.1 Dataset Overview
Code
flowchart LR A[Raw Tornado Data<br/>67,937 records] --> B[Train/Test Split<br/>80/20] B --> C[Training Set<br/>54,349 rows] B --> D[Test Set<br/>13,588 rows] C --> E[Feature Engineering] D --> E E --> F[Domain-Enhanced<br/>20 Features]
flowchart LR
A[Raw Tornado Data<br/>67,937 records] --> B[Train/Test Split<br/>80/20]
B --> C[Training Set<br/>54,349 rows]
B --> D[Test Set<br/>13,588 rows]
C --> E[Feature Engineering]
D --> E
E --> F[Domain-Enhanced<br/>20 Features]
Data Pipeline Architecture
5.2 Julia’s EDA Insights
Before building her model, Julia explored the data and shared key observations:
NoteState Variable Analysis
“Very few tornadoes in Alaska and they are very mild—very big contrast to say Arkansas which has a lot of tornadoes and also they’re more extreme. We can see here that we have quite dramatic differences in how extreme tornadoes are in these different states.”
Julia noted the state variable has 53 levels (high cardinality), presenting a choice: - Make 50+ dummy variables ❌ - Use likelihood/effect encoding ✅
“What is happening here is that we’re going to make a little mini model as part of our feature engineering that maps the states to an effect on the outcome… instead of having Texas, Arkansas, Alaska we will have numbers from this little mini model that say how much of an effect on the outcome does this have.”
NoteInjuries as a Predictor
“I know are of course going to have to have a strong relationship, right? Strong tornadoes cause more injuries and fatalities, certainly.”
Julia visualized injuries by magnitude and found a power law relationship—dramatic increases in injuries as magnitude increases. This validates using inj and fat as predictors.
5.3 Feature Categories
Our 20 features fall into four categories:
5.3.1 1. Raw Physical Features (6 features)
Physical Features
Feature
Description
Importance Rank
lat
Latitude of tornado touchdown
#1 🥇
len
Path length (miles)
#5
wid
Path width (yards)
#6
inj
Injuries caused
#11
fat
Fatalities caused
#13
yr
Year of occurrence
#2 🥈
5.3.2 2. Engineered Features (4 features)
NoteEngineering Insight
The engineered ratios wid_len_ratio and area became top-5 predictors, demonstrating that feature engineering significantly enhanced model performance.
Engineered Features
Feature
Formula
Importance Rank
wid_len_ratio
width / length
#3 🥉
area
width × length
#4
total_casualties
injuries + fatalities
#10
st_encoded
State label encoding
#8
5.3.3 3. Cyclical Temporal Features (2 features)
Traditional month encoding (1-12) creates artificial discontinuity between December and January. We used cyclical encoding:
Result: RMSE 0.725, R² 0.34 — Worse than baseline!
Diagnosis: DART’s dropout regularization was insufficient. The high num_leaves and low learning_rate with many iterations caused memorization of training data.
6.2 Phase 2: GBDT with Regularization (Breakthrough)
Switching to GBDT with explicit L1/L2 regularization immediately improved results:
Code
flowchart TD A[DART Initial<br/>RMSE: 0.725] -->|Switch to GBDT| B[GBDT Base<br/>RMSE: 0.559] B -->|Add Regularization| C[GBDT + L1/L2<br/>RMSE: 0.412] C -->|Increase Depth| D[Deep GBDT<br/>RMSE: 0.339] D -->|Tune Learning Rate| E[Optimized GBDT<br/>RMSE: 0.291] E -->|Fine-tune All| F[Final Model<br/>RMSE: 0.277] style A fill:#ff6b6b,color:#fff style F fill:#51cf66,color:#fff
flowchart TD
A[DART Initial<br/>RMSE: 0.725] -->|Switch to GBDT| B[GBDT Base<br/>RMSE: 0.559]
B -->|Add Regularization| C[GBDT + L1/L2<br/>RMSE: 0.412]
C -->|Increase Depth| D[Deep GBDT<br/>RMSE: 0.339]
D -->|Tune Learning Rate| E[Optimized GBDT<br/>RMSE: 0.291]
E -->|Fine-tune All| F[Final Model<br/>RMSE: 0.277]
style A fill:#ff6b6b,color:#fff
style F fill:#51cf66,color:#fff
Over 40+ iterations of hyperparameter tuning, systematically exploring:
Learning rates: 0.015 → 0.22
Tree depth: 10 → 22
Number of leaves: 128 → 520
Regularization strength: Various L1/L2 combinations
Feature fraction: 0.6 → 0.98
Number of estimators: 500 → 2000
6.3.1 Julia’s Tuning Approach: Racing
Julia used a clever technique called racing to efficiently tune XGBoost:
NoteRacing Methods Explained
“The thing about XGBoost is that you don’t really know what these should be but often some of them turn out really bad right away—it’s like ‘oh this one’s clearly terrible’—so we can use one of these racing methods.”
“What will happen is we’ll use those resamples that we made and we will try all of the different hyperparameter combinations and see which ones turn out really bad after the first couple of resamples and then throw those away and not keep going with them. So it can be a really big time saver.”
Racing uses an ANOVA model to statistically determine if a hyperparameter configuration is significantly worse than others, eliminating it early.
Our MCP-based approach was different but similarly iterative—Claude could immediately see results and adjust parameters in real-time conversation, effectively “racing” through configurations with human-guided intuition.
Higher than typical; compensated by regularization
max_depth
21
Deep trees capture complex interactions
num_leaves
500
< 2^21 to prevent overfitting
n_estimators
1600
Many iterations with regularization
feature_fraction
0.79
Column subsampling for diversity
lambda_l1
0.17
L1 regularization (Lasso)
lambda_l2
0.35
L2 regularization (Ridge)
min_data_in_leaf
1
Allow fine-grained splits
8 Results
8.1 Performance Comparison
Model Performance Comparison
Metric
Validation
Test
Julia Silge
Improvement
RMSE
0.277
0.357
0.583
38.8% ↓
R²
0.904
0.838
0.578
45.0% ↑
MAE
0.198
0.242
—
—
MAPE
—
11.2%
—
—
TipGeneralization Gap
The validation-to-test performance drop (RMSE 0.277 → 0.357) indicates some overfitting, but test performance still dramatically exceeds the benchmark.
8.2 Julia’s Interpretation Framework
Julia provided excellent guidance for evaluating regression predictions on ordinal outcomes:
NoteWhat Julia Looked For
When Julia evaluated her XGBoost predictions, she checked two things:
1. Distribution of Predictions:“Look at this distribution of predictions—this is actually not so bad. Notice there’s not a lot of values below zero and the range here is actually just right. We don’t end up predicting tornadoes that have a magnitude of 10 or 20.”
2. Predictions by True Class:“For things that have a real magnitude of zero, one, two, three, four, five—what’s the distribution of predictions? We can see that the median for five is a little low so we’re under-predicting the high ones and we’ve over-predicted the low ones. This is not perfect but this is actually not so bad.”
This framework—checking that predictions stay in reasonable bounds and examining prediction distributions by true class—is how we validated that treating magnitude as continuous was a reasonable choice.
8.3 Feature Importance Analysis
8.3.1 Julia’s Feature Importance Findings
Julia’s XGBoost model found similar top features:
NoteJulia’s Top Features
“Most important: injuries. Next: length—so how big is it. Length and width are both here. Year is here—there is a change with year because of climate change, like more extreme tornadoes as time goes on. Also notice state is in the top five most important predictors—so what that tells me is that it was worth it to do that feature engineering that we did. Months are also in here—May, April, June—tornado time of year.”
Our LightGBM model found latitude (#1) even more important than injuries, likely because we used raw lat while Julia used effect-encoded state which captures similar geographic information differently.
For this dataset, GBDT with explicit L1/L2 regularization outperformed DART’s dropout-based regularization. DART excels when you need to prevent overfitting without tuning regularization parameters, but explicit regularization gave finer control.
Tip2. Feature Engineering
Engineered features (wid_len_ratio, area) ranked #3 and #4 in importance. Simple transformations of raw features added significant predictive power.
Tip3. Cyclical Encoding
mo_cos and mo_sin outperformed binary month indicators. The continuous representation captured seasonal patterns more effectively.
Features like is_tornado_alley and is_peak_season added less value than expected. The model learned these patterns implicitly from lat, st_encoded, and cyclical month features.
Warning3. Very Low Learning Rates
Initial attempts with learning_rate=0.015 required too many iterations and still underperformed.
9.3 Insights
NoteGeography is Paramount
Latitude alone was the #1 predictor. Tornado magnitude correlates strongly with geographic location—likely reflecting the climatological conditions that produce severe tornadoes.
NoteTemporal Trends Matter
Year (yr) ranked #2, suggesting tornado magnitude patterns have changed over time—potentially due to improved detection, climate patterns, or measurement methodology changes.
NoteMorphology Predicts Severity
Physical dimensions (len, wid) and their ratio directly relate to magnitude. Wider, longer tornadoes tend to be more severe—an intuitive finding that validates our feature engineering approach.
10 Recommendations for Future Work
10.1 Model Improvements
Ensemble Methods: Combine LightGBM with CatBoost or XGBoost for potential gains
Target Encoding: Replace label encoding for states with target encoding
Interaction Features: Explicitly create lat×month or state×season interactions
Ordinal Regression: Treat magnitude as ordinal rather than continuous
Radar Data: Use pre-tornado radar signatures if available
Time of Day: Add hour of occurrence (severe tornadoes peak in late afternoon)
Population Density: May correlate with damage assessment accuracy
10.3 Validation Improvements
Cross-Validation: Implement k-fold CV for more robust estimates
Temporal Validation: Train on earlier years, test on recent years
Geographic Validation: Ensure performance across different regions
11 Conclusion
11.1 Technical Achievements
This analysis demonstrates that domain-informed feature engineering combined with aggressive hyperparameter optimization can significantly exceed baseline performance. Our LightGBM model achieved:
38.8% lower RMSE than Julia Silge’s XGBoost benchmark
Interpretable insights about the importance of geography, morphology, and time
11.2 The Bigger Picture: AI-Human Collaboration
TipThe Real Innovation
The most significant outcome of this project isn’t the model performance—it’s demonstrating a new way of doing data science.
Through MCP servers, Claude Opus 4.5 wasn’t just advising on machine learning—it was actively participating in the iterative process of building, evaluating, and refining models.
11.2.1 What Made This Collaboration Work
Human-AI Division of Labor
Human Contributions
AI Contributions
Domain intuition (tornado meteorology)
Rapid iteration through hyperparameters
Strategic direction (“try DART”, “go ELON MODE”)
Statistical feature engineering
Quality judgment (is this result good enough?)
Tool execution (training, prediction, evaluation)
Business context (why does this matter?)
Documentation and explanation
11.2.2 The Conversational Data Science Process
Our session demonstrated a natural dialogue:
Human: “Let’s use these domain features based on meteorological research”
AI: Validates features against data, creates engineered dataset