So You Want to Win a Kaggle Competition?

Are growing in prevalence and efficacy in environmental science. What began in 1949 as an attempt to write a computer program that could play checkers [1] has grown into one of the fastest developing fields today. Machine learning is used to across multiple disciplines—from image recognition models designed to improve the speed of medical diagnoses [2] to the program that filters your spam from your normal inbox [3]—machine learning’s reach has spread far and wide. The study of our natural world and its processes is no exception.

Machine learning has allowed major advances in the environmental field as well. As data collection techniques only become more refined, frequent, and numerous, we need processing techniques that can match the scale of this data.

As part of a final project in Dr. Matteo Robbin’s Machine Learning in Environmental Science, my class was tasked with optimizing a machine learning model that would predict dissolved inorganic carbon content (DIC) in a sample of seawater based on a number of other associated characteristics. This was held as part of a kaggle competiton where the evaluation metric was root mean squared error (rmse). The winner was determined based on which model had the smallest rmse on the private leaderboard. Overall, this assesses how accurate each model is at predicting DIC when generalizing to unseen data.

I will walk through my winning model for this competition: an extreme gradient boosted model (XGB) with bayesian hyperoptimization of parameters.

The Model Breakdown

eXtreme Gradient Boosting

Let’s go word by word through an extreme gradient boosted model and explain each piece.

  • Boosted: Boosting is an ensemble method in which multiple weak decision trees are trained sequentially. In simpler terms, you train many “short” decision trees on top of each other and you use the residual error from the previous tree to train the following tree. This allows the model to iteratively optimize performance without a tendency to overfit.
  • Gradient: Gradient refers to the idea of gradient descent, which is the optimization technique used to minimize the loss function. This starts to get into the math weeds, so I’ll link some resources for those who’d like more detail [4] [5]. But at it’s most conceptual, let’s imagine our parameter space as standing on top of a hill. If we look down, there are numerous different slopes and valleys that mark the terrain between the peak and the bottom of the hill. This “terrain variation” can be thought of as the unique parameter space of our model. Now, imagine I were to drop 100 ping pong balls from the top of this hill and I want to know which ball reached the bottom the fastest. The fastest ping pong boll can be thought of as akin to the gradient of our loss function. That is, gradient descent finds the direction of steepest increase to minimize the loss function.
  • Extreme: Now that we have a conceptual understanding of gradient descent, there are many different ways you can set up your how your model finds the most optimum gradient. The term “extreme” comes from the popular xgboost library that is designed to be especially efficient and flexible. There are other types of gradient descent, like batch gradient descent or stochastic gradient descent [6], but XGB is a very common method due to its high performance, built-in regularization, and parallel computing capabilities.

hyperopt: Bayesian Hyperoptimization

Now that we have our model itself established, let’s talk about how I decided to select parameters for the model with hyperopt. hyperopt is a python library that uses bayesian optimization to find the best parameters. It has three main parts: an objective function, a domain space, and a search algorithm [7].

  • Bayesian optimization: This is another area that gets into the weeds [8], but it can be thought of as a probabilistic, model-based technique to minimize a function. It’s quicker than a random search of parameters because it uses the posterior distribution to establish which parameter spaces are most worth exploring. In this way, the future parameter combinations are informed by the previous ones.
    • Objective function: This is the function we want our bayesian model to minimize. This function will take our input domain space and output the validation metric (in our case, RMSE). The objective function for this model is the XGB model discussed above. We want to minimize our error given that exact model construction, so naturally, we optimize our hyperparameters based on that model.
    • Domain space: The set of hyperparameters and their input values over which we want to search.
    • Optimization algorithm: The optimization algorithm used in this model is Tree of Parzen Estimators (TPE). This is where the Bayesian optimization discussed above actually happens.

At the end of the hyperopt process, we have a set of parameters that returns the smallest RMSE. Then, we can train our model on the best parameters.

Data

The data used in this model comes courtesy of Dr. Erin Satterthwaite at the California Cooperative Oceanic Fisheries Investigations(CalCOFI)

Metadata

  • Lat_Dec: Observed Latitude in decimal degrees
  • Lon_Dec: Observed Longitude in decimal degrees
  • NO2uM: Micromoles Nitrite per liter of seawater
  • NO3uM: Micromoles Nitrate per liter of seawater
  • NH3uM: Micromoles Ammonia per liter of seawater
  • R_TEMP: Reported (Potential) Temperature in degrees Celsius
  • R_Depth: Reported Depth (from pressure) in meters
  • R_Sal: Reported Salinity (from Specific Volume Anomoly, M³/Kg)
  • R_DYNHT: Reported Dynamic Height in units of dynamic meters (work per unit mass)
  • R_Nuts: Reported Ammonium concentration
  • R_Oxy_micromol.Kg: Reported Oxygen micromoles/kilogram
  • PO4uM: Micromoles Phosphate per liter of seawater
  • SiO3uM: Micromoles Silicate per liter of seawater
  • TA1.x: Total Alkalinity micromoles per kilogram solution
  • Salinity1: Salinity (Practical Salinity Scale 1978)
  • Temperature_degC: Water temperature in degrees Celsius
  • DIC: Dissolved Inorganic Carbon micromoles per kilogram solution

The Coding Breakdown

# Load basic libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import statistics as stats
import time

# XGB libraries
from sklearn.model_selection import train_test_split,RandomizedSearchCV, cross_val_score, KFold
import xgboost as xgb
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from scipy.stats import uniform, randint
from sklearn.preprocessing import StandardScaler

Import Data and Explore

# Import data
train_df = pd.read_csv("~/MEDS/website/haylee360.github.io/posts/2025-03-30-kaggle/data/train.csv")
test_df = pd.read_csv("~/MEDS/website/haylee360.github.io/posts/2025-03-30-kaggle/data/test.csv")

# Fix column name error
test_df = test_df.rename(columns={'TA1':'TA1.x'})

# Remove NA column from training data
train_df = train_df.drop(columns='Unnamed: 12')
# Get a feel for feature summary stats
train_df.describe()
idLat_DecLon_DecNO2uMNO3uMNH3uMR_TEMPR_DepthR_SalR_DYNHTR_NutsR_Oxy_micromol.KgPO4uMSiO3uMTA1.xSalinity1Temperature_degCDIC
count1454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.0000001454.000000
mean727.50000033.271315-120.2163590.06225218.8858120.08506210.882772193.451857224.5278540.3747260.085062146.5076821.64486929.1714372256.05440933.76409410.9013072150.468820
std419.8779580.8912611.7198730.28451714.4140590.1909223.702193347.48613588.4278640.3652260.19092292.4210331.02445028.62868235.2151250.3984093.684964113.163645
min1.00000030.417500-124.0006700.0000000.0000000.0000001.2500001.00000044.9000000.0030000.0000000.0000000.1700000.0000002181.57000032.8400001.5200001948.850000
25%364.25000032.654580-121.8448530.0000001.8775000.0000008.18500030.000000149.4750000.1070000.00000059.1705720.4900003.5850002230.03250033.4170008.2150002025.818652
50%727.50000033.420670-120.0250800.01400022.6000000.0100009.900000101.000000202.0000000.2935000.010000136.2672501.82000024.1500002244.02000033.7468009.9100002166.630000
75%1090.75000034.150520-118.6300000.05000031.5000000.09000013.667500252.000000299.0750000.5777500.090000244.6360502.56000045.6750002279.17500034.14945013.6675002252.657500
max1454.00000034.663330-117.3086008.19000042.0000002.75000022.7500003595.000000485.9000003.2260002.750000332.3477004.280000175.2000002433.71000034.67600022.7500002367.800000
# Check NAs
train_df.isna().sum()
id                   0
Lat_Dec              0
Lon_Dec              0
NO2uM                0
NO3uM                0
NH3uM                0
R_TEMP               0
R_Depth              0
R_Sal                0
R_DYNHT              0
R_Nuts               0
R_Oxy_micromol.Kg    0
PO4uM                0
SiO3uM               0
TA1.x                0
Salinity1            0
Temperature_degC     0
DIC                  0
dtype: int64
# Visualize feature relationships
sns.pairplot(train_df, y_vars=['DIC'], x_vars= train_df.columns[1:-1], diag_kind='kde')

Model Selection: XGB with Hyperoptimization

The relationships look mostly linear, but we’re working with a lot of features. I figured gradient boosting would be a good approach.

# Assign features
X = train_df.drop(columns=['id', 'DIC'], axis=1)
y = train_df['DIC']
X_test = test_df.drop(columns=['id'], axis=1) 

# Scale the data
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
# For predictions later on...
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

Define Objective Function

# Set up kfold cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=808)

# Define objective function to minimize
def objective(params):
    model = XGBRegressor(
        n_estimators=int(params["n_estimators"]),
        learning_rate=params["learning_rate"],
        max_depth=int(params["max_depth"]),
        min_child_weight=params["min_child_weight"],
        subsample=params["subsample"],
        colsample_bytree=params["colsample_bytree"],
        gamma=params["gamma"],
        reg_alpha=params["reg_alpha"],
        reg_lambda=params["reg_lambda"],
        random_state=808
    )
    
    # Perform cross-validation
    scores = -cross_val_score(model, X_scaled, y, cv=kf, scoring='neg_root_mean_squared_error', n_jobs=-1)

    # Average RMSE across folds
    rmse = np.mean(scores)

    return {'loss': rmse, 'status': STATUS_OK}

Create Domain space

# Create hyperparameter space
space = {
    "n_estimators": hp.quniform("n_estimators", 100, 1200, 10),
    "learning_rate": hp.uniform("learning_rate", 0.005, 0.3),
    "max_depth": hp.quniform("max_depth", 3, 20, 1),
    "min_child_weight": hp.uniform("min_child_weight", 1, 10),
    "subsample": hp.uniform("subsample", 0.5, 1.0),
    "colsample_bytree": hp.uniform("colsample_bytree", 0.5, 1.0),
    "gamma": hp.uniform("gamma", 0, 10),  
    "reg_alpha": hp.uniform("reg_alpha", 0, 1),  
    "reg_lambda": hp.uniform("reg_lambda", 0, 1),  
}

Run Optimization Algorithm

# Run hyperopt
trials = Trials()
best_params = fmin(
    fn=objective, 
    space=space,      
    algo=tpe.suggest, 
    max_evals=200,
    trials=trials,       
    rstate=np.random.default_rng(808)  
)

Train the Model on the Best Parameters

Now that we’ve optimized all of our relevant parameters, we can train our XGB model. We use **best_params to unpack the best parameters from before and initialize an XGBRegressor model.

# Convert int hyperparameters to fix type error
best_params["n_estimators"] = int(best_params["n_estimators"])
best_params["max_depth"] = int(best_params["max_depth"])

# Initialize best hyperopt model
xgb_hyper = XGBRegressor(**best_params, eval_metric='rmse', random_state=808)

# Fit model
xgb_hyper.fit(X_scaled, y)

# Predict on test data
y_pred_hyper = xgb_hyper.predict(X_test_scaled)
# Get feature importance
feat_imp_hyper = pd.DataFrame({'Feature': X_scaled.columns, 'Importance': xgb_hyper.feature_importances_})

# Sort by importance
feat_imp_hyper = feat_imp_hyper.sort_values(by="Importance", ascending=False)
feat_imp_hyper
FeatureImportance
SiO3uM0.439688
PO4uM0.375247
R_Oxy_micromol.Kg0.099113
R_Sal0.032475
NO3uM0.017117
TA1.x0.015256
Salinity10.012926
R_Depth0.004667
NO2uM0.000781
Temperature_degC0.000693
R_TEMP0.000607
R_DYNHT0.000389
NH3uM0.000282
Lat_Dec0.000281
Lon_Dec0.000257
R_Nuts0.000222

Now that we’ve generated our predictions on the test data, all we need to do is add those to their associated ID’s in the test_df and export to csv for submission to the competition.

# Add DIC to test dataset
test_df['DIC'] = y_pred_hyper
submission = test_df[['id', 'DIC']]
submission.head()
idDIC
014552170.5910
114562194.9880
214572326.0432
314581991.1729
414592147.3965
# Export for submission
submission.to_csv('submission.csv', index=False)

And just like that, you can have a competition-winning machine learning model! A very big thanks to Professor Robbins for his guidance in this course, Dr. Satterthwaite for her wonderful guest lecture, and Annie Adams for her assistance all quarter.

References

Citation

BibTeX citation:

@online{oyler2025,
  author = {Oyler, Haylee},
  title = {So {You} {Want} to {Win} a {Kaggle} {Competition?}},
  date = {2025-03-30},
  url = {https://haylee360.github.io/posts/2025-03-30-kaggle/},
  langid = {en}
}

For attribution, please cite this work as:

Oyler H. So You Want to Win a Kaggle Competition? 30 Mar 2025. Available:

https://haylee360.github.io/posts/2025-03-30-kaggle/

References

1. Wiederhold G, McCarthy J. Arthur samuel: Pioneer in machine learning. IBM Journal of Research and Development. 1992;36: 329–331. doi: 10.1147/rd.363.0329
2. Pinto-Coelho L. How artificial intelligence is shaping medical imaging technology: A survey of innovations and applications. Bioengineering. 2023;10: 1435. doi: 10.3390/bioengineering10121435
3. Dada EG, Bassi JS, Chiroma H, Abdulhamid SM, Adetunmbi AO, Ajibuwa OE. Machine learning for email spam filtering: Review, approaches and open research problems. Heliyon. 2019;5: e01802. doi: https://doi.org/10.1016/j.heliyon.2019.e01802
4. Kwiatkowski R. Gradient descent algorithm - a deep dive. Medium. TDS Archive; 2023. Available: https://medium.com/data-science/gradient-descent-algorithm-a-deep-dive-cf04e8115f21
6. IBM. What is gradient descent? 2025. Available: https://www.ibm.com/think/topics/gradient-descent
7. Banerjee P. Bayesian optimization using hyperopt. Kaggle; 2020. Available: https://www.kaggle.com/code/prashant111/bayesian-optimization-using-hyperopt
8. Nogueira F. Bayesian Optimization : Open source constrained global optimization tool for Python . 2014—. Available: https://github.com/bayesian-optimization/BayesianOptimization