UFC Match Outcome Predictor
A data-driven system that collects and integrates UFC match statistics, engineers rich historical and contextual features, and trains Random Forest models to predict match outcomes with calibrated confidence.
Context
Predicting UFC match outcomes benefits from combining fighter profiles, historical performance, and detailed match statistics. This project consolidates disparate sources (UFC stats, event metadata, Wikipedia records) into a leakage-safe dataset for supervised learning.
The Problem
Raw match data is fragmented across sources, noisy, and temporally sensitive. Naively merging records risks mismatches and data leakage, while limited features hinder model performance and interpretability.
Approach
Built a reproducible pipeline for data collection, fuzzy name alignment, and feature engineering (rolling averages, historical performance, physical attributes). Trained and tuned Random Forest models (Scikit-Learn primary, XGBoost RF alternative), evaluated on a held-out test set, and exposed a notebook workflow for matchup predictions with probability (vote share).
Architecture
Python-based project with Jupyter notebooks: data/collection (raw JSON/CSV and transformed training_data.csv), data/analysis/feature_engineering.ipynb for feature creation; training/rfclassifier_training.ipynb for model training and RandomizedSearchCV; prediction/fight_prediction.ipynb for inference. Models persisted as .pkl files in saved_models/. Core libraries: pandas, NumPy, scikit-learn, XGBoost, SciPy, thefuzz.
Results
Achieved 62.4% test accuracy with a Scikit-Learn Random Forest (80/20 split, tuned with RandomizedSearchCV).
Engineered a final training dataset with 80 features spanning match stats, historical performance, and physical/contextual attributes.
Implemented leakage-safe rolling metrics and robust string matching for dataset alignment.
Provided an alternative XGBoost Random Forest configuration with comparable behavior.
Delivered a notebook-driven prediction interface that outputs class and vote-share confidence for new matchups.
What I Learned
Ensuring temporal consistency and avoiding leakage are critical for combat-sports prediction. High-quality feature engineering and rigorous data cleaning (deduplication, missing value handling, fuzzy matching) materially impact model reliability. Probability outputs aid decision-making beyond raw class predictions.
Impact Summary
Built an end-to-end ML pipeline that predicts UFC match outcomes using multi-source historical data and 80 engineered features, achieving 62.4% test accuracy with probability outputs.