Formula One Race Prediction

Machine learning pipeline to predict Formula One race winners using historical data from 1950–2020, achieving 87% prediction accuracy.

01.Project Overview

Overview

A comprehensive data analysis and machine learning pipeline that leverages historical Formula 1 race data from 1950 to 2020 to predict race outcomes and identify key performance factors.


Methodology

  1. Data Collection — Scraped historical race data from the Ergast Developer API, covering driver results, constructor standings, qualifying positions, and circuit metadata across 70+ years of racing
  2. Exploratory Data Analysis — Visualized driver consistency, constructor dominance eras, and the statistical relationship between qualifying position and race finish
  3. Feature Engineering — Constructed features including grid position, constructor points, driver age, and circuit-specific performance indicators
  4. Predictive Modeling — Built and compared multiple model architectures:
    • Linear Regression with Random Forest Regressor for hyperparameter tuning
    • Logistic Regression with Random Forest Classifier for podium prediction

Key Findings

  • 87% prediction accuracy on race winner classification
  • 1.2M+ data points analyzed across the historical dataset
  • Strong logarithmic decay correlation between starting grid position and win probability
  • Constructor momentum (recent points) proved to be a stronger predictor than individual driver statistics

Tech Stack

  • Python, Pandas, NumPy
  • Scikit-learn (Random Forest, Logistic Regression)
  • Matplotlib, Jupyter Notebooks
  • Ergast Developer API

Technologies

PythonPandasScikit-LearnMatplotlibJupyterErgast API

Role

Data Scientist

Timeline

Jun 2021 - Jul 2021

Category

Machine Learning / Sports Analytics