Formula One Race Prediction
Machine learning pipeline to predict Formula One race winners using historical data from 1950–2020, achieving 87% prediction accuracy.
01.Project Overview
Overview
A comprehensive data analysis and machine learning pipeline that leverages historical Formula 1 race data from 1950 to 2020 to predict race outcomes and identify key performance factors.
Methodology
- Data Collection — Scraped historical race data from the Ergast Developer API, covering driver results, constructor standings, qualifying positions, and circuit metadata across 70+ years of racing
- Exploratory Data Analysis — Visualized driver consistency, constructor dominance eras, and the statistical relationship between qualifying position and race finish
- Feature Engineering — Constructed features including grid position, constructor points, driver age, and circuit-specific performance indicators
- Predictive Modeling — Built and compared multiple model architectures:
- Linear Regression with Random Forest Regressor for hyperparameter tuning
- Logistic Regression with Random Forest Classifier for podium prediction
Key Findings
- 87% prediction accuracy on race winner classification
- 1.2M+ data points analyzed across the historical dataset
- Strong logarithmic decay correlation between starting grid position and win probability
- Constructor momentum (recent points) proved to be a stronger predictor than individual driver statistics
Tech Stack
- Python, Pandas, NumPy
- Scikit-learn (Random Forest, Logistic Regression)
- Matplotlib, Jupyter Notebooks
- Ergast Developer API
Technologies
PythonPandasScikit-LearnMatplotlibJupyterErgast API
Role
Data Scientist
Timeline
Jun 2021 - Jul 2021
Category
Machine Learning / Sports Analytics