PlayerData
completedAthlete benchmark analytics framework for contextualizing collegiate wearable data
What It Is
PlayerData is an athlete benchmark analytics platform developed as part of an Applied Data Science Consulting course at Indiana University Indianapolis. The system converts raw wearable session metrics into contextual performance insights by benchmarking individual athletes against statistical cohorts of similar athletes (by sport, gender, division, and age group).
The central value proposition: transform "your top speed was 27 km/h" into "your top speed was 27 km/h, placing you in the 68th percentile for female collegiate soccer athletes at your age."
This is a team project. The DE (Data Engineering) folder contains the data pipeline, Streamlit dashboard, and conversational chatbot, which were the primary contributions from the DE team.
Architecture
The system has four implemented layers:
-
Data Ingestion and Validation: Reads raw athlete session CSVs, validates 21 required columns, coerces types, removes duplicates and nulls. Handles both real sample data and synthetic training data.
-
Metrics Pipeline: Translates the DS team's R statistical pipeline into Python. Filters sessions by sport, gender, and minimum active minutes, computes age-group summaries (mean, standard deviation, 10th/90th percentiles) for key metrics (total distance, max speed, session load), and calculates within-cohort percent ranks for individual athletes.
-
Streamlit Dashboard: A 5-tab analytics interface with sidebar filters for gender, sport, division, age range, minimum active minutes, and data source. Tabs cover cohort overview with KPI cards, age-group analysis, individual athlete drilldown with percentile radar charts, cross-sport benchmarking, and the conversational chatbot.
-
Rule-Based Chatbot: A natural language interface that parses user queries into intents (stats, rank, compare, cohort) and entities (athlete ID, metric, division, sport, gender), then routes to the appropriate tool function. No LLM or external API required. Responds with formatted markdown tables and human-readable summaries.
Data Sources
- Sample Data: 7,868 sessions from 281 unique athletes across two sports (association football and American football), 95% Division I. 21 raw metrics per session covering active minutes, distances, speeds, accelerations, sprint events, and session load.
- Synthetic Data: 20,000 generated sessions across men's Division II and women's Division III with extended age range (16-37 years).
What It Does Not Do
- No real-time wearable device integration (batch CSV processing only).
- No LLM or cloud API dependency. The chatbot is entirely rule-based and runs locally.
- No position-level analysis (session type and athlete position labels not available in current data).
Current State
The DE pipeline, Streamlit dashboard, and chatbot are complete and functional. The chatbot was the final deliverable, added in early April 2026. Phase II consulting deliverables (PPTX report and interactive HTML report) were delivered to the PlayerData client in March 2026.
The 17-column schema divergence between sample and synthetic data remains documented as a known limitation. Production scaling (from in-memory pandas to a columnar store) and session-type labeling are identified as Phase III priorities if the project continues beyond the course engagement.