Raj
Home/Projects/Intent Quotient Cricket

Intent Quotient Cricket

in-progress

Quantifying batting intent in T20 cricket independent of outcome

3 progress reports
ETLResearchEngineeringModelingLaunch
Python 3pandasNumPymatplotlibseabornplotlyscikit-learn (TF-IDFLogisticRegression
Overview

What It Is

The Intent Quotient (IQ) is a custom analytical framework that quantifies batting intent in T20 cricket. The core innovation is separating the decision to attack from the result of that decision, enabling context-aware analysis independent of raw batting outcomes. Traditional metrics conflate intent with success; IQ isolates intent as a tactical choice modulated by match situation, wickets in hand, phase of play, and scoring pressure.


Architecture

The system has four processing layers:

  1. Data Ingestion: Cricsheet CSV2 ball-by-ball data for IPL and five other T20 leagues. A unified parser handles zip downloads, CSV parsing, metadata extraction, and player registry normalization.

  2. Commentary Scraping and Labeling: Ball-level commentary text scraped from ESPN Cricinfo, joined to delivery-level data. Aggressive shot attempts identified via regex heuristics (40+ attack patterns) and refined with supervised ML (TF-IDF features with logistic regression and SVM classifiers).

  3. Metric Computation: Per-batter, per-phase computation of Attack Frequency (rate of aggressive attempts), Boundary Conversion (success rate on attacks with Laplace smoothing), and Intent Rating (weighted blend). Bayesian shrinkage regularizes estimates for batters with limited phase-specific sample sizes using season-wide performance as a prior.

  4. Analysis and Visualization: Phase-level IQ plots (team bar charts, player scatter plots with quadrant zones), comparative analysis across powerplay, middle overs, and death overs.

Key Features

Data Sources

  • Cricsheet: Ball-by-ball CSV2 data covering IPL seasons 2007 to 2024, plus BBL, PSL, T20I, and The Hundred.
  • ESPN Cricinfo: Commentary HTML scraped via Playwright and Selenium for ball-level text annotations. Series URL catalog maintained as JSON mapping IPL seasons to Cricinfo identifiers.

Current State

The full IPL data pipeline (17 seasons), IQ metric with Bayesian shrinkage, commentary-to-delivery joining, regex and ML labeling, and phase-level visualization are all complete. Output includes team and player IQ plots for the 2024 IPL season. Planned extensions include cross-league normalization, player clustering by intent profile, and outcome-agnostic predictive models.

Progress Reports
Report #3 of 3
Intent Quotient Cricket: ML Finalization & Project Restructure
Dec 13 - Dec 13, 2025

Devlog

Two months after the initial analysis sprint, I came back to the Intent Quotient project to finalize it as a standalone body of work. The metric design and visualization were done in October, but the repository was a collection of scripts and notebooks scattered at the root level with no clear structure or documentation. This session was about making the project self-contained and reproducible.

The SVM baseline model was serialized to a joblib file and run against the full 2007-2024 IPL history, producing predictions for every delivery with commentary coverage. The master dataset was consolidated into a single parquet file that combines deliveries, commentary text, metadata, and derived features into one source of truth. Having everything in parquet rather than scattered CSVs makes the dataset practical to work with: column pruning, predicate pushdown, and compression bring what would be gigabytes of CSV down to a manageable working set.

The repository restructuring moved files into proper module directories: `/data_processing` for Cricsheet ingestion and commentary joining, `/labeling` for the regex and NLP classification, `/scraping` for the Cricinfo crawlers, `/analysis` for the IQ computation and plotting scripts, and `/notebooks` for the exploratory work. The `.gitignore` was expanded to exclude raw Cricsheet downloads, cache files, and pycache directories.

I also wrote the README with the full project overview: the motivation for separating intent from outcome, the metric definition, the data pipeline, the labeling methodology, and the visualization showcase. This matters because the Intent Quotient is one of the foundational research pieces for the CoverDrive Cricket platform, and it needs to be understandable by someone who was not in the room when the metric was designed.

The project is now complete as a standalone research artifact. The metric, the data, the model, and the analysis are all documented and reproducible. The next step is not more work on this repo but rather integrating the IQ metric into CoverDrive Cricket as a live feature, which will happen in that project's codebase.

What's next: Integration of the Intent Quotient as a feature within the CoverDrive Cricket platform, where it will surface as a per-player and per-team metric in match previews.

Changelog

Added

  • Add serialized SVM baseline model (attempt_baseline_svm_textunion.joblib)
  • Add full 2007-2024 SVM predictions dataset (parquet)
  • Add master consolidated dataset (IPL_data_2007_2024.parquet) combining deliveries, commentary, and features
  • Add README with metric definition, pipeline documentation, and visualization showcase

Changed

  • Update repository structure: reorganize root-level files into /data_processing, /labeling, /scraping, /analysis, /notebooks
  • Update .gitignore to exclude raw Cricsheet downloads, cache, and pycache

Removed

  • Remove root-level script files (moved to proper module directories)
svmmodel-serializationjoblibparquetproject-restructuredocumentationreadmedata-consolidationiplcricket-analytics
Report #2 of 3
Intent Quotient Cricket: IQ Metric Design & Visualization
Oct 10 - Oct 13, 2025

Devlog

With the joined commentary-delivery dataset in place, I built the labeling pipeline and designed the Intent Quotient metric over three days.

The labeling problem is: given a ball's commentary text, did the batter attempt an aggressive shot? I built two approaches. The regex heuristic scans for 40+ attack patterns (lofted, slog, charge, smash, tonk, launch, and variations) while filtering out defensive false positives. It is fast and interpretable but misses subtler intent signals. The NLP baseline uses TF-IDF features (word 1-2grams plus character 3-5grams) combined with contextual features (over, runs, phase, team, bowler) fed into a LinearSVC with calibrated probability outputs. The SVM handles cases the regex misses, like "steps out and tries to go big" or "makes room to free the arms." Both approaches produce a boolean `attempt_attacking` label and a `success_boundary` flag for when the attack produced a four or six.

The IQ metric itself has three components. Attack Frequency (AF) is the rate of aggressive attempts per legal ball faced in a given phase. Boundary Conversion (BC) is the success rate on those attempts, with Laplace smoothing to handle zero denominators. Intent Rating blends AF and BC. The key innovation is the Bayesian shrinkage layer: for batters with limited balls in a specific phase, the metric pulls toward a season-wide prior computed from their overall strike rate and average (z-scored against the league). This prevents a batter who faced 8 balls in the powerplay from getting an extreme IQ score based on a tiny sample.

The phase stratification turned out to be more revealing than I expected. Team-level powerplay IQ tells a completely different story than death-overs IQ. Some teams are aggressive early and conservative late; others are the opposite. The scatter plots of AF versus BC create natural quadrants: aggressive-and-successful (top right), aggressive-but-inefficient (bottom right), conservative-but-effective (top left), conservative-and-ineffective (bottom left). These quadrant positions are more stable across seasons than raw averages.

I generated 13 publication-ready figures for the 2024 IPL season: team IQ bars and player AF-vs-BC scatters for each phase, plus aggregate views. The Statistics notebook also includes K-means clustering on batting profiles (strike rate, boundary percentage, dot ball percentage) which produces interpretable player archetypes.

What's next: Scaling the NLP model to the full 2007-2024 dataset, serializing the trained model, and restructuring the repository for long-term maintainability.

Changelog

Added

  • Add regex-based intent labeling with 40+ attack patterns and defensive filtering
  • Add NLP baseline: TF-IDF (word + char n-grams) + LinearSVC with calibrated probabilities
  • Add Intent Quotient metric: Attack Frequency, Boundary Conversion, Intent Rating
  • Add Bayesian shrinkage layer using season-wide player priors (z-scored SR and average)
  • Add phase-stratified IQ computation (powerplay, middle overs, death overs)
  • Add 13 publication-ready figures: team IQ bars + player AF-vs-BC scatters per phase
  • Add per-innings IQ scores export (IQ_Innings.csv)
  • Add player-season level intent metrics export (IQ_Player_Season.csv)
  • Add K-means clustering for batting profile archetypes in Statistics notebook
  • Add EDA notebook with batter/bowler feature engineering
  • Add first-innings specific analysis notebook
intent-quotientbayesian-shrinkageattack-frequencyboundary-conversiontfidfsvmlogistic-regressionregex-labelingnlpphase-analysispowerplaydeath-oversmatplotlibseabornplotlycricket-analyticsclustering
Report #1 of 3
Intent Quotient Cricket: Data Pipeline & Commentary Integration
Oct 7 - Oct 10, 2025

Devlog

I spent the first four days building the data pipeline that the Intent Quotient metric needs: ball-by-ball delivery data joined with broadcast commentary text. The commentary is the key ingredient because it describes what the batter attempted, not just what happened. A dot ball where the batter charged down the pitch and missed is fundamentally different from a dot ball where the batter blocked defensively, and that distinction only exists in the commentary.

The delivery data comes from Cricsheet's CSV2 format. I wrote a unified parser that handles IPL, T20I, BBL, PSL, and The Hundred, normalizing the varying column layouts into a consistent schema with match key, season, date, venue, batting/bowling teams, innings, over, ball sequence, batter, bowler, runs, extras, wickets, and phase labels. The parser fetches zip archives directly from Cricsheet's download URLs and extracts both the per-delivery CSVs and the metadata/player registry files.

The commentary pipeline was the harder build. ESPNcricinfo does not have a public API for ball-by-ball commentary, so I built scrapers: first a series catalog generator that crawls the fixtures and results pages for each IPL season to extract match IDs and series IDs, then a batch Selenium/Firefox scraper that fetches commentary HTML for each match. I went through several iterations (single match, first innings only, batch with retry logic) before landing on a version that handles Cricinfo's rate limiting and 403 blocks gracefully.

The join logic connects commentary text to deliveries using an event-indexed approach: matching on innings, over, and ball sequence with fuzzy fallbacks for OCR-style mismatches. Join coverage tracking flags low-quality matches where the commentary structure does not align well with the Cricsheet delivery data.

By October 10, I had the complete IPL dataset: 2007 through 2024, roughly 750 matches with both deliveries and commentary text linked at the ball level.

What's next: Building the labeling pipeline to classify which deliveries represent attacking attempts, and designing the IQ metric that will quantify batting intent from those labels.

Changelog

Added

  • Add Cricsheet CSV2 parser for 5 T20 competitions (IPL, T20I, BBL, PSL, Hundred)
  • Add ESPNcricinfo series catalog generator (IPL seasons 2008-2024)
  • Add Selenium/Firefox batch commentary scraper with rate limiting and retry logic
  • Add commentary-to-delivery join pipeline with event-indexed matching
  • Add join coverage tracking for quality assessment per innings
  • Add match index builder for IPL fixtures discovery
  • Add complete IPL dataset: 2007-2024 deliveries + commentary (750+ matches)
  • Add multi-format data exports (CSV, parquet)
cricsheetespncricinfoseleniumplaywrightweb-scrapingdata-pipelinecommentaryball-by-ballparquetcsviplt20pythonpandas