Home/Projects/DataSkrive Cohort

DataSkrive Cohort

in-progress

Cohort audit and anomaly detection for a sports betting content platform

4 progress reports

ETLResearchEngineeringModelingLaunch

Python 3.14pandasNumPymatplotlibstatsmodelsscikit-learnPySparkGoogle Dataproc Serverless

Overview

What It Is

A database reverse-engineering and cohort analytics project for DataSkrive, an NLP-driven sports betting content automation platform. The project has two sequential workstreams:

Cohort Audit: Reverse-engineer DataSkrive's partially documented GCP/BigQuery database to understand the user cohort classification system, audit it against behavioral data, and propose reclassification improvements.
Anomaly Detection: Build a multi-tier anomaly detection pipeline instrumenting key KPI signals with automated alerts and dashboarding.

All analysis runs against local CSV/JSON preview exports from BigQuery rather than live database connections.

Architecture

The underlying DataSkrive system follows a medallion pattern (Bronze, Silver, Gold):

Bronze Layer: 10 raw ingestion tables via PySpark/Dataproc from AWS S3 (FanDuel snapshots) and BigQuery (real-time moment feeds). Covers user accounts, push notification dispatches, betting transactions, and notification metadata.
Silver Layer: Transformed event tables with unified send-to-open-to-bet funnels and multi-tier attribution (30-minute, 2-hour, and 24-hour confidence windows).
Gold Layer: Star schema reporting views with dimensions (user, moment, scenario, campaign, segment) and facts (daily KPIs, cohort performance, scenario summaries, user funnels).

The project identified two coexisting cohort concepts: experiment groups (static A/B test assignment) and targeting cohorts (dynamic content-variant buckets assigned per notification).

Key Features

Data Sources

AWS S3 (FanDuel): User account snapshots, betting transactions (1.8B rows), experiment group assignments.
Braze API: Push notification dispatches (340M sends) and open events (1.4M opens).
BigQuery (Proven Verve): Real-time notification metadata, designed cohort stores.
Local Previews: 30 CSV files sampled from production across bronze, silver, and gold layers.

Current State

Phases 1 through 4 are complete: schema archaeology, cohort discovery, cohort audit with four professional HTML reports, and strategic ML architecture analysis. Phase 5 (anomaly detection) is in progress with Tier 1 (Holt-Winters) delivered and Tiers 2-3 (Prophet, LSTM) pending. Target delivery: May 15, 2026.

Progress Reports

Report #4 of 4

DataSkrive Cohort: Holt-Winters Anomaly Detection (Tier 1)

Mar 25 - Apr 2, 2026

Devlog

After a preparation period where the team studied time series methods (the learning folder from March 25 captures that ramp-up), I built and delivered Tier 1 of the anomaly detection pipeline using Holt-Winters triple exponential smoothing.

The approach was adapted from a Bing Ads infrastructure monitoring pattern. The algorithm uses additive seasonal decomposition with a 7-day period (day-of-week seasonality is the dominant pattern in sports betting engagement). Parameter tuning is done via grid search over alpha, beta, and gamma, selecting the combination that minimizes the mean absolute error on a bootstrap training window. Anomaly scoring uses a 3-point moving average of residuals compared against percentile-based bounds, with periodic re-tuning to handle distributional drift.

I built two variants. The first operates on daily cohort-level aggregates (153 days of Cohort 2 data from October 2025 through March 2026), monitoring 7 KPIs: sends, opens, bets, conversion rate, open rate, revenue per send, and attribution rate. The second variant operates at the user level, scanning individual user time series with relaxed parameters to handle sparser data.

The results validated all four anomaly signals I had flagged during the strategic analysis phase. The conversion rate collapse is real and persistent, not a seasonal artifact. The send volume volatility correlates with operational gaps (zero-send days during active betting periods). The population drift in Cohort 1 is not gradual rebalancing but shows discontinuous jumps that suggest upstream assignment logic changes.

Both scripts generate standalone HTML reports with per-KPI anomaly charts, flagged dates, and summary dashboards. The reports are large (the user-level report is nearly 10 MB) because they embed all the chart images for offline review. I also created interactive Jupyter notebooks for drilling into specific anomaly periods without regenerating the full reports.

The 48-cohort weighting system documentation was finalized alongside this work, capturing the full weight matrices and semantic descriptions for all defined cohort profiles.

What's next: Tier 2 (Prophet for trend change detection and automatic holiday seasonality) and Tier 3 (LSTM autoencoder if time permits), both targeting the May 15 delivery deadline. BigQuery integration and Power BI dashboarding will follow once the detection algorithms are validated.

Changelog

Added

Add hw_anomaly_detector.py: Holt-Winters daily cohort KPI anomaly detection
Add hw_user_anomaly_detector.py: Holt-Winters user-level anomaly detection
Add HTML report: hw-anomaly-detection (Tier 1 cohort-level findings, 2.3 MB)
Add HTML report: hw-user-anomaly-detection (user-level anomaly profiles, 9.9 MB)
Add Jupyter notebook: hw_interactive_review.ipynb (cohort-level drill-down)
Add Jupyter notebook: hw_user_interactive_review.ipynb (user-level drill-down)
Add requirements-anomaly.txt for anomaly detection dependencies
Add Holt-Winters summary document for team learning reference
Add learning folder with reference materials for team onboarding
Add 48-cohort weighting system full documentation in schema_map.py
Add .gitignore for pycache, RAG runtime, IDE, and OS files

Changed

Update schema_map.py with finalized cohort weight matrices and semantic descriptions

holt-wintersanomaly-detectiontime-seriesexponential-smoothingseasonal-decompositionstatsmodelscohort-kpisuser-level-analysisjupyterinteractive-reviewpythonhtml-reports

Report #3 of 4

DataSkrive Cohort: RAG Pipeline & Project Infrastructure

Mar 11 - Mar 12, 2026

Devlog

With the discovery and strategic analysis phases complete, I shifted to infrastructure that would make the accumulated knowledge usable by the full team and set up the project management layer for the anomaly detection workstream.

The RAG pipeline was the bigger build. The project had generated a substantial body of documentation in three days: 75KB of schema inventory, 5 HTML reports, 6 analysis scripts, query results, and architectural notes. Rather than expecting team members to read everything, I built a local Q&A system using LangGraph with Ollama embeddings and a FAISS vector index. The pipeline has six nodes: ingestion (document loading and chunking), retrieval (similarity search against the FAISS index), context filtering (relevance scoring), summarization, generation (answer synthesis via Qwen3), and output formatting. It runs entirely locally through a Streamlit web interface, so no project data leaves the machine. Two model options are available depending on how much depth is needed: Qwen3 14B for fast answers and Qwen3 32B for deeper reasoning.

The practical value is that anyone on the team can ask questions like "what does targeting_cohort_id actually control?" or "which cohorts have normalization issues?" and get sourced answers without digging through the schema map or reading all five reports. The RAG pipeline also supports multi-turn conversations with state persistence, so complex investigative queries can build on prior context.

On the project management side, I created a high-level plan and detailed task breakdown for the anomaly detection workstream (Workstream 2), exported as both HTML reports and a Trello-importable CSV. The three-tier roadmap (Holt-Winters, Prophet, LSTM) with the May 15 delivery deadline is now documented and trackable.

I also wrote the comprehensive README that frames the project for anyone picking it up cold: folder structure, table export references, workstream definitions, hard constraints, and reasoning standards. The repository is now self-documenting enough that a new team member can orient themselves without a walkthrough.

What's next: After a preparation period (team learning on time series methods), implementing Tier 1 anomaly detection using Holt-Winters exponential smoothing on the daily cohort KPIs.

Changelog

Added

Add RAG pipeline: LangGraph + FAISS + Ollama (Qwen3 14B/32B) with Streamlit web UI
Add vector index over all project documentation (schema map, reports, scripts, architecture notes)
Add multi-turn conversation support with SQLite state persistence
Add HTML report: rag-pipeline-architecture-guide
Add HTML report: anomaly-detection-project-plan-highlevel
Add HTML report: anomaly-detection-project-plan-tasks (detailed task breakdown)
Add Trello-importable task CSV for anomaly detection workstream
Add comprehensive README.md with folder structure, workstream definitions, and reasoning standards
Add requirements-rag.txt for local RAG deployment dependencies

langgraphollamafaissfastapistreamlitrag-pipelinevector-searchproject-managementdocumentationqwen3knowledge-base

Report #2 of 4

DataSkrive Cohort: Strategic Analysis & Anomaly Signal Detection

Mar 10 - Mar 10, 2026

Devlog

With the cohort system mapped and audited, I dug into the ML architecture and the time series data to understand what the existing models actually say about cohort value, and whether the anomaly patterns I noticed yesterday hold up under scrutiny.

The ML architecture review confirmed what the audit suggested: cohort assignment has near-zero predictive importance. The existing BQML Boosted Tree v1 (AUC 0.885, trained on 194M TEST-only rows with 53 features) does not use cohort one-hot features at all. Behavioral segment dominates. The T-Learner uplift pipeline (329M rows across TEST and CONTROL) is through Phase 7 of 8, with productionization still pending. The strategic implication is clear: the cohort system as currently implemented is a content-variant selector, not a meaningful user segmentation layer. Investing in better cohort assignment will not move the needle unless the content variants themselves become more differentiated.

I also found a weight normalization bug in the cohort weighting system. Active cohorts have un-normalized weight sums: Cohort 1 sums to 3.29 (expected 1.0), Cohort 17 to 2.32, Cohort 2 to 1.93. If the downstream scoring logic is not compensating for this, active cohorts produce inflated relevance scores that make cross-cohort performance comparison unreliable.

The anomaly signal investigation validated four distinct patterns in the data. The system-wide conversion rate collapsed from 0.25 to 0.08 over 5 months. NBA halftime notifications went from a 0.22 CVR to 0.02 (destroying roughly $267K in value). Send volume shows extreme day-of-week volatility (coefficient of variation 1.19) with 13 complete zero-send days during active betting periods. And Cohort 1's population swings 85x within single months (from 12K to 1.06M), which means any per-cohort metric computed on that population is not comparable across time.

I wrote six analysis scripts to generate the supporting evidence and charts for these findings. Each script loads data through the shared loader utility and carries a STATUS annotation (VALIDATED or DRAFT) so the team can distinguish confirmed findings from exploratory work.

What's next: Building a RAG pipeline to make all this accumulated documentation queryable, and setting up the project management infrastructure for the anomaly detection workstream.

Changelog

Added

Add HTML report: ml-architecture-strategic-analysis (BQML review, uplift pipeline status, 5 dashboard opportunities)
Add analysis script: anomaly_kpi_signals.py (CVR collapse signal validation)
Add analysis script: anomaly_cohort_population_drift.py (85x Cohort 1 population range)
Add analysis script: scenario_performance_analysis.py (NBA halftime value destruction)
Add analysis script: cohort1_weekday_lift.py (29 consecutive negative lift days)
Add analysis script: cross_cohort_seasonal.py (day-of-week bifurcation)
Add analysis script: anomaly_charts.py (4 anomaly signal PNG generation)
Add 48-cohort weighting system documentation with weight matrices and semantic descriptions
Add STATUS annotations (VALIDATED/DRAFT) to all analysis scripts

Fixed

Fix cohort weight normalization documentation (Cohort 1 sum 3.29 vs expected 1.0 flagged)

bqmluplift-modelingt-learneranomaly-detectioncohort-weightingfeature-importancecvr-analysistime-seriesmatplotlibpythondata-analysis

Report #1 of 4

DataSkrive Cohort: Schema Archaeology & Cohort Discovery

Mar 9 - Mar 9, 2026

Devlog

I kicked off the DataSkrive project today and went from zero context to a fully documented cohort system in a single day. The client runs an NLP-driven sports betting content automation platform on GCP/BigQuery, and the database was only partially documented. My job was to reverse-engineer the cohort classification system, audit it against observed behavioral data, and propose improvements.

I started by running 8 BigQuery queries against the preview data exports to map out the medallion architecture (Bronze, Silver, Gold layers). The schema archaeology surfaced a living inventory of 30+ tables and 250+ columns, which I captured in a Python dictionary file that serves as the canonical reference. The most important discovery was that the system has two distinct, coexisting cohort concepts that had been conflated in the existing documentation: an experiment group layer (static A/B test assignment: TEST/CONTROL/UNKNOWN) and a targeting cohort layer (dynamic content-variant buckets assigned per notification). 48 cohort profiles are defined in the system, but only 5 are active in production reporting.

The audit uncovered seven findings, with the most critical being that the DS User Store's cohort field is 100% NULL across all 122 million rows. That is the table that was supposed to hold the stable cohort assignment. Other findings included 12.8% of sent user-days missing their targeting cohort ID, undocumented cohorts appearing in dispatch logs, and Cohort 2 representing 83.5% of the total population (making any A/B comparison against it statistically questionable).

The value analysis quantified $857M in total attributed revenue across all cohorts and showed that behavioral segment is 114 times more predictive of conversion than cohort assignment. Scenario selection (which notification template fires) showed a 183x conversion rate spread, making it the largest optimization lever available.

I closed the day with four reclassification proposals: activating the empty user store with behavioral tier assignments, rebalancing the cohort population, adding sport-affinity segmentation, and temporal context cohorts.

What's next: Deeper ML architecture review to understand the existing BQML models and uplift pipeline, and validating the anomaly signals I spotted in the time series data.

Changelog

Added

Add schema_map.py: living inventory of 30+ tables, 250+ columns with confidence levels
Add 8 BigQuery query result exports covering user scope, sends, opens, bets, and cohort assignments
Add HTML report: cohort-id-meaning (two-layer cohort architecture documented)
Add HTML report: cohort-system-audit (7 findings, 3 critical/high severity)
Add HTML report: cohort-value-analysis ($857M attributed revenue, 114x segment leverage)
Add HTML report: reclassification-proposals (4 concrete proposals with SQL sketches)
Add reclassification_v1.md with 5-tier behavioral assignment proposal
Add data loader utility (utils/loader.py) for standardized CSV access

bigquerygcpschema-archaeologycohort-analysisdata-auditreverse-engineeringsports-bettingcontent-automationpysparkmedallion-architecturehtml-reports