Home/Projects/Medicare Provider Cost Analysis

Medicare Provider Cost Analysis

in-progress

Predicting Medicare allowed amounts using provider-level features across 103M records

4 progress reports

ETLResearchEngineeringModelingLaunch

Python 3.xpandasNumPyPyArrowscikit-learnXGBoostRAPIDS (cuML)MLflow

Overview

What It Is

An end-to-end data pipeline and machine learning system designed to predict Medicare allowed amounts per service for healthcare providers. It uses CMS Physician & Practitioners data spanning 2013 to 2023 (approximately 103 million rows) with a medallion architecture for data processing and regional batch training for model scalability.

The project supports two stages: predicting allowed amounts from provider and service features (current), and predicting patient out-of-pocket costs using Medicare Current Beneficiary Survey data (planned).

Architecture

The system follows a Bronze-Silver-Gold medallion pattern with dual execution modes:

Production (Databricks): PySpark-based ingestion from S3 into Delta Lake tables, with SQL-based cleaning and feature engineering.
Local Development (pandas/PyArrow): State-partitioned parquet files processed through the same logical layers, enabling development on a single machine with GPU acceleration.

Data flows from CMS API downloads through CSV partitioning (by state and provider type), parquet conversion, bronze ingestion, silver cleaning (type casting, IQR outlier removal), and gold feature engineering (10 features including clinical bucketing and HCC risk scores).

Model training uses a regional batch strategy: data is split into Census regions, with models trained incrementally across regions to manage memory. MLflow tracks all experiments.

Key Features

Data Sources

CMS Medicare Physician & Practitioners (by Provider & Service): 11 years (2013-2023), approximately 10M rows per year. Fetched via CMS Data API with batch pagination.
CMS Medicare Physician & Practitioners (by Provider Summary): Provider-level HCC risk scores joined on NPI. Direct CSV downloads, approximately 1M rows per year.

Current State

Phase 2 is complete. The full pipeline (API pull, partitioning, medallion layers, feature engineering, risk score integration, LSTM sequence preparation) is operational. Random Forest achieves the best test performance on the national dataset. The GLM (SGD with Huber loss) diverged and needs tuning. Phase 3 (LSTM time-series forecasting on year-indexed sequences) is in planning, with input data already prepared.

Progress Reports

Report #4 of 4

Medicare: LSTM Training, OOP Training, and Next.js Web App

Apr 8 - Apr 8, 2026

Devlog

Today closed out the entire pipeline, from training the last two models to deploying a production web app. The session started with the LSTM and OOP models still untrained (code was written on April 7 but GPU wasn't available), and ended with a six-page Next.js app live on Vercel backed by pre-computed predictions in Supabase.

The first challenge was getting PyTorch with CUDA working. The Windows Anaconda environment had torch CPU-only, and WSL Ubuntu had no torch at all despite having the 5070 Ti visible via nvidia-smi. I copied the repo into WSL, created a venv, installed PyTorch with the cu128 index, and ran both the sequence creation (05_lstm_sequences_local.py) and training (train_lstm_local.py) from there. The LSTM hit R-squared of 0.886 on the 2022-2023 validation set with MAE of $8.84, early stopping at epoch 19 out of 100. Two bugs surfaced in the visualization code: the label_encoders.json stores lists (not dicts), and parquet deserializes the years column as numpy arrays (not Python lists), breaking `.index()` calls. Both were straightforward fixes.

OOP quantile regression ran next in WSL with the same venv. The P50 model scored R-squared of 0.40 with exact 50% coverage, and the P90 achieved 90% coverage. These are on synthetic data, so the R-squared matters less than the calibration.

Phase 6 was the big pivot. The original plan was Streamlit, but I switched to Next.js with Material UI and Supabase. The key architectural decision was pre-computing all predictions rather than running models at request time. I aggregated the 103M gold rows into 32,818 Stage 1 lookup groups (by specialty, bucket, state, place of service), the 10.3M synthetic OOP rows into 23,424 Stage 2 groups, and uploaded the 62,703 LSTM forecast rows directly. Total database: about 119K rows across 7 tables, well within Supabase free tier. The Cost Estimator page does two Supabase queries (Stage 1 + Stage 2) and displays Medicare allowed amounts alongside patient OOP ranges with P10/P50/P90 breakdowns.

The honest friction point was bulk data upload. The Supabase MCP tool couldn't handle 2000-row INSERT statements, and two background agents both failed to push the data through. The fix was simple: use the Python supabase client with the service role key directly. Took about 5 minutes once I had the right key.

What's next: Connect the Vercel project to the GitHub repo for auto-deploy on push, then shift focus to other projects.

Changelog

Added

Add Next.js 16 web app in `web/` with Material UI, Recharts, and Supabase client
Add 6 pages: Dashboard, Cost Estimator, Forecast Explorer, Model Comparison, Data Explorer, About
Add Supabase schema with 7 tables: lookup_labels, stage1_allowed_amounts, stage2_oop_estimates, lstm_forecasts, state_summary, model_metrics, feature_importances
Add `web/scripts/upload_all.py` for pre-computing aggregations and uploading to Supabase
Add `.claude/launch.json` for dev server preview
Add synthetic data disclaimer on Cost Estimator OOP results

Changed

Update `.gitignore` with web/node_modules, web/.next, web/.env.local
Update PROGRESS.md with Phase 3 training results (R-squared=0.886), Phase 5 training results, and full Phase 6 milestones
Update Phase 3 status from "code complete" to complete (LSTM trained in WSL with CUDA)
Update Phase 5 status from "code complete" to complete (OOP quantile regression trained)

Fixed

Fix `train_lstm_local.py` label_encoders handling (list-based, not dict-based)
Fix `train_lstm_local.py` numpy array `.index()` call replaced with `np.where()`
Fix `.env` Windows carriage return causing Databricks MLflow auth failure in WSL

nextjsmaterial-uisupabasevercelpytorch-cudalstmxgboostwsl

Report #3 of 4

Medicare: LSTM Forecasting, MCBS Integration, and Stage 2 OOP Model

Apr 4 - Apr 7, 2026

Devlog

Three phases landed in a single session today, covering the LSTM forecasting architecture, MCBS beneficiary data integration, and the Stage 2 out-of-pocket prediction model. The two-stage pipeline vision is now fully wired: Stage 1 predicts what Medicare allows for a service, Stage 2 predicts what the patient pays out of pocket.

Phase 3 was the LSTM time-series forecaster. The model takes year-ordered sequences of allowed amounts grouped by specialty, HCPCS bucket, and state, then forecasts 2024-2026 using autoregressive rollout with MC Dropout for confidence bounds. The architecture uses static embeddings for the three group keys concatenated with the target value at each timestep, feeding into a 2-layer LSTM with a linear head. Teacher forcing during training, temporal split (train on 2013-2021, validate on 2022-2023). Training is deferred until GPU is free since other models are running on the 5070 Ti.

Phase 4, the MCBS integration, is where the real surprises hit. I built the download script, Bronze ingest, and Silver cleaning pipeline expecting to join survey demographics with cost supplement spending data at the beneficiary level. Three constraints in the actual MCBS Public Use File forced a redesign. First, the PUF has no Census region (suppressed for privacy). Second, the Survey File and Cost Supplement use completely different PUF_IDs with zero overlap, so they cannot be joined. Third, the Survey File ships as three seasonal CSV rounds per year with entirely different columns that need merging. The Cost Supplement turned out to be self-contained with its own demographics, which simplified things once I stopped trying to force the survey join.

The solution was a dual-track architecture. Track A works with the real PUF data at a national aggregate level, which is honest but limited. Track B generates synthetic per-service OOP records with Census region by sampling beneficiary demographics from real MCBS distributions and modulating OOP amounts based on dual eligibility, supplemental insurance, and chronic condition count. The synthetic data is clearly labeled, and anyone with access to the actual MCBS Limited Data Set (which costs $600 per module and requires a Data Use Agreement) can drop in their real data and run the same pipeline unchanged.

Phase 5 trained the Stage 2 quantile regression on the synthetic data: three separate XGBoost boosters for P10, P50, and P90. The P50 model hit MAE of $9.78 with exact 50.0% coverage, and the P90 model achieved 90.0% coverage. The key design choice was making the Stage 1 target (allowed amount) the primary input feature for Stage 2, which is what connects the two stages into a coherent prediction pipeline.

What's next: LSTM training once GPU is available, then Phase 6 (Streamlit portfolio app combining both stages plus LSTM forecast display with uncertainty bands).

Changelog

Added

Add train_lstm_local.py: PyTorch LSTM with static embeddings, temporal split, MC Dropout autoregressive forecast
Add pull_mcbs_data.py: downloads MCBS Survey File (2015-2022) and Cost Supplement (2018-2023) from CMS
Add 06_mcbs_bronze_local.py: handles CMS directory structure, merges 3 seasonal survey rounds on PUF_ID
Add 07_mcbs_silver_local.py: Cost Supplement cleaning with real CMS column names (PAMTOOP, CSP_AGE, etc.)
Add 08_mcbs_crosswalk_local.py with dual-mode flag (--mode puf for national, --mode synthetic for regional)
Add generate_synthetic_mcbs.py: produces 10.3M synthetic per-service OOP records with Census region
Add train_oop_local.py: Stage 2 XGBoost quantile regression (P10/P50/P90) on 12 features
Add synthetic_metadata.json with provenance and drop-in replacement instructions

Changed

Update compare_models_local.py with LSTM in Stage 1 table and separate Stage 2 OOP section
Update CLAUDE.md with LSTM commands, MCBS pipeline, dual-track architecture, torch dependency
Update PROGRESS.md with Phases 3-5 milestones and PUF constraints documentation

pytorchlstmmcbsxgboostquantile-regressionsynthetic-datamlflowtime-seriesoop-predictionmedicare

Report #2 of 4

Medicare: National Scale Pipeline & Model Training

Mar 31 - Apr 4, 2026

Devlog

With the infrastructure in place, I spent the next five days running the full pipeline end-to-end and training models at national scale. The most consequential change happened before any model training: I swapped the target variable.

The original target was `Avg_Mdcr_Pymt_Amt` (what Medicare actually pays). But the project's two-stage vision is to first predict what Medicare allows and then, in Stage 2, predict patient out-of-pocket costs. The allowed amount is the right Stage 1 target because it represents Medicare's determination of the service value before provider-specific adjustments. This required updating the IQR outlier bounds in the silver layer, the feature engineering in gold, and all three training scripts.

The bigger lesson was catching data leakage. The initial feature set included `Avg_Mdcr_Pymt_Amt` and `Avg_Mdcr_Stdzd_Amt` as features, both of which are derived from the target. With those columns in, the models hit R-squared of 0.9996, which looked impressive but was meaningless. Removing the leaky features and their ratio derivatives dropped performance to realistic levels and forced me to engineer genuinely predictive features.

The final feature set has 10 columns: three label-encoded categoricals (provider type, state, HCPCS code), a coarse clinical bucket (mapping CPT code ranges to 6 categories like surgery, radiology, lab), a facility/office flag, HCC risk scores from a separate CMS provider-level dataset (joined on NPI and year, with median imputation for missing values), log-transformed service and beneficiary counts, submitted charge amount, and a services-per-beneficiary ratio.

Training at 103 million rows required a regional batch approach. I split the data into five Census regions and trained incrementally: XGBoost continues its booster across regions (125 rounds each), Random Forest uses sklearn's warm_start (125 trees per region, 625 total). This keeps memory manageable while still training on the full national dataset. XGBoost auto-detects CUDA for GPU acceleration when available.

Random Forest came out as the best performer on the test set. The GLM diverged and needs hyperparameter work. I also prepared the LSTM sequence data: 23,672 provider-service-state groups with year-ordered target vectors, ready for Phase 3.

What's next: Phase 3 LSTM time-series forecasting using the prepared sequences, training on 2013-2021 and validating against 2022-2023, with projections forward to 2024-2026.

Changelog

Added

Add pull_provider_data.py for CMS "by Provider" dataset (HCC risk scores, 11 years)
Add 5 new features: hcpcs_bucket, place_of_srvc_flag, Bene_Avg_Risk_Scre, log_srvcs, log_benes
Add regional batch training by Census region (Northeast, South, Midwest, West, Territories)
Add CUDA auto-detection for XGBoost GPU acceleration
Add cuML auto-detection for Random Forest full mode
Add LSTM sequence preparation script (23,672 groups, 10,540 with complete 11-year coverage)
Add model comparison harness with paired t-tests and feature importance extraction
Add label_encoders.json for cross-script encoding consistency
Add PROGRESS.md documenting Phase 2 completion and results

Changed

Update target variable from Avg_Mdcr_Pymt_Amt (payment) to Avg_Mdcr_Alowd_Amt (allowed amount)
Update silver layer IQR outlier bounds for new target
Update gold layer to per-state parquets instead of monolithic file
Update all training scripts for spec-aligned 10-feature set

Removed

Remove leaky features: Avg_Mdcr_Pymt_Amt, Avg_Mdcr_Stdzd_Amt, pymt_to_charge_ratio, stdz_to_pymt_ratio

scikit-learnxgboostrandom-forestrapidscumlcudamlflowdatabricksmedallion-pipelinefeature-engineeringregional-batch-traininghcc-risk-scoresdata-leakagelabel-encodingparquetmedicarehealthcare-analyticsgpu-acceleration

Report #1 of 4

Medicare: Data Infrastructure & Pipeline Foundation

Mar 30 - Mar 30, 2026

Devlog

I set up the full data infrastructure for the Medicare provider cost analysis project in a single day. This is an H515 class project, but the scope is real: predicting Medicare allowed amounts per service using CMS Physician & Practitioners data spanning 2013 to 2023, roughly 103 million rows at national scale.

The first task was building the data pull. The CMS Data API serves the "by Provider & Service" dataset through paginated REST endpoints, one UUID per year. I wrote a client that fetches in 5,000-row batches, handles rate limiting, and writes raw CSVs to disk. Then a partitioning script splits each year's CSV by state and provider type, which is important because the full national file is too large to process in memory on a single machine. During partitioning, I inject the year column by extracting it from the filename via regex. This turned out to be a critical decision: the year column is not in the raw CMS data, and without it the LSTM time-series work I have planned for Phase 3 would require re-processing everything.

The pipeline follows a medallion architecture with dual execution modes. The Databricks path (PySpark + Delta Lake on S3) is the production target, and I stubbed out all four notebooks: bronze ingest, silver cleaning, gold feature engineering, and EDA. The local path (pandas + PyArrow) mirrors the same logic for development on my machine. Both paths share the same logical transformations so results stay consistent.

I also stubbed out three model training scripts (GLM, Random Forest, XGBoost) and a comparison harness, all wired to log experiments to MLflow on Databricks. The initial target variable was set to `Avg_Mdcr_Pymt_Amt` (Medicare payment amount), though I already suspected this might need to change once I dug deeper into the distinction between what Medicare allows versus what it pays.

What's next: Running the local pipeline end-to-end (bronze through gold), engineering the feature set, and training the first models at national scale.

Changelog

Added

Add CMS Data API client for "by Provider & Service" dataset (2013-2023, 11 years)
Add partition script splitting raw CSVs by state and provider type with year column injection
Add CSV-to-parquet conversion script for columnar storage
Add Databricks notebook stubs: bronze ingest, silver clean, gold features, EDA
Add local pipeline equivalents for pandas/PyArrow development
Add model training stubs: GLM (SGDRegressor), Random Forest, XGBoost
Add model comparison harness with MLflow logging to Databricks workspace
Add project proposal document
Add README with pipeline architecture overview
Add .gitignore excluding data/, partitioned_data/, local_pipeline/

cms-apiparquetpyarrowdatabricksdelta-lakepysparkmedallion-architecturedata-pipelinemlflowaws-s3pythonmedicarehealthcare-analytics