Agricultural Data Analysis
in-progressInteractive USDA analytics dashboard with commodity price forecasting
What It Is
An interactive web-based dashboard for exploring long-term U.S. agricultural trends using USDA QuickStats data, combined with commodity price forecasting models. The platform surfaces production, yield, price, land use, labor, and economic data at the state and county level, and provides probabilistic forward-looking price forecasts that highlight when supply/demand fundamentals diverge from current futures pricing.
Architecture
A three-tier system:
- Frontend (Next.js on Vercel): Six analytical dashboards (crops, animals, land, labor, economics, predictions) with interactive charts (Recharts), geographic visualizations (Deck.gl choropleth maps), and browser-side parquet file reading (hyparquet) for cost-effective data delivery.
- Backend (FastAPI on EC2): Serves commodity price forecasts via REST endpoints. PostgreSQL on AWS RDS stores futures prices, WASDE reports, ERS production costs, and forecast outputs.
- Data Pipeline (EC2 cron): Automated ingestion on schedule: monthly USDA QuickStats via API, daily CME futures and FRED dollar index, monthly WASDE CSV downloads, and annual ERS production cost updates. Data lands in both S3 (parquet for frontend) and RDS (for forecasting models).
Data Sources
- USDA QuickStats: All 50 states plus counties, 2001 to 2025. Monthly API pull via CMS QuickStats endpoint.
- CME Futures: Daily corn, soybean, and wheat contract prices via yfinance.
- FRED API: US Dollar Index (DXY), updated daily.
- USDA WASDE: Monthly World Agricultural Supply and Demand Estimates.
- USDA ERS: Annual production cost estimates per bushel by commodity.
Current State
The frontend dashboard with five core views is complete. The commodity price forecasting module is fully implemented end-to-end (data ingestion, feature engineering, model training with walk-forward validation, API endpoints, and frontend predictions dashboard). EC2 cron scheduling and SNS alerting are operational.
Next priorities include web app performance improvements (replacing full national parquet fetches with targeted Athena queries), UX polish (search, mobile navigation, loading states), accessibility improvements, and additional forecasting modules for crop yield and planted acreage (tech specs written, models not yet trained).
Devlog
With the commodity price forecasting module live, I turned to two things that needed attention before moving to the next ML module: extending the data pipeline to county-level granularity, and cleaning up operational rough edges.
The county-level NASS data extension was the bigger piece. The existing pipeline pulled state-level aggregates from the QuickStats API, which is fine for national trends but too coarse for the crop yield prediction module I have planned next. County-level data is an order of magnitude larger (thousands of FIPS codes versus 50 states), so the ingestion needed to handle pagination more carefully and the parquet partitioning scheme needed updating. I kept the state-level files as the primary dashboard data source and added county-level files as a separate partition layout that the yield module will consume.
On the operational side, I fixed a redundant API call pattern in the pipeline where the same QuickStats query was being made multiple times during a single run. This was not causing data issues (the upsert logic handled duplicates), but it was wasting API quota and slowing down runs unnecessarily. I also tightened up the cron runner's mode handling and verified the systemd service restarts cleanly after WASDE inference updates the model artifacts.
The model artifact flow is now clean: train locally, upload pickles to S3, EC2 pulls from S3 on FastAPI startup (or falls back to local artifacts if S3 is unreachable). The inference script runs all 18 commodity/horizon combinations in a single batch, writes predictions to RDS, and the service restart picks up any updated model files.
This is not a flashy milestone, but the platform is now operationally solid. The pipeline runs reliably on its cron schedule, the prediction service auto-recovers, and the data layer supports both the state-level dashboard and the county-level work coming next.
What's next: The crop yield prediction module (Module 04), which will use the county-level NASS data alongside NOAA weather feeds and NASA solar radiation data. This module doubles as a Cloud Computing class deliverable, so the architecture needs to be fully documented and reproducible.
Changelog
Added
- Add county-level NASS data ingestion from USDA QuickStats API
- Add county-level parquet partition layout alongside existing state-level files
- Add batch inference script for all 18 commodity/horizon model combinations
Changed
- Update pipeline pagination to handle county-level data volumes
- Update parquet partitioning scheme to support dual granularity (state + county)
- Update cron_runner.sh mode handling for cleaner operational flow
Fixed
- Fix redundant API calls in QuickStats pipeline wasting quota and slowing runs
- Fix systemd service restart behavior after WASDE inference model updates
Devlog
This was the densest two days of the project. I built the complete commodity price forecasting module end-to-end, from database tables through model training to a live frontend dashboard. Nine implementation steps, all shipped.
The backend starts with four ETL scripts pulling data from different sources on different schedules: CME futures prices daily via yfinance (I had to switch from Nasdaq Data Link after their CHRIS/CME endpoint was retired), US Dollar Index from FRED, WASDE supply-and-demand reports monthly from USDA, and ERS production costs annually. Each script writes to its own RDS table with upsert logic so reruns are safe.
The feature engineering layer builds an 18-feature matrix from these sources: market signals (spot price, term spread, basis, open interest change), fundamental indicators (stocks-to-use ratio and its historical percentile, WASDE surprise magnitude), macro factors (DXY level and 30-day change), cost metrics (production cost per bushel, price-to-cost ratio), and interaction terms (corn-soy ratio, seasonal factor). Every feature is validated with a Pandera schema before it reaches the model.
The model itself is an ensemble: SARIMAX for capturing seasonality and trend, LightGBM with quantile regression for the probabilistic spread, a Ridge meta-learner that blends them, and an isotonic calibration layer for the probability estimates. I trained 18 model sets (3 commodities times 6 monthly horizons) using walk-forward validation with proper temporal splits. A Mahalanobis distance check flags regime anomalies at inference time -- if the current feature vector is far from the training distribution, the system defers to the futures curve rather than trusting the model.
The API serves four endpoints: the core forecast (p10/p50/p90), calibrated probability above a user-specified threshold, WASDE signal analysis, and historical forecast-vs-actual comparisons. The frontend PredictionsDashboard has a fan chart, probability gauge, WASDE signal card, key driver callout (top SHAP feature), and regime alert.
The whole thing runs on EC2 via cron: daily market data pulls, monthly WASDE-triggered inference, and a systemd service that auto-restarts the FastAPI server. Total infrastructure cost is about $22/month.
What's next: Extending the pipeline to county-level NASS data for finer geographic granularity, and operational polish before the yield prediction module.
Changelog
Added
- Add 5 RDS tables: futures_daily, wasde_releases, price_forecasts, ers_production_costs, dxy_daily
- Add ETL scripts: ingest_futures (yfinance), ingest_fred (DXY), ingest_wasde (USDA CSV), load_ers_costs (Excel)
- Add 18-feature engineering pipeline with Pandera schema validation
- Add PriceEnsemble model: SARIMAX + LightGBM quantile + Ridge meta-learner + isotonic calibration
- Add SHAP TreeExplainer for key driver identification per forecast
- Add Mahalanobis distance regime anomaly detection at inference time
- Add walk-forward training pipeline (2010-2019 train, 2020-2022 val, 2023-2024 test)
- Add 4 API endpoints: forecast, probability, wasde-signal, history
- Add PredictionsDashboard with fan chart, probability gauge, WASDE signal card, key driver callout
- Add PREDICTIONS tab to main dashboard ViewMode
- Add EC2 cron schedule: daily market data, monthly WASDE + inference, annual ERS costs
- Add systemd service for FastAPI auto-restart on EC2
- Add S3 model artifact storage and fallback loading
Changed
- Update cron_runner.sh with --daily, --monthly-wasde, --annual-ers modes
Devlog
With the Next.js frontend stable, I shifted to the backend infrastructure that the prediction modules will need. This was less about visible features and more about laying pipe: a FastAPI skeleton, database setup, and fixing a pipeline reliability problem that had been lurking since the Streamlit days.
The FastAPI backend is straightforward: async SQLAlchemy 2.0 with asyncpg connecting to a PostgreSQL instance on AWS RDS (db.t4g.micro, about $15/month). Alembic handles migrations. The router structure is set up for the prediction endpoints I will build next, with config management via pydantic-settings reading from environment variables. The entry point loads models at startup via a lifespan context manager, which will matter once trained model artifacts exist.
The more important fix was the EC2 OOM kill. The USDA QuickStats pipeline was trying to load the entire national dataset into memory for processing, which worked on my local machine but blew up the EC2 instance. I refactored the pipeline to process data incrementally: ingest by state, merge parquet files in a multi-step process, and never hold more than one state's worth of data in memory at a time. This also made the pipeline resumable -- if it fails on state 37, it picks up from 37 on the next run instead of starting over.
I also did a Phase 3 UX and accessibility pass on the web app: bug fixes, some analytics instrumentation, and accessibility improvements flagged during the Next.js migration. And I cleaned out the legacy Dash app code that was still sitting in the repo from an earlier prototype. The repository is now cleanly split between `web_app/` (Next.js frontend), `backend/` (FastAPI), and `pipeline/` (data ingestion).
What's next: Building the commodity price forecasting module -- the first ML system on this platform, using WASDE reports, CME futures, and macro indicators to produce probabilistic price forecasts.
Changelog
Added
- Add FastAPI backend skeleton with async SQLAlchemy 2.0 and asyncpg
- Add PostgreSQL on AWS RDS (db.t4g.micro) with Alembic migrations
- Add pydantic-settings config management for backend environment
- Add incremental state-by-state pipeline processing to prevent EC2 OOM
- Add multi-step parquet merge for resumable pipeline runs
- Add Phase 3 UX and accessibility improvements to web app
Changed
- Update pipeline to process data incrementally instead of full national load
- Update repository structure: clean split between web_app/, backend/, pipeline/
Fixed
- Fix EC2 OOM kill during national dataset processing
- Fix various web app bugs identified during Phase 3 review
Removed
- Remove legacy Dash app code from repository
Devlog
After two months away from this project (other work took priority over winter), I came back and rewrote the entire dashboard from Streamlit to Next.js 16 with React 19 and TypeScript. The Streamlit version worked for exploration, but the UX ceiling was low: no component-level state management, limited interactivity, and Streamlit's deployment model did not fit well with Vercel hosting.
The migration took about a week of focused work. Each of the five original views (Crops, Animals, Land, Labor, Economics) became its own React component. Charts moved from Matplotlib/Plotly to Recharts, which integrates naturally with React's rendering lifecycle. The geographic visualizations got a major upgrade: Deck.gl with MapLibre GL replaced the hexagonal map, giving me proper choropleth layers with county-level aggregation and interactive tooltips.
The most interesting architectural decision was browser-side parquet reading via hyparquet. Instead of standing up a backend data proxy, the frontend fetches parquet files directly from S3 and parses them in the browser. This means the data layer has zero backend cost for read-heavy dashboard usage. S3 serves the files, the browser does the parsing, and Athena is available as a fallback for complex aggregation queries that would be expensive to run client-side.
The build process was rougher than expected. I went through three rounds of build fixes on February 12 alone (type errors, missing dependencies, Next.js App Router quirks). The v2 data integrity pass on February 16 caught several cases where the parquet column names did not match the TypeScript interfaces I had defined.
I also established the design system during this migration: a dark-themed palette with consistent chart colors, responsive layout via Tailwind CSS 4, and a design token system that will make future theming straightforward.
What's next: Standing up a FastAPI backend for the prediction services and hardening the pipeline to handle larger data volumes without running out of memory on EC2.
Changelog
Added
- Add Next.js 16 frontend with React 19, TypeScript 5, and App Router
- Add Recharts 3.7 integration for interactive chart components
- Add Deck.gl 9.2 + MapLibre GL 5.18 for choropleth and aggregation layer maps
- Add hyparquet for browser-side parquet file reading (zero-backend data serving)
- Add S3-first data fetching strategy with local API fallback
- Add Athena query integration for complex aggregation queries
- Add design system with dark theme palette and consistent chart color tokens
- Add Vercel deployment configuration
Changed
- Update dashboard architecture from Streamlit (Python) to Next.js (React/TypeScript)
- Update chart library from Matplotlib/Plotly to Recharts
- Update geographic visualization from hexagonal map to Deck.gl choropleth layers
Fixed
- Fix multiple build errors from Next.js App Router migration (3 rounds of fixes)
- Fix parquet column name mismatches with TypeScript interfaces in v2 data integrity pass
Devlog
I stood up the first working version of the USDA agricultural analytics dashboard in about two weeks. The initial goal was straightforward: pull historical crop, livestock, land use, labor, and economic data from the USDA QuickStats API and present it in an interactive dashboard that makes state-level agricultural trends explorable.
The data pipeline fetches from the QuickStats API, cleans and validates the response, and writes partitioned parquet files to an S3 bucket. Parquet was the right call here because the dataset is wide (dozens of commodity types, metrics, and geographies) and read-heavy. The pipeline handles incremental updates so I can re-run it monthly without re-downloading the full history.
The dashboard itself is a Streamlit app with five views: Crops (production, yield, acreage), Animals (inventory, production), Land (use composition), Labor (employment, wages), and Economics (prices received, revenue). I added a hexagonal state map for geographic exploration, which looks better than a standard choropleth for agricultural data because it gives smaller states equal visual weight.
Most of the friction was around deployment. I built a Docker container and GitHub Actions CI/CD pipeline on December 6, then spent the next day debugging S3 connectivity issues and data loading strategies. The demo mode (single-state focus for testing) helped isolate problems without loading the full national dataset. I also ran into memory issues when loading all states at once, which I solved by switching to a lazy-loading strategy that only fetches the states a user actually navigates to.
By December 13, the MVP was documented, the README was in place, and the S3 environment was stable. The Streamlit approach worked for getting something up fast, but I already knew the UX limitations would push me toward a proper frontend framework.
What's next: Migrating the dashboard from Streamlit to Next.js for better interactivity, Vercel hosting, and component-level state management.
Changelog
Added
- Add USDA QuickStats API ingestion pipeline with incremental updates
- Add AWS S3 parquet storage with partitioned file layout
- Add Streamlit dashboard with 5 views (Crops, Animals, Land, Labor, Economics)
- Add hexagonal state map visualization for geographic exploration
- Add Dockerfile and GitHub Actions CI/CD pipeline
- Add demo mode for single-state testing and development
- Add detailed S3 error logging for debugging connectivity issues
- Add README documentation with screenshots
Changed
- Update data loading to lazy-fetch strategy (per-state on navigation)
Fixed
- Fix labor statistics data accuracy in pipeline
- Fix S3 environment variable configuration for deployment
Devlog
Two sessions today took Module 03 from code-on-disk to a fully trained, honestly evaluated, and benchmarked acreage prediction system.
The morning session was infrastructure: county data pipeline optimization (870K records ingested locally in 3.5 hours with batched API calls and 8 parallel workers), the full Module 03 code build (ensemble model, features, API, frontend), RDS connectivity fix (the subnet route table was missing an Internet Gateway route), and a complete ETL run loading 19K futures rows, 5K DXY observations, 132 WASDE releases, 75 ERS cost records, and 64 fertilizer price records into RDS.
The afternoon session was about training and finding out if the model was any good. The first pass trained on 14 national-level samples. It technically produced numbers, but the uncertainty bands were static (identical p10/p90 every year), soybean was systematically underestimated, wheat was overestimated, and CI coverage sat at 43% where 80% was expected. Quantile regression had collapsed to constants with so few samples. Not publishable.
I upgraded to state-panel data: fetched 1990-2000 historical NASS records via the QuickStats API (1,303 records), merged with existing 2001-2025 data, and retrained on 15 top states per commodity (375 samples instead of 14). I added LRU caching on DB queries (15x speedup since futures/cost queries are year-dependent, not state-dependent), fixed wheat double-counting (ALL CLASSES + WINTER were both summed), and switched from July wheat futures (unavailable 8 months out) to December.
The real value came from the evaluation layer: persistence baseline ("predict last year"), 5-year average baseline, leave-one-year-out CV (25 folds), and split conformal prediction for calibrated intervals. Soybean passes the baseline gate (5.63% MAPE vs 5.42% baseline, 91% val coverage). Corn and wheat do not: persistence at 6.24% and 5.73% beats the model. Acreage is inherently sticky, and market signals alone add noise at the state level.
Comparing against the real world drove the point home. USDA's March Prospective Plantings (74,000 farmer survey) reported corn at 95.3M, soybean at 84.7M, wheat at 43.8M. Reuters, Bloomberg, Farm Futures, and AgMarket.Net all landed within 1-2% of USDA. Our model: corn 87.9M (-7.8%), soybean 68.8M (-18.8%), wheat 31.3M (-28.6%). The gap isn't timing; it's that the top-15 state sum misses the long tail, and market signals don't capture physical constraints.
I spec'd out four Tier 1 feature sources to close that gap: CRP contract expirations (direct land supply signal), RMA crop insurance elections (closest thing to a farmer survey without surveying), Drought Monitor DSCI (physical planting constraints via a clean REST API), and FAS export commitments (demand pipeline). Full API endpoints, table schemas, and feature designs are in `research/acreage-tier1-features-spec.md`.
What's next: Integrate drought monitor and RMA insured acres (the two highest-impact, lowest-effort Tier 1 sources), retrain, and re-evaluate. If corn/wheat still fail the baseline gate, pivot to residual modeling: predict the delta from persistence rather than absolute levels.
Changelog
Added
- Add Module 03 code: ensemble model, 15-feature pipeline, 4 API endpoints, inference CLI, frontend with seasonal state machine
- Add Alembic migration 002 for acreage tables; 3-source fertilizer ETL pipeline
- Add `pipeline/fetch_nass_historical.py` for 1990-2000 state-level NASS data via QuickStats API (1,303 records)
- Add state-panel training in `acreage_model.py` (15 states x 26 years = 375 samples per commodity)
- Add persistence and 5-year average baselines with deployment gate in `acreage_model.py`
- Add leave-one-year-out cross-validation (25 folds per commodity)
- Add split conformal prediction for calibrated 80% coverage intervals
- Add `predict_batch()` for vectorized multi-row prediction
- Add rich metrics.json: MAPE, RMSE, coverage, baseline comparison, CV stats, top features
- Add LRU caching to feature query functions (futures, ERS costs, fertilizer)
- Add `research/acreage-tier1-features-spec.md` speccing CRP, RMA, Drought Monitor, FAS Export sources
Changed
- Update `train_acreage.py` to merge 1990-2000 historical data with 2001-2025 S3 parquets
- Update `acreage_inference.py` to predict per-state with national rollup (16 rows per commodity)
- Update `train_and_save()` to default to TOP_STATES instead of national-only
- Rewrite county fetch pipeline with batched stat categories (4x fewer API calls, 8 workers)
- Rewrite ERS cost parser for new "machine readable" Excel format
Fixed
- Fix RDS connectivity by adding Internet Gateway route to subnet route table
- Fix wheat acreage double-counting with `class_desc == 'ALL CLASSES'` filter
- Fix wheat futures feature: December contract instead of unavailable July contract
- Fix negative acreage predictions with floor of 0 in predict methods
- Fix numpy float64 serialization in DB upsert
- Fix all-NaN column handling in ensemble fit
Devlog
Module 04 went from zero to 60 trained models in a single session. The goal: county-level, weekly, in-season crop yield prediction for corn, soybeans, and wheat using weather data, soil characteristics, and historical NASS yield patterns.
The session started with infrastructure. I created the Alembic migration for three new RDS tables (soil_features, feature_weekly, yield_forecasts), wrote ORM models and Pydantic schemas, then built the full ETL stack: five new ingestion scripts for NOAA GHCN weather, NASA POWER solar/VPD, US Drought Monitor, NASS crop condition ratings, and PRISM precipitation normals. Each follows the existing common.py pattern with upsert semantics and S3 archival. I also wrote static data loaders for county centroids (Census Gazetteer, 3,222 counties), SSURGO soil features (NRCS SDA API, 2,391 counties with AWC and drainage class), and NOAA station-to-county mapping (Haversine nearest-station within 50km, 3,140 mapped, 82 NASA POWER fallback).
The real unlock was processing the GHCN bulk archive. Instead of rate-limited API calls (1,000/day for 3,200 counties), I processed the 7.5GB daily-summaries-latest.tar.gz directly: extract only the 3,105 mapped US stations, parse TMAX/TMIN/PRCP, convert units, and map to county FIPS. The result: 16 million daily weather records (2000-2025) processed in 4.3 minutes into a 71 MB parquet. That single file replaced what would have taken weeks via the API.
The wheat commodity required a targeted NASS fetch (286K rows across 30 states, 58 minutes) because the original county pipeline used "WINTER WHEAT" as the commodity name, but county-level data uses the aggregated "WHEAT" descriptor. A class_desc filter fix (accepting "WINTER" alongside "ALL CLASSES") resolved the training data gap.
First-pass training used only NASS-derived features (county mean yield, trend, prior year, std deviation, year). All corn and soybean models passed the 10% baseline gate, but features were identical across weeks since they didn't incorporate in-season weather. The retrain with weather features (GDD accumulated from planting, season precipitation, precipitation deficit vs normals, average max temperature, heat stress days) added real week-specific differentiation. Corn week 20 hit 14.15% val RRMSE (40% better than the 23.78% county-mean baseline), and soybean week 10 reached 16.50% (29% improvement). Wheat remains the hardest crop at 20-21% RRMSE, narrowly missing the 10% improvement gate. Its cross-year phenology (fall planting, summer harvest) makes a simple day-of-year season window less effective.
The frontend section integrates into the existing Predictions dashboard with commodity tabs, a week-of-season slider (1-20), confidence badges (low/medium/high by week), county summary stats, a map placeholder for Deck.gl county choropleth, forecast detail card with p10/p50/p90 ranges, a confidence strip showing how uncertainty narrows through the season, and a season accuracy chart.
What's next: Implement the Deck.gl GeoJsonLayer county choropleth using the downloaded TopoJSON, add drought and crop condition features to the training pipeline, and deploy the updated FastAPI backend to EC2 with the 60 model artifacts.
Changelog
Added
- Add Alembic migration 003 with `soil_features`, `feature_weekly`, `yield_forecasts` tables
- Add ORM models (`SoilFeature`, `FeatureWeekly`, `YieldForecast`) and 4 Pydantic response schemas
- Add 5 ETL scripts: `ingest_noaa.py`, `ingest_nasa_power.py`, `ingest_drought.py`, `ingest_crop_conditions.py`, `load_prism_normals.py`
- Add 3 static data loaders: `load_county_centroids.py` (3,222 counties), `load_ssurgo.py` (2,391 soil records), `build_station_map.py` (3,140 stations)
- Add `process_ghcn_bulk.py` for bulk GHCN archive processing (16M rows in 4.3 min)
- Add `fetch_wheat_county.py` for targeted wheat NASS county data (286K rows)
- Add `yield_features.py` with 7-feature vector (GDD, CCI, precip deficit, VPD stress, drought, soil AWC, soil drainage)
- Add `yield_model.py` (LightGBM quantile ensemble: p10/p50/p90) and `train_yield.py` (60 models, weather-enhanced, 10 features)
- Add `yield_inference.py` for weekly county-level prediction pipeline
- Add `yield_forecast.py` FastAPI router with 3 endpoints at `/api/v1/predict/yield`
- Add `useYieldForecast.ts` React hook and `YieldForecastSection.tsx` with 4 sub-components
- Add `--weekly-yield` mode to `cron_runner.sh` (Thursday 10 AM ET)
- Add US counties TopoJSON (842KB) for Deck.gl choropleth
Changed
- Update `main.py` to load 60 yield models at startup with S3 fallback, register yield router, bump to v0.3.0
- Update `config.py` with `NOAA_API_KEY` setting
- Update `PredictionsDashboard.tsx` to include `YieldForecastSection`
- Update `train_yield.py` to integrate GHCN weather features (GDD, precip, hot days) with week-specific computation
- Update `alembic/env.py` to register yield module ORM models
Fixed
- Fix `load_county_centroids.py` to handle Census Gazetteer ZIP format and strip trailing whitespace from column names
- Fix `load_ssurgo.py` to parse SDA API list-of-lists response format (was expecting dicts)
- Fix wheat county yield filter to accept `class_desc='WINTER'` alongside `'ALL CLASSES'`
Devlog
Today was the biggest single-day accuracy improvement across the prediction modules. The acreage model started the day with only soybean passing its deployment gate. By the end, three of four models pass: corn, soybean, and winter wheat.
The core problem was that the original 15 features were dominated by NASS historical acreage, with price and cost signals ranking low in importance. The models needed environmental constraints, farmer-decision signals, and demand context that market prices alone couldn't capture. I built four new ETL pipelines pulling from the US Drought Monitor (state-level DSCI via REST API), USDA RMA Summary of Business (crop insurance insured acreage from pipe-delimited ZIPs), FSA Conservation Reserve Program (enrollment and contract expirations from Excel), and FAS export sales (weekly commitments from ESRQS CSVs). That added about 12,300 rows across four new database tables and expanded the feature set from 15 to 24.
The data engineering had some friction. The FAS OpenData API was retired on April 2 (returning 403s on every endpoint), so I fell back to manual CSV downloads from the ESRQS query tool. FSA servers were too slow for automated Excel downloads, requiring manual retrieval. The USDM DSCI API returned lowercase JSON keys (`dsci`, `mapDate`) contrary to what their documentation suggested, which caused the first backfill run to parse zero rows before I caught the mismatch.
The real breakthrough came from two modeling changes layered on top of the new features. First, I split wheat into winter and spring sub-models with commodity-specific decision dates: August 1 for winter wheat (already in the ground by November) and March 1 for spring wheat. This required matching futures contracts to the decision date using `infer_contract_month()` rather than hardcoded forward contracts. The initial winter wheat results showed five NaN features because there was no historical futures data at August dates, so I backfilled 19,318 rows from Yahoo Finance going back to 2000. That single fix dropped winter wheat from 6.56% to 5.91% val MAPE.
Second, I switched from absolute acre targets to 3-year moving average residuals for soybean and wheat. The persistence baseline (just predict last year's acres) was already strong at 4-6% MAPE. Predicting raw deltas from prior year was too noisy, but the 3-year average smoothed enough to give the model a stable baseline to deviate from. Seven experiments confirmed the pattern: raw residuals hurt everything, 5-year averages flatten the trend too much, and 3-year is the sweet spot. Corn performed best with absolute targets, so the production config uses a per-commodity `RESIDUAL_CONFIG` dictionary: absolute for corn, 3yr residual for everything else.
The soybean result stands out. At 5.39% val MAPE, it is the first model in the entire project to pass the validation gate (not just the test gate). That means the model genuinely outperforms the 5-year-average baseline even on the volatile 2021-2023 period that includes COVID supply chain shocks and the Ukraine war.
What's next: Wheat spring (4.92% test vs 4.17% baseline) is the last holdout, limited by its 5-state, 125-row training set. Expanding the state list or adding spring-wheat-specific features like prevent-plant acreage could close that gap.
Changelog
Added
- Add Alembic migration 004 with four new tables: `drought_index`, `rma_insured_acres`, `crp_enrollment`, `export_commitments`
- Add `ingest_drought_dsci.py` ETL for USDM state-level DSCI (1,227 rows backfilled, 2000-2025)
- Add `ingest_rma.py` ETL for RMA Summary of Business crop insurance (3,188 rows, 2000-2025)
- Add `ingest_crp.py` ETL for FSA CRP enrollment and contract expirations (2,121 rows, 1986-2024)
- Add `ingest_fas_exports.py` ETL for FAS weekly export commitments (5,769 rows, 3 commodities)
- Add 9 Tier 1 features to `acreage_features.py`: drought (dsci_nov, dsci_fall_avg), insurance (insured_acres_prior, insured_acres_yoy_change), CRP (crp_expiring_acres, crp_pct_cropland), exports (export_outstanding_pct, export_pace_vs_5yr), plus prior_3yr_avg_acres
- Add winter/spring wheat model split with `DECISION_DATES` dict and commodity-specific futures contract matching
- Add 3-year moving average residual target modeling to `AcreageEnsemble` with `use_residual` flag and `_to_delta`/`_from_delta` methods
- Add `RESIDUAL_CONFIG` per-commodity config dict to `train_acreage.py`
- Add 6 new cron_runner.sh scheduling modes for Tier 1 data refresh
Changed
- Update feature set from 15 to 24 features in `acreage_features.py`
- Update training pipeline to produce 4 models (corn, soybean, wheat_winter, wheat_spring) instead of 3
- Update futures contract queries to use `infer_contract_month()` at decision date instead of hardcoded forward months
- Backfill 19,318 historical futures rows (2000-2026) via Yahoo Finance for all three commodities
- Widen `export_commitments.marketing_year` column from VARCHAR(9) to VARCHAR(20) for FAS marketing year format
- Improve corn test MAPE from 7.96% to 6.26% (passes deployment gate)
- Improve soybean test MAPE from 5.97% to 4.64% (passes both val and test gates)
- Add wheat winter model at 4.51% test MAPE (passes gate, new model)
Fixed
- Fix DSCI API parser to handle lowercase response keys (`dsci`, `mapDate`) instead of uppercase
- Fix winter wheat NaN futures by switching from forward-dated contract months to front-month matching at August decision dates
Devlog
I replaced the entire frontend of FieldPulse in a single session. The old dashboard was a 656-line monolith: dark-only, single-page, seven useState view modes, hardcoded hex colors everywhere, and an Inter font that had nothing to do with agriculture. The new one is a light-first, 6-tab application built on a proper design system with URL routing, seasonal behavior, and cross-cutting systems like glossary hover popovers, auto-generated chart captions, and citation blocks under every visualization.
The rebuild followed a five-wave plan. Wave 1 was the hardest: getting Tailwind v4's CSS processing to play nicely with custom properties took three attempts. The `@import` ordering in Tailwind v4 is strict, and putting the Google Fonts import after `@import "tailwindcss"` silently broke the entire stylesheet with no error in the terminal. Only the browser showed a blank page. Once I moved the font import above Tailwind's import, every token lit up correctly. MapLibre also refuses CSS `var()` references in paint properties, so all map layer colors had to be literal hex values.
The data layer was the pleasant surprise. The existing `processData.ts` (628 lines of transform functions) and `serviceData.ts` (S3 parquet fetching with fallback) survived the rebuild untouched. Every page reuses `filterData`, `getCommodityStory`, `detectAnomalies`, `getTopCrops`, and the rest. The old code was better than it looked: well-structured pure functions that just needed a modern rendering layer on top.
The seasonal behavior system turned out to be the most satisfying piece. A single `getAgSeason()` function drives dormant messages ("Crop conditions return May"), seasonal story card rotation on the Overview, and visibility gating on forecast panels. In April, the Forecasts page correctly shows acreage forecasts as live and yield forecasts as dormant until May 19.
What's next: Connect to the live FastAPI backend on EC2 to populate the Market tab (futures, WASDE, input costs) and Forecasts tab (acreage ensemble predictions, accuracy panels). The frontend is ready; it just needs data flowing.
Changelog
Added
- Add FieldPulse design token system with full light/dark mode via CSS custom properties and `[data-theme='dark']`
- Add 6-tab URL routing via Next.js App Router route groups (`/overview`, `/market`, `/forecasts`, `/crops`, `/land-economy`, `/livestock`)
- Add `useFilters` hook wrapping `useSearchParams` for URL-as-source-of-truth filter state
- Add `useTheme` hook with flash-free dark mode detection via inline script
- Add `useAgSeason` hook with seasonal override table driving UI behavior per month
- Add 10 shared components: BandShell (loading/error/empty/seasonal), KpiCard, DeltaChip, Sparkline, Term (glossary popover), CitationBlock, CommodityPicker, RangeChips, ExperimentalPill, ChartToolbar
- Add shell system: Header (6-tab nav, logo, theme toggle), FilterRail (state/year/commodity pills, freshness badge), SourceFooter
- Add Overview page with USChoropleth (MapLibre), HeroStrip (3 KPIs), StateFingerprint (donut + sparklines + peer bars), StoryCards (3 static + 1 seasonal)
- Add Crops page with YieldTrendChart (25yr line, anomaly dots, national overlay), ProfitChart (bidirectional bars), HarvestEfficiency, CropProgressStrip (seasonal visibility)
- Add Market page with MarketHero (64px Barlow price), PriceHistoryChart (range chips), WasdeCard (percentile bar, direction pill), RatioDial (corn/soy gauge), InputCostCard, DxyStrip
- Add Forecasts page with SeasonClock (12-month strip), AcreageCard (3 commodity cards), AccuracyPanel (walk-forward backtest charts), yield dormant panel
- Add Land & Economy page with RevenueLeaderboard (boom/decline callouts), FarmStructure (operations + avg farm size), LandUseMix (stacked area), LaborWages (wage trend + ranking)
- Add Livestock page with InventorySnapshot (6 KPI cards), ProductionCharts (cattle/hog/milk line charts)
- Add `captionTemplates.ts` with 20+ runtime caption templates for auto-generated chart descriptions
- Add `glossary.json` with 20 USDA terms and `peerStates.json` with 50-state peer lookup
- Add `anomalyContext.json` for 6 major drought/weather events (2002-2019)
- Add `useMarketData` hook wrapping all 7 FastAPI `/market/*` and `/predict/*` endpoints
Changed
- Rewrite `globals.css` from 2-token dark-only to 80+ token light-first design system with Tailwind `@theme inline` mapping
- Rewrite `layout.tsx` to load Plus Jakarta Sans, Barlow Condensed, JetBrains Mono via `next/font/google` with flash-free theme script
- Replace root `page.tsx` (656-line monolith) with server-side redirect to `/overview`
Removed
- Remove 12 old dashboard components (CropsDashboard, AnimalsDashboard, EconomicsDashboard, LaborDashboard, LandDashboard, PredictionsDashboard, AcreagePredictionSection, YieldForecastSection, StateInfoPanel, StateSingleMap, TopCropsChart, USMap)
- Remove `design.ts` (palette object replaced by CSS custom properties; format functions extracted to `lib/format.ts`)
- Remove `usePriceForecast.ts` (price forecast module cut per spec section 0.1)
Devlog
FieldPulse has sprawled across four subsystems (pipeline, ML backend, FastAPI, Next.js frontend), and it was time for a real audit before the class demo. I spent the first half of the session running parallel review agents across the code. They came back with twenty findings, ranked by severity. The next several hours were spent fixing them in one pass, with one deferred by operator decision (the committed .env holds placeholder values, so rotation can wait).
A few of the findings mattered more than the headline count suggests. The multipart S3 verification was silently skipping any parquet over 8 MB, so a corrupted file would have reached the browser with only a log warning. The `last_success` manifest stamp was landing before the S3 upload step, meaning a failed upload would permanently suppress re-ingestion on the next cron run. The Conformalized Quantile Regression calibration in the price model was fit on the same validation split used to report coverage, making the reported figure tautological. The acreage national rollup was dividing by the wrong z-score, understating state sigma by about twenty-two percent. None of these were showing up as errors; they were the quiet kind of wrong.
The interesting decision was on gate policy. The review had flagged that "baseline gates" across all three forecasting modules were cosmetic: a model that failed the gate still got written to S3. For price and acreage, I enforced the gate properly (S3 upload blocked unless `--allow-failed-gate` is set). For yield, the module doubles as a class project and needs to demo even when test RRMSE trails the baseline. So I took a different path: gates stay informational, and a new `/metadata` endpoint exposes the gate status so the frontend can render a performance banner. Honest about the limits, visible in the UI, not hidden in a log line.
The last addition was a `YieldSeasonReview` component behind a toggle on the Forecasts tab. Off-season, it shows the 2024 walk-forward test results across corn, soybean, and wheat, with a per-week RRMSE chart against the county five-year baseline. The page was blank before, except for a "yield forecasts begin May 19" placeholder.
What's next: Redeploy the backend so the breaking acreage schema rename (`forecast_acres_millions` to `forecast_acres`) takes effect in production. The frontend currently shows "No forecast available" for acreage because the deployed API still returns the old field name.
Changelog
Added
- Add `YieldSeasonReview` component to Forecasts tab with Current/Review toggle and per-crop walk-forward RRMSE chart
- Add `GET /api/v1/predict/yield/metadata` endpoint exposing per-crop gate status for frontend annotation
- Add HMAC-SHA256 pickle signing in `backend/models/_signing.py` gated by `MODEL_SIGNING_KEY` env var
- Add `--allow-failed-gate` CLI flag to price and acreage training with gate-failed artifacts now excluded from S3 by default
Changed
- Rename acreage API fields `forecast_acres_millions` to `forecast_acres` across schemas, router, and frontend (breaking, requires backend redeploy)
- Replace MD5-plus-ETag multipart verification with `ChecksumAlgorithm=SHA256` and size check in `pipeline/upload_to_s3.py`
- Split price model CQR calibration into separate calibration and measurement halves so reported coverage is no longer tautological
- Dedupe `/api/v1/predict/acreage/states` on latest `created_at` per `state_fips` since the unique constraint includes `model_ver`
Fixed
- Fix `yield_inference.py` to populate weather features at prediction time instead of silently zero-filling
- Fix acreage national rollup z-score from 1.645 to 1.2816 so state sigma reflects the 80 percent stored interval
- Fix `quickstats_ingest` manifest ordering so `record_counts` promotes only after a clean S3 upload
- Fix SQL injection surface in `backend/features/acreage_features.py` by gating dynamic column names behind frozenset allowlists
Devlog
FieldPulse had shipped, but the first real walkthrough against live data surfaced problems that a code review alone would never have caught. Indiana total farm sales were reading $448.3B with a +15487 percent growth chip, rice profit in Arkansas was clocking in at $32,336 per acre, and Colorado's revenue leaderboard crowned peaches at $120B. The audit for this session was to go tab by tab with screenshots and not stop at "that looks wrong": every issue had to be traced back to source code, a parquet query, or a live API call, so the fix plan could be written with confidence rather than intuition.
The audit ran across six tabs (Overview, Market, Forecasts, Crops, Land & Economy, Livestock) and produced twenty-two distinct findings. I read every page component, every shared utility, every backend router, and ran Python directly against the S3 parquets to confirm what the data looked like before the frontend got its hands on it. Most of the user-visible problems fell into three families: data correctness in the pipeline's enrichment step, missing or miswired data sources, and plain frontend bugs where keys were misspelled or values were hardcoded placeholders.
The one finding that collapsed a dozen symptoms into a single root cause was in `pipeline/quickstats_ingest.py`. The `enrich_dataframe` function was computing imputed revenue as `sum(PRICE_RECEIVED) * sum(PRODUCTION)`, grouped by (state, commodity, year) but summing blindly across every `class_desc` variant. For hay, that meant multiplying the sum of alfalfa plus non-alfalfa plus all-classes prices by the sum of all three productions. The resulting DERIVED rows were inflated by roughly 1000x. One row for Indiana hay in 2024 read $338 billion, against a real figure near $300 million. That same bug cascaded into the overview hero, the revenue mix donut, the Land and Economy boom/decline callouts, and the livestock inventory cards (cattle 1.1 billion head, hogs 602 million, layers 328 million, all for one Midwestern state).
The non-obvious finding was that the corn/soy price ratio endpoint was returning the reciprocal of what every farmer uses. The backend returned `corn / soy` (roughly 0.38), but the 2.2 and 2.5 decision thresholds embedded in the frontend copy were written for the industry-standard `soy / corn` convention (roughly 2.67). The ratio bar marker, which clamped against a `tenYearMin=2.0` floor, pinned to `left: 0%` for every reading and left the bar uncolored. Both sides were internally consistent; they were just speaking different dialects.
The open question from the audit is how to reconcile the soybean acreage forecast rollup. The model's per-state test MAPE of 4.64 percent is genuinely good, yet the top-15-state sum comes in at 71.7M acres against USDA Prospective Plantings of 83.5M. The math is right: 71.7 divided by the historical coverage ratio (~0.86) lands at 83.4M. The choice is between expanding training to 30 states (already attempted, val MAPE regressed from 8 percent to 16 percent because the tail states add noise) or applying a post-hoc scale-up from historical coverage ratios. The latter is what the fix plan settled on.
What's next: With the audit complete, the fix plan is ready to execute. First wave is the pipeline enrichment rewrite plus fresh state-level and county-level aggregates, which unblocks most of the cascading symptoms without any frontend changes. Then backend endpoints (price ratio flip, soybeans plural, wheat export pace, crops profit history, meta models). Then frontend tab by tab against the corrected data.
Changelog
Added
- Flag 7 Overview issues: inflated sales ($448B), rank placeholder, plain map, peer-comparison hardcoded zeros, revenue-mix inflation, flat sparklines, map layout gap
- Flag 4 Market issues: inverted corn/soy ratio, soybeans plural regex reject, empty wheat card slot, non-functional state filter
- Flag 5 Forecasts issues: invisible season-clock rail, hardcoded p10/p90 bar geometry, blobbed accuracy chart 1, ambiguous accuracy chart 2, low-contrast section headings
- Flag 5 Crops issues: tiny anomaly hit target, operations N/A (data gap), missing yield-trend legend, placeholder profit formula, state-agnostic commodity picker
- Flag 5 Land and Economy issues: inflated peaches leaderboard, no crop-type filter, land-use chart missing 4 of 5 categories, labor data key mismatch, operations section key mismatch, unlabeled boom/decline baseline
- Flag 3 Livestock issues: inflated inventories (cattle 1.1B / hogs 602M / layers 328M), charts preferred as tables, sparklines without tooltips
- Flag 2 cross-cutting issues: section-heading contrast failure across 30-plus usages, no About page explaining methodology
Changed
- Document root cause: `pipeline/quickstats_ingest.py::enrich_dataframe` sums PRICE RECEIVED x PRODUCTION across class_desc variants, producing 100-1000x inflated DERIVED rows that cascade into Overview, Land and Economy, and Livestock surfaces
- Document root cause: `backend/routers/acreage.py::get_price_ratio` computes corn/soy (0.38) while frontend thresholds (2.2, 2.5) assume soy/corn convention (2.67), clipping every reading to soy_favored
- Document root cause: NATIONAL.parquet contains only `state_alpha='US'` rows, so `getMapData` returns empty and MapLibre falls through to default fill on every state
- Document root cause: multiple frontend key mismatches where component props read keys that the util never returns (`d.count` vs `d.operations`, `d.stateWage` vs `d[stateCode]`), silently dropping every row
- Document root cause: 14 backend endpoints across 4 router files reject `commodity=soybeans` (plural) via regex, while frontend `CROP_COMMODITIES` uses plural
- Document root cause: county parquet at `partitioned_states_counties/{STATE}.parquet` exists and is populated, but no frontend path consumes it, leaving the choropleth at state grain only
- Document root cause: yield accuracy endpoint returns mixed state plus national rows, and the frontend plots all 245 rows on one line, producing the tangled zigzag
- Document root cause: profit chart uses literal formula `yield * 4.5 - 700` with no per-commodity price or cost lookup, so rice at 7400 lb/acre reads as $32K profit
Fixed
- No production fixes in this milestone. Findings were flagged, traced, and documented; fix plan landed next session.
Devlog
With the audit findings in hand, I went straight into the fix plan. The strategy was to fix data correctness first, so that every frontend rewire downstream could rely on honest numbers, and to do the pipeline work as one coherent pass rather than per-tab patches. The plan shipped in this session covered eleven backend items plus the full Overview tab; the remaining five tabs and the new About page are tracked in a separate pending-work log.
The pipeline rewrite was the load-bearing change. I rewrote `enrich_dataframe` with a tier-aware canonical aggregation: filter down to class_desc equals ALL CLASSES when USDA publishes one, else sum across disjoint sub-classes (melons don't publish an ALL CLASSES row, so watermelon plus cantaloupe sum correctly). Units are paired explicitly so `$/BU` price only ever multiplies `BU` production, never tons. Dollar totals and quantities aggregate with sum within the chosen tier; per-unit prices and yields aggregate with max. Then a rebuild script downloaded all 50 state parquets plus NATIONAL from S3, stripped the bad DERIVED rows, re-enriched with the corrected logic, backed up the originals under a timestamped S3 prefix, and uploaded the clean versions. The whole pipeline ran in about a minute. Indiana hay 2024 went from $338 billion to $296 million; Indiana's full 2024 farm sales went from $448 billion to $14.1 billion, matching the USDA figure of about $13 to 14 billion.
Two new aggregate parquets came out of the same pipeline wave. `state_totals.parquet` is 38 kilobytes and gives the frontend everything it needs for the choropleth, hero totals, rank, and peer comparison; `state_commodity_totals.parquet` is 670 kilobytes and powers the revenue mix, 25-year sparklines, and any crop-type filter. A third `county_metrics` parquet per state backs the Overview map's drill-down mode. All three live under `s3://usda-analysis-datasets/survey_datasets/overview/`.
Beyond the correctness layer, I shipped six new or rewritten backend surfaces. The corn/soy ratio flipped to the industry-standard `soy / corn` convention in `backend/routers/acreage.py`, and a new shared dependency accepts both `soybean` and `soybeans` across 14 endpoints in four router files. A wheat export-pace endpoint joins the existing FAS export_commitments table against a historical 5-year same-week average, so the Market tab's empty wheat slot has real data ready. A `/api/v1/meta/models` endpoint reads every `metrics.json` under `backend/artifacts/` and returns 16 models with gate-pass/fail/borderline classification, feeding the About page's live model strip. ERS cost-of-production ingest expanded from 3 commodities (corn, soy, wheat) to 9 (adding cotton, rice, peanut, sorghum, oats, barley), with per-acre columns added via migration 008. A new `/api/v1/crops/profit-history` endpoint joins NASS state price and yield with ERS per-acre costs to replace the placeholder formula that had been producing the $32K/acre rice profit. USDA ERS Major Land Uses got its own ingest script covering all six state-level land categories for 1945 through 2017, and BLS QCEW NAICS 111 and 112 covers state-level farm establishment counts and wages annually from 2014 through 2024.
One real surprise came from the acreage retrain. I had been confident that expanding the training panel from 15 states to all 30-plus NASS-reporting states would close the national-rollup gap cleanly. It did not. Val MAPE for corn regressed from 8.4 percent to 16.6 percent; the tail states added more noise than signal. I reverted the training panel to the original 15 and added a `NATIONAL_COVERAGE_MULTIPLIERS` constant (derived from the state-aggregate parquet we already produced) that scales the state-sum at inference time. The multiplier for corn is 1.019, soy is 1.076, wheat is 1.157. The scale-up is honest: it comes from actual historical coverage ratios, not a fudge factor.
The Overview tab rewire was the only frontend work that landed. The old page was reading raw state parquets and doing in-browser aggregation on every render; the new page fetches the three pre-computed aggregates and does no more than filtering and sorting. Hero numbers, rank, peer comparison, revenue mix, and sparklines now all land correctly. The map has a second mode: clicking a state zooms into that state and swaps the layer to county-level harvested acres with a faint surrounding-states outline and a "back to US" pill. The county GeoJSON came from us-atlas, pre-converted to GeoJSON and served from `/public` to dodge a Next.js ESM issue with the topojson-client package. CORS on the bucket was missing entirely and had to be configured from scratch.
What's next: Five tab rewires plus the About page and the cross-cutting section-heading migration. The pending-work log has the context each one needs.
Changelog
Added
- Add `pipeline/rebuild_enrichment.py` that downloads every state parquet from S3, strips bad DERIVED rows, re-enriches with tier-aware canonical aggregation, and uploads clean versions with timestamped backups
- Add `pipeline/build_overview_aggregates.py` producing `state_totals.parquet` and `state_commodity_totals.parquet` to `s3://usda-analysis-datasets/survey_datasets/overview/`
- Add `pipeline/build_county_aggregates.py` producing one parquet per state with canonical-filtered county metrics, used by the Overview map drill-down
- Add `backend/etl/ingest_ers_mlu.py` pulling all 6 ERS Major Land Uses state-year categories and emitting `overview/land_use.parquet`
- Add `backend/etl/ingest_bls_qcew.py` covering NAICS 111 and 112 state-year establishments, employment, and avg annual pay for 2014 through 2024
- Add `backend/routers/deps.py` with `commodity_param` and `crop_param` FastAPI dependencies that accept `soybeans` plural and normalize to `soybean` at 14 endpoint call sites
- Add `GET /api/v1/market/exports` endpoint computing wheat export commitments vs 5-year same-week average from FAS data
- Add `GET /api/v1/meta/models` endpoint returning 16 forecast models with pass/borderline/fail gate classification from artifact metrics.json files
- Add `GET /api/v1/crops/profit-history` endpoint joining NASS state price and yield with ERS per-acre cost for 9 commodities
- Add `backend/alembic/versions/008_ers_per_acre_costs.py`, `009_widen_yield_units.py`, `010_land_use_categories.py`, `011_bls_establishments.py`
- Add S3 bucket CORS policy allowing public GET and HEAD for cross-origin parquet fetches
- Add `NATIONAL_COVERAGE_MULTIPLIERS` in `backend/features/acreage_features.py` applied in `compute_national_forecast` so the top-15-state sum scales up to a USDA-comparable national total
- Add `--section-heading` design token in `web_app/src/app/globals.css` and a `SectionHeading` component for consistent contrast across all dashboards
- Add `/us-counties.geojson` static asset and county-choropleth mode to `USChoropleth` with state-BBox zoom and a "back to US" pill
- Add About tab entry to the top-nav TABS constant in preparation for the new page
Changed
- Rewrite `enrich_dataframe` in `pipeline/quickstats_ingest.py` to filter canonical rows (class_desc ALL CLASSES when available, prod_practice ALL, reference_period YEAR) and use tier-aware sum-or-max aggregation per stat, with a $50B sanity cap on derived revenue
- Flip `backend/routers/acreage.py::get_price_ratio` to return soy/corn instead of corn/soy and rewrite the percentile SQL to match
- Expand ERS cost ingest in `backend/etl/load_ers_costs.py` from 3 commodities to 9 (adding cotton, rice, peanut, sorghum, oats, barley) with per-acre columns and HTTP-failure fallback to cached files
- Rewrite `web_app/src/app/(tabs)/overview/page.tsx` to fetch pre-computed aggregates instead of raw state parquets, with 2022 Census-year baseline for growth calculations
- Rewrite `web_app/src/components/maps/USChoropleth.tsx` with dual-mode rendering (state choropleth, county choropleth inside a selected state) and lazy GeoJSON loading
- Rebuild all 50 state parquets plus NATIONAL with the new canonical aggregation and upload to S3 (originals backed up under timestamped `backups/` prefix)
- Retrain corn and soybean acreage models after reverting to the top-15-state panel (soy val MAPE 5.39 percent, corn 8.37 percent, both matching pre-expansion artifacts) and upload ensemble plus metrics to S3
Fixed
- Fix Indiana hay 2024 revenue from $338 billion to $296 million and cascade every overview and land-economy surface to honest numbers
- Fix rank placeholder so Indiana reads #10 instead of #0, computed from sorted state_totals
- Fix peer-comparison bars so all 4 peer states show real values instead of a single-bar chart
- Fix 25-year sparklines to aggregate across states for national view (national CORN reads 90.9M acres instead of one state's value)
- Fix Rules-of-Hooks violation in `USChoropleth` by moving `useMemo(viewState)` above the `!ready` early return
- Fix Overview layout so the map column stretches to match the fingerprint sidebar height instead of leaving empty space below
Devlog
With the Overview tab landed in the prior session, the remaining pending-frontend log tracked five tabs, one new route, and a cross-cutting design-token migration. The goal this session was to clear the whole list in one pass rather than leaving any tab in an intermediate state. The work went tab by tab with a verification checkpoint at the end of each, which kept the diff reviewable and caught real bugs at the moment they were introduced rather than at the end.
The fix that mattered most for user-visible correctness was on the Livestock tab. The existing code used a substring match on `commodity_desc.includes('CATTLE')` and summed every matching row, which pulled in CATTLE plus CATTLE, COWS plus CATTLE, CALVES and summed four quarterly inventories as if they were annual. For Iowa 2024 that produced "3.1B head of hogs." I switched to an exact `commodity_desc === 'HOGS'` match, added a `unit_desc === 'HEAD'` filter (the raw data also carries dollar-valuation rows under the same stat-cat, which was the other source of the inflation), and took MAX across quarterly reports rather than SUM. Iowa hogs now reads 25.2M head, which matches the NASS Hogs and Pigs report almost exactly.
Two backend data gaps showed up mid-implementation that the pending-work log did not anticipate. The `state_commodity_totals` aggregate parquet is supposed to carry `inventory_head` per (state, commodity, year), but the column is null for CATTLE, HOGS, and TURKEYS. I fell back to the raw state parquet for inventory (with the new sanitization filters doing the work the aggregate was supposed to do) and left a note in the check-in that the pipeline's `build_overview_aggregates.py` would need to populate that column for the aggregate to become load-bearing. Separately, the BLS QCEW parquet keys rows on `state_fips` and `naics` (not `state_alpha` and `naics_code` as the type I wrote assumed), so the Labor and Wages overlay silently rendered zero lines until I added a `STATE_ALPHA_TO_FIPS` lookup table and fixed the field names. Both were quick pivots but the kind of thing that only shows up when you actually run the code against production data.
The About page is the most visible addition: eight prose sections covering data sources, pipeline, architecture (iframed from the existing cloud diagram), per-module model descriptions for price, acreage, and yield, accuracy methodology, known limitations, and a glossary pointer. Underneath that sits a live inventory strip that fetches `/api/v1/meta/models` and renders one tile per model with a gate-status badge. The summary line reads from the endpoint, so when a model is retrained the page updates without a code change: 16 models as of today, 4 pass, 3 borderline, 9 fail. The SectionHeading migration at the end of the session took 19 of 24 candidate occurrences across 14 files and left the four that were intentionally not section headings (the component itself, a KpiCard internal label, a Callout with a dynamic color prop, and the dead ProductionCharts component that §5b replaced) as documented skips.
What's next: a single cleanup PR to delete `ProductionCharts.tsx` (flagged as spawn-task during the session), and re-verifying the Vercel deploy now that the `Tab` type includes `'about'`.
Changelog
Added
- Add ExportPaceCard for wheat in Market tab (fills the slot where the corn/soy RatioDial is hidden)
- Add /about route with 8 prose sections, iframed architecture diagram, and live model-metadata strip
- Add fetchBlsEstablishments fetcher and BlsEstablishmentRow type in overviewData.ts
- Add crop-type filter pills (All / Field Crops / Fruits / Vegetables / Livestock / Dairy / Poultry) to RevenueLeaderboard
- Add 80% interval glossary entry with conformal-calibration tooltip on AcreageCard
- Add ProductionSalesTable (11-year table with sparkline headers) replacing ProductionCharts on Livestock
- Add optional `years[]` and `unit` props to Sparkline for hover tooltips and first/last endpoint labels
- Add STATE_ALPHA_TO_FIPS lookup table in land-economy page for BLS QCEW joins
Changed
- Migrate 19 section heading occurrences across 14 files to the shared SectionHeading component
- Switch acreage accuracy chart to national-only rows pivoted by commodity (three lines with commodity colors)
- Switch yield accuracy chart to three shaded areas (corn, soybean, wheat) plus dashed baseline
- Replace hardcoded `left: 10%` on AcreageCard p10/p90 bar with positions computed against a p50 plus-or-minus 15% axis
- Wire Crops profit chart to /api/v1/crops/profit-history (replaces placeholder `yield * 4.5 - 700`)
- Filter Crops commodity picker to crops with AREA PLANTED greater than 0 in the last 3 years for the selected state
- Change Boom / Decline baseline from rolling 10-year to fixed 2012 Census
- Convert SeasonClock off-season yield cells to hatched repeating-linear-gradient for theme-independent contrast
- Replace anomaly ReferenceDot with custom shape containing a 12px transparent hit target over a 6px visible circle
Fixed
- Fix Vercel build error: add 'about' to the Tab type to match the TABS constant
- Fix Livestock inventory overcount: exact commodity match, unit_desc='HEAD' filter, MAX across quarterly reports (hogs 3.1B to 25.2M for Iowa)
- Fix Labor wage key mismatch: read `d[stateCode]` and `d['National Avg']` instead of non-existent `d.stateWage` and `d.nationalWage`
- Fix Farm Structure operations count: read `d.operations` from getOperationsTrend (was reading undefined `d.count`)
- Fix hide State filter pill on Market tab (Market is national-only)
- Fix AccuracyPanel crash on allSettled partial failure: use cancellation guard and per-fetch error handling
- Fix Crops Operations Census fallback: walk [year, year-1, 2022, 2017, 2012, 2007, 2002] and hide card when no data
Removed
- Drop the placeholder `yield * 4.5 - 700` profit formula from Crops page
Devlog
Three arcs of work this week, each building on the previous. The first arc closed most of the county-level NASS gaps the April 17 audit surfaced: a new `fill_county_gaps.py` wrapper re-audited local parquets, computed the residual gap set, and drove the main ingest one state at a time until new rows stopped arriving. Wheat (which had zero county rows before) landed 418K rows; total county coverage roughly doubled to 1.93M rows.
The second arc fixed three real math bugs on the Crops tab that were quietly lying to users. The Indiana corn panel showed 31M planted acres (true answer: 5.2M) because `filterData` summed every `reference_period_desc` variant that NASS publishes (YEAR, YEAR MAR ACREAGE, YEAR JUN ACREAGE, YEAR NOV FORECAST, and so on). A harvest-efficiency chart showed values up to 300% because of the same bug paired with a census-year SURVEY and CENSUS double-count. A percentile caption rendered literally as "-th percentile of the 25-year record" because the template was never wired. Fixing the filter, switching single-value aggregates from sum to max with unit gates, and computing the ordinal suffix client-side closed all three. A bonus fix for five-year averages skipping census-year gaps followed from the same investigation.
The third arc was the redesign the P0 fixes exposed as worth doing. I ported five enrichment ingests from aquifer-watch (NOAA nClimDiv precipitation, NASS Census irrigated acres, USDA IWMS water application, ERS revenue, EIA state electricity prices) and wired them nationwide. The Crops tab now renders a 92-county Indiana choropleth colored by yield anomaly, with click-to-drill into an Ogallala-style panel showing KPIs, real NOAA growing-season precipitation, and a nearest-peers table. Layout went through grid, sticky, and finally float-based with `clear: left` on the peers card, which gave a clean inverted-L wrap in both state and county modes.
What's next: Re-run the NASS irrigated-acres ingest at lower concurrency to pick up the five states that 403'd, and wire IWMS + ERS into a water-productivity KPI (bu per acre-foot) on the county drill.
Changelog
Added
- Add `pipeline/fill_county_gaps.py` resume-by-default gap filler, closing 1,192 of 1,764 residual (state, commodity, year) gaps across 2 rounds
- Add `pipeline/enrichments/` subdirectory with 5 ports from aquifer-watch covering NOAA nClimDiv, NASS irrigated acres, USDA IWMS, ERS revenue, EIA electricity
- Add `--layout {state,county}` flag to `pipeline/upload_to_s3.py`, decoupling state and county prefixes and guarding the manifest stamp
- Add `fetchCountyData`, `fetchCountyPrecip`, `fetchIrrigatedCountyAcres` helpers in `serviceData.ts`
- Add `CropsStateMap` component (SVG choropleth, hover tooltip, click-to-drill) with in-module county rollup
- Add `CropsCountyDrill` component with NOAA precipitation context block, KPIs, method signature
- Add `CropsPeers` component (top-5 / weakest-3 in state mode; 6 nearest-yield peers in county mode)
- Add `.crops-invL-wrap` float-based layout in `globals.css`, 720px mobile breakpoint
Changed
- Update `filterData` in `processData.ts` to drop `YEAR - *` reference periods and include `reference_period_desc` plus source / class / practice in the dedup key
- Update `getCommodityStory` to use max with unit gates instead of sum, so biotech PCT sub-rows can no longer inflate area totals
- Update `YieldTrendChart` to compute percentile ordinal client-side (96th, 91st, 22nd) and drop the literal "th" template suffix
- Update `crops/page.tsx` 5-year baseline to skip zero-yield years so census-year SURVEY gaps do not pull the average down
Fixed
- Fix Crops tab planted acres inflation (Indiana corn 31.0M -> 5.2M) by filtering interim / forecast reference periods
- Fix harvest efficiency bars exceeding 100% (peaks at 300% -> max at 100%) via unit-consistent rollup
- Fix broken percentile caption in yield trend (rendered "-th" -> renders "96th percentile of the 25-year record")