Home/Projects/Agricultural Data Analysis

Agricultural Data Analysis

in-progress

Interactive USDA analytics dashboard with commodity price forecasting

14 progress reports

ETLResearchEngineeringModelingLaunch

Next.js 16React 19TypeScriptTailwind CSS 4RechartsDeck.glMapLibre GLhyparquet

Overview

What It Is

An interactive web-based dashboard for exploring long-term U.S. agricultural trends using USDA QuickStats data, combined with commodity price forecasting models. The platform surfaces production, yield, price, land use, labor, and economic data at the state and county level, and provides probabilistic forward-looking price forecasts that highlight when supply/demand fundamentals diverge from current futures pricing.

Architecture

A three-tier system:

Frontend (Next.js on Vercel): Six analytical dashboards (crops, animals, land, labor, economics, predictions) with interactive charts (Recharts), geographic visualizations (Deck.gl choropleth maps), and browser-side parquet file reading (hyparquet) for cost-effective data delivery.
Backend (FastAPI on EC2): Serves commodity price forecasts via REST endpoints. PostgreSQL on AWS RDS stores futures prices, WASDE reports, ERS production costs, and forecast outputs.
Data Pipeline (EC2 cron): Automated ingestion on schedule: monthly USDA QuickStats via API, daily CME futures and FRED dollar index, monthly WASDE CSV downloads, and annual ERS production cost updates. Data lands in both S3 (parquet for frontend) and RDS (for forecasting models).

Key Features

Data Sources

USDA QuickStats: All 50 states plus counties, 2001 to 2025. Monthly API pull via CMS QuickStats endpoint.
CME Futures: Daily corn, soybean, and wheat contract prices via yfinance.
FRED API: US Dollar Index (DXY), updated daily.
USDA WASDE: Monthly World Agricultural Supply and Demand Estimates.
USDA ERS: Annual production cost estimates per bushel by commodity.

Current State

The frontend dashboard with five core views is complete. The commodity price forecasting module is fully implemented end-to-end (data ingestion, feature engineering, model training with walk-forward validation, API endpoints, and frontend predictions dashboard). EC2 cron scheduling and SNS alerting are operational.

Next priorities include web app performance improvements (replacing full national parquet fetches with targeted Athena queries), UX polish (search, mobile navigation, loading states), accessibility improvements, and additional forecasting modules for crop yield and planted acreage (tech specs written, models not yet trained).

Progress Reports

Report #5 of 14

Agricultural Data Analysis: County-Level Data & Operational Polish

Mar 18 - Mar 24, 2026

Devlog

With the commodity price forecasting module live, I turned to two things that needed attention before moving to the next ML module: extending the data pipeline to county-level granularity, and cleaning up operational rough edges.

The county-level NASS data extension was the bigger piece. The existing pipeline pulled state-level aggregates from the QuickStats API, which is fine for national trends but too coarse for the crop yield prediction module I have planned next. County-level data is an order of magnitude larger (thousands of FIPS codes versus 50 states), so the ingestion needed to handle pagination more carefully and the parquet partitioning scheme needed updating. I kept the state-level files as the primary dashboard data source and added county-level files as a separate partition layout that the yield module will consume.

On the operational side, I fixed a redundant API call pattern in the pipeline where the same QuickStats query was being made multiple times during a single run. This was not causing data issues (the upsert logic handled duplicates), but it was wasting API quota and slowing down runs unnecessarily. I also tightened up the cron runner's mode handling and verified the systemd service restarts cleanly after WASDE inference updates the model artifacts.

The model artifact flow is now clean: train locally, upload pickles to S3, EC2 pulls from S3 on FastAPI startup (or falls back to local artifacts if S3 is unreachable). The inference script runs all 18 commodity/horizon combinations in a single batch, writes predictions to RDS, and the service restart picks up any updated model files.

This is not a flashy milestone, but the platform is now operationally solid. The pipeline runs reliably on its cron schedule, the prediction service auto-recovers, and the data layer supports both the state-level dashboard and the county-level work coming next.

What's next: The crop yield prediction module (Module 04), which will use the county-level NASS data alongside NOAA weather feeds and NASA solar radiation data. This module doubles as a Cloud Computing class deliverable, so the architecture needs to be fully documented and reproducible.

Changelog

Added

Add county-level NASS data ingestion from USDA QuickStats API
Add county-level parquet partition layout alongside existing state-level files
Add batch inference script for all 18 commodity/horizon model combinations

Changed

Update pipeline pagination to handle county-level data volumes
Update parquet partitioning scheme to support dual granularity (state + county)
Update cron_runner.sh mode handling for cleaner operational flow

Fixed

Fix redundant API calls in QuickStats pipeline wasting quota and slowing runs
Fix systemd service restart behavior after WASDE inference model updates

county-datausdanasspipelineapi-optimizationcronsystemds3model-artifactsdata-granularityoperational-readiness

Report #4 of 14

Agricultural Data Analysis: Commodity Price Forecasting Module

Mar 16 - Mar 17, 2026

Devlog

This was the densest two days of the project. I built the complete commodity price forecasting module end-to-end, from database tables through model training to a live frontend dashboard. Nine implementation steps, all shipped.

The backend starts with four ETL scripts pulling data from different sources on different schedules: CME futures prices daily via yfinance (I had to switch from Nasdaq Data Link after their CHRIS/CME endpoint was retired), US Dollar Index from FRED, WASDE supply-and-demand reports monthly from USDA, and ERS production costs annually. Each script writes to its own RDS table with upsert logic so reruns are safe.

The feature engineering layer builds an 18-feature matrix from these sources: market signals (spot price, term spread, basis, open interest change), fundamental indicators (stocks-to-use ratio and its historical percentile, WASDE surprise magnitude), macro factors (DXY level and 30-day change), cost metrics (production cost per bushel, price-to-cost ratio), and interaction terms (corn-soy ratio, seasonal factor). Every feature is validated with a Pandera schema before it reaches the model.

The model itself is an ensemble: SARIMAX for capturing seasonality and trend, LightGBM with quantile regression for the probabilistic spread, a Ridge meta-learner that blends them, and an isotonic calibration layer for the probability estimates. I trained 18 model sets (3 commodities times 6 monthly horizons) using walk-forward validation with proper temporal splits. A Mahalanobis distance check flags regime anomalies at inference time -- if the current feature vector is far from the training distribution, the system defers to the futures curve rather than trusting the model.

The API serves four endpoints: the core forecast (p10/p50/p90), calibrated probability above a user-specified threshold, WASDE signal analysis, and historical forecast-vs-actual comparisons. The frontend PredictionsDashboard has a fan chart, probability gauge, WASDE signal card, key driver callout (top SHAP feature), and regime alert.

The whole thing runs on EC2 via cron: daily market data pulls, monthly WASDE-triggered inference, and a systemd service that auto-restarts the FastAPI server. Total infrastructure cost is about $22/month.

What's next: Extending the pipeline to county-level NASS data for finer geographic granularity, and operational polish before the yield prediction module.

Changelog

Added

Add 5 RDS tables: futures_daily, wasde_releases, price_forecasts, ers_production_costs, dxy_daily
Add ETL scripts: ingest_futures (yfinance), ingest_fred (DXY), ingest_wasde (USDA CSV), load_ers_costs (Excel)
Add 18-feature engineering pipeline with Pandera schema validation
Add PriceEnsemble model: SARIMAX + LightGBM quantile + Ridge meta-learner + isotonic calibration
Add SHAP TreeExplainer for key driver identification per forecast
Add Mahalanobis distance regime anomaly detection at inference time
Add walk-forward training pipeline (2010-2019 train, 2020-2022 val, 2023-2024 test)
Add 4 API endpoints: forecast, probability, wasde-signal, history
Add PredictionsDashboard with fan chart, probability gauge, WASDE signal card, key driver callout
Add PREDICTIONS tab to main dashboard ViewMode
Add EC2 cron schedule: daily market data, monthly WASDE + inference, annual ERS costs
Add systemd service for FastAPI auto-restart on EC2
Add S3 model artifact storage and fallback loading

Changed

Update cron_runner.sh with --daily, --monthly-wasde, --annual-ers modes

sarimaxlightgbmscikit-learnridge-regressionisotonic-calibrationshapmahalanobispanderafastapipostgresqlaws-rdsec2croncommodity-forecastingwasdecme-futuresfredensemble-modelquantile-regressionwalk-forward-validation

Report #3 of 14

Agricultural Data Analysis: Backend Foundation & Pipeline Hardening

Feb 28 - Mar 3, 2026

Devlog

With the Next.js frontend stable, I shifted to the backend infrastructure that the prediction modules will need. This was less about visible features and more about laying pipe: a FastAPI skeleton, database setup, and fixing a pipeline reliability problem that had been lurking since the Streamlit days.

The FastAPI backend is straightforward: async SQLAlchemy 2.0 with asyncpg connecting to a PostgreSQL instance on AWS RDS (db.t4g.micro, about $15/month). Alembic handles migrations. The router structure is set up for the prediction endpoints I will build next, with config management via pydantic-settings reading from environment variables. The entry point loads models at startup via a lifespan context manager, which will matter once trained model artifacts exist.

The more important fix was the EC2 OOM kill. The USDA QuickStats pipeline was trying to load the entire national dataset into memory for processing, which worked on my local machine but blew up the EC2 instance. I refactored the pipeline to process data incrementally: ingest by state, merge parquet files in a multi-step process, and never hold more than one state's worth of data in memory at a time. This also made the pipeline resumable -- if it fails on state 37, it picks up from 37 on the next run instead of starting over.

I also did a Phase 3 UX and accessibility pass on the web app: bug fixes, some analytics instrumentation, and accessibility improvements flagged during the Next.js migration. And I cleaned out the legacy Dash app code that was still sitting in the repo from an earlier prototype. The repository is now cleanly split between `web_app/` (Next.js frontend), `backend/` (FastAPI), and `pipeline/` (data ingestion).

What's next: Building the commodity price forecasting module -- the first ML system on this platform, using WASDE reports, CME futures, and macro indicators to produce probabilistic price forecasts.

Changelog

Added

Add FastAPI backend skeleton with async SQLAlchemy 2.0 and asyncpg
Add PostgreSQL on AWS RDS (db.t4g.micro) with Alembic migrations
Add pydantic-settings config management for backend environment
Add incremental state-by-state pipeline processing to prevent EC2 OOM
Add multi-step parquet merge for resumable pipeline runs
Add Phase 3 UX and accessibility improvements to web app

Changed

Update pipeline to process data incrementally instead of full national load
Update repository structure: clean split between web_app/, backend/, pipeline/

Fixed

Fix EC2 OOM kill during national dataset processing
Fix various web app bugs identified during Phase 3 review

Removed

Remove legacy Dash app code from repository

fastapisqlalchemyasyncpgalembicaws-rdspostgresqlec2pipelineincremental-processingoom-fixathenaaccessibility

Report #2 of 14

Agricultural Data Analysis: Next.js Migration

Feb 12 - Feb 17, 2026

Devlog

After two months away from this project (other work took priority over winter), I came back and rewrote the entire dashboard from Streamlit to Next.js 16 with React 19 and TypeScript. The Streamlit version worked for exploration, but the UX ceiling was low: no component-level state management, limited interactivity, and Streamlit's deployment model did not fit well with Vercel hosting.

The migration took about a week of focused work. Each of the five original views (Crops, Animals, Land, Labor, Economics) became its own React component. Charts moved from Matplotlib/Plotly to Recharts, which integrates naturally with React's rendering lifecycle. The geographic visualizations got a major upgrade: Deck.gl with MapLibre GL replaced the hexagonal map, giving me proper choropleth layers with county-level aggregation and interactive tooltips.

The most interesting architectural decision was browser-side parquet reading via hyparquet. Instead of standing up a backend data proxy, the frontend fetches parquet files directly from S3 and parses them in the browser. This means the data layer has zero backend cost for read-heavy dashboard usage. S3 serves the files, the browser does the parsing, and Athena is available as a fallback for complex aggregation queries that would be expensive to run client-side.

The build process was rougher than expected. I went through three rounds of build fixes on February 12 alone (type errors, missing dependencies, Next.js App Router quirks). The v2 data integrity pass on February 16 caught several cases where the parquet column names did not match the TypeScript interfaces I had defined.

I also established the design system during this migration: a dark-themed palette with consistent chart colors, responsive layout via Tailwind CSS 4, and a design token system that will make future theming straightforward.

What's next: Standing up a FastAPI backend for the prediction services and hardening the pipeline to handle larger data volumes without running out of memory on EC2.

Changelog

Added

Add Next.js 16 frontend with React 19, TypeScript 5, and App Router
Add Recharts 3.7 integration for interactive chart components
Add Deck.gl 9.2 + MapLibre GL 5.18 for choropleth and aggregation layer maps
Add hyparquet for browser-side parquet file reading (zero-backend data serving)
Add S3-first data fetching strategy with local API fallback
Add Athena query integration for complex aggregation queries
Add design system with dark theme palette and consistent chart color tokens
Add Vercel deployment configuration

Changed

Update dashboard architecture from Streamlit (Python) to Next.js (React/TypeScript)
Update chart library from Matplotlib/Plotly to Recharts
Update geographic visualization from hexagonal map to Deck.gl choropleth layers

Fixed

Fix multiple build errors from Next.js App Router migration (3 rounds of fixes)
Fix parquet column name mismatches with TypeScript interfaces in v2 data integrity pass

nextjsreacttypescripttailwindrechartsdeckglmaplibre-glhyparquetvercels3parquetbrowser-side-processingdesign-system

Report #1 of 14

Agricultural Data Analysis: Streamlit MVP & USDA Pipeline

Nov 29 - Dec 13, 2025

Devlog

I stood up the first working version of the USDA agricultural analytics dashboard in about two weeks. The initial goal was straightforward: pull historical crop, livestock, land use, labor, and economic data from the USDA QuickStats API and present it in an interactive dashboard that makes state-level agricultural trends explorable.

The data pipeline fetches from the QuickStats API, cleans and validates the response, and writes partitioned parquet files to an S3 bucket. Parquet was the right call here because the dataset is wide (dozens of commodity types, metrics, and geographies) and read-heavy. The pipeline handles incremental updates so I can re-run it monthly without re-downloading the full history.

The dashboard itself is a Streamlit app with five views: Crops (production, yield, acreage), Animals (inventory, production), Land (use composition), Labor (employment, wages), and Economics (prices received, revenue). I added a hexagonal state map for geographic exploration, which looks better than a standard choropleth for agricultural data because it gives smaller states equal visual weight.

Most of the friction was around deployment. I built a Docker container and GitHub Actions CI/CD pipeline on December 6, then spent the next day debugging S3 connectivity issues and data loading strategies. The demo mode (single-state focus for testing) helped isolate problems without loading the full national dataset. I also ran into memory issues when loading all states at once, which I solved by switching to a lazy-loading strategy that only fetches the states a user actually navigates to.

By December 13, the MVP was documented, the README was in place, and the S3 environment was stable. The Streamlit approach worked for getting something up fast, but I already knew the UX limitations would push me toward a proper frontend framework.

What's next: Migrating the dashboard from Streamlit to Next.js for better interactivity, Vercel hosting, and component-level state management.

Changelog

Added

Add USDA QuickStats API ingestion pipeline with incremental updates
Add AWS S3 parquet storage with partitioned file layout
Add Streamlit dashboard with 5 views (Crops, Animals, Land, Labor, Economics)
Add hexagonal state map visualization for geographic exploration
Add Dockerfile and GitHub Actions CI/CD pipeline
Add demo mode for single-state testing and development
Add detailed S3 error logging for debugging connectivity issues
Add README documentation with screenshots

Changed

Update data loading to lazy-fetch strategy (per-state on navigation)

Fixed

Fix labor statistics data accuracy in pipeline
Fix S3 environment variable configuration for deployment

streamlitusdaquickstats-apiaws-s3parquetdockergithub-actionspythonmatplotlibplotlyhexagonal-mapdata-pipeline

Report #0 of 14

Agricultural Data Analysis: Acreage Model Trained, Evaluated, and Benchmarked Against USDA

Apr 10, 2026

Devlog

Two sessions today took Module 03 from code-on-disk to a fully trained, honestly evaluated, and benchmarked acreage prediction system.

The morning session was infrastructure: county data pipeline optimization (870K records ingested locally in 3.5 hours with batched API calls and 8 parallel workers), the full Module 03 code build (ensemble model, features, API, frontend), RDS connectivity fix (the subnet route table was missing an Internet Gateway route), and a complete ETL run loading 19K futures rows, 5K DXY observations, 132 WASDE releases, 75 ERS cost records, and 64 fertilizer price records into RDS.

The afternoon session was about training and finding out if the model was any good. The first pass trained on 14 national-level samples. It technically produced numbers, but the uncertainty bands were static (identical p10/p90 every year), soybean was systematically underestimated, wheat was overestimated, and CI coverage sat at 43% where 80% was expected. Quantile regression had collapsed to constants with so few samples. Not publishable.

I upgraded to state-panel data: fetched 1990-2000 historical NASS records via the QuickStats API (1,303 records), merged with existing 2001-2025 data, and retrained on 15 top states per commodity (375 samples instead of 14). I added LRU caching on DB queries (15x speedup since futures/cost queries are year-dependent, not state-dependent), fixed wheat double-counting (ALL CLASSES + WINTER were both summed), and switched from July wheat futures (unavailable 8 months out) to December.

The real value came from the evaluation layer: persistence baseline ("predict last year"), 5-year average baseline, leave-one-year-out CV (25 folds), and split conformal prediction for calibrated intervals. Soybean passes the baseline gate (5.63% MAPE vs 5.42% baseline, 91% val coverage). Corn and wheat do not: persistence at 6.24% and 5.73% beats the model. Acreage is inherently sticky, and market signals alone add noise at the state level.

Comparing against the real world drove the point home. USDA's March Prospective Plantings (74,000 farmer survey) reported corn at 95.3M, soybean at 84.7M, wheat at 43.8M. Reuters, Bloomberg, Farm Futures, and AgMarket.Net all landed within 1-2% of USDA. Our model: corn 87.9M (-7.8%), soybean 68.8M (-18.8%), wheat 31.3M (-28.6%). The gap isn't timing; it's that the top-15 state sum misses the long tail, and market signals don't capture physical constraints.

I spec'd out four Tier 1 feature sources to close that gap: CRP contract expirations (direct land supply signal), RMA crop insurance elections (closest thing to a farmer survey without surveying), Drought Monitor DSCI (physical planting constraints via a clean REST API), and FAS export commitments (demand pipeline). Full API endpoints, table schemas, and feature designs are in `research/acreage-tier1-features-spec.md`.

What's next: Integrate drought monitor and RMA insured acres (the two highest-impact, lowest-effort Tier 1 sources), retrain, and re-evaluate. If corn/wheat still fail the baseline gate, pivot to residual modeling: predict the delta from persistence rather than absolute levels.

Changelog

Added

Add Module 03 code: ensemble model, 15-feature pipeline, 4 API endpoints, inference CLI, frontend with seasonal state machine
Add Alembic migration 002 for acreage tables; 3-source fertilizer ETL pipeline
Add `pipeline/fetch_nass_historical.py` for 1990-2000 state-level NASS data via QuickStats API (1,303 records)
Add state-panel training in `acreage_model.py` (15 states x 26 years = 375 samples per commodity)
Add persistence and 5-year average baselines with deployment gate in `acreage_model.py`
Add leave-one-year-out cross-validation (25 folds per commodity)
Add split conformal prediction for calibrated 80% coverage intervals
Add `predict_batch()` for vectorized multi-row prediction
Add rich metrics.json: MAPE, RMSE, coverage, baseline comparison, CV stats, top features
Add LRU caching to feature query functions (futures, ERS costs, fertilizer)
Add `research/acreage-tier1-features-spec.md` speccing CRP, RMA, Drought Monitor, FAS Export sources

Changed

Update `train_acreage.py` to merge 1990-2000 historical data with 2001-2025 S3 parquets
Update `acreage_inference.py` to predict per-state with national rollup (16 rows per commodity)
Update `train_and_save()` to default to TOP_STATES instead of national-only
Rewrite county fetch pipeline with batched stat categories (4x fewer API calls, 8 workers)
Rewrite ERS cost parser for new "machine readable" Excel format

Fixed

Fix RDS connectivity by adding Internet Gateway route to subnet route table
Fix wheat acreage double-counting with `class_desc == 'ALL CLASSES'` filter
Fix wheat futures feature: December contract instead of unavailable July contract
Fix negative acreage predictions with floor of 0 in predict methods
Fix numpy float64 serialization in DB upsert
Fix all-NaN column handling in ensemble fit

lightgbmconformal-predictionpanel-datausda-nasspython

Report #0 of 14

Agricultural Data Analysis: County-Level Crop Yield Forecasting Module Built and Trained

Apr 11, 2026

Devlog

Module 04 went from zero to 60 trained models in a single session. The goal: county-level, weekly, in-season crop yield prediction for corn, soybeans, and wheat using weather data, soil characteristics, and historical NASS yield patterns.

The session started with infrastructure. I created the Alembic migration for three new RDS tables (soil_features, feature_weekly, yield_forecasts), wrote ORM models and Pydantic schemas, then built the full ETL stack: five new ingestion scripts for NOAA GHCN weather, NASA POWER solar/VPD, US Drought Monitor, NASS crop condition ratings, and PRISM precipitation normals. Each follows the existing common.py pattern with upsert semantics and S3 archival. I also wrote static data loaders for county centroids (Census Gazetteer, 3,222 counties), SSURGO soil features (NRCS SDA API, 2,391 counties with AWC and drainage class), and NOAA station-to-county mapping (Haversine nearest-station within 50km, 3,140 mapped, 82 NASA POWER fallback).

The real unlock was processing the GHCN bulk archive. Instead of rate-limited API calls (1,000/day for 3,200 counties), I processed the 7.5GB daily-summaries-latest.tar.gz directly: extract only the 3,105 mapped US stations, parse TMAX/TMIN/PRCP, convert units, and map to county FIPS. The result: 16 million daily weather records (2000-2025) processed in 4.3 minutes into a 71 MB parquet. That single file replaced what would have taken weeks via the API.

The wheat commodity required a targeted NASS fetch (286K rows across 30 states, 58 minutes) because the original county pipeline used "WINTER WHEAT" as the commodity name, but county-level data uses the aggregated "WHEAT" descriptor. A class_desc filter fix (accepting "WINTER" alongside "ALL CLASSES") resolved the training data gap.

First-pass training used only NASS-derived features (county mean yield, trend, prior year, std deviation, year). All corn and soybean models passed the 10% baseline gate, but features were identical across weeks since they didn't incorporate in-season weather. The retrain with weather features (GDD accumulated from planting, season precipitation, precipitation deficit vs normals, average max temperature, heat stress days) added real week-specific differentiation. Corn week 20 hit 14.15% val RRMSE (40% better than the 23.78% county-mean baseline), and soybean week 10 reached 16.50% (29% improvement). Wheat remains the hardest crop at 20-21% RRMSE, narrowly missing the 10% improvement gate. Its cross-year phenology (fall planting, summer harvest) makes a simple day-of-year season window less effective.

The frontend section integrates into the existing Predictions dashboard with commodity tabs, a week-of-season slider (1-20), confidence badges (low/medium/high by week), county summary stats, a map placeholder for Deck.gl county choropleth, forecast detail card with p10/p50/p90 ranges, a confidence strip showing how uncertainty narrows through the season, and a season accuracy chart.

What's next: Implement the Deck.gl GeoJsonLayer county choropleth using the downloaded TopoJSON, add drought and crop condition features to the training pipeline, and deploy the updated FastAPI backend to EC2 with the 60 model artifacts.

Changelog

Added

Add Alembic migration 003 with `soil_features`, `feature_weekly`, `yield_forecasts` tables
Add ORM models (`SoilFeature`, `FeatureWeekly`, `YieldForecast`) and 4 Pydantic response schemas
Add 5 ETL scripts: `ingest_noaa.py`, `ingest_nasa_power.py`, `ingest_drought.py`, `ingest_crop_conditions.py`, `load_prism_normals.py`
Add 3 static data loaders: `load_county_centroids.py` (3,222 counties), `load_ssurgo.py` (2,391 soil records), `build_station_map.py` (3,140 stations)
Add `process_ghcn_bulk.py` for bulk GHCN archive processing (16M rows in 4.3 min)
Add `fetch_wheat_county.py` for targeted wheat NASS county data (286K rows)
Add `yield_features.py` with 7-feature vector (GDD, CCI, precip deficit, VPD stress, drought, soil AWC, soil drainage)
Add `yield_model.py` (LightGBM quantile ensemble: p10/p50/p90) and `train_yield.py` (60 models, weather-enhanced, 10 features)
Add `yield_inference.py` for weekly county-level prediction pipeline
Add `yield_forecast.py` FastAPI router with 3 endpoints at `/api/v1/predict/yield`
Add `useYieldForecast.ts` React hook and `YieldForecastSection.tsx` with 4 sub-components
Add `--weekly-yield` mode to `cron_runner.sh` (Thursday 10 AM ET)
Add US counties TopoJSON (842KB) for Deck.gl choropleth

Changed

Update `main.py` to load 60 yield models at startup with S3 fallback, register yield router, bump to v0.3.0
Update `config.py` with `NOAA_API_KEY` setting
Update `PredictionsDashboard.tsx` to include `YieldForecastSection`
Update `train_yield.py` to integrate GHCN weather features (GDD, precip, hot days) with week-specific computation
Update `alembic/env.py` to register yield module ORM models

Fixed

Fix `load_county_centroids.py` to handle Census Gazetteer ZIP format and strip trailing whitespace from column names
Fix `load_ssurgo.py` to parse SDA API list-of-lists response format (was expecting dicts)
Fix wheat county yield filter to accept `class_desc='WINTER'` alongside `'ALL CLASSES'`

lightgbmquantile-regressionghcn-weathercounty-levelcrop-yield

Report #0 of 14

Agricultural Data Analysis: Acreage Model Tier 1 Features and Structural Modeling

Apr 11, 2026

Devlog

Today was the biggest single-day accuracy improvement across the prediction modules. The acreage model started the day with only soybean passing its deployment gate. By the end, three of four models pass: corn, soybean, and winter wheat.

The core problem was that the original 15 features were dominated by NASS historical acreage, with price and cost signals ranking low in importance. The models needed environmental constraints, farmer-decision signals, and demand context that market prices alone couldn't capture. I built four new ETL pipelines pulling from the US Drought Monitor (state-level DSCI via REST API), USDA RMA Summary of Business (crop insurance insured acreage from pipe-delimited ZIPs), FSA Conservation Reserve Program (enrollment and contract expirations from Excel), and FAS export sales (weekly commitments from ESRQS CSVs). That added about 12,300 rows across four new database tables and expanded the feature set from 15 to 24.

The data engineering had some friction. The FAS OpenData API was retired on April 2 (returning 403s on every endpoint), so I fell back to manual CSV downloads from the ESRQS query tool. FSA servers were too slow for automated Excel downloads, requiring manual retrieval. The USDM DSCI API returned lowercase JSON keys (`dsci`, `mapDate`) contrary to what their documentation suggested, which caused the first backfill run to parse zero rows before I caught the mismatch.

The real breakthrough came from two modeling changes layered on top of the new features. First, I split wheat into winter and spring sub-models with commodity-specific decision dates: August 1 for winter wheat (already in the ground by November) and March 1 for spring wheat. This required matching futures contracts to the decision date using `infer_contract_month()` rather than hardcoded forward contracts. The initial winter wheat results showed five NaN features because there was no historical futures data at August dates, so I backfilled 19,318 rows from Yahoo Finance going back to 2000. That single fix dropped winter wheat from 6.56% to 5.91% val MAPE.

Second, I switched from absolute acre targets to 3-year moving average residuals for soybean and wheat. The persistence baseline (just predict last year's acres) was already strong at 4-6% MAPE. Predicting raw deltas from prior year was too noisy, but the 3-year average smoothed enough to give the model a stable baseline to deviate from. Seven experiments confirmed the pattern: raw residuals hurt everything, 5-year averages flatten the trend too much, and 3-year is the sweet spot. Corn performed best with absolute targets, so the production config uses a per-commodity `RESIDUAL_CONFIG` dictionary: absolute for corn, 3yr residual for everything else.

The soybean result stands out. At 5.39% val MAPE, it is the first model in the entire project to pass the validation gate (not just the test gate). That means the model genuinely outperforms the 5-year-average baseline even on the volatile 2021-2023 period that includes COVID supply chain shocks and the Ukraine war.

What's next: Wheat spring (4.92% test vs 4.17% baseline) is the last holdout, limited by its 5-state, 125-row training set. Expanding the state list or adding spring-wheat-specific features like prevent-plant acreage could close that gap.

Changelog

Added

Add Alembic migration 004 with four new tables: `drought_index`, `rma_insured_acres`, `crp_enrollment`, `export_commitments`
Add `ingest_drought_dsci.py` ETL for USDM state-level DSCI (1,227 rows backfilled, 2000-2025)
Add `ingest_rma.py` ETL for RMA Summary of Business crop insurance (3,188 rows, 2000-2025)
Add `ingest_crp.py` ETL for FSA CRP enrollment and contract expirations (2,121 rows, 1986-2024)
Add `ingest_fas_exports.py` ETL for FAS weekly export commitments (5,769 rows, 3 commodities)
Add 9 Tier 1 features to `acreage_features.py`: drought (dsci_nov, dsci_fall_avg), insurance (insured_acres_prior, insured_acres_yoy_change), CRP (crp_expiring_acres, crp_pct_cropland), exports (export_outstanding_pct, export_pace_vs_5yr), plus prior_3yr_avg_acres
Add winter/spring wheat model split with `DECISION_DATES` dict and commodity-specific futures contract matching
Add 3-year moving average residual target modeling to `AcreageEnsemble` with `use_residual` flag and `_to_delta`/`_from_delta` methods
Add `RESIDUAL_CONFIG` per-commodity config dict to `train_acreage.py`
Add 6 new cron_runner.sh scheduling modes for Tier 1 data refresh

Changed

Update feature set from 15 to 24 features in `acreage_features.py`
Update training pipeline to produce 4 models (corn, soybean, wheat_winter, wheat_spring) instead of 3
Update futures contract queries to use `infer_contract_month()` at decision date instead of hardcoded forward months
Backfill 19,318 historical futures rows (2000-2026) via Yahoo Finance for all three commodities
Widen `export_commitments.marketing_year` column from VARCHAR(9) to VARCHAR(20) for FAS marketing year format
Improve corn test MAPE from 7.96% to 6.26% (passes deployment gate)
Improve soybean test MAPE from 5.97% to 4.64% (passes both val and test gates)
Add wheat winter model at 4.51% test MAPE (passes gate, new model)

Fixed

Fix DSCI API parser to handle lowercase response keys (`dsci`, `mapDate`) instead of uppercase
Fix winter wheat NaN futures by switching from forward-dated contract months to front-month matching at August decision dates

lightgbmfeature-engineeringetlpostgresqlagricultural-ml

Report #0 of 14

Agricultural Data Analysis: FieldPulse Frontend Rebuild

Apr 15, 2026

Devlog

I replaced the entire frontend of FieldPulse in a single session. The old dashboard was a 656-line monolith: dark-only, single-page, seven useState view modes, hardcoded hex colors everywhere, and an Inter font that had nothing to do with agriculture. The new one is a light-first, 6-tab application built on a proper design system with URL routing, seasonal behavior, and cross-cutting systems like glossary hover popovers, auto-generated chart captions, and citation blocks under every visualization.

The rebuild followed a five-wave plan. Wave 1 was the hardest: getting Tailwind v4's CSS processing to play nicely with custom properties took three attempts. The `@import` ordering in Tailwind v4 is strict, and putting the Google Fonts import after `@import "tailwindcss"` silently broke the entire stylesheet with no error in the terminal. Only the browser showed a blank page. Once I moved the font import above Tailwind's import, every token lit up correctly. MapLibre also refuses CSS `var()` references in paint properties, so all map layer colors had to be literal hex values.

The data layer was the pleasant surprise. The existing `processData.ts` (628 lines of transform functions) and `serviceData.ts` (S3 parquet fetching with fallback) survived the rebuild untouched. Every page reuses `filterData`, `getCommodityStory`, `detectAnomalies`, `getTopCrops`, and the rest. The old code was better than it looked: well-structured pure functions that just needed a modern rendering layer on top.

The seasonal behavior system turned out to be the most satisfying piece. A single `getAgSeason()` function drives dormant messages ("Crop conditions return May"), seasonal story card rotation on the Overview, and visibility gating on forecast panels. In April, the Forecasts page correctly shows acreage forecasts as live and yield forecasts as dormant until May 19.

What's next: Connect to the live FastAPI backend on EC2 to populate the Market tab (futures, WASDE, input costs) and Forecasts tab (acreage ensemble predictions, accuracy panels). The frontend is ready; it just needs data flowing.

Changelog

Added

Add FieldPulse design token system with full light/dark mode via CSS custom properties and `[data-theme='dark']`
Add 6-tab URL routing via Next.js App Router route groups (`/overview`, `/market`, `/forecasts`, `/crops`, `/land-economy`, `/livestock`)
Add `useFilters` hook wrapping `useSearchParams` for URL-as-source-of-truth filter state
Add `useTheme` hook with flash-free dark mode detection via inline script
Add `useAgSeason` hook with seasonal override table driving UI behavior per month
Add 10 shared components: BandShell (loading/error/empty/seasonal), KpiCard, DeltaChip, Sparkline, Term (glossary popover), CitationBlock, CommodityPicker, RangeChips, ExperimentalPill, ChartToolbar
Add shell system: Header (6-tab nav, logo, theme toggle), FilterRail (state/year/commodity pills, freshness badge), SourceFooter
Add Overview page with USChoropleth (MapLibre), HeroStrip (3 KPIs), StateFingerprint (donut + sparklines + peer bars), StoryCards (3 static + 1 seasonal)
Add Crops page with YieldTrendChart (25yr line, anomaly dots, national overlay), ProfitChart (bidirectional bars), HarvestEfficiency, CropProgressStrip (seasonal visibility)
Add Market page with MarketHero (64px Barlow price), PriceHistoryChart (range chips), WasdeCard (percentile bar, direction pill), RatioDial (corn/soy gauge), InputCostCard, DxyStrip
Add Forecasts page with SeasonClock (12-month strip), AcreageCard (3 commodity cards), AccuracyPanel (walk-forward backtest charts), yield dormant panel
Add Land & Economy page with RevenueLeaderboard (boom/decline callouts), FarmStructure (operations + avg farm size), LandUseMix (stacked area), LaborWages (wage trend + ranking)
Add Livestock page with InventorySnapshot (6 KPI cards), ProductionCharts (cattle/hog/milk line charts)
Add `captionTemplates.ts` with 20+ runtime caption templates for auto-generated chart descriptions
Add `glossary.json` with 20 USDA terms and `peerStates.json` with 50-state peer lookup
Add `anomalyContext.json` for 6 major drought/weather events (2002-2019)
Add `useMarketData` hook wrapping all 7 FastAPI `/market/*` and `/predict/*` endpoints

Changed

Rewrite `globals.css` from 2-token dark-only to 80+ token light-first design system with Tailwind `@theme inline` mapping
Rewrite `layout.tsx` to load Plus Jakarta Sans, Barlow Condensed, JetBrains Mono via `next/font/google` with flash-free theme script
Replace root `page.tsx` (656-line monolith) with server-side redirect to `/overview`

Removed

Remove 12 old dashboard components (CropsDashboard, AnimalsDashboard, EconomicsDashboard, LaborDashboard, LandDashboard, PredictionsDashboard, AcreagePredictionSection, YieldForecastSection, StateInfoPanel, StateSingleMap, TopCropsChart, USMap)
Remove `design.ts` (palette object replaced by CSS custom properties; format functions extracted to `lib/format.ts`)
Remove `usePriceForecast.ts` (price forecast module cut per spec section 0.1)

nextjsreacttailwindrechartsdesign-system

Report #0 of 14

Agricultural Data Analysis: End-to-End Review and Hardening

Apr 16, 2026

Devlog

FieldPulse has sprawled across four subsystems (pipeline, ML backend, FastAPI, Next.js frontend), and it was time for a real audit before the class demo. I spent the first half of the session running parallel review agents across the code. They came back with twenty findings, ranked by severity. The next several hours were spent fixing them in one pass, with one deferred by operator decision (the committed .env holds placeholder values, so rotation can wait).

A few of the findings mattered more than the headline count suggests. The multipart S3 verification was silently skipping any parquet over 8 MB, so a corrupted file would have reached the browser with only a log warning. The `last_success` manifest stamp was landing before the S3 upload step, meaning a failed upload would permanently suppress re-ingestion on the next cron run. The Conformalized Quantile Regression calibration in the price model was fit on the same validation split used to report coverage, making the reported figure tautological. The acreage national rollup was dividing by the wrong z-score, understating state sigma by about twenty-two percent. None of these were showing up as errors; they were the quiet kind of wrong.

The interesting decision was on gate policy. The review had flagged that "baseline gates" across all three forecasting modules were cosmetic: a model that failed the gate still got written to S3. For price and acreage, I enforced the gate properly (S3 upload blocked unless `--allow-failed-gate` is set). For yield, the module doubles as a class project and needs to demo even when test RRMSE trails the baseline. So I took a different path: gates stay informational, and a new `/metadata` endpoint exposes the gate status so the frontend can render a performance banner. Honest about the limits, visible in the UI, not hidden in a log line.

The last addition was a `YieldSeasonReview` component behind a toggle on the Forecasts tab. Off-season, it shows the 2024 walk-forward test results across corn, soybean, and wheat, with a per-week RRMSE chart against the county five-year baseline. The page was blank before, except for a "yield forecasts begin May 19" placeholder.

What's next: Redeploy the backend so the breaking acreage schema rename (`forecast_acres_millions` to `forecast_acres`) takes effect in production. The frontend currently shows "No forecast available" for acreage because the deployed API still returns the old field name.

Changelog

Added

Add `YieldSeasonReview` component to Forecasts tab with Current/Review toggle and per-crop walk-forward RRMSE chart
Add `GET /api/v1/predict/yield/metadata` endpoint exposing per-crop gate status for frontend annotation
Add HMAC-SHA256 pickle signing in `backend/models/_signing.py` gated by `MODEL_SIGNING_KEY` env var
Add `--allow-failed-gate` CLI flag to price and acreage training with gate-failed artifacts now excluded from S3 by default

Changed

Rename acreage API fields `forecast_acres_millions` to `forecast_acres` across schemas, router, and frontend (breaking, requires backend redeploy)
Replace MD5-plus-ETag multipart verification with `ChecksumAlgorithm=SHA256` and size check in `pipeline/upload_to_s3.py`
Split price model CQR calibration into separate calibration and measurement halves so reported coverage is no longer tautological
Dedupe `/api/v1/predict/acreage/states` on latest `created_at` per `state_fips` since the unique constraint includes `model_ver`

Fixed

Fix `yield_inference.py` to populate weather features at prediction time instead of silently zero-filling
Fix acreage national rollup z-score from 1.645 to 1.2816 so state sigma reflects the 80 percent stored interval
Fix `quickstats_ingest` manifest ordering so `record_counts` promotes only after a clean S3 upload
Fix SQL injection surface in `backend/features/acreage_features.py` by gating dynamic column names behind frozenset allowlists

fastapinextjsml-opssecurityaudit

Report #0 of 14

Agricultural Data Analysis: Tab-by-Tab Audit and Root Causes

Apr 17, 2026

Devlog

FieldPulse had shipped, but the first real walkthrough against live data surfaced problems that a code review alone would never have caught. Indiana total farm sales were reading $448.3B with a +15487 percent growth chip, rice profit in Arkansas was clocking in at $32,336 per acre, and Colorado's revenue leaderboard crowned peaches at $120B. The audit for this session was to go tab by tab with screenshots and not stop at "that looks wrong": every issue had to be traced back to source code, a parquet query, or a live API call, so the fix plan could be written with confidence rather than intuition.

The audit ran across six tabs (Overview, Market, Forecasts, Crops, Land & Economy, Livestock) and produced twenty-two distinct findings. I read every page component, every shared utility, every backend router, and ran Python directly against the S3 parquets to confirm what the data looked like before the frontend got its hands on it. Most of the user-visible problems fell into three families: data correctness in the pipeline's enrichment step, missing or miswired data sources, and plain frontend bugs where keys were misspelled or values were hardcoded placeholders.

The one finding that collapsed a dozen symptoms into a single root cause was in `pipeline/quickstats_ingest.py`. The `enrich_dataframe` function was computing imputed revenue as `sum(PRICE_RECEIVED) * sum(PRODUCTION)`, grouped by (state, commodity, year) but summing blindly across every `class_desc` variant. For hay, that meant multiplying the sum of alfalfa plus non-alfalfa plus all-classes prices by the sum of all three productions. The resulting DERIVED rows were inflated by roughly 1000x. One row for Indiana hay in 2024 read $338 billion, against a real figure near $300 million. That same bug cascaded into the overview hero, the revenue mix donut, the Land and Economy boom/decline callouts, and the livestock inventory cards (cattle 1.1 billion head, hogs 602 million, layers 328 million, all for one Midwestern state).

The non-obvious finding was that the corn/soy price ratio endpoint was returning the reciprocal of what every farmer uses. The backend returned `corn / soy` (roughly 0.38), but the 2.2 and 2.5 decision thresholds embedded in the frontend copy were written for the industry-standard `soy / corn` convention (roughly 2.67). The ratio bar marker, which clamped against a `tenYearMin=2.0` floor, pinned to `left: 0%` for every reading and left the bar uncolored. Both sides were internally consistent; they were just speaking different dialects.

The open question from the audit is how to reconcile the soybean acreage forecast rollup. The model's per-state test MAPE of 4.64 percent is genuinely good, yet the top-15-state sum comes in at 71.7M acres against USDA Prospective Plantings of 83.5M. The math is right: 71.7 divided by the historical coverage ratio (~0.86) lands at 83.4M. The choice is between expanding training to 30 states (already attempted, val MAPE regressed from 8 percent to 16 percent because the tail states add noise) or applying a post-hoc scale-up from historical coverage ratios. The latter is what the fix plan settled on.

What's next: With the audit complete, the fix plan is ready to execute. First wave is the pipeline enrichment rewrite plus fresh state-level and county-level aggregates, which unblocks most of the cascading symptoms without any frontend changes. Then backend endpoints (price ratio flip, soybeans plural, wheat export pace, crops profit history, meta models). Then frontend tab by tab against the corrected data.

Changelog

Added

Flag 7 Overview issues: inflated sales ($448B), rank placeholder, plain map, peer-comparison hardcoded zeros, revenue-mix inflation, flat sparklines, map layout gap
Flag 4 Market issues: inverted corn/soy ratio, soybeans plural regex reject, empty wheat card slot, non-functional state filter
Flag 5 Forecasts issues: invisible season-clock rail, hardcoded p10/p90 bar geometry, blobbed accuracy chart 1, ambiguous accuracy chart 2, low-contrast section headings
Flag 5 Crops issues: tiny anomaly hit target, operations N/A (data gap), missing yield-trend legend, placeholder profit formula, state-agnostic commodity picker
Flag 5 Land and Economy issues: inflated peaches leaderboard, no crop-type filter, land-use chart missing 4 of 5 categories, labor data key mismatch, operations section key mismatch, unlabeled boom/decline baseline
Flag 3 Livestock issues: inflated inventories (cattle 1.1B / hogs 602M / layers 328M), charts preferred as tables, sparklines without tooltips
Flag 2 cross-cutting issues: section-heading contrast failure across 30-plus usages, no About page explaining methodology

Changed

Document root cause: `pipeline/quickstats_ingest.py::enrich_dataframe` sums PRICE RECEIVED x PRODUCTION across class_desc variants, producing 100-1000x inflated DERIVED rows that cascade into Overview, Land and Economy, and Livestock surfaces
Document root cause: `backend/routers/acreage.py::get_price_ratio` computes corn/soy (0.38) while frontend thresholds (2.2, 2.5) assume soy/corn convention (2.67), clipping every reading to soy_favored
Document root cause: NATIONAL.parquet contains only `state_alpha='US'` rows, so `getMapData` returns empty and MapLibre falls through to default fill on every state
Document root cause: multiple frontend key mismatches where component props read keys that the util never returns (`d.count` vs `d.operations`, `d.stateWage` vs `d[stateCode]`), silently dropping every row
Document root cause: 14 backend endpoints across 4 router files reject `commodity=soybeans` (plural) via regex, while frontend `CROP_COMMODITIES` uses plural
Document root cause: county parquet at `partitioned_states_counties/{STATE}.parquet` exists and is populated, but no frontend path consumes it, leaving the choropleth at state grain only
Document root cause: yield accuracy endpoint returns mixed state plus national rows, and the frontend plots all 245 rows on one line, producing the tangled zigzag
Document root cause: profit chart uses literal formula `yield * 4.5 - 700` with no per-commodity price or cost lookup, so rice at 7400 lb/acre reads as $32K profit

Fixed

No production fixes in this milestone. Findings were flagged, traced, and documented; fix plan landed next session.

auditdata-qualitynextjsfastapiusda-nass

Report #0 of 14

Agricultural Data Analysis: Backend Rebuild and Overview Frontend Rewire

Apr 17, 2026

Devlog

With the audit findings in hand, I went straight into the fix plan. The strategy was to fix data correctness first, so that every frontend rewire downstream could rely on honest numbers, and to do the pipeline work as one coherent pass rather than per-tab patches. The plan shipped in this session covered eleven backend items plus the full Overview tab; the remaining five tabs and the new About page are tracked in a separate pending-work log.

The pipeline rewrite was the load-bearing change. I rewrote `enrich_dataframe` with a tier-aware canonical aggregation: filter down to class_desc equals ALL CLASSES when USDA publishes one, else sum across disjoint sub-classes (melons don't publish an ALL CLASSES row, so watermelon plus cantaloupe sum correctly). Units are paired explicitly so `$/BU` price only ever multiplies `BU` production, never tons. Dollar totals and quantities aggregate with sum within the chosen tier; per-unit prices and yields aggregate with max. Then a rebuild script downloaded all 50 state parquets plus NATIONAL from S3, stripped the bad DERIVED rows, re-enriched with the corrected logic, backed up the originals under a timestamped S3 prefix, and uploaded the clean versions. The whole pipeline ran in about a minute. Indiana hay 2024 went from $338 billion to $296 million; Indiana's full 2024 farm sales went from $448 billion to $14.1 billion, matching the USDA figure of about $13 to 14 billion.

Two new aggregate parquets came out of the same pipeline wave. `state_totals.parquet` is 38 kilobytes and gives the frontend everything it needs for the choropleth, hero totals, rank, and peer comparison; `state_commodity_totals.parquet` is 670 kilobytes and powers the revenue mix, 25-year sparklines, and any crop-type filter. A third `county_metrics` parquet per state backs the Overview map's drill-down mode. All three live under `s3://usda-analysis-datasets/survey_datasets/overview/`.

Beyond the correctness layer, I shipped six new or rewritten backend surfaces. The corn/soy ratio flipped to the industry-standard `soy / corn` convention in `backend/routers/acreage.py`, and a new shared dependency accepts both `soybean` and `soybeans` across 14 endpoints in four router files. A wheat export-pace endpoint joins the existing FAS export_commitments table against a historical 5-year same-week average, so the Market tab's empty wheat slot has real data ready. A `/api/v1/meta/models` endpoint reads every `metrics.json` under `backend/artifacts/` and returns 16 models with gate-pass/fail/borderline classification, feeding the About page's live model strip. ERS cost-of-production ingest expanded from 3 commodities (corn, soy, wheat) to 9 (adding cotton, rice, peanut, sorghum, oats, barley), with per-acre columns added via migration 008. A new `/api/v1/crops/profit-history` endpoint joins NASS state price and yield with ERS per-acre costs to replace the placeholder formula that had been producing the $32K/acre rice profit. USDA ERS Major Land Uses got its own ingest script covering all six state-level land categories for 1945 through 2017, and BLS QCEW NAICS 111 and 112 covers state-level farm establishment counts and wages annually from 2014 through 2024.

One real surprise came from the acreage retrain. I had been confident that expanding the training panel from 15 states to all 30-plus NASS-reporting states would close the national-rollup gap cleanly. It did not. Val MAPE for corn regressed from 8.4 percent to 16.6 percent; the tail states added more noise than signal. I reverted the training panel to the original 15 and added a `NATIONAL_COVERAGE_MULTIPLIERS` constant (derived from the state-aggregate parquet we already produced) that scales the state-sum at inference time. The multiplier for corn is 1.019, soy is 1.076, wheat is 1.157. The scale-up is honest: it comes from actual historical coverage ratios, not a fudge factor.

The Overview tab rewire was the only frontend work that landed. The old page was reading raw state parquets and doing in-browser aggregation on every render; the new page fetches the three pre-computed aggregates and does no more than filtering and sorting. Hero numbers, rank, peer comparison, revenue mix, and sparklines now all land correctly. The map has a second mode: clicking a state zooms into that state and swaps the layer to county-level harvested acres with a faint surrounding-states outline and a "back to US" pill. The county GeoJSON came from us-atlas, pre-converted to GeoJSON and served from `/public` to dodge a Next.js ESM issue with the topojson-client package. CORS on the bucket was missing entirely and had to be configured from scratch.

What's next: Five tab rewires plus the About page and the cross-cutting section-heading migration. The pending-work log has the context each one needs.

Changelog

Added

Add `pipeline/rebuild_enrichment.py` that downloads every state parquet from S3, strips bad DERIVED rows, re-enriches with tier-aware canonical aggregation, and uploads clean versions with timestamped backups
Add `pipeline/build_overview_aggregates.py` producing `state_totals.parquet` and `state_commodity_totals.parquet` to `s3://usda-analysis-datasets/survey_datasets/overview/`
Add `pipeline/build_county_aggregates.py` producing one parquet per state with canonical-filtered county metrics, used by the Overview map drill-down
Add `backend/etl/ingest_ers_mlu.py` pulling all 6 ERS Major Land Uses state-year categories and emitting `overview/land_use.parquet`
Add `backend/etl/ingest_bls_qcew.py` covering NAICS 111 and 112 state-year establishments, employment, and avg annual pay for 2014 through 2024
Add `backend/routers/deps.py` with `commodity_param` and `crop_param` FastAPI dependencies that accept `soybeans` plural and normalize to `soybean` at 14 endpoint call sites
Add `GET /api/v1/market/exports` endpoint computing wheat export commitments vs 5-year same-week average from FAS data
Add `GET /api/v1/meta/models` endpoint returning 16 forecast models with pass/borderline/fail gate classification from artifact metrics.json files
Add `GET /api/v1/crops/profit-history` endpoint joining NASS state price and yield with ERS per-acre cost for 9 commodities
Add `backend/alembic/versions/008_ers_per_acre_costs.py`, `009_widen_yield_units.py`, `010_land_use_categories.py`, `011_bls_establishments.py`
Add S3 bucket CORS policy allowing public GET and HEAD for cross-origin parquet fetches
Add `NATIONAL_COVERAGE_MULTIPLIERS` in `backend/features/acreage_features.py` applied in `compute_national_forecast` so the top-15-state sum scales up to a USDA-comparable national total
Add `--section-heading` design token in `web_app/src/app/globals.css` and a `SectionHeading` component for consistent contrast across all dashboards
Add `/us-counties.geojson` static asset and county-choropleth mode to `USChoropleth` with state-BBox zoom and a "back to US" pill
Add About tab entry to the top-nav TABS constant in preparation for the new page

Changed

Rewrite `enrich_dataframe` in `pipeline/quickstats_ingest.py` to filter canonical rows (class_desc ALL CLASSES when available, prod_practice ALL, reference_period YEAR) and use tier-aware sum-or-max aggregation per stat, with a $50B sanity cap on derived revenue
Flip `backend/routers/acreage.py::get_price_ratio` to return soy/corn instead of corn/soy and rewrite the percentile SQL to match
Expand ERS cost ingest in `backend/etl/load_ers_costs.py` from 3 commodities to 9 (adding cotton, rice, peanut, sorghum, oats, barley) with per-acre columns and HTTP-failure fallback to cached files
Rewrite `web_app/src/app/(tabs)/overview/page.tsx` to fetch pre-computed aggregates instead of raw state parquets, with 2022 Census-year baseline for growth calculations
Rewrite `web_app/src/components/maps/USChoropleth.tsx` with dual-mode rendering (state choropleth, county choropleth inside a selected state) and lazy GeoJSON loading
Rebuild all 50 state parquets plus NATIONAL with the new canonical aggregation and upload to S3 (originals backed up under timestamped `backups/` prefix)
Retrain corn and soybean acreage models after reverting to the top-15-state panel (soy val MAPE 5.39 percent, corn 8.37 percent, both matching pre-expansion artifacts) and upload ensemble plus metrics to S3

Fixed

Fix Indiana hay 2024 revenue from $338 billion to $296 million and cascade every overview and land-economy surface to honest numbers
Fix rank placeholder so Indiana reads #10 instead of #0, computed from sorted state_totals
Fix peer-comparison bars so all 4 peer states show real values instead of a single-bar chart
Fix 25-year sparklines to aggregate across states for national view (national CORN reads 90.9M acres instead of one state's value)
Fix Rules-of-Hooks violation in `USChoropleth` by moving `useMemo(viewState)` above the `!ready` early return
Fix Overview layout so the map column stretches to match the fingerprint sidebar height instead of leaving empty space below

pipelinefastapiaws-s3rechartsmaplibre

Report #0 of 14

Agricultural Data Analysis: Five Tabs Wired, About Page Shipped, Headings Unified

Apr 17, 2026

Devlog

With the Overview tab landed in the prior session, the remaining pending-frontend log tracked five tabs, one new route, and a cross-cutting design-token migration. The goal this session was to clear the whole list in one pass rather than leaving any tab in an intermediate state. The work went tab by tab with a verification checkpoint at the end of each, which kept the diff reviewable and caught real bugs at the moment they were introduced rather than at the end.

The fix that mattered most for user-visible correctness was on the Livestock tab. The existing code used a substring match on `commodity_desc.includes('CATTLE')` and summed every matching row, which pulled in CATTLE plus CATTLE, COWS plus CATTLE, CALVES and summed four quarterly inventories as if they were annual. For Iowa 2024 that produced "3.1B head of hogs." I switched to an exact `commodity_desc === 'HOGS'` match, added a `unit_desc === 'HEAD'` filter (the raw data also carries dollar-valuation rows under the same stat-cat, which was the other source of the inflation), and took MAX across quarterly reports rather than SUM. Iowa hogs now reads 25.2M head, which matches the NASS Hogs and Pigs report almost exactly.

Two backend data gaps showed up mid-implementation that the pending-work log did not anticipate. The `state_commodity_totals` aggregate parquet is supposed to carry `inventory_head` per (state, commodity, year), but the column is null for CATTLE, HOGS, and TURKEYS. I fell back to the raw state parquet for inventory (with the new sanitization filters doing the work the aggregate was supposed to do) and left a note in the check-in that the pipeline's `build_overview_aggregates.py` would need to populate that column for the aggregate to become load-bearing. Separately, the BLS QCEW parquet keys rows on `state_fips` and `naics` (not `state_alpha` and `naics_code` as the type I wrote assumed), so the Labor and Wages overlay silently rendered zero lines until I added a `STATE_ALPHA_TO_FIPS` lookup table and fixed the field names. Both were quick pivots but the kind of thing that only shows up when you actually run the code against production data.

The About page is the most visible addition: eight prose sections covering data sources, pipeline, architecture (iframed from the existing cloud diagram), per-module model descriptions for price, acreage, and yield, accuracy methodology, known limitations, and a glossary pointer. Underneath that sits a live inventory strip that fetches `/api/v1/meta/models` and renders one tile per model with a gate-status badge. The summary line reads from the endpoint, so when a model is retrained the page updates without a code change: 16 models as of today, 4 pass, 3 borderline, 9 fail. The SectionHeading migration at the end of the session took 19 of 24 candidate occurrences across 14 files and left the four that were intentionally not section headings (the component itself, a KpiCard internal label, a Callout with a dynamic color prop, and the dead ProductionCharts component that §5b replaced) as documented skips.

What's next: a single cleanup PR to delete `ProductionCharts.tsx` (flagged as spawn-task during the session), and re-verifying the Vercel deploy now that the `Tab` type includes `'about'`.

Changelog

Added

Add ExportPaceCard for wheat in Market tab (fills the slot where the corn/soy RatioDial is hidden)
Add /about route with 8 prose sections, iframed architecture diagram, and live model-metadata strip
Add fetchBlsEstablishments fetcher and BlsEstablishmentRow type in overviewData.ts
Add crop-type filter pills (All / Field Crops / Fruits / Vegetables / Livestock / Dairy / Poultry) to RevenueLeaderboard
Add 80% interval glossary entry with conformal-calibration tooltip on AcreageCard
Add ProductionSalesTable (11-year table with sparkline headers) replacing ProductionCharts on Livestock
Add optional `years[]` and `unit` props to Sparkline for hover tooltips and first/last endpoint labels
Add STATE_ALPHA_TO_FIPS lookup table in land-economy page for BLS QCEW joins

Changed

Migrate 19 section heading occurrences across 14 files to the shared SectionHeading component
Switch acreage accuracy chart to national-only rows pivoted by commodity (three lines with commodity colors)
Switch yield accuracy chart to three shaded areas (corn, soybean, wheat) plus dashed baseline
Replace hardcoded `left: 10%` on AcreageCard p10/p90 bar with positions computed against a p50 plus-or-minus 15% axis
Wire Crops profit chart to /api/v1/crops/profit-history (replaces placeholder `yield * 4.5 - 700`)
Filter Crops commodity picker to crops with AREA PLANTED greater than 0 in the last 3 years for the selected state
Change Boom / Decline baseline from rolling 10-year to fixed 2012 Census
Convert SeasonClock off-season yield cells to hatched repeating-linear-gradient for theme-independent contrast
Replace anomaly ReferenceDot with custom shape containing a 12px transparent hit target over a 6px visible circle

Fixed

Fix Vercel build error: add 'about' to the Tab type to match the TABS constant
Fix Livestock inventory overcount: exact commodity match, unit_desc='HEAD' filter, MAX across quarterly reports (hogs 3.1B to 25.2M for Iowa)
Fix Labor wage key mismatch: read `d[stateCode]` and `d['National Avg']` instead of non-existent `d.stateWage` and `d.nationalWage`
Fix Farm Structure operations count: read `d.operations` from getOperationsTrend (was reading undefined `d.count`)
Fix hide State filter pill on Market tab (Market is national-only)
Fix AccuracyPanel crash on allSettled partial failure: use cancellation guard and per-fetch error handling
Fix Crops Operations Census fallback: walk [year, year-1, 2022, 2017, 2012, 2007, 2002] and hide card when no data

Removed

Drop the placeholder `yield * 4.5 - 700` profit formula from Crops page

nextjsrechartsfastapiparquets3

Report #0 of 14

Agricultural Data Analysis: County Coverage, Enrichment Pipelines, and Crops Tab Redesign

Apr 21, 2026

Devlog

Three arcs of work this week, each building on the previous. The first arc closed most of the county-level NASS gaps the April 17 audit surfaced: a new `fill_county_gaps.py` wrapper re-audited local parquets, computed the residual gap set, and drove the main ingest one state at a time until new rows stopped arriving. Wheat (which had zero county rows before) landed 418K rows; total county coverage roughly doubled to 1.93M rows.

The second arc fixed three real math bugs on the Crops tab that were quietly lying to users. The Indiana corn panel showed 31M planted acres (true answer: 5.2M) because `filterData` summed every `reference_period_desc` variant that NASS publishes (YEAR, YEAR MAR ACREAGE, YEAR JUN ACREAGE, YEAR NOV FORECAST, and so on). A harvest-efficiency chart showed values up to 300% because of the same bug paired with a census-year SURVEY and CENSUS double-count. A percentile caption rendered literally as "-th percentile of the 25-year record" because the template was never wired. Fixing the filter, switching single-value aggregates from sum to max with unit gates, and computing the ordinal suffix client-side closed all three. A bonus fix for five-year averages skipping census-year gaps followed from the same investigation.

The third arc was the redesign the P0 fixes exposed as worth doing. I ported five enrichment ingests from aquifer-watch (NOAA nClimDiv precipitation, NASS Census irrigated acres, USDA IWMS water application, ERS revenue, EIA state electricity prices) and wired them nationwide. The Crops tab now renders a 92-county Indiana choropleth colored by yield anomaly, with click-to-drill into an Ogallala-style panel showing KPIs, real NOAA growing-season precipitation, and a nearest-peers table. Layout went through grid, sticky, and finally float-based with `clear: left` on the peers card, which gave a clean inverted-L wrap in both state and county modes.

What's next: Re-run the NASS irrigated-acres ingest at lower concurrency to pick up the five states that 403'd, and wire IWMS + ERS into a water-productivity KPI (bu per acre-foot) on the county drill.

Changelog

Added

Add `pipeline/fill_county_gaps.py` resume-by-default gap filler, closing 1,192 of 1,764 residual (state, commodity, year) gaps across 2 rounds
Add `pipeline/enrichments/` subdirectory with 5 ports from aquifer-watch covering NOAA nClimDiv, NASS irrigated acres, USDA IWMS, ERS revenue, EIA electricity
Add `--layout {state,county}` flag to `pipeline/upload_to_s3.py`, decoupling state and county prefixes and guarding the manifest stamp
Add `fetchCountyData`, `fetchCountyPrecip`, `fetchIrrigatedCountyAcres` helpers in `serviceData.ts`
Add `CropsStateMap` component (SVG choropleth, hover tooltip, click-to-drill) with in-module county rollup
Add `CropsCountyDrill` component with NOAA precipitation context block, KPIs, method signature
Add `CropsPeers` component (top-5 / weakest-3 in state mode; 6 nearest-yield peers in county mode)
Add `.crops-invL-wrap` float-based layout in `globals.css`, 720px mobile breakpoint

Changed

Update `filterData` in `processData.ts` to drop `YEAR - *` reference periods and include `reference_period_desc` plus source / class / practice in the dedup key
Update `getCommodityStory` to use max with unit gates instead of sum, so biotech PCT sub-rows can no longer inflate area totals
Update `YieldTrendChart` to compute percentile ordinal client-side (96th, 91st, 22nd) and drop the literal "th" template suffix
Update `crops/page.tsx` 5-year baseline to skip zero-yield years so census-year SURVEY gaps do not pull the average down

Fixed

Fix Crops tab planted acres inflation (Indiana corn 31.0M -> 5.2M) by filtering interim / forecast reference periods
Fix harvest efficiency bars exceeding 100% (peaks at 300% -> max at 100%) via unit-consistent rollup
Fix broken percentile caption in yield trend (rendered "-th" -> renders "96th percentile of the 25-year record")

nextjsd3s3noaausda-nass