Agricultural Data Analysis
in-progressInteractive USDA analytics dashboard with commodity price forecasting
What It Is
An interactive web-based dashboard for exploring long-term U.S. agricultural trends using USDA QuickStats data, combined with commodity price forecasting models. The platform surfaces production, yield, price, land use, labor, and economic data at the state and county level, and provides probabilistic forward-looking price forecasts that highlight when supply/demand fundamentals diverge from current futures pricing.
Architecture
A three-tier system:
- Frontend (Next.js on Vercel): Six analytical dashboards (crops, animals, land, labor, economics, predictions) with interactive charts (Recharts), geographic visualizations (Deck.gl choropleth maps), and browser-side parquet file reading (hyparquet) for cost-effective data delivery.
- Backend (FastAPI on EC2): Serves commodity price forecasts via REST endpoints. PostgreSQL on AWS RDS stores futures prices, WASDE reports, ERS production costs, and forecast outputs.
- Data Pipeline (EC2 cron): Automated ingestion on schedule: monthly USDA QuickStats via API, daily CME futures and FRED dollar index, monthly WASDE CSV downloads, and annual ERS production cost updates. Data lands in both S3 (parquet for frontend) and RDS (for forecasting models).
Data Sources
- USDA QuickStats: All 50 states plus counties, 2001 to 2025. Monthly API pull via CMS QuickStats endpoint.
- CME Futures: Daily corn, soybean, and wheat contract prices via yfinance.
- FRED API: US Dollar Index (DXY), updated daily.
- USDA WASDE: Monthly World Agricultural Supply and Demand Estimates.
- USDA ERS: Annual production cost estimates per bushel by commodity.
Current State
The frontend dashboard with five core views is complete. The commodity price forecasting module is fully implemented end-to-end (data ingestion, feature engineering, model training with walk-forward validation, API endpoints, and frontend predictions dashboard). EC2 cron scheduling and SNS alerting are operational.
Next priorities include web app performance improvements (replacing full national parquet fetches with targeted Athena queries), UX polish (search, mobile navigation, loading states), accessibility improvements, and additional forecasting modules for crop yield and planted acreage (tech specs written, models not yet trained).
Devlog
With the commodity price forecasting module live, I turned to two things that needed attention before moving to the next ML module: extending the data pipeline to county-level granularity, and cleaning up operational rough edges.
The county-level NASS data extension was the bigger piece. The existing pipeline pulled state-level aggregates from the QuickStats API, which is fine for national trends but too coarse for the crop yield prediction module I have planned next. County-level data is an order of magnitude larger (thousands of FIPS codes versus 50 states), so the ingestion needed to handle pagination more carefully and the parquet partitioning scheme needed updating. I kept the state-level files as the primary dashboard data source and added county-level files as a separate partition layout that the yield module will consume.
On the operational side, I fixed a redundant API call pattern in the pipeline where the same QuickStats query was being made multiple times during a single run. This was not causing data issues (the upsert logic handled duplicates), but it was wasting API quota and slowing down runs unnecessarily. I also tightened up the cron runner's mode handling and verified the systemd service restarts cleanly after WASDE inference updates the model artifacts.
The model artifact flow is now clean: train locally, upload pickles to S3, EC2 pulls from S3 on FastAPI startup (or falls back to local artifacts if S3 is unreachable). The inference script runs all 18 commodity/horizon combinations in a single batch, writes predictions to RDS, and the service restart picks up any updated model files.
This is not a flashy milestone, but the platform is now operationally solid. The pipeline runs reliably on its cron schedule, the prediction service auto-recovers, and the data layer supports both the state-level dashboard and the county-level work coming next.
What's next: The crop yield prediction module (Module 04), which will use the county-level NASS data alongside NOAA weather feeds and NASA solar radiation data. This module doubles as a Cloud Computing class deliverable, so the architecture needs to be fully documented and reproducible.
Changelog
Added
- Add county-level NASS data ingestion from USDA QuickStats API
- Add county-level parquet partition layout alongside existing state-level files
- Add batch inference script for all 18 commodity/horizon model combinations
Changed
- Update pipeline pagination to handle county-level data volumes
- Update parquet partitioning scheme to support dual granularity (state + county)
- Update cron_runner.sh mode handling for cleaner operational flow
Fixed
- Fix redundant API calls in QuickStats pipeline wasting quota and slowing runs
- Fix systemd service restart behavior after WASDE inference model updates
Devlog
This was the densest two days of the project. I built the complete commodity price forecasting module end-to-end, from database tables through model training to a live frontend dashboard. Nine implementation steps, all shipped.
The backend starts with four ETL scripts pulling data from different sources on different schedules: CME futures prices daily via yfinance (I had to switch from Nasdaq Data Link after their CHRIS/CME endpoint was retired), US Dollar Index from FRED, WASDE supply-and-demand reports monthly from USDA, and ERS production costs annually. Each script writes to its own RDS table with upsert logic so reruns are safe.
The feature engineering layer builds an 18-feature matrix from these sources: market signals (spot price, term spread, basis, open interest change), fundamental indicators (stocks-to-use ratio and its historical percentile, WASDE surprise magnitude), macro factors (DXY level and 30-day change), cost metrics (production cost per bushel, price-to-cost ratio), and interaction terms (corn-soy ratio, seasonal factor). Every feature is validated with a Pandera schema before it reaches the model.
The model itself is an ensemble: SARIMAX for capturing seasonality and trend, LightGBM with quantile regression for the probabilistic spread, a Ridge meta-learner that blends them, and an isotonic calibration layer for the probability estimates. I trained 18 model sets (3 commodities times 6 monthly horizons) using walk-forward validation with proper temporal splits. A Mahalanobis distance check flags regime anomalies at inference time -- if the current feature vector is far from the training distribution, the system defers to the futures curve rather than trusting the model.
The API serves four endpoints: the core forecast (p10/p50/p90), calibrated probability above a user-specified threshold, WASDE signal analysis, and historical forecast-vs-actual comparisons. The frontend PredictionsDashboard has a fan chart, probability gauge, WASDE signal card, key driver callout (top SHAP feature), and regime alert.
The whole thing runs on EC2 via cron: daily market data pulls, monthly WASDE-triggered inference, and a systemd service that auto-restarts the FastAPI server. Total infrastructure cost is about $22/month.
What's next: Extending the pipeline to county-level NASS data for finer geographic granularity, and operational polish before the yield prediction module.
Changelog
Added
- Add 5 RDS tables: futures_daily, wasde_releases, price_forecasts, ers_production_costs, dxy_daily
- Add ETL scripts: ingest_futures (yfinance), ingest_fred (DXY), ingest_wasde (USDA CSV), load_ers_costs (Excel)
- Add 18-feature engineering pipeline with Pandera schema validation
- Add PriceEnsemble model: SARIMAX + LightGBM quantile + Ridge meta-learner + isotonic calibration
- Add SHAP TreeExplainer for key driver identification per forecast
- Add Mahalanobis distance regime anomaly detection at inference time
- Add walk-forward training pipeline (2010-2019 train, 2020-2022 val, 2023-2024 test)
- Add 4 API endpoints: forecast, probability, wasde-signal, history
- Add PredictionsDashboard with fan chart, probability gauge, WASDE signal card, key driver callout
- Add PREDICTIONS tab to main dashboard ViewMode
- Add EC2 cron schedule: daily market data, monthly WASDE + inference, annual ERS costs
- Add systemd service for FastAPI auto-restart on EC2
- Add S3 model artifact storage and fallback loading
Changed
- Update cron_runner.sh with --daily, --monthly-wasde, --annual-ers modes
Devlog
With the Next.js frontend stable, I shifted to the backend infrastructure that the prediction modules will need. This was less about visible features and more about laying pipe: a FastAPI skeleton, database setup, and fixing a pipeline reliability problem that had been lurking since the Streamlit days.
The FastAPI backend is straightforward: async SQLAlchemy 2.0 with asyncpg connecting to a PostgreSQL instance on AWS RDS (db.t4g.micro, about $15/month). Alembic handles migrations. The router structure is set up for the prediction endpoints I will build next, with config management via pydantic-settings reading from environment variables. The entry point loads models at startup via a lifespan context manager, which will matter once trained model artifacts exist.
The more important fix was the EC2 OOM kill. The USDA QuickStats pipeline was trying to load the entire national dataset into memory for processing, which worked on my local machine but blew up the EC2 instance. I refactored the pipeline to process data incrementally: ingest by state, merge parquet files in a multi-step process, and never hold more than one state's worth of data in memory at a time. This also made the pipeline resumable -- if it fails on state 37, it picks up from 37 on the next run instead of starting over.
I also did a Phase 3 UX and accessibility pass on the web app: bug fixes, some analytics instrumentation, and accessibility improvements flagged during the Next.js migration. And I cleaned out the legacy Dash app code that was still sitting in the repo from an earlier prototype. The repository is now cleanly split between `web_app/` (Next.js frontend), `backend/` (FastAPI), and `pipeline/` (data ingestion).
What's next: Building the commodity price forecasting module -- the first ML system on this platform, using WASDE reports, CME futures, and macro indicators to produce probabilistic price forecasts.
Changelog
Added
- Add FastAPI backend skeleton with async SQLAlchemy 2.0 and asyncpg
- Add PostgreSQL on AWS RDS (db.t4g.micro) with Alembic migrations
- Add pydantic-settings config management for backend environment
- Add incremental state-by-state pipeline processing to prevent EC2 OOM
- Add multi-step parquet merge for resumable pipeline runs
- Add Phase 3 UX and accessibility improvements to web app
Changed
- Update pipeline to process data incrementally instead of full national load
- Update repository structure: clean split between web_app/, backend/, pipeline/
Fixed
- Fix EC2 OOM kill during national dataset processing
- Fix various web app bugs identified during Phase 3 review
Removed
- Remove legacy Dash app code from repository
Devlog
After two months away from this project (other work took priority over winter), I came back and rewrote the entire dashboard from Streamlit to Next.js 16 with React 19 and TypeScript. The Streamlit version worked for exploration, but the UX ceiling was low: no component-level state management, limited interactivity, and Streamlit's deployment model did not fit well with Vercel hosting.
The migration took about a week of focused work. Each of the five original views (Crops, Animals, Land, Labor, Economics) became its own React component. Charts moved from Matplotlib/Plotly to Recharts, which integrates naturally with React's rendering lifecycle. The geographic visualizations got a major upgrade: Deck.gl with MapLibre GL replaced the hexagonal map, giving me proper choropleth layers with county-level aggregation and interactive tooltips.
The most interesting architectural decision was browser-side parquet reading via hyparquet. Instead of standing up a backend data proxy, the frontend fetches parquet files directly from S3 and parses them in the browser. This means the data layer has zero backend cost for read-heavy dashboard usage. S3 serves the files, the browser does the parsing, and Athena is available as a fallback for complex aggregation queries that would be expensive to run client-side.
The build process was rougher than expected. I went through three rounds of build fixes on February 12 alone (type errors, missing dependencies, Next.js App Router quirks). The v2 data integrity pass on February 16 caught several cases where the parquet column names did not match the TypeScript interfaces I had defined.
I also established the design system during this migration: a dark-themed palette with consistent chart colors, responsive layout via Tailwind CSS 4, and a design token system that will make future theming straightforward.
What's next: Standing up a FastAPI backend for the prediction services and hardening the pipeline to handle larger data volumes without running out of memory on EC2.
Changelog
Added
- Add Next.js 16 frontend with React 19, TypeScript 5, and App Router
- Add Recharts 3.7 integration for interactive chart components
- Add Deck.gl 9.2 + MapLibre GL 5.18 for choropleth and aggregation layer maps
- Add hyparquet for browser-side parquet file reading (zero-backend data serving)
- Add S3-first data fetching strategy with local API fallback
- Add Athena query integration for complex aggregation queries
- Add design system with dark theme palette and consistent chart color tokens
- Add Vercel deployment configuration
Changed
- Update dashboard architecture from Streamlit (Python) to Next.js (React/TypeScript)
- Update chart library from Matplotlib/Plotly to Recharts
- Update geographic visualization from hexagonal map to Deck.gl choropleth layers
Fixed
- Fix multiple build errors from Next.js App Router migration (3 rounds of fixes)
- Fix parquet column name mismatches with TypeScript interfaces in v2 data integrity pass
Devlog
I stood up the first working version of the USDA agricultural analytics dashboard in about two weeks. The initial goal was straightforward: pull historical crop, livestock, land use, labor, and economic data from the USDA QuickStats API and present it in an interactive dashboard that makes state-level agricultural trends explorable.
The data pipeline fetches from the QuickStats API, cleans and validates the response, and writes partitioned parquet files to an S3 bucket. Parquet was the right call here because the dataset is wide (dozens of commodity types, metrics, and geographies) and read-heavy. The pipeline handles incremental updates so I can re-run it monthly without re-downloading the full history.
The dashboard itself is a Streamlit app with five views: Crops (production, yield, acreage), Animals (inventory, production), Land (use composition), Labor (employment, wages), and Economics (prices received, revenue). I added a hexagonal state map for geographic exploration, which looks better than a standard choropleth for agricultural data because it gives smaller states equal visual weight.
Most of the friction was around deployment. I built a Docker container and GitHub Actions CI/CD pipeline on December 6, then spent the next day debugging S3 connectivity issues and data loading strategies. The demo mode (single-state focus for testing) helped isolate problems without loading the full national dataset. I also ran into memory issues when loading all states at once, which I solved by switching to a lazy-loading strategy that only fetches the states a user actually navigates to.
By December 13, the MVP was documented, the README was in place, and the S3 environment was stable. The Streamlit approach worked for getting something up fast, but I already knew the UX limitations would push me toward a proper frontend framework.
What's next: Migrating the dashboard from Streamlit to Next.js for better interactivity, Vercel hosting, and component-level state management.
Changelog
Added
- Add USDA QuickStats API ingestion pipeline with incremental updates
- Add AWS S3 parquet storage with partitioned file layout
- Add Streamlit dashboard with 5 views (Crops, Animals, Land, Labor, Economics)
- Add hexagonal state map visualization for geographic exploration
- Add Dockerfile and GitHub Actions CI/CD pipeline
- Add demo mode for single-state testing and development
- Add detailed S3 error logging for debugging connectivity issues
- Add README documentation with screenshots
Changed
- Update data loading to lazy-fetch strategy (per-state on navigation)
Fixed
- Fix labor statistics data accuracy in pipeline
- Fix S3 environment variable configuration for deployment
Devlog
Module 04 went from zero to 60 trained models in a single session. The goal: county-level, weekly, in-season crop yield prediction for corn, soybeans, and wheat using weather data, soil characteristics, and historical NASS yield patterns.
The session started with infrastructure. I created the Alembic migration for three new RDS tables (soil_features, feature_weekly, yield_forecasts), wrote ORM models and Pydantic schemas, then built the full ETL stack: five new ingestion scripts for NOAA GHCN weather, NASA POWER solar/VPD, US Drought Monitor, NASS crop condition ratings, and PRISM precipitation normals. Each follows the existing common.py pattern with upsert semantics and S3 archival. I also wrote static data loaders for county centroids (Census Gazetteer, 3,222 counties), SSURGO soil features (NRCS SDA API, 2,391 counties with AWC and drainage class), and NOAA station-to-county mapping (Haversine nearest-station within 50km, 3,140 mapped, 82 NASA POWER fallback).
The real unlock was processing the GHCN bulk archive. Instead of rate-limited API calls (1,000/day for 3,200 counties), I processed the 7.5GB daily-summaries-latest.tar.gz directly: extract only the 3,105 mapped US stations, parse TMAX/TMIN/PRCP, convert units, and map to county FIPS. The result: 16 million daily weather records (2000-2025) processed in 4.3 minutes into a 71 MB parquet. That single file replaced what would have taken weeks via the API.
The wheat commodity required a targeted NASS fetch (286K rows across 30 states, 58 minutes) because the original county pipeline used "WINTER WHEAT" as the commodity name, but county-level data uses the aggregated "WHEAT" descriptor. A class_desc filter fix (accepting "WINTER" alongside "ALL CLASSES") resolved the training data gap.
First-pass training used only NASS-derived features (county mean yield, trend, prior year, std deviation, year). All corn and soybean models passed the 10% baseline gate, but features were identical across weeks since they didn't incorporate in-season weather. The retrain with weather features (GDD accumulated from planting, season precipitation, precipitation deficit vs normals, average max temperature, heat stress days) added real week-specific differentiation. Corn week 20 hit 14.15% val RRMSE (40% better than the 23.78% county-mean baseline), and soybean week 10 reached 16.50% (29% improvement). Wheat remains the hardest crop at 20-21% RRMSE, narrowly missing the 10% improvement gate. Its cross-year phenology (fall planting, summer harvest) makes a simple day-of-year season window less effective.
The frontend section integrates into the existing Predictions dashboard with commodity tabs, a week-of-season slider (1-20), confidence badges (low/medium/high by week), county summary stats, a map placeholder for Deck.gl county choropleth, forecast detail card with p10/p50/p90 ranges, a confidence strip showing how uncertainty narrows through the season, and a season accuracy chart.
What's next: Implement the Deck.gl GeoJsonLayer county choropleth using the downloaded TopoJSON, add drought and crop condition features to the training pipeline, and deploy the updated FastAPI backend to EC2 with the 60 model artifacts.
Changelog
Added
- Add Alembic migration 003 with `soil_features`, `feature_weekly`, `yield_forecasts` tables
- Add ORM models (`SoilFeature`, `FeatureWeekly`, `YieldForecast`) and 4 Pydantic response schemas
- Add 5 ETL scripts: `ingest_noaa.py`, `ingest_nasa_power.py`, `ingest_drought.py`, `ingest_crop_conditions.py`, `load_prism_normals.py`
- Add 3 static data loaders: `load_county_centroids.py` (3,222 counties), `load_ssurgo.py` (2,391 soil records), `build_station_map.py` (3,140 stations)
- Add `process_ghcn_bulk.py` for bulk GHCN archive processing (16M rows in 4.3 min)
- Add `fetch_wheat_county.py` for targeted wheat NASS county data (286K rows)
- Add `yield_features.py` with 7-feature vector (GDD, CCI, precip deficit, VPD stress, drought, soil AWC, soil drainage)
- Add `yield_model.py` (LightGBM quantile ensemble: p10/p50/p90) and `train_yield.py` (60 models, weather-enhanced, 10 features)
- Add `yield_inference.py` for weekly county-level prediction pipeline
- Add `yield_forecast.py` FastAPI router with 3 endpoints at `/api/v1/predict/yield`
- Add `useYieldForecast.ts` React hook and `YieldForecastSection.tsx` with 4 sub-components
- Add `--weekly-yield` mode to `cron_runner.sh` (Thursday 10 AM ET)
- Add US counties TopoJSON (842KB) for Deck.gl choropleth
Changed
- Update `main.py` to load 60 yield models at startup with S3 fallback, register yield router, bump to v0.3.0
- Update `config.py` with `NOAA_API_KEY` setting
- Update `PredictionsDashboard.tsx` to include `YieldForecastSection`
- Update `train_yield.py` to integrate GHCN weather features (GDD, precip, hot days) with week-specific computation
- Update `alembic/env.py` to register yield module ORM models
Fixed
- Fix `load_county_centroids.py` to handle Census Gazetteer ZIP format and strip trailing whitespace from column names
- Fix `load_ssurgo.py` to parse SDA API list-of-lists response format (was expecting dicts)
- Fix wheat county yield filter to accept `class_desc='WINTER'` alongside `'ALL CLASSES'`