PySpark Workshop — Batch ML to Real-Time Prediction
A hands-on 2-hour PySpark workshop covering distributed data processing, MLlib pipelines, MLflow tracking, and real-time streaming on Databricks.
01.Project Overview
Overview
A structured 2-hour workshop that teaches essential PySpark skills through building a complete analytics system — from batch data processing to real-time prediction — using the TPC-H dataset on Databricks Free Edition.
Workshop Structure
The workshop is divided into four progressive parts:
Part 1 — PySpark Speedrun (20 min)
Introduction to Spark fundamentals: loading data, understanding lazy evaluation vs. eager actions, and core DataFrame operations.
Part 2 — Data Engineer Toolkit (35 min)
The practical 80/20 of data engineering: aggregations (groupBy, window functions), multi-table joins (inner vs. left), null handling, and saving to Delta Lake.
Part 3 — ML Capstone Pipeline (40 min)
End-to-end machine learning:
- Feature engineering with StringIndexer and VectorAssembler
- RandomForest model training to predict order value
- MLflow experiment tracking for parameter/metric logging
- Model serialization and reload
Part 4 — Real-Time Prediction (15 min)
Applying the trained batch model to a live simulated stream using Structured Streaming, with real-time predictions displayed as orders arrive.
Design Decisions
- Zero setup barrier — Built entirely for Databricks Free Edition (no credit card, no local installation)
- Built-in data — Uses the TPC-H dataset pre-loaded in every Databricks workspace
- Progressive complexity — Each part builds on the previous, creating a complete pipeline by the end
- Production patterns — Teaches Delta Lake, MLflow, and parameterized pipelines that transfer to production work
Materials Included
- 4 Jupyter notebooks (one per part)
- Complete workshop guide with timing and teaching tips
- Databricks Free Edition setup guide
- PySpark cheat sheet for quick reference
Technologies
Role
Workshop Designer & Instructor
Timeline
Nov 2025
Category
Data Engineering / Education