PySpark Workshop — Batch ML to Real-Time Prediction

A hands-on 2-hour PySpark workshop covering distributed data processing, MLlib pipelines, MLflow tracking, and real-time streaming on Databricks.

01.Project Overview

Overview

A structured 2-hour workshop that teaches essential PySpark skills through building a complete analytics system — from batch data processing to real-time prediction — using the TPC-H dataset on Databricks Free Edition.


Workshop Structure

The workshop is divided into four progressive parts:

Part 1 — PySpark Speedrun (20 min)

Introduction to Spark fundamentals: loading data, understanding lazy evaluation vs. eager actions, and core DataFrame operations.

Part 2 — Data Engineer Toolkit (35 min)

The practical 80/20 of data engineering: aggregations (groupBy, window functions), multi-table joins (inner vs. left), null handling, and saving to Delta Lake.

Part 3 — ML Capstone Pipeline (40 min)

End-to-end machine learning:

  • Feature engineering with StringIndexer and VectorAssembler
  • RandomForest model training to predict order value
  • MLflow experiment tracking for parameter/metric logging
  • Model serialization and reload

Part 4 — Real-Time Prediction (15 min)

Applying the trained batch model to a live simulated stream using Structured Streaming, with real-time predictions displayed as orders arrive.


Design Decisions

  • Zero setup barrier — Built entirely for Databricks Free Edition (no credit card, no local installation)
  • Built-in data — Uses the TPC-H dataset pre-loaded in every Databricks workspace
  • Progressive complexity — Each part builds on the previous, creating a complete pipeline by the end
  • Production patterns — Teaches Delta Lake, MLflow, and parameterized pipelines that transfer to production work

Materials Included

  • 4 Jupyter notebooks (one per part)
  • Complete workshop guide with timing and teaching tips
  • Databricks Free Edition setup guide
  • PySpark cheat sheet for quick reference

Technologies

PySparkMLlibMLflowStructured StreamingDatabricksDelta Lake

Role

Workshop Designer & Instructor

Timeline

Nov 2025

Category

Data Engineering / Education