A production-grade ML pipeline that forecasts daily sales for 3,049 retail SKUs, translates those forecasts into purchasing decisions, and serves everything through a REST API and interactive dashboard — built on the Walmart M5 competition dataset.
Retailers carry inventory under fundamental uncertainty about future demand. Get it wrong in either direction and you pay:
Getting the balance right requires knowing, as accurately as possible, how much of each product will sell over the next four weeks — and how uncertain that estimate is. This project builds the full system: from raw sales history to a live API that answers both questions on demand.
The M5 Forecasting Competition dataset contains hierarchical daily unit sales from Walmart stores across three U.S. states. The pipeline is demonstrated on CA_1 (California, Store 1) with the option to scale to all 10 stores in parallel.
| Dimension | Value |
|---|---|
| Store | CA_1 — Walmart California, Store 1 |
| SKUs modeled | 3,049 unique items |
| History per item | ~1,941 daily observations (2011–2016) |
| Total rows (long format) | ~5.9 million |
| Categories | FOODS, HOBBIES, HOUSEHOLD |
| Departments | 7 (FOODS_1/2/3, HOBBIES_1/2, HOUSEHOLD_1/2) |
| External signals | Calendar events, SNAP food-stamp flags, daily sell price |
Raw data arrives in wide format — one column per day, one row per item. The pipeline immediately melts it to long format (item × date rows), merges the calendar and price tables, and applies dtype optimization (categoricals, float32 lags) to keep memory under 2 GB.
The global approach has three advantages over per-series models:
LightGBM is configured with Tweedie loss rather than squared-error. Retail sales are zero-inflated and right-skewed — most (item, day) pairs sell 0–2 units, but occasionally an item sells 20+. Squared-error gets dominated by those outliers. Tweedie is the proper distribution for compound Poisson count data.
The Tweedie variance power p = 1.231 was found by Optuna (between Poisson p=1 and Gamma p=2), reflecting moderate overdispersion.
| Group | Features | What it encodes |
|---|---|---|
| Lag features | lag_7, lag_14, lag_28, lag_35 | Sales from 1/2/4/5 weeks ago — weekly seasonality signal |
| Rolling means | roll_mean_7/14/28 | Recent demand level — the model's "memory" of how fast this item sells |
| Rolling std | roll_std_7/14/28 | Demand volatility — high-std items need more safety stock |
| Price features | sell_price, price_rel_store_cat, price_pct_4w | Absolute price, price vs. category peers, recent price change |
| Calendar | wday_sin/cos, week_sin/cos, month, year | Cyclical encodings avoid discontinuity at week/year boundaries |
| Events | has_event, snap | Binary flags for known demand shocks |
| Identity | item_id, dept_id, cat_id | Categorical embeddings — model learns item-specific baselines |
roll_mean_14 alone contributes ~45% of total gain. The model primarily answers "how much did this item sell in the last 2 weeks?" before anything else.
Split count (right) — how many times a feature appears in the ensemble. item_id appears 70,000+ times — the model makes fine-grained item-level adjustments throughout every tree, even though each individual adjustment is small.
Insight: the dominant features are endogenous (derived from past sales), not exogenous. The model is a sophisticated autoregressive smoother. Price and events matter at the margin.
Rather than grid search, the pipeline uses Optuna TPE (Tree-structured Parzen Estimator) — a Bayesian optimization algorithm that builds a probabilistic model of the loss surface and proposes new configurations intelligently. 20 trials over 7 hyperparameters.
The benchmark is Naive Seasonal — which predicts that any day's sales will equal the same weekday from 4 weeks ago. This is a strong baseline for retail data; beating it requires genuinely capturing patterns the calendar can't explain.
RMSE — average forecast error in units, with larger errors penalized quadratically. MAE — average absolute error in units. MASE — MAE scaled by the naive baseline on training data; values near 1 mean the model performs like naive on training but outperforms on held-out validation — the expected pattern for a model that generalizes while per-item naive baselines overfit historical noise.
A point forecast alone is incomplete for inventory decisions. Three separate LightGBM models are trained with quantile (pinball) loss at the 10th, 50th, and 90th percentiles to bound uncertainty.
The recursive challenge: the model was trained on features like lag_7 (sales 7 days ago). On forecast day 8, "7 days ago" is itself a predicted value. The pipeline handles this with a rolling buffer — each step appends its own output and recomputes all lag/rolling features before predicting the next.
Weighted Root Mean Squared Scaled Error is the official M5 metric. It measures forecast accuracy, normalized by each item's historical variance, then weighted by that item's revenue contribution. A score of 1.0 means the model performs identically to the naïve seasonal baseline. Lower is better.
A forecast that lives only in a CSV is not useful. Phase 6 converts each item's demand forecast and uncertainty estimate into three concrete purchasing parameters using the continuous-review inventory model — the same framework used by major retailers worldwide.
Buffer inventory held to absorb demand uncertainty during the replenishment lead time. Higher uncertainty and longer lead times both require more buffer.
σ is derived from Phase 4's quantile spread: for a normal distribution, P90 − P10 ≈ 2.563 × σ. The quantile models do double duty — prediction intervals and volatility estimates for inventory.
The inventory level at which a replenishment order is triggered. Designed so that, in the average demand scenario, inventory won't reach zero before the order arrives.
How much to order each time. Balances the cost of ordering frequently (fixed $50 per order) against the cost of holding large quantities (capital tied up, warehouse space).
| Metric | Mean | Median | Max |
|---|---|---|---|
| Mean daily demand (units) | 1.52 | 0.79 | 55.2 |
| Safety stock (units) | 5.14 | 3.70 | 103.8 |
| Reorder point (units) | 15.76 | 9.20 | 490.3 |
| EOQ (units per order) | 445.5 | 378.9 | 3,174.5 |
| Days of supply in safety stock | 5.16 | 4.60 | 17.1 |
The median item needs fewer than 4 units of safety stock and triggers an order at 9 units on hand — appropriate for slow-moving SKUs. The top-volume items (ROP = 490) require daily monitoring and represent an operationally different challenge.
Evaluating on a single held-out period can be misleading — maybe the model got lucky on that particular window. Walk-forward (rolling-origin) validation tests the model on four consecutive future windows, retraining from scratch at each origin.
This simulates production: the model is retrained periodically as new data arrives, then evaluated on the next month it has never seen.
| Window | Validation Period | LightGBM RMSE | Naive RMSE | Reduction | LGB Wins (%) |
|---|---|---|---|---|---|
| W1 | Feb 2016 | 1.312 | 1.804 | −27% | 86.7% |
| W2 | Mar 2016 | 1.310 | 1.840 | −29% | 87.4% |
| W3 | Apr 2016 | 1.350 | 1.923 | −30% | 89.2% |
| W4 | May 2016 | 1.400 | 1.969 | −29% | 93.6% |
The slight upward drift in RMSE in later windows is matched by the baseline's own drift — the relative improvement stays flat. This means signal-to-noise in the data is stable; the model is not degrading.
The batch pipeline writes Parquet outputs once (or on schedule). The FastAPI server is the low-latency serving layer on top: reads outputs, caches DataFrames in memory with a 1-hour TTL, and answers queries in milliseconds.
| Method | Path | Purpose | Latency |
|---|---|---|---|
| GET | /health |
Liveness check — returns available stores and cache status | <1 ms |
| GET | /stores |
Lists all stores with completed pipeline outputs | <1 ms |
| GET | /stores/{store}/items |
Lists item IDs, filterable by category or department | <5 ms |
| GET | /forecast/{store}/{item_id} |
28-day point + quantile forecast (pre-computed from Parquet) | <5 ms |
| POST | /inventory |
Live inventory params — any lead time or service level | <1 ms |
| GET | /metrics/{store} |
Store-level WRMSSE and walk-forward CV | <2 ms |
Design note: /forecast reads from Parquet — not from the live model. This means sub-millisecond response times at the cost of freshness (refresh by re-running Phase 5). /inventory computes on-the-fly, letting users query any lead time or service level without re-running Phase 6.
GET /forecast/CA_1/FOODS_1_001_CA_1_evaluation { "store": "CA_1", "item_id": "FOODS_1_001_CA_1_evaluation", "n_days": 28, "forecasts": [ { "date": "2016-04-25", "pred_point": 0.7978, "pred_q10": 0.0, "pred_q50": 0.9599, "pred_q90": 2.5654, "actual_sales": 2.0 }, ... 27 more days ] }
POST /inventory { "store": "CA_1", "item_id": "FOODS_1_001_CA_1_evaluation", "lead_time": 7, "service_level": 0.95 } { "mean_daily_demand": 0.861, "demand_std_daily": 0.902, "lead_time_demand": 6.02, "safety_stock": 3.9, ← hold this much buffer "reorder_point": 9.9, ← order when stock hits this "eoq": 396.3, ← order this many units "days_of_supply_ss": 4.6 }
pandas, numpy, pyarrow, fastparquet
LightGBM, statsmodels (ARIMA), Prophet, scikit-learn
Optuna (TPE sampler), 20-trial Bayesian HPO
matplotlib, seaborn, plotly
FastAPI, uvicorn, Pydantic v2, TTL in-process cache
Streamlit (5-tab interactive app with live parameter sliders)
Melt 5.9M rows from wide to long, merge calendar + prices, dtype optimization (categoricals + float32), 5 diagnostic plots.
Tweedie regression on all 3,049 series simultaneously. Lag, rolling, price, and calendar features. RMSE −29% vs Naive Seasonal.
20 TPE trials across 7 hyperparameters. Best params saved to JSON and reused in downstream phases.
Three separate LightGBM models with pinball loss. 80% interval coverage: 83.4%. Quantile spread reused as volatility estimate in Phase 6.
Lag/rolling feature rollover loop. 28 sequential predictions per item. WRMSSE 0.911 (recursive) vs 0.877 (one-step) — only 3.9% degradation from compounding.
Safety stock, reorder point, EOQ per item. Fully vectorized pandas — all 3,049 items in under 1 second. Configurable lead time, service level, and cost parameters.
4 rolling-origin windows (Feb–May 2016), full model retrain at each origin. RMSE CV = 3.1% — stable. 27–30% reduction vs baseline across all windows.