Bakery Anomaly Detection
Production order error detection for an artisanal bakery — applied ML where simpler is necessary
2 months in production · Paper in preparation for SOMI @ ECML-PKDD 2026
Overview
An end-to-end anomaly detection system built for an artisanal bakery in northern Italy (~50 employees, ~200 products). Every morning, a bakery manager exports a CSV listing planned production quantities for the day. This system ingests that file, flags likely data-entry errors before they propagate to procurement and kitchen scheduling, and lets the operator review and correct them in seconds.
The operator is the bakery owner — not a data scientist.
The Core Problem
Standard anomaly detection methods (Isolation Forest, LOF, autoencoders) look impressive on benchmarks. This project documents why they are structurally inapplicable to a whole class of real SME deployments, and what the right tool actually is.
Five properties make trained ML models unsuitable:
- Day-of-week independence — Monday orders differ systematically from Saturday for every product. A DOW-aware model would need ~600 instances, each trained on 4 data points.
- No inter-product correlation — a mistyped quantity carries zero signal in any other product.
- Extreme data scarcity — a 4-week rolling window means 4 training observations per model.
- Continuous distribution shift — seasonal volumes, new products, discontinued items.
- Zero-maintenance requirement — no IT budget for retraining pipelines.
The Solution
A rolling statistical estimator with a conjunction rule: for each (product, day-of-week) pair, compute mean and standard deviation over the last W same-weekday observations (default W = 4). Flag an anomaly only when all three conditions hold simultaneously:
- Z-score > 7.0
- Percentage deviation > 30%
- Absolute deviation > volume-tier threshold
A fourth component — a year-over-year seasonality shield — suppresses false positives on seasonal products (Easter specialties, holiday items) by comparing against the same date one year prior.
Self-adapting by construction: no model to retrain, no artefact to version, no drift detector to maintain. New products are handled from their second observation.
Architecture
CSV upload (daily)
│
Streamlit frontend
│
Detection engine (Python)
├── Rolling Z-score per (product, DOW)
├── Conjunction rule
├── Volume-tier thresholds
└── YoY seasonality shield
│
Supabase PostgreSQL
├── Historical orders
├── Precomputed baselines
├── Audit log (append-only)
└── Auth (role-based)
Two user roles: an administrator view (full table with Z-score, deviation, time-series charts) and a simplified operator view (one review card per flag, business-language summaries: "You ordered 500 kg; the usual Monday order is 48 ± 6 kg").
Deployed on free tiers only: Streamlit Cloud, Supabase, Cloudflare DNS.
Total recurring infrastructure cost: €0/month.
Benchmark Results
496-anomaly labelled test set, constructed with the bakery's cooperation. Competing methods receive oracle threshold selection (best possible F1 on test set) — the deployed Z-score gets no oracle.
| Method | Window W | Precision | Recall | F1 |
|---|---|---|---|---|
| Z-score (deployed) ← no oracle | 4 | 89.0% | 90.9% | 85.7% |
| Z-score (oracle) | 4 | 83.0% | 91.1% | 86.9% |
| MAD (oracle) | 4 | 82.3% | 83.2% | 82.8% |
| Holt-Winters (oracle) | 4 | — | — | 0% (inapplicable) |
| Z-score (oracle) | 24 | 86.5% | 91.6% | 89.0% |
| MAD (oracle) | 24 | 83.7% | 92.8% | 88.0% |
| Holt-Winters (oracle) | 24 | 66.5% | 75.7% | 70.8% |
The deployed system outperforms all alternatives at every window size without oracle selection. The gap versus oracle Z-score is 1.2 pp. Holt-Winters is entirely inapplicable at W = 4 and plateaus ~17 pp below Z-score at W = 24, while being 65–130× slower per inference.
Key Lessons
- Precision calibration over recall in human-in-the-loop deployments. A false-positive rate above ~20% causes operator disengagement — effective detection drops to zero.
- Seasonality bites hard in production. The YoY shield was added after the first Easter, when legitimate demand spikes eroded operator trust within days.
- Zero-maintenance is a hard constraint. Any system requiring scheduled retraining would have been abandoned within weeks.
- Explainability is structurally unavailable with IF/LOF. Rolling Z-score decomposes every flag into interpretable business quantities.
Academic Output
"When Simpler Is Necessary: Anomaly Detection for Production Orders in an Artisanal Bakery"
Lucio Baiocchi — MSc Data Science and Engineering, Politecnico di Torino
Submitted to SOMI Workshop @ ECML-PKDD 2026 (Springer LNCS)
Main contribution: a problem characterisation framework — a decision tree that predicts, from the structural properties of a monitoring problem, whether a rolling statistical estimator or a trained ML model is the appropriate tool.
Stack: Python · Streamlit · Supabase (PostgreSQL) · Streamlit Cloud · Cloudflare