Bakery Anomaly Detection

Production order error detection for an artisanal bakery — applied ML where simpler is necessary

2 months in production · Paper in preparation for SOMI @ ECML-PKDD 2026

Overview

An end-to-end anomaly detection system built for an artisanal bakery in northern Italy (~50 employees, ~200 products). Every morning, a bakery manager exports a CSV listing planned production quantities for the day. This system ingests that file, flags likely data-entry errors before they propagate to procurement and kitchen scheduling, and lets the operator review and correct them in seconds.

The operator is the bakery owner — not a data scientist.

The Core Problem

Standard anomaly detection methods (Isolation Forest, LOF, autoencoders) look impressive on benchmarks. This project documents why they are structurally inapplicable to a whole class of real SME deployments, and what the right tool actually is.

Five properties make trained ML models unsuitable:

Day-of-week independence — Monday orders differ systematically from Saturday for every product. A DOW-aware model would need ~600 instances, each trained on 4 data points.
No inter-product correlation — a mistyped quantity carries zero signal in any other product.
Extreme data scarcity — a 4-week rolling window means 4 training observations per model.
Continuous distribution shift — seasonal volumes, new products, discontinued items.
Zero-maintenance requirement — no IT budget for retraining pipelines.

The Solution

A rolling statistical estimator with a conjunction rule: for each (product, day-of-week) pair, compute mean and standard deviation over the last W same-weekday observations (default W = 4). Flag an anomaly only when all three conditions hold simultaneously:

Z-score > 7.0
Percentage deviation > 30%
Absolute deviation > volume-tier threshold

A fourth component — a year-over-year seasonality shield — suppresses false positives on seasonal products (Easter specialties, holiday items) by comparing against the same date one year prior.

Self-adapting by construction: no model to retrain, no artefact to version, no drift detector to maintain. New products are handled from their second observation.

Architecture

CSV upload (daily)
       │
  Streamlit frontend
       │
  Detection engine (Python)
  ├── Rolling Z-score per (product, DOW)
  ├── Conjunction rule
  ├── Volume-tier thresholds
  └── YoY seasonality shield
       │
  Supabase PostgreSQL
  ├── Historical orders
  ├── Precomputed baselines
  ├── Audit log (append-only)
  └── Auth (role-based)

Two user roles: an administrator view (full table with Z-score, deviation, time-series charts) and a simplified operator view (one review card per flag, business-language summaries: "You ordered 500 kg; the usual Monday order is 48 ± 6 kg").

Deployed on free tiers only: Streamlit Cloud, Supabase, Cloudflare DNS.
Total recurring infrastructure cost: €0/month.

Benchmark Results

496-anomaly labelled test set, constructed with the bakery's cooperation. Competing methods receive oracle threshold selection (best possible F1 on test set) — the deployed Z-score gets no oracle.

Method	Window W	Precision	Recall	F1
Z-score (deployed) ← no oracle	4	89.0%	90.9%	85.7%
Z-score (oracle)	4	83.0%	91.1%	86.9%
MAD (oracle)	4	82.3%	83.2%	82.8%
Holt-Winters (oracle)	4	—	—	0% (inapplicable)
Z-score (oracle)	24	86.5%	91.6%	89.0%
MAD (oracle)	24	83.7%	92.8%	88.0%
Holt-Winters (oracle)	24	66.5%	75.7%	70.8%

The deployed system outperforms all alternatives at every window size without oracle selection. The gap versus oracle Z-score is 1.2 pp. Holt-Winters is entirely inapplicable at W = 4 and plateaus ~17 pp below Z-score at W = 24, while being 65–130× slower per inference.

Key Lessons

Precision calibration over recall in human-in-the-loop deployments. A false-positive rate above ~20% causes operator disengagement — effective detection drops to zero.
Seasonality bites hard in production. The YoY shield was added after the first Easter, when legitimate demand spikes eroded operator trust within days.
Zero-maintenance is a hard constraint. Any system requiring scheduled retraining would have been abandoned within weeks.
Explainability is structurally unavailable with IF/LOF. Rolling Z-score decomposes every flag into interpretable business quantities.

Academic Output

"When Simpler Is Necessary: Anomaly Detection for Production Orders in an Artisanal Bakery"
Lucio Baiocchi — MSc Data Science and Engineering, Politecnico di Torino
Submitted to SOMI Workshop @ ECML-PKDD 2026 (Springer LNCS)

Main contribution: a problem characterisation framework — a decision tree that predicts, from the structural properties of a monitoring problem, whether a rolling statistical estimator or a trained ML model is the appropriate tool.

Stack: Python · Streamlit · Supabase (PostgreSQL) · Streamlit Cloud · Cloudflare