Bakery Anomaly Detection

Production order error detection for an artisanal bakery — applied ML where simpler is necessary

2 months in production · Paper in preparation for SOMI @ ECML-PKDD 2026

Overview

An end-to-end anomaly detection system built for an artisanal bakery in northern Italy (~50 employees, ~200 products). Every morning, a bakery manager exports a CSV listing planned production quantities for the day. This system ingests that file, flags likely data-entry errors before they propagate to procurement and kitchen scheduling, and lets the operator review and correct them in seconds.

The operator is the bakery owner — not a data scientist.

The Core Problem

Standard anomaly detection methods (Isolation Forest, LOF, autoencoders) look impressive on benchmarks. This project documents why they are structurally inapplicable to a whole class of real SME deployments, and what the right tool actually is.

Five properties make trained ML models unsuitable:

  1. Day-of-week independence — Monday orders differ systematically from Saturday for every product. A DOW-aware model would need ~600 instances, each trained on 4 data points.
  2. No inter-product correlation — a mistyped quantity carries zero signal in any other product.
  3. Extreme data scarcity — a 4-week rolling window means 4 training observations per model.
  4. Continuous distribution shift — seasonal volumes, new products, discontinued items.
  5. Zero-maintenance requirement — no IT budget for retraining pipelines.

The Solution

A rolling statistical estimator with a conjunction rule: for each (product, day-of-week) pair, compute mean and standard deviation over the last W same-weekday observations (default W = 4). Flag an anomaly only when all three conditions hold simultaneously:

  • Z-score > 7.0
  • Percentage deviation > 30%
  • Absolute deviation > volume-tier threshold

A fourth component — a year-over-year seasonality shield — suppresses false positives on seasonal products (Easter specialties, holiday items) by comparing against the same date one year prior.

Self-adapting by construction: no model to retrain, no artefact to version, no drift detector to maintain. New products are handled from their second observation.

Architecture

CSV upload (daily)
       │
  Streamlit frontend
       │
  Detection engine (Python)
  ├── Rolling Z-score per (product, DOW)
  ├── Conjunction rule
  ├── Volume-tier thresholds
  └── YoY seasonality shield
       │
  Supabase PostgreSQL
  ├── Historical orders
  ├── Precomputed baselines
  ├── Audit log (append-only)
  └── Auth (role-based)

Two user roles: an administrator view (full table with Z-score, deviation, time-series charts) and a simplified operator view (one review card per flag, business-language summaries: "You ordered 500 kg; the usual Monday order is 48 ± 6 kg").

Deployed on free tiers only: Streamlit Cloud, Supabase, Cloudflare DNS.
Total recurring infrastructure cost: €0/month.

Benchmark Results

496-anomaly labelled test set, constructed with the bakery's cooperation. Competing methods receive oracle threshold selection (best possible F1 on test set) — the deployed Z-score gets no oracle.

Method Window W Precision Recall F1
Z-score (deployed) ← no oracle 4 89.0% 90.9% 85.7%
Z-score (oracle) 4 83.0% 91.1% 86.9%
MAD (oracle) 4 82.3% 83.2% 82.8%
Holt-Winters (oracle) 4 0% (inapplicable)
Z-score (oracle) 24 86.5% 91.6% 89.0%
MAD (oracle) 24 83.7% 92.8% 88.0%
Holt-Winters (oracle) 24 66.5% 75.7% 70.8%

The deployed system outperforms all alternatives at every window size without oracle selection. The gap versus oracle Z-score is 1.2 pp. Holt-Winters is entirely inapplicable at W = 4 and plateaus ~17 pp below Z-score at W = 24, while being 65–130× slower per inference.

Key Lessons

  • Precision calibration over recall in human-in-the-loop deployments. A false-positive rate above ~20% causes operator disengagement — effective detection drops to zero.
  • Seasonality bites hard in production. The YoY shield was added after the first Easter, when legitimate demand spikes eroded operator trust within days.
  • Zero-maintenance is a hard constraint. Any system requiring scheduled retraining would have been abandoned within weeks.
  • Explainability is structurally unavailable with IF/LOF. Rolling Z-score decomposes every flag into interpretable business quantities.

Academic Output

"When Simpler Is Necessary: Anomaly Detection for Production Orders in an Artisanal Bakery"
Lucio Baiocchi — MSc Data Science and Engineering, Politecnico di Torino
Submitted to SOMI Workshop @ ECML-PKDD 2026 (Springer LNCS)

Main contribution: a problem characterisation framework — a decision tree that predicts, from the structural properties of a monitoring problem, whether a rolling statistical estimator or a trained ML model is the appropriate tool.

Stack: Python · Streamlit · Supabase (PostgreSQL) · Streamlit Cloud · Cloudflare