AI models still struggle to read analog clocks. Here’s what that failure teaches about generalization risk in supply chain & procurement for energy and utilities.

AI Can’t Read Analog Clocks—Should You Trust It?
A multimodal AI model can summarize a contract, describe a substation photo, and draft an incident report—then misread an analog clock that a tired operator can parse at a glance. That’s not a quirky party trick. It’s a warning label.
IEEE Internet Computing published results (October 2025) showing four multimodal large language models (MLLMs) initially failed to read analog clock times accurately, improved after additional training, and then dropped again when tested on unfamiliar clock images. The study used a synthetic dataset spanning 43,000+ indicated times and added 5,000 more training images to fine-tune performance—only to see generalization break when the clock style changed.
For teams working on AI in supply chain & procurement inside energy and utilities, this matters because your models don’t live in tidy benchmark worlds. They face messy reality: different meter vendors, sensor replacements mid-year, camera angles at depots, handwritten BOLs, shifting supplier packaging, and edge devices with inconsistent lighting. If a model can’t reliably interpret “simple” spatial cues like clock hands across variations, it can also misinterpret “simple” operational cues like a mislabeled pallet, a flipped gauge photo, or a one-off SCADA screen layout.
Why “telling time” is a stress test for operational AI
Reading an analog clock forces a model to combine detection + geometry + reasoning. Humans do this automatically: identify the hands, map angles to numbers, and convert that to time. MLLMs often treat this as a loose pattern-matching problem—and that’s where things fall apart.
The clock task bundles three capabilities your supply chain AI depends on:
- Object identification: “Which line is the hour hand vs the minute hand?”
- Spatial orientation: “What’s the exact angle, direction, and relative position?”
- Symbolic conversion: “Translate that geometry into a discrete output (time).”
In procurement and logistics analytics, those same steps show up constantly:
- Identifying an asset label or part number (object identification)
- Understanding orientation and context in an image (spatial orientation)
- Converting what’s seen into an actionable field in a system of record (symbolic conversion)
If any one step is wrong, errors cascade. The IEEE study observed exactly that: an error recognizing clock hands created bigger downstream spatial errors. In operations, that’s the difference between a harmless typo and a chain reaction—wrong part, wrong work order, wrong crew dispatch, missed SLA.
A useful rule: If your model’s perception step is brittle, your “reasoning” step becomes confident nonsense.
The real issue: generalization fails right when conditions change
MLLMs can look competent on “known” data and still fail on new scenarios. The study’s pattern is familiar: training helped, test-on-similar improved, test-on-new styles degraded.
That’s not a minor technicality. In supply chain & procurement for utilities, change is constant:
- A supplier switches packaging artwork
- A warehouse changes lighting and camera mounting height
- Field techs use a new phone model with different image processing
- A meter vendor updates a display font
- A storm event drives abnormal inventory flows and work order patterns
This is why many AI pilots “work great” in a controlled proof-of-concept and then quietly underperform at rollout.
Synthetic training data helps—until it teaches the wrong shortcuts
The researchers used synthetic clock images, which is a smart way to scale labeled data. Energy supply chains can do the same with synthetic documents (POs, invoices), synthetic images (label variations), and simulated grid/asset telemetry.
But synthetic data has a catch: models can learn the generator’s style, not the underlying concept. If your synthetic labels always appear in the same corner, the model learns “corner = truth.” When a real photo violates that assumption, accuracy drops.
My take: synthetic data is worth using, but only if you pair it with brutal evaluation against “weird” real-world samples.
What “warped clocks” teach us about messy operational inputs
In the clock study, researchers distorted clock shapes and modified hands (like adding arrows). Humans still read the time. The models often couldn’t.
That’s the most relevant part for procurement and utility operations: your inputs are warped clocks. Not artistically—operationally.
Consider a few everyday “DalĂ-clock” equivalents:
- Part labels on curved surfaces (cylinders, valves, drums)
- Perspective distortion in photos taken inside trucks or cramped storerooms
- Non-standard annotations (marker scribbles, tape, QR codes partially covered)
- Different iconography across OEM dashboards and HMI screens
- Document scans that are skewed, low-contrast, or cropped
A model that “knows” what a label looks like in ideal conditions can still fail when the label is glossy, wrinkled, rotated, or half-obscured.
The failure mode that should worry you most: cascading errors
The study suggests errors stack: mis-identify the hands → larger spatial mistakes → wrong time.
In supply chain AI, cascading looks like:
- OCR misreads one character in a part number → system matches the wrong item → reorder triggers for the wrong SKU → stockout for the right one
- Vision model mis-detects a hazard label → compliance workflow is skipped → shipment gets rejected → outage restoration parts arrive late
- Forecasting model overfits “normal” seasonal demand → misses an extreme weather-driven spike → procurement lead times can’t catch up
The lesson isn’t “don’t use AI.” It’s design for failure and assume distribution shift is normal.
Practical safeguards for AI in supply chain & procurement (energy edition)
You don’t fix generalization with more training alone. You fix it with a system: data, evaluation, controls, and human-in-the-loop workflows that are built for critical infrastructure.
Here’s what works in practice.
1) Test like your suppliers (and the weather) want you to fail
Your evaluation set should be more diverse than your training set. If you only test on “clean” data, you’re grading the model on memorization.
Build a “torture test” pack that includes:
- New suppliers and new packaging designs
- Multiple camera devices and lighting conditions
- Rotated, blurred, occluded labels
- Rare but costly scenarios (storm surge inventory moves, emergency buys)
- Interface changes (ERP upgrade screens, new invoice templates)
Operational KPI suggestion (simple and extractable): Track model accuracy by scenario bucket, not just overall accuracy.
2) Add confidence thresholds that trigger verification, not auto-action
High automation with low certainty is how small errors become expensive incidents.
For supply chain and procurement AI, set explicit routing rules:
- If confidence ≥ 0.95 → auto-post (low-risk fields)
- If 0.80–0.95 → send to quick human verification
- If < 0.80 → fall back to rules/OCR baseline + manual review
This isn’t “slowing down.” It’s focusing human attention where it pays off.
3) Use redundancy: two models, two modalities, or a model + rules
Analog clock reading is hard partly because it’s a compound task. In operations, compound tasks should be decomposed.
Examples of redundancy patterns:
- Vision + metadata cross-check: label read must match PO item master constraints (valid prefixes, check digits)
- Model + deterministic geometry: if you’re interpreting gauge-like visuals, compute angles with classical vision and let the LLM handle only the narrative
- Two-pass extraction: first pass extracts candidates, second pass validates against business rules and supplier catalogs
A memorable one-liner I use internally: “Let models guess; let rules decide.”
4) Design for drift: monitoring, retraining cadence, and vendor change events
If performance drops when clock styles change, it will drop when:
- you onboard a new supplier
- you change a warehouse process
- you roll out new mobile devices
Treat those events as model risk events. Put them on the same calendar as system change management.
Minimum drift monitoring that’s actually doable:
- Weekly sample audits (small but consistent)
- Automated alerts for distribution shift (input length, image brightness, template changes)
- A retraining queue tied to supplier onboarding and major storms
5) Start with “decision support,” then earn your way to automation
For energy and utilities, the best path to leads and value is often:
- Decision support (reduce analyst workload, flag anomalies)
- Guardrailed automation (auto-fill + verify)
- Selective auto-action (only for low-risk, high-confidence paths)
If your procurement AI is touching emergency buys, critical spares, or compliance documentation, default to conservative controls.
People also ask: “If AI can’t tell time, is it ready for critical infrastructure?”
Yes—if you stop expecting one model to do everything, and you engineer the surrounding system. The clock result doesn’t mean AI is useless. It means perception and generalization are fragile, especially across visual variations.
AI is ready for supply chain & procurement when you can answer these questions clearly:
- What happens when the model is wrong?
- How quickly do we detect it?
- How do we prevent repeat failures?
- Can we prove performance under realistic variability?
If those answers are vague, the risk isn’t theoretical.
A better way to trust AI: prove it under variation
Analog clock reading looks “easy” until you list the steps. That’s the point. Many supply chain workflows feel straightforward until you map every assumption: stable templates, consistent part labeling, predictable demand patterns, uniform field photo quality.
Energy supply chains and procurement teams are being asked to do more with fewer people heading into 2026: faster storm response, tighter spare parts strategy, better supplier risk management, and shorter outage windows. AI can help—especially with forecasting, invoice processing, supplier analytics, and predictive maintenance-related parts planning. But only if you treat generalization as a first-class requirement.
The question to ask your team (and any vendor) isn’t “What’s the benchmark accuracy?” It’s “What happens when the clock melts?”