AI in Supply Chain & Procurement•December 19, 2025•By 3L3C

AI can’t reliably read analog clocks. For utilities, that failure maps to real risks in time-sensitive supply chain and procurement workflows.

utilitiesenergy-aiprocurementsupply-chainai-governancemodel-risk

Featured image for AI Can’t Read a Clock—Here’s Why Utilities Should Care

AI Can’t Read a Clock—Here’s Why Utilities Should Care

A recent study tested multimodal large language models on a task most 8-year-olds can do: read an analog clock. The models didn’t just struggle—they failed, improved only after extra training, then fell apart again when the clocks changed.

That sounds like a quirky internet fact. It isn’t.

If you work in energy and utilities, or you’re responsible for AI in supply chain and procurement, this is a clean, easy-to-understand example of a bigger operational truth: general-purpose AI is often brittle when the “shape” of the real world shifts. And in mission-critical environments—grid operations, field maintenance, outage response, fuel scheduling, spares planning—that brittleness turns into cost, risk, and delayed decisions.

This post unpacks what the “AI can’t tell time” result actually means, and how to apply it to AI projects in utilities—especially those that touch procurement workflows and the supply chain.

What the clock-failure experiment reveals about multimodal AI

Key point: The clock study is a simple demonstration of a hard problem: combining perception (seeing) with reasoning (inferring) under real-world variation.

Researchers built a large dataset of synthetic analog clock images representing 43,000+ distinct times, then evaluated four multimodal LLMs. All four initially failed. After additional training with 5,000 more images, performance improved—until the models were tested on a new set of clocks they hadn’t seen. Then performance dropped again.

That pattern—train → improve → drift to a new scenario → fail—is exactly what catches enterprises off guard.

The “cascading failure” problem (and why it’s scary in operations)

Reading an analog clock forces a model to do several dependent steps correctly:

Identify the clock hands
Tell the hour hand from the minute hand
Estimate angles precisely
Convert those angles into a time

The study found something that shows up across real deployments: an early perception error cascades into a larger reasoning error. If the model misreads the hand style, its spatial estimation gets worse—and the final answer is wrong.

For utilities, this is the same failure mode you see when:

An inspection model mis-detects a component (perception), then the work order prioritization logic (reasoning) routes the job incorrectly.
A document AI tool misreads one field on a supplier certificate, then downstream compliance checks fail—or worse, pass incorrectly.
A forecasting model ingests misaligned timestamps, then planning outputs look “confident” while being operationally wrong.

Generalization isn’t optional in energy systems

The clock study also highlights a basic limitation: many models are good at patterns they’ve seen and unreliable when inputs shift (different clock designs, distortions, arrows on hands).

In the utility world, input shift is constant:

Different OEM documentation formats
Changing tariff structures
Seasonal load patterns (and holiday-driven anomalies)
New sensor firmware and recalibrations
Asset replacements that change visual appearance
Extreme-weather operations (December storms are a perfect example)

A model that only works “when the clock looks familiar” is a model that will break the week you need it most.

Why “telling time” maps directly to utility AI risk

Key point: Time is the hidden backbone of grid operations and supply chain execution. If an AI system can’t reliably interpret time-related context, it can’t be trusted to run time-sensitive workflows.

Utilities are full of time-coupled decisions:

Dispatch windows for generation and storage
Switching schedules
Crew routing and restoration sequencing
SCADA event ordering
Protection and relay coordination studies
Delivery lead times and outage-driven expedite decisions

Even outside operations, procurement is time-sensitive:

Framework agreements expiring
Lead-time volatility for transformers, breakers, and switchgear
Seasonal constraints for construction and outages
SLA clocks for vendor response

A small “time interpretation” error rarely stays small. It becomes:

Wrong priorities
Missed windows
Overtime and expedite costs
Reliability hits
Audit and compliance exposure

Here’s the stance I’ll take: if your AI can’t explain its confidence around time-bound decisions, it shouldn’t be allowed to automate them. Assist, yes. Auto-approve, no.

The hidden costs of brittle AI in supply chain & procurement

Key point: The most expensive AI failures in procurement aren’t dramatic—they’re quiet, repetitive, and hard to trace.

When AI models fail in utility procurement and supply chain, the bill shows up as:

Excess inventory because risk buffers get inflated after unreliable forecasts
Stockouts of critical spares because the system “learned” a pattern that stopped being true
Supplier performance blind spots because data extraction breaks on a new template
Cycle time creep because teams stop trusting automation and re-check everything

Example: the “new template” supplier document failure

Imagine a document AI workflow that ingests vendor packing slips and auto-reconciles receipts. It works for months.

Then a strategic supplier rolls out a new ERP template. Columns shift. One field now includes units in parentheses. OCR still “reads” the text, but the extraction model mislabels a quantity field as a part number. The three-way match fails.

What happens next is predictable:

AP exceptions spike
Receiving queues get jammed
Planners lose visibility into what’s actually arrived
Expediting starts (often unnecessarily)

That’s the clock problem in a different outfit: a small perception shift triggers a cascading workflow failure.

How utilities avoid these failures: practical guardrails that work

Key point: Utilities don’t need “more AI.” They need tighter AI systems: domain-tuned, measurable, and constrained by operational controls.

Here are guardrails I’ve found consistently reduce production surprises.

1. Use domain-tuned models for mission-critical workflows

If a workflow affects safety, reliability, or compliance, general-purpose multimodal AI shouldn’t be the core engine.

Instead, use:

Domain-tuned models trained on your asset classes, document types, and operating conditions
Hybrid architectures where deterministic rules handle the “hard constraints”
Narrow vision models for perception tasks (inspection, component detection)

This isn’t philosophical. It’s economic. A smaller, specialized model that behaves predictably is worth more than a larger model that surprises you.

2. Design for “unknown clock faces” (shift testing)

Most AI evaluations are too polite. They test on data that looks like training data.

For utilities and energy supply chains, testing must include:

New supplier templates
Alternate camera angles / lighting
Different device firmware and sampling rates
Regional differences (units, abbreviations, language variants)
Stress conditions (storm operations, peak demand weeks)

Treat this as a formal practice: out-of-distribution testing isn’t an academic phrase—it’s your insurance policy.

3. Add confidence gates and human approvals where it matters

Clock-reading models often respond confidently even when wrong. That’s not unique to clocks.

So build workflows like this:

Auto-process only when confidence is above a strict threshold
Route low-confidence cases to a human reviewer
Track why the system was uncertain (missing field, poor scan, unseen format)

In procurement, this is especially effective for:

Contract clause extraction
Compliance documentation
Supplier onboarding
Invoice coding and approval recommendations

4. Monitor drift like you monitor the grid

Utilities already understand monitoring culture. Apply it to AI:

Define model KPIs (precision/recall, exception rates, rework time)
Set alert thresholds
Trigger retraining or rule updates when drift is detected

A strong operational approach is: treat model performance drops like reliability events—root cause, mitigation, and prevention.

5. Prefer “decision support AI” over “decision replacement”

If you’re building AI for supply chain planning—demand forecasting, inventory optimization, supplier risk scoring—keep the AI’s job focused:

Summarize signals
Explain drivers
Propose scenarios
Quantify tradeoffs

Then let accountable leaders make the call.

That’s not anti-AI. It’s pro-outcomes.

What to do next if you’re deploying AI in utility procurement

Key point: Your next step is to map failure modes before you scale.

If you’re planning (or rescuing) an AI initiative in supply chain and procurement for a utility, start here:

Identify the workflows where time sensitivity is non-negotiable (outage-related materials, emergency sourcing, crew readiness)
List the input variations you’ll face (supplier templates, asset models, seasonal shifts)
Decide what must be deterministic (rules, thresholds, compliance requirements)
Create a shift-test suite you can run before every release
Instrument exceptions so you can learn from failures instead of hiding them

A useful internal mantra: If a model can’t handle a different “clock face,” it’s not ready for the field.

Where this fits in the AI in Supply Chain & Procurement series

This topic series is about using AI to forecast demand, reduce supplier risk, and optimize procurement decisions. The clock-reading failure is a reminder that those benefits only materialize when AI is robust to real-world variation.

Energy and utilities operate in a world of constrained windows, safety standards, and high consequence. The right approach is not blind adoption. It’s disciplined engineering: specialized models, rigorous testing, and operational guardrails.

If you’re evaluating AI tools for procurement automation or supply chain planning, here’s a forward-looking question worth sitting with:

Where, exactly, would a confident-but-wrong AI answer cost you the most next quarter—and what guardrail would prevent it?