AI models struggle to tell time. For utilities, that weakness can derail demand forecasting, grid planning, and procurement. Learn how to build time-robust AI.

Most AI teams assume “time” is a solved problem. It isn’t.
A recent study highlighted an almost comical failure: multimodal large language models (MLLMs) struggled to read analog clocks, even after extra training. The funny part ends quickly when you map that weakness to energy and utilities. If a model can’t consistently interpret time in a simple visual setting, you shouldn’t trust it to reason about temporal patterns in grid operations, demand forecasting, or procurement planning.
And that’s the connection to this series on AI in Supply Chain & Procurement: utilities don’t just buy parts—they buy availability. Lead times, outage windows, storm restoration logistics, and seasonal demand all live and die on temporal accuracy. If your AI misreads time, it misprices risk.
The clock-reading failure is a warning about generalization
Key point: The clock problem isn’t “AI is dumb.” It’s that many models still don’t generalize well across unseen variations.
In the study summarized by IEEE Spectrum, researchers built a dataset of synthetic analog clock images spanning more than 43,000 indicated times and tested several MLLMs. All four models initially failed. With additional training (about 5,000 more images), performance improved—until the models were tested on a different collection of clocks, where performance dropped again.
That pattern is painfully familiar in operational AI:
- A model performs well in a pilot.
- It degrades after rollout because cameras, sensors, lighting, formats, or workflows differ.
- Teams “fix” it with more fine-tuning.
- It breaks again under a new operational edge case.
Why this matters in energy and utilities
Utilities run on “edge cases.” Holidays, cold snaps, heat domes, wildfire smoke, equipment derates, and sudden industrial load changes aren’t rare anomalies anymore—they’re the operating environment.
When an AI system is responsible for:
- load and price forecasting
- outage prediction and restoration ETAs
- battery dispatch and ancillary services
- spares planning for critical assets
…generalization isn’t optional. It’s the whole job.
Why “telling time” is harder than it looks—especially for AI
Key point: Reading an analog clock forces a model to combine multiple skills: detection, geometry, and rule-based composition.
Humans treat clock reading as trivial because we’ve internalized it: identify the hands, estimate their angles, map to numbers, and apply a rule (“minute hand tells minutes, hour hand tells hours, with partial movement”).
The study’s deeper finding is more interesting than “models failed.” It suggests failures cascade:
- If the model misidentifies the hands, it makes larger spatial orientation errors.
- If the clock face is distorted (think DalĂ-style warping), humans adapt, but models struggle.
- If the hands change appearance (for example arrows added), performance drops sharply.
That “cascade” is the part I want energy leaders and procurement leaders to sit with.
The grid equivalent of confusing the hour and minute hand
In utilities, temporal reasoning often requires combining signals that look individually “simple,” but only make sense together:
- SCADA points + topology + switching logs
- AMI interval data + temperature + calendar effects
- asset health + duty cycle history + maintenance work orders
- supplier lead time + port congestion + storm season + inventory policies
If an AI system gets one component wrong—misaligns timestamps, misreads time zones, mishandles daylight saving transitions, or misinterprets a reporting interval—errors compound. The model doesn’t just get a little worse. It starts making confident, coherent-looking decisions off a broken timeline.
What energy supply chain and procurement teams should do differently
Key point: Treat time as a first-class risk in AI projects, not a formatting detail.
Procurement and supply chain in utilities is already shifting from “buy at lowest unit cost” to “buy resilience.” AI can help, but only if it’s engineered around temporal reality.
Here’s what works in practice.
1) Audit temporal data like you audit financial data
If you can’t trace money, you can’t close the books. Same logic: if you can’t trace time, you can’t trust forecasts.
A practical temporal audit checklist:
- One canonical time standard (usually UTC) at rest; local time only for display.
- Explicit handling for daylight saving time (duplicate hours, missing hours).
- A documented policy for late-arriving data (common in AMI and OT integrations).
- Versioned definitions for intervals (15-min vs 30-min vs 60-min), including when they changed historically.
- Automated tests for time alignment across systems (SCADA vs AMI vs OMS vs weather feeds).
This is unglamorous work. It’s also where many “mysterious” model errors come from.
2) Use domain-specific evaluation, not generic model benchmarks
The clock study shows why: a model can look capable in demos and still fail a simple operational task.
For utilities, evaluation should mirror decisions you actually make:
- Forecast accuracy by season and extreme events, not just overall MAPE.
- Performance under distribution shift: new feeder configurations, new DER penetration, new tariff structures.
- Robustness to missing data windows and sensor dropouts.
- Error cost weighted by operational impact (being wrong during peak is worse than being wrong at 3 a.m.).
In supply chain terms: don’t score the model like a school test. Score it like a reliability event.
3) Don’t force a language model to be your time-series brain
LLMs are excellent at text workflows—summarizing work orders, drafting supplier communications, extracting clauses from contracts. But many teams overreach and ask an LLM to do what specialized time-series models do better.
A cleaner architecture is usually:
- Time-series forecasting models (statistical, gradient boosting, deep time-series) produce forecasts + uncertainty.
- Optimization engines produce procurement or dispatch recommendations.
- LLMs sit on top to explain, document, and orchestrate workflows—without inventing the numbers.
When you do use an LLM in the loop, constrain it:
- tool-based retrieval (it can fetch, not guess)
- schema validation (it must output structured fields)
- hard guardrails around timestamps, units, and horizons
4) Build “temporal robustness” into procurement planning
Utilities buy transformers, breakers, relays, cable, sensors, and fleet parts with lead times that can stretch into months.
AI-supported procurement should explicitly model time-dependent risk:
- lead-time distributions (not single averages)
- seasonal demand bands
- storm/restoration surge scenarios
- vendor capacity and allocation risk
If your AI output is a single number—“order 12 units”—you’re leaving value on the table. The more useful output is:
- “Order 8 now, reserve 6 contingent units, and trigger a reorder if hurricane probability exceeds X by date Y.”
That requires uncertainty-aware time modeling, not just pattern recognition.
A simple playbook: testing AI the way utilities actually operate
Key point: If you want robust AI, test it like it’s going to break—because it will.
The clock researchers found that small visual changes (distorted shapes, altered hands) exposed brittleness. Utilities can apply the same philosophy with “scenario perturbations.”
Stress tests to run before production
- Calendar perturbations: holiday shifts, billing cycle changes, DST transitions.
- Weather perturbations: temperature forecast bias, sudden fronts, smoke/air quality impacts.
- Grid perturbations: feeder reconfiguration, DER ramp events, BESS outages.
- Data perturbations: missing AMI intervals, delayed SCADA points, duplicated timestamps.
What “good” looks like
- Forecasts degrade gracefully instead of collapsing.
- The system flags low-confidence periods and escalates to human review.
- Recommendations include uncertainty bands and operational constraints.
A model you can’t stress-test is a model you can’t trust.
What to do next (especially for 2026 planning)
Utilities are heading into another year where volatility is normal: electrification load growth, DER complexity, and tighter equipment supply all stack the deck against brittle automation.
If you’re investing in AI for demand forecasting, grid optimization, or supply chain planning, use the clock-reading failure as your shortcut lesson: temporal competence isn’t implied. It must be designed, trained, and validated.
For teams in supply chain and procurement, I’d start with two concrete moves this quarter:
- Run a temporal data quality sprint (time zones, intervals, DST, late data) across your forecasting and planning pipelines.
- Redesign evaluation around the moments that cost you real money: peak days, storm weeks, and constrained supply periods.
The question that matters heading into next year isn’t whether AI will get “smarter.” It’s whether your AI systems will stay reliable when the grid—and your supply chain—stops behaving like the training set.