🇺🇸 Better AI Testing for Grids and Supply Chains - United States

AI in Supply Chain & Procurement•December 19, 2025•By 3L3C

Stop grading AI like a benchmark test. Learn how energy and supply chain teams can validate real-world AI reliability for grids, maintenance, and procurement.

ai-evaluationenergy-utilitiespredictive-maintenancegrid-analyticssupply-chain-procurementml-governance

AI doesn’t fail in energy because it’s “not smart enough.” It fails because we often test it like a trivia contestant—then deploy it like an operator.

That’s the core warning behind Melanie Mitchell’s message at NeurIPS (the largest annual AI research conference): today’s AI benchmarks can look impressive while telling you very little about how a system will behave in messy, real operations. If you work in energy & utilities—or in supply chain & procurement supporting the energy transition—this isn’t academic nitpicking. It’s the difference between an AI pilot that demos well and a system you can trust when demand spikes, a transformer overheats, a supplier misses a shipment, or a storm breaks your forecasts.

Here’s the stance I’ll take: most organizations are testing AI the wrong way for critical operations. They’re over-weighting benchmark accuracy and under-weighting generalization, failure modes, and real-world robustness. Mitchell’s call for “better tests” maps cleanly to what energy companies need right now: evaluation protocols that resemble operations, not exams.

Why AI benchmarks mislead energy operations

Benchmarks reward the wrong behaviors. Standard AI evaluation often looks like “run the model on a fixed dataset and report accuracy.” Mitchell argues that even when systems crush benchmarks—sometimes exceeding human performance—results don’t reliably translate to real work.

Energy and utilities add two complications that make this worse:

The environment shifts constantly. Grid conditions, weather patterns, asset health, and market behavior aren’t static. Models trained and tested on last year’s distribution can degrade quickly.
The cost of failure is asymmetric. A slightly worse forecast isn’t “a lower score.” It can mean an avoidable outage, equipment damage, safety events, or regulatory exposure.

A practical example: a load forecasting model might score very well on a historical test set, but still fail during:

a rare cold snap (extreme demand)
abnormal industrial shutdowns or restarts
DER behavior changes (new rooftop solar or batteries)
price-driven demand response events

The reality? Accuracy averages hide operational risk. If you’re buying AI for grid optimization or predictive maintenance, you don’t just need “good performance.” You need known performance under stress.

The “bar exam problem” for AI

Mitchell uses a sharp analogy: an AI can ace the bar exam and still be a terrible lawyer. In energy terms: a model can ace your offline test and still be a terrible operator.

Offline tests tend to assume:

the data is representative
the inputs are clean and available
the objective is stable
the model won’t be gamed by correlations that don’t hold later

Operations tend to look like:

missing SCADA tags
sensor drift
delayed meter reads
changing market dispatch incentives
new operating procedures

If your evaluation doesn’t simulate operational friction, you’re not validating performance—you’re validating optimism.

“Alien intelligence” and why it matters for field AI

Mitchell describes AI systems as a kind of “alien intelligence,” borrowing language used for understanding nonverbal minds like babies and animals. The point isn’t philosophical. It’s methodological.

When psychologists study babies or animals, they can’t just ask them what they meant. So they’ve learned to:

design careful control conditions
vary stimuli systematically
test robustness to small changes
focus on failure patterns as insight

This is exactly how energy teams should evaluate production AI—especially systems that influence operations, dispatch, maintenance scheduling, or procurement decisions.

In supply chain & procurement (the series this post sits within), the same “alien” issue shows up when you deploy AI for:

supplier risk scoring
lead-time prediction
inventory optimization
demand sensing

The models will “behave” in ways that look intelligent until one variable changes (a new supplier, a new SKU mix, a tariff, a port delay), and then you find out what they were actually learning.

Operational AI should be tested like a nonverbal agent: by probing behavior under controlled variations, not by celebrating high scores.

The Clever Hans lesson: your model might be reading the room

Mitchell brings up one of the most famous cautionary tales in experimental science: Clever Hans, the horse that appeared to do arithmetic by tapping its hoof.

It wasn’t a hoax—but it wasn’t math either. A psychologist used control experiments (blindfolding the horse, blocking its view) and showed the horse was responding to subtle cues from the human questioner.

AI systems do the same thing all the time.

In energy and supply chain analytics, “Clever Hans models” often succeed by exploiting shortcuts such as:

data leakage (future information sneaks into training)
proxy variables (the model learns a code for the label)
overly tidy backtests (operations cleaned the data after the fact)
correlated artifacts (e.g., a maintenance ticket timestamp strongly implying failure)

A field example: predictive maintenance leakage

I’ve seen versions of this in asset health modeling: teams include work-order fields that only exist because someone already suspected a failure. The model “predicts” failures brilliantly—because it’s reading the human signal, not the machine.

A better test looks like:

only using features available at the decision time
simulating real latency (e.g., 24-hour delay on lab results)
auditing feature provenance (“how would we know this at runtime?”)

If you can’t explain how a feature exists before the event, it doesn’t belong in the benchmark.

What better AI testing looks like for energy & utilities

Better testing is a protocol, not a metric. Mitchell’s main critique is that AI lacks strong experimental methodology—and many AI practitioners weren’t trained in it.

Here’s a practical evaluation stack that works well for grid AI, predictive maintenance, and operational supply chain models.

1) Replace “one test set” with scenario suites

One static dataset encourages narrow optimization. Energy operations demand scenario coverage.

Build a scenario suite with slices such as:

extreme weather days (top/bottom 1% temperature)
high-renewable penetration periods
voltage/frequency disturbance windows
peak and shoulder seasons (December heating vs. July cooling)
rare but critical asset states (incipient fault signatures)

Then score separately per scenario. A model that is “2% better overall” but “20% worse on peak days” is not better.

2) Use perturbation testing (small changes, big reveal)

Psychology-style stimulus variation translates cleanly to AI:

add realistic noise to sensors
drop a subset of telemetry tags
shift timestamps by plausible delays
simulate new DER adoption rates
adjust commodity price signals

If performance collapses under mild perturbations, you’re not ready for production.

3) Evaluate failure modes, not just success

Mitchell emphasizes studying failures because they reveal what’s going on.

For operational AI, define and track:

false positives that trigger unnecessary truck rolls
false negatives that miss safety-critical failures
confident wrong predictions (high-risk)
unstable outputs (small input change flips decisions)

A simple but powerful metric: “Cost-weighted error by failure type” (your finance and ops teams can help quantify real costs).

4) Require replication before scaling

Mitchell also calls out a cultural problem: replication is undervalued in AI research. In operations, replication is non-negotiable.

Before scaling an AI solution across the fleet or across business units, replicate:

across regions (urban vs. rural feeders)
across asset cohorts (age bands, vendors)
across seasons (winter vs. summer)
across market regimes (price volatility)

If the model only works “in the pilot territory,” treat it as unproven.

5) Promote “skeptic” to a job title

Mitchell notes that in AI, “skeptic” is treated as a negative label. In critical infrastructure, skepticism is professionalism.

A useful internal role is a rotating model challenger—someone whose explicit job is to find:

shortcut learning
leakage
brittleness
unsafe edge cases

Your goal isn’t to win a benchmark. Your goal is to earn operational trust.

How this ties back to AI in supply chain & procurement

Energy supply chains are under pressure in late 2025: long lead times for grid equipment, constrained transformer availability, battery supply volatility, and shifting regulatory and trade environments. Many teams are adopting AI for procurement forecasting and supplier risk.

Mitchell’s point lands here too: procurement AI often looks great until the world changes.

If you’re using AI to predict supplier performance or optimize inventory for critical spares, better tests include:

stress-testing lead-time models on disruption periods
validating predictions when onboarding new suppliers (cold-start)
checking robustness when part numbers are substituted
auditing whether “supplier risk” is just a proxy for region or size

This matters because the grid is increasingly coupled to supply chain reality. A predictive maintenance model that flags replacements is only helpful if procurement can actually source parts—and if the AI didn’t learn the wrong signal.

A practical checklist: what to ask before you buy or deploy

Use these questions in vendor selection, internal model reviews, or pilot-to-scale gates:

What happens on the worst 1% of days? Show performance specifically on extremes.
What features are available at decision time? Prove there’s no leakage.
How does performance change under realistic data loss? Demonstrate resilience.
What are the top three failure modes and their operational cost? Quantify risk.
Has it replicated across geographies or asset types? One region doesn’t count.
How is the model monitored post-deploy? Drift detection, alerting, rollback.
Who owns the decision when the model is uncertain? Human-in-the-loop design.

If the answers are vague, that’s a signal in itself.

What to do next

Better AI testing is the fastest way to reduce risk and speed adoption at the same time. When ops teams see evaluation that mirrors reality—messy inputs, shifting conditions, clear failure costs—trust grows quickly.

If you’re an energy leader rolling out AI for grid optimization, predictive maintenance, or procurement forecasting, start by upgrading the evaluation protocol before upgrading the model. You’ll save months of churn and avoid the painful pattern of “great pilot, rough rollout.”

The forward-looking question that matters for 2026 planning: Are you measuring your AI the way it will be used—or the way it’s easiest to score?