Stop grading AI like a benchmark test. Learn how energy and supply chain teams can validate real-world AI reliability for grids, maintenance, and procurement.
AI doesn’t fail in energy because it’s “not smart enough.” It fails because we often test it like a trivia contestant—then deploy it like an operator.
That’s the core warning behind Melanie Mitchell’s message at NeurIPS (the largest annual AI research conference): today’s AI benchmarks can look impressive while telling you very little about how a system will behave in messy, real operations. If you work in energy & utilities—or in supply chain & procurement supporting the energy transition—this isn’t academic nitpicking. It’s the difference between an AI pilot that demos well and a system you can trust when demand spikes, a transformer overheats, a supplier misses a shipment, or a storm breaks your forecasts.
Here’s the stance I’ll take: most organizations are testing AI the wrong way for critical operations. They’re over-weighting benchmark accuracy and under-weighting generalization, failure modes, and real-world robustness. Mitchell’s call for “better tests” maps cleanly to what energy companies need right now: evaluation protocols that resemble operations, not exams.
Why AI benchmarks mislead energy operations
Benchmarks reward the wrong behaviors. Standard AI evaluation often looks like “run the model on a fixed dataset and report accuracy.” Mitchell argues that even when systems crush benchmarks—sometimes exceeding human performance—results don’t reliably translate to real work.
Energy and utilities add two complications that make this worse:
- The environment shifts constantly. Grid conditions, weather patterns, asset health, and market behavior aren’t static. Models trained and tested on last year’s distribution can degrade quickly.
- The cost of failure is asymmetric. A slightly worse forecast isn’t “a lower score.” It can mean an avoidable outage, equipment damage, safety events, or regulatory exposure.
A practical example: a load forecasting model might score very well on a historical test set, but still fail during:
- a rare cold snap (extreme demand)
- abnormal industrial shutdowns or restarts
- DER behavior changes (new rooftop solar or batteries)
- price-driven demand response events
The reality? Accuracy averages hide operational risk. If you’re buying AI for grid optimization or predictive maintenance, you don’t just need “good performance.” You need known performance under stress.
The “bar exam problem” for AI
Mitchell uses a sharp analogy: an AI can ace the bar exam and still be a terrible lawyer. In energy terms: a model can ace your offline test and still be a terrible operator.
Offline tests tend to assume:
- the data is representative
- the inputs are clean and available
- the objective is stable
- the model won’t be gamed by correlations that don’t hold later
Operations tend to look like:
- missing SCADA tags
- sensor drift
- delayed meter reads
- changing market dispatch incentives
- new operating procedures
If your evaluation doesn’t simulate operational friction, you’re not validating performance—you’re validating optimism.
“Alien intelligence” and why it matters for field AI
Mitchell describes AI systems as a kind of “alien intelligence,” borrowing language used for understanding nonverbal minds like babies and animals. The point isn’t philosophical. It’s methodological.
When psychologists study babies or animals, they can’t just ask them what they meant. So they’ve learned to:
- design careful control conditions
- vary stimuli systematically
- test robustness to small changes
- focus on failure patterns as insight
This is exactly how energy teams should evaluate production AI—especially systems that influence operations, dispatch, maintenance scheduling, or procurement decisions.
In supply chain & procurement (the series this post sits within), the same “alien” issue shows up when you deploy AI for:
- supplier risk scoring
- lead-time prediction
- inventory optimization
- demand sensing
The models will “behave” in ways that look intelligent until one variable changes (a new supplier, a new SKU mix, a tariff, a port delay), and then you find out what they were actually learning.
Operational AI should be tested like a nonverbal agent: by probing behavior under controlled variations, not by celebrating high scores.
The Clever Hans lesson: your model might be reading the room
Mitchell brings up one of the most famous cautionary tales in experimental science: Clever Hans, the horse that appeared to do arithmetic by tapping its hoof.
It wasn’t a hoax—but it wasn’t math either. A psychologist used control experiments (blindfolding the horse, blocking its view) and showed the horse was responding to subtle cues from the human questioner.
AI systems do the same thing all the time.
In energy and supply chain analytics, “Clever Hans models” often succeed by exploiting shortcuts such as:
- data leakage (future information sneaks into training)
- proxy variables (the model learns a code for the label)
- overly tidy backtests (operations cleaned the data after the fact)
- correlated artifacts (e.g., a maintenance ticket timestamp strongly implying failure)
A field example: predictive maintenance leakage
I’ve seen versions of this in asset health modeling: teams include work-order fields that only exist because someone already suspected a failure. The model “predicts” failures brilliantly—because it’s reading the human signal, not the machine.
A better test looks like:
- only using features available at the decision time
- simulating real latency (e.g., 24-hour delay on lab results)
- auditing feature provenance (“how would we know this at runtime?”)
If you can’t explain how a feature exists before the event, it doesn’t belong in the benchmark.
What better AI testing looks like for energy & utilities
Better testing is a protocol, not a metric. Mitchell’s main critique is that AI lacks strong experimental methodology—and many AI practitioners weren’t trained in it.
Here’s a practical evaluation stack that works well for grid AI, predictive maintenance, and operational supply chain models.
1) Replace “one test set” with scenario suites
One static dataset encourages narrow optimization. Energy operations demand scenario coverage.
Build a scenario suite with slices such as:
- extreme weather days (top/bottom 1% temperature)
- high-renewable penetration periods
- voltage/frequency disturbance windows
- peak and shoulder seasons (December heating vs. July cooling)
- rare but critical asset states (incipient fault signatures)
Then score separately per scenario. A model that is “2% better overall” but “20% worse on peak days” is not better.
2) Use perturbation testing (small changes, big reveal)
Psychology-style stimulus variation translates cleanly to AI:
- add realistic noise to sensors
- drop a subset of telemetry tags
- shift timestamps by plausible delays
- simulate new DER adoption rates
- adjust commodity price signals
If performance collapses under mild perturbations, you’re not ready for production.
3) Evaluate failure modes, not just success
Mitchell emphasizes studying failures because they reveal what’s going on.
For operational AI, define and track:
- false positives that trigger unnecessary truck rolls
- false negatives that miss safety-critical failures
- confident wrong predictions (high-risk)
- unstable outputs (small input change flips decisions)
A simple but powerful metric: “Cost-weighted error by failure type” (your finance and ops teams can help quantify real costs).
4) Require replication before scaling
Mitchell also calls out a cultural problem: replication is undervalued in AI research. In operations, replication is non-negotiable.
Before scaling an AI solution across the fleet or across business units, replicate:
- across regions (urban vs. rural feeders)
- across asset cohorts (age bands, vendors)
- across seasons (winter vs. summer)
- across market regimes (price volatility)
If the model only works “in the pilot territory,” treat it as unproven.
5) Promote “skeptic” to a job title
Mitchell notes that in AI, “skeptic” is treated as a negative label. In critical infrastructure, skepticism is professionalism.
A useful internal role is a rotating model challenger—someone whose explicit job is to find:
- shortcut learning
- leakage
- brittleness
- unsafe edge cases
Your goal isn’t to win a benchmark. Your goal is to earn operational trust.
How this ties back to AI in supply chain & procurement
Energy supply chains are under pressure in late 2025: long lead times for grid equipment, constrained transformer availability, battery supply volatility, and shifting regulatory and trade environments. Many teams are adopting AI for procurement forecasting and supplier risk.
Mitchell’s point lands here too: procurement AI often looks great until the world changes.
If you’re using AI to predict supplier performance or optimize inventory for critical spares, better tests include:
- stress-testing lead-time models on disruption periods
- validating predictions when onboarding new suppliers (cold-start)
- checking robustness when part numbers are substituted
- auditing whether “supplier risk” is just a proxy for region or size
This matters because the grid is increasingly coupled to supply chain reality. A predictive maintenance model that flags replacements is only helpful if procurement can actually source parts—and if the AI didn’t learn the wrong signal.
A practical checklist: what to ask before you buy or deploy
Use these questions in vendor selection, internal model reviews, or pilot-to-scale gates:
- What happens on the worst 1% of days? Show performance specifically on extremes.
- What features are available at decision time? Prove there’s no leakage.
- How does performance change under realistic data loss? Demonstrate resilience.
- What are the top three failure modes and their operational cost? Quantify risk.
- Has it replicated across geographies or asset types? One region doesn’t count.
- How is the model monitored post-deploy? Drift detection, alerting, rollback.
- Who owns the decision when the model is uncertain? Human-in-the-loop design.
If the answers are vague, that’s a signal in itself.
What to do next
Better AI testing is the fastest way to reduce risk and speed adoption at the same time. When ops teams see evaluation that mirrors reality—messy inputs, shifting conditions, clear failure costs—trust grows quickly.
If you’re an energy leader rolling out AI for grid optimization, predictive maintenance, or procurement forecasting, start by upgrading the evaluation protocol before upgrading the model. You’ll save months of churn and avoid the painful pattern of “great pilot, rough rollout.”
The forward-looking question that matters for 2026 planning: Are you measuring your AI the way it will be used—or the way it’s easiest to score?