AI in Supply Chain & Procurement•December 19, 2025•By 3L3C

Better AI benchmarks can prevent costly failures in forecasting, maintenance, and grid ops. Learn how utilities can test AI for real-world reliability.

AI benchmarkingUtilitiesGrid optimizationPredictive maintenanceDemand forecastingProcurement analytics

Featured image for AI Benchmarks Utilities Can Trust (Not Just Pass)

AI Benchmarks Utilities Can Trust (Not Just Pass)

A model that “beats the benchmark” can still fail on the one day your grid needs it most.

That’s the uncomfortable lesson behind Melanie Mitchell’s recent critique of how we test AI systems. Her point isn’t academic hand-wringing about AGI. It’s a practical warning: our evaluation habits reward score-chasing, not reliability. And reliability is exactly what energy, utilities, and supply chain teams can’t compromise on—especially heading into winter peak season and year-end planning cycles, when forecasting mistakes and equipment failures get expensive fast.

In the AI in Supply Chain & Procurement world, this shows up in predictable ways: demand forecasts that look great in a slide deck but miss regional spikes; supplier-risk models that don’t generalize when geopolitical conditions shift; predictive maintenance that works in the lab and then quietly degrades in the field.

If you’re deploying AI for grid optimization, DER orchestration, demand response, outage prediction, or spare-parts procurement, the real question isn’t “What’s your benchmark score?” It’s “What did you do to prove it won’t behave like Clever Hans?”

Why benchmark wins don’t translate to operational wins

Benchmarks are usually optimized for comparability, not for the real world. Mitchell calls out a core problem: we often evaluate AI systems by running them against a fixed set of tasks and reporting accuracy. It’s tidy, but it’s also fragile.

In energy and utilities, fragility looks like this:

A load forecasting model performs well on historical data, then breaks during a demand shock (cold snap, heat dome, industrial restart, EV charging cluster).
A grid anomaly detector flags the “usual suspects,” but misses rare multi-causal events (protection miscoordination plus communications dropouts plus inverter control interactions).
A supplier lead-time model assumes stable logistics patterns, then underestimates disruption when port congestion and tariff changes hit simultaneously.

Here’s the stance I take: benchmark-only evaluation is a procurement risk. If you buy or approve a model because it tops a leaderboard, you’re effectively purchasing a performance claim that may not be reproducible under your operating constraints.

The “bar exam problem” for utilities

Mitchell uses a clean analogy: an AI that aces the bar exam doesn’t necessarily make a good lawyer. In utilities, the equivalent is an AI that aces a forecasting test set but can’t handle:

missing SCADA/AMI intervals
telemetry drift
new asset configurations
market rule changes
data that arrives late (common in enterprise environments)

Passing the test is not the same as doing the job.

What energy AI can learn from “nonverbal minds”

Mitchell argues we should study modern AI more like psychologists study babies and animals—nonverbal agents where you can’t assume shared understanding or even shared motivations.

This maps surprisingly well to operational AI in utilities:

Your model doesn’t “know” what a feeder is; it knows patterns in representations.
Your model doesn’t “understand” scarcity; it learns correlations between purchase orders and delays.
Your model doesn’t “intend” to be robust; it optimizes for the training objective you gave it.

So evaluation needs to do something different: probe what the model actually uses to make decisions.

Clever Hans is alive and well in machine learning

The Clever Hans story matters because it’s not about fraud. It’s about unintended cueing. The horse wasn’t counting; it was reacting to subtle human signals.

In modern AI, Clever Hans shows up as:

data leakage (future information sneaks into features)
proxy learning (model uses an easy shortcut correlated with the label)
spurious patterns (weather station ID, meter firmware version, vendor naming conventions)

Utilities are especially exposed because datasets often contain “hidden helpers,” like:

maintenance codes that indirectly encode failure outcomes
work order close dates that correlate with asset health labels
market settlement fields that reflect operator interventions

If your evaluation doesn’t actively test for shortcut behavior, you’ll ship a model that looks smart—right up until the context changes.

A better evaluation playbook for grid and supply chain AI

Better testing means designing evaluations that punish shortcuts and reward generalization. Below is a pragmatic approach I’ve found works across energy AI, forecasting, and procurement analytics.

1) Replace single scores with a “capability profile”

A single accuracy number hides too much. Create a capability profile that includes at least:

Average performance (what you report today)
Tail performance (worst 1–5% of cases)
Stress performance (extreme weather, outages, price spikes)
Stability (variance across time windows, regions, asset classes)

In grid terms: don’t just ask “Is MAE low?” Ask “What happens on the top 10 peak days and the bottom 10 demand days?”

2) Use controlled counterfactuals (your “blindfold and screen” tests)

Mitchell highlights psychology’s emphasis on control experiments. For energy AI, control experiments are how you prove you’re not being fooled.

Examples of counterfactual tests:

Feature removal tests: remove suspect proxies (operator tags, feeder ID, vendor code) and see if performance collapses.
Permutation tests: shuffle a feature that shouldn’t matter; if performance changes, you’ve got leakage.
Delay simulations: simulate late-arriving AMI data; does the model degrade gracefully?
Topology change tests: evaluate before/after a reconfiguration; does the model generalize?

A model that only works when it can “see the question asker” is not production-ready.

3) Build benchmark variations, not benchmark worship

Mitchell points out that psychologists vary stimuli carefully to test robustness. Your internal benchmark suite should do the same.

For utilities and supply chain AI, create benchmark variations along dimensions like:

geography (territory A vs territory B)
seasonality (winter peaks vs shoulder months)
asset mix (overhead vs underground, old vs new)
market regime (high volatility vs stable pricing)
supplier base shifts (new vendors, alternate parts)

If you only test on one “canonical” dataset split, you’ll overfit your process—not just your model.

4) Treat failure analysis as a first-class deliverable

Mitchell notes that failures often teach more than successes. In energy AI, this is non-negotiable.

Require a failure analysis that answers:

Where does it fail? (which feeders, which substations, which part families)
When does it fail? (which weather bands, which load levels)
How does it fail? (systematic bias, delayed response, false confidence)
What’s the operational impact? (crew dispatch errors, inventory misallocation, market exposure)

This turns model evaluation into a risk conversation procurement and operations can actually use.

Replication isn’t “incremental”—it’s how you de-risk vendors

Mitchell calls out an AI culture problem: replication is undervalued because it’s not “novel.” That mindset is fine for chasing conference acceptance. It’s terrible for infrastructure.

Energy and utilities should normalize a different standard:

If a model claim can’t be replicated under your data, your constraints, and your latency requirements, it doesn’t exist.

In procurement terms, replication is your acceptance test. It’s the difference between:

buying a capability
buying a story

Practical procurement requirement: the vendor replication pack

If you’re sourcing an AI solution (forecasting, maintenance, grid optimization), ask for a replication pack that includes:

Data schema + lineage (what fields are required, how they’re generated)
Training/evaluation protocol (splits, leakage controls, baselines)
Robustness tests (stress scenarios, counterfactuals, missing data)
Model monitoring plan (drift metrics, alert thresholds, retraining triggers)
Rollback plan (how operations continue safely if the model degrades)

This is how you turn “AI benchmarking” into a procurement asset instead of a procurement liability.

FAQ-style answers your stakeholders will ask anyway

“Do we need AGI-level tests for utility AI?”

No. You need operations-level tests. The goal isn’t to measure some abstract intelligence. It’s to measure whether the system performs reliably under the conditions your grid, assets, and suppliers actually experience.

“What’s the fastest way to improve trust in our models?”

Run three things immediately:

a leakage audit
a stress test on extreme events
a tail-performance report (worst-case behavior)

If any of those are uncomfortable, that’s useful information—not a failure.

“How does this connect to supply chain and procurement?”

AI forecasts drive purchasing. Predictive maintenance drives spares. Risk models drive supplier selection. Bad benchmarks create confident mistakes, and confident mistakes are what blow up budgets and reliability KPIs.

What to do next if you’re deploying AI in energy and utilities

Most companies get this wrong by treating evaluation as a final checkbox. The better approach is to treat AI benchmarking as an ongoing operational control, like protection settings or NERC compliance evidence.

Start small and concrete:

Define your “real-world” scenarios (winter peak, storm response, DER volatility, supply disruption).
Build a benchmark suite with variations (not one dataset split).
Require counterfactual control tests to catch Clever Hans behavior.
Operationalize monitoring so evaluation doesn’t stop at go-live.

If your AI is making recommendations that affect reliability, safety, or procurement spend, it deserves the same skepticism you’d apply to a new relay setting or an unproven supplier.

Better AI tests won’t just improve model quality. They’ll improve decision quality. And in energy—where mistakes propagate fast—that’s the whole point.