Stop Trusting AI Benchmarks for Energy Decisions

AI in Supply Chain & Procurement••By 3L3C

AI benchmarks often miss real-world risk in energy and procurement. Learn a practical evaluation playbook for robust, reliable AI decisions.

AI evaluationEnergy analyticsUtilities operationsSupply chain procurementModel governanceDemand forecasting
Share:

Featured image for Stop Trusting AI Benchmarks for Energy Decisions

Stop Trusting AI Benchmarks for Energy Decisions

A model can score in the 90s on an internal benchmark and still fail the first time your load profile shifts, a supplier substitutes a component, or an ice storm takes out a substation and your data distribution changes overnight.

That mismatch—great benchmark results, disappointing real-world performance—is exactly what Melanie Mitchell has been warning the AI community about. In a recent NeurIPS keynote and interview, she argued that we’re often testing AI “intelligence” the wrong way: we reward pattern-matching on familiar tasks, then act surprised when systems don’t generalize.

For leaders working in energy and utilities—and especially for teams applying AI across supply chain and procurement—this isn’t an academic gripe. It’s a reliability issue. If you’re using AI for demand forecasting, outage risk prediction, inventory optimization, vendor risk scoring, or maintenance planning, your evaluation strategy is either building trust… or quietly manufacturing future incidents.

Benchmarks don’t measure what energy operations need

Answer first: Benchmarks measure performance on a specific test set, while energy operations need performance under shifting conditions, messy constraints, and high consequence decisions.

Most AI evaluation still looks like this: pick a benchmark, train a model, report accuracy (or MAE/MAPE/F1), and declare progress. Mitchell’s point is simple: passing a test designed for humans—or for a static dataset—doesn’t prove the system can do the job in the wild.

Energy is a perfect stress test for this problem because it punishes fragile generalization:

  • Weather regimes change (and 2025 continues to show what “non-stationary climate” really means).
  • DER adoption increases volatility (EV charging clusters, behind-the-meter solar, demand response).
  • Regulatory and market rules shift (tariffs, interconnection timelines, capacity market tweaks).
  • Supply chains remain disrupted (lead times, substitution risk, and quality variance).

A model that performs well only when the world behaves like last year’s dataset is a liability.

The “bar exam problem” shows up in utilities too

Mitchell uses a memorable analogy: acing the bar exam doesn’t make an AI a good lawyer. Energy has its own version: a model can “ace” your historical forecasting backtest and still be a poor operator.

Why? Because operational usefulness includes things benchmarks often ignore:

  • Can the system explain drivers (temperature vs. price vs. curtailment vs. customer mix)?
  • Does it know when it’s out of distribution and should abstain or escalate?
  • Does it stay stable when SCADA tags drift, AMI clocks skew, or a vendor changes naming conventions?
  • Can planners simulate “what if we can’t procure that transformer for 28 weeks?”

Benchmarks don’t fail because they’re evil. They fail because they’re incomplete.

Treat AI like an “alien intelligence,” not a perfect employee

Answer first: You’ll evaluate AI more accurately if you assume it’s a nonverbal, nonhuman problem-solver—closer to a baby or animal in experimental design than a human analyst.

Mitchell borrows the phrase “alien intelligences” (with quotes) to describe both modern AI systems and nonverbal biological minds like infants and animals. The framing is useful for energy teams because it breaks a common habit:

We keep grading models as if they think like us.

They don’t. They produce outputs that can look fluent or confident while relying on shortcuts. That’s not “bad”—it’s simply the nature of statistical systems trained on historical patterns.

If you’re deploying AI in grid operations or procurement workflows, the right posture is:

  • Curious about what the model can do
  • Skeptical about why it’s doing it
  • Relentless about alternative explanations

Mitchell’s stance that “skeptic” should be a compliment is one I strongly agree with. In critical infrastructure, skepticism is a safety feature.

The Clever Hans lesson (and the energy equivalent)

Mitchell highlights the classic story of Clever Hans, the horse that appeared to do arithmetic by tapping its hoof. A psychologist eventually showed Hans was reading subtle cues from the questioner—he wasn’t solving math.

Energy AI has “Clever Hans” moments all the time:

  • A model predicts outages “amazingly well”… because it learned to pick up crew dispatch artifacts or post-event labels that leak the answer.
  • A procurement risk model flags suppliers… because it learned that certain suppliers use different invoice formats correlated with past disputes.
  • A load forecast looks great… because the dataset includes a feature that’s basically a proxy for the target (like a settlement field updated after the fact).

These are label leakage and shortcut learning problems. Without strong controls, you’ll celebrate a model that’s “reading your facial expressions.”

What psychology can teach AI teams in energy procurement

Answer first: Borrow experimental discipline from developmental and comparative psychology: controls, stimulus variation, failure analysis, and replication.

Mitchell points out that AI researchers often lack formal training in experimental methodology. Developmental and comparative psychologists have to study minds that can’t explain themselves, so they’ve developed rigorous techniques to avoid fooling themselves.

Here’s how to translate that into AI in supply chain and procurement and AI in energy & utilities.

1) Design control tests that remove the obvious crutches

If a model’s performance collapses when you remove one “easy” feature, that’s not a bug—it’s diagnostic information.

Practical controls for energy/procurement AI:

  • Time-travel audits: verify every feature would exist at prediction time, not after the event.
  • Feature blackout tests: retrain or re-evaluate with suspect features removed (dispatch codes, resolution notes, settlement fields).
  • Counterfactual substitutions: replace vendor names, plant IDs, or feeder IDs with randomized tokens to see if the model is learning identity rather than drivers.

2) Vary the “stimuli” to test robustness, not memorization

Psychology experiments often use many small variations to check whether an effect is real. Energy AI should do the same.

Robustness checks that matter operationally:

  • Evaluate by seasonal slices (winter peaks vs. shoulder months)
  • Evaluate by event regimes (heatwaves, polar vortex, wildfire smoke days)
  • Evaluate by asset classes (pad-mount vs. overhead, different OEMs)
  • Evaluate by supplier change scenarios (substitution allowed vs. not allowed)

A single averaged metric can hide a model that’s great 10 months a year and dangerous in the other two.

3) Study failures more than successes

Mitchell notes that failure modes often teach you more than a successful score. In operations, that’s doubly true because edge cases are where the cost lives.

A practical failure-analysis routine:

  1. Collect the top 50 worst errors each month.
  2. Categorize them (data quality, regime shift, rare event, wrong constraint, human override).
  3. Decide whether each category needs: better data, different model class, new constraints, or a workflow change.
  4. Feed those insights into the next evaluation cycle.

If your AI governance process doesn’t have a standing “worst errors review,” you’re flying blind.

4) Replication should be rewarded internally—even if conferences don’t

Mitchell criticizes the AI research culture for undervaluing replication. Energy companies can’t afford that attitude.

Internal replication is how you prevent “one brilliant notebook” from becoming a production incident.

What replication looks like in a utility or energy procurement org:

  • Another team reruns the experiment from scratch using the same data contract
  • Results must hold across different time windows and different operating regions
  • You publish an internal “model card” that states what changed, what didn’t, and what still breaks

This is also how you accelerate adoption: operational stakeholders trust systems that behave consistently.

A better evaluation playbook for AI in grid and supply chain workflows

Answer first: Replace single-score benchmarking with an evaluation stack: generalization tests, operational constraints, uncertainty, and monitoring.

If you’re using AI for demand forecasting, inventory optimization, supplier performance analytics, or predictive maintenance, here’s a practical playbook you can implement without waiting for a research breakthrough.

Evaluation Layer 1: Generalization tests (do they travel?)

Your model should prove it can handle change.

  • Temporal generalization: train on 2022–2024, test on 2025 months with major events
  • Geographic generalization: train on one territory/ISO zone, test on another
  • Supplier/asset generalization: train on OEM A, test on OEM B (or with substituted parts)

If the model can’t travel, don’t deploy it widely. Deploy it narrowly with guardrails.

Evaluation Layer 2: Constraint realism (can it operate?)

Forecast accuracy is not enough if the output can’t be executed.

Examples:

  • Procurement recommendations must honor MOQ, lead time, approved vendor lists, contract terms
  • Maintenance scheduling must respect crew availability, outage windows, safety constraints
  • Grid optimization must respect thermal limits, voltage constraints, N-1 reliability criteria

A good evaluation includes simulated execution: “Given the model’s output, can the business actually act?”

Evaluation Layer 3: Calibration and abstention (does it know when it’s wrong?)

Energy teams should demand models that can say “I don’t know” in a measurable way.

  • Track calibration (prediction intervals that actually contain the truth at the stated rate)
  • Implement abstention rules (escalate to human review when uncertainty is high)
  • Require reason codes that map to measurable drivers, not vague explanations

If a model always sounds confident, treat that as a defect.

Evaluation Layer 4: Monitoring tied to business impact

Once deployed, monitoring should detect drift before customers do.

  • Data drift: sensor changes, missingness, vendor formatting shifts
  • Concept drift: new load patterns due to EV adoption or rate plan changes
  • Outcome drift: KPI changes even when accuracy looks stable

Tie monitoring to business outcomes: stockouts avoided, expedite fees reduced, SAIDI/SAIFI impact, forecast error during peaks, and maintenance deferrals avoided.

“People also ask” questions energy teams raise (and straight answers)

Do we need AGI-style tests for energy AI?

No. You need task-specific reliability under change, not a debate about whether a model is “generally intelligent.” Mitchell’s skepticism about AGI measurement is healthy: if you can’t define it crisply, you can’t evaluate it rigorously.

What’s the hidden cost of weak AI evaluation in procurement?

Expedites, excess inventory, and vendor disputes are the visible costs. The hidden ones are worse: decisions that degrade resilience—like concentrating orders with a supplier that looks “low risk” only because the model learned a shortcut.

Can LLMs be evaluated with the same rigor?

Yes, but not with vibe checks. For LLM-based procurement assistants (SOW drafting, supplier Q&A, invoice triage), build test suites with:

  • Adversarial prompts (jailbreak-style, ambiguity, conflicting constraints)
  • Ground-truth rubrics (what counts as acceptable, unsafe, or non-compliant)
  • Repeatability tests (does the answer change across runs?)

What to do next if you own AI for energy supply chains

The teams getting value from AI in supply chain and procurement aren’t the ones with the flashiest models. They’re the ones with evaluation discipline that matches the stakes.

Start with three moves this quarter:

  1. Create a “Clever Hans checklist” for leakage, proxies, and shortcut features.
  2. Add generalization slices to every model report (season/event/territory/supplier).
  3. Institutionalize replication: require a second build before production, and track deltas.

If you’re responsible for reliable operations—whether that’s keeping the grid stable or keeping critical parts available—ask your team a blunt question: If this model fails on the worst day of the year, will we see it coming in our evaluation report?