Why General AI Fails at “Easy” Tasks Like Time

AI in Supply Chain & ProcurementBy 3L3C

General AI can fail at “easy” tasks like telling time. Here’s what that teaches energy procurement teams about domain-specific AI and reliable automation.

AI reliabilityProcurement analyticsSupply chain riskUtilities AIMultimodal AIData driftHuman-in-the-loop
Share:

Featured image for Why General AI Fails at “Easy” Tasks Like Time

Why General AI Fails at “Easy” Tasks Like Time

A modern multimodal AI model can summarize a 40-page contract, caption an image, and draft an email your procurement team would actually send. Then you show it an analog clock—and it confidently gives you the wrong time.

That mismatch isn’t a funny party trick. It’s a warning label for anyone using AI in supply chain and procurement inside energy and utilities, where the “simple” step is often the one that breaks the workflow. If an AI system misreads a clock hand because it hasn’t seen that visual pattern before, it can just as easily misread a meter photo, misclassify a maintenance invoice line item, or confuse a substation asset tag because the label is smudged or the format is new.

The clock problem, reported in recent research on multimodal large language models (MLLMs), is a clean example of a bigger reality: general AI performs well on familiar patterns and fails hard when the input shifts. Energy supply chains don’t get to live in “familiar patterns.” They live in bad lighting, legacy formats, vendor variability, winter storm disruptions, and last-minute substitutions.

What the “AI can’t read a clock” result really proves

Answer first: The clock results prove that multimodal AI can struggle with basic spatial reasoning and generalization, and small perception errors can cascade into full task failure.

In the study highlighted by IEEE Spectrum, researchers generated a large synthetic dataset of analog clock images (tens of thousands of displayed times). Multiple MLLMs initially performed poorly. Fine-tuning with a few thousand additional examples improved performance—until the models were tested on a new collection of clock images, where accuracy dropped again.

That pattern matters more than the headline.

The core issue: brittle generalization

Clock reading looks trivial to humans because we’ve internalized:

  • The semantics of “hour hand vs minute hand”
  • The geometry of angles and relative positions
  • The convention that the hour hand moves continuously
  • The mapping between hand orientation and time

Many general-purpose AI systems don’t truly “understand” those rules. They recognize visual patterns that correlate with labels—until the pattern changes.

When researchers warped clock shapes or changed the appearance of the hands (for example, adding arrows), models struggled even more. That’s classic distribution shift: the model learned the dataset’s style, not the task’s underlying structure.

Cascading errors: the supply chain version

The study also points to an important failure mode: if the model misidentifies the hands, it becomes worse at estimating orientation, which then compounds into a wrong time.

This is exactly how procurement automation breaks in the real world:

  1. The model misreads a supplier part number (perception)
  2. That causes the wrong cross-reference match (interpretation)
  3. That triggers an incorrect reorder suggestion (decision)
  4. That creates stockouts or overstock (operations)

The problem isn’t one mistake. It’s that modern AI pipelines often treat early steps as “good enough,” then stack automations on top.

Why this matters right now for energy procurement teams (December 2025 reality)

Answer first: Winter operations amplify AI brittleness because inputs get messier, substitutes become common, and procurement decisions tighten.

Late December is a stress test for utilities: weather-driven demand swings, constrained field labor, tighter logistics capacity, and more emergency buys. The procurement signal-to-noise ratio gets worse.

If you’re using AI to support:

  • Demand forecasting for critical spares
  • Supplier risk scoring
  • Invoice-to-PO matching
  • Maintenance, repair, and operations (MRO) catalog classification
  • Outage-driven logistics coordination

…then you’re operating in a high-consequence environment where “mostly correct” can be operationally wrong.

A practical stance I’ll defend: general AI is best treated as a draft assistant until it’s wrapped in domain constraints and verified with domain-grade testing.

The lesson for AI in supply chain & procurement: constraints beat confidence

Answer first: Domain-specific AI succeeds in critical infrastructure because it adds constraints, structured context, and specialized evaluation—not just more generic training data.

When teams adopt a general MLLM and expect it to run procurement workflows end-to-end, they’re betting that the model will generalize across:

  • New supplier templates n- New SKU formats
  • New site naming conventions
  • New regulatory terms
  • New asset families

That bet is usually wrong.

Here’s what works better in energy supply chain AI systems.

1) Ground the model in structured procurement context

Instead of asking the model to “figure it out” from raw documents, strong implementations anchor outputs to:

  • Approved vendor lists (AVL)
  • Contracted price books
  • Material master / item master taxonomy
  • Site/asset hierarchy (plant → system → component)
  • Standard work order codes

When the model has to choose between existing, validated entities, the “clock hand confusion” problem shrinks dramatically.

2) Use hybrid pipelines: rules + ML + human checks

Pure end-to-end generative workflows are fragile. A more reliable pattern is:

  • Deterministic parsing where possible (tables, known fields)
  • Specialized models for extraction/classification
  • A language model for summarization and exception handling
  • Human-in-the-loop review for high-impact decisions

In procurement terms: let AI propose, but make approvals boring and auditable.

3) Treat “new formats” as a first-class requirement

The clock study shows models stumble when the clock “looks different.” Procurement has the same problem:

  • A supplier changes invoice layout
  • A subcontractor uses different unit abbreviations
  • A new region introduces different tax lines
  • A freight carrier changes tracking event codes

So your evaluation plan must include format drift, not just average-case accuracy.

A good operational metric is not only “overall match rate,” but:

  • Accuracy on unseen suppliers
  • Accuracy on new templates
  • Accuracy under low-quality scans/photos
  • Error rate by commodity family

What to test so your procurement AI doesn’t “fail like a clock”

Answer first: Test for generalization, not memorization, by building stress tests that mimic real procurement messiness.

If you’re building or buying AI for supply chain and procurement in energy and utilities, include these in your acceptance criteria.

Generalization test suite (practical checklist)

  1. Template shift tests

    • Same invoice content, 5 different layouts
    • Same PO content, different column orders
  2. Appearance shift tests (scans and photos)

    • Blur, glare, skew, partial occlusion
    • Low-light phone images from the field
  3. Entity ambiguity tests

    • Similar supplier names (parent vs subsidiary)
    • Similar part numbers (one character off)
  4. Unit and quantity traps

    • EA vs FT vs M vs KG
    • Pack sizes, minimum order quantities, split shipments
  5. Workflow integrity tests (cascading errors)

    • Measure how a small extraction error changes downstream decisions

A simple rule I use: if an error can create a truck roll, a stockout of critical spares, or a contract compliance issue, it needs a deterministic backstop.

How energy-focused AI teams avoid these failures

Answer first: Energy-focused AI avoids “basic reasoning failures” by embedding operational physics, business rules, and domain ontologies into the system.

General AI tries to infer the world from patterns. Energy operations are better served by systems that know the rules.

Examples of domain knowledge that changes outcomes:

  • Grid operations constraints: ramp limits, contingency logic, outage windows
  • Asset criticality: lead times and failure consequences for transformers, breakers, protection relays
  • Procurement policy: thresholds, preferred suppliers, approval routing
  • Commodity taxonomy: consistent classes for MRO parts and services
  • Maintenance semantics: mapping work orders to materials and vendors

This is also where utilities are making progress with context-aware AI: models that don’t just read text, but operate inside a structured environment with validated options.

That’s the bridge from the clock story to energy procurement: you don’t need an AI that’s “smart.” You need an AI that’s constrained correctly.

A procurement AI that can’t say “I don’t know” is a liability.

Where this fits in an “AI in Supply Chain & Procurement” strategy

Answer first: The clock failure is a reminder that scaling AI in procurement is mostly an engineering discipline: data quality, domain modeling, and rigorous evaluation.

If this post is part of your broader series on AI forecasts, supplier management, risk reduction, and global supply chain optimization, here’s the narrative connection:

  • Forecasting and optimization only work if inputs are trustworthy.
  • Trustworthy inputs require robust perception and extraction.
  • Robust extraction requires domain-specific training and testing, not generic demos.

It’s tempting to start with shiny use cases—supplier chatbots, autonomous sourcing agents, automated negotiation summaries. Most teams get better ROI by starting one layer lower:

  • Clean item masters
  • High-confidence classification
  • Exceptions handling
  • Drift monitoring

That foundation is what keeps your AI from misreading the “clock hands” of your procurement data.

What to do next

If you’re evaluating AI for energy supply chain and procurement in 2026 planning cycles, take these next steps:

  1. Pick one workflow (invoice matching, MRO classification, supplier risk, demand forecasting) and map the failure cascade.
  2. Define a generalization benchmark that includes unseen suppliers and format drift.
  3. Require constraint mechanisms (approved entity lists, business rules, confidence thresholds, audit logs).
  4. Monitor drift monthly, and after major vendor/template changes.

General AI will keep improving. But the teams that win in utilities won’t wait for “smarter models.” They’ll build systems that stay reliable when the real world gets weird.

Where in your procurement workflow would a small AI mistake turn into a big operational problem—and do you have a backstop there today?

🇺🇸 Why General AI Fails at “Easy” Tasks Like Time - United States | 3L3C