Local LLMs are finally practical on modern hardware. See what NPUs, unified memory, and hybrid architecture mean for secure, low-latency utility AI.

Local LLMs for Utilities: The Hardware Catch-Up
Most energy and utility teams I talk to want the same thing from AI: answers fast, on-site, and without shipping sensitive operational data to a third party. The problem is that, until very recently, “run it locally” was mostly wishful thinking—especially if “locally” meant a standard-issue laptop, a control-room workstation, or an edge box inside a substation.
That’s starting to change, and the laptop market is a surprisingly good indicator of what’s next for industrial AI infrastructure. The same forces pushing Windows PCs toward NPUs (neural processing units) and unified memory are also shaping how utilities should plan AI deployments across control centers, field operations, and data centers.
This post is part of our AI in Cloud Computing & Data Centers series, and it’s a reminder of a simple truth: model choice matters, but infrastructure design decides whether AI is reliable, secure, and affordable in production.
Why local AI is becoming a utility requirement (not a nice-to-have)
Answer first: Utilities are moving toward local LLMs and on-device inference because latency, resilience, and data governance are operational constraints—not preferences.
Cloud-hosted LLMs work well for general knowledge tasks, but grid operations and utility workflows aren’t “general.” They’re full of constraints:
- Outage resilience: If a cloud endpoint is unavailable—or connectivity is degraded during storms—AI assistance disappears right when operators need it.
- Latency: Some tasks don’t tolerate round-trip delays, especially when AI supports real-time decisioning (alarm triage, switching procedure checks, call center assist during peak events).
- Data privacy and sovereignty: SCADA context, protection settings, asset health signals, and customer data can be sensitive. Many teams won’t accept “send it to a black box.”
- Cost predictability: Cloud inference can be economical at first, then become hard to forecast when usage scales across thousands of employees and systems.
Here’s the stance I’ll take: for utilities, local AI isn’t about avoiding the cloud. It’s about reducing the cloud’s blast radius. Keep the cloud for training, fleet management, and heavy experimentation. Push more inference closer to where decisions are made.
Your laptop is the canary: what NPUs signal for enterprise AI
Answer first: NPUs are a signal that the industry is optimizing mainstream hardware for AI inference, which will spill over into enterprise endpoints and rugged edge systems.
A year-old “normal” laptop typically has a CPU, maybe an integrated GPU, and 16 GB of RAM. That’s fine for email and spreadsheets. It’s not fine for modern LLM inference. The shift underway is the addition of purpose-built AI acceleration:
- NPUs are designed to execute the matrix-heavy operations AI relies on.
- They’re generally more power efficient than discrete GPUs for many inference workloads.
- They’re improving quickly: mainstream laptop NPUs have moved from roughly ~10 TOPS (2023-era) to ~40–50 TOPS in newer chips, with some systems aiming far higher.
What TOPS means (and why utilities should care)
TOPS (trillions of operations per second) is an imperfect metric, but it’s useful as a directional indicator. For utility leaders, the practical takeaway is:
AI capacity is becoming a standard line item in endpoint procurement, just like RAM and storage.
That changes how you plan:
- Control-room PCs and engineering workstations will increasingly ship with AI acceleration by default.
- Field laptops used for inspections, switching, and maintenance will get AI features without requiring a discrete GPU.
- More inference workloads can shift from data center to edge—if your software stack is ready.
The real bottleneck isn’t compute. It’s memory.
Answer first: Local LLM performance is often limited by memory capacity and memory architecture more than raw compute.
LLMs and many generative models need the model (or large parts of it) resident in memory to run efficiently. Traditional PC design split memory into separate pools:
- System memory (RAM) for the CPU
- Dedicated GPU memory (VRAM) for the GPU
That split made sense historically. For AI, it creates friction: moving data across buses burns power and time, and it makes “how much memory do I really have available to the model?” a messy question.
Unified memory is the quiet architecture shift utilities should watch
Unified memory architectures allow CPU, GPU, and NPU to access a shared pool of memory. Consumer devices already proved the pattern; now PC vendors are bringing similar designs into laptops and small form-factor systems.
Why this matters in utility environments:
- Bigger effective memory pools make it realistic to run stronger local models for:
- procedure guidance
- safety checklists and lockout/tagout validation
- interpreting maintenance logs
- summarizing inspections with photos and notes
- Less copying between subsystems improves battery life and reliability in field work.
- More predictable performance reduces operator frustration—critical when AI is embedded in workflows.
A practical procurement implication: when your team asks for “AI PCs,” don’t stop at NPU specs. Ask about:
- maximum supported memory (64 GB, 96 GB, 128 GB)
- whether memory is unified/shared
- whether memory is upgradeable or soldered (trade-off: serviceability vs performance)
Cloud vs edge vs on-device: a deployment model that fits utilities
Answer first: The winning pattern for utilities is a tiered AI architecture that places inference where it best meets latency, privacy, and cost constraints.
In this series, we talk a lot about how cloud providers optimize infrastructure. Utilities can borrow the same thinking—then apply it to their own hybrid reality.
A simple 3-tier model for utility AI workloads
-
On-device (laptops, operator workstations):
- Best for private, interactive tasks
- Works offline or during degraded connectivity
- Example: an engineer asking a local LLM to summarize a protection relay manual and match it to internal standards
-
On-site edge (substation or plant gateways, rugged servers):
- Best for near-real-time analytics and local compliance constraints
- Example: anomaly detection on transformer DGA trends or vibration signals with results served locally to technicians
-
Cloud/data center (central platforms):
- Best for fleet-wide learning, model training, large-scale evaluation, and governance
- Example: retraining predictive maintenance models across the full asset fleet and distributing updated weights
What should stay local?
If you’re deciding what to run locally, here’s a practical filter I’ve found works:
- Keep it local when data sensitivity is high (SCADA, customer PII, critical infrastructure details).
- Keep it local when latency matters (operator assist during events).
- Keep it local when availability must be independent of WAN links (storm response, remote regions).
- Use cloud when the job needs massive compute or benefits from fleet-wide aggregation.
This isn’t ideological. It’s operational engineering.
Software is catching up: why Windows “local AI stacks” matter
Answer first: Standardized local AI runtimes make it easier to deploy models across heterogeneous hardware—and that’s the difference between a pilot and a program.
One reason utilities have struggled to operationalize edge AI is the integration tax: different hardware, different drivers, different runtimes, different security postures. Increasingly, major OS ecosystems are trying to standardize local inference so applications can target a single interface and let the runtime decide whether a CPU, GPU, or NPU should do the work.
For utilities, this has two concrete benefits:
- Broader deployment: you can roll out AI-enabled tools without requiring every endpoint to have the same discrete GPU.
- Better governance: standardized runtimes can centralize policy controls (model allowlists, logging, update channels), which matters for regulated environments.
Retrieval-augmented generation (RAG) belongs on-site
RAG is what makes an LLM useful in enterprise settings: it answers using your documents, procedures, asset history, and knowledge base.
Utilities should treat RAG architecture as a security and performance design, not just a data science feature:
- Put the document store and embeddings close to the users who need them (control center or regional ops).
- Keep sensitive retrieval inside your network boundary.
- Allow on-device inference for the conversational layer when feasible.
A strong pattern is: local inference + on-site retrieval + cloud governance.
What energy and utilities should do in 2026 procurement and architecture
Answer first: Treat AI readiness as an infrastructure roadmap: endpoints, edge, and data center planning should align around measurable targets.
Here’s a practical checklist you can put into budgeting and vendor discussions for 2026.
1) Set “AI endpoint tiers” (and stop buying one-size-fits-all laptops)
Create 2–3 hardware profiles:
- Standard user: normal productivity
- AI-enabled field/ops: NPU-capable, higher RAM, rugged options
- AI power user: unified memory designs or workstation-class systems for local inference and multimodal workflows
Then map roles to tiers (operators, protection engineers, asset health teams, planners).
2) Define your local inference use cases before you pick models
Utilities often start with “which model?” Better sequence:
- Where must AI run when the WAN is down?
- Which workflows touch sensitive data that can’t leave the premises?
- What response time does the workflow actually require?
Then select model sizes, quantization approaches, and deployment targets.
3) Treat memory as a first-class requirement
If your AI program is serious, bake these questions into standards:
- What’s the minimum RAM for AI-enabled endpoints (32 GB? 64 GB?)
- Do you require unified memory for certain roles?
- What’s your policy on non-upgradable memory (serviceability vs performance)?
4) Plan for the “model supply chain”
Running models locally doesn’t eliminate governance—it raises the stakes.
You’ll need:
- model approval workflows
- signed artifacts
- controlled distribution
- rollback plans
- monitoring for model drift and misuse
This is where your data center and cloud platform still matter: they become the command center for an increasingly distributed AI fleet.
Local AI turns endpoints into a fleet. Fleets require operations.
Where this goes next for cloud computing & data centers
Local LLMs won’t replace cloud AI. They’ll change what cloud is for.
Cloud and data center teams will spend less time serving every token for every interaction and more time on:
- training and fine-tuning
- evaluation and red-teaming
- distributing policy-compliant models
- optimizing workload placement across hybrid environments
- reducing total energy cost per insight (which utilities care about more than most industries)
The laptop market’s shift toward NPUs and unified memory is a preview of a broader infrastructure reality: AI is becoming a built-in workload class, like databases or virtualization. For energy and utilities, that’s good news—because the most valuable AI is often the AI that keeps working during the messy, real-world moments.
If you’re planning grid modernization or predictive maintenance investments for 2026, here’s the forward-looking question worth debating internally: Which decisions must your AI support when connectivity is worst—and what hardware will be on hand when that happens?