AgentCore Policy and Evaluations help teams deploy trusted AI agents with enforceable controls and CloudWatch quality metrics—built for production governance.

AgentCore Policy & Evaluations: Trusted AI Agents
Most teams don’t fail to ship AI agents because the model is “not smart enough.” They fail because they can’t trust what the agent will do once it’s connected to real tools, real data, and real customers.
That trust gap gets expensive fast in cloud environments: an overly-permissioned agent can trigger risky actions, pull sensitive data, or spin up workloads you didn’t intend—wasting compute and creating audit nightmares. In December 2025, AWS added two preview capabilities to Amazon Bedrock AgentCore that squarely target this problem: Policy (to control what agents are allowed to do) and AgentCore Evaluations (to measure whether agents are doing the right thing).
This post is part of our AI in Cloud Computing & Data Centers series, where we track how AI moves from demos into infrastructure. AgentCore’s updates are a strong signal of where production AI is heading: policy-driven governance + continuous quality telemetry, treated as first-class cloud primitives.
Why “trusted AI agents” is really an infrastructure problem
Trusted AI agents aren’t just an app-layer concern. They’re an infrastructure governance concern.
Once an agent can call APIs, run code, query internal systems, and take action, it starts to look like a new kind of workload—one that makes decisions. That changes what “resource governance” means in the cloud:
- Security governance: What data/tools can the agent access, and under what conditions?
- Operational governance: How do you audit every tool call and decision path?
- Cost governance: How do you prevent accidental workload explosions (loops, excessive tool calls, unnecessary data pulls)?
- Quality governance: How do you detect drift in correctness, safety, or goal completion while users are interacting with it?
AWS is effectively saying: if you want agents in production, you need the same discipline you apply to microservices—policy enforcement, observability, and SLOs—but tuned for agent behavior.
AgentCore Policy: guardrails that sit outside the agent’s “brain”
Answer first: AgentCore Policy is valuable because it enforces permissions at the gateway, independent of the model or framework, so you can verify actions before they touch tools, data, or infrastructure.
A common anti-pattern in agent deployments is relying on prompt instructions like “don’t do refunds over $200” or “never access payroll data.” That’s not governance. That’s wishful thinking.
AgentCore Policy works differently: it intercepts tool calls in AgentCore Gateway before execution. So even if the agent decides to do something, the platform can still say “deny” based on explicit rules.
Why this matters for cloud security and data center operations
When agents become common, they’ll be integrated into operational workflows: ticket triage, incident response, capacity planning, FinOps recommendations, and internal self-service.
Policy controls align with resource governance because they can prevent:
- Unauthorized actions: e.g., calling a “terminate instance” tool without the right role.
- Sensitive data access: e.g., querying PII tables or finance systems.
- High-cost execution paths: e.g., restricting expensive tools or large queries unless an incident severity is high.
In practice, that’s how you keep agents from turning your cloud into a slot machine.
Natural language policy authoring (and why I’m cautiously optimistic)
AWS allows policies to be authored in natural language or directly in Cedar (an open-source policy language). Natural language authoring is interesting because it broadens who can participate:
- Security teams can review intent in plain English.
- Compliance can audit policy logic without reading application code.
- Developers can iterate faster during integration.
The critical detail: policy generation isn’t a generic “LLM translate English to code.” It’s schema-aware (it understands your tool definitions) and includes automated checks to catch policies that are impossible, overly permissive, or overly restrictive.
That said, production teams should still treat generated policies like generated infrastructure-as-code: review, test, version, and roll out gradually.
A practical way to roll Policy into production
AgentCore supports a “log-only” mode before enforcement. Use it. Here’s a rollout approach I’ve found works across regulated teams:
- Start with log-only policies for the top 5 highest-risk tools (payments, user admin, data exports, infrastructure changes, code execution).
- Review denied/allowed decisions weekly. Look for surprising tool usage patterns.
- Add conditions tied to identity claims (OAuth/JWT tags, roles) so rules map to org structure.
- Graduate to enforcement only after you’ve observed real traffic.
- Create a policy change pipeline (approvals, staging gateway, automated tests, and audit logs).
Treat agent tool permissions the way you treat IAM: least privilege, gradual expansion, and constant review.
AgentCore Evaluations: quality telemetry you can alert on
Answer first: AgentCore Evaluations turns agent quality into measurable signals—like correctness, safety, and tool accuracy—published in CloudWatch so teams can set thresholds, alarms, and operational playbooks.
Teams often do a pre-launch evaluation, ship, and then rely on user complaints as monitoring. That’s backwards. Agent quality is dynamic:
- Knowledge and policies change.
- Tools evolve.
- User behavior shifts (especially seasonally).
- Models get swapped or tuned.
AgentCore Evaluations is designed for continuous evaluation based on real-world behavior, not just test suites.
What gets measured (and what you should actually care about)
Built-in evaluators include dimensions such as:
- Correctness and helpfulness
- Faithfulness (is the response supported by context)
- Safety (harmful content, stereotyping)
- Goal success rate
- Tool selection accuracy and tool parameter accuracy
For cloud and data center workloads, the tool-related metrics are the sleeper hit. An agent that “talks well” but picks the wrong tool can still cause outages or costs.
A pragmatic metric set for infrastructure agents (incident, ops, FinOps) looks like:
- Tool selection accuracy
- Parameter accuracy (resource IDs, regions, account boundaries)
- Correctness (for factual operational guidance)
- Context relevance (especially with runbooks and KBs)
- Goal success rate (did it actually resolve/complete the workflow)
CloudWatch integration: where this becomes operational
Publishing evaluation scores into CloudWatch matters because it fits existing operational muscle memory:
- Dashboards alongside latency, errors, and tool-call traces
- Alarms when quality drops below thresholds
- Automations (ticket creation, rollback, canary fail, model switch)
If you already run SRE practices, this is the missing piece for agentic systems: quality becomes an observable, alertable signal.
Here’s an example policy for an SLO-style approach:
- Correctness score must stay ≥ 0.85 for the last 2 hours
- Tool parameter accuracy must stay ≥ 0.90
- Harmfulness must stay ≤ 0.02
- If breached: trigger an incident, route traffic to a safer fallback workflow, and pin the last known-good agent version
That’s how you keep “AI drift” from becoming “production incident.”
Custom evaluators: where leads and business value show up
AgentCore supports custom evaluators using a model-as-judge approach (your choice of model and prompt), scored per trace, session, or tool call.
For lead-focused enterprise teams, custom evaluators can measure outcomes that correlate with revenue and operational efficiency, like:
- First-contact resolution for support agents
- Policy compliance (did it follow required disclosures or escalation rules)
- Cost-aware behavior (did it choose a cheaper tool path when appropriate)
- Data minimization (did it avoid pulling more data than needed)
This is where agent evaluation stops being “AI quality theater” and becomes governance you can defend in a security review.
Episodic memory and bidirectional streaming: capability expands, risk expands too
Answer first: Episodic memory and bidirectional voice streaming increase agent capability and user experience—but they also increase the need for policy controls and continuous evaluation.
AWS also announced:
- Episodic functionality in AgentCore Memory (long-term learning from prior interactions)
- Bidirectional streaming in AgentCore Runtime (more natural voice interactions with interruptions)
These features are great for user experience and productivity, but they raise governance questions.
Episodic memory: operational gold if you control it
Episodic memory can reduce repetitive instruction overhead by capturing patterns, outcomes, and “what worked last time.” For ops and cloud teams, that could mean:
- Faster remediation suggestions based on past incidents
- Consistent handling of recurring capacity events
- Better continuity during long-running tickets
But long-term memory is also where sensitive information can accidentally persist. If you enable it, be explicit about:
- What categories of data can be stored
- Retention periods and deletion workflows
- Redaction rules (PII, credentials, customer identifiers)
- Access boundaries across tenants, accounts, and environments
Bidirectional voice streaming: great for NOC workflows
Bidirectional streaming enables interruption-friendly voice agents—useful in network operations centers (NOCs) or on-call scenarios where speed matters.
If you deploy voice agents for operations, combine:
- Policy enforcement on high-risk tools (infrastructure changes)
- Evaluation alarms on safety/correctness
- Human confirmation steps for destructive actions
Voice feels informal. The backend actions aren’t.
Real-world traction: what the early numbers suggest
AWS shared adoption signals that match what I’m seeing in the market: high experimentation, now shifting toward production hardening.
- The AgentCore SDK reached 2 million downloads in 5 months (since preview).
- PGA TOUR reported 1,000% faster content writing and a 95% cost reduction in a multi-agent content system.
- Workday’s Planning Agent reduced routine planning analysis time by 30% (about 100 hours/month saved).
- Grupo Elfa achieved 100% traceability of agent decisions and cut problem resolution time by 50%.
Those outcomes share a theme: the winners aren’t just “using AI.” They’re operationalizing it with traceability, controls, and measurable performance.
A production checklist for trusted AI agents in cloud environments
Answer first: If you want agents that are safe, cost-aware, and auditable, implement policy enforcement + continuous evaluations + staged rollout, just like any other critical workload.
Use this as a starting checklist:
- Classify tools by risk (read-only, write, destructive, financial, sensitive data).
- Enforce least-privilege tool access at the gateway (not in prompts).
- Start log-only, then enforce once you’ve observed real tool usage.
- Define agent SLOs (correctness, tool accuracy, goal success, safety) and alert on them.
- Instrument cost controls: rate limits, tool budgets per session, and escalation paths.
- Version everything: prompts, policies, tools, and evaluators; roll out with canaries.
- Auditability by default: keep traces that link user input → reasoning → tool calls → outputs.
If that list feels like “a lot,” good. Agents are operational software, not a chat widget.
Where this is heading for AI in cloud computing & data centers
Trusted AI agents are becoming a new layer of cloud infrastructure—one that must be governed like identity, networking, and compute.
AgentCore’s direction (policy at the gateway + evaluations in CloudWatch) is exactly what production teams need: controls you can audit and metrics you can operate.
If you’re building agents for cloud operations, FinOps, or internal platforms, the next step is straightforward: define what your agent is allowed to do, measure what it’s actually doing, and automate the response when it drifts.
What would it change in your environment if agent quality and agent permissions were treated as seriously as service latency and IAM access?