Red teaming AI at scale takes human expertise plus AI-driven testing. Learn a practical playbook to secure tool-using models in U.S. digital services.

Red Teaming AI: People + Models for Safer Systems
Most organizations treat AI security like a checkbox: run a few prompt tests, add a policy banner, call it “responsible.” That’s how you end up with systems that look polished in demos but behave badly under pressure—exactly the pressure that shows up in defense, national security, and critical digital services.
Advancing red teaming with people and AI is really about a simple shift: stress-testing models has to scale as fast as models do. And the only way to scale that kind of adversarial testing is human expertise paired with AI-assisted red teaming—a setup where people provide creativity, intent, and judgment while models help generate coverage, variation, and speed.
This post is part of our AI in Defense & National Security series, where reliability is the product. If you’re deploying AI in the United States—whether for customer communications, cybersecurity operations, fraud prevention, or mission support—red teaming isn’t a nice-to-have. It’s how you keep AI useful when someone is trying to make it fail.
Why AI red teaming matters more in U.S. digital services
AI red teaming is the practice of intentionally trying to break an AI system—to expose failures in safety, security, privacy, and policy compliance before real adversaries do.
That matters across U.S. digital services because AI is now embedded in workflows that directly affect people: benefits eligibility, financial disputes, healthcare scheduling, emergency call triage, identity verification, and security monitoring. In defense and national security contexts, the stakes rise again: model misuse, prompt injection, data leakage, and tool abuse can create operational risk, not just reputational risk.
Here’s the hard truth I’ve seen play out: if your AI system can take actions—send messages, query tools, summarize sensitive documents—then someone will try to steer it. Sometimes that “someone” is a malicious actor. Sometimes it’s an ordinary user with a weird edge case. Either way, the outcome is the same: you need a disciplined way to find failures.
The modern threat model: it’s not only “bad outputs”
Early AI safety conversations fixated on toxic language or misinformation. That’s still relevant, but it’s not the whole board anymore. For AI-powered digital services, red teaming has expanded into:
- Prompt injection: users embed instructions that override system rules (especially in tool-using agents).
- Data exfiltration: attempts to extract secrets from context windows, logs, connectors, or memory.
- Policy evasion: rephrasing requests until safeguards break.
- Tool misuse: steering the model into calling actions it shouldn’t (refunds, account changes, record pulls).
- Jailbreaks at scale: scripted attack patterns that generate thousands of variants.
In national security and defense-adjacent environments, add:
- Operational security failures (leaking sensitive locations, identifiers, or plans)
- Social engineering amplification (more persuasive phishing or impersonation)
- Model inversion and training data inference risks, depending on architecture
Red teaming is how you turn these from abstract fears into concrete test cases and measurable controls.
The secret ingredient: human judgment, AI scale
The best red teaming is collaborative: humans define what “harm” looks like, while AI helps generate breadth and repeatability.
Humans are still better at:
- Recognizing novel abuse patterns
- Understanding real-world context (what’s sensitive, what’s operationally risky)
- Prioritizing findings (what will actually hurt you in production)
AI can be better at:
- Creating thousands of test variations quickly
- Simulating diverse user behavior (tone, language, intent)
- Turning failure patterns into regression tests you can run every release
If you’re building or buying AI services, this is the maturity marker to look for: Does the team have a red teaming program that combines expert testers with automated adversarial generation and continuous evaluation? If not, they’re relying on luck.
What “AI-assisted red teaming” looks like in practice
A practical approach usually includes three loops:
-
Human-led discovery
- Experts probe the system (including tool calls and integrations) to find real failure modes.
- They document “attack stories,” not just isolated prompts.
-
Model-generated expansion
- Once you have a failure pattern, a model generates many paraphrases, languages, and indirect variants.
- You build a test suite that reflects how adversaries iterate.
-
Automated regression and monitoring
- Run the suite against every model update, prompt change, policy update, or tool addition.
- Track metrics like refusal quality, sensitive data exposure, and tool-call constraint violations.
This is the bridge point to the broader campaign: AI is powering U.S. digital services—and red teaming is how those services stay reliable as they scale.
A red teaming playbook for defense and national security teams
Answer first: If your AI system touches sensitive data or actions, you need a red team plan that tests capabilities, not just content.
Below is a field-tested playbook that fits defense and national security use cases (and works just as well for regulated industries).
1) Start with asset mapping, not prompts
Red teaming fails when it’s treated like clever prompt writing. Start by mapping assets:
- What data can the model see? (tickets, emails, case notes, classified/controlled info)
- What tools can it call? (
search,send_email,create_case,update_record,run_query) - What identity context does it inherit? (roles, group membership, permissions)
- What logs are stored—and who can access them?
A model that can’t access anything sensitive is mostly a reputational risk. A model that can access sensitive systems is an operational risk.
2) Threat-model the workflows, not the model
The model is rarely the single point of failure. The workflow is.
Example workflow risks:
- SOC copilot: model summarizes alerts → analyst trusts summary → misses a lateral movement indicator.
- Intel triage assistant: model ranks reports → bias or prompt injection skews priority.
- Citizen service chatbot: model accesses case status → attacker tricks it into revealing personal data.
In each case, you should test end-to-end: inputs, retrieval, tool calls, and outputs.
3) Build a “harm taxonomy” your org will actually use
A harm taxonomy is a shared language for what you’re testing. Keep it practical.
A solid starting taxonomy for AI in defense & national security:
- Confidentiality harms: PII exposure, sensitive operational detail leakage, connector exfiltration
- Integrity harms: incorrect action execution, tool-call manipulation, record tampering
- Availability harms: overload, denial-of-service via token abuse, workflow dead-ends
- Compliance harms: policy violations, retention violations, audit gaps
- Human harms: harassment, coercion, unsafe guidance
Make it measurable by pairing each harm type with:
- A severity level (1–5)
- A detection method (manual review, automated classifier, rule-based checks)
- A mitigation owner (product, security, legal, ops)
4) Test the two hardest things: tool use and retrieval
Tool-using agents and retrieval-augmented generation (RAG) systems are where red teaming pays off fastest. They’re also where teams get surprised.
High-value tests include:
- Prompt injection in retrieved documents (malicious text hidden in a PDF or ticket)
- Cross-tenant data leakage (one customer’s data retrieved for another)
- Privilege escalation (model uses a tool with broader permissions than the user)
- Action confirmation bypass (model executes without required human approval)
If you remember one line: RAG expands the attack surface from “what users type” to “anything the system can read.”
5) Turn findings into controls, not just fixes
A good red team report doesn’t end with “here’s the prompt that broke it.” It ends with durable controls.
Controls that consistently reduce risk:
- Least-privilege tool scopes (model can’t do more than the user)
- Allowlisted tool actions with strict schemas and validation
- Human-in-the-loop gates for sensitive actions (payments, deletions, disclosures)
- Content filtering plus intent checks (don’t rely on one classifier)
- Structured outputs for critical workflows (reduce ambiguity)
- Audit logs that capture tool calls and decision context
This is where AI maturity shows up: you’re building systems that assume adversarial pressure.
Measuring red teaming: what to track (and what to ignore)
Answer first: Track metrics that reflect real risk: data exposure, unauthorized actions, and repeatable failures across releases.
Vanity metrics—like “number of prompts tested”—don’t tell you much. What you want is evidence that the system is getting harder to break.
Practical metrics for AI security teams:
- Escape rate: % of attempts that bypass refusal or policy controls
- Sensitive data leakage rate: % of tests that output restricted data (or parts of it)
- Unauthorized tool-call rate: tool calls made outside user permission scope
- Time-to-mitigate: days from finding to deployed fix/control
- Regression rate: previously fixed failures that reappear after updates
For defense and national security use cases, add:
- Classification handling accuracy (does the assistant respect markings and compartments?)
- Source attribution quality (does it distinguish unknown vs verified?)
- Operator override frequency (how often humans have to correct it)
If you can’t measure it, you can’t defend it in an audit—or in a real incident.
What business leaders should do before deploying AI copilots
Answer first: Treat AI red teaming like pre-production testing for a system that can be attacked, not a one-time risk review.
If you’re a CTO, CISO, product leader, or program manager rolling out AI in U.S. digital services, here’s a checklist that won’t waste your time:
- Ask where the model can take actions. If it can change anything, require tool-call constraints and logging.
- Require a red team test plan for RAG. Specifically: prompt injection, cross-tenant leakage, and retrieval boundary tests.
- Demand regression tests. A red team finding that can’t be rerun automatically will come back.
- Separate “demo mode” from “production mode.” Production needs stricter permissions, monitoring, and escalation.
- Plan for an AI incident. Define who can disable tools, rotate keys, revoke connectors, and communicate updates.
This connects directly to the campaign’s point: AI is driving growth in U.S. technology and digital services, but reliability is what keeps that growth. Red teaming is how you earn reliability.
Snippet-worthy stance: If your AI product can call tools, your security posture is only as strong as your tool governance.
The direction things are heading in 2026
Holiday week is when a lot of teams quietly ship changes—model upgrades, workflow tweaks, new integrations—because traffic is lower. It’s also when issues get missed because staffing is thin. That’s a good reminder of where AI security is going next year: continuous evaluation, not periodic review.
Expect more organizations to standardize on:
- Always-on red teaming pipelines that run nightly or on every deployment
- Scenario-based testing (multi-step attacks, not single prompts)
- Human-AI collaboration where AI generates variants and humans focus on novel risk
- Stricter procurement requirements for AI vendors (test evidence, audit logs, mitigations)
In defense and national security work, this will become table stakes because the adversary iterates faster than quarterly security reviews.
Most companies get this wrong at first. The fix isn’t complicated: treat red teaming as a product capability. Staff it, automate it, measure it, and keep it running.
Where do you want to be a year from now—arguing that your AI system is safe, or showing the test results that prove it?