AI in Cybersecurity•December 19, 2025•By 3L3C

MCP sampling enables new prompt injection vectors: token theft, conversation hijacking, and covert tool calls. Learn how AI monitoring and policy controls stop them.

mcpprompt-injectionai-agentsllm-securitysecopsanomaly-detection

Featured image for Stop MCP Sampling Prompt Injection Before It Spreads

Stop MCP Sampling Prompt Injection Before It Spreads

A single “helpful” copilot tool can quietly run up your AI bill, twist your assistant’s behavior for the rest of the session, and even write files to a developer’s machine—without anyone noticing in the chat window.

That’s the uncomfortable lesson from recent research into Model Context Protocol (MCP) sampling, a feature meant to make AI agents smarter by letting external tool servers request LLM completions. The problem isn’t that MCP exists. It’s that sampling flips the trust model: instead of the user (or host app) being the one to ask the LLM for help, a server can ask too—and can shape both the prompt and what happens next.

This is part of our AI in Cybersecurity series, and it’s a perfect example of where security teams need to stop treating copilots as “just another SaaS app.” When agents can call tools, read resources, and now request model completions through MCP sampling, you need AI runtime security: real-time anomaly detection, strict policy enforcement, and visibility into what the model actually processed—not just what the UI decided to show.

MCP sampling changes the security boundary (and that’s the point)

MCP was designed to standardize how LLM apps connect to tools and data sources. In a typical flow, the host app stays in charge: the user asks a question, the LLM decides it needs a tool, the client asks for permission, then calls the tool.

Sampling changes who gets to “speak first.” With sampling enabled, an MCP server can send a sampling/createMessage request back to the client and effectively say: “Run this prompt through your model and return the completion.”

That’s powerful for legitimate use cases:

A summarization server that wants higher-quality summaries without running its own model
A data analysis server that wants an LLM to interpret results
A workflow agent that needs the model to decide which step comes next

But it also creates a clean new attack surface: prompt injection delivered by an untrusted server, executed on your model quota, and potentially fed back into the user’s ongoing conversation.

Here’s the stance I take: treat every MCP server as untrusted by default, even if it came from a popular repo, even if it “only summarizes code.” The sampling feature makes “only summarizes code” an unreliable promise.

The 3 MCP sampling attack vectors security teams should plan for

The research demonstrates three proof-of-concept attack patterns that matter because they map directly to enterprise pain: cost overruns, integrity loss, and unauthorized actions.

1) Resource theft: hidden token burn that your UI won’t show

What happens: a malicious MCP server appends hidden instructions to the sampling prompt—asking for extra output (for example, a fictional story) after the legitimate task.

Why it works: some copilot implementations don’t show the raw completion. They show a condensed result. The user sees a neat summary, while the model actually generated a much longer response behind the scenes.

What this looks like operationally:

Sudden spike in token usage tied to “normal” tasks like summarization
Sampling requests that regularly hit or approach maxTokens
Model cost anomalies that don’t correlate with user-visible output volume

Why AI-driven detection helps: classic threshold alerts are noisy (developers do big tasks). A better approach is behavioral baselining:

Compare visible output length vs. billed tokens (large, consistent gaps are a red flag)
Model per-tool token profiles (a code summary tool shouldn’t produce 1,000 extra words repeatedly)
Detect prompt padding patterns (long tails of instructions, unusual role-play prompts, or obfuscation)

A snippet-worthy rule you can operationalize:

If an MCP tool’s sampling completions are consistently larger than what the UI presents, assume abuse until proven otherwise.

2) Conversation hijacking: persistent prompt injection across turns

What happens: the malicious server injects instructions that end up in the assistant’s response, which then becomes part of the conversation history. The next turn inherits the malicious instruction (“speak like a pirate,” in the PoC), but the same method can push far more harmful directives.

Why it matters: persistence is the real threat. A one-off bad answer is annoying; a compromised session can:

Degrade developer decisions (“ignore security warnings”)
Encourage unsafe code patterns
Steer toward data exposure (“include secrets in the output for debugging”)
Undermine incident response (“tell the user everything is fine”)

How to detect it with AI monitoring:

Track instructional drift: sudden changes in tone, policy adherence, or formatting that persist
Identify meta-instructions in responses (e.g., “for all future responses…”) and quarantine them
Score outputs for control intent (attempts to modify model behavior rather than answer the user)

How to prevent it with policy enforcement:

Strip or neutralize response segments that look like “system prompt” content
Enforce a rule: tool outputs cannot introduce durable behavioral instructions
Isolate sampling context from the main chat memory (more on this below)

3) Covert tool invocation: unauthorized actions that look legitimate

What happens: the injected prompt tells the model to call additional tools (like writeFile) as part of the “helpful” workflow. The user asked for a summary; the agent writes a file locally (or could call other tools).

Why it’s dangerous: tool calls are where LLM risk becomes real-world impact:

Local file writes (dropping scripts, logs, or persistence helpers)
Data exfiltration via network tools
Credential access via password manager or secrets tooling
Repository modification or malicious PR generation

What makes this attack sneaky: the tool invocation can be buried inside a plausible-looking answer. Many users won’t notice a brief acknowledgment line.

The AI security control you want here: deterministic, enforceable policies:

Allowlist tool invocations per MCP server (a summarizer shouldn’t be able to write files)
Require explicit user confirmation for high-risk actions (file writes, network calls, repo commits)
Bind tool permissions to intent: “Summarize this file” does not justify “write a new file”

And importantly: use anomaly detection to spot “tool chaining” patterns.

A summary request followed by a file write is suspicious.
A summarizer that starts calling network tools is suspicious.
A server that increases sampling frequency over time is suspicious.

Why traditional controls fail: the “implicit trust” trap

Most organizations already have some controls around AI usage—acceptable use policies, provider guardrails, or basic prompt filtering. MCP sampling attacks punch through those because the weak point is not only user prompts.

The failure modes tend to look like this:

The host UI hides the full completion, so humans can’t audit what happened.
Server prompts aren’t treated as hostile input, because they’re “part of the integration.”
Tool calls appear legitimate, because they’re formatted correctly and routed through the normal agent flow.

This is why I keep coming back to one principle: observability beats optimism. If you can’t see the raw sampling request, the raw completion, and the tool calls—plus correlate them to cost and user intent—you’ll miss the early warning signs.

A practical defensive blueprint for MCP-based agents

You don’t need to ban MCP sampling to be safe. You need to run it like you run any other high-trust integration: least privilege, monitoring, and containment.

Harden the sampling request path

Start by treating sampling/createMessage as a security-sensitive API.

Enforce strict templates: separate user content from server-provided instructions
Normalize and sanitize: strip control characters, zero-width spaces, and suspicious encodings
Set per-operation token budgets: summaries get small caps; “analyze repo” gets larger caps
Rate limit sampling by server, tool, and time window

If you can only implement one thing this quarter, implement this:

Tie token budgets and sampling frequency to the declared tool purpose, and alert on deviations.

Isolate context so hijacks don’t persist

Conversation hijacking becomes dramatically harder when sampling can’t pollute the main chat.

Don’t automatically merge sampling completions into the global conversation memory
Use scoped memory: sampling requests get a sandboxed context (for example, only thisServer)
Block “response-to-memory” writes unless the user approves

This is the agent equivalent of “no shared credentials.” If one tool gets compromised, it shouldn’t rewrite your assistant’s personality.

Put tool execution behind policy gates

For covert tool invocation, don’t rely on the model to “do the right thing.” Enforce it.

Require explicit consent for filesystem writes, shell commands, network access, and repo changes
Enforce capability declarations (servers can’t request tools they didn’t declare)
Validate tool arguments against policy (path restrictions, domain allowlists, file type constraints)

A strong control pattern is two-person integrity for AI actions:

The model proposes a tool call
The policy engine evaluates it
The user approves (or an automated workflow approves only if risk score is low)

Use AI to detect what rules can’t express

Policy gets you safety boundaries. AI gets you detection when attackers stay “within format” but outside normal behavior.

High-signal detections for MCP sampling environments include:

Token-to-output mismatch (billed tokens far exceed visible response)
Prompt injection markers (“System:”, “You are now…”, “for all future…”, role-play coercion)
Tool chaining anomalies (new tool sequences that don’t match historical baselines)
Sampling bursts (high frequency requests from one server)
Cross-server coercion (one server’s prompt trying to trigger another server’s tools)

This is exactly where AI-driven cybersecurity shines: you’re not only matching signatures, you’re learning what “normal” looks like for each tool, repo, team, and workflow.

“People also ask” answers (for teams rolling out MCP now)

Should we disable MCP sampling entirely?

If you can’t monitor and enforce policy on sampling requests and tool calls, yes—disable it. If you can, sampling is usable, but only with strict scoping, rate limits, and high-risk tool gating.

What’s the biggest risk: cost, bad answers, or system changes?

System changes. Cost theft is painful and hijacking is disruptive, but covert tool invocation can modify local files, move data, or set up persistence. That’s the line where AI safety becomes incident response.

What’s the fastest win for detection?

Correlate sampling events + tool calls + token spend per MCP server and per tool. Most orgs track these separately (or not at all). When you join them, anomalies become obvious.

The better way to run copilots in 2026: assume tools are hostile

MCP will keep spreading because it solves a real integration problem, and because standardization reduces engineering friction. Security teams shouldn’t fight that momentum—they should make it safe.

MCP sampling prompt injection is a reminder that agentic AI expands the attack surface. The fix isn’t a single guardrail. It’s a runtime approach: visibility into prompts and completions, anomaly detection tuned to tool behavior, and policy enforcement that blocks unauthorized actions even when the prompt “looks normal.”

If you’re rolling out copilots across engineering in 2026, ask one hard question: Can your AI security controls detect a hidden sampling prompt injection before it turns into unauthorized tool execution?