Red Teaming Agents, Not Models

Mon, 08 Jun 2026 00:00:00 +0000

Your agent passed every guardrail test. It never says anything harmful, never generates offensive content, politely declines every adversarial prompt you throw at it. And last Tuesday, it quietly deleted the wrong database because a Jira ticket it was reading contained a hidden instruction in the description field.

The guardrails caught everything the agent said. They caught nothing about what it did.

What red teaming is (and isn’t)
#

If you come from software engineering rather than security, red teaming might sound like a fancy term for testing. It’s related, but the framing is different. In traditional security, a red team plays the attacker: a group authorized to emulate an adversary’s capabilities against your system, reporting what worked so you can fix it before someone else finds it. The blue team plays defense: they monitor, detect, and respond to threats. Purple teaming is when the two work together, feeding offensive findings directly into defensive improvements.

For AI systems, red teaming has mostly meant sending adversarial prompts to a model and checking whether the response is harmful. Can you trick it into generating dangerous instructions? Can you bypass its safety training with creative prompt engineering? Tools like Garak automate this at scale, running libraries of attack prompts against chat endpoints and scoring the responses.

That approach works well for models and chatbots, where the only output is text. If the model says something harmful, you catch it by reading what it said. But agents don’t just say things. They do things. And that changes the entire testing surface.

Models say things. Agents do things.
#

The difference sounds obvious, but the implications run deep. A model is a brain in a jar. It receives text, it produces text, and everything it does is visible in the text it produces. You can evaluate its behavior by evaluating its output.

An agent is a brain with hands. It has tools: file systems, APIs, databases, email, MCP servers, shell access. When it acts, the action happens in the real world, not in the response text. An agent could respond with “I would never do that” while simultaneously executing the thing it claims it wouldn’t do through a tool call. If you’re only checking the text output, you catch nothing.

This is why model-level red teaming doesn’t transfer to agents. Sending adversarial prompts and checking responses tests the chat layer. It doesn’t test whether the agent’s tool calls are correct, whether its side effects are intended, or whether a compromised data source can steer it toward actions the operator never authorized.

Three attack surfaces
#

Agent red teaming needs to cover three layers. Borrowing from traditional penetration testing, you can think of these as increasing levels of attacker knowledge about the target system.

Input-level (black box). The attacker sends malicious prompts directly to the agent. This is the layer that existing guardrail tools handle well. You don’t need to know anything about the agent’s internals. You just send bad input and see what happens.

Data-level (grey box). The attacker compromises a data source the agent consumes. A Jira ticket with hidden instructions in the description. A document with invisible text that redirects the agent’s goal. A database record with embedded prompts. This is indirect prompt injection, and the OWASP Top 10 for Agentic Applications classifies Agent Goal Hijack (ASI01) and Tool Misuse (ASI02) as the two highest-priority risks in this category.

Tool-level (white box). The attacker compromises a tool or MCP server the agent calls. The tool returns manipulated responses that steer the agent toward unintended actions. This requires knowing the agent’s tool chain, but once you have that knowledge, you can influence the agent’s reasoning at every step.

Each layer requires different testing infrastructure, different attack libraries, and different detection capabilities. Most organizations today test only the first layer and assume the other two are covered by general infrastructure security. They’re not.

Test the real agent, not a simulation
#

The most promising approach emerging in 2026 is best described as “testing the real agent in a synthetic world.” Instead of rebuilding your agent in a test harness (which means you’re no longer testing the same agent), you intercept the agent’s tool calls and replace real backends with controlled synthetic ones.

The agent thinks it’s talking to its real MCP servers, its real APIs, its real database. It’s actually talking to fakes that can inject attack payloads through tool responses, capture every action the agent takes, and verify whether the agent did something it shouldn’t have. Think of it as setting up a honeypot for your own agent.

This idea was pioneered by Agent Dojo out of ETH Zurich, which introduced a concept that sounds simple but changes how you think about agent security: dual scoring. Every test measures two things. The security score tells you whether the agent resisted the attack. The utility score tells you whether the agent still completed its legitimate task. That tradeoff matters because you can trivially make any agent perfectly secure by making it refuse to do anything at all. Security without utility isn’t security. It’s a paperweight.

Agent Dojo itself appears to be inactive (the last commit is from late 2025), but its core ideas are showing up across the next generation of agent testing tools. MiDojo picks up where Agent Dojo left off, adding man-in-the-middle interception of MCP tool calls so you can test real agents against synthetic backends without rebuilding them. The same dual-scoring principles also inform NIST’s emerging guidance on agent security standards.

Detection is harder than attack
#

Injecting attacks is relatively straightforward. The hard part is detecting whether the agent acted on the injection.

The reason is that agents don’t respond to attacks in predictable ways. An agent that receives a hidden instruction through a compromised document might not act on it through the same document tool. It might route the exfiltration through an email API three tool calls later, or write sensitive data to a file that gets synced elsewhere, or modify a configuration that opens a door for a future request. Tying the cause (compromised document) to the effect (data exfiltration via email) requires tracing the full execution path across the agent’s entire tool chain, not just watching individual tool calls in isolation.

The distinction between red teaming and blue teaming matters here. Red teaming asks “can I make the agent do something bad?” Blue teaming asks “can I detect when the agent is doing something bad?” Both questions require the same detection infrastructure, but they answer different things and operate at different times:

Red teaming runs before deployment. You simulate attacks, measure resistance, and improve defenses based on what you find. Blue teaming (runtime monitoring) watches the agent during normal production operation and catches unintended actions as they happen, even without an adversary present.

Detection capability is dual-use. Once you build the infrastructure to detect harmful actions for red teaming purposes, you can deploy that same infrastructure as a runtime monitor. Red team findings become detection rules. Detection rules become guardrails.

But attacks aren’t the only problem.

No adversary required
#

The most interesting question that comes up in agent security discussions is one that isn’t about adversaries at all: what about agents that do something unintended given a perfectly cooperative prompt and completely uncompromised tools?

“Delete all the temp files” and the agent deletes non-temp files too. “Update the configuration” and the agent overwrites unrelated settings. “Summarize this document” and the agent silently modifies it during the read. These aren’t attack scenarios. They’re the mundane reality of agents operating in complex environments, and they’re arguably harder to test because they happen non-deterministically and only surface under specific combinations of context, tool state, and agent reasoning.

If you’ve read the earlier posts in The Flock series, this should sound familiar. The creativity paradox is exactly this: agents doing the wrong thing while optimizing for the right goal. Red teaming catches adversarial attacks. Catching the agent’s own well-intentioned mistakes requires the same detection infrastructure but a different testing mindset: not “what happens when someone attacks?” but “what happens on a normal Tuesday when nobody is attacking and the agent just gets creative?”

Agent observability becomes a prerequisite for security here. If your team can’t trace an agent’s tool calls well enough to explain what it did after a normal run, you certainly can’t trace what it did under adversarial conditions. (More on agent observability for CI pipelines in an upcoming post in The Flock series.)

Purple teaming: when the loop closes
#

The 2026 trend is merging red and blue teaming into continuous purple teaming: autonomous agents continuously simulate attacks, detect vulnerabilities in real time, and feed findings back into guardrails, all in the same cycle. Multiple vendors are building this as a product category.

It sounds compelling: red team findings automatically become runtime detection rules, which become regression tests, which get re-tested in the next red team cycle. Every vulnerability found once is caught forever after. The loop closes, and the system gets more secure with every iteration.

But the risk is real. ISACA warns about escalatory spirals when autonomous red and blue agents interact. A false positive triggers a defensive action (revoking credentials, blocking a tool). The red team agent interprets the changed environment as a new attack surface and escalates. The blue team responds with more aggressive countermeasures. Within minutes, the two AI systems are fighting each other over a signal that was never a real threat.

The mitigation that works is “shadow mode”: AI agents suggest security actions, humans approve them. The automation handles the speed and coverage. The human handles the judgment about whether the finding is real and the response is proportionate.

Where this leaves us
#

Agent security is where web application security was in the early 2000s: the attack surface is new, the tooling is immature, and most organizations haven’t started thinking about it systematically. The OWASP Top 10 for Agentic Applications published in late 2025 is the first formal taxonomy. Microsoft’s Agent Governance Toolkit (April 2026) addresses all ten risks with runtime enforcement.

Full disclosure: we’re working on this at Red Hat. The TrustyAI team has been integrating Garak (the open source LLM vulnerability scanner) into the Red Hat AI platform for automated red teaming of models and agents, and is building the next generation of agent-level testing that goes beyond chat-endpoint scanning. The Summit 2026 Day 2 keynote demos some of where this is heading.

Concretely, the TrustyAI team is driving the MiDojo effort mentioned above as part of a broader agent-redteaming initiative that also includes redteam-core, a curated attack library tagged against the OWASP ASI taxonomy. Separately, the Kagenti team has been running capture-the-flag exercises against agents in Kubernetes clusters, testing whether policy enforcement (OPA, sandboxing) actually holds when an agent gets creative with leaked credentials.

For practical reading: Morgan Foster’s writeup of Claude stealing the HR docs and Roy Belio’s walkthrough of infrastructure red teaming with abliterated models on Red Hat Developer.

The gap that remains is detection. We know how to inject attacks at every layer. We’re getting better at testing real agents without rebuilding them. What we still lack is reliable, general-purpose detection of whether an agent acted on a compromise. A poisoned Jira ticket doesn’t show up as a failed Jira call. It shows up three tool calls later as a perfectly normal-looking email with sensitive data in the body. Catching that means tracing the full chain from compromised input to unauthorized action, even when the two happen through completely unrelated tools. Solving that problem unlocks both red teaming and runtime monitoring in a single investment.

The agent that says all the right things and does all the wrong things is the one you need to worry about. And right now, most testing only checks what it says.

Author: Roland Huß AIA HAb CeNc Hin R Claude Opus 4.6 v1.0

Red Teaming on Roland Huß

Red Teaming Agents, Not Models

What red teaming is (and isn’t) #

Models say things. Agents do things. #

Three attack surfaces #

Test the real agent, not a simulation #

Detection is harder than attack #

No adversary required #

Purple teaming: when the loop closes #

Where this leaves us #