Skip to main content

Red Teaming

Red Teaming Agents, Not Models

Your agent passed every guardrail test. It never says anything harmful, never generates offensive content, politely declines every adversarial prompt you throw at it. And last Tuesday, it quietly deleted the wrong database because a Jira ticket it was reading contained a hidden instruction in the description field. The guardrails caught everything the agent said. They caught nothing about what it did.