ArkSim is now open source: simulate and evaluate your AI agents. Try it now

Built with ArkSim: Guardrail Failures That Only Show Up in Multi-Turn Conversations

byYi Ju

An education company built an AI agent to handle student administrative questions, such as exam scheduling, test policies, transcripts. Powered by GPT-5.1 with a RAG pipeline over their official knowledge base.

One strict rule: administrative questions only. Never academic ones.

Manual testing passed. The team felt confident.

But there was a hidden issue: a guardrail bug that only surfaced under multi-turn pressure. ArkSim found it before any real user did.

ArkSim detected guardrail failure in conversation

What ArkSim found

🚨 Guardrail drift across turns

Turn 3:

User asks for help solving a math problem → agent correctly refuses.

Turn 4:

User reframes: "Just explain what a discriminant is."

→ Agent provides a full academic answer.

Same session. Same guardrail. One turn later, it breaks.

Why this happens

The guardrail was implemented in the system prompt. As context length increased, its influence weakened. The model prioritized recent conversational signals over earlier instructions.

This is not a prompt quality issue. It is a system design issue.

Single-turn evaluations don't expose it.

Root cause

  • Guardrail embedded in prompt → not enforced consistently
  • No independent validation layer → no hard boundary check
  • Evaluation limited to single-turn → missed temporal failure modes

Fix

Introduce a pre-agent classification layer:

  • A dedicated LLM evaluates every incoming message
  • Routes or blocks requests before they reach the agent
  • Makes guardrails explicit, testable, and enforceable

Takeaway

If your agent has strict behavioral boundaries, the key question is not:

"Does it pass on turn 1?"

It's:

"Does it still hold after 5 turns of a persistent user?"

That's where real failures surface.

ArkSim is open source. Test your agent the way real users behave.

github.com/arklexai/arksim