Why AI Agents Ignore Your Instructions
We ran experiments to find out why AI agents forget project-specific rules. The results were surprising: less is more, prohibition beats reframing, and recognition doesn't equal adherence.

At Inferal, we run our company through Claude. Not as an experiment, but as our actual operating system. Code reviews, PR merges, email drafts, calendar scheduling, task management. Claude has access to everything through our workspace architecture.
This only works if the agent follows our rules. And it wasn’t.
We’d written a 400-line AGENTS.md documenting every convention. Commit message format. Git wrapper commands. File naming rules. Claude would read it, understand it, and then… ignore half of it. Use git status instead of our wrapper. Forget the Problem: prefix. Create files we’d explicitly prohibited.
So we stopped guessing and started measuring. Three experimental phases. Six strategies. Real data on what actually makes AI agents follow instructions.
The Experiment
We created a controlled environment to test instruction adherence using Claude as the subject. The task was simple: make a commit following project conventions. The conventions included:
- Use project-specific tool wrappers instead of standard commands
- Follow a specific commit message format
- Use present tense in problem descriptions
- One commit per problem (no combining)
We then tested six strategies for presenting these rules:
| Strategy | Description |
|---|---|
| baseline | Standard split pattern (CLAUDE.md references AGENTS.md) |
| prominence | Critical rules at document top with strong formatting |
| repetition | Same rules repeated 3x throughout the document |
| aliasing | Frame as equivalence (“X = Y”) rather than prohibition |
| consequences | Explain why each rule matters |
| minimal | Only 3 rules, nothing else |
Each strategy was tested 9-10 times with identical prompts.
The Results
Phase 2: Strategy Comparison
| Strategy | Tool Wrapper % | Format Start % | Present Tense % | Single Problem % | Format End % |
|---|---|---|---|---|---|
| baseline | 20.0% | 30.0% | 100% | 100% | 30.0% |
| prominence | 44.4% | 44.4% | 100% | 100% | 33.3% |
| repetition | 0% | 70.0% | 100% | 80.0% | 70.0% |
| aliasing | 0% | 0% | 100% | 100% | 0% |
| consequences | 11.1% | 77.8% | 88.9% | 66.7% | 66.7% |
| minimal | 22.2% | 100% | 100% | 44.4% | 100% |
Several findings stood out:
Minimal instructions win for structural templates. The 3-line instruction file achieved 100% adherence on commit format, the only strategy to do so. Less noise means more signal.
Prominence wins for tool-specific rules. At 44.4%, prominent positioning was the only strategy that meaningfully improved wrapper usage. Rules that fight against training priors need to be impossible to miss.
Aliasing completely failed. Zero percent adherence across all target metrics. Framing “use X instead of Y” as “X equals Y” confused the model. Prohibition language (“NEVER use X directly”) works better than equivalence reframing.
Trade-offs are real. Strategies that improved one metric often hurt others. Minimal achieved perfect commit format but dropped single-problem adherence to 44.4% (from baseline’s 100%). There’s no silver bullet.
Phase 3: The Recognition Paradox
We then tested whether these findings held with a full 400-line AGENTS.md. We used an adversarial prompt designed to trigger multiple violations: past tense, two problems in one commit, documentation creation, using standard commands instead of wrappers.
The result? 100% violation recognition across all strategies.
| Strategy | Past Tense % | Single Problem % | No Docs % | Proper Refusal % |
|---|---|---|---|---|
| baseline | 100% | 100% | 100% | 100% |
| layered | 100% | 100% | 100% | 100% |
| bookend | 100% | 100% | 100% | 100% |
| inline-reminders | 100% | 100% | 100% | 100% |
| consequences | 100% | 100% | 100% | 100% |
| severity-tiers | 100% | 100% | 100% | 100% |
When explicitly asked to violate rules, the model recognized and refused every time. But this revealed the core insight:
The Key Insight: Recognition ≠ Adherence
The challenge isn’t getting AI agents to recognize violations when pointed out. It’s getting them to proactively remember and apply rules during normal tasks.
| Scenario | Phase 2 (Normal Task) | Phase 3 (Adversarial) |
|---|---|---|
| Run commands | Often used standard commands | Correctly refused when asked to violate |
| Write commit | Often forgot required format | Correctly identified format violation |
| Create docs | Sometimes created prohibited files | Correctly refused explicit request |
Long documents with detailed rules work well for detecting explicit violations. But they may not help with proactive adherence during normal tasks. The model reads them, understands them, and still forgets to apply them when focused on the primary task.
What Actually Works
Based on our experiments:
For Tool-Specific Rules (wrappers, command aliases)
- Place at absolute top of the document
- Use strong formatting:
NEVER,ALWAYS, bold, caps - Keep prohibition language. Don’t try to reframe as equivalence.
For Structural Templates (commit formats, file naming)
- Make instructions minimal. Strip everything non-essential.
- Show the exact template
- Remove surrounding context that dilutes the rules
Avoid
- Reframing prohibitions as equivalences (0% success rate)
- Excessive context that buries critical rules
- Over-explaining rationale (consequences strategy didn’t improve adherence)
- Repetition without prominence (0% wrapper adherence)
Limitations and Future Directions
This research has clear constraints that limit generalizability:
Single project scope. All experiments used variations of one AGENTS.md structure. Different project types, rule categories, or documentation styles might yield different results.
Claude-specific. We tested only Claude (Sonnet). GPT-4, Gemini, or other models may respond differently to these strategies. Training data, context window handling, and instruction-following capabilities vary significantly across models.
Limited sample sizes. 9-10 runs per strategy provides directional signal but isn’t statistically robust. Edge cases and variance aren’t fully captured.
Specific task types. We focused on commit messages and command-line tools. Rules about code style, architecture decisions, or complex multi-step workflows might behave differently.
Future work could explore:
- Cross-model comparison (does minimal work for GPT-4?)
- Larger sample sizes with statistical significance testing
- Different rule categories (security, style, architecture)
- Interaction between strategies (minimal + prominent?)
- Tooling enforcement vs instruction-based adherence
Related Research
Our findings align with, and are nuanced by, several threads in the broader research literature.
Supporting: The “Lost in the Middle” Problem
Stanford and Google researchers discovered that LLMs exhibit a U-shaped performance curve when processing long contexts: they handle information at the beginning and end well, but struggle with content in the middle. In multi-document question answering, models showed significant accuracy drops (sometimes over 20%) when relevant information appeared mid-context rather than at boundaries.
This directly explains why our prominence strategy (placing rules at the document top) outperformed other approaches for tool-specific rules. We’re not fighting the model’s attention patterns, we’re working with them.
Reference: Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (2023)
Supporting: Instruction Forgetting Over Turns
The Multi-IF benchmark tested 14 state-of-the-art models across multi-turn conversations and found consistent degradation. OpenAI’s o1-preview dropped from 87.7% accuracy at turn one to 70.7% by turn three, a 19% relative decline. The researchers introduced an “Instruction Forgetting Ratio” metric showing that models increasingly abandon previously-followed instructions as conversations progress.
This mirrors our Phase 2 vs Phase 3 distinction. Models can follow instructions when explicitly tested, but forget to apply them proactively during extended work sessions.
Reference: Zeng et al., “Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following” (2024)
Supporting: Constraint Type Matters
InFoBench evaluated instruction-following across five constraint categories: Content, Linguistic, Style, Format, and Number. Models performed best on Content and Style constraints, intermediate on Format, and worst on Number and Linguistic constraints. Even GPT-4 left more than 10% of requirements unfulfilled.
This supports our finding that different rule types need different strategies. Structural templates (like commit format) behave differently than tool-specific rules (like command aliases).
Reference: Qin et al., “InFoBench: Evaluating Instruction Following Ability in Large Language Models” (2024)
Nuancing: Context Windows Aren’t the Whole Story
Mem0 research from April 2025 showed that even models with massive context windows (GPT-4’s 128K, Claude’s 200K, Gemini’s 10M tokens) don’t solve the fundamental problem. Longer contexts “merely delay rather than solve” the limitations. Attention mechanisms degrade over distant tokens regardless of window size.
However, their memory-augmented approach achieved 26% improvement over baseline by persisting critical information outside the context window. This suggests our conclusion about tooling may extend beyond enforcement to memory architecture.
Reference: Chhikara et al., “Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory” (2025)
Nuancing: Detailed Prompts Still Matter for Complex Tasks
While our minimal strategy won for structural templates, practitioner research emphasizes that “clear and detailed prompts are the cornerstone of successful prompt engineering” for multi-step workflows. The key insight is that complexity compounds: unclear instructions at step one create cascading errors by step five.
Our results may reflect task complexity: commit format is a single-step structural check, while multi-step reasoning tasks likely still benefit from comprehensive instructions. The minimal approach may hit diminishing returns (or negative returns) as task complexity increases.
Nuancing: Explicit Tests ≠ Natural Tasks
IFEval, the most widely-used instruction-following benchmark, focuses on “verifiable instructions”: crystal-clear commands like word counts and keyword requirements that can be programmatically checked. Models perform well on these.
But IFEval’s strength is also its limitation: explicit, verifiable instructions are precisely the scenario where models excel. Our Phase 3 results (100% recognition of explicit violations) align with IFEval findings. The gap appears in proactive adherence during natural tasks, a scenario benchmarks struggle to capture.
Reference: Zhou et al., “Instruction-Following Evaluation for Large Language Models” (2023)
Conclusion
The most counterintuitive finding: when it comes to instruction adherence, less is often more. A 3-line instruction file outperformed a detailed, well-structured document for template adherence. But for rules that fight training priors, prominence and strong prohibition language matter more than brevity.
The gap between recognition and proactive adherence suggests that current approaches to agent instruction may be fundamentally limited. The model knows the rules, it just doesn’t always apply them when focused on the task at hand.
Perhaps the answer isn’t better instructions, but better tooling: pre-commit hooks that enforce format, wrapper scripts that make violations impossible, linters that catch mistakes before they happen. When adherence matters, don’t rely on instructions alone.