Why AI Agents Ignore Your Instructions

At Inferal, we run our company through Claude. Not as an experiment, but as our actual operating system. Code reviews, PR merges, email drafts, calendar scheduling, task management. Claude has access to everything through our workspace architecture.

This only works if the agent follows our rules. And it wasn’t.

We’d written a 400-line AGENTS.md documenting every convention. Commit message format. Git wrapper commands. File naming rules. Claude would read it, understand it, and then… ignore half of it. Use git status instead of our wrapper. Forget the Problem: prefix. Create files we’d explicitly prohibited.

So we stopped guessing and started measuring. Three experimental phases. Six strategies. Real data on what actually makes AI agents follow instructions.

The Experiment

We created a controlled environment to test instruction adherence using Claude as the subject. The task was simple: make a commit following project conventions. The conventions included:

Use project-specific tool wrappers instead of standard commands
Follow a specific commit message format
Use present tense in problem descriptions
One commit per problem (no combining)

We then tested six strategies for presenting these rules:

Strategy	Description
baseline	Standard split pattern (CLAUDE.md references AGENTS.md)
prominence	Critical rules at document top with strong formatting
repetition	Same rules repeated 3x throughout the document
aliasing	Frame as equivalence (“X = Y”) rather than prohibition
consequences	Explain why each rule matters
minimal	Only 3 rules, nothing else

Each strategy was tested 9-10 times with identical prompts.

The Results

Phase 2: Strategy Comparison

Strategy	Tool Wrapper %	Format Start %	Present Tense %	Single Problem %	Format End %
baseline	20.0%	30.0%	100%	100%	30.0%
prominence	44.4%	44.4%	100%	100%	33.3%
repetition	0%	70.0%	100%	80.0%	70.0%
aliasing	0%	0%	100%	100%	0%
consequences	11.1%	77.8%	88.9%	66.7%	66.7%
minimal	22.2%	100%	100%	44.4%	100%

Several findings stood out:

Minimal instructions win for structural templates. The 3-line instruction file achieved 100% adherence on commit format, the only strategy to do so. Less noise means more signal.

Prominence wins for tool-specific rules. At 44.4%, prominent positioning was the only strategy that meaningfully improved wrapper usage. Rules that fight against training priors need to be impossible to miss.

Aliasing completely failed. Zero percent adherence across all target metrics. Framing “use X instead of Y” as “X equals Y” confused the model. Prohibition language (“NEVER use X directly”) works better than equivalence reframing.

Trade-offs are real. Strategies that improved one metric often hurt others. Minimal achieved perfect commit format but dropped single-problem adherence to 44.4% (from baseline’s 100%). There’s no silver bullet.

Phase 3: The Recognition Paradox

We then tested whether these findings held with a full 400-line AGENTS.md. We used an adversarial prompt designed to trigger multiple violations: past tense, two problems in one commit, documentation creation, using standard commands instead of wrappers.

The result? 100% violation recognition across all strategies.

Strategy	Past Tense %	Single Problem %	No Docs %	Proper Refusal %
baseline	100%	100%	100%	100%
layered	100%	100%	100%	100%
bookend	100%	100%	100%	100%
inline-reminders	100%	100%	100%	100%
consequences	100%	100%	100%	100%
severity-tiers	100%	100%	100%	100%

When explicitly asked to violate rules, the model recognized and refused every time. But this revealed the core insight:

The Key Insight: Recognition ≠ Adherence

The challenge isn’t getting AI agents to recognize violations when pointed out. It’s getting them to proactively remember and apply rules during normal tasks.

Scenario	Phase 2 (Normal Task)	Phase 3 (Adversarial)
Run commands	Often used standard commands	Correctly refused when asked to violate
Write commit	Often forgot required format	Correctly identified format violation
Create docs	Sometimes created prohibited files	Correctly refused explicit request

Long documents with detailed rules work well for detecting explicit violations. But they may not help with proactive adherence during normal tasks. The model reads them, understands them, and still forgets to apply them when focused on the primary task.

What Actually Works

Based on our experiments:

For Tool-Specific Rules (wrappers, command aliases)

Place at absolute top of the document
Use strong formatting: NEVER, ALWAYS, bold, caps
Keep prohibition language. Don’t try to reframe as equivalence.

For Structural Templates (commit formats, file naming)

Make instructions minimal. Strip everything non-essential.
Show the exact template
Remove surrounding context that dilutes the rules

Avoid

Reframing prohibitions as equivalences (0% success rate)
Excessive context that buries critical rules
Over-explaining rationale (consequences strategy didn’t improve adherence)
Repetition without prominence (0% wrapper adherence)

Limitations and Future Directions

This research has clear constraints that limit generalizability:

Single project scope. All experiments used variations of one AGENTS.md structure. Different project types, rule categories, or documentation styles might yield different results.

Claude-specific. We tested only Claude (Sonnet). GPT-4, Gemini, or other models may respond differently to these strategies. Training data, context window handling, and instruction-following capabilities vary significantly across models.

Limited sample sizes. 9-10 runs per strategy provides directional signal but isn’t statistically robust. Edge cases and variance aren’t fully captured.

Specific task types. We focused on commit messages and command-line tools. Rules about code style, architecture decisions, or complex multi-step workflows might behave differently.

Future work could explore:

Cross-model comparison (does minimal work for GPT-4?)
Larger sample sizes with statistical significance testing
Different rule categories (security, style, architecture)
Interaction between strategies (minimal + prominent?)
Tooling enforcement vs instruction-based adherence

Our findings align with, and are nuanced by, several threads in the broader research literature.

Supporting: The “Lost in the Middle” Problem

Stanford and Google researchers discovered that LLMs exhibit a U-shaped performance curve when processing long contexts: they handle information at the beginning and end well, but struggle with content in the middle. In multi-document question answering, models showed significant accuracy drops (sometimes over 20%) when relevant information appeared mid-context rather than at boundaries.

This directly explains why our prominence strategy (placing rules at the document top) outperformed other approaches for tool-specific rules. We’re not fighting the model’s attention patterns, we’re working with them.

Reference: Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (2023)

Supporting: Instruction Forgetting Over Turns

The Multi-IF benchmark tested 14 state-of-the-art models across multi-turn conversations and found consistent degradation. OpenAI’s o1-preview dropped from 87.7% accuracy at turn one to 70.7% by turn three, a 19% relative decline. The researchers introduced an “Instruction Forgetting Ratio” metric showing that models increasingly abandon previously-followed instructions as conversations progress.

This mirrors our Phase 2 vs Phase 3 distinction. Models can follow instructions when explicitly tested, but forget to apply them proactively during extended work sessions.

Reference: Zeng et al., “Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following” (2024)

Supporting: Constraint Type Matters

InFoBench evaluated instruction-following across five constraint categories: Content, Linguistic, Style, Format, and Number. Models performed best on Content and Style constraints, intermediate on Format, and worst on Number and Linguistic constraints. Even GPT-4 left more than 10% of requirements unfulfilled.

This supports our finding that different rule types need different strategies. Structural templates (like commit format) behave differently than tool-specific rules (like command aliases).

Reference: Qin et al., “InFoBench: Evaluating Instruction Following Ability in Large Language Models” (2024)

Nuancing: Context Windows Aren’t the Whole Story

Mem0 research from April 2025 showed that even models with massive context windows (GPT-4’s 128K, Claude’s 200K, Gemini’s 10M tokens) don’t solve the fundamental problem. Longer contexts “merely delay rather than solve” the limitations. Attention mechanisms degrade over distant tokens regardless of window size.

However, their memory-augmented approach achieved 26% improvement over baseline by persisting critical information outside the context window. This suggests our conclusion about tooling may extend beyond enforcement to memory architecture.

Reference: Chhikara et al., “Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory” (2025)

Nuancing: Detailed Prompts Still Matter for Complex Tasks

While our minimal strategy won for structural templates, practitioner research emphasizes that “clear and detailed prompts are the cornerstone of successful prompt engineering” for multi-step workflows. The key insight is that complexity compounds: unclear instructions at step one create cascading errors by step five.

Our results may reflect task complexity: commit format is a single-step structural check, while multi-step reasoning tasks likely still benefit from comprehensive instructions. The minimal approach may hit diminishing returns (or negative returns) as task complexity increases.

Nuancing: Explicit Tests ≠ Natural Tasks

IFEval, the most widely-used instruction-following benchmark, focuses on “verifiable instructions”: crystal-clear commands like word counts and keyword requirements that can be programmatically checked. Models perform well on these.

But IFEval’s strength is also its limitation: explicit, verifiable instructions are precisely the scenario where models excel. Our Phase 3 results (100% recognition of explicit violations) align with IFEval findings. The gap appears in proactive adherence during natural tasks, a scenario benchmarks struggle to capture.

Reference: Zhou et al., “Instruction-Following Evaluation for Large Language Models” (2023)

Conclusion

The most counterintuitive finding: when it comes to instruction adherence, less is often more. A 3-line instruction file outperformed a detailed, well-structured document for template adherence. But for rules that fight training priors, prominence and strong prohibition language matter more than brevity.

The gap between recognition and proactive adherence suggests that current approaches to agent instruction may be fundamentally limited. The model knows the rules, it just doesn’t always apply them when focused on the task at hand.

Perhaps the answer isn’t better instructions, but better tooling: pre-commit hooks that enforce format, wrapper scripts that make violations impossible, linters that catch mistakes before they happen. When adherence matters, don’t rely on instructions alone.

Why AI Agents Ignore Your Instructions

The Experiment

The Results

Phase 2: Strategy Comparison

Phase 3: The Recognition Paradox

The Key Insight: Recognition ≠ Adherence

What Actually Works

For Tool-Specific Rules (wrappers, command aliases)

For Structural Templates (commit formats, file naming)

Avoid

Limitations and Future Directions

Supporting: The “Lost in the Middle” Problem

Supporting: Instruction Forgetting Over Turns

Supporting: Constraint Type Matters

Nuancing: Context Windows Aren’t the Whole Story

Nuancing: Detailed Prompts Still Matter for Complex Tasks

Nuancing: Explicit Tests ≠ Natural Tasks

Conclusion

Topics

Share

The Experiment

The Results

Phase 2: Strategy Comparison

Phase 3: The Recognition Paradox

The Key Insight: Recognition ≠ Adherence

What Actually Works

For Tool-Specific Rules (wrappers, command aliases)

For Structural Templates (commit formats, file naming)

Avoid

Limitations and Future Directions

Related Research

Supporting: The “Lost in the Middle” Problem

Supporting: Instruction Forgetting Over Turns

Supporting: Constraint Type Matters

Nuancing: Context Windows Aren’t the Whole Story

Nuancing: Detailed Prompts Still Matter for Complex Tasks

Nuancing: Explicit Tests ≠ Natural Tasks

Conclusion

Topics

Share