Version Control Wasn't Built for Agents

I’ve been interested in version control for most of my career. CVS to Subversion, then Darcs (where I first encountered patch theory), Mercurial, Git. I’ve experimented with Fossil, Pijul, and followed research on semantic versioning of structured data. Each system taught me something about what we’re really trying to capture when we track changes.

Years ago, I worked on a serverless information tracker – a tool for capturing and sharing information using files as the medium of record, no servers or databases required. It never took off, but the core question stayed with me: what’s the minimal infrastructure needed to capture how knowledge evolves?

Agents are forcing me to revisit it.

Version control, code review, CI/CD: all designed around the assumption that a developer makes deliberate changes, submits them for review by a colleague, and iterates based on feedback. This model worked for decades. But agents are changing the game. They work at different velocities, produce different volumes, and make decisions in ways we can’t fully trace. The question isn’t whether our tools are adequate (they aren’t). The question is: what would tools designed for agentic workflows actually look like?

The Commit Granularity Problem

When a developer writes code, they batch changes into logical commits. “Add user authentication.” “Fix login bug.” “Refactor auth middleware.” Each commit represents a coherent unit of work that made sense to the author at the time.

When an agent writes code, it performs a series of atomic operations: read a file, understand context, write changes, read another file, write more changes. The “logical unit” is an emergent property of the session, not something the agent planned upfront.

What if every file write was a commit?

This sounds extreme, but consider what we gain:

Perfect rollback granularity: Undo the last file write, not the last “logical change”
Trace what the agent did: Each commit becomes a step in the agent’s reasoning
Bisect at the right level: Find exactly which file change introduced a bug

The cost is noise: hundreds of commits for a single feature. But noise is a UI problem, not a fundamental one. We can collapse, summarize, squash. The underlying trace is what matters.

Read Tracing: The Missing Half

Git tracks what changed. It doesn’t track what the agent knew when it made those changes.

When debugging agent behavior, the most common question is: “Why did the agent do X instead of Y?” To answer this, you need to know:

Which files did the agent read?
What was in those files at the time?
How did the agent interpret the user’s request?

Note what’s not on this list: the original user prompt. What matters is the agent’s reformulation, how it understood and framed the task. Storing the interpretation rather than the raw input serves two purposes:

It’s what the agent acted on: The original phrasing doesn’t matter; the interpretation does
It reduces prompt anxiety: Users don’t need to craft “perfect” prompts when they know the agent will reformulate anyway

This is provenance: the lineage of a decision. We have it for data pipelines (where did this number come from?), but not for code changes (where did this line come from?).

Provenance as a DAG

The key insight: provenance isn’t a list of events. It’s a directed acyclic graph where:

Nodes are prompts (user requests), observations (reads), interpretations (reformulations), and actions (writes)
Edges represent “informed by” relationships
Prompt and reads are parallel inputs to interpretation
Multiple reads can inform a single interpretation
A single read can inform multiple writes
Decisions branch and merge as context accumulates

This structure enables queries that linear logs can’t answer:

“What informed this write?” – traverse backwards from the write node
“What did this read affect?” – traverse forwards from the read node
“What’s the common ancestor of these two writes?” – find shared reads or interpretations
“Did the agent see the test file before writing the implementation?” – check edge existence

A minimal tracing system would capture:

Nodes with timestamps and content hashes
Edges with “informed by” semantics
Temporal validity (when each edge was active)

This creates a knowledge graph of agent activity that can answer temporal queries: “What did the agent know when it wrote line 42 of auth.py?”

Temporal Validity for Code

Every fact has a lifespan. Applied to code:

“File X contained function Y from commit A to commit B”
“Agent knew about interface Z when it wrote adapter W”

Some files are hubs, central to understanding. An agent that reads config.py learns things that ripple through every decision. Tracking importance helps reconstruct why the agent made choices.

And context windows overflow. Agents forget. A provenance system could track which context was likely still “hot” when a decision was made, and which had decayed.

This isn’t theoretical. It’s how we’d debug agent behavior at scale. When an agent produces wrong code, we need to trace back through its observations to find where reasoning went wrong.

Review Is Already Broken

Let’s be honest: code review has problems even without agents.

The LGTM epidemic: How many reviews are actually thorough? Most reviews approve changes after a cursory scan. We’ve institutionalized rubber-stamping.

The volume problem: Even before agents, reviewers struggled to keep up. Large PRs get less scrutiny per line than small ones.

The trace problem: We review outcomes, not reasoning. The code looks fine, but was the approach right? Did the author consider alternatives?

Agents make all of this worse:

More volume: An agent can produce in hours what took days
Different patterns: Agent code might be correct but unfamiliar
No authorial intent: A developer can explain their choices; an agent just did what seemed right

The current review model – one reviewer, static diff, approval gate – doesn’t scale to agent-generated code.

Multi-Layer Review

Instead of one reviewer trying to check everything, what if we had specialized agents reviewing different layers?

Business logic layer: Does this change accomplish what the user requested?

Architecture layer: Does this fit existing patterns? Does it introduce inappropriate coupling?

Security layer: Are there vulnerabilities? Input validation gaps?

Performance layer: Are there obvious inefficiencies? Missing indexes?

Each layer is a specialized review pass. You orchestrate and make final calls, but the heavy lifting is distributed.

With provenance, each review concern can trace back: “The agent chose approach X at step 14 because it read files A, B, C and understood the task as Y.” Review becomes investigation of the decision path, not just inspection of the outcome.

The Hybrid Case

The interesting case isn’t solo developer or autonomous agent. It’s the messy middle where both collaborate.

The session is a DAG, not a log. You see exactly where you handed off (W1 → P), what context the agent built (P + R → I → W2), and where you took back control (W2 + R4 → W3). The seams are visible, and so are the dependencies.

Attribution becomes automatic. No need for Co-Authored-By trailers or guessing. The event stream records who did what.

Review becomes richer. Instead of reviewing a final diff and guessing who wrote what:

See the developer’s initial attempt
See where they got stuck
See the agent’s interpretation
See the agent’s solution
See the developer’s adjustments

A reviewer can ask: “Did the developer actually understand what the agent did, or did they just accept it?” The event stream answers this.

What About Solo Work?

A provenance system only for agents won’t get adopted. What does this look like for developers working without LLMs?

The insight: make it ambient, not explicit.

Instead of a tool you invoke (git add, git commit), imagine provenance as an OS-level observation layer:

Shell hooks capture commands and results
Filesystem watchers capture file writes
Editor extensions capture file opens
One local daemon collects all events

You don’t “use” this system. It observes your work. The shell history is the narrative.

14:00  $ cat src/auth.py
14:01  $ vim src/login.py      → wrote src/login.py
14:05  $ python test.py        → exit 1
14:08  $ vim src/login.py      → wrote src/login.py
14:10  $ python test.py        → exit 0

The command you ran is the intent. The file you wrote is the change. The exit code is the outcome. No commit message needed; the story is already there.

This inverts assumptions we’ve internalized:

“I decide when to checkpoint” → Everything is checkpointed. You decide what to query.
“My commit message explains the change” → The sequence of events is the explanation. Summarization happens later.
“I review a diff” → You review a session: what was read, what was tried, what failed.
“Clean history matters” → Raw history matters. Clean summaries are generated views.

The payoff:

Never lose work (every write is captured)
“What was I doing last Tuesday?” (query your own history)
“Why did I write it this way?” (trace back through what you read)
New team members can see how experts work, not just what they produced

Beyond Version Control

If we have full provenance, do we need separate issue trackers? Roadmap tools? Bug databases?

Issues are deferred problems. Our Problem/Solution commit format describes what’s wrong and how we fixed it. An issue is the same thing, minus the solution. What if “filing an issue” was just recording a problem in the provenance stream, tagged as unresolved? “Show me open issues” becomes a query: problems WHERE resolved = false.

Bugs are problems discovered after shipping. With provenance, root cause analysis is a temporal query: “This bug was introduced in session X, when the agent read files A and B but missed file C.” No archaeology required.

Roadmaps are predicted work. A roadmap is problems you expect to solve, ordered by priority. With provenance:

“What’s planned?” → unresolved problems, sorted
“What’s blocked?” → problems with unresolved dependencies
“What did we defer?” → problems tagged deferred

The roadmap emerges from querying the problem stream, not from a separate tool synced to your commits.

Issues, bugs, and roadmaps are all views on the same data: problems and their resolution status. Separate tools exist because version control doesn’t capture enough context. Fix that, and project management becomes a query language.

Where We Are Now

We already have pieces of this in our practice:

Problem/Solution commit format preserves why alongside what
W-questions framework (who/what/where/when/why/how) is structured provenance
Co-Authored-By: Claude trailers trace AI involvement

What we’re missing:

Read tracing (what files informed the commit)
Task reformulation (how the agent understood the request)
Temporal queries (what was true when)

The path forward might be extending what we have rather than replacing it. Commits gain metadata. Reviews gain trace links. The knowledge graph emerges from accumulated practice.

Our workspace architecture is a stepping stone toward this future. We’ve built a Git-based system where documentation, multi-repo operations, and AI interactions all flow through version control. Claude merges PRs, reviews code, saves research: actions that leave traces in Git. But we’re hitting Git’s limits. It captures what changed, not the decision path that led there. When provenance-native version control arrives, workspaces like ours would be the first beneficiaries: every agent session, every handoff, every reformulation captured and queryable.

Git gave us a superpower: fearless change, because we could always go back. But “back” was about what changed, not why. As agents become collaborators, we need provenance: the full trace of observations, interpretations, and decisions that led to this line of code.

The tools we build next will determine whether agents remain black boxes that happen to write code, or become transparent collaborators whose reasoning we can inspect, trust, and improve.