Autogenesis: the self-evolution paper that is actually about the substrate | Lab Diary

Most self-evolution papers in this space have the same shape. There is a clever loop. The loop mutates a prompt, or a planner, or a tool selection policy. A scoring function decides which mutation survives. The system gets a little better on a benchmark and the paper claims something about emergent improvement. If you have read three of these you have read all three of these.

The thing that almost none of them do is define the substrate. What is the thing being evolved? In what state is it stored? How do you roll back when the mutation makes the system worse? What does it even mean for a "tool" or an "agent" or a "prompt" to have a version? These questions sound like plumbing, and most papers treat them as plumbing, which means they get hand-waved or buried in the appendix. The actual loop is treated as the contribution.

This paper, Autogenesis, inverts that framing, and that is what made me sit with it. The loop in Autogenesis is almost banal. Three operators, propose, assess, commit. Nothing about the loop itself is new. What is new is that the paper insists on defining the substrate before the loop, and most of the design work is in the substrate. Every component of an agent system, prompts, tools, agents, environments, memory, is modeled as a versioned protocol resource with explicit state and lifecycle. Evolution happens at the resource level. Rollback is a first-class operation. There is auditable lineage on every change.

I read this once, was suspicious, read it again, and now I think the framing might be more important than the empirical results. The empirical results are fine. The framing is a thing I have wanted to exist for a while.

What the paper is proposing

The authors are Wentao Zhang, Zhe Zhao, Haibin Wen, Yingcheng Wu, Ming Yin, Bo An, and Mengdi Wang. The paper introduces the Autogenesis Protocol (AGP), which is split into two layers, and the Autogenesis System (AGS), which is a reference implementation of an agent system that runs on top of the protocol.

The two-layer split is the central design move:

RSPL, the Resource Substrate Protocol Layer. Defines what counts as a resource (prompts, agents, tools, environments, memory). Each resource has an explicit state, a lifecycle (create, register, update, deprecate, retire), and a versioned interface. You can hold a reference to "agent v3" and the system knows what that means.
SEPL, the Self Evolution Protocol Layer. Defines a closed loop of operators on resources. Propose generates a candidate change to a resource. Assess evaluates the candidate against criteria. Commit writes the new version into the registry with auditable lineage. There is also a rollback operation, which I will get to.

The authors describe the goal as "separating what evolves from how evolution occurs." Read that sentence twice. Most papers in this area collapse the two. The thing that gets evolved and the algorithm doing the evolving are wired into each other so tightly that you cannot inspect either independently. Autogenesis insists on the separation. RSPL describes the noun. SEPL describes the verb. A given system can swap evolution algorithms without touching its resources, and can swap resources without touching its evolution algorithm.

This is the bit that scratches an itch for me. I have looked at a lot of self-evolving agent code in the wild and almost all of it has this property where you cannot answer the question "what state is the system actually in right now" without reading the entire codebase. The state of the agent is implicit in the prompts, the prompts are implicit in some f-string somewhere, the tool list is hardcoded, and the memory is whatever the framework's session object happens to look like. There is no registry. There is no version. There is certainly no rollback.

Autogenesis is, very explicitly, the proposal that you should not run a self-evolving system without these primitives. The loop is the easy part. The substrate is the part everyone has been skipping.

How RSPL works (what I understood)

RSPL is the layer I sat with longest. Reading the README for the reference implementation helps: there are explicit modules for agent, tool, environment, memory, optimizer, tracer, and version. Each of the first four is a kind of resource. The latter three (optimizer, tracer, version) are the machinery that lets the resources evolve safely.

A resource has, at minimum:

A type declaration. The system needs to know that this is an agent, not a tool, because the contracts are different.
A state. Resources are stateful. Memory is the obvious one, but agents and tools also have state in the form of their current configuration, their current prompt, their current dependencies on other resources.
A lifecycle. There is a moment a resource is created, a moment it is registered into the protocol, a window in which it is active, and an eventual deprecation or retirement. The lifecycle is what makes "agent v3 deprecated, agent v4 active" a meaningful sentence.
A versioned interface. Other resources reference this one through the version, not through a Python pointer. If agent v4 changes its tool-call schema, the resources that depend on it know to either upgrade their reference or stay pinned to v3.

The thing I want to underline is that prompts are resources too. That sounds tiny but it changes a lot. In most agent code I have read, the prompt is a string in a config file or, worse, an f-string in the agent's run method. There is no way to ask "show me every prompt this agent has used" or "what changed between this run and the last one." When the prompt is a versioned resource, those become trivial queries. The prompt has a history, the prompt has a diff, the prompt has a parent.

The same logic applies to memory. Memory in most agent frameworks is a session object, sometimes a vector store, sometimes both, almost always implicit. As an RSPL resource, memory has a schema, a version, and a lineage. You can fork a memory store, run an experiment on the fork, and either commit the changes back or throw the fork away.

If you have ever worked with database migrations or with a package manager, the design vocabulary should feel familiar. RSPL is doing for agent components roughly what git did for code or what npm did for libraries. It is not a flashy contribution. It is the kind of contribution that a field needs before the flashy contributions can stop being one-offs.

How SEPL works (what I understood)

SEPL sits on top of RSPL and provides the operators. Three of them are central, and a fourth (rollback) is implied by the versioning.

Propose takes a current resource version and generates a candidate next version. The candidate is just another versioned resource, sitting in a kind of staging state. It has not been committed yet. It has a parent, and the lineage is recorded.

Assess takes the candidate and runs it through a set of evaluators. The paper is light on what counts as an evaluator, and I think this is deliberate. The evaluator is whatever the system designer plugs in. It could be a held-out benchmark, a self-play tournament, a critic agent, a hard-coded property test, or a regression check against the previous version.

Commit takes a candidate that passed assessment and registers it as the new active version of the resource. From this point forward, references to "the agent" resolve to the new version. The previous version is retained, deprecated rather than deleted, available for rollback.

Rollback is not framed as an operator in the same way the other three are, but it falls out of the design. Because every commit is versioned, reverting to a prior state is a registry operation. You do not need to "unwind" the change. You just point the reference back.

The reason the loop is banal is that this is roughly what every self-improvement paper does, just usually without the formality. What Autogenesis adds is that the operators always act on RSPL resources, never on raw Python objects or strings. The propose step cannot mutate state in place. It must produce a new versioned candidate. The assess step is run against the candidate, not the live system. The commit step is the only operation that touches the live system, and it is atomic at the registry level.

I keep coming back to the database analogy. This is essentially MVCC for agents. A self-evolution operation is a transaction. If it commits, the registry advances. If it does not commit, the live system is untouched. If it commits and then turns out to be a regression, you roll back to the parent.

What I do not (fully) understand

Several things.

The first is the scope of the propose operator. The paper describes propose as generating a candidate change, but it is unclear to me how much of the system the propose step is allowed to touch in a single proposal. Can a single propose operation produce a candidate that changes a prompt, adds a tool, and modifies the agent's planner all at once? Or is the protocol enforcing single-resource changes, where each proposal touches exactly one versioned resource? The reference implementation might tell me, but the paper itself does not seem to commit, and the answer matters a lot. Multi-resource proposals are more powerful but harder to assess. Single-resource proposals are tractable but might miss the kind of co-adapted changes that make self-evolution interesting.

The second is the assessment criteria. The paper says assess evaluates against performance criteria. Fine. But the most interesting evaluation question in self-evolving systems is not "did the new version do better on benchmark X" but "did it not regress somewhere I forgot to measure." A propose-assess-commit loop that only checks the headline benchmark will quietly trade off everything else. I cannot tell from the paper how much of this is left to the system designer (probably most of it) and how much is built into SEPL itself.

The third is the stability of the protocol under recursive self-modification. RSPL itself is implemented as code. SEPL itself is implemented as code. The paper is careful to say that the operators evolve resources, not the protocol. But in practice the line between "a tool that orchestrates evolution" and "a resource that is itself an agent component" gets blurry. If the optimizer is a resource (and it lives in src/optimizer/ in the reference implementation), can the optimizer optimize itself? If yes, the protocol has a meta-level it does not fully formalize. If no, that needs to be a hard rule somewhere in SEPL, and I did not see it stated as a rule.

The fourth is more of a quibble than a confusion. The paper uses the word "auditable lineage" repeatedly, which made me expect a formal definition of what a lineage record contains. There is a tracer module in the implementation, but the paper itself does not, as far as I can tell, define the lineage schema. Lineage in databases has a precise meaning. Lineage in this paper feels more like "we record things." That is fine for a v1, but the schema is the part that matters if you ever want to reason about what happened across many evolution steps.

Sketching the mechanism

Let me try to write down what I think the protocol means in code. This is not the actual implementation. It is my read of the protocol surface, sketched compactly so I can argue with it.

from dataclasses import dataclass, field
from typing import Any, Optional, Callable
from uuid import uuid4

# RSPL: resources are versioned, stateful, typed.

@dataclass
class Resource:
    type: str               # "prompt" | "agent" | "tool" | "environment" | "memory"
    name: str               # stable identity across versions
    version: int            # monotonically increasing per name
    state: dict             # opaque to the protocol, schema set by the type
    parent: Optional[str]   # version-id of the predecessor, None for genesis
    id: str = field(default_factory=lambda: str(uuid4()))

class Registry:
    """The single source of truth. Every resource lookup goes through here."""
    def __init__(self):
        self.active: dict[str, Resource] = {}     # name -> currently active version
        self.history: dict[str, list[Resource]] = {}  # name -> all versions, oldest first

    def register(self, r: Resource) -> None:
        self.history.setdefault(r.name, []).append(r)
        self.active[r.name] = r

    def rollback(self, name: str, to_version: int) -> None:
        prior = next(v for v in self.history[name] if v.version == to_version)
        self.active[name] = prior  # the old version is just re-pointed, not rebuilt

# SEPL: three operators, all routed through the registry.

@dataclass
class Candidate:
    parent: Resource
    proposed: Resource

def propose(current: Resource, mutator: Callable[[Resource], Resource]) -> Candidate:
    next_version = mutator(current)
    next_version.version = current.version + 1
    next_version.parent = current.id
    return Candidate(parent=current, proposed=next_version)

def assess(c: Candidate, evaluators: list[Callable[[Resource], float]]) -> dict:
    # Each evaluator returns a scalar; the system designer decides the aggregation.
    return {ev.__name__: ev(c.proposed) for ev in evaluators}

def commit(c: Candidate, scores: dict, registry: Registry, threshold: float) -> bool:
    if min(scores.values()) < threshold:
        return False  # the candidate fails assessment, registry is untouched
    registry.register(c.proposed)
    return True

Two things stand out when I write it down this way.

The first is that the propose operator is parameterized by an arbitrary mutator. The protocol says nothing about what the mutator does. It could be a language model rewriting a prompt. It could be a tree search over tool configurations. It could be a hand-coded heuristic. The substrate does not care, which is the point.

The second is that the registry is the only writable thing. The propose and assess operators do not touch state. Only commit does. This is what makes rollback trivial: the previous version was never overwritten, so reverting is just a pointer move on active.

You could plug almost any of the existing self-evolution algorithms into this surface. That is either a feature (the protocol is general) or a confession (the protocol does not constrain what evolution looks like, only how state is managed). I think it is mostly a feature. The papers that obsess over the loop have not, on the whole, produced agents that are meaningfully better than the baseline. The papers that get the substrate right have a chance of compounding.

What the results say

The empirical section evaluates AGS, the reference system, on GPQA (graduate-level reasoning), GAIA (general AI assistant tasks), LeetCode (coding), and SWE-bench (software engineering). The headline is consistent improvement over strong baselines across all four.

I want to be honest about how I read these numbers. I did not work through the appendix carefully enough to know which subsets of each benchmark were used, what the baselines actually were, or how many runs were averaged. The high-level pattern, that a self-evolving system beats a fixed-prompt baseline on tasks that benefit from iterative refinement, is the expected result for this kind of paper. I would be surprised if it did not hold.

What I find more interesting than the absolute numbers is what the framing of those numbers implies. Because the system is built on RSPL, the authors can in principle answer questions like "which resources changed during the run that produced this score" or "what was the lineage of the prompt at the moment of best performance." I did not see the paper exploit this affordance, and I think it is the most underused dimension of the work. A self-evolving system whose evolution is audited is not just a system that does better. It is a system whose improvement story is reproducible, which almost no self-evolving system in the wild can claim.

If I were to push the empirical work in one direction, it would be that. Not "AGS scores 4 points higher than baseline," but "here is the lineage of the resources that produced the best run, here is the diff of the prompts at each commit, here is the assessment record at each step." The protocol makes that paper writable. The paper they wrote does not write it.

What did land for me

A few things, in rough order of strength.

Resources as versioned protocol entities, not Python objects. This is the move I want to steal regardless of whether anyone ends up using AGP specifically. Even in a system that does not self-evolve, treating prompts and tool definitions as registered resources with versions and lineage solves a class of operational problems that most agent codebases just live with. "Why does this agent behave differently in production than in dev" usually has a one-line answer when you have the lineage and a multi-day investigation when you do not.

The propose-assess-commit operators as protocol primitives, not algorithm details. Pulling these out of the algorithm and into the protocol is the move that makes the whole thing composable. The same registry can host an evolutionary search, a critic-driven refinement, a manual override, all within the same lineage record. The operator surface is the boundary.

Rollback as a first-class affordance. I have written more agent rollbacks than I want to admit, and they are always ad hoc. The fact that rollback in AGP is just a registry pointer move is the kind of design choice that shifts what becomes thinkable. You can imagine a system that aggressively explores, commits things that turn out to be bad, and rolls back without losing the lineage. That is closer to how anyone actually deploys software, and very far from how anyone deploys agents today.

The phrase "separating what evolves from how evolution occurs." I want to put it on a sticker. It is the single most useful sentence I have read about agent system design this year, because it names a confusion I have watched a lot of teams walk into without seeing it. If you cannot say what the noun and the verb are independently, you do not have a self-evolving system. You have a tightly coupled blob whose behavior happens to drift over time.

What I think is missing

The protocol does not say very much about what makes an evolution step safe. It gives you the surface (propose, assess, commit) and the substrate (versioned resources), but the actual safety properties (does this commit break invariants? does it violate a permission? does it introduce a cycle in tool dependencies?) are left to the system designer to enforce inside the assess step. I think for v1 this is the right call, but it leaves the most interesting design questions on the table. What does a "type system for agent evolution" look like? What invariants would you want enforced at the protocol level rather than the application level?

The protocol is also silent on multi-agent evolution. AGS is a multi-agent system, and the paper hints that resources can evolve concurrently, but I did not see a clear story on what happens when two agents try to commit changes to the same shared resource at the same time. Database literature has answers for this (locking, optimistic concurrency, MVCC). The paper would benefit from picking one and stating it.

And the protocol leaves the loop choice fully to the system designer, which I noted above as a feature, but it is also a hedge. The paper does not have a strong opinion on what kinds of mutators or evaluators work best. In a sense, that is fine: the contribution is the protocol. In another sense, the empirical section would be more interesting if it showed two completely different evolution algorithms running on the same RSPL substrate, with the substrate being the controlled variable. That would make the case for the protocol much harder to dismiss.

Where my head is at

The thing that pulled me into this paper was the framing, and the thing that kept me reading was that the framing was actually load-bearing in the design, not just a slogan in the abstract. I suspect the protocol will outlast the specific reference implementation, and I suspect the most useful version of this paper has not been written yet, where someone picks up RSPL/SEPL, plugs in an evolution algorithm that is not the authors' own, and shows that it composes. The paper currently couples the substrate and one specific system; the substrate's value comes through clearest when it is decoupled from any particular system at all.

More reading. I want to look at the reference implementation more carefully and see whether the lineage records are as rich as the paper implies. If they are, I think the substrate idea has legs. If they are not, the framing is still the contribution and the next paper is the one where someone takes the framing seriously.

References

Autogenesis: A Self-Evolving Agent Protocol (arXiv:2604.15034) by Wentao Zhang, Zhe Zhao, Haibin Wen, Yingcheng Wu, Ming Yin, Bo An, Mengdi Wang. Defines the AGP protocol (RSPL plus SEPL) and describes AGS, a reference multi-agent implementation evaluated on GPQA, GAIA, LeetCode, and SWE-bench.
DVampire/Autogenesis (github.com/DVampire/Autogenesis). The reference implementation. Modules of interest are src/version/ and src/tracer/ for the versioning and lineage machinery, and src/optimizer/ for the propose-assess-commit operators.