Back to Lab Diary

HyperAgents: self-improving agents, or evolutionary prompt engineering?

Here is something I have been thinking about since the SkillNet entry. I spend a lot of my time being the improvement mechanism for my own agents. I build an agent. I watch it fail. I figure out why. I rewrite the prompt, restructure the tool selection, adjust the orchestration logic. Then I do it again. The agent does the task. I do the meta-task of making the agent better at the task.

So when a paper comes along claiming that agents can take over that meta-task โ€” and not just take it over, but improve the meta-task itself โ€” I pay attention. The paper is called HyperAgents, from a team spanning UBC, the Vector Institute, and several Meta research groups including FAIR and the Superintelligence Labs. They extend the Darwin Godel Machine with what they call "metacognitive self-modification." The pitch: agents that improve not only how they solve tasks, but how they improve.

I read it twice. The framing is genuinely exciting. But I am left with more questions than answers about what is actually happening under the hood. This is not a rigorous critique. It is me trying to figure out whether the mechanism lives up to the language around it, and being honest about the places where I cannot tell.


What they are proposing

The setup requires some background. The Darwin Godel Machine (DGM) is a prior system from some of the same authors where a coding agent evolves through open-ended exploration. It generates modifications to its own code, evaluates the results, and keeps the best variants in an archive. Successful variants become stepping stones for future improvements. Think of it like evolutionary search applied to agent code.

The DGM works well for coding tasks. The catch is that the meta-level mechanism โ€” the part that decides how to generate improvements โ€” is handcrafted and fixed. A human designed the instruction-generation process. The agent can improve its task-solving code, but it cannot improve the process that generates those improvements. The ceiling is baked in by the human who wrote the outer loop.

HyperAgents try to remove that ceiling. The idea: merge the task agent and the meta agent into a single editable program called a hyperagent. In the DGM, the coding agent is both the worker and the subject of improvement. In DGM-Hyperagents (DGM-H), the agent is the worker, the improver, and the subject of improvement. Everything is one Python codebase, and the agent can edit any part of it.

They call this "metacognitive self-modification." The improvement mechanism is itself subject to improvement.

Here is the part that is important to understand: the foundation model is frozen. The underlying LLM does not change. What changes is the Python code that wraps the FM โ€” the prompts, the control flow, the parsing logic, the tool selection, the memory structures. The "self" that is being improved is the code scaffold, not the neural network. I keep coming back to this because the paper's framing sometimes makes it sound like the agent is getting smarter in some deeper sense. It is not. It is editing its own wrapper.

That distinction matters and I will come back to it.


The thing I keep going back and forth on

The initial hyperagent is deliberately simple. The task agent makes a single FM call and returns the raw output. No parsing. No post-processing. No domain-specific logic. The meta agent gets the agent's repo, past performance results, and a budget of remaining iterations, and is told to modify whatever it wants.

On the paper review benchmark, this initial agent scores 0.0. Not "low." Zero. It fails to produce output in the required format. There is no format parsing, no extraction of an accept/reject decision, nothing. On robotics reward design, it scores 0.060 โ€” it occasionally stumbles into a reward function that compiles but rarely produces anything useful.

After 100 evolutionary iterations of DGM-H, paper review goes from 0.0 to 0.710 on the test set. Robotics reward design goes from 0.060 to 0.372. These numbers look dramatic. But here is what I keep getting stuck on: how much of that improvement is the agent discovering something new versus the evolutionary search surfacing things that a competent human engineer would have added in the first ten minutes?

The paper describes the qualitative improvements on paper review. The agent moved from superficial behavioral instructions (like "be rigorous") to explicit multi-stage evaluation pipelines with checklists, decision rules, and clearly defined criteria. On robotics, it built an internal knowledge base of environment constraints, valid state variables, and reward-scaling heuristics.

These sound like good engineering practices. Not novel discoveries. A human building a paper review agent would start with format parsing, then add structured evaluation criteria, then iterate on decision thresholds. The agent arrived at similar patterns through 100 iterations of evolutionary search. Is that self-improvement or is it a very expensive way to do what an experienced engineer would do in an afternoon?

I genuinely do not know the answer. Part of me thinks: "Of course the FM already knows about checklists and multi-stage pipelines. It has seen thousands of engineering blog posts about those patterns. The evolutionary search is just eliciting patterns the model already has." Another part thinks: "But the agent figured out which patterns to apply, in which order, and how to compose them โ€” without being told. The search through the space of possible modifications is non-trivial even if each individual modification is something the FM could generate."

I keep going back and forth on this. The framing says "self-improvement." The mechanism looks more like "evolutionary prompt and code engineering with a frozen model as the generator."


What the numbers actually say (and do not say)

Let me walk through the results more carefully because the headlines are more nuanced than they first appear.

Coding (Polyglot benchmark). DGM-H improves from 0.140 to 0.340 on the training subset and from 0.084 to 0.267 on the full benchmark. The original DGM achieves comparable numbers (0.140 to 0.380 on training). So on the domain the DGM was specifically designed for, DGM-H roughly matches it despite not being handcrafted for coding. That is interesting but not a huge win.

Paper review. 0.0 to 0.710 on the test set (CI: 0.590โ€“0.750). Several top-performing agents outperform the AI-Scientist-v2 static baseline which scores 0.630. This is the strongest result. But note the confidence interval width. And the starting point of zero is artificially low because of the deliberately minimal initial agent.

Robotics reward design. 0.060 to 0.372 (CI: 0.355โ€“0.436). Surpasses the default reward function baseline (0.348). Modest but real.

DGM-H vs. DGM-custom. This is the comparison I find most important. DGM-custom is the original DGM with its instruction-generation mechanism manually customized for each domain by humans. DGM-H achieves higher median performance than DGM-custom on the test tasks in both paper review and robotics. But โ€” and this is a big but โ€” the difference is not statistically significant (p > 0.05). The paper says this directly, and I appreciate the honesty. But it means we cannot confidently say that automated metacognitive self-modification outperforms human-designed self-improvement mechanisms. It might. Or it might be roughly equivalent. We do not have enough signal to tell.

The ablation results. These are actually the cleanest part of the paper. DGM-H without self-improvement (fixing the meta agent) shows little to no improvement. DGM-H without open-ended exploration (no archive, each new variant replaces the old) also fails to sustain progress. Both components are necessary. You need the evolutionary archive AND the ability to modify the improvement mechanism. Removing either collapses performance. This is a real empirical finding that I do not think is trivially obvious.

Transfer. Meta-level improvements from paper review and robotics transfer to a new domain (Olympiad-level math grading). Transfer hyperagents achieve imp@50 of 0.630, compared to 0.0 for DGM transfer agents and 0.0 for the initial meta agent. That is a large and statistically significant difference. The transfer agents picked up general-purpose meta-level capabilities โ€” things like persistent memory and performance tracking โ€” that remain useful in a new domain.

But here is my question about the transfer result: is transferring "persistent memory" and "performance tracking" really transferring meta-level intelligence, or is it transferring generic software engineering infrastructure? A performance tracker is useful in any domain. Logging your results and comparing them across iterations is basic engineering. The fact that the agent independently discovered the need for it is interesting. But the transfer might be less "the agent learned how to improve" and more "the agent learned to use common engineering tooling."


What I do not understand

I want to be explicit about where my understanding breaks down.

Where does genuinely new capability come from? The FM is frozen. The agent edits Python code. The modifications are generated by the frozen FM. So every modification the agent makes is something the FM was already capable of generating. The evolutionary search explores the space of FM-generatable code modifications and selects the ones that improve benchmark scores. But the ceiling is bounded by what the FM can produce. Is there a mechanism for the agent to discover something genuinely beyond the FM's existing knowledge? I do not see one, and the paper does not address this directly.

Maybe the argument is that the FM can produce any individual modification, but composing 100 sequential modifications creates something no single FM call could produce. The whole is greater than the sum of the parts because each step builds on the scaffold created by previous steps. That makes sense theoretically. But it also means the "self-improvement" is really "sequential code refinement through evolutionary search," which is a less dramatic claim than the paper's framing implies.

The outer loop is not actually self-modifiable. The parent selection mechanism โ€” how the archive chooses which hyperagent to evolve next โ€” is fixed and handcrafted in the main experiments. The evaluation protocol is fixed. The paper acknowledges this and presents preliminary results on modifying parent selection in an appendix, but for the core experiments, the "fully self-referential" framing is aspirational. The agent can modify how it solves tasks and how it generates task-solving modifications. But it cannot modify how it is selected, evaluated, or replaced. That is a meaningful limitation on the "metacognitive" claim.

Overfitting to evaluation. The agent improves on a fixed set of evaluation tasks. How do we know it is genuinely generalizing versus overfitting to the evaluation distribution? The paper uses held-out test sets, which helps. But the evaluation tasks within a domain are presumably similar in structure. An agent that learns to game the evaluation criteria โ€” say, by optimizing for the specific format the evaluator expects rather than for actual quality โ€” would show improvement on both training and test sets without genuinely getting better at the underlying task. The paper does not discuss this risk.

Cost. Running 100 evolutionary iterations where each iteration involves FM calls to generate code modifications, sandbox execution to test them, and FM calls to evaluate them โ€” that is not cheap. The paper mentions cost estimates in an appendix, but the main text does not compare the cost of DGM-H improvement against the cost of a human engineer achieving similar results. If DGM-H costs $500 in API calls to reach what a human could build in a day, the value proposition changes. If it costs $5, that is a different story. I did not find this comparison in the paper and I think it matters a lot for whether this approach is practical.


What did land for me

I do not want to come across as purely skeptical. There are genuinely interesting things here that survived my attempts to explain them away.

The ablation structure is convincing. The fact that you need both metacognitive self-modification and open-ended exploration, and that removing either one collapses the result, tells me something real is happening beyond simple evolutionary search. If it were just brute-force search over code modifications, you would expect the archive (open-ended exploration) to be sufficient on its own. The fact that you also need the self-modifying meta agent suggests that the quality of the search improves over time, not just the quantity of variants explored.

The emergent persistent memory is the result I keep coming back to. The paper shows a PerformanceTracker class that the agent autonomously developed โ€” logging metrics across generations, computing trends, storing synthesized insights and forward-looking plans. Nobody told it to build memory. The evolutionary pressure drove it to implement a capability that, in hindsight, obviously helps. But "obvious in hindsight" is different from "trivially generated." The agent had to figure out that persistent memory was the bottleneck, design a solution, implement it in code, and have that implementation survive selection. That sequence is non-trivial even if each individual step draws on the FM's existing knowledge.

The persistent memory entries themselves are interesting. Rather than just logging numbers, the hyperagent stores causal hypotheses about why certain generations performed well, diagnoses over-corrections, and proposes combination strategies. It looks less like a log file and more like an engineer's notebook. I am not sure whether this is "genuine reasoning about improvement" or "the FM generating plausible-sounding engineering notes because that is what it has been trained on." But the functional outcome is the same either way: agents with this memory improve faster than agents without it.

The transfer result is real even if my explanation for it is deflationary. Transfer hyperagents produce imp@50 of 0.630 on math grading (a domain they were never optimized for), while DGM transfer agents achieve 0.0. Even if the transferred capabilities are "just" generic software engineering patterns like performance tracking and persistent memory, the fact that the DGM does not transfer those same patterns while DGM-H does tells me that the metacognitive self-modification process produces something qualitatively different from domain-specific optimization.


Sketching the mechanism to understand it

I wrote some code to try to make the mechanism concrete. This is not from the paper. It is me trying to understand what "a program that edits itself through a frozen FM" actually looks like in practice.

import subprocess
from pathlib import Path

def self_modify(
    agent_repo: Path,
    past_results: list[dict],
    fm_call: callable,
) -> Path:
    """The core loop: read your own code, propose edits, test them."""

    # Read the agent's own source
    agent_code = (agent_repo / "agent.py").read_text()

    # Ask the frozen FM to propose modifications
    # This is where the "self-improvement" actually happens:
    # an FM call that takes (my code + my past results) and outputs a diff
    modification = fm_call(
        prompt=f"""Here is an agent's source code:
        {agent_code}

        Here are its recent evaluation results:
        {past_results}

        Propose modifications to improve performance.
        You may modify ANY part of the code, including this
        modification-generation logic itself.""",
    )

    # Apply the diff to create a new variant
    new_repo = clone_and_patch(agent_repo, modification)
    return new_repo

Staring at this sketch, the thing that strikes me is how much work the frozen FM is doing. The "self-modification" is really "ask an FM to suggest code changes." The agent does not have a gradient signal, does not do backpropagation, does not learn in any mechanistic sense. It asks a very capable language model "what should I change?" and the FM answers based on its training data. The evolutionary loop provides selection pressure, but the generation of improvements is entirely dependent on the FM's pre-existing capabilities.

This is why I struggle with the "self-improvement" framing. The agent is not improving itself. The FM is improving the agent's code, and the evolutionary loop selects which FM-generated improvements survive. The FM is the actual improvement engine. The evolutionary search is the selection engine. The "self" in "self-improvement" is a poetic choice, not a mechanistic description.

Or maybe I am being too reductive. Maybe the accumulated context โ€” past results, current code state, persistent memory โ€” constitutes a form of learning that is genuinely different from any single FM call. Each modification is informed by the entire history of previous modifications. The FM is the same, but the context it operates on grows and refines over time. Is that self-improvement? I do not know. It depends on where you draw the line between "using a tool more effectively" and "the tool improving itself."


The safety section is more interesting than I expected

Most papers tack on a safety section as a formality. This one actually engages with a hard problem: what happens when a system can modify the mechanisms that govern its own improvement?

The paper is honest that current safeguards (sandboxing, timeouts, restricted internet access, human oversight) may become strained as self-improving systems grow more capable. They frame the tension well: how do you balance the potential for autonomous scientific discovery against the risk that a system evolves faster than humans can audit?

What caught my attention is that this is not hypothetical. The DGM-H already modifies its own improvement process. If you scaled this to more powerful FMs, more iterations, and less constrained environments, you would have a system that is continuously evolving its own behavior in ways that are only retrospectively interpretable. The paper waves at this but does not propose solutions beyond "more research needed." Fair enough โ€” I do not have solutions either. But I appreciate that they name the problem clearly instead of pretending the sandboxed experiments resolve it.

The outer loop being fixed in the main experiments is actually a safety feature, not just a limitation. By keeping parent selection and evaluation protocols handcrafted and immutable, the researchers maintained a killswitch. The agent can modify anything about how it solves tasks and how it generates improvements, but it cannot modify how it is evaluated or selected. If you removed that constraint โ€” which is the paper's stated aspirational goal โ€” you would be in genuinely uncharted territory.


Where my head is at

I am less sure about this paper than I was about SkillNet. With SkillNet, the gap between the framing and the mechanism was small. They proposed a skill package manager for agents and built something that looks like a skill package manager for agents. With HyperAgents, the gap is wider. They propose self-referential agents that improve their own improvement mechanisms. What they built is an evolutionary search system that uses a frozen FM to generate code modifications and selects the ones that improve benchmark scores.

Those might be the same thing described at different levels of abstraction. Or the framing might be doing more work than the mechanism. I genuinely cannot tell yet.

What I do think is real: the empirical results. Something useful happens when you let an FM iteratively edit agent code with evolutionary selection pressure. The ablations show that both the open-ended archive and the self-modifiable meta agent contribute. The transfer results show that whatever the agents learn about how to improve is at least partially general. These are real findings regardless of whether you call them "metacognitive self-modification" or "evolutionary prompt and code engineering."

What I am not sure about: whether there is a ceiling. If the FM is frozen and the agent can only produce modifications the FM is capable of generating, then the space of achievable improvements is bounded by the FM's knowledge. You can explore that space more efficiently with evolutionary search, but you cannot exceed it. The paper does not address this because the experiments are short enough that the ceiling has not been hit. But it matters for the larger claim about "potentially unbounded self-improvement."

I want to sit with this longer before deciding how it fits into my mental model alongside SkillNet and the Context Platform work. The SkillNet question was about composability of what agents do. The HyperAgents question is supposed to be about composability of how agents improve. But I am not convinced the paper has shown the latter rather than a sophisticated version of the former.

More reading, more thinking. I will probably come back to this after I have spent some time with the code.


References