Back to Lab Diary

Diversity Collapse: when more agents and tighter coordination produce fewer ideas

There is a pattern that anyone running multi-agent ideation pipelines has probably hit. You give five agents a prompt and ask them to brainstorm. You get five flavors of the same idea. You scale to fifteen agents, the variance does not actually grow. You add a "critic" or a "supervisor" agent and somehow the spread of ideas tightens further. The outputs are not bad. They are not low quality. They are just samey.

This paper puts a name on it: diversity collapse, driven by something the authors call structural coupling. The work is from a team led by Nuo Chen and Bingsheng He at NUS, with co-authors across a few Singapore labs, and it is in ACL 2026 Findings.

The headline argument is the one that made me read the paper twice. The bottleneck is not the model. It is the wiring. Stronger models do not fix it. More agents do not fix it. Tighter coordination actively makes it worse. The diversity loss comes from the structure of the multi-agent system itself: the model priors, the role hierarchy, and the communication topology compound into a process that progressively narrows what the group can produce.

I want to flag at the start that I do not love every part of this paper. The framing is sharp, the empirical setup is solid in places and weak in others, and the term "structural coupling" gets stretched in ways I do not fully buy. But the core measurement, the thing where they show that authority-driven groups produce less diverse ideas than junior-dominated groups, is the kind of result that I cannot un-see. And it explains something a lot of people have been quietly accepting as "just how multi-agent systems work."


What the paper is proposing

The setup is open-ended idea generation with multiple LLM agents. Pick a prompt like "propose research directions for sustainable urban transportation" or "generate startup ideas in the climate adaptation space." Spin up some number of agents, give them roles, let them talk to each other for some number of rounds, and collect the resulting set of ideas. Measure how diverse the resulting set is.

The expectation, which I think most people who build with agents share, is that more agents means more ideas means more diversity. Bigger model means better individual ideas, which lifts the diversity floor. Tighter coordination (everyone reads everyone else's outputs each round) means the agents can build on each other instead of duplicating work. All three of those expectations turn out to be wrong, or at least very oversold.

The paper structures its findings across three levels:

Model level. They run the same multi-agent ideation pipeline with different base models (a mix of frontier and mid-tier, the appendix has the full list). The finding is what they call the compute efficiency paradox: more capable, more aligned models produce slightly higher per-sample quality but diminishing marginal gains in diversity. The diversity curve flattens hard. Their reading is that alignment training pulls models toward modal answers, and modal answers from a powerful model and modal answers from a less powerful model end up overlapping more than they differ.

Cognition level. They simulate group structures with explicit role hierarchies. A "senior" role (framed as the expert, the lead, the one with authority) and a "junior" role (framed as the new hire, the curious outsider). The finding is that authority-driven groups (senior roles dominant in the conversation graph) produce significantly less semantically diverse idea sets than junior-dominated groups, or even mixed groups with no role weighting. The seniors anchor the discussion early and the juniors anchor onto them.

System level. They vary group size and communication topology. Group size scales with diminishing returns: going from five to ten agents does not double the diversity. Going from ten to twenty barely moves it. Communication topology matters more than size: dense topologies (everyone reads everyone) produce premature convergence, and sparse topologies (small subgroups, occasional information flow) preserve diversity longer.

The umbrella term is structural coupling. Their argument is that the three levels are not independent. Model alignment couples agents to a shared prior. Authority coupling propagates that prior through the role hierarchy. Topological coupling propagates it through the communication graph. Each layer reinforces the next. The collapse is not caused by any single mechanism, it is caused by the way the mechanisms compound.

This is the move I find most interesting and also where I have the most reservations. "Structural coupling" is doing a lot of conceptual work in the paper, and I am not sure all of it is earned.


How the three levels work (what I understood)

The compute efficiency paradox

The paper measures diversity using a couple of metrics. The main one is a semantic diversity score computed from sentence embeddings of the generated ideas, basically the average pairwise distance in embedding space. They also report a lexical diversity score (vocabulary richness, type-token ratios) and a category-level diversity score (how many distinct topical clusters the ideas fall into when you cluster the embeddings).

When they swap in stronger models, per-sample quality (rated by a separate evaluator agent, with human spot-checks) goes up. Semantic diversity goes up too, but the slope is shallow. Category-level diversity barely moves. Their interpretation is that stronger models are better at producing high-quality versions of the same idea categories, not at proposing new categories. If your weaker model proposes "electric scooters, bike lanes, electric scooters again, electric scooters with better batteries", your stronger model proposes "well-designed electric scooter networks, considered bike lane policy, electric scooters with optimal battery chemistry." Better outputs, same categories.

I find this empirically convincing. It also matches a thing that shows up casually in any side-by-side comparison of ideation runs: swapping a mid-tier model for a frontier model on the same prompt makes the individual ideas better written but does not actually open up new conceptual territory. It is the difference between hiring a more articulate version of the same person versus hiring someone with a different background.

Authority-driven dynamics

This was the section that landed hardest for me. They run controlled experiments where the only thing that varies is the role distribution in the multi-agent group. Senior-dominant groups (say, three seniors and two juniors, with the seniors getting more turns and longer context windows) versus junior-dominant groups (three juniors and two seniors, juniors get more airtime). Same models. Same prompts. Same number of rounds.

The senior-dominant groups produce ideas that cluster tighter in embedding space. Less variance, more agreement, faster convergence on a small set of themes. The junior-dominant groups produce wider distributions. The mechanism the paper proposes (and they have some ablations to back it up) is that the senior agents' first contributions act as anchors. Subsequent contributions from any agent get pulled toward those anchors because the rest of the group is conditioning its outputs on what was said before, and the senior turns are the highest-status turns in that history.

The cleanest version of this experiment is the one where they hold everything constant except the labeling of which agent is "senior". Same models, same prompts, same conversation history, only the role tag changes. The diversity gap persists. So it is not that senior-tagged agents are producing different content from a different distribution, it is that the group dynamics of treating one agent as authoritative reshape the conversation around it.

This explains a pattern that shows up everywhere in agent design. A lot of multi-agent setups in the wild have a "lead researcher" or "principal" agent that drafts the initial framing, and then secondary agents that critique or extend. The conventional reading of that pattern is that it is good design: clear authority gradient, fewer redundant outputs, faster convergence. The paper argues that the same pattern is exactly what kills the diversity you would actually want from the brainstorm phase.

Group size and topology

This is the system-level finding and the one I had the most prior intuition about, so it landed less surprisingly but still cleanly.

Group size scales sub-linearly with diversity. They show curves for group sizes from three to forty. The diversity score climbs from three to about ten, plateaus from ten to twenty, and barely moves from twenty to forty. This matches what I see in practice: there is a regime where adding more agents helps and a regime where you are just paying inference cost for redundant ideas.

The topology finding is the more interesting one. They compare four topologies: fully connected (everyone reads everyone every round), star (one central agent that everyone talks to), small-world (clusters of tightly connected agents with sparse inter-cluster edges), and isolated (no inter-agent communication, just independent generation followed by aggregation).

The diversity ranking is roughly: isolated > small-world > star > fully connected. The fully connected topology, which is what most multi-agent frameworks default to, produces the least diverse outputs. The isolated baseline, where the agents do not talk to each other at all, produces the most. And the small-world topology, which is the middle ground I find most plausible for real systems, does almost as well as isolated but with the benefit of allowing some real coordination.

This is uncomfortable for the field. The whole pitch of "multi-agent" is that the agents talk to each other. The paper is showing that for ideation specifically, the talking is the problem.


What I do not (fully) understand

I want to be specific about the gaps because the paper is doing a lot of work and not all of it is equally rigorous.

The metric for semantic diversity is embedding distance, and the embedding model they use is one of the open sentence-transformer variants. They acknowledge this in a footnote and run a small ablation with a different embedding model, but the main results all use the one model. Embedding-based diversity is a proxy. It picks up surface variation in how ideas are phrased, and it picks up some categorical variation, but it conflates the two. I cannot tell from the paper how much of the reported "diversity collapse" is genuine conceptual collapse versus convergence on a shared phrasing style. The category-level metric helps, but their clustering is also embedding-based, so it inherits the same biases.

The formal definition of structural coupling is the place where the paper gets handwavy. They define it informally as "the process by which agent interactions inadvertently restrict the exploration space" and gesture at three coupling mechanisms (model, cognition, system). They do not give a single formal definition that would let you measure structural coupling on a new system. What they have is three separate measurements and an argument that they correlate. That is fine, but the language of the paper sometimes implies that "structural coupling" is a single quantity, and I do not think they have shown that.

The model coverage is also narrower than the framing suggests. They run experiments on a handful of models, mostly closed-API frontier models plus a couple of open-weights mid-tier models. The compute-efficiency-paradox claim is general but the data is from a small sample. I would want to see this with more open models, especially the recent ones with weaker alignment training, before I really believe the alignment story.

The authority-driven finding is, to me, the most empirically solid result in the paper, but I am not sure the cognition framing is the right one. They are showing that role-conditioning in the prompt changes group dynamics. Calling that "cognition" suggests something psychological is happening inside the model. I think what is actually happening is more mechanical: role tags shift attention patterns over the conversation history. That is interesting either way, but the cognition framing is doing rhetorical work that I do not think the experiments fully justify.

I followed the experimental setup. I followed the conclusions. The handwave is at the level of theory, where the three findings get bundled into a single conceptual frame. The findings stand on their own. The frame is more of a vibe.


Sketching the structural coupling

I wrote a small toy version of the multi-agent ideation loop to make the topology effect concrete. This is not from the paper. It is me trying to nail down what they are actually varying.

import asyncio
from dataclasses import dataclass, field
from typing import Literal


@dataclass
class Agent:
    name: str
    role: Literal["senior", "junior", "neutral"]
    history: list[str] = field(default_factory=list)


Topology = Literal["full", "star", "small_world", "isolated"]


def neighbors(agents: list[Agent], topology: Topology, idx: int) -> list[int]:
    n = len(agents)
    if topology == "isolated":
        return []
    if topology == "full":
        return [j for j in range(n) if j != idx]
    if topology == "star":
        return [0] if idx != 0 else list(range(1, n))
    if topology == "small_world":
        # Tight cluster of 3, with one bridge edge to the next cluster
        cluster = idx // 3
        local = [j for j in range(n) if j // 3 == cluster and j != idx]
        bridge = [(cluster + 1) * 3 % n] if idx % 3 == 0 else []
        return local + bridge
    return []


def role_weight(role: str) -> float:
    # Senior turns get more weight in the shared context.
    return {"senior": 2.0, "junior": 1.0, "neutral": 1.0}[role]


async def run_round(agents: list[Agent], topology: Topology, prompt: str):
    new_outputs = []
    for i, agent in enumerate(agents):
        peer_history = []
        for j in neighbors(agents, topology, i):
            peer = agents[j]
            for turn in peer.history:
                # Senior turns appear with higher effective weight.
                peer_history.extend([turn] * int(role_weight(peer.role)))

        context = "\n".join(peer_history)
        idea = await llm_call(prompt, context, role=agent.role)
        new_outputs.append((i, idea))

    for i, idea in new_outputs:
        agents[i].history.append(idea)


async def ideation_run(prompt: str, n_agents: int,
                       topology: Topology, rounds: int,
                       senior_share: float):
    n_seniors = int(n_agents * senior_share)
    agents = [
        Agent(name=f"a{i}", role="senior" if i < n_seniors else "junior")
        for i in range(n_agents)
    ]
    for _ in range(rounds):
        await run_round(agents, topology, prompt)
    return [a.history[-1] for a in agents]

Three things become obvious once you write it down:

The neighbors function is the entire topology effect. Switching from "full" to "isolated" is a one-character change, and it takes the diversity from "lowest" to "highest" in the paper's experiments. The mechanism is not subtle. It is literally how much of the conversation history each agent conditions its next output on. More history means more anchoring on prior outputs, means more convergence.

The role_weight function is the cognition effect. The seniors do not produce different content per turn, they just take up more of the shared context. Their outputs get repeated more in the input that other agents see. So later agents condition more heavily on senior turns, and the conversation drifts toward whatever the seniors said first.

The model effect is hidden inside llm_call. Stronger and more aligned models produce outputs that sit closer to the modal output for the prompt, so the early turns of the conversation are already less diverse before any of the structural mechanisms kick in. The structural mechanisms then compress an already-narrow distribution further.

When I look at this code, the appeal of the term "structural coupling" gets clearer. The three effects stack. A frontier model produces narrow priors. The senior weighting concentrates those priors in a few turns. The dense topology broadcasts those concentrated turns to everyone. By round three, you are not running an ideation pipeline, you are running a consensus-formation pipeline. And consensus is the opposite of what you wanted.


What the results say (and do not say)

The headline numbers are real and they are not enormous. Going from a fully connected topology to an isolated topology improves semantic diversity by something like 20-30% on their main benchmark, with category-level diversity moving by a similar amount. The senior-dominant versus junior-dominant comparison shows roughly a 15% gap. The compute-efficiency-paradox numbers are smaller, in the 5-10% range when you swap a mid-tier model for a frontier one, and the slope flattens fast.

These are not the kind of numbers that make a press release. They are the kind of numbers that change how you build systems. A 20% diversity gain from changing the topology is a free lunch if you are running an ideation pipeline that costs real money in inference. The same gain from changing the model means a four-times inference cost for a smaller effect.

Things I want to flag in the results.

The benchmark is on open-ended ideation tasks. The findings might not transfer to tasks where convergence is actually the goal, like collaborative problem-solving with a known answer, or tasks where you want the agents to refine a single idea rather than spread out. The paper acknowledges this and explicitly frames the work as "for divergent thinking phases." I think that framing is right, but it limits the scope of the conclusions more than the title suggests.

The evaluator agent is itself an LLM. They use a frontier model to score idea quality and to cluster categories, and they do human spot-checks. The human checks are reassuring but they are spot-checks, not full-scale annotation. If the evaluator model has correlated biases with the agents being evaluated (which is plausible for models from the same family or trained on similar data), the diversity measurements could be systematically off. They run an ablation with a different evaluator and find similar trends, which helps, but the evaluator coverage is also limited.

The role labels are simple. "Senior" versus "junior" is a single dimension of authority. Real agent systems have richer role structures (specialists, generalists, critics, executors, planners, and so on). I would want to see this study extended with more realistic role structures before I put too much weight on the cognition-level results. The simple senior-junior split is a clean controlled experiment, but the framing of "authority-driven dynamics" is broader than what the experiment actually measures.

What the results do say cleanly: if you are running a multi-agent ideation pipeline with a fully connected topology and a clear authority gradient, you are paying inference costs for an output set that is significantly less diverse than what you would get from running the same agents independently and aggregating at the end. The "naive parallel sampling" baseline beats most of their structured multi-agent setups on diversity. That is a finding with direct implications for almost any production multi-agent ideation system.


What this means for designing multi-agent systems

A lot of multi-agent systems in production today have aggressive coordination patterns. Orchestration pipelines that read every prior agent's output before deciding what to invoke next. Meta-loops where a planner agent sees the full history of every sub-agent's output before generating the next round of plans. These are exactly the topologies the paper says collapse diversity fastest.

The minimal change the paper suggests, even though it does not state it as a design recommendation: introduce a phase boundary. During the divergent phase, run agents independently or in a small-world topology. During the convergent phase, switch to fully-connected. This is, in retrospect, what design thinking has been telling people to do for decades, and it is also what good human research teams do. Work alone first, then come together. The multi-agent framing makes "agents talking to each other" feel like the point, but talking is a tool, not a default.

The role-tagging finding has the same flavor of practical implication. A lot of agent setups in the wild use explicit role tags ("lead researcher", "senior engineer", "principal architect") on the assumption that authority gradients help agents know what to do. The paper's authority experiments suggest those tags also flatten the diversity of the group's outputs. The trade-off is real: the tags probably are net-positive for output quality and coherence, and net-negative for output spread. Whether the trade is worth it depends entirely on which phase of the pipeline you are in.

The broader thing the paper does is make agent topologies a first-class design object, the same way you think about database indices or API rate limits. It is easy to treat topology as an implementation detail (how agents pass messages) when it is actually a load-bearing design choice (how agents shape each other's outputs). A fully connected topology and an isolated topology produce qualitatively different outputs, and that fact deserves an explicit label in any system diagram, not just a buried decision in the message-routing code.


Where my head is at

I went into this paper expecting to disagree with it. Multi-agent skepticism is having a moment online and I assumed this was another piece in that genre. It is not. It is an empirical study that takes the multi-agent framing seriously and then measures specific failure modes, and the failure modes are real. I came out more convinced of the underlying value of multi-agent systems and also more aware of how easy it is to set them up in the configuration that actively hurts the thing you want them to do.

The thing worth chasing next is whether the topology effects survive at smaller agent counts and tighter latency budgets. Most of the experiments are at five to twenty agents over multiple rounds. The smaller-and-shorter regime, where many production systems actually live, is not where the curves were measured. The curves at that end of the parameter space would be the most useful piece of follow-up work I could imagine for this paper.

More reading on this. The related work it is in conversation with, especially the older social-choice and group-decision-making literature it seems to be reaching back into, is the next thing on the list.


References

  • Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation (arXiv:2604.18005) by Nuo Chen, Yicheng Tong, Yuzhe Yang, Yufei He, Xueyi Zhang, Qingyun Zou, Qian Wang, Bingsheng He. ACL 2026 Findings.