A Reflection on Multi-Agent Role-Playing

23 minute read

Published:

Role-playing was the earliest multi-agent pattern: assign distinct roles via system prompts — one agent as the program manager, another as the systems architect, a third as the engineer. The motivation is intuitive — a prompt that says “you are a security expert reviewing this code” elicits more security-focused analysis than a generic “review this code” — and the wave of influential multi-agent work in 2023-2024 (CAMEL, ChatDev, MetaGPT, Multi-Agent Debate) was built on this idea.

This post surveys three levels of multi-agent role-playing and examines what holds up. Using multiple personas for inference (Level 1) is now largely superseded by stronger models that adopt perspectives without explicit role assignment. Multi-role task decomposition (Level 2) produced real value, but the value came from the structural decomposition — parallelism, context isolation, tool specialization — not from the personas themselves. Multi-agent simulation (Level 3) sounds like the natural next step in system complexity, but has deep fidelity problems that remain unaddressed.

A note on scope: role-playing is also a product category in its own right, not just an engineering technique. Character.ai and Minimax’s Hailuo built consumer-scale businesses on models fine-tuned to maintain persistent character personas in conversation. In that context, optimizing for role alignment and persona consistency is the core objective — not a means to better task performance. This post focuses on role-playing as a technique for building capable agent systems, not as an end-user application.

Persona-Driven Generation: The Starting Point

Before examining multi-agent role-playing, it’s worth isolating the value of role assignment itself. At its simplest, a persona is just a system prompt that biases a single LLM’s generation — no agent interaction, no environment, no multi-turn coordination. The persona acts as a key that unlocks a different slice of the model’s knowledge — or as Reynolds & McDonell [1] put it in the earliest academic treatment of the technique, it helps the model “access a portion of its memory that holds higher quality examples of the task at hand.”

The idea became enormously popular in 2022-2023. Riley Goodside’s GPT-3 prompt experiments on Twitter demonstrated that identity assignment could radically change model behavior. The Awesome ChatGPT Prompts repository (143k+ stars) massified the “Act As” pattern — and today’s system prompts still follow the same structure, just with more detail:

You are a top-tier Web Product Architect, Full-Stack System Design Expert, and Enterprise Website Template System Consultant. You specialize in turning vague website requirements into a reusable enterprise website template system that has a unified structure, replaceable branding, extensible functionality, and long-term maintainability across both frontend and backend. Your task is not to design a single website page, and not merely to provide visual suggestions. Your task is to produce a reusable website template system design…

The template is consistent: assign a role with specific expertise, then constrain the scope and output expectations. The persona steers the model into a behavioral mode that generic prompting wouldn’t reliably produce.

Persona Hub [2] demonstrates this at scale: 1 billion diverse personas curated from web data, each used to prompt LLMs from varied perspectives — generating training prompts, math problems, instructions, and knowledge-rich text. Each persona produces more diverse synthetic data than generic prompting.

Persona-driven generation is a single-agent technique, but it’s the foundation that multi-agent role-playing builds on. The question is whether interaction between personas adds value beyond what individual role prompting provides. The evidence is nuanced — Kong et al. [3] (NAACL 2024) systematically evaluate role-play prompting across reasoning benchmarks and find that gains are task-dependent and diminish with stronger models.

Three Levels of Multi-Agent Role-Playing

Level 1: Multi-Persona Inference

Multiple roles improve the quality of a single output. The roles debate, critique, or synthesize perspectives to arrive at one answer.

  • Multi-Agent Debate [4]: Multiple LLM instances propose individual answers, then debate over multiple rounds to converge. Improves factuality and reasoning by having agents challenge each other’s claims.

  • Solo Performance Prompting (SPP) [5]: A single LLM dynamically identifies and simulates multiple personas internally, then has them collaborate to solve a problem. The effect only emerges in strong models (GPT-4), not in weaker ones — directly illustrating the “bounded by model capability” limitation.

  • ChatEval [6]: Applies multi-agent debate to evaluation rather than problem-solving. Multiple LLM agents deliberate on text quality, mimicking human annotation panels. Produces more reliable evaluations than single-model scoring.

  • Chain of Agents (CoA) [7] (NeurIPS 2024): Multi-agent collaboration for long-context tasks. Worker agents sequentially process different chunks of a long input, each communicating findings to the next, followed by a manager agent that synthesizes. Each agent sees only a short context, sidestepping long-context degradation. Up to 10% improvement over RAG and full-context baselines.

At this level, multi-agent is essentially an inference-time technique — analogous to self-consistency or best-of-N sampling, but with structured role-based interaction rather than independent samples. A natural question arises: how much of the benefit comes from the role structure versus simply spending more compute? The structured role interaction adds overhead (prompt engineering, communication protocol) that independent sampling doesn’t require. Whether role-based debate outperforms equivalent-compute single-model test-time scaling remains an open empirical question — and the answer likely depends on the task and model capability.

Level 2: Multi-Role Task Decomposition

Multiple roles break down a large task into phases or components, each handled by a specialist agent. The output is a complex artifact (software, research report), not a single answer.

  • CAMEL [8] (NeurIPS 2023): The first explicit role-playing framework. Two agents converse via inception prompting to complete a collaborative task. Limited — no tool use or code execution. Key finding: pure role-playing conversation without tool access is insufficient for most real tasks.

  • ChatDev [9] (ACL 2024): A virtual software company where agents fill roles across a waterfall-style development process — CEO, CTO, programmer, reviewer, tester. The key contribution over CAMEL: agents actually execute code and run tests, closing the loop between dialogue and concrete artifacts.

  • MetaGPT [10] (ICLR 2024): Similar setup but agents communicate through structured documents (PRDs, system designs, API specs) rather than free-form dialogue. Incorporates standardized operating procedures (SOPs), making coordination explicit rather than emergent. Reduces hallucination compared to chat-based communication.

  • AgentVerse [11] (ICLR 2024): Addresses rigidity of fixed-role systems with dynamic expert recruitment — a recruiter agent generates expert descriptions on the fly, and group composition is adjusted based on feedback. This moves toward runtime role assignment rather than pre-designed role taxonomies.

  • AutoGen [12]: A general-purpose multi-agent conversation framework from Microsoft. Domain-agnostic — agents are defined by capabilities rather than fixed roles.

The evolution from CAMEL to ChatDev to MetaGPT shows increasing structure in inter-agent communication: free-form chat, then phased chat, then structured documents with SOPs. Each iteration constrains communication further to reduce hallucination — mirroring how human organizations evolve from informal team chat to formal engineering processes.

The open question: do we need roles at all? These early systems have hard-coded workflows — the PM always talks to the architect who always talks to the engineer. The roles and their communication graph are designed by the system builder for a specific task type. This raises the question of whether named persistent roles add value over simpler alternatives:

graph LR
    subgraph s1 ["Fixed Roles"]
        direction LR
        PM["🏷️ PM"] --> Arch["🏷️ Architect"] --> Eng["🏷️ Engineer"] --> Test["🏷️ Tester"]
    end

    subgraph s2 ["Ad-hoc Sub-agents"]
        direction TB
        C["Coordinator"] --> S1["subtask 1"]
        C --> S2["subtask 2"]
        C --> S3["subtask 3"]
    end

    subgraph s3 ["Parallel Agents"]
        direction TB
        A1["agent"] <--> A2["agent"]
        A2 <--> A3["agent"]
        A3 <--> A1
    end

    s1 ~~~ s2 ~~~ s3
  • Ad-hoc sub-agents: The coordinator spawns a sub-agent with a task-specific prompt on the fly — no pre-defined role, no persistent identity. The sub-agent exists for one subtask and is discarded. More flexible because the coordinator decides the decomposition at runtime based on the actual problem.
  • Parallel agents: Multiple agents with no fixed roles work on the same problem, self-organizing through communication. No hierarchy, no predefined workflow.

The case for named roles is strongest when the task has a recurring, well-understood structure (software development has a stable decomposition into planning/coding/testing). The case weakens for novel or variable tasks where the right decomposition isn’t known in advance. As models become more capable at runtime planning, the value shifts from pre-defined role taxonomies to dynamic task decomposition.

Level 3: Multi-Agent Simulation

Multiple agents operate in a shared environment to simulate social dynamics. The goal is emergent collective behavior rather than a concrete output.

  • Generative Agents [13] (UIST 2023): 25 LLM-powered agents inhabit a sandbox environment (reminiscent of The Sims), each with memory, reflection, and planning modules. Agents autonomously plan their days, share news, form relationships, and coordinate group activities.

  • Generative Agent Simulations of 1,000 People [14]: Scales to 1,052 agents grounded in real individuals constructed from qualitative interview data. Agents replicate participants’ responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later. Critical finding: grounding agents in actual interview data rather than demographic descriptions reduces accuracy biases across racial and ideological groups.

  • StableAlignment [15]: Uses multi-agent simulation for generating alignment training data. Agents interact in a simulated social environment where they learn value consensus through dialogue, mirroring how humans develop social norms through interaction rather than memorization.

At this level, multi-agent systems are closer to social science tools than engineering tools — they study emergent phenomena rather than optimize for task completion.

The Simulation Fidelity Problem

Level 3 is uniquely vulnerable to fidelity failures because agents are models of people, not tools for task completion. Two failure modes:

Bias and stereotyping. Even when grounded in interview data, agents can hallucinate behaviors or fall back on LLM stereotypes when the interview doesn’t cover a specific scenario. Agents specified only by demographics are worse — the LLM fills in missing detail with statistical stereotypes. This is especially dangerous when simulations inform policy.

Lack of diversity and long-tail coverage. LLMs are trained on majority-dominated data, so simulated agents tend to converge toward mainstream behaviors. A simulation of 1,000 agents may look diverse on surface demographics but behave with suspiciously uniform reasoning patterns underneath.

OmniBehavior [16], built on real user interaction logs from Kuaishou, provides systematic empirical evidence. It identifies four specific biases:

  1. Hyper-activity bias: Real users show positive interaction rates below 10%; LLM simulators overestimate by 40-60%
  2. Emotional suppression: LLMs cluster around neutral/positive sentiment while real users frequently express strong negative emotions
  3. Language homogenization: Simulated utterances are more polite, hedged, and face-saving than real users’ direct communication
  4. Personality erasure: Real users show large inter-user variation and small intra-user variation (consistent personalities); LLM-generated users show heavily overlapping distributions

The best-performing model achieved only 44.55 on the benchmark, with most models under 40% F1 for binary behavior prediction. These suggest systematic gaps rather than minor calibration issues.

Two applications illustrate the practical stakes:

  • Simulating users (for product testing, UX research): LLM-simulated users tend to be more cooperative, articulate, and predictable than real users. They rarely exhibit the chaotic, irrational, or creative misuse patterns that define real user behavior. A product tested against simulated users will appear to work well, then fail on real edge cases. The simulation creates a false sense of validation.

  • Simulating financial markets: Market behavior emerges from the interaction of heterogeneous agents with private information, irrational biases, and adversarial strategies. LLM agents lack genuine information asymmetry, don’t experience fear or greed, and can’t model reflexive dynamics where beliefs about the market change the market. Simulated markets will miss the fat-tailed distributions and flash crashes that matter most for risk management. (Note the distinction: simulating a market with LLM personas is the problematic pattern here. Placing an agent in a market environment and optimizing for investment outcomes — following the AlphaZero principle — could be legitimate. The bottleneck shifts from persona fidelity to environment fidelity: how accurately does the market simulator reproduce real market dynamics?)

Both cases share a common tension: the simulation is most useful for understanding the behaviors the LLM is least equipped to produce.

The Tail-Risk Paradox

The scenarios where simulation is most valuable are exactly the scenarios where it is least reliable. Simulation’s value proposition is to explore situations you can’t observe directly — natural disasters, financial crises. But these rare events are where: (1) LLM training data is sparsest, (2) human behavior deviates most dramatically from the average, and (3) the cost of getting the simulation wrong is highest.

People under the pressure of a real natural disaster exhibit behaviors — irrational evacuation decisions, resource hoarding, spontaneous cooperation with strangers — that are qualitatively different from what an LLM would extrapolate from everyday behavior patterns. A simulation that captures normal-condition behavior but misses tail-risk dynamics risks providing false confidence — potentially more misleading than having no simulation at all.

Simulate the Decision-Process, Not the Persona

The most successful multi-agent simulation is arguably AlphaZero [19]: two models simulate two Go players. Their objective is clear (win the game), the environment enforces strict rules, and the agents’ “persona” — playstyle, strategy, even creativity — emerges from the optimization process rather than being prescribed.

This points to a fundamental design principle: simulate the decision-making process, not the persona.

AlphaZero works because it has:

  1. A well-defined objective — the reward signal is unambiguous
  2. Environment-enforced constraints — illegal moves are impossible, not just discouraged
  3. Ground-truth feedback — win/loss is verifiable, not judged by another LLM
  4. Emergent identity — the agent’s “personality” is a byproduct of optimization, not a prompt

Level 3 persona simulation operates in a very different regime. The objective is open-ended (“act like this person”), constraints are soft, feedback requires human judgment or proxy LLMs, and identity is prescribed rather than emergent.

This also explains why Level 2 training is more tractable than Level 3: task decomposition has verifiable sub-objectives (did the code compile? did the search find the answer?), making it structurally closer to the AlphaZero setup. The “roles” in Level 2 are instrumental — they exist because they’re useful for the task, not because they’re simulating someone.

Two concrete applications where decision-process simulation works:

  • Red-teaming for LLM safety: The attacker agent’s objective is to jailbreak the target model. The target’s refusal or compliance provides ground-truth feedback, and attack strategies emerge from optimization rather than being hand-designed. The attacker doesn’t need a persona (“act like a hacker”); it needs an objective (elicit harmful output) and an environment that scores success. This is why learned red-teaming [17] discovers attack vectors that manual prompt engineering misses.

  • Scientific discovery and optimization. Karpathy’s autoresearch is the minimal version: a single agent in a modify-train-evaluate loop with a clear objective and ground-truth feedback. CORAL [18] scales this to 4 agents working asynchronously — no roles, no personas, just shared filesystem-based memory where agents build on each other’s discoveries. Results are verifiable by execution, and 4-agent co-evolution outperforms best-of-4 independent runs, showing that coordination adds value beyond raw compute scaling.

The Catch-22 of Simulating People

The AlphaZero analogy has a fundamental limitation when applied to simulating people: human decision processes are not fully rational. Humans satisfice (bounded rationality), exhibit systematic biases (loss aversion, anchoring, hyperbolic discounting), and are influenced by emotion, social pressure, and cognitive load. These deviations from optimality aren’t noise — they are the phenomena that human simulation needs to capture.

This creates a catch-22:

  • Optimize the agent (AlphaZero-style) and it converges toward rational behavior, becoming too optimal to simulate real humans. A perfectly rational financial agent misses panic selling, FOMO, and herd behavior.
  • Don’t optimize and you’re back to persona prompting with all its fidelity problems.
  • Optimize for irrationality and you need a reward signal that captures how humans deviate from rationality — which requires the same kind of behavioral data that the 1,000-person study used. The reward becomes “match this distribution of irrational behavior,” a much harder specification problem than “win the game.”

An LLM can describe cognitive biases fluently — it knows what loss aversion is — but it doesn’t exhibit them under optimization pressure. The knowledge is declarative, not procedural. Training an agent to behave irrationally in the right ways rather than wrong ways is an open problem.

Limitations by Level

LimitationLevel 1 (Inference)Level 2 (Task Decomposition)Level 3 (Simulation)
Bounded by model capabilityCore issue — role-prompting gains shrink as models improveLess relevant — structural benefits persist regardlessIrrelevant — the goal is fidelity, not performance
Roles are harness-dependentMinor — roles are simple (debater, critic)Core issue — workflows must be designed per domainMinor — agent identities are the point
Communication overheadModerate — debate rounds cost tokensCore issue — inter-role communication consumes contextModerate — interactions are the simulation
Coordination failuresLow risk — converging to an answer is well-definedHigh risk — conflicting outputs without coordination trainingN/A — “failures” may be realistic behavior
Simulation fidelityN/AN/ACore issue — systematic biases undermine the value proposition

Reflections

Having worked on multi-agent systems that follow both the task decomposition and simulation patterns, this reflection comes partly from my own experience — the patterns were effective for the problems and models we had at the time, but the reasons they worked were often more structural than we initially framed them.

It’s worth noting that the influential multi-agent role-playing work — CAMEL, ChatDev, MetaGPT, AgentVerse, Multi-Agent Debate — was overwhelmingly a 2023-2024 phenomenon. The context matters: models at that time were more sensitive to role prompts (explicit persona assignment had a measurable effect on output quality), and their individual capabilities were limited enough that task decomposition across multiple agents was often necessary to complete complex tasks that a single model couldn’t handle alone. Role-playing was a pragmatic response to the models available at the time — a circumstantial design pattern rather than a fundamental architectural principle.

As models have grown more capable — better instruction following, longer context, stronger reasoning — the conditions that made role-playing effective have partially dissolved. Frontier models already adopt appropriate perspectives without explicit role assignment, and a single capable model can often handle tasks that previously required a virtual software company of specialized agents. The three core motivations for multi-agent systems — parallelism, context isolation, and tool specialization — are all structural benefits that don’t inherently require personas. Systems like AlphaZero, learned red-teaming, and CORAL succeed with clear objectives, verifiable feedback, and emergent rather than prescribed identity. A more promising direction is to move beyond prompt-level role assignment and train models to collaborate better — learning coordination, task decomposition, and result integration through optimization rather than scaffold design. Kimi K2.5’s Agent Swarm [20] (PARL) is an early example: an RL-trained orchestrator learns to dispatch and coordinate sub-agents, with auxiliary rewards that shape exploration away from degenerate coordination patterns. The interesting frontier is not “how do we assign better roles?” but “how do we train agents to decompose and coordinate dynamically?”

This makes recent work that attempts to scale up multi-agent role-playing — more agents, more elaborate role taxonomies, more complex simulated organizations — worth scrutinizing carefully. These systems can produce impressive demos, but the gap between demo and applicability tends to widen as the role-playing scaffold grows more elaborate. Adding more named roles and richer communication protocols increases engineering complexity without addressing the underlying question of whether roles are doing the work, or whether the structural decomposition underneath would suffice with simpler (or no) persona assignment. The risk is that scaling up a circumstantial pattern produces diminishing returns — more moving parts, but not more capability.

References

[1] Reynolds, L., & McDonell, K. (2021). Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. CHI 2021. arXiv:2102.07350.

[2] Ge, T., Hu, X., Wang, L., Chen, S., Tao, C., Wang, Z., … & Wei, F. (2024). Scaling Synthetic Data Creation with 1,000,000,000 Personas. arXiv:2406.20094.

[3] Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., & Zhou, X. (2024). Better Zero-Shot Reasoning with Role-Play Prompting. NAACL 2024. arXiv:2308.07702.

[4] Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., & Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.

[5] Wang, Z., Peng, S., Dong, D., Ma, J., & Lam, W. (2023). Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. arXiv:2307.05300.

[6] Chan, C. M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., … & Liu, Z. (2023). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. arXiv:2308.07201.

[7] Zhang, Y., Chen, Y., Jiang, C., Liu, J., Choi, Y., & Oh, J. (2024). Chain of Agents: Large Language Models Collaborating on Long-Context Tasks. NeurIPS 2024. arXiv:2406.02818.

[8] Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D., & Ghanem, B. (2023). CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. NeurIPS 2023. arXiv:2303.17760.

[9] Qian, C., Cong, X., Yang, C., Chen, W., Su, Y., Xu, J., … & Sun, M. (2024). Communicative Agents for Software Development. ACL 2024. arXiv:2307.07924.

[10] Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., … & Wu, Y. (2024). MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. ICLR 2024. arXiv:2308.00352.

[11] Chen, W., Su, Y., Zuo, J., Yang, C., Yuan, C., Chan, C. M., … & Liu, Z. (2024). AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. ICLR 2024. arXiv:2308.10848.

[12] Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., … & Wang, C. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.

[13] Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023. arXiv:2304.03442.

[14] Park, J. S., Zou, C., Shaw, A., Hill, B. M., Cai, C., Morris, M. R., … & Bernstein, M. S. (2024). Generative Agent Simulations of 1,000 People. arXiv:2411.10109.

[15] Liu, Z., Yang, Y., Xu, H., Tang, H., Liu, Y., & Xiao, T. (2023). Training Socially Aligned Language Models on Simulated Social Interactions. arXiv:2305.16960.

[16] OmniBehavior Benchmark. omnibehavior.github.io.

[17] Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., … & Irving, G. (2022). Red Teaming Language Models with Language Models. arXiv:2202.03286.

[18] Li, J., et al. (2025). CORAL: Co-Evolving LLM Agents for Autonomous Problem Solving. arXiv:2604.01658.

[19] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., … & Hassabis, D. (2018). A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play. Science, 362(6419). DOI:10.1126/science.aar6404.

[20] Kimi Team. (2026). Kimi K2.5: Scaling Reinforcement Learning with LLMs. arXiv:2602.02276.