FlashEvolve
Accelerating Agent Evolution with Asynchronous Stage Orchestration
LLM-based evolution has emerged as a promising way to improve agents by refining non-parametric artifacts, but its wall-clock cost remains a major bottleneck. We identify that this cost comes from synchronized stage execution and imbalance inside each LLM-heavy stage. We present FlashEvolve, an efficient framework that replaces synchronized execution with asynchronous workers and queues, allowing different stages and steps to overlap. To handle data staleness introduced by asynchrony, FlashEvolve tracks artifact versions and applies different policies to update, discard, or patch stale artifacts. Unlike weight-space staleness in asynchronous RL, language-space staleness is inspectable and repairable: a stale artifact is not just delayed work, but readable evidence that the LLM can reflect on, revise, and turn into useful evolution signal. On GEPA workloads, FlashEvolve improves proposal throughput by 3.5× on local vLLM and 4.9× on API serving over synchronous GEPA.
Multi-stage execution in agent evolution. Synchronized stage orchestration in existing implementations (GEPA, ACE, Combee) exposes two efficiency challenges: sequential dependencies across stages and sample workload imbalance within each individual stage.
Introduction
A growing line of recent work enables LLM agents to evolve themselves. Instead of updating model weights, these systems iteratively refine the non-parametric components that govern their behavior, including system prompts [1, 11, 16], context and memory [2, 5, 6], harness code [12, 3], and generated programs [4, 7, 8].
By having an LLM reflect on full execution traces rather than optimize against scalar rewards, this paradigm draws a richer learning signal from each rollout: GEPA [1] outperforms GRPO with an average gain of 6% across six reasoning benchmarks, while Meta-Harness [3] automatically discovers agent harnesses that surpass the best hand-engineered baselines.
Despite its algorithmic appeal, agent evolution remains expensive in wall-clock execution time. On IFBench, a single GEPA evolution step already takes ~2 minutes; Combee [23] parallelizes proposal generation but further stretches each step to ~2.8 minutes. Reaching a stable improvement requires more than 2 hours on an H100 GPU.
We trace this high wall-clock cost to two compounding factors:
- Synchronized stage execution — each evolution step runs a sequence of LLM-heavy stages serially (run current artifact, propose candidate, evaluate). A later stage cannot start until the previous one fully completes, preventing overlap across stages.
- Generation imbalance inside each stage — request lengths vary widely across samples, so the longest requests determine the execution time of the whole stage. The effective batch shrinks, leaving the LLM backend idle.
We present FlashEvolve, a framework that improves time efficiency of agent evolution through asynchronous stage orchestration. FlashEvolve treats an evolution loop as a set of LLM-heavy stages connected by queues, turning a synchronized loop into a streaming execution pipeline. To preserve evolution semantics, it tracks artifact-pool versions and applies update, discard, or reflective patch policies for stale items.
Contributions
- Bottleneck analysis. We identify synchronized stage execution and intra-stage generation imbalance as the two systems bottlenecks in LLM-based agent evolution.
- Asynchronous orchestration framework. FlashEvolve overlaps artifact execution, proposal, evaluation, and pool update through workers and queues; speculative stage completion and adaptive workflow control further reduce intra-stage waiting.
- Language-space staleness handling. We introduce artifact-version tracking and reflective patching — unlike weight-space staleness in async RL, stale language artifacts can be inspected and repaired by the same LLM that drives proposal.
- Throughput gains. On GEPA workloads, FlashEvolve delivers 3.5× higher proposal throughput on local vLLM and 4.9× on API serving over synchronous GEPA, and generalizes to ACE and Meta-Harness.
Background & Motivation
Agent Evolution Beyond Weight Updates
Agent evolution has emerged as a paradigm for adapting LLM-based systems to new data and tasks [13, 14]. A single model reflects on its own trajectories, critiques its own outputs, and proposes new artifacts that govern its own behavior. Crucially, this happens without modifying model weights, sidestepping the training infrastructure of supervised fine-tuning and reinforcement learning [9].
An evolution loop iterates over multiple steps; each step consists of LLM-heavy stages: Generate (run current artifact on tasks to collect trajectories), Propose (reflect on traces to produce a new candidate), and Evaluate (score and filter). A subsequent pool update commits the new artifact.
Profiling inefficiency in synchronized agent evolution. (a) Stage execution is serial and time is highly imbalanced. (b) Within a single stage, output lengths show a long-tail distribution — the slowest requests determine stage completion time. (c) Serial execution and intra-stage imbalance reduce effective concurrency, while FlashEvolve keeps more requests in flight.
Why Synchronous Execution Is Slow
Even with vLLM [15] (continuous batching, prefix caching), GEPA with Qwen3-8B takes 50 minutes for 49 evolution steps on IFBench, and 134 minutes for 411 steps on HotpotQA. This stems from two compounding costs:
- Serial chain across stages — total step time becomes the sum of per-stage durations, with no opportunity to overlap rollout, proposal, and evaluation.
- Synchronization barrier within a stage — the entire batch waits for the slowest request; a small number of long outputs dictate stage completion time.
Analogy to Asynchronous RL
These challenges resemble those in synchronous LLM RL systems [17, 19]; async RL addresses them by overlapping rollout with training [20, 21, 22]. Agent evolution differs in two key ways. First, it contains multiple LLM inference stages rather than a single rollout stage. Second, staleness occurs over inspectable language artifacts — prompts, memories, harness code, programs — rather than continuous model weights. This allows a much richer design space for staleness handling.
FlashEvolve: Asynchronous Framework for Agent Evolution
Overview of FlashEvolve. Asynchronous workers connected by queues replace synchronized stage execution. Workers pass partial or completed results through queues instead of waiting for a whole stage to finish. Speculative completion, validation-set reordering, workflow control, and staleness-aware handling further improve throughput while limiting data staleness.
Four Coordinated Mechanisms
FlashEvolve decomposes an evolution loop into asynchronous workers connected by queues, so different stages and evolution steps can overlap. Each queue item carries the artifact state and pool version $v_i$, allowing FlashEvolve to detect stale items. On top of this execution model, FlashEvolve introduces four coordinated mechanisms:
Asynchronous Workers & Queues
Each stage owns an input queue and a worker pool of size $K_i$. Workers continuously process ready items and pass outputs to downstream queues, allowing rollout, proposal, evaluation, and pool update to overlap.
Staleness-Aware Handling
Three policies trade throughput for freshness: Full Async, Guarded Async (discard if version gap > $\Delta_{\max}$), and Reflective Async (inspect & patch stale artifacts via an LLM call).
Speculative Stage Completion
A stage releases partial output after a fraction $\alpha_{\mathrm{spec}}^{i}$ of its requests has finished; downstream workers start from the tentative item. Validation-set reordering keeps discriminative samples in the speculative prefix.
Adaptive Workflow Control
FlashEvolve monitors per-stage production rate and adjusts worker counts: under-half-median stages gain a worker, over-twice-median stages lose one, bounded by per-stage min/max.
Why Language-Space Staleness Can Be Repaired
Language-space staleness is discrete and inspectable, unlike parameter staleness in async RL, which is continuous and opaque. In RL, a stale item is tied to an older point in weight space, so systems handle it through importance weighting, bounded delay, or discard. In agent evolution, a stale item is text or code — a prompt edit, memory update, harness mutation, or generated program.
FlashEvolve can therefore inspect the stale item together with the intervening artifact history and decide whether the edit is orthogonal, already subsumed, or in conflict with the current pool. This makes repair a first-class operation: stale items can be patched when they contain reusable information, or discarded when they are too specific or inconsistent.
Three Staleness Policies
Full Async preserves all completed work and maximizes throughput but may pollute the pool. Guarded Async discards items when $\Delta_i = v - v_i > \Delta_{\max}$. Reflective Async uses a reflection worker to read the stale item plus intervening pool updates and decide whether to patch or discard — reusing useful stale items while avoiding uncontrolled stale updates.
Experimental Results
We evaluate FlashEvolve against GEPA and Combee on four benchmarks (IFBench, HotpotQA, HoVer, AIME) using Qwen3-8B on a single H100 and GPT-4o-mini via API. FlashEvolve consistently improves both LLM throughput and proposal throughput.
Avg. over GEPA on local vLLM
Avg. over GEPA on GPT-4o-mini
IFBench, 30-min budget vs. GEPA
Proposal & validation throughput
System Throughput on GEPA Workloads
LLM throughput (output token rate) and proposal throughput (rate of new candidate generation) across IFBench, HotpotQA, HoVer, and AIME.
| Method | LLM Throughput (token/s) | Proposal Throughput (proposal/min) | ||||||
|---|---|---|---|---|---|---|---|---|
| IFBench | HotpotQA | HoVer | AIME | IFBench | HotpotQA | HoVer | AIME | |
| vLLM with Qwen3-8B | ||||||||
| GEPA | 963 | 30 | 461 | 200 | 1.9 | 4.6 | 2.5 | 2.2 |
| Combee (K=10) | 696 | 38 | 810 | 994 | 1.2 | 2.7 | 2.0 | 6.2 |
| Combee (K=40) | 900 | 44 | 891 | 977 | 0.7 | 4.5 | 2.0 | 1.6 |
| FlashEvolve | 2,688 | 93 | 1,255 | 998 | 8.9 | 8.8 | 5.9 | 11.4 |
| API with GPT-4o-mini | ||||||||
| GEPA | 361 | 14 | 142 | 103 | 1.7 | 2.4 | 1.8 | 1.3 |
| Combee (K=10) | 397 | 18 | 348 | 211 | 1.0 | 1.4 | 0.8 | 1.0 |
| Combee (K=40) | 389 | 23 | 214 | 336 | 0.8 | 1.2 | 0.7 | 0.6 |
| FlashEvolve | 791 | 32 | 352 | 485 | 10.1 | 8.0 | 9.1 | 6.6 |
Table 1. FlashEvolve improves LLM throughput by 3.4× on average over GEPA on local vLLM and 2.9× on API serving, sustaining 5.9–11.4 proposals/min across settings.
Validation Score within a 30-Minute Budget
Normalized evolution rate measures score improvement within a fixed wall-clock budget, normalized to synchronous GEPA. AIME reports "—" because serial GEPA shows no improvement.
| Method | IFBench | HoVer | HotpotQA | AIME | ||||
|---|---|---|---|---|---|---|---|---|
| Score | Norm. Rate | Score | Norm. Rate | Score | Norm. Rate | Score | Norm. Rate | |
| GEPA | 87.6 | 1.00 | 39.8 | 1.00 | 63.3 | 1.00 | 10.0 | — |
| Combee (K=10) | 88.5 | 1.39 | 41.2 | 1.09 | 62.5 | 0.94 | 10.0 | — |
| Combee (K=40) | 86.5 | 0.55 | 40.5 | 1.05 | 58.6 | 0.63 | 10.0 | — |
| FlashEvolve | 90.6 | 2.27 | 42.0 | 1.15 | 61.7 | 0.88 | 15.0 | — |
Table 2. Within 30 minutes, FlashEvolve achieves 2.27× the evolution rate of GEPA on IFBench (90.6 vs 87.6 score) and is the only method that improves on AIME (10.0 → 15.0).
Long-Time Evolution
Validation score over a 180-minute budget. FlashEvolve reaches strong scores earlier than synchronous baselines and maintains a higher score on HotpotQA throughout.
Figure 4. Long-time validation curves with Qwen3-8B. On IFBench, FlashEvolve reaches 91% at 39.3 min, while Combee (B=40) needs 104.2 min to reach the same region. On HotpotQA, FlashEvolve peaks at 66.41% at 56.1 min and holds the lead over the full budget.
Ablation Studies
Staleness Handling Policies
Three policies on IFBench. Full Async and Guarded Async behave similarly here; Reflective Async reaches a validation score of 94.3% within the 30-minute budget — the highest of the three.
Figure 5. Staleness handling on IFBench with Qwen3-8B. Left: example of reflective prompt repair — FlashEvolve discards task-specific formulas and distills transferable principles (stricter constraint checking, self-contained reasoning) into a compact patch. Right: Full Async, Guarded Async, and Reflective Async compared; Reflective Async leads.
Worker Concurrency & Speculative Completion
Larger worker counts increase proposal throughput from 7 to 99 artifacts/min, but validation throughput does not scale uniformly. Adaptive control balances queue pressure and achieves the highest accepted proposal throughput.
Figure 6. Ablation of worker concurrency and speculative completion on IFBench with Qwen3-8B. (a) Worker allocation varies throughput. (b) Adaptive worker control achieves the highest accepted proposal throughput. (c) Speculative completion with $\alpha_{\mathrm{spec}}^{3}=0.25$ raises validation throughput to 3.15/min and improves score by 4.49 percentage points within 30 min.
Key Findings
Async Overlap is Essential
Asynchronous stage orchestration keeps the LLM backend busier by overlapping requests across stages and steps, yielding 3.4× higher LLM throughput on average.
Reflective Repair Wins
Stale language artifacts are not just delayed work — Reflective Async turns them into useful signal, achieving 94.3% validation score on IFBench within 30 minutes.
Adaptive Balance Beats Brute Force
Naive worker scaling shifts the bottleneck between stages. Adaptive workflow control balances queue pressure and stage rates, producing the highest accepted proposal throughput.
Algorithm-Agnostic Design
FlashEvolve is not GEPA-specific. The worker-and-queue abstraction also applies to ACE (context evolution) and Meta-Harness (harness-code evolution).
Beyond GEPA: ACE and Meta-Harness
FlashEvolve is algorithm-agnostic. It does not rely on a specific artifact type, only that the evolution loop contains multiple LLM-heavy stages. We evaluate on ACE [2] (context playbooks) and Meta-Harness [3] (harness code).
Figure 7. FlashEvolve on other evolution algorithms. (a)–(b) ACE validation score curves on FiNER (0.60 → 0.66) and Formula (0.66 → 0.70) within a 30-min budget. (c)–(d) Meta-Harness proposal rate (0.3 → 1.4 proposals/min, a 4.7× speedup) and score distributions on Symptom2Disease and AGNews.
(1) ACE: Context Evolution
ACE evolves agent context playbooks rather than prompts. Within the same 30-minute budget, FlashEvolve reaches better validation scores on both datasets: 0.60 → 0.66 on FiNER and 0.66 → 0.70 on Formula, demonstrating higher efficiency on a non-prompt artifact.
(2) Meta-Harness: Code Evolution
Meta-Harness evolves harness code. FlashEvolve improves proposal-and-validation throughput from 0.3 to 1.4 proposals/min — a 4.7× speedup. With higher throughput, FlashEvolve samples and validates more harness candidates within the same time budget, raising the ceiling of harness quality.
Takeaway. The asynchronous worker-and-queue abstraction generalizes across artifact types — prompts, context, and harness code — because they all share the same multi-stage, queueable, pool-updating structure. FlashEvolve treats wall-clock execution as a first-class design target for agent evolution systems.
References
- [1] L. A. Agrawal et al. GEPA: Reflective prompt evolution can outperform reinforcement learning. arXiv:2507.19457, 2025.
- [2] Q. Zhang et al. Agentic context engineering: Evolving contexts for self-improving language models. arXiv:2510.04618, 2025.
- [3] Y. Lee et al. Meta-Harness: End-to-end optimization of model harnesses. arXiv:2603.28052, 2026.
- [4] A. Novikov et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131, 2025.
- [5] S. Ouyang et al. ReasoningBank: Scaling agent self-evolving with reasoning memory. arXiv:2509.25140, 2025.
- [6] G. Zhang et al. MemEvolve: Meta-evolution of agent memory systems. arXiv:2512.18746, 2025.
- [7] R. T. Lange et al. ShinkaEvolve: Towards open-ended and sample-efficient program evolution. arXiv:2509.19349, 2025.
- [8] H. Assumpção et al. CodeEvolve: An open-source evolutionary coding agent. arXiv:2510.14150, 2025.
- [9] D. Guo et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948, 2025.
- [10] X. Wang et al. PromptAgent: Strategic planning with language models for prompt optimization. arXiv:2310.16427, 2023.
- [11] E. Xiao et al. Prompt-MII: Meta-Learning Instruction Induction for LLMs. arXiv:2510.16932, 2025.
- [12] X. Lou et al. AutoHarness: Improving LLM agents by synthesizing a code harness. arXiv:2603.03329, 2026.
- [13] H. Gao et al. A survey of self-evolving agents. arXiv:2507.21046, 2025.
- [14] J. Fang et al. A comprehensive survey of self-evolving AI agents. arXiv:2508.07407, 2025.
- [15] W. Kwon et al. Efficient memory management for LLM serving with PagedAttention (vLLM). SOSP 2023.
- [16] M. Yuksekgonul et al. TextGrad: Automatic "differentiation" via text. arXiv:2406.07496, 2024.
- [17] G. Sheng et al. HybridFlow: A flexible and efficient RLHF framework. EuroSys 2025.
- [18] NVIDIA. NeMo RL: A scalable and efficient post-training library. GitHub, 2025.
- [19] Z. Hu et al. JigsawRL: Assembling RL pipelines for efficient LLM post-training. arXiv:2604.23838, 2026.
- [20] W. Fu et al. AReaL: A large-scale asynchronous RL system for language reasoning. arXiv:2505.24298, 2025.
- [21] Y. Zhong et al. StreamRL: Scalable, heterogeneous, elastic RL with disaggregated stream generation. arXiv:2504.15930, 2025.
- [22] G. Sheng et al. Laminar: A scalable asynchronous RL post-training framework. arXiv:2510.12633, 2025.
- [23] H. Li et al. Combee: Scaling prompt learning for self-improving language model agents. arXiv:2604.04247, 2026.
- [24] A. Yang et al. Qwen3 technical report. arXiv:2505.09388, 2025.
- [25] A. Hurst et al. GPT-4o system card. arXiv:2410.21276, 2024.
- [26] V. Pyatkin et al. Generalizing verifiable instruction following (IFBench). arXiv:2507.02833, 2025.
- [27] Z. Yang et al. HotpotQA: A dataset for diverse, explainable multi-hop QA. EMNLP 2018.
- [28] Y. Jiang et al. HoVer: A dataset for many-hop fact extraction and claim verification. EMNLP Findings 2020.
- [29] L. Loukas et al. FiNER: Financial numeric entity recognition for XBRL tagging. ACL 2022.
- [30] X. Zhang et al. Character-level convolutional networks for text classification (AGNews). NeurIPS 2015.