FlashEvolve

Accelerating Agent Evolution with Asynchronous Stage Orchestration

Zhengding Hu¹, Mingge Lu¹, Zhen Wang¹, Jixuan Ruan¹, Chang Chen¹, Zaifeng Pan¹, Yue Guan¹, Ruiyi Wang¹, Zhongkai Yu¹, Chao Zhang², Yufei Ding¹

¹University of California, San Diego ²Georgia Institute of Technology

Paper Method Results

LLM-based evolution has emerged as a promising way to improve agents by refining non-parametric artifacts, but its wall-clock cost remains a major bottleneck. We identify that this cost comes from synchronized stage execution and imbalance inside each LLM-heavy stage. We present FlashEvolve, an efficient framework that replaces synchronized execution with asynchronous workers and queues, allowing different stages and steps to overlap. To handle data staleness introduced by asynchrony, FlashEvolve tracks artifact versions and applies different policies to update, discard, or patch stale artifacts. Unlike weight-space staleness in asynchronous RL, language-space staleness is inspectable and repairable: a stale artifact is not just delayed work, but readable evidence that the LLM can reflect on, revise, and turn into useful evolution signal. On GEPA workloads, FlashEvolve improves proposal throughput by 3.5× on local vLLM and 4.9× on API serving over synchronous GEPA.

Illustration of multi-stage execution in agent evolution. Synchronized stage orchestration in existing implementations exposes sequential dependencies across stages and sample workload imbalance within each stage.

Multi-stage execution in agent evolution. Synchronized stage orchestration in existing implementations (GEPA, ACE, Combee) exposes two efficiency challenges: sequential dependencies across stages and sample workload imbalance within each individual stage.

Introduction

A growing line of recent work enables LLM agents to evolve themselves. Instead of updating model weights, these systems iteratively refine the non-parametric components that govern their behavior, including system prompts [1, 11, 16], context and memory [2, 5, 6], harness code [12, 3], and generated programs [4, 7, 8].

By having an LLM reflect on full execution traces rather than optimize against scalar rewards, this paradigm draws a richer learning signal from each rollout: GEPA [1] outperforms GRPO with an average gain of 6% across six reasoning benchmarks, while Meta-Harness [3] automatically discovers agent harnesses that surpass the best hand-engineered baselines.

Despite its algorithmic appeal, agent evolution remains expensive in wall-clock execution time. On IFBench, a single GEPA evolution step already takes ~2 minutes; Combee [23] parallelizes proposal generation but further stretches each step to ~2.8 minutes. Reaching a stable improvement requires more than 2 hours on an H100 GPU.

We trace this high wall-clock cost to two compounding factors:

Synchronized stage execution — each evolution step runs a sequence of LLM-heavy stages serially (run current artifact, propose candidate, evaluate). A later stage cannot start until the previous one fully completes, preventing overlap across stages.
Generation imbalance inside each stage — request lengths vary widely across samples, so the longest requests determine the execution time of the whole stage. The effective batch shrinks, leaving the LLM backend idle.

We present FlashEvolve, a framework that improves time efficiency of agent evolution through asynchronous stage orchestration. FlashEvolve treats an evolution loop as a set of LLM-heavy stages connected by queues, turning a synchronized loop into a streaming execution pipeline. To preserve evolution semantics, it tracks artifact-pool versions and applies update, discard, or reflective patch policies for stale items.

Contributions

Bottleneck analysis. We identify synchronized stage execution and intra-stage generation imbalance as the two systems bottlenecks in LLM-based agent evolution.
Asynchronous orchestration framework. FlashEvolve overlaps artifact execution, proposal, evaluation, and pool update through workers and queues; speculative stage completion and adaptive workflow control further reduce intra-stage waiting.
Language-space staleness handling. We introduce artifact-version tracking and reflective patching — unlike weight-space staleness in async RL, stale language artifacts can be inspected and repaired by the same LLM that drives proposal.
Throughput gains. On GEPA workloads, FlashEvolve delivers 3.5× higher proposal throughput on local vLLM and 4.9× on API serving over synchronous GEPA, and generalizes to ACE and Meta-Harness.

Background & Motivation

Agent Evolution Beyond Weight Updates

Agent evolution has emerged as a paradigm for adapting LLM-based systems to new data and tasks [13, 14]. A single model reflects on its own trajectories, critiques its own outputs, and proposes new artifacts that govern its own behavior. Crucially, this happens without modifying model weights, sidestepping the training infrastructure of supervised fine-tuning and reinforcement learning [9].

An evolution loop iterates over multiple steps; each step consists of LLM-heavy stages: Generate (run current artifact on tasks to collect trajectories), Propose (reflect on traces to produce a new candidate), and Evaluate (score and filter). A subsequent pool update commits the new artifact.

Profiling results of inefficiency in synchronized agent evolution: (a) stage time imbalance, (b) long-tail output length distribution within a stage, (c) effective concurrency comparison.

Profiling inefficiency in synchronized agent evolution. (a) Stage execution is serial and time is highly imbalanced. (b) Within a single stage, output lengths show a long-tail distribution — the slowest requests determine stage completion time. (c) Serial execution and intra-stage imbalance reduce effective concurrency, while FlashEvolve keeps more requests in flight.

Why Synchronous Execution Is Slow

Even with vLLM [15] (continuous batching, prefix caching), GEPA with Qwen3-8B takes 50 minutes for 49 evolution steps on IFBench, and 134 minutes for 411 steps on HotpotQA. This stems from two compounding costs:

Serial chain across stages — total step time becomes the sum of per-stage durations, with no opportunity to overlap rollout, proposal, and evaluation.
Synchronization barrier within a stage — the entire batch waits for the slowest request; a small number of long outputs dictate stage completion time.

Analogy to Asynchronous RL

These challenges resemble those in synchronous LLM RL systems [17, 19]; async RL addresses them by overlapping rollout with training [20, 21, 22]. Agent evolution differs in two key ways. First, it contains multiple LLM inference stages rather than a single rollout stage. Second, staleness occurs over inspectable language artifacts — prompts, memories, harness code, programs — rather than continuous model weights. This allows a much richer design space for staleness handling.

FlashEvolve: Asynchronous Framework for Agent Evolution

Overview of FlashEvolve: asynchronous workers and queues across stages, with speculative completion, validation-set reordering, workflow control, and staleness-aware handling.

Overview of FlashEvolve. Asynchronous workers connected by queues replace synchronized stage execution. Workers pass partial or completed results through queues instead of waiting for a whole stage to finish. Speculative completion, validation-set reordering, workflow control, and staleness-aware handling further improve throughput while limiting data staleness.

Four Coordinated Mechanisms

FlashEvolve decomposes an evolution loop into asynchronous workers connected by queues, so different stages and evolution steps can overlap. Each queue item carries the artifact state and pool version $v_i$, allowing FlashEvolve to detect stale items. On top of this execution model, FlashEvolve introduces four coordinated mechanisms:

Asynchronous Workers & Queues

Each stage owns an input queue and a worker pool of size $K_i$. Workers continuously process ready items and pass outputs to downstream queues, allowing rollout, proposal, evaluation, and pool update to overlap.

Staleness-Aware Handling

Three policies trade throughput for freshness: Full Async, Guarded Async (discard if version gap > $\Delta_{\max}$), and Reflective Async (inspect & patch stale artifacts via an LLM call).

Speculative Stage Completion

A stage releases partial output after a fraction $\alpha_{\mathrm{spec}}^{i}$ of its requests has finished; downstream workers start from the tentative item. Validation-set reordering keeps discriminative samples in the speculative prefix.

Adaptive Workflow Control

FlashEvolve monitors per-stage production rate and adjusts worker counts: under-half-median stages gain a worker, over-twice-median stages lose one, bounded by per-stage min/max.

Why Language-Space Staleness Can Be Repaired

Language-space staleness is discrete and inspectable, unlike parameter staleness in async RL, which is continuous and opaque. In RL, a stale item is tied to an older point in weight space, so systems handle it through importance weighting, bounded delay, or discard. In agent evolution, a stale item is text or code — a prompt edit, memory update, harness mutation, or generated program.

FlashEvolve can therefore inspect the stale item together with the intervening artifact history and decide whether the edit is orthogonal, already subsumed, or in conflict with the current pool. This makes repair a first-class operation: stale items can be patched when they contain reusable information, or discarded when they are too specific or inconsistent.

Three Staleness Policies

Full Async preserves all completed work and maximizes throughput but may pollute the pool. Guarded Async discards items when $\Delta_i = v - v_i > \Delta_{\max}$. Reflective Async uses a reflection worker to read the stale item plus intervening pool updates and decide whether to patch or discard — reusing useful stale items while avoiding uncontrolled stale updates.

Experimental Results

We evaluate FlashEvolve against GEPA and Combee on four benchmarks (IFBench, HotpotQA, HoVer, AIME) using Qwen3-8B on a single H100 and GPT-4o-mini via API. FlashEvolve consistently improves both LLM throughput and proposal throughput.

3.5×

Proposal Throughput

Avg. over GEPA on local vLLM

4.9×

API Proposal Speedup

Avg. over GEPA on GPT-4o-mini

2.27×

Evolution Rate

IFBench, 30-min budget vs. GEPA

4.7×

Meta-Harness Speedup

Proposal & validation throughput

System Throughput on GEPA Workloads

LLM throughput (output token rate) and proposal throughput (rate of new candidate generation) across IFBench, HotpotQA, HoVer, and AIME.

Method	LLM Throughput (token/s)				Proposal Throughput (proposal/min)
	IFBench	HotpotQA	HoVer	AIME	IFBench	HotpotQA	HoVer	AIME
vLLM with Qwen3-8B
GEPA	963	30	461	200	1.9	4.6	2.5	2.2
Combee (K=10)	696	38	810	994	1.2	2.7	2.0	6.2
Combee (K=40)	900	44	891	977	0.7	4.5	2.0	1.6
FlashEvolve	2,688	93	1,255	998	8.9	8.8	5.9	11.4
API with GPT-4o-mini
GEPA	361	14	142	103	1.7	2.4	1.8	1.3
Combee (K=10)	397	18	348	211	1.0	1.4	0.8	1.0
Combee (K=40)	389	23	214	336	0.8	1.2	0.7	0.6
FlashEvolve	791	32	352	485	10.1	8.0	9.1	6.6

Table 1. FlashEvolve improves LLM throughput by 3.4× on average over GEPA on local vLLM and 2.9× on API serving, sustaining 5.9–11.4 proposals/min across settings.

Validation Score within a 30-Minute Budget

Normalized evolution rate measures score improvement within a fixed wall-clock budget, normalized to synchronous GEPA. AIME reports "—" because serial GEPA shows no improvement.

Method	IFBench		HoVer		HotpotQA		AIME
	Score	Norm. Rate	Score	Norm. Rate	Score	Norm. Rate	Score	Norm. Rate
GEPA	87.6	1.00	39.8	1.00	63.3	1.00	10.0	—
Combee (K=10)	88.5	1.39	41.2	1.09	62.5	0.94	10.0	—
Combee (K=40)	86.5	0.55	40.5	1.05	58.6	0.63	10.0	—
FlashEvolve	90.6	2.27	42.0	1.15	61.7	0.88	15.0	—

Table 2. Within 30 minutes, FlashEvolve achieves 2.27× the evolution rate of GEPA on IFBench (90.6 vs 87.6 score) and is the only method that improves on AIME (10.0 → 15.0).

Long-Time Evolution

Validation score over a 180-minute budget. FlashEvolve reaches strong scores earlier than synchronous baselines and maintains a higher score on HotpotQA throughout.

Validation score evolution over wall-clock time with Qwen3-8B across IFBench and HotpotQA, showing FlashEvolve reaching strong scores earlier than GEPA and Combee.

Figure 4. Long-time validation curves with Qwen3-8B. On IFBench, FlashEvolve reaches 91% at 39.3 min, while Combee (B=40) needs 104.2 min to reach the same region. On HotpotQA, FlashEvolve peaks at 66.41% at 56.1 min and holds the lead over the full budget.

Ablation Studies

Staleness Handling Policies

Three policies on IFBench. Full Async and Guarded Async behave similarly here; Reflective Async reaches a validation score of 94.3% within the 30-minute budget — the highest of the three.

Left: example of reflective prompt repair where FlashEvolve discards task-specific noise and keeps transferable principles. Right: validation score over time for Full Async, Guarded Async, and Reflective Async.

Figure 5. Staleness handling on IFBench with Qwen3-8B. Left: example of reflective prompt repair — FlashEvolve discards task-specific formulas and distills transferable principles (stricter constraint checking, self-contained reasoning) into a compact patch. Right: Full Async, Guarded Async, and Reflective Async compared; Reflective Async leads.

Worker Concurrency & Speculative Completion

Larger worker counts increase proposal throughput from 7 to 99 artifacts/min, but validation throughput does not scale uniformly. Adaptive control balances queue pressure and achieves the highest accepted proposal throughput.

Ablation of worker concurrency and speculative completion: (a) worker allocation throughput, (b) adaptive control achieves highest accepted-proposal throughput, (c) speculative completion improves validation throughput when prefix threshold is right.

Figure 6. Ablation of worker concurrency and speculative completion on IFBench with Qwen3-8B. (a) Worker allocation varies throughput. (b) Adaptive worker control achieves the highest accepted proposal throughput. (c) Speculative completion with $\alpha_{\mathrm{spec}}^{3}=0.25$ raises validation throughput to 3.15/min and improves score by 4.49 percentage points within 30 min.

Key Findings

Async Overlap is Essential

Asynchronous stage orchestration keeps the LLM backend busier by overlapping requests across stages and steps, yielding 3.4× higher LLM throughput on average.

Reflective Repair Wins

Stale language artifacts are not just delayed work — Reflective Async turns them into useful signal, achieving 94.3% validation score on IFBench within 30 minutes.

Adaptive Balance Beats Brute Force

Naive worker scaling shifts the bottleneck between stages. Adaptive workflow control balances queue pressure and stage rates, producing the highest accepted proposal throughput.

Algorithm-Agnostic Design

FlashEvolve is not GEPA-specific. The worker-and-queue abstraction also applies to ACE (context evolution) and Meta-Harness (harness-code evolution).

Beyond GEPA: ACE and Meta-Harness

FlashEvolve is algorithm-agnostic. It does not rely on a specific artifact type, only that the evolution loop contains multiple LLM-heavy stages. We evaluate on ACE [2] (context playbooks) and Meta-Harness [3] (harness code).

FlashEvolve on other algorithms: (a)(b) validation score curves on FiNER and Formula for ACE; (c)(d) Meta-Harness proposal rate and score distributions on Symptom2Disease and AGNews.

Figure 7. FlashEvolve on other evolution algorithms. (a)–(b) ACE validation score curves on FiNER (0.60 → 0.66) and Formula (0.66 → 0.70) within a 30-min budget. (c)–(d) Meta-Harness proposal rate (0.3 → 1.4 proposals/min, a 4.7× speedup) and score distributions on Symptom2Disease and AGNews.

(1) ACE: Context Evolution

ACE evolves agent context playbooks rather than prompts. Within the same 30-minute budget, FlashEvolve reaches better validation scores on both datasets: 0.60 → 0.66 on FiNER and 0.66 → 0.70 on Formula, demonstrating higher efficiency on a non-prompt artifact.

(2) Meta-Harness: Code Evolution

Meta-Harness evolves harness code. FlashEvolve improves proposal-and-validation throughput from 0.3 to 1.4 proposals/min — a 4.7× speedup. With higher throughput, FlashEvolve samples and validates more harness candidates within the same time budget, raising the ceiling of harness quality.

Takeaway. The asynchronous worker-and-queue abstraction generalizes across artifact types — prompts, context, and harness code — because they all share the same multi-stage, queueable, pool-updating structure. FlashEvolve treats wall-clock execution as a first-class design target for agent evolution systems.

References

[1] L. A. Agrawal et al. GEPA: Reflective prompt evolution can outperform reinforcement learning. arXiv:2507.19457, 2025.
[2] Q. Zhang et al. Agentic context engineering: Evolving contexts for self-improving language models. arXiv:2510.04618, 2025.
[3] Y. Lee et al. Meta-Harness: End-to-end optimization of model harnesses. arXiv:2603.28052, 2026.
[4] A. Novikov et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131, 2025.
[5] S. Ouyang et al. ReasoningBank: Scaling agent self-evolving with reasoning memory. arXiv:2509.25140, 2025.
[6] G. Zhang et al. MemEvolve: Meta-evolution of agent memory systems. arXiv:2512.18746, 2025.
[7] R. T. Lange et al. ShinkaEvolve: Towards open-ended and sample-efficient program evolution. arXiv:2509.19349, 2025.
[8] H. Assumpção et al. CodeEvolve: An open-source evolutionary coding agent. arXiv:2510.14150, 2025.
[9] D. Guo et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948, 2025.
[10] X. Wang et al. PromptAgent: Strategic planning with language models for prompt optimization. arXiv:2310.16427, 2023.
[11] E. Xiao et al. Prompt-MII: Meta-Learning Instruction Induction for LLMs. arXiv:2510.16932, 2025.
[12] X. Lou et al. AutoHarness: Improving LLM agents by synthesizing a code harness. arXiv:2603.03329, 2026.
[13] H. Gao et al. A survey of self-evolving agents. arXiv:2507.21046, 2025.
[14] J. Fang et al. A comprehensive survey of self-evolving AI agents. arXiv:2508.07407, 2025.
[15] W. Kwon et al. Efficient memory management for LLM serving with PagedAttention (vLLM). SOSP 2023.
[16] M. Yuksekgonul et al. TextGrad: Automatic "differentiation" via text. arXiv:2406.07496, 2024.
[17] G. Sheng et al. HybridFlow: A flexible and efficient RLHF framework. EuroSys 2025.
[18] NVIDIA. NeMo RL: A scalable and efficient post-training library. GitHub, 2025.
[19] Z. Hu et al. JigsawRL: Assembling RL pipelines for efficient LLM post-training. arXiv:2604.23838, 2026.
[20] W. Fu et al. AReaL: A large-scale asynchronous RL system for language reasoning. arXiv:2505.24298, 2025.
[21] Y. Zhong et al. StreamRL: Scalable, heterogeneous, elastic RL with disaggregated stream generation. arXiv:2504.15930, 2025.
[22] G. Sheng et al. Laminar: A scalable asynchronous RL post-training framework. arXiv:2510.12633, 2025.
[23] H. Li et al. Combee: Scaling prompt learning for self-improving language model agents. arXiv:2604.04247, 2026.
[24] A. Yang et al. Qwen3 technical report. arXiv:2505.09388, 2025.
[25] A. Hurst et al. GPT-4o system card. arXiv:2410.21276, 2024.
[26] V. Pyatkin et al. Generalizing verifiable instruction following (IFBench). arXiv:2507.02833, 2025.
[27] Z. Yang et al. HotpotQA: A dataset for diverse, explainable multi-hop QA. EMNLP 2018.
[28] Y. Jiang et al. HoVer: A dataset for many-hop fact extraction and claim verification. EMNLP Findings 2020.
[29] L. Loukas et al. FiNER: Financial numeric entity recognition for XBRL tagging. ACL 2022.
[30] X. Zhang et al. Character-level convolutional networks for text classification (AGNews). NeurIPS 2015.