FlashEvolve

Accelerating Agent Evolution with Asynchronous Stage Orchestration

Anonymous Authors*

* Submitted to NeurIPS 2026 (under double-blind review)

LLM-based evolution has emerged as a promising way to improve agents by refining non-parametric artifacts, but its wall-clock cost remains a major bottleneck. We identify that this cost comes from synchronized stage execution and imbalance inside each LLM-heavy stage. We present FlashEvolve, an efficient framework that replaces synchronized execution with asynchronous workers and queues, allowing different stages and steps to overlap. To handle data staleness introduced by asynchrony, FlashEvolve tracks artifact versions and applies different policies to update, discard, or patch stale artifacts. Unlike weight-space staleness in asynchronous RL, language-space staleness is inspectable and repairable: a stale artifact is not just delayed work, but readable evidence that the LLM can reflect on, revise, and turn into useful evolution signal. On GEPA workloads, FlashEvolve improves proposal throughput by 3.5× on local vLLM and 4.9× on API serving over synchronous GEPA.

Illustration of multi-stage agent evolution: Rollout, Propose, Evaluate stages run serially with sample-level imbalance, exposing two efficiency challenges: sequential dependencies and sample workload imbalance.

Figure 1. Multi-stage execution in agent evolution. Existing implementations [2, 34, 15] use synchronized stage orchestration, exposing two efficiency challenges: sequential dependencies across stages and sample workload imbalance within each individual stage.

Introduction

A growing line of recent work enables LLM agents to evolve themselves. Instead of updating model weights, these systems iteratively refine the non-parametric components that govern their behavior, including system prompts, context and memory, harness code, and generated programs. This emerging paradigm of test-time self-evolution relaxes the access requirements of weight-space adaptation: it requires neither labeled trajectories nor gradient updates. By having an LLM reflect on full execution traces rather than optimize against scalar rewards, this paradigm draws a richer learning signal from each rollout.

Despite its algorithmic appeal, agent evolution remains expensive in wall-clock execution time. On IFBench, a single GEPA evolution step already takes ∼2 minutes; reaching a stable improvement requires more than 2 hours on an H100 GPU. This cost further grows with data scale, making evolution runs slow to tune and deploy in practice.

  • Sequential dependencies across stages — A later stage cannot start until the previous stage has fully completed, preventing any overlap across rollout, propose, and evaluate.
  • Sample workload imbalance — Output lengths within a stage show a long-tail distribution; a small number of long requests determine stage completion time and leave the LLM backend underutilized.
  • Artifact-level staleness — Naive asynchrony can produce intermediate results from an artifact pool that has already changed before they are consumed.

We present FlashEvolve, a framework that improves the time efficiency of agent evolution through asynchronous stage orchestration. FlashEvolve treats an evolution loop as a set of LLM-heavy stages connected by queues, turning a synchronized loop into a streaming execution pipeline. To preserve evolution semantics, it tracks artifact versions and applies staleness-aware policies that update, discard, or patch stale items. It further reduces waiting inside long stages through speculative completion and uses adaptive workflow control to balance workload across stages.

Contributions

  • Asynchronous Stage Orchestration. A worker-queue execution model that overlaps rollout, proposal, and evaluation across stages and evolution steps, eliminating the serial dependency chain in synchronous agent evolution.
  • Language-Space Staleness Repair. Three policies — Full Async, Guarded Async, and Reflective Async — that exploit the inspectable nature of text artifacts to patch stale proposals rather than discard them.
  • Speculative Completion & Adaptive Control. Mechanisms that release partial stage results early and rebalance worker counts based on observed production rates.
  • SOTA Throughput. On GEPA workloads, FlashEvolve achieves 3.5× proposal throughput on local vLLM and 4.9× on API serving; the same design generalizes to ACE and Meta-Harness.

Background & Motivation

An agent-evolve loop iterates over multiple steps, where each step consists of several LLM-heavy stages: Generate runs the current artifact on tasks to collect trajectories, Propose reflects on these trajectories to produce a new candidate artifact, and Evaluate scores the candidate against task signals. A subsequent pool update commits the new artifact to the artifact pool.

Even with state-of-the-art LLM serving infrastructure such as vLLM (which supports continuous batching and prefix caching), GEPA with Qwen3-8B takes 50 minutes to complete 49 evolution steps on IFBench, and 134 minutes for 411 steps on HotpotQA. This inefficiency stems from sequential and synchronized stage execution, with two compounding costs.

Three profiling panels: (a) stage time breakdown across GEPA, Combee, ACE, and Meta-Harness, with Evaluate dominating; (b) token distribution in GEPA evaluate stage with long-tail effect; (c) concurrent requests in evolution showing FlashEvolve sustains far higher batch sizes than baselines.

Figure 2. Profiling inefficiency in synchronized agent evolution. (a) Stage execution is serial, and stage time is highly imbalanced. (b) Within a single stage, output lengths show a long-tail distribution, so the slowest requests determine stage completion time. (c) Serial execution and intra-stage imbalance reduce effective concurrency, while FlashEvolve keeps far more requests in flight.

Analogy to Asynchronous RL

These challenges are related to synchronous LLM RL systems, which also suffer from synchronization overhead and workload imbalance. Asynchronous RL addresses this by overlapping rollout generation with training and controlling off-policy optimization. Agent evolution differs in two key ways:

  • Multiple inference stages. Evolution contains rollout, proposal, and evaluation as distinct LLM-heavy stages (vs. a single rollout stage in RL), each with its own batched generation and long-tail imbalance.
  • Inspectable language-space staleness. Stale items are text or code — prompts, memories, harness mutations, programs — rather than continuous model weights, enabling a more flexible design space for staleness-handling policies.

FlashEvolve: Asynchronous Workers, Queues & Reflective Repair

Overview of FlashEvolve: top half shows the evolving workflow (Rollout, Propose, Evaluate, with speculative completion, dynamic reordering, and three staleness policies — Fully Async, Guarded Async, Reflective Async). Bottom half shows the system design with workers, queues, an artifact pool, and workflow control adjusting per-stage worker counts.

Figure 3. Overview of FlashEvolve. Asynchronous workers and queues span the rollout, propose, and evaluate stages. Workers pass partial or completed results through queues instead of waiting for a whole stage to finish. FlashEvolve further uses speculative completion, validation-set reordering, workflow control, and staleness-aware handling to improve throughput while limiting data staleness.

Four Mechanisms

Asynchronous Workers & Queues

Each stage i has an input queue and a pool of workers (count $K_i$). Items carry the artifact-pool version $v_i$ at creation; the pool version increases on each update, so FlashEvolve can compare $v_i$ with the current $v$ to detect stale items.

Staleness-Aware Handling

Three policies trade throughput for freshness: Full Async preserves all work, Guarded Async discards items with $\Delta_i > \Delta_{\max}$, and Reflective Async uses an extra LLM worker to patch stale items against the intervening artifact history.

Speculative Stage Completion

A stage releases a tentative output once an $\alpha^i_{\text{spec}} \in (0, 1]$ fraction of its requests has finished. Validation-set reordering moves easy samples (passing $w$ consecutive rounds) out of the speculative prefix to keep the early signal discriminative.

Adaptive Workflow Control

FlashEvolve measures each stage's item-production rate. A stage producing < ½ the median rate gets a worker; a stage producing > 2× the median loses one. Adjustments are bounded per step to avoid oscillation.

Why Language-Space Staleness Can Be Repaired

Language-space staleness is discrete and inspectable, unlike parameter staleness in asynchronous RL, which is continuous and opaque. In RL, a stale item is tied to an older point in weight space, so systems handle it through importance weighting, bounded delay, or discard. In agent evolution, a stale item is text or code — a prompt edit, memory update, harness mutation, or generated program. FlashEvolve can inspect the stale item together with the intervening artifact history and decide whether the edit is orthogonal, already subsumed, or conflicting with the current pool.

Reflective Patching as a First-Class Operation

Stale items can be patched when they contain reusable information, or discarded when they are too specific or inconsistent. In the IFBench example shown later in Figure 5, FlashEvolve filters out task-specific stale formulas and keeps transferable principles (strict constraint checking, self-contained reasoning) to form a compact prompt patch.

Experimental Results

We compare FlashEvolve against synchronous GEPA [2] and the scaling-oriented baseline Combee [15] on four datasets: IFBench, HotpotQA, HoVer, and AIME. Models are Qwen3-8B served by vLLM on a single NVIDIA H100 80GB, and remote GPT-4o-mini for API serving.

3.5×
Proposal Throughput

vs. synchronous GEPA on local vLLM

4.9×
Proposal Throughput

vs. synchronous GEPA on API serving

3.4×
LLM Token Throughput

Avg. on local vLLM over GEPA

2.27×
Evolution Rate

IFBench, 30-min budget vs. GEPA

Throughput Comparison on GEPA Workloads

FlashEvolve substantially improves both raw LLM token throughput and the rate of new candidate artifact generation. Asynchronous orchestration keeps the LLM backend busier by overlapping requests from different stages and evolution steps.

Method LLM Throughput (token/s) Proposal Throughput (proposal/min)
IFBenchHotpotQAHoVerAIME IFBenchHotpotQAHoVerAIME
vLLM with Qwen3-8B
GEPA96330461200 1.94.62.52.2
Combee (K=10)69638810994 1.22.72.06.2
Combee (K=40)90044891977 0.74.52.01.6
FlashEvolve 2,68893 1,255998 8.98.8 5.911.4
API with GPT-4o-mini
GEPA36114142103 1.72.41.81.3
Combee (K=10)39718348211 1.01.40.81.0
Combee (K=40)38923214336 0.81.20.70.6
FlashEvolve 79132 352485 10.18.0 9.16.6

Table 1. Throughput comparison on GEPA workloads. Across all settings, FlashEvolve sustains more than 5.9 proposals/min and up to 11.4 proposals/min, substantially raising the rate at which new candidate artifacts are tested.

Validation Score over Wall-Clock Time

FlashEvolve reaches strong validation scores earlier than the synchronous baselines and, in several workloads, also improves the final score under a longer time budget.

Validation-score curves over 180 minutes for IFBench (left) and HotpotQA (right). FlashEvolve climbs faster than GEPA and Combee variants and remains at the top throughout.

Figure 4. Longer-time validation score evolution over wall-clock time with Qwen3-8B. On IFBench, FlashEvolve reaches 91% in 39.3 min, while Combee (B=40) needs 104.2 min to reach the same score region. On HotpotQA, FlashEvolve reaches its best score of 66.41% at 56.1 min and stays on top of all baselines for the full budget.

Ablation Studies

Staleness Handling: Reflective Repair Wins

Three staleness policies are compared on IFBench. Full Async and Guarded Async behave similarly in this setting, but Reflective Async reaches the best score by inspecting and repairing stale prompts rather than discarding them.

Left: example of reflective prompt repair where stale task-specific formulas are discarded while strict-constraint and self-contained reasoning takeaways are patched into a new compact prompt. Right: validation-score curves on IFBench comparing GEPA (Sync), Guarded Async, Full Async, and Reflective Async; Reflective Async reaches 94.3%.

Figure 5. Staleness handling on IFBench with Qwen3-8B. Left: example of reflective prompt repair — FlashEvolve discards task-specific formulas because the accepted prompt already contains general instruction-following rules, while distilling stricter constraint checking and self-contained reasoning into a compact patch. Right: Reflective Async reaches a validation score of 94.3% within the 30-minute budget, well above Full and Guarded Async.

Worker Concurrency & Speculative Completion

Larger worker counts greatly increase raw proposal throughput, but naive scaling can shift the bottleneck across stages. Adaptive control achieves the highest accepted proposal throughput by balancing proposal generation and validation.

Three-panel ablation. (a) Proposal and Validate throughput across Sync, K1=1/K3=1, K1=16/K3=8, and Adaptive — proposals jump from 7 to 140/min. (b) Acceptance throughput rises from 0.033 (Sync) to 0.310 (Adaptive). (c) Speculative completion: Adaptive 2.62 val sweeps/min; α_spec=0.5 starves validate; α_spec=0.25 reaches 3.15 val sweeps/min.

Figure 6. Ablation of worker concurrency and speculative completion on IFBench with Qwen3-8B. (a) Worker allocation varies throughput. (b) Adaptive worker control achieves the highest accepted-proposal throughput by balancing proposal generation and validation (9.4× over Sync). (c) Speculative completion improves validation throughput when the prefix threshold is set well ($\alpha^3_{\text{spec}}=0.25$ gives 3.15 val sweeps/min); too large a threshold ($\alpha^3_{\text{spec}}=0.5$) starves the validate stage.

30-Minute Budget Comparison

Validation score and normalized evolution rate within a fixed 30-minute budget on Qwen3-8B. AIME reports "–" because serial GEPA shows no improvement within the budget; FlashEvolve is the only method that improves over the initial score.

Method IFBench HoVer HotpotQA AIME
ScoreNorm. Rate ScoreNorm. Rate ScoreNorm. Rate ScoreNorm. Rate
GEPA 87.61.00 39.81.00 63.31.00 10.0
Combee (K=10) 88.51.39 41.21.09 62.50.94 10.0
Combee (K=40) 86.50.55 40.51.05 58.60.63 10.0
FlashEvolve 90.62.27 42.01.15 61.70.88 15.0

Table 2. Across the three workloads where the GEPA baseline makes measurable progress, FlashEvolve achieves an average normalized evolution rate of 1.43×, with the strongest gain on IFBench.

Key Findings

Asynchrony Pays Off

Replacing the synchronous step boundary with worker queues yields a 3.5× proposal-throughput gain on vLLM and 4.9× on API serving, with no algorithmic change to the underlying evolution loop.

Stale ≠ Useless

Reflective Async patches stale prompts using intervening artifact history, lifting IFBench validation score to 94.3% — well above Full / Guarded Async, which only discard.

Adaptive > Fixed Workers

Naive scaling (large $K_1$, $K_3$) shifts the bottleneck. Adaptive control gives 9.4× accepted-proposal throughput by balancing proposal generation and validation rates.

Speculative Completion Helps If Tuned

At $\alpha^3_{\text{spec}}=0.25$, validation sweeps rise to 3.15/min and IFBench score gains 4.49 pp in 30 min. Too large a threshold (0.5) starves the validate stage.

Generalization to ACE & Meta-Harness

FlashEvolve is algorithm-agnostic: it does not rely on a specific artifact type, only that the evolution loop contains multiple stages that need orchestration. We evaluate the same framework on ACE (context playbook evolution) and Meta-Harness (harness code evolution).

0.66
ACE on FiNER

Up from 0.60 (synchronous ACE), 30-min budget

0.70
ACE on Formula

Up from 0.66 (synchronous ACE), 30-min budget

1.4 / min
Meta-Harness Proposals

Up from 0.3 / min (synchronous baseline)

4.7×
Meta-Harness Speedup

On Symptom2Disease & AGNews

Figure 7 (summarized). FlashEvolve on other evolution algorithms. ACE: validation score on FiNER rises 0.60→0.66; on Formula 0.66→0.70. Meta-Harness on Symptom2Disease & AGNews: proposal / validation throughput rises from 0.3 to 1.4 proposals/min, a 4.7× speedup.

(1) Context Evolution (ACE)

ACE evolves agent context playbooks. FlashEvolve reaches higher validation scores within the same 30-minute budget on both FiNER (0.6 → 0.66) and Formula (0.66 → 0.7), demonstrating that the worker-queue abstraction transfers cleanly to a different artifact type and pool-update rule.

(2) Harness-Code Evolution (Meta-Harness)

Meta-Harness evolves the harness code that wraps an LLM agent. Since the open-source model has relatively weak code-generation capability, harness-code evolution progresses slowly in both settings — but FlashEvolve samples and validates 4.7× more candidates per unit time, materially raising the chance of discovering an improved harness within a fixed budget.

Why this matters. The same worker, queue, and staleness-policy machinery covers prompt evolution (GEPA), context evolution (ACE), and harness-code evolution (Meta-Harness). Mapping a new evolution algorithm to FlashEvolve requires only its stages, queue items, artifact state, and update rules — not a redesign of the execution model.

BibTeX

If you find FlashEvolve useful in your research, please consider citing:

@inproceedings{flashevolve2026,
  title     = {FlashEvolve: Accelerating Agent Evolution with Asynchronous Stage Orchestration},
  author    = {Anonymous Authors},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2026}
}

Selected References

  1. [1] NeMo RL: A Scalable and Efficient Post-Training Library. GitHub repository, 2025.
  2. [2] L. A. Agrawal et al. GEPA: Reflective prompt evolution can outperform reinforcement learning. arXiv:2507.19457, 2025.
  3. [12] W. Kwon et al. Efficient memory management for large language model serving with PagedAttention (vLLM). SOSP, 2023.
  4. [14] Y. Lee et al. Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052, 2026.
  5. [15] H. Li et al. Combee: Scaling Prompt Learning for Self-Improving Language Model Agents. arXiv:2604.04247, 2026.
  6. [23] V. Pyatkin et al. IFBench: Generalizing verifiable instruction following. arXiv:2507.02833, 2025.
  7. [30] A. Yang et al. Qwen3 technical report. arXiv:2505.09388, 2025.
  8. [31] Z. Yang et al. HotpotQA: A dataset for diverse, explainable multi-hop question answering. EMNLP, 2018.
  9. [34] Q. Zhang et al. ACE: Agentic Context Engineering — Evolving contexts for self-improving language models. arXiv:2510.04618, 2025.