FlashEvolve

An asynchronous worker-queue framework for LLM agent evolution — 3.5× faster than synchronous GEPA, with no algorithmic changes.

Anonymous Authors  ·  UC San Diego · Georgia Tech

3.5×Proposal speedup
(vLLM, GEPA)
4.9×Proposal speedup
(API serving)
2.27×Evolution rate
(IFBench)
4.7×Speedup over
Meta-Harness

The Problem

LLM-based agent evolution (GEPA, ACE, Meta-Harness) reflects on execution traces to refine prompts, memories, and harness code — outperforming GRPO by 6% on average. The catch: a single GEPA step takes ~2 minutes, and reaching a stable improvement takes over 2 hours on an H100. Stages run serially, and within each stage, a few long-tail requests hold up the entire batch.

Sequential dependencies and sample imbalance in agent evolution
Figure 1. Two compounding inefficiencies in synchronous agent evolution: sequential dependencies between Rollout → Propose → Evaluate, and sample workload imbalance within each stage. The whole batch waits for the slowest sample.

The Idea

Treat the evolution loop as a set of LLM-heavy stages connected by asynchronous workers and queues. Stages overlap. When artifacts go stale — produced before the pool was updated — repair them in the language space instead of discarding them. Unlike RL weight-space staleness, a stale prompt is inspectable: the same LLM that proposed it can also patch it.

FlashEvolve framework overview
Figure 3. FlashEvolve executes agent evolution with async workers and queues across stages. Three staleness-handling policies (Full, Guarded, Reflective Async) plus speculative stage completion and adaptive worker control keep throughput high without sacrificing convergence.

How It Works

Three mechanisms working together:

01 · Pipeline

Async Workers & Queues

Each stage gets its own worker pool and queue. Rollout, propose, and evaluate overlap across evolution steps. Queue items carry the artifact pool version v so stale items can be detected.

02 · Repair

Reflective Patching

When a stale prompt arrives, an LLM worker inspects it against the artifact history, discards task-specific noise, and patches in transferable principles. Reaches 94.3% validation in 30 min on IFBench.

03 · Speculation

Speculative Completion

Release partial output after an α fraction of a stage's requests finish — completed samples flow downstream while long-tail requests continue in the background. With α=0.25, validation throughput rises from 0.5 → 3.15 val/min.

Speedup Results

Measured on GEPA workloads with Qwen3-8B (vLLM, single H100) and GPT-4o-mini (API). All baselines run on the same DSPy + vLLM stack — differences come from pipeline, not the LLM backend.

Validation score over wall-clock time for GEPA, Combee, and FlashEvolve
Figure 4. Validation score over wall-clock time (Qwen3-8B). FlashEvolve reaches 91% on IFBench in 39.3 min — Combee (B=40) takes 104 min to hit the same region. On HotpotQA, FlashEvolve maintains the highest score across the full 180-min budget.
3.5×Proposal throughput
vs GEPA (vLLM avg)
4.9×Proposal throughput
vs GEPA (API avg)
11.4Proposals / min
peak (AIME, vLLM)
10.1Proposals / min
peak (IFBench, API)

Validation-score / time budget (Qwen3-8B, 30 min):

Method IFBench Norm. rate HoVer HotpotQA AIME
GEPA87.61.00×39.863.310.0
Combee (K=10)88.51.39×41.262.510.0
Combee (K=40)86.50.55×40.558.610.0
FlashEvolve90.62.27×42.061.715.0

See the paper for full ablations on worker concurrency, speculative thresholds, and staleness policies.

When It Fails

Async overhead dominates on tiny workloads. FlashEvolve assumes per-stage requests can be packed into batches large enough to hide queue bookkeeping. When the mini-batch is small or stages already finish in seconds, the per-item queue and version-tracking cost can outweigh the overlap gain. Reflective patching also depends on the LLM's ability to discriminate useful stale signal from noise — on workloads where most stale items are conflicting (e.g. HotpotQA early steps), Full Async and Reflective Async converge to similar scores within a 30-min budget.

Other Algorithms

The same async framework drops onto evolution algorithms beyond GEPA — ACE (context playbook evolution) and Meta-Harness (harness-code evolution) — without changing their semantics.

FlashEvolve raises Meta-Harness's proposal+validation throughput from 0.3 → 1.4 proposals/min, and accelerates ACE on FiNER and FormulaReasoning by similar margins. Algorithm-agnostic: it sees a multi-stage loop with queueable intermediates and a shared artifact pool.
4.7×Meta-Harness
3×+ACE / FiNER