FlashEvolve

An asynchronous worker-queue framework for LLM agent evolution — 3.5× faster than synchronous GEPA, with no algorithmic changes.

Anonymous Authors · UC San Diego · Georgia Tech

Read Paper (PDF) Code Repository

3.5×Proposal speedup
(vLLM, GEPA)

4.9×Proposal speedup
(API serving)

2.27×Evolution rate
(IFBench)

4.7×Speedup over
Meta-Harness

The Problem

LLM-based agent evolution (GEPA, ACE, Meta-Harness) reflects on execution traces to refine prompts, memories, and harness code — outperforming GRPO by 6% on average. The catch: a single GEPA step takes ~2 minutes, and reaching a stable improvement takes over 2 hours on an H100. Stages run serially, and within each stage, a few long-tail requests hold up the entire batch.

Sequential dependencies and sample imbalance in agent evolution — **Figure 1.** Two compounding inefficiencies in synchronous agent evolution: *sequential dependencies* between Rollout → Propose → Evaluate, and *sample workload imbalance* within each stage. The whole batch waits for the slowest sample.

The Idea

Treat the evolution loop as a set of LLM-heavy stages connected by asynchronous workers and queues. Stages overlap. When artifacts go stale — produced before the pool was updated — repair them in the language space instead of discarding them. Unlike RL weight-space staleness, a stale prompt is inspectable: the same LLM that proposed it can also patch it.

FlashEvolve framework overview — **Figure 3.** FlashEvolve executes agent evolution with async workers and queues across stages. Three staleness-handling policies (*Full*, *Guarded*, *Reflective Async*) plus speculative stage completion and adaptive worker control keep throughput high without sacrificing convergence.

How It Works

Three mechanisms working together:

01 · Pipeline

Async Workers & Queues

Each stage gets its own worker pool and queue. Rollout, propose, and evaluate overlap across evolution steps. Queue items carry the artifact pool version v so stale items can be detected.

02 · Repair

Reflective Patching

When a stale prompt arrives, an LLM worker inspects it against the artifact history, discards task-specific noise, and patches in transferable principles. Reaches 94.3% validation in 30 min on IFBench.

03 · Speculation

Speculative Completion

Release partial output after an α fraction of a stage's requests finish — completed samples flow downstream while long-tail requests continue in the background. With α=0.25, validation throughput rises from 0.5 → 3.15 val/min.

Speedup Results

Measured on GEPA workloads with Qwen3-8B (vLLM, single H100) and GPT-4o-mini (API). All baselines run on the same DSPy + vLLM stack — differences come from pipeline, not the LLM backend.

Validation score over wall-clock time for GEPA, Combee, and FlashEvolve — **Figure 4.** Validation score over wall-clock time (Qwen3-8B). FlashEvolve reaches *91% on IFBench in 39.3 min* — Combee (B=40) takes *104 min* to hit the same region. On HotpotQA, FlashEvolve maintains the highest score across the full 180-min budget.

3.5×Proposal throughput
vs GEPA (vLLM avg)

4.9×Proposal throughput
vs GEPA (API avg)

11.4Proposals / min
peak (AIME, vLLM)

10.1Proposals / min
peak (IFBench, API)

Validation-score / time budget (Qwen3-8B, 30 min):

Method	IFBench	Norm. rate	HoVer	HotpotQA	AIME
GEPA	87.6	1.00×	39.8	63.3	10.0
Combee (K=10)	88.5	1.39×	41.2	62.5	10.0
Combee (K=40)	86.5	0.55×	40.5	58.6	10.0
FlashEvolve	90.6	2.27×	42.0	61.7	15.0

See the paper for full ablations on worker concurrency, speculative thresholds, and staleness policies.

When It Fails

Async overhead dominates on tiny workloads. FlashEvolve assumes per-stage requests can be packed into batches large enough to hide queue bookkeeping. When the mini-batch is small or stages already finish in seconds, the per-item queue and version-tracking cost can outweigh the overlap gain. Reflective patching also depends on the LLM's ability to discriminate useful stale signal from noise — on workloads where most stale items are conflicting (e.g. HotpotQA early steps), Full Async and Reflective Async converge to similar scores within a 30-min budget.

Other Algorithms

The same async framework drops onto evolution algorithms beyond GEPA — ACE (context playbook evolution) and Meta-Harness (harness-code evolution) — without changing their semantics.

FlashEvolve raises Meta-Harness's proposal+validation throughput from 0.3 → 1.4 proposals/min, and accelerates ACE on FiNER and FormulaReasoning by similar margins. Algorithm-agnostic: it sees a multi-stage loop with queueable intermediates and a shared artifact pool.

4.7×Meta-Harness

3×+ACE / FiNER