FlashEvolve — Accelerating Agent Evolution with Asynchronous Stage Orchestration

Abstract

LLM-based evolution has emerged as a promising way to improve agents by refining non-parametric artifacts, but its wall-clock cost remains a major bottleneck. We identify that this cost comes from synchronized stage execution and imbalance inside each LLM-heavy stage. We present FlashEvolve, an efficient framework that replaces synchronized execution with asynchronous workers and queues, allowing different stages and steps to overlap. To handle data staleness introduced by asynchrony, FlashEvolve tracks artifact versions and applies different policies to update, discard, or patch stale artifacts. Unlike weight-space staleness in asynchronous RL, language-space staleness is inspectable and repairable: a stale artifact is not just delayed work, but readable evidence that the LLM can reflect on, revise, and turn into useful evolution signal. FlashEvolve further improves throughput and token efficiency with speculative stage completion and adaptive workflow control. On GEPA workloads, FlashEvolve improves proposal throughput by $3.5\times$ on local vLLM and $4.9\times$ on API serving over synchronous GEPA. The same design also applies to ACE and Meta-Harness.

1. Introduction

A growing line of recent work enables LLM agents to evolve themselves. Instead of updating model weights, these systems iteratively refine the non-parametric components that govern their behavior, including system prompts1 2 3, context and memory4 5 6, harness code7 8, and generated programs9 10. This emerging paradigm of test-time self-evolution11 fundamentally relaxes the access requirements of weight-space adaptation: it requires neither labeled trajectories nor gradient updates. By reflecting on full execution traces rather than scalar rewards, it draws a richer learning signal from each rollout: GEPA1 outperforms GRPO with an average gain of 6% across six reasoning benchmarks, while Meta-Harness8 automatically discovers agent harnesses that surpass the best hand-engineered baselines.

Despite its algorithmic appeal, agent evolution remains expensive in wall-clock execution time. Existing algorithms pursue "faster" evolution by improving the quality of each step through stronger reflection1 4 12, better proposal and search10, or larger-batch updates13, thereby reducing the number of steps needed. However, fewer steps do not necessarily translate into shorter wall-clock time. On IFBench, a single GEPA evolution step already takes ~2 minutes; Combee13 parallelizes proposal generation, but further stretches each step to ~2.8 minutes. Reaching a stable improvement requires more than two hours on an H100 GPU.

Such high wall-clock cost comes from synchronized stage execution. As shown in Figure 1, each evolution step runs a sequence of LLM-heavy stages — running the current artifact on a mini-batch of inputs, proposing a new candidate artifact, and evaluating the new one — and a later stage cannot start until the previous one fully completes. This serial structure prevents overlap across stages.

The cost is amplified by generation imbalance inside each stage. Request lengths vary widely across samples, creating a long-tail effect: the longest requests determine the execution time of the whole stage. This reduces the effective batch size in both local serving frameworks14 15 and API-based remote calls, leading to low resource utilization and inefficient waiting for long samples.

Figure 1. Illustration of the multi-stage execution in agent evolution. The synchronized stage orchestration in existing implementations exposes two efficiency challenges: sequential dependencies across stages and sample workload imbalance within each individual stage.

To this end, we present FlashEvolve, a framework that improves the time efficiency of agent evolution through asynchronous stage orchestration. FlashEvolve treats an evolution loop as a set of LLM-heavy stages connected by queues. This allows artifact execution, proposal generation, evaluation, and pool update to overlap in time, turning a synchronized loop into a streaming execution pipeline.

This design introduces new systems challenges. Asynchronous execution can generate stale items because an artifact pool may change while earlier items are still waiting in queues. FlashEvolve handles this with artifact-version tracking and staleness-aware policies, including version comparison and discarding, or reflective patching for stale language artifacts. This property is specific to agent evolution. Unlike weight updates in SFT or reinforcement learning, evolution artifacts are prompts, memories, harness code, or programs. A stale artifact is therefore an inspectable object: its relation to the current pool can be judged as complementary, redundant, or conflicting, and can be revised by the same LLM mechanism used for proposal. This makes staleness a semantic repair problem rather than only a scheduling hazard.

FlashEvolve further reduces waiting inside long stages through speculative completion, and uses adaptive workflow control to balance workload across stages. Together, these mechanisms improve throughput while preserving the quality of evolution.

2. Background & Motivation

2.1 Agent evolution: self-improvement beyond weight updates

Agent evolution has emerged as a new paradigm for adapting LLM-based systems to new data and tasks11. This stems from the strong reasoning capability of modern LLMs, which enables a single model to reflect on its own trajectories16, critique its own outputs17, and propose new artifacts that govern its own behavior, ranging from prompts, memory, and harness code, to generated programs. Crucially, this happens without modifying model weights, sidestepping the training infrastructure of supervised fine-tuning and reinforcement learning while delivering comparable or stronger gains.

For example, GEPA1 and ACE4 use reflection on execution traces to evolve system prompts and contextual playbooks. Meta-Harness8 and AutoHarness7 use a coding agent to evolve the harness. AlphaEvolve9 and ShinkaEvolve10 push this further, evolving the generated programs the agent uses to solve problems.

An evolution loop iterates over multiple steps, each consisting of several stages. The LLM-heavy ones are typically Generate, Propose, and Evaluate. Generate runs the current artifact on tasks to collect trajectories; Propose reflects on these to produce a new candidate; Evaluate scores the candidate against task signals. A subsequent update commits the new artifact to the artifact pool. At the start of each step, new candidates are selected from the pool via methods like Pareto-aware sampling1 or evolutionary tournaments9.

2.2 Inefficiency: sequential and imbalanced stages

Despite its algorithmic appeal, agent evolution remains expensive in wall-clock time. Even with state-of-the-art LLM serving such as vLLM14 (continuous batching and prefix caching), GEPA with Qwen3-8B takes 50 minutes to complete 49 evolution steps on IFBench18, and 134 minutes for 411 steps on HotpotQA19.

This inefficiency stems from sequential and synchronized stage execution. Each step runs its LLM-heavy stages serially, and each stage internally waits for all parallel LLM requests to finish before advancing. This creates two compounding costs:

Serial stage chain. Total step time is the sum of per-stage durations, with no opportunity to overlap. As shown in Figure 2(a), stage time is highly imbalanced — different stages become the bottleneck depending on the workload and algorithm.
Synchronization barriers. Each stage's barrier forces the entire batch to wait for the slowest request. As shown in Figure 2(b), output lengths within a stage show a long-tail distribution. Sequential execution and intra-stage imbalance reduce effective concurrency and leave the LLM backend underutilized (Figure 2c).

Profiling of inefficiency in synchronized agent evolution. — Figure 2. Profiling results of inefficiency in synchronized agent evolution. **(a)** Stage execution is serial, and stage time is highly imbalanced. **(b)** Within a single stage, output lengths show a long-tail distribution, so the slowest requests determine stage completion time. **(c)** Serial stage execution and intra-stage imbalance reduce effective concurrency, while FlashEvolve keeps more requests in flight.

Such inefficiency cannot be solved by simply launching more LLM requests in parallel. Agent evolution must convert a synchronized multi-stage loop into a streaming workflow while preserving artifact-evolution semantics. This creates two challenges. First, asynchrony introduces artifact-level staleness: intermediate results may be produced from an artifact pool that has already changed before they are consumed. Second, naive parallel scaling can amplify workload imbalance, causing queue buildup, longer staleness windows, and wasted LLM work.

Analogy to asynchronous RL

These challenges are related to synchronous LLM RL systems20 21 22, which also suffer from synchronization overhead and workload imbalance. Asynchronous RL addresses this by overlapping rollout generation with training and controlling off-policy optimization23 24 25. Agent evolution differs in two key ways. First, it contains multiple LLM inference stages rather than a single rollout stage, each with batched generation behavior and its own long-tail imbalance. Second, staleness occurs over inspectable language artifacts — prompts, memories, harness code, programs — rather than continuous model weights. This allows a more flexible design space for staleness handling policies.

3. FlashEvolve: Asynchronous Framework for Agent Evolution

We present FlashEvolve, an asynchronous framework that removes the sequential and imbalanced behavior identified above. FlashEvolve decomposes an evolution loop into asynchronous workers connected by queues, so different stages and evolution steps can overlap. Each queue item carries the artifact state and pool version, allowing FlashEvolve to detect stale items. On top of this execution model, it introduces staleness-aware data handling, speculative stage completion, and adaptive workflow control to improve time efficiency.

Figure 3. Overview of FlashEvolve. Asynchronous workers and queues replace synchronous stage execution. Workers pass partial or completed results through queues rather than waiting for a whole stage to finish. Speculative completion, validation-set reordering, workflow control, and staleness-aware handling further improve throughput while limiting data staleness.

3.1 Asynchronous execution with workers and queues

Asynchronous workers. FlashEvolve turns a synchronized evolution step into asynchronous workers connected by queues. Instead of waiting for proposal, validation, and pool update to finish before starting the next step, workers continuously process ready items and pass their outputs to downstream queues. Each stage has an input queue and a set of workers. A queue item carries the artifact being evolved, the input/output, and the artifact-pool version $v_i$ at item creation. The pool version increases after each pool update, so $v_i$ can be compared with the current version $v$ to detect stale items.

Worker concurrency. FlashEvolve assigns each stage $i$ a worker count $K_i$. A larger $K_i$ allows more tasks to issue LLM requests at the same time, which increases per-stage concurrency so the pipeline is not bottlenecked by a single slow stage. The tradeoff is data staleness: larger worker counts increase the chance that queued items were generated from an older artifact pool state.

3.2 Staleness-aware data handling

Asynchronous execution improves throughput, but the artifact pool may be updated while earlier items are still waiting in queues. FlashEvolve supports three policies for handling such stale items with different tradeoffs:

Full Async. Does not check artifact-pool versions and allows all items to continue through the pipeline. This maximizes throughput but may inject outdated updates into the artifact pool.
Guarded Async. Defines the version gap as $\Delta_i = v - v_i$, and discards item $i$ when $\Delta_i > \Delta_{\max}$. This prevents highly stale items, but wastes the tokens already spent on discarded items.
Reflective Async. Adds a reflection worker stage. For an item with $\Delta_i > 0$, the reflection worker uses the stale item and all pool updates between versions $v_i$ and $v$ to decide whether the item still contributes a useful change. If so, it patches the item against the current pool; otherwise, it discards it. This reuses useful stale items, reducing wasted LLM generations.

Why language-space staleness can be repaired

Language-space staleness is discrete and inspectable, unlike parameter staleness in asynchronous RL, which is continuous and opaque. In RL, a stale item is tied to an older point in weight space, so systems typically handle it through importance weighting, bounded delay, or discard. In agent evolution, a stale item is text or code — a prompt edit, memory update, harness mutation, or generated program. FlashEvolve can therefore inspect the stale item together with the intervening artifact history and decide whether the edit is orthogonal, already subsumed, or conflicting with the current pool. This makes repair a first-class operation: stale items can be patched when they contain reusable information, or discarded when they are too specific or inconsistent.

3.3 Speculative stage completion

Asynchronous workers remove waiting between stages, but each worker may still wait for all LLM requests in its current stage before writing to the next queue — an intra-stage synchronization barrier, especially in rollout and evaluate stages. To reduce this barrier, FlashEvolve allows a stage to release partial output after a fraction $\alpha_{\mathrm{spec}}^{i} \in (0, 1]$ of its requests has finished. The worker packages the completed results as a tentative queue item and continues the remaining requests in the background, while downstream workers can start from the tentative item.

For evaluation, FlashEvolve adds a score threshold to avoid forwarding weak candidates: if the partial score exceeds the current pool score, the candidate is inserted as a speculative artifact. When full evaluation finishes, the artifact is confirmed if it still passes acceptance; otherwise, it is removed and downstream items derived from it fall back to the staleness-aware policy.

Validation-set reordering. Speculative completion is more reliable when the early validation samples are informative. FlashEvolve reorders the validation set using sample pass history: samples that pass for $w$ consecutive rounds are moved out of the speculative prefix and placed later. We set $w=3$ to avoid reacting to one-round noise while keeping the prefix responsive to artifact improvement.

3.4 Adaptive workflow control

Different stages produce and consume items at different rates. A stage with short LLM requests can quickly fill the queue of a later stage whose requests are longer. FlashEvolve measures the item production rate of each asynchronous stage and adjusts worker counts: if a stage produces items at less than half the median stage rate, we increase its worker count; if at more than twice the median, we decrease it. Each adjustment changes the worker count by at most one, and each stage has a minimum and maximum worker count. This avoids large swings while still correcting persistent throughput imbalance.

3.5 Implementation

FlashEvolve is implemented in Python with lightweight threads and in-process queues. Pool updates are applied under a lock. For a fair comparison, all open-source baselines and FlashEvolve run on the same LLM serving stack: native LLM calls are replaced by the same DSPy client backed by a local vLLM14 server with an OpenAI-compatible endpoint. All methods therefore benefit from the same continuous batching and KV-cache reuse; throughput differences mainly reflect the optimization of the evolution pipeline.

4. Evaluation

4.1 Setup

We evaluate FlashEvolve on three test-time evolution algorithms that optimize different artifacts: GEPA1 (prompts), ACE4 (context playbooks), and Meta-Harness8 (harness code). We compare against the scaling-oriented baseline Combee13 with fixed-batch variants $B = 10$ and $40$. For open-source experiments we use Qwen3-8B26 served by vLLM on a single H100 80GB GPU; for API experiments we use GPT-4o-mini27. Datasets follow each original algorithm: IFBench18, HotpotQA19, HoVer28, and AIME for GEPA; FiNER29 and FormulaReasoning for ACE; Symptom2Disease and AGNews30 for Meta-Harness.

4.2 Throughput on GEPA workloads

Table 1 shows that FlashEvolve substantially improves LLM throughput. Asynchronous orchestration keeps the LLM backend busier by overlapping requests from different stages and steps. On local vLLM, FlashEvolve improves LLM throughput by $3.4\times$ over GEPA and $1.9\times$ over the best Combee setting on average; on API serving, by $2.9\times$ and $1.5\times$ respectively.

Higher LLM throughput translates into faster artifact exploration. On local vLLM, FlashEvolve improves proposal throughput by $3.5\times$ on average over both GEPA and the best Combee. On API serving, the gains grow to $4.9\times$ over GEPA and $8.4\times$ over the best Combee.

Table 1. Throughput comparison on GEPA workloads. LLM throughput is the system's output token rate. Proposal throughput is the rate of new candidate artifact generation. **Bold** = best in column.
Method	LLM Throughput (tok/s)				Proposal Throughput (prop/min)
	IFBench	HotpotQA	HoVer	AIME	IFBench	HotpotQA	HoVer	AIME
vLLM with Qwen3-8B
GEPA	963	30	461	200	1.9	4.6	2.5	2.2
Combee ($K{=}10$)	696	38	810	994	1.2	2.7	2.0	6.2
Combee ($K{=}40$)	900	44	891	977	0.7	4.5	2.0	1.6
FlashEvolve	2,688	93	1,255	998	8.9	8.8	5.9	11.4
API with GPT-4o-mini
GEPA	361	14	142	103	1.7	2.4	1.8	1.3
Combee ($K{=}10$)	397	18	348	211	1.0	1.4	0.8	1.0
Combee ($K{=}40$)	389	23	214	336	0.8	1.2	0.7	0.6
FlashEvolve	791	32	352	485	10.1	8.0	9.1	6.6

4.3 Evolution efficiency and long-time runs

Within a fixed 30-minute budget on the four GEPA workloads, FlashEvolve achieves an average normalized evolution rate of $1.43\times$. The strongest gain is on IFBench: validation score rises from 87.6 to 90.6 and the normalized rate reaches $2.27\times$. On HoVer, FlashEvolve also achieves the best score at $1.15\times$. On AIME, both GEPA and Combee remain at the initial 10.0%, while FlashEvolve reaches 15.0% — the only method to improve over the initial score.

Figure 4 reports validation score over a longer 180-minute budget. FlashEvolve reaches strong scores much earlier than the synchronous baselines. On IFBench, it reaches 91% in 39.3 minutes, while Combee ($B{=}40$) reaches the same region only after 104.2 minutes. On HotpotQA, FlashEvolve reaches its best 66.41% at 56.1 minutes and maintains the highest validation score throughout the full budget.

Validation score over wall-clock time. — Figure 4. Validation score over wall-clock time with Qwen3-8B, under a 180-minute budget. FlashEvolve reaches strong scores earlier than the synchronous baselines and, on some workloads, also improves the final score under a longer time budget.

4.4 Ablations

Staleness handling

Figure 5 compares the three staleness policies on IFBench. Full Async and Guarded Async achieve similar final scores. Reflective Async achieves the best evolution efficiency, reaching 94.3% validation score within the 30-minute budget. Inspecting logs, stale items are not always useless: when the stale artifact is text, FlashEvolve can discard task-specific noise (e.g. concrete formulas) and reuse transferable principles (e.g. stricter constraint checking, self-contained reasoning) by distilling them into a compact prompt patch. Many score jumps in the Reflective Async curve come from prompts after such patches, suggesting that reflective repair improves the quality of proposals rather than only increasing throughput.

Concurrent workers & speculative completion

Larger worker counts greatly increase proposal throughput, from 7 artifacts/min in the synchronous setting to 99 artifacts/min with $K_1{=}16$, $K_3{=}8$ (Figure 6a). However, validation throughput does not scale uniformly — naive scaling can shift the bottleneck. Adaptive control balances queue pressure and stage rates, achieving the highest accepted proposal throughput (Figure 6b). Speculative completion with $\alpha_{\mathrm{spec}}^3{=}0.25$ increases validation-stage throughput to 3.15 validations/min and improves validation score by 4.49 points within 30 minutes; with $\alpha_{\mathrm{spec}}^3{=}0.5$ the speculative gate becomes less effective (Figure 6c).

Ablation of worker concurrency and speculative completion. — Figure 6. Ablation of worker concurrency and speculative completion on IFBench with Qwen3-8B. **(a)** Worker allocation varies throughput. **(b)** Adaptive worker control achieves the highest *accepted* proposal throughput by balancing proposal generation and validation. **(c)** Speculative completion improves validation throughput when the prefix threshold is properly set.

4.5 Generality: ACE and Meta-Harness

FlashEvolve is algorithm-agnostic. It does not rely on a specific artifact type, only that the evolution loop contains multiple stages that need orchestration. On ACE, FlashEvolve reaches better validation scores within the same 30-minute budget: from $0.60$ to $0.66$ on FiNER and from $0.66$ to $0.70$ on Formula. On Meta-Harness, it improves proposal and validation throughput from $0.3$ to $1.4$ proposals/min — a $4.7\times$ speedup. With higher proposal throughput, FlashEvolve samples and validates more harness candidates within the same time budget, leading to a higher potential of improvement.

FlashEvolve on other algorithms: ACE and Meta-Harness. — Figure 7. FlashEvolve on other algorithms for agent evolution. **(a, b)** Validation score evolution on FiNER and Formula (ACE) over a fixed time budget. **(c, d)** Meta-Harness proposal rate and validation-score distributions of different proposals on Symptom2Disease and AGNews.

5. Conclusion

We present FlashEvolve, a framework for accelerating agent evolution in wall-clock time. It replaces synchronous stage execution with asynchronous workers and queues, allowing LLM-heavy stages and evolution steps to overlap. It preserves evolution semantics through artifact-pool versioning and staleness-aware policies that update, discard, or patch stale artifacts, and further improves efficiency with speculative stage completion and adaptive workflow control. On GEPA workloads, FlashEvolve achieves $3.5\times$ higher proposal throughput over the synchronous implementation on local vLLM serving. The same execution model also generalizes to context evolution with ACE and harness-code evolution with Meta-Harness.

Limitations

FlashEvolve currently supports only a limited set of agent-evolution algorithms, and each integration still requires algorithm-specific implementation effort. While the worker-and-queue abstraction is general, mapping a new algorithm requires implementing its stages, queue items, artifact state, and update rules. Our evaluation focuses on representative prompt, context, and harness-code workloads; broader coverage of artifact types remains future work.

Future work

We plan to expand FlashEvolve with a more general plugin interface for defining stages, artifacts, staleness policies, and evaluation logic, so that new evolution algorithms can be integrated with less manual engineering. We also plan to extend FlashEvolve to more types of artifact evolution, such as memory, tool-use policies, and generated programs.

Resources

Paper FlashEvolve (PDF) Full NeurIPS 2026 preprint.

Code Coming soon Reference implementation will be released alongside the camera-ready version.

Data Benchmark configs Replication scripts for GEPA, ACE, and Meta-Harness workloads — to be released.

Contact zhh068@ucsd.edu Questions, feedback, and collaboration inquiries.

References

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, et al. arXiv:2507.19457, 2025.
PromptAgent: Strategic Planning with Language Models Enables Expert-Level Prompt Optimization. X. Wang, C. Li, Z. Wang, F. Bai, H. Luo, J. Zhang, N. Jojic, E. P. Xing, Z. Hu. arXiv:2310.16427, 2023.
Prompt-MII: Meta-Learning Instruction Induction for LLMs. E. Xiao, Y. Zeng, A. Chen, C.-J. Li, A. Bertsch, G. Neubig. arXiv:2510.16932, 2025.
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models. Q. Zhang, C. Hu, S. Upasani, B. Ma, F. Hong, V. Kamanuru, et al. arXiv:2510.04618, 2025.
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory. S. Ouyang, J. Yan, I. Hsu, Y. Chen, K. Jiang, Z. Wang, et al. arXiv:2509.25140, 2025.
MemEvolve: Meta-Evolution of Agent Memory Systems. G. Zhang, H. Ren, C. Zhan, Z. Zhou, J. Wang, H. Zhu, W. Zhou, S. Yan. arXiv:2512.18746, 2025.
AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness. X. Lou, M. Lázaro-Gredilla, A. Dedieu, C. Wendelken, W. Lehrach, K. P. Murphy. arXiv:2603.03329, 2026.
Meta-Harness: End-to-End Optimization of Model Harnesses. Y. Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, C. Finn. arXiv:2603.28052, 2026.
AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery. A. Novikov, N. Vũ, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, et al. arXiv:2506.13131, 2025.
ShinkaEvolve: Towards Open-Ended and Sample-Efficient Program Evolution. R. T. Lange, Y. Imajuku, E. Cetin. arXiv:2509.19349, 2025.
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence. H.-a. Gao, J. Geng, W. Hua, M. Hu, X. Juan, H. Liu, et al. arXiv:2507.21046, 2025.
TextGrad: Automatic "Differentiation" via Text. M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, Z. Huang, C. Guestrin, J. Zou. arXiv:2406.07496, 2024.
Combee: Scaling Prompt Learning for Self-Improving Language Model Agents. H. Li, R. He, Q. Zhang, C. Ji, Q. Mang, X. Chen, L. A. Agrawal, et al. arXiv:2604.04247, 2026.
Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, I. Stoica. SOSP, 2023.
SGLang: Efficient Execution of Structured Language Model Programs. L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, et al. NeurIPS, 2024.
Reflexion: Language Agents with Verbal Reinforcement Learning. N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, S. Yao. NeurIPS, 2023.
Self-Refine: Iterative Refinement with Self-Feedback. A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, et al. NeurIPS, 2023.
Generalizing Verifiable Instruction Following (IFBench). V. Pyatkin, S. Malik, V. Graf, H. Ivison, S. Huang, P. Dasigi, N. Lambert, H. Hajishirzi. arXiv:2507.02833, 2025.
HotpotQA: A Dataset for Diverse, Explainable Multi-Hop Question Answering. Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, C. D. Manning. EMNLP, 2018.
HybridFlow: A Flexible and Efficient RLHF Framework. G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, C. Wu. EuroSys, 2025.
NeMo RL: A Scalable and Efficient Post-Training Library. NVIDIA. GitHub, 2025.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training. Z. Hu, H. Ouyang, C. Chen, Z. Pan, Y. Guan, Z. Yu, Z. Wang, S. Swanson, Y. Ding. arXiv:2604.23838, 2026.
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning. W. Fu, J. Gao, X. Shen, C. Zhu, Z. Mei, C. He, et al. arXiv:2505.24298, 2025.
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation. Y. Zhong, Z. Zhang, X. Song, H. Hu, C. Jin, B. Wu, et al. arXiv:2504.15930, 2025.
Laminar: A Scalable Asynchronous RL Post-Training Framework. G. Sheng, Y. Tong, B. Wan, W. Zhang, C. Jia, X. Wu, et al. arXiv:2510.12633, 2025.
Qwen3 Technical Report. A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, et al. arXiv:2505.09388, 2025.
GPT-4o System Card. A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, et al. arXiv:2410.21276, 2024.
HoVer: A Dataset for Many-Hop Fact Extraction and Claim Verification. Y. Jiang, S. Bordia, Z. Zhong, C. Dognin, M. Singh, M. Bansal. Findings of EMNLP, 2020.
FiNER: Financial Numeric Entity Recognition for XBRL Tagging. L. Loukas, M. Fergadiotis, I. Chalkidis, E. Spyropoulou, P. Malakasiotis, I. Androutsopoulos, G. Paliouras. ACL, 2022.
Character-Level Convolutional Networks for Text Classification (AGNews). X. Zhang, J. Zhao, Y. LeCun. NeurIPS, 2015.