Codestin Search App

alphaXiv

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Papers Datasets

agentic-frameworks

156

16 Oct 2025

agentic-frameworks agents computer-science

Agentic Entropy-Balanced Policy Optimization

Renmin University of China Kuaishou Technology

Researchers from Renmin University of China and Kuaishou Technology developed Agentic Entropy-Balanced Policy Optimization (AEPO), an algorithm designed to stabilize and enhance the training of web agents by dynamically balancing entropy during rollout and policy updates. AEPO achieved 47.6% Pass@1 on the GAIA benchmark and reduced tool calls by approximately half compared to other RL methods, demonstrating improved performance and training stability on complex, multi-turn tasks.

665

16 Oct 2025

agentic-frameworks agents ai-for-genomics

LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

University of Washington

Stanford University

Princeton University

The Ohio State University

LabOS is an AI co-scientist system, developed by researchers at Stanford and Princeton, that integrates a self-evolving AI agent with an XR-enabled physical lab interface to accelerate scientific discovery. It achieved over 90% accuracy in real-time error detection for lab procedures and successfully identified novel targets in cancer immunotherapy and cell fusion research.

344

14 Oct 2025

agentic-frameworks agents computer-science

A Survey of Vibe Coding with Large Language Models

Researchers from ICT, CAS and collaborating institutions present the first comprehensive survey of Vibe Coding, a novel LLM-powered software development methodology, formalizing its processes and outlining five distinct development models. The work thoroughly analyzes the ecosystem's infrastructure, revealing critical challenges in human-AI collaboration and a shift in developer roles.

16 Oct 2025

agentic-frameworks agents computer-science

Agentic Design of Compositional Machines

The Chinese University of Hong Kong The Chinese University of Hong Kong (Shenzhen)

Researchers from The Chinese University of Hong Kong developed a framework for assessing large language models' ability to design functional, physically simulated machines using a novel environment and agentic workflows. They demonstrated that while LLMs can generate functional designs, they require advanced techniques like iterative refinement and reinforcement learning to overcome limitations in spatial and physical reasoning.

4,359

06 Oct 2025

agentic-frameworks agents computer-science

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

UC Berkeley

Stanford University SambaNova

The Agentic Context Engineering (ACE) framework dynamically evolves and curates comprehensive 'playbook' contexts for large language models, allowing them to continuously improve performance. This enables smaller, open-source models to match or exceed proprietary LLM agent performance on benchmarks like AppWorld, simultaneously reducing adaptation latency by up to 91.5% and token cost by 83.6%.

16 Oct 2025

agentic-frameworks agents computer-science

LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

Researchers from the University of Wisconsin-Madison, Stanford University, and Salesforce AI Research introduced LiveResearchBench and DeepEval, a novel benchmark and evaluation suite to rigorously assess the deep research capabilities of AI agents. This framework provides 100 expert-curated, user-centric tasks and a comprehensive evaluation methodology across six dimensions, revealing that while current systems excel at information collection, they frequently struggle with analytical depth and reliable citation.

195

14 Oct 2025

agentic-frameworks agents computer-science

Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

Huawei Noah’s Ark Lab

Sun Yat-Sen University

Beijing Jiaotong University HiThink Research

The MemAct framework enables Large Language Model agents to autonomously manage their working memory by treating context curation as learnable actions, addressing a critical bottleneck in long-horizon tasks. This approach achieves 59.1% accuracy on multi-objective QA while reducing average context tokens to 3,447, outperforming larger baselines and improving training efficiency by up to 40%.

16 Oct 2025

agentic-frameworks agents computer-science

AI for Service: Proactive Assistance with AI Glasses

Shanghai Jiao Tong University The Hong Kong University of Science and Technology (Guangzhou)

Peking University

HKUST

Researchers from EPIC Lab at Shanghai Jiao Tong University and collaborators introduce AI4Service, a proactive AI assistance paradigm leveraging the Alpha-Service framework deployed on AI glasses. This system anticipates user needs and provides real-time, context-aware assistance by integrating multimodal perception, LLM-based reasoning, external tool access, and personalized long-term memory. It successfully demonstrates capabilities across scenarios such as gaming, museum tours, and shopping.

16 Oct 2025

agentic-frameworks agents computer-science

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Ant Group

Renmin University of China

Information Gain-based Policy Optimization (IGPO) introduces an intrinsic, turn-level reward mechanism for multi-turn LLM agents, deriving rewards from the model's evolving confidence in the ground truth. This approach improved average F1 score by 4.8 points over the best prior method, DeepResearcher, across seven datasets and yielded substantial gains for smaller models.

223

13 Oct 2025

agentic-frameworks agents computer-science

Scaling Long-Horizon LLM Agent via Context-Folding

Carnegie Mellon University

Stanford University

ByteDance

Researchers from ByteDance Seed, Carnegie Mellon University, and Stanford University developed Context-Folding, a framework allowing Large Language Model (LLM) agents to actively manage their context window for long-horizon tasks. Leveraging a reinforcement learning algorithm called FoldGRPO, the agent learns to dynamically branch for subtasks and condense their trajectories, achieving over 90% context compression and outperforming baseline agents on deep research and agentic software engineering benchmarks.

14 Oct 2025

agentic-frameworks agents computer-science

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Ax-Prover is a multi-agent framework that equips general-purpose Large Language Models (LLMs) with formal Lean tools, enabling them to reliably prove theorems across diverse scientific domains, including mathematics and quantum physics. The system significantly outperforms specialized provers on new abstract algebra and quantum theory benchmarks, while also identifying a critical error in a published cryptography proof.

15 Oct 2025

agentic-frameworks agents computer-science

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

National University of Singapore

Microsoft

A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.

16 Oct 2025

adversarial-attacks agentic-frameworks agents

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

16 Oct 2025

agentic-frameworks agents computer-science

Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

The Chinese University of Hong Kong Tencent AI Lab

Researchers from The Chinese University of Hong Kong and Tencent AI Lab developed `WebAggregator`, a series of foundation models that improve deep research agents' ability to aggregate information from the web. By introducing the "Explore to Evolve" paradigm for automated data generation, these models achieve 56.3% Pass@1 on GAIA-text, outperforming GPT-4.1 (43.7%) and approaching Claude-3.7-sonnet (60.2%).

367

350

13 Oct 2025

agentic-frameworks agents computer-science

SR-Scientist: Scientific Equation Discovery With Agentic AI

Shanghai Jiao Tong University SII GAIR

SR-Scientist transforms large language models into autonomous AI scientists for symbolic regression by enabling tool-use and long-horizon optimization, achieving superior precision and robustness in scientific equation discovery across multiple disciplines. Developed by researchers at Shanghai Jiao Tong University, this framework also incorporates reinforcement learning for agent self-improvement.

14 Oct 2025

agentic-frameworks agents computer-science

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

NVIDIA

Duke University

MIT

KVCOMM introduces a training-free and prompt-adaptive framework for efficient multi-agent LLM systems, enabling robust reuse of Key-Value (KV) caches across diverse contexts. It achieves up to 7.8x prefilling speedup and an average of 6.7x speedup on multi-agent tasks while maintaining or improving accuracy compared to existing baselines.

14 Oct 2025

agentic-frameworks agents computer-science

Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

ScaleAI researchers introduced IRIS, a new benchmark evaluating Multimodal Large Language Models (MLLMs) on their ability to actively perceive, transform, and reason with images using external tools, moving beyond passive image interpretation. The evaluation found that current MLLMs, including leading models, struggle significantly with tool-enabled visual reasoning, with the best model achieving an average pass rate of only 18.68%.

1,896

09 Oct 2025

agentic-frameworks agents computer-science

Training-Free Group Relative Policy Optimization

Tencent

Researchers from Tencent Youtu Lab developed Training-Free Group Relative Policy Optimization, a method that enhances LLM agent performance in specialized tasks by learning and integrating experiential knowledge as a token prior without modifying model parameters. This approach achieved substantial performance gains on mathematical reasoning and web searching benchmarks with significantly reduced data and computational costs, leveraging the full capabilities of frozen large LLMs.

3,305

16 Oct 2025

agentic-frameworks agents computer-science

Towards Agentic Self-Learning LLMs in Search Environment

Agentic Self-Learning (ASL) is introduced as a framework for training LLM-based agents in open-domain search environments without relying on human-curated datasets or predefined rule-based rewards. The framework enables a closed-loop co-evolution of a prompt generator, policy model, and generative reward model, demonstrating continuous performance improvement, superior long-term accuracy, and robustness against reward hacking.

13 Oct 2025

agentic-frameworks agents computer-science

Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs

University of California, San Diego Intel Corporation

Researchers at UCSD and Intel developed AT-GRPO, an on-policy reinforcement learning framework designed for collaborative large language model agents, coupled with a novel multi-agent system training infrastructure. This approach achieved near-optimal accuracy of 96.0-99.5% in long-horizon planning tasks and yielded substantial performance improvements across coding and mathematical reasoning benchmarks.

There are no more papers matching your filters at the moment.

Install Browser Extension

Blog|We're hiring

alphaXiv

Explore

Login

Feedback

Dark mode

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Agentic Entropy-Balanced Policy Optimization

LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

A Survey of Vibe Coding with Large Language Models

Agentic Design of Compositional Machines

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

AI for Service: Proactive Assistance with AI Glasses

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Scaling Long-Horizon LLM Agent via Context-Folding

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

SR-Scientist: Scientific Equation Discovery With Agentic AI

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Training-Free Group Relative Policy Optimization

Towards Agentic Self-Learning LLMs in Search Environment

Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs