Thanks to visit codestin.com
Credit goes to www.alphaxiv.org

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

GitHub
agentic-frameworks
Researchers from Renmin University of China and Kuaishou Technology developed Agentic Entropy-Balanced Policy Optimization (AEPO), an algorithm designed to stabilize and enhance the training of web agents by dynamically balancing entropy during rollout and policy updates. AEPO achieved 47.6% Pass@1 on the GAIA benchmark and reduced tool calls by approximately half compared to other RL methods, demonstrating improved performance and training stability on complex, multi-turn tasks.
665
LabOS is an AI co-scientist system, developed by researchers at Stanford and Princeton, that integrates a self-evolving AI agent with an XR-enabled physical lab interface to accelerate scientific discovery. It achieved over 90% accuracy in real-time error detection for lab procedures and successfully identified novel targets in cancer immunotherapy and cell fusion research.
Researchers from ICT, CAS and collaborating institutions present the first comprehensive survey of Vibe Coding, a novel LLM-powered software development methodology, formalizing its processes and outlining five distinct development models. The work thoroughly analyzes the ecosystem's infrastructure, revealing critical challenges in human-AI collaboration and a shift in developer roles.
4
Researchers from The Chinese University of Hong Kong developed a framework for assessing large language models' ability to design functional, physically simulated machines using a novel environment and agentic workflows. They demonstrated that while LLMs can generate functional designs, they require advanced techniques like iterative refinement and reinforcement learning to overcome limitations in spatial and physical reasoning.
53
The Agentic Context Engineering (ACE) framework dynamically evolves and curates comprehensive 'playbook' contexts for large language models, allowing them to continuously improve performance. This enables smaller, open-source models to match or exceed proprietary LLM agent performance on benchmarks like AppWorld, simultaneously reducing adaptation latency by up to 91.5% and token cost by 83.6%.
4
Researchers from the University of Wisconsin-Madison, Stanford University, and Salesforce AI Research introduced LiveResearchBench and DeepEval, a novel benchmark and evaluation suite to rigorously assess the deep research capabilities of AI agents. This framework provides 100 expert-curated, user-centric tasks and a comprehensive evaluation methodology across six dimensions, revealing that while current systems excel at information collection, they frequently struggle with analytical depth and reliable citation.
The MemAct framework enables Large Language Model agents to autonomously manage their working memory by treating context curation as learnable actions, addressing a critical bottleneck in long-horizon tasks. This approach achieves 59.1% accuracy on multi-objective QA while reducing average context tokens to 3,447, outperforming larger baselines and improving training efficiency by up to 40%.
Researchers from EPIC Lab at Shanghai Jiao Tong University and collaborators introduce AI4Service, a proactive AI assistance paradigm leveraging the Alpha-Service framework deployed on AI glasses. This system anticipates user needs and provides real-time, context-aware assistance by integrating multimodal perception, LLM-based reasoning, external tool access, and personalized long-term memory. It successfully demonstrates capabilities across scenarios such as gaming, museum tours, and shopping.
Information Gain-based Policy Optimization (IGPO) introduces an intrinsic, turn-level reward mechanism for multi-turn LLM agents, deriving rewards from the model's evolving confidence in the ground truth. This approach improved average F1 score by 4.8 points over the best prior method, DeepResearcher, across seven datasets and yielded substantial gains for smaller models.
Researchers from ByteDance Seed, Carnegie Mellon University, and Stanford University developed Context-Folding, a framework allowing Large Language Model (LLM) agents to actively manage their context window for long-horizon tasks. Leveraging a reinforcement learning algorithm called FoldGRPO, the agent learns to dynamically branch for subtasks and condense their trajectories, achieving over 90% context compression and outperforming baseline agents on deep research and agentic software engineering benchmarks.
Ax-Prover is a multi-agent framework that equips general-purpose Large Language Models (LLMs) with formal Lean tools, enabling them to reliably prove theorems across diverse scientific domains, including mathematics and quantum physics. The system significantly outperforms specialized provers on new abstract algebra and quantum theory benchmarks, while also identifying a critical error in a published cryptography proof.
A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.
Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.
Researchers from The Chinese University of Hong Kong and Tencent AI Lab developed `WebAggregator`, a series of foundation models that improve deep research agents' ability to aggregate information from the web. By introducing the "Explore to Evolve" paradigm for automated data generation, these models achieve 56.3% Pass@1 on GAIA-text, outperforming GPT-4.1 (43.7%) and approaching Claude-3.7-sonnet (60.2%).
367
SR-Scientist transforms large language models into autonomous AI scientists for symbolic regression by enabling tool-use and long-horizon optimization, achieving superior precision and robustness in scientific equation discovery across multiple disciplines. Developed by researchers at Shanghai Jiao Tong University, this framework also incorporates reinforcement learning for agent self-improvement.
3
KVCOMM introduces a training-free and prompt-adaptive framework for efficient multi-agent LLM systems, enabling robust reuse of Key-Value (KV) caches across diverse contexts. It achieves up to 7.8x prefilling speedup and an average of 6.7x speedup on multi-agent tasks while maintaining or improving accuracy compared to existing baselines.
ScaleAI researchers introduced IRIS, a new benchmark evaluating Multimodal Large Language Models (MLLMs) on their ability to actively perceive, transform, and reason with images using external tools, moving beyond passive image interpretation. The evaluation found that current MLLMs, including leading models, struggle significantly with tool-enabled visual reasoning, with the best model achieving an average pass rate of only 18.68%.
Researchers from Tencent Youtu Lab developed Training-Free Group Relative Policy Optimization, a method that enhances LLM agent performance in specialized tasks by learning and integrating experiential knowledge as a token prior without modifying model parameters. This approach achieved substantial performance gains on mathematical reasoning and web searching benchmarks with significantly reduced data and computational costs, leveraging the full capabilities of frozen large LLMs.
3,305
Agentic Self-Learning (ASL) is introduced as a framework for training LLM-based agents in open-domain search environments without relying on human-curated datasets or predefined rule-based rewards. The framework enables a closed-loop co-evolution of a prompt generator, policy model, and generative reward model, demonstrating continuous performance improvement, superior long-term accuracy, and robustness against reward hacking.
Researchers at UCSD and Intel developed AT-GRPO, an on-policy reinforcement learning framework designed for collaborative large language model agents, coupled with a novel multi-agent system training infrastructure. This approach achieved near-optimal accuracy of 96.0-99.5% in long-horizon planning tasks and yielded substantial performance improvements across coding and mathematical reasoning benchmarks.
2
There are no more papers matching your filters at the moment.