Codestin Search App

alphaXiv

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Papers Datasets

Baidu Inc.

753

06 Aug 2025

agents computer-science conversational-ai

TURA: Tool-Augmented Unified Retrieval Agent for AI Search

University of Science and Technology of China Baidu Inc.Wilfrid Laurier University

The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system.

515

22 Aug 2025

chain-of-thought computer-science artificial-intelligence

ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability

Carnegie Mellon University

Renmin University of China Baidu Inc.

ReasonRank presents a listwise reranker for complex information retrieval tasks, developed using an automated data synthesis framework and a two-stage post-training approach to infuse strong reasoning abilities into large language models. The method achieves state-of-the-art performance on reasoning-intensive benchmarks while demonstrating enhanced efficiency compared to pointwise reasoning rerankers.

442

07 Aug 2025

computer-science artificial-intelligence computation-and-language

LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking

Chinese Academy of Sciences Baidu Inc.

Macro W

Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. This paper explores the `Soft Thinking' capabilities of various LLMs by examining the models' internal behavior using a suite of probing techniques. Contrary to the common belief that Soft Thinking enables the simultaneous exploration of diverse reasoning paths, our findings reveal that LLMs predominantly rely on the most influential component of the soft inputs during subsequent decoding steps. This reliance hinders the exploration of different reasoning paths and reduces vanilla Soft Thinking to a form of greedy decoding, obscuring the advantage of transmitting more information through Soft Tokens. To tackle this issue, we explore sampling strategies to introduce \emph{randomness}, employing methods such as Dirichlet resampling and the Gumbel-Softmax trick. Our experiments demonstrate that incorporating randomness can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking. Notably, the Gumbel-Softmax trick provides adequate randomness with controlled smoothness, resulting in superior performance across eight reasoning benchmarks.

338

08 Jul 2025

computer-science computer-vision-and-pattern-recognition hardware-aware-algorithms

PaddleOCR 3.0 Technical Report

Baidu Inc.

Baidu Inc.'s PaddleOCR 3.0 provides a robust open-source toolkit for optical character recognition and document understanding, achieving competitive accuracy with models requiring fewer than 100 million parameters against billion-parameter Vision-Language Models in diverse scenarios. The system significantly improves document parsing and key information extraction, while offering enhanced deployment capabilities and serving as a crucial infrastructure for LLM and RAG applications.

7,523

03 Feb 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

The University of Hong Kong Baidu Inc.

A groundbreaking video understanding framework from HKU and Baidu researchers enables retrieval-augmented generation (RAG) for extremely long videos, achieving superior performance through graph-based knowledge organization and multi-modal retrieval while establishing a new benchmark with over 134 hours of content for evaluating long-form video understanding capabilities.

323

1,258

24 Jul 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

Peking University Shenzhen Graduate School Baidu Inc.

RT-DETRv2, developed by Baidu Inc. and Peking University, refines the real-time detection Transformer architecture by integrating "bag-of-freebies" techniques. These improvements enhance flexibility, practicality, and performance without increasing inference time, demonstrating superior accuracy-speed trade-offs compared to prior RT-DETR and YOLO models on the COCO dataset.

4,248

1,276

07 Oct 2025

computer-science computation-and-language information-retrieval

Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning

Carnegie Mellon University

Renmin University of China Baidu Inc.

Jiaxin Mao

Researchers from Renmin University of China, Baidu Inc., and Carnegie Mellon University developed MMOA-RAG, a framework that models Retrieval-Augmented Generation (RAG) pipelines as cooperative multi-agent reinforcement learning tasks to jointly optimize multiple interacting components. This approach significantly enhances the accuracy and factuality of generated answers on question-answering benchmarks, surpassing existing RAG optimization methods.

266

29 Aug 2025

agents computer-science artificial-intelligence

Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards

Beihang University

Beijing Jiaotong University Baidu Inc.

A method enhancing Small Language Models (SLMs) for creative writing leverages AI feedback through two distinct strategies: a multi-agent refined reward model and a principle-guided LLM-as-a-Judge. The LLM-as-a-Judge approach, employing adversarial optimization and reflection, achieved state-of-the-art performance in generating Chinese greetings, outperforming larger models while requiring less data and computational resources.

1,415

31 Dec 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Sun Yat-Sen University

Tsinghua University

Nanyang Technological University Baidu Inc.

姚欢晋

YuXin Song

This research introduces Collective Monte Carlo Tree Search (CoMCTS) to empower Multimodal Large Language Models (MLLMs) with step-by-step reasoning and reflection. The method generates a high-quality, tree-structured multimodal reasoning dataset (Mulberry-260k), leading to Mulberry models that demonstrate enhanced reasoning capabilities and improved performance on complex multimodal benchmarks.

1,216

2,185

04 Nov 2024

computer-science artificial-intelligence computation-and-language

Tool Learning with Large Language Models: A Survey

Chinese Academy of Sciences

Renmin University of China Baidu Inc.

A systematic and comprehensive review of tool learning with Large Language Models is presented, organizing fragmented literature by detailing the 'why' and 'how' of LLM-tool integration, surveying existing methods, benchmarks, and identifying open challenges and future research directions.

523

02 Jun 2025

computer-science computer-vision-and-pattern-recognition knowledge-distillation

MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

The University of Hong Kong Baidu Inc.

University of Hong Kong and Baidu researchers evaluate how well Multimodal Large Language Models learn 3D-aware representations for scene understanding by analyzing multi-view feature correspondence across voxelized 3D scenes, revealing a strong positive correlation between 3D representation quality and downstream task performance that motivates their 3DRS framework, which distills knowledge from pretrained 3D foundation models like VGGT and FLARE into MLLM visual features during training, achieving state-of-the-art results across five benchmarks including 62.9 [email protected] on ScanRefer (versus 58.1 baseline) and 104.8 CIDEr on ScanQA (versus 102.1 baseline) while introducing zero computational overhead during inference through offline target precomputation.

502

26 May 2025

agents computer-science artificial-intelligence

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Baidu Inc.

Leiden University

Shandong University

zhengliang shi

A new benchmark and training dataset, TOOLRET, is introduced to evaluate information retrieval models for selecting tools for large language models. The work demonstrates that while existing retrieval models perform poorly on this task, training them on the TOOLRET dataset significantly improves their performance and the end-to-end task success of tool-augmented LLMs.

5,666

28 Nov 2023

computer-science artificial-intelligence computation-and-language

Explainability for Large Language Models: A Survey

Shanghai Jiao Tong University

University of Georgia

Johns Hopkins University

Rice University Baidu Inc.New Jersey Institute of Technology Wake Forest University Institute of Computing Technology, CAS

This survey provides a comprehensive, structured overview of explainability techniques specifically for Large Language Models, categorizing them based on training paradigms (fine-tuning and prompting) and explanation scope (local and global). It reviews existing methodologies, evaluation metrics, practical applications, and outlines key research challenges in making LLMs more transparent and trustworthy.

3,398

04 Mar 2025

computer-science artificial-intelligence information-retrieval

Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

Baidu Inc.

Hao Zhang

COBRA (Cascaded sparsE-dense RepresentAtion) is a framework that combines sparse IDs with dense vectors for generative recommendation, improving recommendation performance and offering control over diversity. The approach achieved a 3.8% increase in conversion rate and a 2.7% increase in Average Revenue Per User (ARPU) in online A/B tests on a large-scale industrial platform.

166

29 Jul 2025

agents chain-of-thought computer-science

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Baidu Inc.

Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3) Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection (ORR) format. By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains. For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset's effectiveness in enhancing multimodal reasoning and tool-based capabilities. The dataset is publicly available at this https URL.

2,992

23 Feb 2025

computer-science computation-and-language model-interpretation

Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking

Chinese Academy of Sciences

Beijing Normal University Baidu Inc.

Tingwen Liu

Yilong Chen (Yichen)

The Inner Thinking Transformer (ITT) enhances the reasoning capabilities and efficiency of large language models by introducing a dynamic depth scaling architecture that adaptively allocates internal computational resources to critical tokens. Developed by a collaborative team from the Chinese Academy of Sciences and Baidu Inc., ITT enables models to achieve performance comparable to significantly larger models while improving data and computational efficiency.

156

01 Aug 2025

agentic-frameworks agents computer-science

MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation

Renmin University of China Baidu Inc.

MAO-ARAG introduces an adaptive Retrieval-Augmented Generation (RAG) framework that dynamically orchestrates multiple specialized agents using reinforcement learning to tailor workflows for diverse queries, achieving higher answer quality while balancing computational costs. The system significantly improves F1 scores on various QA datasets, outperforming existing RAG pipelines, and demonstrates efficient resource utilization.

798

07 May 2024

computer-science artificial-intelligence computation-and-language

GraphGPT: Graph Instruction Tuning for Large Language Models

University of Hong Kong Baidu Inc.Musketeers Foundation Institute of Data Science

GraphGPT, developed by researchers from the University of Hong Kong and Baidu Inc., introduces a dual-stage graph instruction tuning framework that injects graph structural knowledge into large language models, leading to a 2-10 times accuracy increase in zero-shot graph learning tasks compared to GNN-based models.

678

259

08 Sep 2025

computer-science artificial-intelligence computation-and-language

SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards

Wuhan University

Renmin University of China Baidu Inc.UCAS

Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate vision-language alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are naturally inverse dual tasks, we propose \textbf{SUDER} (\textbf{S}elf-improving \textbf{U}nified LMMs with \textbf{D}ual s\textbf{E}lf-\textbf{R}ewards), a framework reinforcing the understanding and generation capabilities of LMMs with a self-supervised dual reward mechanism. SUDER leverages the inherent duality between understanding and generation tasks to provide self-supervised optimization signals for each other. Specifically, we sample multiple outputs for a given input in one task domain, then reverse the input-output pairs to compute the dual likelihood within the model as self-rewards for optimization. Extensive experimental results on visual understanding and generation benchmarks demonstrate that our method can effectively enhance the performance of the model without any external supervision, especially achieving remarkable improvements in text-to-image tasks.

245

30 May 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information

Peking University Baidu Inc.

Qwen-LookAgain (Qwen-LA), a vision-language reasoning model developed by researchers at Peking University and Baidu Inc., reduces hallucinations and improves accuracy by enabling spontaneous, visually-grounded reflection during extended reasoning. The model demonstrates leading performance on various multimodal QA benchmarks and significantly lower hallucination rates compared to other state-of-the-art methods.

There are no more papers matching your filters at the moment.

Install Browser Extension

Blog|We're hiring

alphaXiv

Explore

Login

Feedback

Dark mode

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

TURA: Tool-Augmented Unified Retrieval Agent for AI Search

ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability

LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking

PaddleOCR 3.0 Technical Report

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning

Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Tool Learning with Large Language Models: A Survey

MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Explainability for Large Language Models: A Survey

Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking

MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation

GraphGPT: Graph Instruction Tuning for Large Language Models

SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards

Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information