Thanks to visit codestin.com
Credit goes to www.alphaxiv.org

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

GitHub
Baidu Inc.
The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system.
ReasonRank presents a listwise reranker for complex information retrieval tasks, developed using an automated data synthesis framework and a two-stage post-training approach to infuse strong reasoning abilities into large language models. The method achieves state-of-the-art performance on reasoning-intensive benchmarks while demonstrating enhanced efficiency compared to pointwise reasoning rerankers.
Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. This paper explores the `Soft Thinking' capabilities of various LLMs by examining the models' internal behavior using a suite of probing techniques. Contrary to the common belief that Soft Thinking enables the simultaneous exploration of diverse reasoning paths, our findings reveal that LLMs predominantly rely on the most influential component of the soft inputs during subsequent decoding steps. This reliance hinders the exploration of different reasoning paths and reduces vanilla Soft Thinking to a form of greedy decoding, obscuring the advantage of transmitting more information through Soft Tokens. To tackle this issue, we explore sampling strategies to introduce \emph{randomness}, employing methods such as Dirichlet resampling and the Gumbel-Softmax trick. Our experiments demonstrate that incorporating randomness can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking. Notably, the Gumbel-Softmax trick provides adequate randomness with controlled smoothness, resulting in superior performance across eight reasoning benchmarks.
3
Baidu Inc.'s PaddleOCR 3.0 provides a robust open-source toolkit for optical character recognition and document understanding, achieving competitive accuracy with models requiring fewer than 100 million parameters against billion-parameter Vision-Language Models in diverse scenarios. The system significantly improves document parsing and key information extraction, while offering enhanced deployment capabilities and serving as a crucial infrastructure for LLM and RAG applications.
A groundbreaking video understanding framework from HKU and Baidu researchers enables retrieval-augmented generation (RAG) for extremely long videos, achieving superior performance through graph-based knowledge organization and multi-modal retrieval while establishing a new benchmark with over 134 hours of content for evaluating long-form video understanding capabilities.
323
RT-DETRv2, developed by Baidu Inc. and Peking University, refines the real-time detection Transformer architecture by integrating "bag-of-freebies" techniques. These improvements enhance flexibility, practicality, and performance without increasing inference time, demonstrating superior accuracy-speed trade-offs compared to prior RT-DETR and YOLO models on the COCO dataset.
4,248
Researchers from Renmin University of China, Baidu Inc., and Carnegie Mellon University developed MMOA-RAG, a framework that models Retrieval-Augmented Generation (RAG) pipelines as cooperative multi-agent reinforcement learning tasks to jointly optimize multiple interacting components. This approach significantly enhances the accuracy and factuality of generated answers on question-answering benchmarks, surpassing existing RAG optimization methods.
78
A method enhancing Small Language Models (SLMs) for creative writing leverages AI feedback through two distinct strategies: a multi-agent refined reward model and a principle-guided LLM-as-a-Judge. The LLM-as-a-Judge approach, employing adversarial optimization and reflection, achieved state-of-the-art performance in generating Chinese greetings, outperforming larger models while requiring less data and computational resources.
· +1
This research introduces Collective Monte Carlo Tree Search (CoMCTS) to empower Multimodal Large Language Models (MLLMs) with step-by-step reasoning and reflection. The method generates a high-quality, tree-structured multimodal reasoning dataset (Mulberry-260k), leading to Mulberry models that demonstrate enhanced reasoning capabilities and improved performance on complex multimodal benchmarks.
1,216
A systematic and comprehensive review of tool learning with Large Language Models is presented, organizing fragmented literature by detailing the 'why' and 'how' of LLM-tool integration, surveying existing methods, benchmarks, and identifying open challenges and future research directions.
1
University of Hong Kong and Baidu researchers evaluate how well Multimodal Large Language Models learn 3D-aware representations for scene understanding by analyzing multi-view feature correspondence across voxelized 3D scenes, revealing a strong positive correlation between 3D representation quality and downstream task performance that motivates their 3DRS framework, which distills knowledge from pretrained 3D foundation models like VGGT and FLARE into MLLM visual features during training, achieving state-of-the-art results across five benchmarks including 62.9 [email protected] on ScanRefer (versus 58.1 baseline) and 104.8 CIDEr on ScanQA (versus 102.1 baseline) while introducing zero computational overhead during inference through offline target precomputation.
1
A new benchmark and training dataset, TOOLRET, is introduced to evaluate information retrieval models for selecting tools for large language models. The work demonstrates that while existing retrieval models perform poorly on this task, training them on the TOOLRET dataset significantly improves their performance and the end-to-end task success of tool-augmented LLMs.
57
This survey provides a comprehensive, structured overview of explainability techniques specifically for Large Language Models, categorizing them based on training paradigms (fine-tuning and prompting) and explanation scope (local and global). It reviews existing methodologies, evaluation metrics, practical applications, and outlines key research challenges in making LLMs more transparent and trustworthy.
19
COBRA (Cascaded sparsE-dense RepresentAtion) is a framework that combines sparse IDs with dense vectors for generative recommendation, improving recommendation performance and offering control over diversity. The approach achieved a 3.8% increase in conversion rate and a 2.7% increase in Average Revenue Per User (ARPU) in online A/B tests on a large-scale industrial platform.
Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3) Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection (ORR) format. By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains. For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset's effectiveness in enhancing multimodal reasoning and tool-based capabilities. The dataset is publicly available at this https URL.
·
The Inner Thinking Transformer (ITT) enhances the reasoning capabilities and efficiency of large language models by introducing a dynamic depth scaling architecture that adaptively allocates internal computational resources to critical tokens. Developed by a collaborative team from the Chinese Academy of Sciences and Baidu Inc., ITT enables models to achieve performance comparable to significantly larger models while improving data and computational efficiency.
MAO-ARAG introduces an adaptive Retrieval-Augmented Generation (RAG) framework that dynamically orchestrates multiple specialized agents using reinforcement learning to tailor workflows for diverse queries, achieving higher answer quality while balancing computational costs. The system significantly improves F1 scores on various QA datasets, outperforming existing RAG pipelines, and demonstrates efficient resource utilization.
GraphGPT, developed by researchers from the University of Hong Kong and Baidu Inc., introduces a dual-stage graph instruction tuning framework that injects graph structural knowledge into large language models, leading to a 2-10 times accuracy increase in zero-shot graph learning tasks compared to GNN-based models.
678
Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate vision-language alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are naturally inverse dual tasks, we propose \textbf{SUDER} (\textbf{S}elf-improving \textbf{U}nified LMMs with \textbf{D}ual s\textbf{E}lf-\textbf{R}ewards), a framework reinforcing the understanding and generation capabilities of LMMs with a self-supervised dual reward mechanism. SUDER leverages the inherent duality between understanding and generation tasks to provide self-supervised optimization signals for each other. Specifically, we sample multiple outputs for a given input in one task domain, then reverse the input-output pairs to compute the dual likelihood within the model as self-rewards for optimization. Extensive experimental results on visual understanding and generation benchmarks demonstrate that our method can effectively enhance the performance of the model without any external supervision, especially achieving remarkable improvements in text-to-image tasks.
Qwen-LookAgain (Qwen-LA), a vision-language reasoning model developed by researchers at Peking University and Baidu Inc., reduces hallucinations and improves accuracy by enabling spontaneous, visually-grounded reflection during extended reasoning. The model demonstrates leading performance on various multimodal QA benchmarks and significantly lower hallucination rates compared to other state-of-the-art methods.
6
There are no more papers matching your filters at the moment.