Codestin Search App

alphaXiv

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Papers Datasets

184

16 Oct 2025

computer-science artificial-intelligence computation-and-language

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Harvard University

Researchers at Harvard University developed power sampling, a training-free method leveraging the Metropolis-Hastings algorithm to sample from a sharpened distribution of a base large language model. This technique unlocks latent reasoning capabilities, achieving single-shot performance comparable to or exceeding reinforcement learning post-training methods across various tasks, while also preserving generation diversity.

4,519

13 Oct 2025

computer-science computer-vision-and-pattern-recognition machine-learning

Diffusion Transformers with Representation Autoencoders

New York University

Representation Autoencoders (RAEs) redefine the latent space for Diffusion Transformers (DiT) by utilizing frozen, pretrained visual encoders with lightweight decoders. This framework achieves state-of-the-art image generation, obtaining an FID of 1.13 on ImageNet 512x512, and demonstrates up to 47x faster convergence rates than prior DiT models.

1,102

209

16 Oct 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Nanyang Technological University SenseTime Research Xi Jiaotong University

Researchers from S-Lab NTU, SenseTime Research, and Xi’an Jiaotong University introduced NEO, a family of native vision-language models built on a unified primitive and end-to-end training. NEO demonstrates competitive performance against modular VLMs and surpasses other native approaches on various benchmarks, despite using significantly less pre-training and SFT data.

1,334

14 Oct 2025

agents autonomous-vehicles computer-science

Robot Learning: A Tutorial

University of Oxford

Hugging Face

A tutorial developed by the University of Oxford and Hugging Face guides readers through modern robot learning, detailing the transition from classical methods to data-driven, learning-based paradigms. It provides conceptual understanding and practical tools using the `lerobot` open-source library, covering Reinforcement Learning, Imitation Learning, and generalist Vision-Language-Action policies with end-to-end examples.

389

446

14 Oct 2025

computer-science artificial-intelligence generative-models

RAG-Anything: All-in-One RAG Framework

Researchers at The University of Hong Kong developed RAG-ANYTHING, an all-in-one framework that addresses the text-centric limitation of existing Retrieval-Augmented Generation (RAG) systems by uniformly processing text, images, tables, and equations. This system leverages a dual-graph construction and cross-modal hybrid retrieval to achieve 63.4% accuracy on DocBench and 42.8% on MMLongBench, showing improved performance, particularly on long, multimodal documents.

8,281

156

16 Oct 2025

agentic-frameworks agents computer-science

Agentic Entropy-Balanced Policy Optimization

Renmin University of China Kuaishou Technology

Researchers from Renmin University of China and Kuaishou Technology developed Agentic Entropy-Balanced Policy Optimization (AEPO), an algorithm designed to stabilize and enhance the training of web agents by dynamically balancing entropy during rollout and policy updates. AEPO achieved 47.6% Pass@1 on the GAIA benchmark and reduced tool calls by approximately half compared to other RL methods, demonstrating improved performance and training stability on complex, multi-turn tasks.

665

114

16 Oct 2025

computer-science computer-vision-and-pattern-recognition machine-learning

Learning an Image Editing Model without Image Editing Pairs

Carnegie Mellon University

Adobe

NP-Edit, developed by researchers at Carnegie Mellon University and Adobe, introduces a training paradigm for instruction-following image editing models that eliminates the need for paired input-target data. The system leverages differentiable feedback from Vision-Language Models and a distribution matching loss, achieving competitive performance and often outperforming larger models in few-step generation on benchmarks such as GEdit-Benchmark and DreamBooth.

177

16 Oct 2025

agentic-frameworks agents computer-science

Agentic Design of Compositional Machines

The Chinese University of Hong Kong The Chinese University of Hong Kong (Shenzhen)

Researchers from The Chinese University of Hong Kong developed a framework for assessing large language models' ability to design functional, physically simulated machines using a novel environment and agentic workflows. They demonstrated that while LLMs can generate functional designs, they require advanced techniques like iterative refinement and reinforcement learning to overcome limitations in spatial and physical reasoning.

514

14 Oct 2025

computer-science computer-vision-and-pattern-recognition data-curation

Detect Anything via Next Point Prediction

International Digital Economy Academy (IDEA)

Researchers at the International Digital Economy Academy (IDEA) introduced Rex-Omni, a 3-billion-parameter Multimodal Large Language Model capable of unifying various visual perception tasks. The model achieves state-of-the-art or highly competitive performance across 11 diverse benchmarks by integrating robust language understanding with precise object localization.

136

16 Oct 2025

computer-science artificial-intelligence machine-learning

RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

Carnegie Mellon University

Shanghai Jiao Tong University

Tsinghua University Shanghai Qizhi Institute

The University of Hong Kong University of North Carolina at Chapel Hill

Researchers from a collaborative team including Shanghai Qizhi Institute and Tsinghua University developed RL-100, a unified framework for real-world robotic manipulation that achieves 100% success rates across seven challenging tasks by combining imitation learning with iterative offline and online reinforcement learning. This framework addresses the latency of multi-step diffusion policies through consistency distillation, enabling high-frequency control at up to 378 Hz while outperforming human experts in efficiency.

16 Oct 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Stanford University

Adobe

PI-FLOW introduces a framework for few-step generative modeling that predicts a network-free policy to guide dense ODE integration substeps, effectively decoupling the number of costly network evaluations from the precision of integration. This method achieves high image quality and diversity with significantly reduced inference steps across various scales, including 1024x1024 text-to-image generation.

11,818

06 Oct 2025

computer-science artificial-intelligence machine-learning

Less is More: Recursive Reasoning with Tiny Networks

Samsung

The Tiny Recursive Model (TRM) demonstrates that a simplified recursive architecture can tackle complex reasoning tasks with higher efficiency and generalization than larger models and its predecessor, the Hierarchical Reasoning Model (HRM). For instance, TRM-MLP attained an 87.4% test accuracy on Sudoku-Extreme, a substantial increase from HRM's 55.0%, utilizing significantly fewer parameters.

3,566

14 Oct 2025

agents computer-science computation-and-language

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

Alibaba Group

Shanghai Jiao Tong University

The Chinese University of Hong Kong

Nanyang Technological University Shanghai Innovation Institution

Researchers from Shanghai Jiao Tong University, Nanyang Technological University, The Chinese University of Hong Kong, and Alibaba Group developed OMNI-CAPTIONER, a framework that includes an agentic data pipeline, dedicated models, and a benchmark to address the detail-hallucination trade-off in Omni Language Models. Their Omni-Captioner-7B model achieved a new state-of-the-art on the VDC benchmark with 55.0% accuracy and superior performance on the novel Omni-Cloze benchmark, scoring 53.5% accuracy.

16 Oct 2025

computer-science contrastive-learning artificial-intelligence

WithAnyone: Towards Controllable and ID Consistent Image Generation

Fudan University StepFun

Researchers from Fudan University and StepFun developed "WithAnyone," a framework that enables controllable and ID-consistent image generation, effectively resolving the prevalent "copy-paste" artifact in existing models. The framework introduces a large-scale paired multi-person dataset and a comprehensive benchmark, achieving high identity fidelity while allowing flexible control over pose and expression, and outperforming state-of-the-art methods across various metrics.

887

13 Oct 2025

computer-science computation-and-language

Demystifying Reinforcement Learning in Agentic Reasoning

University of Illinois at Urbana-Champaign

National University of Singapore

Princeton University

Researchers systematically analyzed reinforcement learning in agentic LLMs across data, algorithms, and reasoning modes, identifying key strategies for enhanced performance and efficiency. Their resulting DemyAgent-4B model, with 4 billion parameters, achieved state-of-the-art agentic reasoning on benchmarks like AIME2024 (72.6%) and AIME2025 (70.0%), surpassing models up to 32 billion parameters.

129

15 Oct 2025

computer-science computer-vision-and-pattern-recognition geometric-deep-learning

Trace Anything: Representing Any Video in 4D via Trajectory Fields

The paper introduces "Trajectory Fields" as a novel 4D video representation, mapping each pixel's continuous 3D path over time. The "Trace Anything" neural network predicts these fields in a single pass, achieving state-of-the-art performance in dynamic scene understanding and dense 3D tracking while being orders of magnitude faster than prior methods.

199

14 Oct 2025

computer-science computer-vision-and-pattern-recognition

Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

California Institute of Technology

Alibaba Group

NVIDIA

A collaboration between AMAP, Alibaba Group, NVIDIA, and Caltech presents EPG, a self-supervised pre-training framework that advances end-to-end pixel-space generative modeling to achieve high image quality and efficiency, outperforming prior pixel-space methods and rivaling latent-space models while enabling the first VAE-free consistency model training on high-resolution images.

486

13 Oct 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

Tsinghua University

NVIDIA

The University of Hong Kong

MIT

A framework called QeRL significantly improves the efficiency and performance of Reinforcement Learning for Large Language Models by employing hardware-accelerated NVFP4 quantization and leveraging quantization noise as a dynamic exploration mechanism. It achieves over 1.5x rollout speedup, enables training of a 32B LLM on a single GPU, and outperforms existing methods in reasoning accuracy on mathematical benchmarks.

157

15 Oct 2025

computer-science artificial-intelligence computation-and-language

LLMs Can Get "Brain Rot"!

Researchers empirically validate the LLM Brain Rot Hypothesis, demonstrating that continual pre-training on low-quality, engagement-driven web content leads to a persistent decline in LLM reasoning, long-context understanding, ethical norms, and personality traits. This research reveals that models exhibit "thought-skipping" as a primary failure mode, and these cognitive impairments are resistant to post-hoc mitigation.

3,911

150

15 Oct 2025

agents computer-science artificial-intelligence

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Shanghai AI Laboratory Intern Robotics

InternVLA-M1 is a spatially guided vision-language-action framework from Intern Robotics and Shanghai AI Laboratory that integrates explicit spatial grounding into generalist robot policy training. The framework achieved superior performance on public benchmarks, including a +14.6% success rate improvement on SimplerEnv Google Robot, alongside enhanced generalization to unseen objects and robustness in dynamic real-world environments.

132

There are no more papers matching your filters at the moment.

Install Browser Extension

Blog|We're hiring

alphaXiv

Explore

Login

Feedback

Dark mode

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Diffusion Transformers with Representation Autoencoders

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Robot Learning: A Tutorial

RAG-Anything: All-in-One RAG Framework

Agentic Entropy-Balanced Policy Optimization

Learning an Image Editing Model without Image Editing Pairs

Agentic Design of Compositional Machines

Detect Anything via Next Point Prediction

RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Less is More: Recursive Reasoning with Tiny Networks

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

WithAnyone: Towards Controllable and ID Consistent Image Generation

Demystifying Reinforcement Learning in Agentic Reasoning

Trace Anything: Representing Any Video in 4D via Trajectory Fields

Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

LLMs Can Get "Brain Rot"!

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy