Thanks to visit codestin.com
Credit goes to www.alphaxiv.org

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Researchers at Harvard University developed power sampling, a training-free method leveraging the Metropolis-Hastings algorithm to sample from a sharpened distribution of a base large language model. This technique unlocks latent reasoning capabilities, achieving single-shot performance comparable to or exceeding reinforcement learning post-training methods across various tasks, while also preserving generation diversity.
46
Representation Autoencoders (RAEs) redefine the latent space for Diffusion Transformers (DiT) by utilizing frozen, pretrained visual encoders with lightweight decoders. This framework achieves state-of-the-art image generation, obtaining an FID of 1.13 on ImageNet 512x512, and demonstrates up to 47x faster convergence rates than prior DiT models.
1,102
Researchers from S-Lab NTU, SenseTime Research, and Xi’an Jiaotong University introduced NEO, a family of native vision-language models built on a unified primitive and end-to-end training. NEO demonstrates competitive performance against modular VLMs and surpasses other native approaches on various benchmarks, despite using significantly less pre-training and SFT data.
25
A tutorial developed by the University of Oxford and Hugging Face guides readers through modern robot learning, detailing the transition from classical methods to data-driven, learning-based paradigms. It provides conceptual understanding and practical tools using the `lerobot` open-source library, covering Reinforcement Learning, Imitation Learning, and generalist Vision-Language-Action policies with end-to-end examples.
389
Researchers at The University of Hong Kong developed RAG-ANYTHING, an all-in-one framework that addresses the text-centric limitation of existing Retrieval-Augmented Generation (RAG) systems by uniformly processing text, images, tables, and equations. This system leverages a dual-graph construction and cross-modal hybrid retrieval to achieve 63.4% accuracy on DocBench and 42.8% on MMLongBench, showing improved performance, particularly on long, multimodal documents.
8,281
Researchers from Renmin University of China and Kuaishou Technology developed Agentic Entropy-Balanced Policy Optimization (AEPO), an algorithm designed to stabilize and enhance the training of web agents by dynamically balancing entropy during rollout and policy updates. AEPO achieved 47.6% Pass@1 on the GAIA benchmark and reduced tool calls by approximately half compared to other RL methods, demonstrating improved performance and training stability on complex, multi-turn tasks.
665
NP-Edit, developed by researchers at Carnegie Mellon University and Adobe, introduces a training paradigm for instruction-following image editing models that eliminates the need for paired input-target data. The system leverages differentiable feedback from Vision-Language Models and a distribution matching loss, achieving competitive performance and often outperforming larger models in few-step generation on benchmarks such as GEdit-Benchmark and DreamBooth.
177
Researchers from The Chinese University of Hong Kong developed a framework for assessing large language models' ability to design functional, physically simulated machines using a novel environment and agentic workflows. They demonstrated that while LLMs can generate functional designs, they require advanced techniques like iterative refinement and reinforcement learning to overcome limitations in spatial and physical reasoning.
53
Researchers at the International Digital Economy Academy (IDEA) introduced Rex-Omni, a 3-billion-parameter Multimodal Large Language Model capable of unifying various visual perception tasks. The model achieves state-of-the-art or highly competitive performance across 11 diverse benchmarks by integrating robust language understanding with precise object localization.
18
Researchers from a collaborative team including Shanghai Qizhi Institute and Tsinghua University developed RL-100, a unified framework for real-world robotic manipulation that achieves 100% success rates across seven challenging tasks by combining imitation learning with iterative offline and online reinforcement learning. This framework addresses the latency of multi-step diffusion policies through consistency distillation, enabling high-frequency control at up to 378 Hz while outperforming human experts in efficiency.
55
PI-FLOW introduces a framework for few-step generative modeling that predicts a network-free policy to guide dense ODE integration substeps, effectively decoupling the number of costly network evaluations from the precision of integration. This method achieves high image quality and diversity with significantly reduced inference steps across various scales, including 1024x1024 text-to-image generation.
21
The Tiny Recursive Model (TRM) demonstrates that a simplified recursive architecture can tackle complex reasoning tasks with higher efficiency and generalization than larger models and its predecessor, the Hierarchical Reasoning Model (HRM). For instance, TRM-MLP attained an 87.4% test accuracy on Sudoku-Extreme, a substantial increase from HRM's 55.0%, utilizing significantly fewer parameters.
3,566
Researchers from Shanghai Jiao Tong University, Nanyang Technological University, The Chinese University of Hong Kong, and Alibaba Group developed OMNI-CAPTIONER, a framework that includes an agentic data pipeline, dedicated models, and a benchmark to address the detail-hallucination trade-off in Omni Language Models. Their Omni-Captioner-7B model achieved a new state-of-the-art on the VDC benchmark with 55.0% accuracy and superior performance on the novel Omni-Cloze benchmark, scoring 53.5% accuracy.
12
Researchers from Fudan University and StepFun developed "WithAnyone," a framework that enables controllable and ID-consistent image generation, effectively resolving the prevalent "copy-paste" artifact in existing models. The framework introduces a large-scale paired multi-person dataset and a comprehensive benchmark, achieving high identity fidelity while allowing flexible control over pose and expression, and outperforming state-of-the-art methods across various metrics.
61
Researchers systematically analyzed reinforcement learning in agentic LLMs across data, algorithms, and reasoning modes, identifying key strategies for enhanced performance and efficiency. Their resulting DemyAgent-4B model, with 4 billion parameters, achieved state-of-the-art agentic reasoning on benchmarks like AIME2024 (72.6%) and AIME2025 (70.0%), surpassing models up to 32 billion parameters.
12
The paper introduces "Trajectory Fields" as a novel 4D video representation, mapping each pixel's continuous 3D path over time. The "Trace Anything" neural network predicts these fields in a single pass, achieving state-of-the-art performance in dynamic scene understanding and dense 3D tracking while being orders of magnitude faster than prior methods.
36
A collaboration between AMAP, Alibaba Group, NVIDIA, and Caltech presents EPG, a self-supervised pre-training framework that advances end-to-end pixel-space generative modeling to achieve high image quality and efficiency, outperforming prior pixel-space methods and rivaling latent-space models while enabling the first VAE-free consistency model training on high-resolution images.
66
A framework called QeRL significantly improves the efficiency and performance of Reinforcement Learning for Large Language Models by employing hardware-accelerated NVFP4 quantization and leveraging quantization noise as a dynamic exploration mechanism. It achieves over 1.5x rollout speedup, enables training of a 32B LLM on a single GPU, and outperforms existing methods in reasoning accuracy on mathematical benchmarks.
157
Researchers empirically validate the LLM Brain Rot Hypothesis, demonstrating that continual pre-training on low-quality, engagement-driven web content leads to a persistent decline in LLM reasoning, long-context understanding, ethical norms, and personality traits. This research reveals that models exhibit "thought-skipping" as a primary failure mode, and these cognitive impairments are resistant to post-hoc mitigation.
3,911
InternVLA-M1 is a spatially guided vision-language-action framework from Intern Robotics and Shanghai AI Laboratory that integrates explicit spatial grounding into generalist robot policy training. The framework achieved superior performance on public benchmarks, including a +14.6% success rate improvement on SimplerEnv Google Robot, alongside enhanced generalization to unseen objects and robustness in dynamic real-world environments.
132
There are no more papers matching your filters at the moment.