Thanks to visit codestin.com
Credit goes to www.alphaxiv.org

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Researchers introduced a predictive framework for Reinforcement Learning (RL) in Large Language Models (LLMs) using a sigmoidal compute-performance curve, enabling performance extrapolation from smaller runs. Their ScaleRL recipe, demonstrated over 100,000 GPU-hours, achieves an asymptotic reward of 0.61 on verifiable math problems, outperforming established methods while exhibiting predictable scaling across model size, generation length, and multi-task settings.
Representation Autoencoders (RAEs) redefine the latent space for Diffusion Transformers (DiT) by utilizing frozen, pretrained visual encoders with lightweight decoders. This framework achieves state-of-the-art image generation, obtaining an FID of 1.13 on ImageNet 512x512, and demonstrates up to 47x faster convergence rates than prior DiT models.
1,102
A tutorial developed by the University of Oxford and Hugging Face guides readers through modern robot learning, detailing the transition from classical methods to data-driven, learning-based paradigms. It provides conceptual understanding and practical tools using the `lerobot` open-source library, covering Reinforcement Learning, Imitation Learning, and generalist Vision-Language-Action policies with end-to-end examples.
389
Researchers at The University of Hong Kong developed RAG-ANYTHING, an all-in-one framework that addresses the text-centric limitation of existing Retrieval-Augmented Generation (RAG) systems by uniformly processing text, images, tables, and equations. This system leverages a dual-graph construction and cross-modal hybrid retrieval to achieve 63.4% accuracy on DocBench and 42.8% on MMLongBench, showing improved performance, particularly on long, multimodal documents.
8,281
Tensor Logic introduces a foundational language for AI, demonstrating that neural, symbolic, and statistical paradigms can be unified under a single mathematical construct: the tensor equation. The framework enables sound and transparent reasoning directly within embedding spaces, offering tunable control over the spectrum from deductive to analogical inference.
Researchers at Harvard University developed power sampling, a training-free method leveraging the Metropolis-Hastings algorithm to sample from a sharpened distribution of a base large language model. This technique unlocks latent reasoning capabilities, achieving single-shot performance comparable to or exceeding reinforcement learning post-training methods across various tasks, while also preserving generation diversity.
46
Researchers from S-Lab NTU, SenseTime Research, and Xi’an Jiaotong University introduced NEO, a family of native vision-language models built on a unified primitive and end-to-end training. NEO demonstrates competitive performance against modular VLMs and surpasses other native approaches on various benchmarks, despite using significantly less pre-training and SFT data.
25
Researchers from ICT, CAS and collaborating institutions present the first comprehensive survey of Vibe Coding, a novel LLM-powered software development methodology, formalizing its processes and outlining five distinct development models. The work thoroughly analyzes the ecosystem's infrastructure, revealing critical challenges in human-AI collaboration and a shift in developer roles.
4
Researchers at the International Digital Economy Academy (IDEA) introduced Rex-Omni, a 3-billion-parameter Multimodal Large Language Model capable of unifying various visual perception tasks. The model achieves state-of-the-art or highly competitive performance across 11 diverse benchmarks by integrating robust language understanding with precise object localization.
18
Researchers from Renmin University of China and Kuaishou Technology developed Agentic Entropy-Balanced Policy Optimization (AEPO), an algorithm designed to stabilize and enhance the training of web agents by dynamically balancing entropy during rollout and policy updates. AEPO achieved 47.6% Pass@1 on the GAIA benchmark and reduced tool calls by approximately half compared to other RL methods, demonstrating improved performance and training stability on complex, multi-turn tasks.
665
NVIDIA researchers introduce VLA-0, a Vision-Language-Action model that achieves state-of-the-art robotic manipulation by directly representing robot actions as numerical text strings and fine-tuning an unmodified Vision-Language Model. This minimalist design outperforms more complex or extensively pretrained alternatives in both simulation and real-world tasks.
Researchers from a collaborative team including Shanghai Qizhi Institute and Tsinghua University developed RL-100, a unified framework for real-world robotic manipulation that achieves 100% success rates across seven challenging tasks by combining imitation learning with iterative offline and online reinforcement learning. This framework addresses the latency of multi-step diffusion policies through consistency distillation, enabling high-frequency control at up to 378 Hz while outperforming human experts in efficiency.
55
BitNet Distillation by Microsoft Research provides a three-stage framework to convert full-precision Large Language Models into efficient 1.58-bit models for downstream tasks. It achieves up to 10x memory savings and 2.65x CPU inference speedup, preserving task performance against full-precision models and addressing scalability challenges.
8
AnyUp introduces a universal method for generating high-resolution feature maps from diverse low-resolution vision encoders without requiring model-specific retraining. The approach achieves state-of-the-art performance across various dense prediction tasks and generalizes robustly to unseen feature types and resolutions.
4
NP-Edit, developed by researchers at Carnegie Mellon University and Adobe, introduces a training paradigm for instruction-following image editing models that eliminates the need for paired input-target data. The system leverages differentiable feedback from Vision-Language Models and a distribution matching loss, achieving competitive performance and often outperforming larger models in few-step generation on benchmarks such as GEdit-Benchmark and DreamBooth.
177
Researchers from The Chinese University of Hong Kong developed a framework for assessing large language models' ability to design functional, physically simulated machines using a novel environment and agentic workflows. They demonstrated that while LLMs can generate functional designs, they require advanced techniques like iterative refinement and reinforcement learning to overcome limitations in spatial and physical reasoning.
53
PaddleOCR-VL, an ultra-compact 0.9B parameter vision-language model from Baidu's PaddlePaddle Team, enables efficient and accurate multilingual document parsing by extracting structured information from complex documents in 109 languages. It achieves state-of-the-art performance on benchmarks like OmniDocBench v1.5 with an overall score of 92.56, while demonstrating 15.8% higher page throughput and consuming 40% less GPU memory compared to leading baselines.
LabOS is an AI co-scientist system, developed by researchers at Stanford and Princeton, that integrates a self-evolving AI agent with an XR-enabled physical lab interface to accelerate scientific discovery. It achieved over 90% accuracy in real-time error detection for lab procedures and successfully identified novel targets in cancer immunotherapy and cell fusion research.
Researchers from Shanghai Jiao Tong University and Alibaba Group introduce a method that leverages attention mechanisms to identify a "preplan-and-anchor" reasoning rhythm within Large Language Models. This understanding enables a fine-grained reinforcement learning approach, leading to improved performance and efficiency across diverse reasoning benchmarks.
DriveVLA-W0 integrates world modeling into Vision-Language-Action (VLA) models for autonomous driving, utilizing future image prediction as a dense self-supervision signal. This framework amplifies data scaling laws, enabling VLAs to achieve state-of-the-art performance and enhanced generalization by learning robust environmental representations.
6
There are no more papers matching your filters at the moment.