Thanks to visit codestin.com
Credit goes to www.alphaxiv.org

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Researchers from UCSD and Meta AI developed DeepConf, a method that enhances Large Language Model reasoning efficiency and performance by leveraging local confidence signals to dynamically prune low-quality reasoning traces and weight final answer aggregation. DeepConf achieved up to 84.7% reduction in generated tokens and improved accuracy, reaching 99.9% on AIME 2025 with GPT-OSS-120B, without requiring additional model training.
1
VGGT, developed by VGG at the University of Oxford and Meta AI, introduces a 1.2 billion-parameter feed-forward transformer that directly infers camera parameters, depth maps, and 3D point clouds from multiple input images in a single pass. This model achieves state-of-the-art accuracy in 3D reconstruction and camera pose estimation (e.g., 85.3 AUC@30 on RealEstate10K) while significantly reducing inference time to approximately 0.2 seconds per scene.
9,237
Flow Matching introduces a simulation-free method for scalably training Continuous Normalizing Flows (CNFs) by leveraging a conditional objective with identical gradients to the marginal objective. The approach achieves state-of-the-art performance on ImageNet, particularly with Optimal Transport paths, leading to faster training convergence and significantly more efficient sample generation through fewer function evaluations.
2,774
This paper introduces Diffusion Transformers (DiTs), a new class of diffusion models that replace the conventional U-Net backbone with a transformer architecture. By leveraging the scalability of transformers, DiTs achieve new state-of-the-art Fréchet Inception Distance (FID) scores on class-conditional ImageNet at 256x256 (2.27 FID) and 512x512 (3.04 FID) resolutions while being more compute-efficient.
6,839
·
Llama 3, a new family of foundation models from Meta AI, achieves performance competitive with leading closed-source models across various benchmarks, while its smaller variants set new standards for open-source LLMs. The models demonstrate enhanced capabilities in multilinguality, coding, reasoning, and tool use, alongside robust safety measures.
FAIR at Meta developed V-JEPA 2, a self-supervised video model that learns a general world model from over 1 million hours of internet video, then adapts this model with a small amount of unlabeled robot data to enable zero-shot robot control and strong performance in video understanding and prediction tasks. The model achieves 80% success for pick-and-place tasks with a cup in novel environments and sets new benchmarks in video question-answering, demonstrating the efficacy of learning predictive world models in representation space.
228
Researchers from MIT, Meta AI, CMU, and NVIDIA developed StreamingLLM, a framework enabling Large Language Models to efficiently process infinitely long input sequences without fine-tuning. This is achieved by leveraging an "attention sink" phenomenon where LLMs disproportionately attend to initial tokens, allowing the model to maintain stable perplexity and achieve up to a 22.2x speedup in decoding latency.
6,792
This paper introduces Flow Matching (FM) as a comprehensive framework for training generative models across different data modalities
9
Chain of Thought (CoT) monitorability offers a distinct capability for AI safety by providing insight into an AI's internal reasoning processes, including potential intent to misbehave. This paper argues that while currently useful for detecting misbehavior and misalignment, this property is fragile and requires proactive research and development to preserve it as AI systems scale.
Researchers introduced a predictive framework for Reinforcement Learning (RL) in Large Language Models (LLMs) using a sigmoidal compute-performance curve, enabling performance extrapolation from smaller runs. Their ScaleRL recipe, demonstrated over 100,000 GPU-hours, achieves an asymptotic reward of 0.61 on verifiable math problems, outperforming established methods while exhibiting predictable scaling across model size, generation length, and multi-task settings.
· +2
Perception Encoder introduces a family of vision models that achieve state-of-the-art performance across diverse vision and vision-language tasks, demonstrating that general, high-quality visual features can be extracted from the intermediate layers of a single, contrastively-trained network. It provides specific alignment tuning methods to make these features accessible for tasks ranging from zero-shot classification to dense spatial prediction and multimodal language understanding.
383
FAIR (Meta) researchers introduce a transparent, open-sourced methodology for pre-training CLIP on native worldwide web-scale image-text pairs. This approach overcomes the "curse of multilinguality," allowing a ViT-H/14 model to surpass its English-only counterpart by 0.8% on zero-shot ImageNet and establish new state-of-the-art results across multilingual benchmarks like CVQA (57.4%) and Babel-ImageNet (50.2%).
1,688
The Vision Language World Model (VLWM) introduces a foundation model for language-based world modeling from natural videos, enabling high-level task planning. It leverages natural language as an abstract world state representation and achieves state-of-the-art performance on visual planning benchmarks and human-preferred plans in evaluations.
Meta FAIR developed SAM 2, a foundation model extending 'segment anything' capabilities to both images and videos. This model demonstrates state-of-the-art performance across promptable image and video segmentation benchmarks, achieved with 3x fewer user interactions for video and a 6x faster image encoder than its predecessor, leveraging the newly collected, 53x larger Segment Anything Video (SA-V) dataset.
15,593
The FAIR (Meta), Hugging Face, and AutoGPT teams introduced GAIA, a benchmark with 466 real-world questions designed to evaluate general AI assistants beyond narrow benchmarks. Experiments on GAIA revealed a substantial performance gap between humans (92% success rate) and state-of-the-art AI systems like GPT-4 with plugins (15% success rate), indicating current limitations in multi-step reasoning, tool use, and multimodal understanding.
1
Researchers from Stanford University, Meta, and UC Berkeley developed ALOHA, a low-cost, open-source hardware system, alongside ACT, a novel imitation learning algorithm, to enable precise fine-grained bimanual manipulation on affordable robots. The system successfully executes complex tasks such as threading zip cable ties, manipulating small objects, and juggling ping pong balls, demonstrating how advanced learning can compensate for hardware limitations.
1,205
Meta FAIR's DINO-world introduces an efficient generalist video world model by leveraging a frozen DINOv2 encoder to predict future states in a semantic latent space, outperforming pixel-based models in dense feature forecasting and enabling effective action-conditioned planning from uncurated video data.
This work introduces a scalable reinforcement learning method for training Large Language Models to generate continuous Chains-of-Thought (CoTs) by injecting noise into mixture embeddings, overcoming prior computational and data dependency issues. Models trained with this approach achieve comparable Pass@1 accuracy, superior Pass@32 performance due to increased reasoning diversity, and improved robustness on out-of-domain tasks compared to discrete CoT training.
53
The paper introduces an information-theoretic framework based on Kolmogorov complexity to quantify language model memorization, differentiating it from generalization, and estimates a capacity of approximately 3.6 bits-per-parameter for GPT-family models. This work demonstrates that phenomena like double descent occur when the dataset size exceeds model capacity, forcing generalization, and presents scaling laws for membership inference attacks, indicating their statistical insignificance for average data points in large, modern LLMs.
1
Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.
18
There are no more papers matching your filters at the moment.