Codestin Search App

alphaXiv

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Papers Datasets

Meta

6,082

21 Aug 2025

agents chain-of-thought computer-science

Deep Think with Confidence

Meta

University of California, San Diego

Researchers from UCSD and Meta AI developed DeepConf, a method that enhances Large Language Model reasoning efficiency and performance by leveraging local confidence signals to dynamically prune low-quality reasoning traces and weight final answer aggregation. DeepConf achieved up to 84.7% reduction in generated tokens and improved accuracy, reaching 99.9% on AIME 2025 with GPT-OSS-120B, without requiring additional model training.

49,469

14 Mar 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

VGGT: Visual Geometry Grounded Transformer

University of Oxford

Meta

VGGT, developed by VGG at the University of Oxford and Meta AI, introduces a 1.2 billion-parameter feed-forward transformer that directly infers camera parameters, depth maps, and 3D point clouds from multiple input images in a single pass. This model achieves state-of-the-art accuracy in 3D reconstruction and camera pose estimation (e.g., 85.3 AUC@30 on RealEstate10K) while significantly reducing inference time to approximately 0.2 seconds per scene.

9,237

19,905

08 Feb 2023

computer-science artificial-intelligence machine-learning

Flow Matching for Generative Modeling

Meta Weizmann Institute of Science

Flow Matching introduces a simulation-free method for scalably training Continuous Normalizing Flows (CNFs) by leveraging a conditional objective with identical gradients to the marginal objective. The approach achieves state-of-the-art performance on ImageNet, particularly with Optimal Transport paths, leading to faster training convergence and significantly more efficient sample generation through fewer function evaluations.

2,774

24,882

02 Mar 2023

computer-science computer-vision-and-pattern-recognition machine-learning

Scalable Diffusion Models with Transformers

New York University

UC Berkeley

Meta

This paper introduces Diffusion Transformers (DiTs), a new class of diffusion models that replace the conventional U-Net backbone with a transformer architecture. By leveraging the scalability of transformers, DiTs achieve new state-of-the-art Fréchet Inception Distance (FID) scores on class-conditional ImageNet at 256x256 (2.27 FID) and 512x512 (3.04 FID) resolutions while being more compute-efficient.

6,839

30,202

23 Nov 2024

computer-science artificial-intelligence computation-and-language

The Llama 3 Herd of Models

Meta

Laurens van der Maaten

Ian Zhou

Llama 3, a new family of foundation models from Meta AI, achieves performance competitive with leading closed-source models across various benchmarks, while its smaller variants set new standards for open-source LLMs. The models demonstrate enhanced capabilities in multilinguality, coding, reasoning, and tool use, alongside robust safety measures.

8,130

11 Jun 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Meta

Mila – Quebec AI Institute

Polytechnique Montréal

FAIR at Meta developed V-JEPA 2, a self-supervised video model that learns a general world model from over 1 million hours of internet video, then adapts this model with a small amount of unlabeled robot data to enable zero-shot robot control and strong performance in video understanding and prediction tasks. The model achieves 80% success for pick-and-place tasks with a cup in novel environments and sets new benchmarks in video question-answering, demonstrating the efficacy of learning predictive world models in representation space.

228

4,140

07 Apr 2024

attention-mechanisms computer-science artificial-intelligence

Efficient Streaming Language Models with Attention Sinks

Carnegie Mellon University

Meta

Researchers from MIT, Meta AI, CMU, and NVIDIA developed StreamingLLM, a framework enabling Large Language Models to efficiently process infinitely long input sequences without fine-tuning. This is achieved by leveraging an "attention sink" phenomenon where LLMs disproportionately attend to initial tokens, allowing the model to maintain stable perplexity and achieve up to a 22.2x speedup in decoding latency.

6,792

38,323

09 Dec 2024

computer-science machine-learning generative-models

Flow Matching Guide and Code

Meta Weizmann Institute of Science

MIT

Itai Gat

This paper introduces Flow Matching (FM) as a comprehensive framework for training generative models across different data modalities

1,662

15 Jul 2025

chain-of-thought computer-science artificial-intelligence

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Google DeepMind

Anthropic

Université de Montréal

UC Berkeley

Meta

OpenAI Amazon University of Montreal

Scale AI Truthful AI Center for AI Safety Apollo Research METR Redwood Research UK AI Security Institute MAGIC AI Futures Project

Victoria Krakovna

Chain of Thought (CoT) monitorability offers a distinct capability for AI safety by providing insight into an AI's internal reasoning processes, including potential intent to misbehave. This paper argues that while currently useful for detecting misbehavior and misalignment, this property is fragile and requires proactive research and development to preserve it as AI systems scale.

1,629

15 Oct 2025

computer-science artificial-intelligence machine-learning

The Art of Scaling Reinforcement Learning Compute for LLMs

Harvard University UT Austin UCL

UC Berkeley

Meta Periodic Labs

Researchers introduced a predictive framework for Reinforcement Learning (RL) in Large Language Models (LLMs) using a sigmoidal compute-performance curve, enabling performance extrapolation from smaller runs. Their ScaleRL recipe, demonstrated over 100,000 GPU-hours, achieves an asymptotic reward of 0.61 on verifiable math problems, outperforming established methods while exhibiting predictable scaling across model size, generation length, and multi-task settings.

126,540

28 Apr 2025

computer-science contrastive-learning computer-vision-and-pattern-recognition

Perception Encoder: The best visual embeddings are not at the output of the network

UT Austin

Fudan University

Meta MBZUAI Meta Reality Labs

Daniel Bolya

Andrea Madotto

Perception Encoder introduces a family of vision models that achieve state-of-the-art performance across diverse vision and vision-language tasks, demonstrating that general, high-quality visual features can be extracted from the intermediate layers of a single, contrastively-trained network. It provides specific alignment tuning methods to make these features accessible for tasks ranging from zero-shot classification to dense spatial prediction and multimodal language understanding.

383

1,490

01 Aug 2025

computer-science contrastive-learning computation-and-language

Meta CLIP 2: A Worldwide Scaling Recipe

New York University

Meta

MIT

Princeton University

FAIR (Meta) researchers introduce a transparent, open-sourced methodology for pre-training CLIP on native worldwide web-scale image-text pairs. This approach overcomes the "curse of multilinguality," allowing a ViT-H/14 model to surpass its English-only counterpart by 0.8% on zero-shot ImageNet and establish new state-of-the-art results across multilingual benchmarks like CVQA (57.4%) and Babel-ImageNet (50.2%).

1,688

1,302

06 Sep 2025

agents chain-of-thought computer-science

Planning with Reasoning using Vision Language World Model

University of Southern California

Meta ISIR Sorbonne Université

The Vision Language World Model (VLWM) introduces a foundation model for language-based world modeling from natural videos, enabling high-level task planning. It leverages natural language as an abstract world state representation and achieves state-of-the-art performance on visual planning benchmarks and human-preferred plans in evaluations.

9,822

28 Oct 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

SAM 2: Segment Anything in Images and Videos

Meta

Ross Girshick

Meta FAIR developed SAM 2, a foundation model extending 'segment anything' capabilities to both images and videos. This model demonstrates state-of-the-art performance across promptable image and video segmentation benchmarks, achieved with 3x fewer user interactions for video and a 6x faster image encoder than its predecessor, leveraging the newly collected, 53x larger Segment Anything Video (SA-V) dataset.

15,593

3,970

21 Nov 2023

computer-science artificial-intelligence computation-and-language

GAIA: a benchmark for General AI Assistants

Meta HuggingFace AutoGPT

The FAIR (Meta), Hugging Face, and AutoGPT teams introduced GAIA, a benchmark with 466 real-world questions designed to evaluate general AI assistants beyond narrow benchmarks. Experiments on GAIA revealed a substantial performance gap between humans (92% success rate) and state-of-the-art AI systems like GPT-4 with plugins (15% success rate), indicating current limitations in multi-step reasoning, tool use, and multimodal understanding.

43,934

23 Apr 2023

computer-science machine-learning robotics

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

UC Berkeley

Stanford University

Meta

Researchers from Stanford University, Meta, and UC Berkeley developed ALOHA, a low-cost, open-source hardware system, alongside ACT, a novel imitation learning algorithm, to enable precise fine-grained bimanual manipulation on affordable robots. The system successfully executes complex tasks such as threading zip cable ties, manipulating small objects, and juggling ping pong balls, demonstrating how advanced learning can compensate for hardware limitations.

1,205

1,151

25 Jul 2025

computer-science computer-vision-and-pattern-recognition generative-models

Back to the Features: DINO as a Foundation for Video World Models

Meta

Meta FAIR's DINO-world introduces an efficient generalist video world model by leveraging a frozen DINOv2 encoder to predict future states in a semantic latent space, outperforming pixel-based models in dense feature forecasting and enabling effective action-conditioned planning from uncurated video data.

1,132

24 Sep 2025

chain-of-thought computer-science artificial-intelligence

Soft Tokens, Hard Truths

University of Amsterdam

New York University

Meta

This work introduces a scalable reinforcement learning method for training Large Language Models to generate continuous Chains-of-Thought (CoTs) by injecting noise into mixture embeddings, overcoming prior computational and data dependency issues. Models trained with this approach achieve comparable Pass@1 accuracy, superior Pass@32 performance due to increased reasoning diversity, and improved robustness on out-of-domain tasks compared to discrete CoT training.

6,599

18 Jun 2025

computer-science computation-and-language generative-models

How much do language models memorize?

Chawin Sitawarin

The paper introduces an information-theoretic framework based on Kolmogorov complexity to quantify language model memorization, differentiating it from generalization, and estimates a capacity of approximately 3.6 bits-per-parameter for GPT-family models. This work demonstrates that phenomena like double descent occur when the dataset size exceeds model capacity, forcing generalization, and presents scaling laws for membership inference attacks, indicating their statistical insignificance for average data points in large, modern LLMs.

907

02 Sep 2025

computer-science computation-and-language machine-learning

Jointly Reinforcing Diversity and Quality in Language Model Generations

Carnegie Mellon University

Meta

Johns Hopkins University

Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.

There are no more papers matching your filters at the moment.

Install Browser Extension

Blog|We're hiring

alphaXiv

Explore

Login

Feedback

Dark mode

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Deep Think with Confidence

VGGT: Visual Geometry Grounded Transformer

Flow Matching for Generative Modeling

Scalable Diffusion Models with Transformers

The Llama 3 Herd of Models

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Efficient Streaming Language Models with Attention Sinks

Flow Matching Guide and Code

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

The Art of Scaling Reinforcement Learning Compute for LLMs

Perception Encoder: The best visual embeddings are not at the output of the network

Meta CLIP 2: A Worldwide Scaling Recipe

Planning with Reasoning using Vision Language World Model

SAM 2: Segment Anything in Images and Videos

GAIA: a benchmark for General AI Assistants

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Back to the Features: DINO as a Foundation for Video World Models

Soft Tokens, Hard Truths

How much do language models memorize?

Jointly Reinforcing Diversity and Quality in Language Model Generations