Codestin Search App

alphaXiv

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Papers Datasets

Google

4,101

22 Jul 2025

agentic-frameworks agents computer-science

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Google

Chen-Yu Lee

Victoria Krakovna

Google's Gemini 2.X model family advances AI capabilities by integrating a native multimodal architecture with a novel 'Thinking' mechanism for enhanced reasoning and extended long-context understanding. This enables the development of next-generation agentic systems, while also improving capability-cost efficiency and maintaining robust safety standards.

991

01 Dec 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

ESCAPE: Equivariant Shape Completion via Anchor Point Encoding

Google Technical University Munich (TUM)

ESCAPE introduces a rotation-equivariant 3D shape completion method that leverages dynamically selected anchor points and distance encoding to infer missing geometry without requiring object canonicalization. The approach outperforms state-of-the-art techniques on rotated inputs, achieving a Chamfer Distance (CD-L1 x 1000) of 10.58 on the PCN benchmark and a median CD-L1 of 18.82 on the real-world OmniObject dataset.

935

05 Dec 2024

computer-science databases human-computer-interaction

Database Theory + X: Database Visualization

Google

Columbia University Amazon

Adobe

Eugene Wu of Columbia University establishes a theoretical framework for "faithful database visualization," proposing that visualizations should directly map not only data content but also underlying database constraints, moving beyond the prevalent single-table data model. This work demonstrates how common visualization designs can be understood as emergent properties of multi-table data structures and their visual constraint preservation.

32,782

19 Sep 2023

computer-science computation-and-language machine-learning

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Google

Irene Li

Google researchers conducted a comprehensive empirical study of transfer learning in NLP, culminating in T5 (Text-to-Text Transfer Transformer), a unified model that casts all text-based problems as text-to-text tasks. The study systematically evaluated various model architectures, pre-training objectives, and datasets, and the largest T5 model achieved state-of-the-art results on 18 benchmarks and nearly matched human performance on SuperGLUE.

6,271

59,078

29 Sep 2025

computer-science artificial-intelligence computation-and-language

Visual Planning: Let's Think Only with Images

University of Cambridge

Google

University College London

Caiqi Zhang

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first" tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

4,838

03 Nov 2023

computer-science information-retrieval machine-learning

Recommender Systems with Generative Retrieval

Google

University of Wisconsin-Madison

The TIGER framework introduces a generative retrieval approach for recommender systems by converting item content into semantically meaningful discrete identifiers. This method allows a Transformer model to directly generate item IDs, leading to state-of-the-art performance in sequential recommendation, enhanced cold-start capabilities, and controllable recommendation diversity.

118

8,960

09 May 2025

computer-science artificial-intelligence computation-and-language

Gemini: A Family of Highly Capable Multimodal Models

Google

Google's Gemini introduces a family of natively multimodal AI models trained across text, image, audio, and video data, setting new state-of-the-art performance across 30 of 32 benchmarks, including becoming the first model to surpass human-expert performance on the MMLU exam.

656

08 Sep 2025

computer-science computer-vision-and-pattern-recognition machine-learning

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Google DeepMind

Google

Unprecedented volumes of Earth observation data are continually collected around the world, but high-quality labels remain scarce given the effort required to make physical measurements and observations. This has led to considerable investment in bespoke modeling efforts translating sparse labels into maps. Here we introduce AlphaEarth Foundations, an embedding field model yielding a highly general, geospatial representation that assimilates spatial, temporal, and measurement contexts across multiple sources, enabling accurate and efficient production of maps and monitoring systems from local to global scales. The embeddings generated by AlphaEarth Foundations are the only to consistently outperform a suite of other well-known/widely accepted featurization approaches tested on a diverse set of mapping evaluations without re-training. We have released a dataset of global, annual, analysis-ready embedding field layers from 2017 through 2024.

594

18 Sep 2025

agents computer-science machine-learning

Self-Improving Embodied Foundation Models

Google GeneralistAI Generalist AI

Researchers from Google DeepMind developed a two-stage post-training framework that integrates online reinforcement learning with embodied foundation models, utilizing the model's self-predicted "steps-to-go" as a data-driven reward signal. This approach allows robots to autonomously refine existing skills and acquire novel behaviors, demonstrating substantial performance gains and superior sample efficiency across various robotic tasks.

3,202

16 Jan 2025

computer-science computer-vision-and-pattern-recognition generative-models

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Google

New York University

MIT

Researchers developed a framework to improve diffusion model performance by searching for better sampling noise during inference, rather than solely relying on more denoising steps. This approach yielded substantial gains in sample quality across various tasks, enabling smaller models with search to often surpass the performance of larger models without it.

1,219

21 May 2025

agents computer-science artificial-intelligence

An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents

University of Illinois at Urbana-Champaign

Google Google Cloud AI Research

Priyanka Kargupta

This empirical study investigates the effective training of LLM agents for interleaved reasoning and search using reinforcement learning. The research found that incorporating format rewards is critical, general-purpose LLMs outperform reasoning-specialized models in RL settings, and the quality of the search engine during training significantly influences learned agent behaviors.

3,317

4,615

11 Aug 2023

computer-science artificial-intelligence computation-and-language

RT-1: Robotics Transformer for Real-World Control at Scale

Google

Google Research Everyday Robots

Michael Ryoo

Kuang-Huei Lee

Google's RT-1 is a Transformer-based model trained on 130,000 real-world robotic demonstrations across 700 language-conditioned tasks, achieving a 76% success rate on unseen instructions and exhibiting strong robustness to distractors and novel environments. It effectively integrates diverse data sources, including simulation and data from other robot morphologies, with minimal performance degradation on original tasks.

1,501

2,252

29 May 2025

computer-science artificial-intelligence computation-and-language

ATLAS: Learning to Optimally Memorize the Context at Test Time

Google

Atlas introduces a set of enhancements to recurrent neural networks, enabling them to optimally memorize context at test time for long sequences. The model achieves state-of-the-art performance on language modeling and common-sense reasoning, and processes contexts up to 10 million tokens, outperforming existing recurrent models and out-of-the-box Transformers in extreme long-context tasks.

3,794

16 Aug 2022

computer-science computation-and-language machine-learning

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Google Everyday Robots

Kuang-Huei Lee

This research from Robotics at Google and Everyday Robots introduces SayCan, a framework that enables robots to follow complex natural language instructions by combining the high-level semantic understanding of Large Language Models (LLMs) with learned robotic affordance functions. The system achieves 74% execution success on 101 diverse real-world tasks in a mock kitchen and demonstrates that robot performance scales directly with the size of the underlying LLM.

510

28 Sep 2025

computer-science artificial-intelligence computation-and-language

EmbeddingGemma: Powerful and Lightweight Text Representations

Google

EmbeddingGemma, a lightweight, open-source text embedding model from Google with 308M parameters, achieves state-of-the-art performance for its size on MTEB multilingual, English, and code benchmarks, often matching or exceeding much larger models and commercial APIs. It maintains strong performance even with quantized weights and truncated embeddings, demonstrating exceptional cross-lingual capabilities.

2,570

16 Dec 2024

computer-science artificial-intelligence computation-and-language

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google

Minji Lee

Google's Gemini 1.5 model family dramatically expands the context window for multimodal AI, processing millions of tokens across text, images, video, and audio, while leveraging sparse Mixture-of-Experts architecture and online distillation for improved computational efficiency. The models demonstrate near-perfect recall and enhanced reasoning over vast contexts, outperforming prior models on complex long-form tasks and showing significant productivity gains in professional applications.

1,246

12 Jun 2023

computer-science artificial-intelligence computation-and-language

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

ETH Zurich

KAIST

University of Washington Rensselaer Polytechnic Institute

Google DeepMind

University of Amsterdam

University of Illinois at Urbana-Champaign

University of Cambridge Heidelberg University

University of Waterloo Facebook

Carnegie Mellon University

University of Southern California

Google

New York University University of Stuttgart

UC Berkeley

National University of Singapore

University College London

University of Oxford LMU Munich

Shanghai Jiao Tong University

University of California, Irvine

Tsinghua University

Stanford University

University of Michigan

University of Copenhagen

The Chinese University of Hong Kong University of Melbourne

Meta University of Edinburgh

OpenAI

The University of Texas at Austin

Cornell University

University of California, San Diego Yonsei University

McGill University

Boston University University of Bamberg

Nanyang Technological University

Microsoft

KU Leuven

Columbia University UC Santa Barbara

Allen Institute for AI German Research Center for Artificial Intelligence (DFKI)

University of Pennsylvania

Johns Hopkins University

Arizona State University

University of Maryland

University of Tokyo University of North Carolina at Chapel Hill Hebrew University of Jerusalem Amazon Tilburg University University of Massachusetts Amherst University of Rochester University of Duisburg-Essen Sapienza University of Rome University of Sheffield

Princeton University

HKUST University of Tübingen TU Berlin Saarland University Bar-Ilan University Technical University of Darmstadt University of Haifa University of Trento University of Montreal Bilkent University University of Cape Town IBM University of Mannheim

ServiceNow Potsdam University Polish-Japanese Academy of Information Technology Salesforce ASAPP AI21 Labs Valencia Polytechnic University University of Trento, Italy

Allen Nie

Jos Rozen

+13

A large-scale and diverse benchmark, BIG-bench, was introduced to rigorously evaluate the capabilities and limitations of large language models across 204 tasks. The evaluation revealed that even state-of-the-art models currently achieve aggregate scores below 20 (on a 0-100 normalized scale), indicating significantly lower performance compared to human experts.

3,272

14 Dec 2014

computer-science computation-and-language machine-learning

Sequence to Sequence Learning with Neural Networks

Google

A deep learning model developed by Google researchers introduced a general sequence-to-sequence (seq2seq) architecture using deep Long Short-Term Memory (LSTM) networks. This model demonstrated superior performance in English-to-French machine translation, surpassing phrase-based statistical machine translation systems on a large-scale dataset.

9,260

04 Feb 2025

agentic-frameworks agents computer-science

Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

University of Cambridge

Google

Google and University of Cambridge researchers introduce MASS (Multi-Agent System Search), a comprehensive framework that automatically optimizes both prompts and topologies in multi-agent systems, achieving superior performance across multiple tasks while establishing foundational design principles through a novel three-stage optimization approach.

6,041

24 Jun 2025

adversarial-attacks agentic-frameworks agents

Defeating Prompt Injections by Design

ETH Zurich

Google DeepMind

Google

Researchers from Google, Google DeepMind, and ETH Zurich introduced CaMeL, a system-level defense that secures Large Language Model (LLM) agents against prompt injection attacks by integrating traditional software security principles like control and data flow integrity and capabilities. This approach achieved 0 successful prompt injection attacks on the AgentDojo benchmark, significantly outperforming heuristic methods, while maintaining 77% task success.

127

There are no more papers matching your filters at the moment.

Install Browser Extension

Blog|We're hiring

alphaXiv

Explore

Login

Feedback

Dark mode

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

ESCAPE: Equivariant Shape Completion via Anchor Point Encoding

Database Theory + X: Database Visualization

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Visual Planning: Let's Think Only with Images

Recommender Systems with Generative Retrieval

Gemini: A Family of Highly Capable Multimodal Models

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Self-Improving Embodied Foundation Models

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents

RT-1: Robotics Transformer for Real-World Control at Scale

ATLAS: Learning to Optimally Memorize the Context at Test Time

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

EmbeddingGemma: Powerful and Lightweight Text Representations

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Sequence to Sequence Learning with Neural Networks

Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

Defeating Prompt Injections by Design