Thanks to visit codestin.com
Credit goes to www.alphaxiv.org

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

· +2
Google's Gemini 2.X model family advances AI capabilities by integrating a native multimodal architecture with a novel 'Thinking' mechanism for enhanced reasoning and extended long-context understanding. This enables the development of next-generation agentic systems, while also improving capability-cost efficiency and maintaining robust safety standards.
ESCAPE introduces a rotation-equivariant 3D shape completion method that leverages dynamically selected anchor points and distance encoding to infer missing geometry without requiring object canonicalization. The approach outperforms state-of-the-art techniques on rotated inputs, achieving a Chamfer Distance (CD-L1 x 1000) of 10.58 on the PCN benchmark and a median CD-L1 of 18.82 on the real-world OmniObject dataset.
8
Eugene Wu of Columbia University establishes a theoretical framework for "faithful database visualization," proposing that visualizations should directly map not only data content but also underlying database constraints, moving beyond the prevalent single-table data model. This work demonstrates how common visualization designs can be understood as emergent properties of multi-table data structures and their visual constraint preservation.
Google researchers conducted a comprehensive empirical study of transfer learning in NLP, culminating in T5 (Text-to-Text Transfer Transformer), a unified model that casts all text-based problems as text-to-text tasks. The study systematically evaluated various model architectures, pre-training objectives, and datasets, and the largest T5 model achieved state-of-the-art results on 18 benchmarks and nearly matched human performance on SuperGLUE.
6,271
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first" tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
62
The TIGER framework introduces a generative retrieval approach for recommender systems by converting item content into semantically meaningful discrete identifiers. This method allows a Transformer model to directly generate item IDs, leading to state-of-the-art performance in sequential recommendation, enhanced cold-start capabilities, and controllable recommendation diversity.
118
Google's Gemini introduces a family of natively multimodal AI models trained across text, image, audio, and video data, setting new state-of-the-art performance across 30 of 32 benchmarks, including becoming the first model to surpass human-expert performance on the MMLU exam.
Unprecedented volumes of Earth observation data are continually collected around the world, but high-quality labels remain scarce given the effort required to make physical measurements and observations. This has led to considerable investment in bespoke modeling efforts translating sparse labels into maps. Here we introduce AlphaEarth Foundations, an embedding field model yielding a highly general, geospatial representation that assimilates spatial, temporal, and measurement contexts across multiple sources, enabling accurate and efficient production of maps and monitoring systems from local to global scales. The embeddings generated by AlphaEarth Foundations are the only to consistently outperform a suite of other well-known/widely accepted featurization approaches tested on a diverse set of mapping evaluations without re-training. We have released a dataset of global, annual, analysis-ready embedding field layers from 2017 through 2024.
Researchers from Google DeepMind developed a two-stage post-training framework that integrates online reinforcement learning with embodied foundation models, utilizing the model's self-predicted "steps-to-go" as a data-driven reward signal. This approach allows robots to autonomously refine existing skills and acquire novel behaviors, demonstrating substantial performance gains and superior sample efficiency across various robotic tasks.
Researchers developed a framework to improve diffusion model performance by searching for better sampling noise during inference, rather than solely relying on more denoising steps. This approach yielded substantial gains in sample quality across various tasks, enabling smaller models with search to often surpass the performance of larger models without it.
This empirical study investigates the effective training of LLM agents for interleaved reasoning and search using reinforcement learning. The research found that incorporating format rewards is critical, general-purpose LLMs outperform reasoning-specialized models in RL settings, and the quality of the search engine during training significantly influences learned agent behaviors.
3,317
· +1
Google's RT-1 is a Transformer-based model trained on 130,000 real-world robotic demonstrations across 700 language-conditioned tasks, achieving a 76% success rate on unseen instructions and exhibiting strong robustness to distractors and novel environments. It effectively integrates diverse data sources, including simulation and data from other robot morphologies, with minimal performance degradation on original tasks.
1,501
Atlas introduces a set of enhancements to recurrent neural networks, enabling them to optimally memorize context at test time for long sequences. The model achieves state-of-the-art performance on language modeling and common-sense reasoning, and processes contexts up to 10 million tokens, outperforming existing recurrent models and out-of-the-box Transformers in extreme long-context tasks.
This research from Robotics at Google and Everyday Robots introduces SayCan, a framework that enables robots to follow complex natural language instructions by combining the high-level semantic understanding of Large Language Models (LLMs) with learned robotic affordance functions. The system achieves 74% execution success on 101 diverse real-world tasks in a mock kitchen and demonstrates that robot performance scales directly with the size of the underlying LLM.
16
EmbeddingGemma, a lightweight, open-source text embedding model from Google with 308M parameters, achieves state-of-the-art performance for its size on MTEB multilingual, English, and code benchmarks, often matching or exceeding much larger models and commercial APIs. It maintains strong performance even with quantized weights and truncated embeddings, demonstrating exceptional cross-lingual capabilities.
Google's Gemini 1.5 model family dramatically expands the context window for multimodal AI, processing millions of tokens across text, images, video, and audio, while leveraging sparse Mixture-of-Experts architecture and online distillation for improved computational efficiency. The models demonstrate near-perfect recall and enhanced reasoning over vast contexts, outperforming prior models on complex long-form tasks and showing significant productivity gains in professional applications.
ETH Zurich logoETH ZurichKAIST logoKAISTUniversity of Washington logoUniversity of WashingtonRensselaer Polytechnic InstituteGoogle DeepMind logoGoogle DeepMindUniversity of Amsterdam logoUniversity of AmsterdamUniversity of Illinois at Urbana-Champaign logoUniversity of Illinois at Urbana-ChampaignUniversity of Cambridge logoUniversity of CambridgeHeidelberg UniversityUniversity of Waterloo logoUniversity of WaterlooFacebookCarnegie Mellon University logoCarnegie Mellon UniversityUniversity of Southern California logoUniversity of Southern CaliforniaGoogle logoGoogleNew York University logoNew York UniversityUniversity of StuttgartUC Berkeley logoUC BerkeleyNational University of Singapore logoNational University of SingaporeUniversity College London logoUniversity College LondonUniversity of Oxford logoUniversity of OxfordLMU MunichShanghai Jiao Tong University logoShanghai Jiao Tong UniversityUniversity of California, Irvine logoUniversity of California, IrvineTsinghua University logoTsinghua UniversityStanford University logoStanford UniversityUniversity of Michigan logoUniversity of MichiganUniversity of Copenhagen logoUniversity of CopenhagenThe Chinese University of Hong Kong logoThe Chinese University of Hong KongUniversity of MelbourneMeta logoMetaUniversity of EdinburghOpenAI logoOpenAIThe University of Texas at Austin logoThe University of Texas at AustinCornell University logoCornell UniversityUniversity of California, San Diego logoUniversity of California, San DiegoYonsei UniversityMcGill University logoMcGill UniversityBoston University logoBoston UniversityUniversity of BambergNanyang Technological University logoNanyang Technological UniversityMicrosoft logoMicrosoftKU Leuven logoKU LeuvenColumbia University logoColumbia UniversityUC Santa BarbaraAllen Institute for AI logoAllen Institute for AIGerman Research Center for Artificial Intelligence (DFKI)University of Pennsylvania logoUniversity of PennsylvaniaJohns Hopkins University logoJohns Hopkins UniversityArizona State University logoArizona State UniversityUniversity of Maryland logoUniversity of MarylandUniversity of Tokyo logoUniversity of TokyoUniversity of North Carolina at Chapel HillHebrew University of JerusalemAmazonTilburg UniversityUniversity of Massachusetts AmherstUniversity of RochesterUniversity of Duisburg-EssenSapienza University of RomeUniversity of SheffieldPrinceton University logoPrinceton UniversityHKUST logoHKUSTUniversity of TübingenTU BerlinSaarland UniversityBar-Ilan UniversityTechnical University of DarmstadtUniversity of HaifaUniversity of TrentoUniversity of MontrealBilkent UniversityUniversity of Cape TownIBMUniversity of MannheimServiceNow logoServiceNowPotsdam UniversityPolish-Japanese Academy of Information TechnologySalesforceASAPPAI21 LabsValencia Polytechnic UniversityUniversity of Trento, Italy
· +13
A large-scale and diverse benchmark, BIG-bench, was introduced to rigorously evaluate the capabilities and limitations of large language models across 204 tasks. The evaluation revealed that even state-of-the-art models currently achieve aggregate scores below 20 (on a 0-100 normalized scale), indicating significantly lower performance compared to human experts.
A deep learning model developed by Google researchers introduced a general sequence-to-sequence (seq2seq) architecture using deep Long Short-Term Memory (LSTM) networks. This model demonstrated superior performance in English-to-French machine translation, surpassing phrase-based statistical machine translation systems on a large-scale dataset.
73
Google and University of Cambridge researchers introduce MASS (Multi-Agent System Search), a comprehensive framework that automatically optimizes both prompts and topologies in multi-agent systems, achieving superior performance across multiple tasks while establishing foundational design principles through a novel three-stage optimization approach.
Researchers from Google, Google DeepMind, and ETH Zurich introduced CaMeL, a system-level defense that secures Large Language Model (LLM) agents against prompt injection attacks by integrating traditional software security principles like control and data flow integrity and capabilities. This approach achieved 0 successful prompt injection attacks on the AgentDojo benchmark, significantly outperforming heuristic methods, while maintaining 77% task success.
127
There are no more papers matching your filters at the moment.