Thanks to visit codestin.com
Credit goes to research.google

Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10633 publications
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Study of Arterials in the City of Rio de Janeiro for Traffic Coordination
    Ori Rottenstreich
    Eliav Buchnik
    Danny Veikherman
    Dan Karliner
    Tom Kalvari
    Shai Ferster
    Ron Tsibulsky
    Jack Haddad
    2025
    Preview abstract Urban traffic congestion is a growing challenge, and optimizing signal timing strategies is crucial for improving traffic flow and reducing emissions. The coordination of signalized intersections improves both traffic operations and environmental aspects. Coordination is particularly important along arterials, sequences of signalized intersections that serve as the primary routes and carry a high volume of traffic. In this paper we analyze real data from the city of Rio de Janeiro to study properties of arterials. We refer to their length, the distance between intersections and to the properties of the traffic light plans such as cycle time. We then study their in practice level of coordination in terms of number of stops and their common locations along the arterials. We dive into particular arterials and provide insights that can be useful for efficient design of arterials in additional cities. Based on the analysis, we show how simple traffic properties can indicate the potential upon coordinating two adjacent intersections as part of an arterial in improving traffic performance. View details
    Security Signals: Making Web Security Posture Measurable At Scale
    David Dworken
    Artur Janc
    Santiago (Sal) Díaz
    Workshop on Measurements, Attacks, and Defenses for the Web (MADWeb)
    Preview abstract The area of security measurability is gaining increased attention, with a wide range of organizations calling for the development of scalable approaches for assessing the security of software systems and infrastructure. In this paper, we present our experience developing Security Signals, a comprehensive system providing security measurability for web services, deployed in a complex application ecosystem of thousands of web services handling traffic from billions of users. The system collects security-relevant information from production HTTP traffic at the reverse proxy layer, utilizing novel concepts such as synthetic signals augmented with additional risk information to provide a holistic view of the security posture of individual services and the broader application ecosystem. This approach to measurability has enabled large-scale security improvements to our services, including prioritized rollouts of security enhancements and the implementation of automated regression monitoring. Furthermore, it has proven valuable for security research and prioritization of defensive work. Security Signals addresses shortcomings of prior web measurability proposals by tracking a comprehensive set of security properties relevant to web applications, and by extracting insights from collected data for use by both security experts and non-experts. We believe the lessons learned from the implementation and use of Security Signals offer valuable insights for practitioners responsible for web service security, potentially inspiring new approaches to web security measurability. View details
    Preview abstract Electrocardiograms (ECGs) are fundamental to cardiac diagnostics, providing noninvasive insights into cardiovascular conditions. Recent advancements in deep learning have led to foundation models (FMs) capable of learning powerful representations of ECG signals. However, these models often fail to fully exploit the periodic nature and diagnostic frequency bands of ECGs, leading to inefficiencies in computational cost and interpretability. We propose a novel ECG foundation model that learns nested embeddings, where each subset of dimensions encodes progressively higher-frequency information. By explicitly modeling frequency structures and applying a correlation penalty, the method achieves compact, high-rank representations that reduce model size without sacrificing performance. We evaluate our approach on two large-scale datasets for embedding redundancy and prediction performance on downstream clinical tasks such as arrhythmia classification, and cardiac condition detection. We observe similar prediction performance AUROC scores and lower embedding redundancy, offering a computationally efficient and interpretable framework for ECG analysis. Finally, the representations obtained from our model in UK Biobank data capture known cardiovascular variants and detect novel loci, which can be applied to drug discovery. View details
    Reasoning-SQL: Reinforcement Learning with Partial Rewards for Reasoning-Enhanced Text-to-SQL
    Mohammadreza Pourreza
    Shayan Talaei
    Hailong Li
    Azalia Mirhoseini
    Amin Saberi
    Conference on Language Modeling (COLM) (2025) (to appear)
    Preview abstract Text-to-SQL is a challenging task involving multiple reasoning-intensive subtasks, including natural language understanding, database schema comprehension, and precise SQL query formulation. Existing approaches often rely on handcrafted reasoning paths with inductive biases that can limit their overall effectiveness. Motivated by the recent success of reasoning-enhanced models such as DeepSeek R1 and OpenAI o1, which effectively leverage reward-driven self-exploration to enhance reasoning capabilities and generalization, we propose a novel set of partial rewards tailored specifically for the Text-to-SQL task. Our reward set includes schema-linking, AI feedback, n-gram similarity, and syntax check, explicitly designed to address the reward sparsity issue prevalent in reinforcement learning (RL). Leveraging group relative policy optimization (GRPO), our approach explicitly encourages large language models (LLMs) to develop intrinsic reasoning skills necessary for accurate SQL query generation. With models of different sizes, we demonstrate that RL-only training with our proposed rewards consistently achieves higher accuracy and superior generalization compared to supervised fine-tuning (SFT). Remarkably, our RL-trained 14B-parameter model significantly outperforms larger proprietary models, e.g. o3-mini by 4% and Gemini-1.5-Pro-002 by 3% on the BIRD benchmark. These highlight the efficacy of our proposed RL-training framework with partial rewards for enhancing both accuracy and reasoning capabilities in Text-to-SQL tasks. View details
    Test-Time Visual In-Context Tuning
    Jiahao Xie
    Nathalie Rauschmayr
    Bernt Schiele
    2025
    Preview abstract Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time visual in-context tuning (VICT), a method that can learn adaptive VICL models on the fly with a single test sample. Specifically, We flip the role between task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on seven representative vision tasks with 15 corruptions demonstrate that our VICT can improve the generalizability of VICL to unseen new domains View details
    Preview abstract Building on the linear programming approach to competitive equilibrium pricing, we develop a general method for constructing iterative auctions that achieve Vickrey-Clarke-Groves (VCG) outcomes. We show how to transform a linear program characterizing competitive equilibrium prices into one that characterizes universal competitive equilibrium (UCE) prices, which elicit precisely the information needed to compute VCG payments. By applying a primal-dual algorithm to these transformed programs, we derive iterative auctions that maintain a single price path, eliminating the overhead and incentive problems associated with multiple price paths used solely for payment calculations. We demonstrate the versatility of our method by developing novel UCE auctions for multi-unit settings and deriving an iterative UCE variant of the Product-Mix auction. The resulting auctions combine the transparency of iterative price discovery with the efficiency and incentive properties of the VCG mechanism. View details
    Preview abstract Differential privacy can be achieved in a distributed manner, where multiple parties add independent noise such that their sum protects the overall dataset with differential privacy. A common technique here is for each party to sample their noise from the decomposition of an infinitely divisible distribution. We introduce two novel mechanisms in this setting: 1) the generalized discrete Laplace (GDL) mechanism, whose distribution (which is closed under summation) follows from differences of i.i.d. negative binomial shares, and 2) The multi-scale discrete Laplace (MSDLap) mechanism, which follows the sum of multiple i.i.d. discrete Laplace shares at different scales. The mechanisms can be parameterized to have 𝑂(Δ^3𝑒^{−𝜀}) and 𝑂 (min(Δ^3𝑒^{−𝜀}, Δ^2𝑒^{−2𝜀/3})) MSE, respectively, where the latter bound matches known optimality results. Furthermore, the MSDLap mechanism has the optimal MSE including constants as 𝜀 → ∞. We also show a transformation from the discrete setting to the continuous setting, which allows us to transform both mechanisms to the continuous setting and thereby achieve the optimal 𝑂 (Δ^2𝑒^{−2𝜀/3}) MSE. To our knowledge, these are the first infinitely divisible additive noise mechanisms that achieve order-optimal MSE under pure differential privacy for either the discrete or continuous setting, so our work shows formally there is no separation in utility when query-independent noise adding mechanisms are restricted to infinitely divisible noise. For the continuous setting, our result improves upon Pagh and Stausholm’s Arete distribution which gives an MSE of 𝑂(Δ^2𝑒^{−𝜀/4}) [35]. We apply our results to improve a state of the art multi-message shuffle DP protocol from [3] in the high 𝜀 regime. View details
    Preview abstract Virtual Reality headsets isolate users from the real-world by restricting their perception to the virtual-world. Video See-Through (VST) headsets address this by utilizing world-facing cameras to create Augmented Reality experiences. However, directly displaying camera feeds can cause visual discomfort and cybersickness due to the inaccurate perception of scale and exaggerated motion parallax. This paper presents initial findings on the potential of geometry aware passthrough systems to mitigate cybersickness through enhanced depth perception. We introduce a promising protocol for quantitatively measuring cybersickness experienced by users in VST headsets. Using this protocol, we conduct a user study to compare direct passthrough and geometry aware passthrough systems. To the best of our knowledge, our study is the first one to reveal reduced nausea, disorientation, and total scores of cybersickness with geometry aware passthrough. It also uncovers several potential avenues to further mitigate visually-induced discomfort. View details
    Preview abstract Text-to-image generative models have demonstrated great performance in generating realistic images. These generations are assumed to reflect a deep understanding of visual scenes. One interesting question is whether these models can possess a zero/few shot generalization capabilities that are known from humans. For example, a human can see an example of a new object and a word associated with this object, use their knowledge in a highly general way to recognize or imagine this novel object in a completely different setting or context. In this work, we are interested in testing whether text-image models can possess this same capability. In this work, we would like to test the hypothesis that text-to-image models may learn familiar objects better than novel objects. We use prompt tuning methods to learn those novel concepts while keeping the text-image models fixed. We prompt tune the model as well to learn familiar concepts, and evaluate the generalization ability for novel objects compared to familiar objects by running generation in different contexts/environments. In addition, instead of initializing the embedding vectors with some similar concepts, we use randomly initialized embedding vectors for both familiar and novel objects. Our human-survey evaluation results demonstrates that in some settings text-image models learn familiar objects better than novel objects. View details
    Software development is a team sport
    Jie Chen
    Alison Chang
    Rayven Plaza
    Marie Huber
    Claire Taylor
    IEEE Software (2025)
    Preview abstract In this article, we describe our human-centered research focused on understanding the role of collaboration and teamwork in productive software development. We describe creation of a logs-based metric to identify collaboration through observable events and a survey-based multi-item scale to assess team functioning. View details
    Linear-Time Multilevel Graph Partitioning via Edge Sparsification
    Peter Sanders
    Dominik Rosch
    Nikolai Maas
    Lars Gottesbüren
    Daniel Seemaier
    2025
    Preview abstract The current landscape of balanced graph partitioning is divided into high-quality but expensive multilevel algorithms and cheaper approaches with linear running time, such as single-level algorithms and streaming algorithms. We demonstrate how to achieve the best of both worlds with a linear time multilevel algorithm. Multilevel algorithms construct a hierarchy of increasingly smaller graphs by repeatedly contracting clusters of nodes. Our approach preserves their distinct advantage, allowing refinement of the partition over multiple levels with increasing detail. At the same time, we use edge sparsification to guarantee geometric size reduction between the levels and thus linear running time. We provide a proof of the linear running time as well as additional insights into the behavior of multilevel algorithms, showing that graphs with low modularity are most likely to trigger worst-case running time. We evaluate multiple approaches for edge sparsification and integrate our algorithm into the state-of-the-art multilevel partitioner KaMinPar, maintaining its excellent parallel scalability. As demonstrated in detailed experiments, this results in a 1.49x average speedup (up to 4x for some instances) with only 1% loss in solution quality. Moreover, our algorithm clearly outperforms state-of-the-art single-level and streaming approaches. View details
    On learnability of distribution classes with adaptive adversaries
    Tosca Lechner
    Alex Bie
    Gautam Kamath
    ICML 2025
    Preview abstract We consider the question of learnability of distribution classes in the presence of adaptive adversaries – that is, adversaries capable of inspecting the whole sample requested by a learner and applying their manipulations before passing it on to the learner. We formulate a general notion of learnability with respect to adaptive adversaries, taking into account the budget of the adversary. We show that learnability with respect to additive adaptive adversaries is a strictly stronger condition than learnability with respect to additive oblivious adversaries. View details
    Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications
    Yanxiang Zhang
    Zheng Xu
    Yuanbo Zhang
    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) (2025)
    Preview abstract Error correction is an important capability when applying large language models (LLMs) to facilitate user typing on mobile devices. In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications. We first prompt LLMs with error correction domain knowledge to build a scalable and reliable addition to the existing data synthesis pipeline. We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples. The reweighting model is learnt by predicting (a handful of) live A/B test metrics when deploying LLMs in production, given the LLM performance on offline evaluation data and scores from a small privacy-preserving on-device language model. Finally, we present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing. View details