Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10064 publications
Batch Calibration: Rethinking Calibration For In-Context Learning And Prompt Engineering
Lev Proleev
International Conference on Learning Representations (ICLR) (2024)
Preview abstract
Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.
View details
Towards Conversational Diagnostic AI
Anil Palepu
Khaled Saab
Jan Freyberg
Ryutaro Tanno
Amy Wang
Brenna Li
Nenad Tomašev
Karan Singhal
Le Hou
Albert Webson
Kavita Kulkarni
Sara Mahdavi
Juro Gottweis
Joelle Barral
Kat Chou
Arxiv (2024) (to appear)
Preview abstract
At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue.
AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.
View details
From Provenance to Aberrations: Image Creator and Screen Reader User Perspectives on Alt Text for AI-Generated Images
Maitraye Das
Alexander J. Fiannaca
CHI Conference on Human Factors in Computing Systems (2024)
Preview abstract
AI-generated images are proliferating as a new visual medium. However, state-of-the-art image generation models do not output alternative (alt) text with
their images, rendering them largely inaccessible to screen reader users (SRUs). Moreover, less is known about what information would be most desirable
to SRUs in this new medium. To address this, we invited AI image creators and SRUs to evaluate alt text prepared from various sources and write their own
alt text for AI images. Our mixed-methods analysis makes three contributions. First, we highlight creators’ perspectives on alt text, as creators are well-positioned
to write descriptions of their images. Second, we illustrate SRUs’ alt text needs particular to the emerging medium of AI images. Finally, we discuss the
promises and pitfalls of utilizing text prompts written as input for AI models in alt text generation, and areas where broader digital accessibility guidelines
could expand to account for AI images.
View details
SAC125 - SSAC Report on Registrar Nameserver Management
Gautam Akiwate
Tim April
kc claffy
Internet Corporation for Assigned Names and Numbers (ICANN), ICANN Security and Stability Advisory Committee (SSAC) Reports and Advisories (2024), pp. 56
Preview abstract
During domain registration, a minimum of two nameservers are typically required, and this
remains a requirement for any future updates to the domain. Often, domains are delegated to
nameservers that are subordinate to some other domains, creating inter-domain dependencies.
This network of dependencies creates a scenario where the functionality of a domain depends
on the operational status of another domain. This setup lacks contractual or procedural
safeguards against disruption or misuse, especially when the nameserver parent domain expires.
Most registries forbid deleting an expired domain if other domains depend on it for name
resolution. These constraints aim to prevent disruptions in DNS resolution for the dependent
domains. However, this also means that the expired domain remains in a liminal state, neither
fully operational nor completely removed. When registrars cannot delete expired domains with
dependents, they are forced to bear the burden of sponsoring the domain without remuneration
from the registrant. A peer-reviewed study, "Risky BIZness: Risks derived from Registrar Name
Management," observed that some registrars have found and utilized a loophole to these
constraints by renaming the host objects that are subordinate to the expiring domain.1 Once
renamed, the host objects are what Akiwate et al.—and subsequently the SSAC—refers to as
sacrificial nameservers.
This report focuses on a specific type of sacrificial nameserver where the parent domains of the renamed host objects are considered to be unsafe because they are registrable. Registrable
parent domains of sacrificial nameservers introduce a new attack surface for domain resolution
hijacking, as malicious actors can exploit unsafe sacrificial nameservers to gain unauthorized
control over the dependent domains, leading to manipulation or disruption. Unlike traditional
domain hijacking techniques that exploit compromised account credentials or manipulate the
resolution protocol, this report focuses on this unforeseen risk arising from a longstanding
practice employed by some registrars.
In this report, the SSAC explores potential solutions to remediate exposed domains and prevent
the creation of new unsafe sacrificial nameservers. The SSAC examines each proposed solution for its feasibility, effectiveness, and potential to reduce the attack surface without introducing undue complexity or new vulnerabilities into the DNS ecosystem.
View details
Context-aware Transliteration of Romanized South Asian Languages
Christo Kirov
Computational Linguistics, 50 (2) (2024), 475–534
Preview abstract
While most transliteration research is focused on single tokens such as named entities -- e.g., transliteration of "અમદાવાદ" from the Gujarati script to the Latin script "Ahmedabad" -- the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this paper, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models finetuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 4.8% absolute (27.1% relative) mean word-error rate reduction.
View details
Preview abstract
Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.
View details
Seeking in Cycles: How Users Leverage Personal Information Ecosystems to Find Mental Health Information
Ashlee Milton
Fernando Maestre
Rebecca Umbach
Stevie Chancellor
Proceedings of the CHI Conference on Human Factors in Computing Systems (2024)
Preview abstract
Information is crucial to how people understand their mental health and well-being, and many turn to online sources found through search engines and social media. We present the findings from an interview study (n = 17) of participants who use online platforms to seek information about their mental illnesses. We found that participants leveraged multiple platforms in a cyclical process for finding information from their personal information ecosystems, driven by the adoption of new information and uncertainty surrounding the credibility of information. Concerns about privacy, fueled by perceptions of stigma and platform design, also influenced their information-seeking decisions. Our work proposes theoretical implications for social computing and information retrieval on information seeking in users' personal information ecosystems. We also offer design implications to support users in navigating their personal information ecosystems to find mental health information.
View details
Specifying BGP using TLA+
Aman Shaikh
(2024)
Preview abstract
This presentation is about the TLA+ specification we have written for BGP, the routing protocol underpinning the Internet. The specification also serves as a crucial first-step towards the use of TLA+ for verification of network designs.
View details
Experiencing InstructPipe: Building Multi-modal AI Pipelines via Prompting LLMs and Visual Programming
Zhongyi Zhou
Jing Jin
Xiuxiu Yuan
Jun Jiang
Jingtao Zhou
Yiyi Huang
Kristen Wright
Jason Mayes
Mark Sherwood
Ram Iyengar
Na Li
Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 5
Preview abstract
Foundational multi-modal models have democratized AI access, yet the construction of complex, customizable machine learning pipelines by novice users remains a grand challenge. This paper demonstrates a visual programming system that allows novices to rapidly prototype multimodal AI pipelines. We first conducted a formative study with 58 contributors and collected 236 proposals of multimodal AI pipelines that served various practical needs. We then distilled our findings into a design matrix of primitive nodes for prototyping multimodal AI visual programming pipelines, and implemented a system with 65 nodes. To support users' rapid prototyping experience, we built InstructPipe, an AI assistant based on large language models (LLMs) that allows users to generate a pipeline by writing text-based instructions. We believe InstructPipe enhances novice users onboarding experience of visual programming and the controllability of LLMs by offering non-experts a platform to easily update the generation.
View details
Websites Need Your Permission Too – User Sentiment and Decision Making on Web Permission Prompts in Desktop Chrome
Marian Harbach
CHI 2024, ACM (to appear)
Preview abstract
The web utilizes permission prompts to moderate access to certain capabilities. We present the first investigation of user behavior and sentiment of this security and privacy measure on the web, using 28 days of telemetry data from more than 100M Chrome installations on desktop platforms and experience sampling responses from 25,706 Chrome users. Based on this data, we find that ignoring and dismissing permission prompts are most common for geolocation and notifications. Permission prompts are perceived as more annoying and interrupting when they are not allowed, and most respondents cite a rational reason for the decision they took. Our data also supports that the perceived availability of contextual information from the requesting website is associated with allowing access to a requested capability. More usable permission controls could facilitate adoption of best practices that address several of the identified challenges; and ultimately could lead to better user experiences and a safer web.
View details
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Chris Donahue
Dima Kuzmin
Judith Li
Kun Su
Mauro Verzetti
Qingqing Huang
Yu Wang
Vol. 38 No. 5: AAAI-24 Technical Tracks 5, AAAI Press (2024), pp. 4952-4960
Preview abstract
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.
View details
Characterizing a Memory Allocator at Warehouse Scale
Zhuangzhuang Zhou
Nilay Vaish
Patrick Xia
Christina Delimitrou
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, Association for Computing Machinery, La Jolla, CA, USA (2024), 192–206
Preview abstract
Memory allocation constitutes a substantial component of warehouse-scale computation. Optimizing the memory allocator not only reduces the datacenter tax, but also improves application performance, leading to significant cost savings.
We present the first comprehensive characterization study of TCMalloc, a warehouse-scale memory allocator used in our production fleet. Our characterization reveals a profound diversity in the memory allocation patterns, allocated object sizes and lifetimes, for large-scale datacenter workloads, as well as in their performance on heterogeneous hardware platforms. Based on these insights, we redesign TCMalloc for warehouse-scale environments. Specifically, we propose optimizations for each level of its cache hierarchy that include usage-based dynamic sizing of allocator caches, leveraging hardware topology to mitigate inter-core communication overhead, and improving allocation packing algorithms based on statistical data. We evaluate these design choices using benchmarks and fleet-wide A/B experiments in our production fleet, resulting in a 1.4% improvement in throughput and a 3.4% reduction in RAM usage for the entire fleet. At our scale, even a single percent CPU or memory improvement translates to significant savings in server costs.
View details
Learning from straggler clients in federated learning
Ehsan Amid
Rohan Anil
Arxiv (2024) (to appear)
Preview abstract
How well do existing federated learning algorithms learn from client devices that return model updates with a significant time delay? Is it even possible to learn effectively from clients that report back minutes, hours, or days after being scheduled? We answer these questions by developing Monte Carlo simulations of client latency that are guided by real-world applications. We compare well-known synchronous optimization algorithms like FedAvg and FedAdam with the state-of-the-art asynchronous FedBuff algorithm, and discover that these existing approaches often struggle to learn from severely delayed clients. To improve upon these, we experiment with modifications including distillation regularization and exponential moving averages of model weights. Finally, we invent two new algorithms, FARe-DUST and FeAST-on-MSG, based on distillation and averaging, respectively. Experiments with the EMNIST, CIFAR-100, and StackOverflow benchmark federated learning tasks demonstrate that our new algorithms outperform existing ones in terms of accuracy for straggler clients, while also providing better trade-offs between training time and total accuracy.
View details
Preview abstract
Given copies of a quantum state $\rho$, a shadow tomography protocol aims to learn all expectation values from a fixed set of observables, to within a given precision $\epsilon$. We say that a shadow tomography protocol is \textit{triply efficient} if it is sample- and time-efficient, and only employs measurements that entangle a constant number of copies of $\rho$ at a time. The classical shadows protocol based on random single-copy measurements is triply efficient for the set of local Pauli observables. This and other protocols based on random single-copy Clifford measurements can be understood as arising from fractional colorings of a graph $G$ that encodes the commutation structure of the set of observables. Here we describe a framework for two-copy shadow tomography that uses an initial round of Bell measurements to reduce to a fractional coloring problem in an induced subgraph of $G$ with bounded clique number. This coloring problem can be addressed using techniques from graph theory known as \textit{chi-boundedness}. Using this framework we give the first triply efficient shadow tomography scheme for the set of local fermionic observables, which arise in a broad class of interacting fermionic systems in physics and chemistry. We also give a triply efficient scheme for the set of all $n$-qubit Pauli observables. Our protocols for these tasks use two-copy measurements, which is necessary: sample-efficient schemes are provably impossible using only single-copy measurements. Finally, we give a shadow tomography protocol that compresses an $n$-qubit quantum state into a $\poly(n)$-sized classical representation, from which one can extract the expected value of any of the $4^n$ Pauli observables in $\poly(n)$ time, up to a small constant error.
View details
Securing the AI Software Supply Chain
Isaac Hepworth
Kara Olive
Kingshuk Dasgupta
Michael Le
Mark Lodato
Mihai Maruseac
Sarah Meiklejohn
Shamik Chaudhuri
Tehila Minkus
Google, Google, 1600 Amphitheatre Parkway, Mountain View, CA, 94043 (2024)
Preview abstract
As AI-powered features gain traction in software applications, we see many of the same problems we’ve faced with traditional software—but at an accelerated pace. The threat landscape continues to expand as AI is further integrated into everyday products, so we can expect more attacks. Given the expense of building models, there is a clear need for supply chain solutions.
This paper explains our approach to securing our AI supply chain using provenance information and provides guidance for other organizations. Although there are differences between traditional and AI development processes and risks, we can build on our work over the past decade using Binary Authorization for Borg (BAB), Supply-chain Levels for Software Artifacts (SLSA), and next-generation cryptographic signing solutions via Sigstore, and adapt these to the AI supply chain without reinventing the wheel. Depending on internal processes and platforms, each organization’s approach to AI supply chain security will look different, but the focus should be on areas where it can be improved in a relatively short time.
Readers should note that the first part of this paper provides a broad overview of “Development lifecycles for traditional and AI software”. Then we delve specifically into AI supply chain risks, and explain our approach to securing our AI supply chain using provenance information. More advanced practitioners may prefer to go directly to the sections on “AI supply chain risks,” “Controls for AI supply chain security,” or even the “Guidance for practitioners” section at the end of the paper, which can be adapted to the needs of any organization.
View details