Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10015 publications
    An artificial neural network to estimate the foliar and ground cover input variables of the Rangeland Hydrology and Erosion Model
    Mahmoud Saeedimoghaddam
    David Goodrich
    Mariano Hernandez
    David Phillip Guertin
    Loretta J. Metz
    Guillermo Ponce-Campos
    Haiyan Wei
    Shea Burns
    Sarah E. McCord
    Mark A. Nearing
    C. Jason Williams
    Carrie-Ann Houdeshell
    Mashrekur Rahman
    Menberu B. Meles
    Steve Barker
    Journal of Hydrology (2024)
    Preview abstract Models like the Rangeland Hydrology and Erosion Model (RHEM) are useful for estimating soil erosion, however, they rely on input parameters that are sometimes difficult or expensive to measure. Specifically, RHEM requires information about foliar and ground cover fractions that generally must be measured in situ, which makes it difficult to use models like RHEM to produce erosion or soil risk maps for areas exceeding the size of a hillslope such as a large watershed. We previously developed a deep learning emulator of RHEM that has low computational expense and can, in principle, be run over large areas (e.g., over the continental US). In this paper, we develop a deep learning model to estimate the RHEM ground cover inputs from remote sensing time series, reducing the need for extensive field surveys to produce erosion maps. We achieve a prediction accuracy on hillslope runoff of r2=0.9, and on soil loss and sediment yield of r2 = 0.4 at 66,643 field locations within the US. We demonstrate how this approach can be used for mapping by developing runoff, soil loss, and sediment yield maps over a 1356 km2 region of interest in Nebraska. View details
    Preview abstract Large Language Models have been able to replicate their success from text generation to coding tasks. While a lot of work has made it clear that they have remarkable performance on tasks such as code completion and editing, it is still unclear as to why. We help bridge this gap by exploring to what degree do auto-regressive models understand the logical constructs of the underlying programs. We propose CAPP, a counterfactual testing framework to evaluate whether large code models understand programming concepts. With only black-box access to the model, we use CAPP to evaluate 10 popular large code models for 5 different programming concepts. Our findings suggest that current models lack understanding of concepts such as data flow and control flow. View details
    Augmentations vs Algorithms: What Works in Self-Supervised Learning
    Warren Morningstar
    Alex Bijamov
    Chris Duvarney
    Luke Friedman
    Neha Kalibhat
    Philip Mansfield
    Renan Rojas-Gomez
    Karan Singhal
    Bradley Green
    Sushant Prakash
    Arxiv (2024) (to appear)
    Preview abstract We study the relative effects of data augmentations, pretraining algorithms, and model architectures in Self-Supervised Learning (SSL). While the recent literature in this space leaves the impression that the pretraining algorithm is of critical importance to performance, understanding its effect is complicated by the difficulty in making objective and direct comparisons between methods. We propose a new framework which unifies many seemingly disparate SSL methods into a single shared template. Using this framework, we identify aspects in which methods differ and observe that in addition to changing the pretraining algorithm, many works also use new data augmentations or more powerful model architectures. We compare several popular SSL methods using our framework and find that many algorithmic additions, such as prediction networks or new losses, have a minor impact on downstream task performance (often less than 1%), while enhanced augmentation techniques offer more significant performance improvements (2−4%). Our findings challenge the premise that SSL is being driven primarily by algorithmic improvements, and suggest instead a bitter lesson for SSL: that augmentation diversity and data / model scale are more critical contributors to recent advances in self-supervised learning. View details
    Preview abstract Help documents are supposed to aid smartphone users in resolving queries such as "How to block calls from unknown numbers?". However, given a query, identifying the right help document, understanding instructions from the document, and using them to resolve the issue at hand is challenging. The user experience may be enhanced by converting the instructions in the help document to a step-by-step tutorial overlaid on the phone UI. Successful execution of this task requires overcoming research challenges in retrieval, parsing, and grounding in the multilingual-multimodal setting. For example, user queries in one language may have to be matched against instructions in another language, which in turn needs to be grounded in a multimodal UI in yet another language. Moreover, there isn’t any relevant dataset for such a task. In order to bridge this gap, we introduce UGIF-DataSet, a multi-lingual, multi-modal UI grounded dataset for step-by-step task completion on the smartphone, containing 4,184 tasks across 8 languages. The instruction steps in UGIF-DataSet are available only in English, so the challenge involves operations in the cross-modal, cross-lingual setting. We compare the performance of different large language models for this task and find that the end-to-end task completion rate drops from 48% in English to 32% for other languages, demonstrating significant overall headroom for improvement. We are hopeful that UGIF-DataSet and our analysis will aid further research on the important problem of sequential task completion in the multilingual and multimodal setting. View details
    "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output
    Michael Xieyang Liu
    Frederick Liu
    Alex Fiannaca
    Terry Koo
    In Extended Abstract in ACM CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024), pp. 9 (to appear)
    Preview abstract Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective. We identified 134 concrete use cases for constraints at two levels: low-level, which ensures the output adhere to a structured format and an appropriate length, and high-level, which requires the output to follow semantic and stylistic guidelines without hallucination. Critically, applying output constraints could not only streamline the currently repetitive process of developing, testing, and integrating LLM prompts for developers, but also enhance the user experience of LLM-powered features and applications. We conclude with a discussion on user preferences and needs towards articulating intended constraints for LLMs, alongside an initial design for a constraint prototyping tool. View details
    Preview abstract Floods are one of the most common natural disasters, with a disproportionate impact in developing countries that often lack dense streamflow gauge networks. Accurate and timely warnings are critical for mitigating flood risks, but hydrological simulation models typically must be calibrated to long data records in each watershed. Here we show that AI-based forecasting achieves reliability in predicting extreme riverine events in ungauged watersheds at up to a 5-day lead time that is similar to or better than the reliability of nowcasts (0-day lead time) from a current state of the art global modeling system (the Copernicus Emergency Management Service Global Flood Awareness System). Additionally, we achieve accuracies over 5-year return period events that are similar to or better than current accuracies over 1-year return period events. This means that AI can provide flood warnings earlier and over larger and more impactful events in ungauged basins. The model developed in this paper was incorporated into an operational early warning system that produces publicly available (free and open) forecasts in real time in over 80 countries. This work highlights a need for increasing the availability of hydrological data to continue to improve global access to reliable flood warnings. View details
    Preview abstract We present a method for generating Streetscapes --- long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. View details
    Practical Performance Guarantees for Pipelined DNN Inference
    Kuikui Liu
    Proceedings of the 41st International Conference on Machine Learning (2024)
    Preview abstract This work optimizes pipeline parallelism of machine learning model inference by partitioning computation graphs into $k$ stages and minimizing the running time of the bottleneck stage. We design practical algorithms for this NP-complete problem and prove they are nearly optimal in practice by comparing against lower bounds obtained from solving novel mixed-integer programming (MIP) formulations. We apply these algorithms and lower-bound techniques to production models to achieve substantial improvements in the approximation guarantees, compared to simple combinatorial lower bounds. For example, our new MIP formulations improve the lower bounds enough to drop the geometric mean approximation ratio from $2.175$ to $1.082$ across production data with $k=16$ pipeline stages. This work shows that while bottleneck partitioning is theoretically hard, in practice we have a handle on the algorithmic side of the problem and much of the remaining challenge is in developing more accurate cost models to give to the partitioning algorithms. View details
    On the predictability of turbulent fluxes from land: PLUMBER2 MIP experimental description and preliminary results
    Gab Abramowitz
    Anna Ukkola
    Sanaa Hobeichi
    Jon Cranko Page
    Mathew Lipson
    Martin De Kauwe
    Sam Green
    Claire Brenner
    Jonathan Frame
    Martyn Clark
    Martin Best
    Peter Anthoni
    Gabriele Arduini
    Souhail Boussetta
    Silvia Caldararu
    Kyeungwoo Cho
    Matthias Cuntz
    David Fairbairn
    Craig Ferguson
    Hyungjun Kim
    Yeonjoo Kim
    Jürgen Knauer
    David Lawrence
    Xiangzhong Luo
    Sergey Malyshev
    Tomoko Nitta
    Jerome Ogee
    Keith Oleson
    Catherine Ottlé
    Phillipe Peylin
    Patricia de Rosnay
    Heather Rumbold
    Bob Su
    Nicolas Vuichard
    Anthony Walker
    Xiaoni Wang-Faivre
    Yunfei Wang
    Yijian Zeng
    Hydrology and Earth Systems Sciences Discussions (2024)
    Preview abstract Accurate representation of the turbulent exchange of carbon, water, and heat between the land surface and the atmosphere is critical for modelling global energy, water, and carbon cycles, both in future climate projections and weather forecasts. We describe a Model Intercomparison Project (MIP) that compares the surface turbulent heat flux predictions of around 20 different land models provided with in-situ meteorological forcing, evaluated with measured surface fluxes using quality-controlled data from 170 eddy-covariance based flux tower sites. Several out-of-sample empirical model predictions of site fluxes are used as benchmarks to quantify the degree to which land model performance could improve across a broad range of metrics. The performance discrepancy between empirical and physically-based model predictions also provides a potential pathway to understand sources of model error. Sites with unusual behaviour, complicated processes, poor data quality or uncommon flux magnitude will be more difficult to predict for both mechanistic and empirical models. Results suggest that latent heat flux and net ecosystem exchange of CO2 are better predicted by land models than sensible heat flux, which at least conceptually would appear to have fewer physical processes controlling it. Land models that are implemented in Earth System Models also appear to perform notably better than stand alone ecosystem (including demographic) models, at least in terms of the fluxes examined here. Flux tower data quality is also explored as an uncertainty source, with the difference between energy-balance corrected versus raw fluxes examined, as well as filtering for low wind speed periods. Land model performance does not appear to improve with energy-balance corrected data, and indeed some results raised questions about whether the correction process itself was appropriate. In both cases results were broadly consistent, with simple out-of-sample empirical models, including linear regression, comfortably outperforming mechanistic land models. The PLUMBER2 approach, and its openly-available data, enable precise isolation of the locations and conditions in which model developers can know that a given land model can improve, allowing information pathways and discrete parametrisations in models to be identified and targeted for model development. View details
    Preview abstract Interruptions in digital services are a common occurrence for users. These disruptions, however, exact a cost in terms of attention, task completion rate, and, most importantly, emotional state. While several methods currently employed by service providers attempt to address this, the paper will argue that browser games or similar interactive interfaces should become a standard mechanism to ease the aforementioned effects. View details
    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
    Nitesh Bharadwaj Gundavarapu
    Luca Versari
    Kihyuk Sohn
    Agrim Gupta
    Xiuye Gu
    Alex Hauptmann
    Boqing Gong
    Lu Jiang
    ICLR (2024)
    Preview abstract While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks. View details
    Minimizing Live Experiments in Recommender Systems: User Simulation to Evaluate Preference Elicitation Policies
    Martin Mladenov
    James Pine
    Hubert Pham
    Shane Li
    Xujian Liang
    Anton Polishko
    Ben Scheetz
    Proceedings of he 47th International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR-24), Washington, DC (2024) (to appear)
    Preview abstract Evaluation of policies in recommender systems (RSs) typically involves A/B testing using live experiments on real users to assess a new policy's impact on relevant metrics. This ``gold standard'' comes at a high cost, however, in terms of cycle time, user cost, and potential user retention. In developing policies for onboarding new users, these costs can be especially problematic, since on-boarding occurs only once. In this work, we describe a simulation methodology used to augment (and reduce) the use of live experiments. We illustrate its deployment for the evaluation of preference elicitation algorithms used to onboard new users of the YouTube Music platform. By developing counterfactually robust user behavior models, and a simulation service that couples such models with production infrastructure, we are able to test new algorithms in a way that reliably predicts their performance on key metrics when deployed live, sometimes more reliably than live experiments due to the scale at which simulation can be realized. We describe our domain, our simulation models and platform, results of experiments and deployment, and suggest future steps needed to further realistic simulation as a powerful complement to live experiments. View details
    Preview abstract One of the most basic problems for studying the "price of privacy over time" is the so called private counter problem, introduced by Dwork et al. (2010) and Chan et al. (2011). In this problem, we aim to track the number of events that occur over time, while hiding the existence of every single event. More specifically, in every time step $t\in[T]$ we learn (in an online fashion) that $\Delta_t\geq 0$ new events have occurred, and must respond with an estimate $n_t\approx\sum_{j=1}^t \Delta_j$. The privacy requirement is that all of the outputs together, across all time steps, satisfy event level differential privacy. The main question here is how our error needs to depend on the total number of time steps $T$ and the total number of events $n$. Dwork et al. (2015) showed an upper bound of $O\left(\log(T)+\log^2(n)\right)$, and Henzinger et al. (2023) showed a lower bound of $\Omega\left(\min\{\log n, \log T\}\right)$. We show a new lower bound of $\Omega\left(\min\{n,\log T\}\right)$, which is tight w.r.t. the dependence on $T$, and is tight in the sparse case where $\log^2 n=O(\log T)$. Our lower bound has the following implications: * We show that our lower bound extends to the online thresholds problem, where the goal is to privately answer many "quantile queries" when these queries are presented one-by-one. This resolves an open question of Bun et al. (2017). * Our lower bound implies, for the first time, a separation between the number of mistakes obtainable by a private online learner and a non-private online learner. This partially resolves a COLT'22 open question published by Sanyal and Ramponi. * Our lower bound also yields the first separation between the standard model of private online learning and a recently proposed relaxed variant of it, called private online prediction. View details
    Human Language to Analog Layout Using Glayout Layout Automation
    Ali Hammoud
    Chetanya Goyal
    Sakib Pathen
    Arlene Dai
    Anhang Li
    Mehdi Saligane
    Preview abstract Current approaches to Analog Layout Automation apply ML techniques such as Graph Convolutional Neural Networks (GCN) to translate netlist to layout. While these ML approaches have proven to be effective, they lack the powerful reasoning capabilities, an intuitive human interface, and standard evaluation benchmarks that have been improving at a rapid de- velopment pace in Large Language Models (LLMs). The GLayout framework introduced in this work translates analog layout into an expressive, technology generic, compact text representation. Then, an LLM is taught to understand analog layout through fine-tuning and in-context learning using Retrieval Augmented Generation (RAG). The LLM is able to successfully layout unseen circuits based on new information provided in-context. We train 3.8, 7, and 22 Billion parameter quantized LLMs on a dataset of less than 50 unique circuits, and text documents providing layout knowledge. The 22B parameter model is tuned in 2 hours on a single NVIDIA A100 GPU. The open-source evaluation set is proposed as an automation benchmark for LLM layout automation tasks, and ranges from 2-transistor circuits to a ∆Σ ADC. The 22B model completes 70% of the tasks in the evaluation set, and is able to pass DRC and LVS verification on unseen 4 transistor blocks. View details
    ASTRA-5G: Automated Over-the-Air Security Testing and Research Architecture for 5G SA Devices
    Aanjhan Ranganathan
    Christina Pöpper
    Evangelos Bitsikas
    Michele Guerra
    Syed Khandker
    WiSec '24: Proceedings of the 17th ACM Conference on Security and Privacy in Wireless and Mobile Networks, ACM (2024)
    Preview abstract Despite the widespread deployment of 5G technologies, there exists a critical gap in security testing for 5G Standalone (SA) devices. Existing methods, largely manual and labor-intensive, are ill-equipped to fully uncover the state of security in the implementations of 5G-SA protocols and standards on devices, severely limiting the ability to conduct comprehensive evaluations. To address this issue, in this work, we introduce an novel, open-source framework that auto- mates the security testing process for 5G SA devices. By leveraging enhanced functionalities of 5G SA core and Radio Access Network (RAN) software, our framework offers a streamlined approach to generating, executing, and evaluating test cases, specifically focusing on the Non-Access Stratum (NAS) layer. Our application of this framework across multiple 5G SA devices provides in-depth security insights, significantly improving testing efficiency and breadth. View details