No abstract available.
Data structures for density estimation
We study statistical/computational tradeoffs for the following density estimation problem: given k distributions v1, ..., vk over a discrete domain of size n, and sampling access to a distribution p, identify vi that is "close" to p. Our main result is ...
ClusterFuG: clustering fully connected graphs by multicut
We propose a graph clustering formulation based on multicut (a.k.a. weighted correlation clustering) on the complete graph. Our formulation does not need specification of the graph topology as in the original sparse formulation of multicut, making our ...
Generalization on the unseen, logic reasoning and degree curriculum
This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data ...
Toward large kernel models
Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets. The interest in kernel machines has been additionally bolstered by the discovery of their equivalence to wide neural ...
Expertise trees resolve knowledge limitations in collective decision-making
Experts advising decision-makers are likely to display expertise which varies as a function of the problem instance. In practice, this may lead to sub-optimal or discriminatory decisions against minority cases. In this work, we model such changes in depth ...
Comparison of meta-learners for estimating multi-valued treatment heterogeneous effects
Conditional Average Treatment Effects (CATE) estimation is one of the main challenges in causal inference with observational data. In addition to Machine Learning based-models, nonparametric estimators called meta-learners have been developed to estimate ...
BNN-DP: robustness certification of Bayesian neural networks via dynamic programming
In this paper, we introduce BNN-DP, an efficient algorithmic framework for analysis of adversarial robustness of Bayesian Neural Networks (BNNs). Given a compact set of input points T ⊂ ℝn, BNN-DP computes lower and upper bounds on the BNN's predictions ...
SAM operates far from home: eigenvalue regularization as a dynamical phenomenon
The Sharpness Aware Minimization (SAM) optimization algorithm has been shown to control large eigenvalues of the loss Hessian and provide generalization benefits in a variety of settings. The original motivation for SAM was a modified loss function which ...
Second-order regression models exhibit progressive sharpening to the edge of stability
Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the ...
Global optimality of Elman-type RNNs in the mean-field regime
We analyze Elman-type Recurrent Reural Networks (RNNs) and their training in the mean-field regime. Specifically, we show convergence of gradient descent training dynamics of the RNN to the corresponding mean-field formulation in the large width limit. We ...
SemSup-XC: semantic supervision for zero and few-shot extreme classification
Extreme classification (XC) involves predicting over large numbers of classes (thousands to millions), with real-world applications like news article classification and e-commerce product tagging. The zero-shot version of this task requires generalization ...
Adaptive IMLE for few-shot pretraining-free generative modelling
Despite their success on large datasets, GANs have been difficult to apply in the few-shot setting, where only a limited number of training examples are provided. Due to mode collapse, GANs tend to ignore some training examples, causing overfitting to a ...
Scaling laws for generative mixed-modal language models
- Armen Aghajanyan,
- Lili Yu,
- Alexis Conneau,
- Wei-Ning Hsu,
- Karen Hambardzumyan,
- Susan Zhang,
- Stephen Roller,
- Naman Goyal,
- Omer Levy,
- Luke Zettlemoyer
Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and ...
Hypothesis transfer learning with surrogate classification losses: generalization bounds through algorithmic stability
Hypothesis transfer learning (HTL) contrasts domain adaptation by allowing for a previous task leverage, named the source, into a new one, the target, without requiring access to the source data. Indeed, HTL relies only on a hypothesis learnt from such ...
Constrained causal Bayesian optimization
We propose constrained causal Bayesian optimization (cCBO), an approach for finding interventions in a known causal graph that optimize a target variable under some constraints. cCBO first reduces the search space by exploiting the graph structure and, if ...
Explaining the effects of non-convergent MCMC in the training of energy-based models
In this paper, we quantify the impact of using nonconvergent Markov chains to train Energy-Based models (EBMs). In particular, we show analytically that EBMs trained with non-persistent short runs to estimate the gradient can perfectly reproduce a set of ...
Using large language models to simulate multiple humans and replicate human subject studies
We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model's ...
Interventional causal representation learning
Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, ...
Sequential underspecified instrument selection for cause-effect estimation
Instrumental variable (IV) methods are used to estimate causal effects in settings with unobserved confounding, where we cannot directly experiment on the treatment variable. Instruments are variables which only affect the outcome indirectly via the ...
Atari-5: distilling the arcade learning environment down to five games
The Arcade Learning Environment (ALE) has become an essential benchmark for assessing the performance of reinforcement learning algorithms. However, the computational cost of generating results on the entire 57-game dataset limits ALE's use and makes the ...
Towards credible visual model interpretation with path attribution
With its inspirational roots in game-theory, path attribution framework stands out among the posthoc model interpretation techniques due to its axiomatic nature. However, recent developments show that despite being axiomatic, path attribution methods can ...
Convergence of first-order methods for constrained nonconvex optimization with dependent data
We focus on analyzing the classical stochastic projected gradient methods under a general dependent data sampling scheme for constrained smooth nonconvex optimization. We show the worst-case rate of convergence Õ(t-1/4) and complexity Õ(ε-4) for achieving ...
Recasting self-attention with holographic reduced representations
In recent years, self-attention has become the dominant paradigm for sequence modeling in a variety of domains. However, in domains with very long sequence lengths the O(T2) memory and O(T2H) compute costs can make using transformers infeasible. Motivated ...
The saddle-point method in differential privacy
We characterize the differential privacy guarantees of privacy mechanisms in the largecomposition regime, i.e., when a privacy mechanism is sequentially applied a large number of times to sensitive data. Via exponentially tilting the privacy loss random ...
Nonlinear advantage: trained networks might not be as complex as you think
We perform an empirical study of the behaviour of deep networks when fully linearizing some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. In experiments on image classification and machine ...
A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image models
- James Urquhart Allingham,
- Jie Ren,
- Michael W. Dusenberry,
- Xiuye Gu,
- Yin Cui,
- Dustin Tran,
- Jeremiah Zhe Liu,
- Balaji Lakshminarayanan
Contrastively trained text-image models have the remarkable ability to perform zero-shot classification, that is, classifying previously unseen images into categories that the model has never been explicitly trained to identify. However, these zero-shot ...
On the privacy-robustness-utility trilemma in distributed learning
The ubiquity of distributed machine learning (ML) in sensitive public domain applications calls for algorithms that protect data privacy, while being robust to faults and adversarial behaviors. Although privacy and robustness have been extensively studied ...
Differentially private distributed Bayesian linear regression with MCMC
We propose a novel Bayesian inference framework for distributed differentially private linear regression. We consider a distributed setting where multiple parties hold parts of the data and share certain summary statistics of their portions in privacy-...
Robust and scalable Bayesian online changepoint detection
This paper proposes an online, provably robust, and scalable Bayesian approach for change-point detection. The resulting algorithm has key advantages over previous work: it provides provable robustness by leveraging the generalised Bayesian perspective, ...
Neural Wasserstein gradient flows for discrepancies with Riesz kernels
Wasserstein gradient flows of maximum mean discrepancy (MMD) functionals with nonsmooth Riesz kernels show a rich structure as singular measures can become absolutely continuous ones and conversely. In this paper we contribute to the understanding of such ...
Index Terms
- Proceedings of the 40th International Conference on Machine Learning