Search | arXiv e-print repository

On scalable oversight with weak LLMs judging strong LLMs

Authors: Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah

Abstract: Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI a… ▽ More Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies. △ Less

Submitted 12 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

Comments: 15 pages (53 including appendices). V2: minor correction to Figure 3; add Figure A.9 comparing open vs assigned consultancy; add a reference

arXiv:2311.14125 [pdf, other]

Scalable AI Safety via Doubly-Efficient Debate

Authors: Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras

Abstract: The emergence of pre-trained AI systems with powerful capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly. Irving et al. [2018] proposed a debate method in this direction with the goal of pitting the power of such AI models against each other until the problem of iden… ▽ More The emergence of pre-trained AI systems with powerful capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly. Irving et al. [2018] proposed a debate method in this direction with the goal of pitting the power of such AI models against each other until the problem of identifying (mis)-alignment is broken down into a manageable subtask. While the promise of this approach is clear, the original framework was based on the assumption that the honest strategy is able to simulate deterministic AI systems for an exponential number of steps, limiting its applicability. In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps. △ Less

Submitted 23 November, 2023; originally announced November 2023.

arXiv:2310.17567 [pdf, other]

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

Authors: Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, Sanjeev Arora

Abstract: With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This… ▽ More With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2306.05873 [pdf, other]

Detecting Adversarial Directions in Deep Reinforcement Learning to Make Robust Decisions

Authors: Ezgi Korkmaz, Jonah Brown-Cohen

Abstract: Learning in MDPs with highly complex state representations is currently possible due to multiple advancements in reinforcement learning algorithm design. However, this incline in complexity, and furthermore the increase in the dimensions of the observation came at the cost of volatility that can be taken advantage of via adversarial attacks (i.e. moving along worst-case directions in the observati… ▽ More Learning in MDPs with highly complex state representations is currently possible due to multiple advancements in reinforcement learning algorithm design. However, this incline in complexity, and furthermore the increase in the dimensions of the observation came at the cost of volatility that can be taken advantage of via adversarial attacks (i.e. moving along worst-case directions in the observation space). To solve this policy instability problem we propose a novel method to detect the presence of these non-robust directions via local quadratic approximation of the deep neural policy loss. Our method provides a theoretical basis for the fundamental cut-off between safe observations and adversarial observations. Furthermore, our technique is computationally efficient, and does not depend on the methods used to produce the worst-case directions. We conduct extensive experiments in the Arcade Learning Environment with several different adversarial attack techniques. Most significantly, we demonstrate the effectiveness of our approach even in the setting where non-robust directions are explicitly optimized to circumvent our proposed method. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Comments: Published in ICML 2023

arXiv:2112.13832 [pdf, ps, other]

Faster Algorithms and Constant Lower Bounds for the Worst-Case Expected Error

Authors: Jonah Brown-Cohen

Abstract: The study of statistical estimation without distributional assumptions on data values, but with knowledge of data collection methods was recently introduced by Chen, Valiant and Valiant (NeurIPS 2020). In this framework, the goal is to design estimators that minimize the worst-case expected error. Here the expectation is over a known, randomized data collection process from some population, and th… ▽ More The study of statistical estimation without distributional assumptions on data values, but with knowledge of data collection methods was recently introduced by Chen, Valiant and Valiant (NeurIPS 2020). In this framework, the goal is to design estimators that minimize the worst-case expected error. Here the expectation is over a known, randomized data collection process from some population, and the data values corresponding to each element of the population are assumed to be worst-case. Chen, Valiant and Valiant show that, when data values are $\ell_{\infty}$-normalized, there is a polynomial time algorithm to compute an estimator for the mean with worst-case expected error that is within a factor $\fracπ{2}$ of the optimum within the natural class of semilinear estimators. However, their algorithm is based on optimizing a somewhat complex concave objective function over a constrained set of positive semidefinite matrices, and thus does not come with explicit runtime guarantees beyond being polynomial time in the input. In this paper we design provably efficient algorithms for approximating the optimal semilinear estimator based on online convex optimization. In the setting where data values are $\ell_{\infty}$-normalized, our algorithm achieves a $\fracπ{2}$-approximation by iteratively solving a sequence of standard SDPs. When data values are $\ell_2$-normalized, our algorithm iteratively computes the top eigenvector of a sequence of matrices, and does not lose any multiplicative approximation factor. We complement these positive results by stating a simple combinatorial condition which, if satisfied by a data collection process, implies that any (not necessarily semilinear) estimator for the mean has constant worst-case expected error. △ Less

Submitted 27 December, 2021; originally announced December 2021.

arXiv:1911.02911 [pdf, ps, other]

Extended Formulation Lower Bounds for Refuting Random CSPs

Authors: Jonah Brown-Cohen, Prasad Raghavendra

Abstract: Random constraint satisfaction problems (CSPs) such as random $3$-SAT are conjectured to be computationally intractable. The average case hardness of random $3$-SAT and other CSPs has broad and far-reaching implications on problems in approximation, learning theory and cryptography. In this work, we show subexponential lower bounds on the size of linear programming relaxations for refuting rando… ▽ More Random constraint satisfaction problems (CSPs) such as random $3$-SAT are conjectured to be computationally intractable. The average case hardness of random $3$-SAT and other CSPs has broad and far-reaching implications on problems in approximation, learning theory and cryptography. In this work, we show subexponential lower bounds on the size of linear programming relaxations for refuting random instances of constraint satisfaction problems. Formally, suppose $P : \{0,1\}^k \to \{0,1\}$ is a predicate that supports a $t-1$-wise uniform distribution on its satisfying assignments. Consider the distribution of random instances of CSP $P$ with $m = Δn$ constraints. We show that any linear programming extended formulation that can refute instances from this distribution with constant probability must have size at least $Ω\left(\exp\left(\left(\frac{n^{t-2}}{Δ^2}\right)^{\frac{1-ν}{k}}\right)\right)$ for all $ν> 0$. For example, this yields a lower bound of size $\exp(n^{1/3})$ for random $3$-SAT with a linear number of clauses. We use the technique of pseudocalibration to directly obtain extended formulation lower bounds from the planted distribution. This approach bypasses the need to construct Sherali-Adams integrality gaps in proving general LP lower bounds. As a corollary, one obtains a self-contained proof of subexponential Sherali-Adams LP lower bounds for these problems. We believe the result sheds light on the technique of pseudocalibration, a promising but conjectural approach to LP/SDP lower bounds. △ Less

Submitted 7 November, 2019; originally announced November 2019.

arXiv:1809.06528 [pdf, other]

Formal Barriers to Longest-Chain Proof-of-Stake Protocols

Authors: Jonah Brown-Cohen, Arvind Narayanan, Christos-Alexandros Psomas, S. Matthew Weinberg

Abstract: The security of most existing cryptocurrencies is based on a concept called Proof-of-Work, in which users must solve a computationally hard cryptopuzzle to authorize transactions (`one unit of computation, one vote'). This leads to enormous expenditure on hardware and electricity in order to collect the rewards associated with transaction authorization. Proof-of-Stake is an alternative concept tha… ▽ More The security of most existing cryptocurrencies is based on a concept called Proof-of-Work, in which users must solve a computationally hard cryptopuzzle to authorize transactions (`one unit of computation, one vote'). This leads to enormous expenditure on hardware and electricity in order to collect the rewards associated with transaction authorization. Proof-of-Stake is an alternative concept that instead selects users to authorize transactions proportional to their wealth (`one coin, one vote'). Some aspects of the two paradigms are the same. For instance, obtaining voting power in Proof-of-Stake has a monetary cost just as in Proof-of-Work: a coin cannot be freely duplicated any more easily than a unit of computation. However some aspects are fundamentally different. In particular, exactly because Proof-of-Stake is wasteless, there is no inherent resource cost to deviating (commonly referred to as the `Nothing-at-Stake' problem). In contrast to prior work, we focus on incentive-driven deviations (any participant will deviate if doing so yields higher revenue) instead of adversarial corruption (an adversary may take over a significant fraction of the network, but the remaining players follow the protocol). The main results of this paper are several formal barriers to designing incentive-compatible proof-of-stake cryptocurrencies (that don't apply to proof-of-work). △ Less

Submitted 18 September, 2018; originally announced September 2018.

arXiv:1504.00703 [pdf, other]

The matching problem has no small symmetric SDP

Authors: Gábor Braun, Jonah Brown-Cohen, Arefin Huq, Sebastian Pokutta, Prasad Raghavendra, Aurko Roy, Benjamin Weitz, Daniel Zink

Abstract: Yannakakis showed that the matching problem does not have a small symmetric linear program. Rothvoß recently proved that any, not necessarily symmetric, linear program also has exponential size. It is natural to ask whether the matching problem can be expressed compactly in a framework such as semidefinite programming (SDP) that is more powerful than linear programming but still allows efficient o… ▽ More Yannakakis showed that the matching problem does not have a small symmetric linear program. Rothvoß recently proved that any, not necessarily symmetric, linear program also has exponential size. It is natural to ask whether the matching problem can be expressed compactly in a framework such as semidefinite programming (SDP) that is more powerful than linear programming but still allows efficient optimization. We answer this question negatively for symmetric SDPs: any symmetric SDP for the matching problem has exponential size. We also show that an O(k)-round Lasserre SDP relaxation for the metric traveling salesperson problem yields at least as good an approximation as any symmetric SDP relaxation of size $n^k$. The key technical ingredient underlying both these results is an upper bound on the degree needed to derive polynomial identities that hold over the space of matchings or traveling salesperson tours. △ Less

Submitted 30 November, 2016; v1 submitted 2 April, 2015; originally announced April 2015.

Comments: 18 pages

MSC Class: 68Q17; 68R10

Journal ref: Proceedings of SODA 2016, 1067-1078

arXiv:1501.01598 [pdf, ps, other]

Combinatorial Optimization Algorithms via Polymorphisms

Authors: Jonah Brown-Cohen, Prasad Raghavendra

Abstract: An elegant characterization of the complexity of constraint satisfaction problems has emerged in the form of the the algebraic dichotomy conjecture of [BKJ00]. Roughly speaking, the characterization asserts that a CSP Λ is tractable if and only if there exist certain non-trivial operations known as polymorphisms to combine solutions to Λ to create new ones. In an entirely separate line of work, th… ▽ More An elegant characterization of the complexity of constraint satisfaction problems has emerged in the form of the the algebraic dichotomy conjecture of [BKJ00]. Roughly speaking, the characterization asserts that a CSP Λ is tractable if and only if there exist certain non-trivial operations known as polymorphisms to combine solutions to Λ to create new ones. In an entirely separate line of work, the unique games conjecture yields a characterization of approximability of Max-CSPs. Surprisingly, this characterization for Max-CSPs can also be reformulated in the language of polymorphisms. In this work, we study whether existence of non-trivial polymorphisms implies tractability beyond the realm of constraint satisfaction problems, namely in the value-oracle model. Specifically, given a function f in the value-oracle model along with an appropriate operation that never increases the value of f , we design algorithms to minimize f . In particular, we design a randomized algorithm to minimize a function f that admits a fractional polymorphism which is measure preserving and has a transitive symmetry. We also reinterpret known results on MaxCSPs and thereby reformulate the unique games conjecture as a characterization of approximability of max-CSPs in terms of their approximate polymorphisms. △ Less

Submitted 7 January, 2015; originally announced January 2015.

Showing 1–9 of 9 results for author: Brown-Cohen, J