Challenges with unsupervised LLM knowledge discovery

Farquhar, Sebastian; Varma, Vikrant; Kenton, Zachary; Gasteiger, Johannes; Mikulik, Vladimir; Shah, Rohin

Computer Science > Machine Learning

arXiv:2312.10029 (cs)

[Submitted on 15 Dec 2023 (v1), last revised 18 Dec 2023 (this version, v2)]

Title:Challenges with unsupervised LLM knowledge discovery

Authors:Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah

View PDF

Abstract:We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.

Comments:	12 pages (38 including references and appendices). First three authors equal contribution, randomised order
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2312.10029 [cs.LG]
	(or arXiv:2312.10029v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2312.10029

Submission history

From: Zachary Kenton [view email]
[v1] Fri, 15 Dec 2023 18:49:43 UTC (3,216 KB)
[v2] Mon, 18 Dec 2023 16:43:35 UTC (3,217 KB)

Computer Science > Machine Learning

Title:Challenges with unsupervised LLM knowledge discovery

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Challenges with unsupervised LLM knowledge discovery

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators