Latent Causal Probing:
A Formal Perspective on Probing with Causal Models of Data

Charles Jin
MIT CSAIL
[email protected]
Abstract

As language models (LMs) deliver increasing performance on a range of NLP tasks, probing classifiers have become an indispensable technique in the effort to better understand their inner workings. A typical setup involves (1) defining an auxiliary task consisting of a dataset of text annotated with labels, then (2) supervising small classifiers to predict the labels from the representations of a pretrained LM as it processed the dataset. A high probing accuracy is interpreted as evidence that the LM has learned to perform the auxiliary task as an unsupervised byproduct of its original pretraining objective. Despite the widespread usage of probes, however, the robust design and analysis of probing experiments remains a challenge. We develop a formal perspective on probing using structural causal models (SCM). Specifically, given an SCM which explains the distribution of tokens observed during training, we frame the central hypothesis as whether the LM has learned to represent the latent variables of the SCM. Empirically, we extend a recent study of LMs in the context of a synthetic grid-world navigation task, where having an exact model of the underlying causal structure allows us to draw strong inferences from the result of probing experiments. Our techniques provide robust empirical evidence for the ability of LMs to learn the latent causal concepts underlying text.

1 Introduction

As large LMs pretrained on massive amounts of unlabeled text continue to reach new heights in NLP tasks (and beyond), the question of what kinds of information such models encode about their training data remains a topic of intense discussion and research. One prominent technique is to supervise small probing classifiers to extract some linguistically relevant property from the representations of the pretrained LM (Shi et al., 2016; Adi et al., 2017; Alain & Bengio, 2018), with the intuition being that the success of the probe reveals the LM has, in fact, learned to encode the property of interest as a byproduct of its training.

Despite their widespread usage, however, probes themselves are also an active area of research, with a number of interconnected open questions in the design and interpretation of probing experiments (Belinkov, 2022), including:

(Q1) Control and interpretation.

Given that the probe itself is directly supervised to perform the auxiliary task, the observed outcomes could depend not only on the information inherently encoded in the LM but also the ability of the probe to extract the information itself. For instance, researchers have found that training probes to predict randomized labels can often yield comparably high accuracies on certain tasks, calling into question the significance of prior results (Hewitt & Liang, 2019). As a result, drawing robust conclusions from the classification accuracy of a probe remains up for debate.

(Q2) Classifier selection and training.

To combat the risk of measuring the probe’s capacity to learn the auxiliary task, researchers often limit probes to low capacity architectures such as linear classifiers (Maudslay et al., 2020). However, other works have countered with evidence that LMs encode more complex concepts using non-linear representations, which can only be accurately measured using higher capacity classifiers (Belinkov & Glass, 2019; Li et al., 2022). A related question which has received little attention is how the training procedure itself (e.g., optimizer selection, training hyperparameters, auxiliary dataset size) interacts with the outcome of the probing experiment.

(Q3) Auxiliary task design.

Finally, as large, pretrained LMs have progressed from producing human-like text to exhibiting increasingly “intelligent” behaviors such as reasoning and in-context learning (Brown et al., 2020), there is an emerging need to better understand the limitations and capabilities of LMs along dimensions such world knowledge and theory of mind. These domains present a distinct set of challenges compared to traditional linguistic tasks such as part-of-speech tagging and dependency parsing.

The theoretical section of this paper develops a formal perspective on probing using the language of structural causal models (SCM). Specifically, given a causal model which explains the distribution of tokens observed during training, we pose the central hypothesis as determining whether the LM has learned to represent the latent causal variables of the SCM: concepts that explain how the text was generated, but are never directly observed during training. We then introduce probes as a means of empirically testing such hypotheses, by extracting the value of the latent concepts given only the LM representations as input. Our setting naturally captures broader questions about the inductive bias of LMs trained solely on text, and the latent concepts they acquire over the course of training (Q3).

Next, by extending the SCM beyond the data generation process to cover the training of the LM (unsupervised) and probe (supervised), we further show that Q1 and Q2 can be understood as the mediating and moderating effects of the probe, respectively. We propose a general technique based on causal mediation analysis which isolates the causal path through the LM while excluding the probe’s influence. Our analysis yields clear, testable conditions for accepting or rejecting our hypotheses based on a probing experiment’s outcomes.

Finally, we conduct an empirical study that extends the experimental setting introduced by Jin & Rinard (2023), who use probes to quantify the extent to which LMs are capable of learning “meaning” from text, as operationalized by the semantics of a synthetic programming language for grid-world navigation. By leveraging the proposed latent causal probing framework, our experiments allow us to draw precise conclusions about the causal relationship between the latent dynamics that generated the training data and what is learned by the LM. In particular, we find evidence that (1) the LM has, in fact, learned to represent the latent variables corresponding to the underlying semantics of the language, and (2) the LM representations exhibit an inductive bias that generalizes to novel action sequences. Our study marks the first rigorous empirical evaluation of the hypothesis that language models are latent concept learners, revealing intriguing insights into how language models might acquire an understanding of language.

2 Structural causal models of text

This section introduces the setting of our framework for probing, which is based on the idea that the text used to train LMs may exhibit latent causal structure; we formalize these concepts using the approach of structural causal models.

2.1 Background: structural causal models

Refer to caption
(a) The original SCM.
Refer to caption
(b) Intervening on the forecast.
Figure 1: An SCM for bringing an umbrella to work.

Structural causal models are graphical models which represent causal relationships in a data generation process as a directed graphical model (Pearl et al., 2000). We illustrate the key concepts with an example; we refer the reader to Pearl (2010) for a more comprehensive overview. Suppose that we are interested in the effect the weather has on employees bringing an umbrella to work. In this case, we may hypothesize a SCM like the one in Figure 1(a). Each node represents a different random variable: the weather, the weather forecast, whether the employee’s morning gets off to a late start, and whether the employee brings an umbrella to work. Nodes without parents are exogenous variables, whose causes are left unexplained; they are often used to model nature, randomness, or other aspects of physical reality, such as genetic or environmental factors. The exogenous variables in Figure 1(a) are the weather and having a late start. Nodes with a parent indicate the possibility of a causal relationship, e.g., the edge from weather to forecast indicates that the weather might influence the forecast. In particular, every missing edge in the SCM asserts the lack of a causal relationship. A standard assumption of causal analysis is that the underlying causal graph is Markovian (or acyclic); we adopt this assumption as well.

Mediators and moderators.

Returning now to our original question of how the weather affects employees bringing an umbrella to work, we note the SCM hypothesizes 3 possible causes: the weather, the weather forecast, and having a late start. The forecast is a mediator because total causal effect of the weather on umbrella in partially transferred by the path-specific effect over the weather-forecast-umbrella pathway (Avin et al., 2005; Imai et al., 2010). A natural question is how much the forecast is responsible for the increase in likelihood that an employee brings an umbrella to work when, for instance, the weather changes from sunny to rainy. One answer is given by the necessary indirect effect, which quantifies how much the presence of the causal path through the mediator contributes to the total measured effect (Weinberger, 2019):

NIErain,sun(Forecast)=𝔼[UmbrellaWeather=rain]𝔼[UmbrellaWeather=rain,do(Forecast=sun)],subscriptNIErainsunForecast𝔼delimited-[]conditionalUmbrellaWeatherrain𝔼delimited-[]conditionalUmbrellaWeatherrain𝑑𝑜Forecastsun\text{NIE}_{\text{rain},\text{sun}}(\text{Forecast})=\mathbb{E}\big{[}\text{% Umbrella}\mid\text{Weather}=\text{rain}\big{]}\\ -\mathbb{E}\big{[}\text{Umbrella}\mid\text{Weather}=\text{rain},do(\text{% Forecast}=\text{sun})\big{]},start_ROW start_CELL NIE start_POSTSUBSCRIPT rain , sun end_POSTSUBSCRIPT ( Forecast ) = blackboard_E [ Umbrella ∣ Weather = rain ] end_CELL end_ROW start_ROW start_CELL - blackboard_E [ Umbrella ∣ Weather = rain , italic_d italic_o ( Forecast = sun ) ] , end_CELL end_ROW

where do(Forecast=sun)𝑑𝑜Forecastsundo(\text{Forecast}=\text{sun})italic_d italic_o ( Forecast = sun ) is a causal intervention. The intervention can be conceptualized as forcing the weather station to forecast a sunny day independent of the weather, thereby severing the weather-forecast-umbrella pathway. Figure 1(b) depicts the SCM post-intervention.

The late start variable is not a mediator of the weather-umbrella causal effect (because there is no path from the weather to umbrella that passes through it), but it could still be a moderator: variables that do not directly mediate a causal effect, but affect the strength (and possibly direction) of another causal path (Baron & Kenny, 1986). For instance, the forecast’s effect on whether an employee brings an umbrella (i.e., the NIE) might be lower if the employee has a late start and rushes out the door without checking the forecast.

2.2 Case study: causal structure in programming languages

Refer to caption
Figure 2: An SCM of the data generation process for the grid world corpus. The exogenous variables are the initial state and the actions; latent variables are green and observed variables are gray. The training corpus consists of programs of length between 6 and 10.

Jin & Rinard (2023) propose an experiment to study whether LMs are able to ground a sequence of actions into a sequence of states, having only seen examples of the initial and final state during training. Specifically, they train a 350M parameter Transformer (Vaswani et al., 2017) on a corpus of specification-program pairs using a standard causal language modeling objective. The programs are strings in a grid-world navigation language with 5 actions (move, turn_right, turn_left, put_marker, pick_marker), sampled uniformly between lengths 6 and 10, inclusive. The specifications consist of the initial and final state, which are 8x8 grids. Executing the program navigates a single robot in the initial state to the final state. We refer to Jin & Rinard (2023) for more details about the language.

Figure 2 displays an SCM of the data generation process (along with an example assignment of values to each variable). The exogeneous variables are the initial state and the program actions. Each action produces a latent state (green), save for the last action, which is observed as the final state. A training sample consists of the sequence: s0,sn,p1,,pnsubscript𝑠0subscript𝑠𝑛subscript𝑝1subscript𝑝𝑛s_{0},s_{n},p_{1},\ldots,p_{n}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where each grid world is converted to text by scanning in row order, with one token per entry.

Consider now modeling a distribution of text drawn from this SCM. In particular, for each sample x𝑥xitalic_x there is an assignment e𝑒eitalic_e to the exogenous variables in the SCM M𝑀Mitalic_M such that M(e)=x𝑀𝑒𝑥M(e)=xitalic_M ( italic_e ) = italic_x. One strategy would be to learn a model of the SCM depicted in Figure 2, and integrate the latent variables during inference. For instance, knowing that the robot is one space away from snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in sn1subscript𝑠𝑛1s_{n-1}italic_s start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT could help a learner predict pn=movesubscript𝑝𝑛movep_{n}=\texttt{move}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = move.

More generally, given observations generated according to some unknown causal mechanism, a learner could propose various SCMs of the underlying causal structure consistent with the observations, then use these SCMs to inform future predictions. A fundamental challenge in learning causal structure is the problem of latent variable induction, or inferring what latent variables to define candidate SCMs over. In this work, we focus on the causal structure of programming languages, where the underlying causal dynamics are governed by a precise formal semantics, and the latent variables are given by program states. Having formally defined semantics and latent variables enables us interpret the results of our probing experiments in an unambiguous way; we refer the reader to Sloman (2005); Feder et al. (2022) for surveys of causal structure in natural language.

3 Latent causal probing

We present latent causal probing, a formal framework for empirically testing the hypothesis

Language models are latent causal concept learners.

At a high level, given an SCM that models the training data as the observed variables, we probe the LM for representations of the latent variables of the SCM. Our main insight, as illustrated in Figure 2, is that knowing the latent value of sn1subscript𝑠𝑛1s_{n-1}italic_s start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT could help predict the observed value of pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT; hence, an LM trained to predict pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT might eventually induce the existence of the latent variable sn1subscript𝑠𝑛1s_{n-1}italic_s start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT. More generally, we refer to any latent variable with a causal effect on the distribution of the training data as a latent causal concept.

3.1 Probing for latent variables

We begin by defining the auxiliary task and dataset for probing. Fix some structural causal model M𝑀Mitalic_M, and let vMsubscript𝑣𝑀v_{M}italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT be the latent variable of interest. Given some text x𝑥xitalic_x, we use vM(x)subscript𝑣𝑀𝑥v_{M}(x)italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x ) to denote the value of the latent variable in the SCM of text x𝑥xitalic_x. For instance, the value of vM=s1subscript𝑣𝑀subscript𝑠1v_{M}=s_{1}italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the sample x𝑥xitalic_x from Figure 2 is the grid depicted in the s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT node. We assume that the value of each latent variable is uniquely determined by x𝑥xitalic_x and M𝑀Mitalic_M.

Given a language model LM𝐿𝑀LMitalic_L italic_M with parameters θ𝜃\thetaitalic_θ, we denote an arbitrary representation function as LM(x;θ)𝐿𝑀𝑥𝜃LM(x;\theta)italic_L italic_M ( italic_x ; italic_θ ). The auxiliary dataset consists of input features {LM(x;θ)xX}conditional-set𝐿𝑀𝑥𝜃𝑥𝑋\{LM(x;\theta)\mid x\in X\}{ italic_L italic_M ( italic_x ; italic_θ ) ∣ italic_x ∈ italic_X } and labels {vM(x)xD}conditional-setsubscript𝑣𝑀𝑥𝑥𝐷\{v_{M}(x)\mid x\in D\}{ italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x ) ∣ italic_x ∈ italic_D }, where D={xi}i=1N𝐷superscriptsubscriptsubscript𝑥𝑖𝑖1𝑁D=\{x_{i}\}_{i=1}^{N}italic_D = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is a corpus of text. We then split D𝐷Ditalic_D into two auxiliary datasets: one for calibration and one for measurement. The probe is trained to predict vM(x)subscript𝑣𝑀𝑥v_{M}(x)italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x ) given LM(x;θ)𝐿𝑀𝑥𝜃LM(x;\theta)italic_L italic_M ( italic_x ; italic_θ ) on the calibration data, and the accuracy is taken over the measurement data. We next discuss the design and interpretation of these two datasets.

Bound vs. free latent variable outcomes.

In general, there may exist several causal dynamics that explain the data equally well. For instance, the following dynamics could also generate the data in Figure 2:

  1. put_marker

    Jump to a random location.

  2. turn_right

    Return to the last position, put a marker, then turn right.

These dynamics assign a different value to s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, but explain the observed variables equally well. Assuming the training corpus consists entirely of this single example, it would be impossible to distinguish between M𝑀Mitalic_M and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT on the basis of data alone. In other words:

  1. 1.

    M𝑀Mitalic_M and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT share the same set of set of latent, observed, and exogenous variables;

  2. 2.

    M𝑀Mitalic_M and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT agree on the observed data; and

  3. 3.

    there exists an assignment e𝑒eitalic_e to the exogenous variables such that vM(x)vM(x)subscript𝑣𝑀𝑥subscript𝑣superscript𝑀superscript𝑥v_{M}(x)\neq v_{M^{\prime}}(x^{\prime})italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x ) ≠ italic_v start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for x=M(e)𝑥𝑀𝑒x=M(e)italic_x = italic_M ( italic_e ) and x=M(e)superscript𝑥superscript𝑀𝑒x^{\prime}=M^{\prime}(e)italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_e ).

In this case, we say that the latent variable v𝑣vitalic_v is free over the assignment e𝑒eitalic_e. More generally, given a hypothesis class \mathcal{M}caligraphic_M of SCMs over the same set of variables, denote the LM training data as Dtrainsubscript𝐷trainD_{\text{train}}italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT and define |trainevaluated-attrain\mathcal{M}|_{\text{train}}caligraphic_M | start_POSTSUBSCRIPT train end_POSTSUBSCRIPT to be the subset of SCMs that generate Dtrainsubscript𝐷trainD_{\text{train}}italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT. The free latent variable outcomes consist of pairs of latent variables and assignments (v,e)𝑣𝑒(v,e)( italic_v , italic_e ) such that there exist M,M|train𝑀superscript𝑀evaluated-attrainM,M^{\prime}\in\mathcal{M}|_{\text{train}}italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_M | start_POSTSUBSCRIPT train end_POSTSUBSCRIPT where vM(M(e))vM(M(e))subscript𝑣𝑀𝑀𝑒subscript𝑣superscript𝑀superscript𝑀𝑒v_{M}(M(e))\neq v_{M^{\prime}}(M^{\prime}(e))italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_M ( italic_e ) ) ≠ italic_v start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_e ) ). Any latent variable outcome (v,e)𝑣𝑒(v,e)( italic_v , italic_e ) which is not free is bound, i.e., the training data Dtrainsubscript𝐷trainD_{\text{train}}italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT fully specifies the outcome of v𝑣vitalic_v on the assignment e𝑒eitalic_e, given the hypothesis class \mathcal{M}caligraphic_M.

calibration measurement
bound free
bound deductive knowledge inductive bias (inference)
free deductive bias (consistency) inductive knowledge
Table 1: Interpreting probing with different calibration and measurement datasets.
Probing with free vs. bound splits.

Table 1 details four possible probing setups when separating the auxiliary dataset D𝐷Ditalic_D into free and bound splits. In particular, when calibration and measurement occur on the same split, the probe quantifies the knowledge, or information content, that can be extracted from the LM representations; conversely, probing with different splits measures the transferability of the representations across different splits, which relies on implicit bias. Additionally, because the bound variables outcomes, can, by definition, be deduced from the given data (and hypothesis class \mathcal{M}caligraphic_M), measuring on the bound split relates to the deductive ability of the LM; conversely, measuring on the free split is inherently an inductive process. We highlight that the inductive bias can be understood as quantifying the capacity of the LM representations to infer values in unseen data by applying theories derived from known data, a form of inductive inference; while the deductive bias measures the extent to which the LM representations produce theories of unseen data that are consistent with the observed data, a key tenet of deductive logic.

3.2 Causal mediation analysis of probing

Refer to caption
Figure 3: An SCM depicting the LM training, probe calibration, and probe measurement. We use plate notation for repeated iid samples, e.g., we draw N𝑁Nitalic_N samples for LM training.

We next turn to controlling for the probe (Q1). Intuitively, the challenge is any measurement using a supervised probe conflates the LM’s representation of the auxiliary labels with the probe’s ability to learn the auxiliary task (Hewitt & Liang, 2019). While there exist a number of proposals for controlling for the contribution of the probe, such techniques typically do not provide any formal guarantees, rendering their correct application and interpretation a challenge (Belinkov, 2022).

We propose a method of disentangling the two effects using the formal framework of causal mediation analysis, and specifically, path-specific effects, which analyze how causal effects decompose over multiple causal paths (Avin et al., 2005; Imai et al., 2010). To begin, we extend the SCM of the data generation process to include (1) the LM training, (2) the probe calibration, and (3) the probe measurement. Figure 3 illustrates an example where the hypothesis class \mathcal{M}caligraphic_M consists of 3-variable SCMs with a single exogenous, observed variable s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a latent variable vMsubscript𝑣𝑀v_{M}italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, and another observed variable s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Observe that there are three causal paths from the true SCM of the data generation process to the auxiliary task accuracy, each of which is mediated by a different set of the latent variables vMsubscript𝑣𝑀v_{M}italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT: (1) during LM training, the LM is trained on a dataset whose text is causally affected by vMsubscript𝑣𝑀v_{M}italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT; (2) during probe calibration, the probe is calibrated using vMsubscript𝑣𝑀v_{M}italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT directly; and (3) during probe measurement, the probe is evaluated for accuracy on vMsubscript𝑣𝑀v_{M}italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT directly. However, we only care to measure the effect over the first of these causal paths, i.e.:

To what extent can the auxiliary task performance be attributed to what LM learns from the latent causal concepts in its training data?

This question can be posed formally as the necessary indirect effect of the paths mediated by the LM’s learned representation for some baseline causal dynamics Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

NIEM,M(θLM)=𝔼[accuracyLM is trained on M, probe is calibrated and measured on M]𝔼[accuracydo(LM is trained on M), probe is calibrated and measured on M],subscriptNIE𝑀superscript𝑀subscript𝜃LM𝔼delimited-[]conditionalaccuracyLM is trained on M, probe is calibrated and measured on M𝔼delimited-[]conditionalaccuracy𝑑𝑜(LM is trained on M), probe is calibrated and measured on M\text{NIE}_{M,M^{\prime}}(\theta_{\text{LM}})=\\ \mathbb{E}\big{[}\text{accuracy}\mid\text{LM is trained on $M$, probe is % calibrated and measured on $M$}\big{]}\\ -\mathbb{E}\big{[}\text{accuracy}\mid do\text{(LM is trained on $M^{\prime}$),% probe is calibrated and measured on $M$}],start_ROW start_CELL NIE start_POSTSUBSCRIPT italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL blackboard_E [ accuracy ∣ LM is trained on italic_M , probe is calibrated and measured on italic_M ] end_CELL end_ROW start_ROW start_CELL - blackboard_E [ accuracy ∣ italic_d italic_o (LM is trained on italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), probe is calibrated and measured on italic_M ] , end_CELL end_ROW

Although path-specific effects offer a crisp conceptual framework for isolating the contribution of the LM in probing experiments, actually computing NIE is not straightforward. First, picking a proper baseline Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is critical; intuitively, if we pick an inappropriate Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, then the NIE will measure the difference between M𝑀Mitalic_M and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in addition to the latent causal concepts hypothetically mediated by the LM representations. Second, measuring the effect requires intervening along the path of interest while holding the other paths constant, i.e., we would need to retrain LM according to the baseline Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which would render the technique prohibitively expensive for large pretrained LMs.

Let acc(M0,M1)𝑎𝑐𝑐subscript𝑀0subscript𝑀1acc(M_{0},M_{1})italic_a italic_c italic_c ( italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) denote the (expected) auxiliary task accuracy after the LM is trained using the SCM M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the probe is calibrated and measured on M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The following result addresses these challenges (proof in Appendix C).

Definition 3.1 (Valid baseline).

Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a valid baseline for M𝑀Mitalic_M if

acc(M,M)𝑎𝑐𝑐superscript𝑀superscript𝑀\displaystyle acc(M^{\prime},M^{\prime})italic_a italic_c italic_c ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) acc(M,M)absent𝑎𝑐𝑐𝑀𝑀\displaystyle\geq acc(M,M)≥ italic_a italic_c italic_c ( italic_M , italic_M ) (1)
acc(M,M)𝑎𝑐𝑐𝑀superscript𝑀\displaystyle acc(M,M^{\prime})italic_a italic_c italic_c ( italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) acc(M,M).absent𝑎𝑐𝑐superscript𝑀𝑀\displaystyle\geq acc(M^{\prime},M).≥ italic_a italic_c italic_c ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M ) . (2)
Proposition 3.2.

Let Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be a valid baseline for M𝑀Mitalic_M. Then

acc(M,M)acc(M,M)>0𝑎𝑐𝑐𝑀𝑀𝑎𝑐𝑐𝑀superscript𝑀0acc(M,M)-acc(M,M^{\prime})>0italic_a italic_c italic_c ( italic_M , italic_M ) - italic_a italic_c italic_c ( italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > 0

implies both NIEM,M(θLM)>0subscriptNIE𝑀superscript𝑀subscript𝜃LM0\text{NIE}_{M,M^{\prime}}(\theta_{\text{LM}})>0NIE start_POSTSUBSCRIPT italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) > 0 and NIEM,M(θLM)>0subscriptNIEsuperscript𝑀𝑀subscriptsuperscript𝜃LM0\text{NIE}_{M^{\prime},M}(\theta^{\prime}_{\text{LM}})>0NIE start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) > 0.

Intuitively, Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a valid baseline when measuring Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is easier than measuring M𝑀Mitalic_M under both normal or intervened circumstances. The conclusion then states that, so long as acc(M,M)acc(M,M)>0𝑎𝑐𝑐𝑀𝑀𝑎𝑐𝑐𝑀superscript𝑀0acc(M,M)-acc(M,M^{\prime})>0italic_a italic_c italic_c ( italic_M , italic_M ) - italic_a italic_c italic_c ( italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > 0 (which can be evaluated by running probe calibration and measurement twice rather than training the LM twice), there is no bias in which SCM is used to train the LM and which is the baseline: the LM representations always mediate a positive amount of the measured effect.

A positive NIE now also has a rigorous interpretation as the LM having learned latent causal concepts, as some positive amount of causal effect is transferred through the representations of the LM. For instance, a positive mediated measurement for inductive bias implies that

The presence of latent causal concepts in the pretraining data causes the LM to learn representations that generalize to unknown data.

3.3 Discussion

We summarize the latent causal probing framework as follows:

  1. 1.

    Select a set of exogenous, latent, and observed variables and pick a hypothesis class \mathcal{M}caligraphic_M of SCMs (from the set of all Markovian SCMs over the variables).

  2. 2.

    Fix a specific target SCM M𝑀M\in\mathcal{M}italic_M ∈ caligraphic_M and a set of latent variables vM𝑣𝑀v\in Mitalic_v ∈ italic_M to test.

  3. 3.

    Construct the auxiliary dataset and create the bound vs. free partition (if possible).

  4. 4.

    Identify a valid baseline Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and perform the mediation analysis.

A significant measurement acc(M,M)acc(M,M)>0𝑎𝑐𝑐𝑀𝑀𝑎𝑐𝑐𝑀superscript𝑀0acc(M,M)-acc(M,M^{\prime})>0italic_a italic_c italic_c ( italic_M , italic_M ) - italic_a italic_c italic_c ( italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > 0 is interpreted evidence that the LM encodes causal concepts in its representations. We conclude with some remarks.

Interventions, and probing for non-causal latent variables.

Our mediation technique requires knowing “what would the text have been if the underlying dynamics were different?”, which could be difficult (especially in non-synthetic domains). Similarly, for non-causal latent variables, such as part-of-speech, producing a hypothesis class \mathcal{M}caligraphic_M with more than one SCM may not be possible: what would the data look like in a counterfactual world in which “dog” is actually an adverb? Unfortunately, our analysis shows that a baseline Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT which induces a different distribution of text is a necessary precondition, since otherwise NIEM,M(θLM)subscriptNIE𝑀superscript𝑀subscript𝜃LM\text{NIE}_{M,M^{\prime}}(\theta_{\text{LM}})NIE start_POSTSUBSCRIPT italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) and NIEM,M(θLM)subscriptNIEsuperscript𝑀𝑀subscriptsuperscript𝜃LM\text{NIE}_{M^{\prime},M}(\theta^{\prime}_{\text{LM}})NIE start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) cannot both be positive (as M𝑀Mitalic_M and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are indistinguishable when used to train the LM). Intuitively, we interpret this result as saying any measurement is inherently biased when the auxiliary task has only one “right” answer.

Probe architecture and hyperparameters.

Our framework also explicates the role of the probe’s architecture and other hyperparameters in the training process, such as the optimizer, learning rate, dataset size, etc., as potential moderators, but not mediators (Q2). In other words, so long as there exists a choice of hyperparameters such that the NIE is positive, the analysis concludes that there exists a causal effect mediated by the model’s parameters (although certain settings of the moderator variables could offer additional interpretations). Practically speaking, our framework also offers a novel way to interpret (and justify) complex probes (Voita & Titov, 2020; Pimentel & Cotterell, 2021).

4 Experiments

We conduct empirical study of whether an LM, trained from scratch on a corpus of program data, learns the latent causal concepts in the underlying data generation process.

4.1 Methods

We describe the key steps according to the framework in Section 3.3; Section A.1 contains full experimental details (e.g., LM and probe architecture and training, LM representations).

Hypothesis class.

The exogenous variables are the initial state and program. The latent variables are the intermediate states, and the observed variables are the initial and final state and the program. The hypothesis class \mathcal{M}caligraphic_M is all Markovian SCMs over the variables.

Target SCM and latent variables.

The target SCM M𝑀M\in\mathcal{M}italic_M ∈ caligraphic_M is the true data generation process in Figure 2. The target latent variables consist of the robot’s position, facing direction, and whether the robot is facing a rock for each intermediate state.

Auxiliary dataset construction and bound and free latent variable outcomes.

For the auxiliary dataset, we use the same data generation process, except that programs range in length between 1 and 15, and we replace the final state in the specification with the initial state. Due to the size of the training corpus (1 million samples), we assume the LM observed all combinations of the exogenous variables. Because the programs in the LM training corpus are between length 5 and 9, the bound latent variables are s6subscript𝑠6s_{6}italic_s start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT to s10subscript𝑠10s_{10}italic_s start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (they are observed during training as the final state). The free latent variables are s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to s5subscript𝑠5s_{5}italic_s start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT and s11subscript𝑠11s_{11}italic_s start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT to s15subscript𝑠15s_{15}italic_s start_POSTSUBSCRIPT 15 end_POSTSUBSCRIPT.

Valid baseline.

We construct valid baselines according to a counterfactual state of the world where the intermediate states are generated by executing the program according to a different set of causal dynamics. Specifically, we define Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using the same SCM, but permute the causal dynamics of the turn_right, turn_left, and move actions (e.g., the robot turns left when executing a turn_right action). As M𝑀Mitalic_M and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are clearly symmetric from a language modeling perspective, Definition 3.1 (and hence Proposition 3.2) holds.

Refer to caption
(a) Deductive knowledge.
Refer to caption
(b) Deductive knowledge, mediated.
Refer to caption
(c) Inductive bias.
Refer to caption
(d) Inductive bias, mediated.
Refer to caption
(e) Deductive bias.
Refer to caption
(f) Deductive bias, mediated.
Refer to caption
(g) Inductive knowledge.
Refer to caption
(h) Inductive knowledge, mediated.
Figure 4: Results from the main experiments. Solid, dashed, and dotted green lines plot the accuracy of a linear, 1-layer MLP, and 2-layer MLP probing classifier, respectively. The final model achieves 92.4% accuracy on generating correct programs for unseen specifications.

4.2 Results

Figure 4 plots the main results. For all four measurements (deductive knowledge, inductive bias, deductive bias, and inductive knowledge) and across all three probes (linear, 1-layer MLP, 2-layer MLP), the mediated measurements are significantly positive (above the dashed line at 0%) by the end of training. We thus conclude that a positive fraction of the observed measurements of causal concepts can be attributed to what is learned by the LM’s representations. We also note that the linear probe exhibits the lowest mediated measurements of the 3 probes across all four tasks. This suggests that more complex auxiliary tasks require more complex probes and, more generally, highlights the importance of probing frameworks that can accommodate deeper probes. Section A.2 contains further results and analyses.

5 Related work

Causal interpretability of LMs

Several prior lines of work apply causal techniques to the interpretability of LMs. These works typically intervene on either the model’s representations (Elazar et al., 2021; Geiger et al., 2021; Meng et al., 2022; Abraham et al., 2022; Li et al., 2022) or the model’s inputs (Kaushik et al., 2020; Vig et al., 2020; Gangal & Hovy, 2020; Amini et al., 2023), and analyze the causal effect on the LM’s outputs. In contrast, we present a formal framework that, conceptually, intervenes on the model’s training data, and measures the causal effect on the LM’s internal representations as a proxy for knowledge. To the best of our knowledge, Elazar et al. (2022) is the only other work that studies the causal relationship between the training data and the LM, presenting a technique for estimating the causal effect of dataset statistics on the factuality of LM’s outputs.

Causal knowledge and reasoning in LMs

A number of benchmarks test for causal knowledge in pretrained LMs by eliciting outputs on causal reasoning tasks; we refer to Yang et al. (2023) for a survey. Recently, however, researchers have raised concerns that performance on such benchmarks may not correspond to true causal reasoning (Zhang et al., 2023; Zečević et al., 2023). As Yang et al. (2023) state, “The issue here is that the LLM does not need to actually reason at all: it can simply access its training dataset, which contains millions of stories about weather and umbrellas, and approximately retrieve a response.” Similarly, Wu et al. (2023) present empirical evidence that pretrained LMs fail to reason on counterfactual task variants, and attribute their apparent reasoning capabilities to recall from the training data. In contrast, we conduct our experiments in a controlled setting where the LM is trained from scratch. Our focus is also on the ability of the LM to represent and generalize the latent causal concepts, rather than knowledge of the causal relationships between the concepts. Finally, the hypothesis that LMs can learn latent causal concepts is also highly related to the position developed by Andreas (2022), who argues that LMs could model properties of agents that are likely to have uttered the language in their training data.

Frameworks for probing

A number of works have proposed frameworks toward a more rigorous understanding of probing. One line takes an information-theoretic view on the information represented by the LM (Zhu & Rudzicz, 2020; Pimentel et al., 2020; Voita & Titov, 2020; Pimentel & Cotterell, 2021). In contrast, we use probes to identify causal effects and inductive biases. Immer et al. (2022) propose an interpretation of probing as quantifying the inductive bias of pretrained representations for downstream tasks, but their framework differs significantly from ours in that the model is understood as a representation-probe pair. In contrast, our approach unifies the LM training and probe calibration procedures under a single causal framework for analysis, yielding formal guarantees for the control of probes. Our analysis also reveals settings in which prior techniques, such as control tasks (Hewitt & Liang, 2019), can yield biased estimates of the intended auxiliary measurement.

6 Conclusion

This paper presents latent causal probing, a probing framework that studies whether LMs learn latent causal concepts as a byproduct of the language modeling objective. Our framework offers robust tools for interpreting experiment results through the lens of causal analysis, and in particular, rigorously controls for the probe’s contribution in learning the auxiliary task. Experimentally, we extend a previous study of whether Transformers can infer the intermediate states that underlie a sequence of actions. Our results provide strong empirical evidence that LMs can induce latent causal concepts from textual pretraining.

References

  • Abraham et al. (2022) Eldar David Abraham, Karel D’Oosterlinck, Amir Feder, Yair Ori Gat, Atticus Geiger, Christopher Potts, Roi Reichart, and Zhengxuan Wu. CEBab: Estimating the causal effects of real-world concepts on NLP model behavior. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=3AbigH4s-ml.
  • Adi et al. (2017) Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJh6Ztuxl.
  • Alain & Bengio (2018) Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2018.
  • Amini et al. (2023) Afra Amini, Tiago Pimentel, Clara Meister, and Ryan Cotterell. Naturalistic Causal Probing for Morpho-Syntax. Transactions of the Association for Computational Linguistics, 11:384–403, 05 2023. ISSN 2307-387X. doi: 10.1162/tacl_a_00554. URL https://doi.org/10.1162/tacl_a_00554.
  • Andreas (2022) Jacob Andreas. Language models as agent models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  5769–5779, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.423. URL https://aclanthology.org/2022.findings-emnlp.423.
  • Avin et al. (2005) Chen Avin, Ilya Shpitser, and Judea Pearl. Identifiability of path-specific effects. In IJCAI International Joint Conference on Artificial Intelligence, pp.  357–363, 2005.
  • Baron & Kenny (1986) Reuben M Baron and David A Kenny. The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of personality and social psychology, 51(6):1173, 1986.
  • Belinkov (2022) Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7.
  • Belinkov & Glass (2019) Yonatan Belinkov and James Glass. Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7:49–72, 04 2019. ISSN 2307-387X. doi: 10.1162/tacl_a_00254. URL https://doi.org/10.1162/tacl_a_00254.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Elazar et al. (2021) Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 03 2021. ISSN 2307-387X. doi: 10.1162/tacl_a_00359. URL https://doi.org/10.1162/tacl_a_00359.
  • Elazar et al. (2022) Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, and Yoav Goldberg. Measuring causal effects of data statistics on language model’sfactual’predictions. arXiv preprint arXiv:2207.14251, 2022.
  • Feder et al. (2022) Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138–1158, 2022.
  • Gangal & Hovy (2020) Varun Gangal and Eduard Hovy. BERTering RAMS: What and how much does BERT already know about event arguments? - a study on the RAMS dataset. In Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupała, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad (eds.), Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  1–10, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.1. URL https://aclanthology.org/2020.blackboxnlp-1.1.
  • Geiger et al. (2021) Atticus Geiger, Hanson Lu, Thomas F Icard, and Christopher Potts. Causal abstractions of neural networks. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=RmuXDtjDhG.
  • Hewitt & Liang (2019) John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2733–2743, 2019.
  • Imai et al. (2010) Kosuke Imai, Luke Keele, and Dustin Tingley. A general approach to causal mediation analysis. Psychological methods, 15(4):309, 2010.
  • Immer et al. (2022) Alexander Immer, Lucas Torroba Hennigen, Vincent Fortuin, and Ryan Cotterell. Probing as quantifying inductive bias. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1839–1851, 2022.
  • Jin & Rinard (2023) Charles Jin and Martin Rinard. Evidence of meaning in language models trained on programs, 2023.
  • Kaushik et al. (2020) Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Sklgs0NFvr.
  • Li et al. (2022) Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2022.
  • Loshchilov & Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  • Maudslay et al. (2020) Rowan Hall Maudslay, Josef Valvoda, Tiago Pimentel, Adina Williams, and Ryan Cotterell. A tale of a probe and a parser. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7389–7395, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.659. URL https://aclanthology.org/2020.acl-main.659.
  • Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  • Nijkamp et al. (2023) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023.
  • Pearl (2010) Judea Pearl. An introduction to causal inference. The international journal of biostatistics, 6(2), 2010.
  • Pearl et al. (2000) Judea Pearl et al. Models, reasoning and inference. Cambridge, UK: CambridgeUniversityPress, 19(2):3, 2000.
  • Pimentel & Cotterell (2021) Tiago Pimentel and Ryan Cotterell. A Bayesian framework for information-theoretic probing. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  2869–2887, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.229. URL https://aclanthology.org/2021.emnlp-main.229.
  • Pimentel et al. (2020) Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. Information-theoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4609–4622, 2020.
  • Shi et al. (2016) Xing Shi, Inkit Padhi, and Kevin Knight. Does string-based neural mt learn source syntax? In Proceedings of the 2016 conference on empirical methods in natural language processing, pp.  1526–1534, 2016.
  • Sloman (2005) Steven Sloman. Locating Causal Structure in Language. In Causal Models: How People Think about the World and Its Alternatives. Oxford University Press, 08 2005. ISBN 9780195183115. doi: 10.1093/acprof:oso/9780195183115.003.0011. URL https://doi.org/10.1093/acprof:oso/9780195183115.003.0011.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Vig et al. (2020) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388–12401, 2020.
  • Voita & Titov (2020) Elena Voita and Ivan Titov. Information-theoretic probing with minimum description length. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  183–196, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.14. URL https://aclanthology.org/2020.emnlp-main.14.
  • Weinberger (2019) Naftali Weinberger. Path-specific effects. British Journal for the Philosophy of Science, 70(1):53–76, 2019. doi: 10.1093/bjps/axx040.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
  • Wu et al. (2023) Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
  • Yang et al. (2023) Linying Yang, Oscar Clivio, Vik Shirvaikar, and Fabian Falck. A critical review of causal inference benchmarks for large language models. In AAAI 2024 Workshop on ”Are Large Language Models Simply Causal Parrots?”, 2023. URL https://openreview.net/forum?id=mRwgczYZFJ.
  • Zečević et al. (2023) Matej Zečević, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. Causal parrots: Large language models may talk causality but are not causal. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=tv46tCzs83.
  • Zhang et al. (2023) Cheng Zhang, Stefan Bauer, Paul Bennett, Jiangfeng Gao, Wenbo Gong, Agrin Hilmkil, Joel Jennings, Chao Ma, Tom Minka, Nick Pawlowski, et al. Understanding causality with large language models: Feasibility and opportunities. arXiv preprint arXiv:2304.05524, 2023.
  • Zhu & Rudzicz (2020) Zining Zhu and Frank Rudzicz. An information theoretic view on selecting linguistic probes. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9251–9262, 2020.

Appendix A Additional experimental details and results

A.1 Language model and probe details

Following Jin & Rinard (2023), the language model is a 350M parameter CodeGen model (Nijkamp et al., 2023) taken from the HuggingFace Transformers library (Wolf et al., 2020). The model was trained for 2.5 billion tokens, which was roughly 6 passes or 80000 training batches over the training corpus. We refer to Jin & Rinard (2023) for further details.

We next describe the design and training of the probing classifiers; these notes apply to all the probing experiments, unless otherwise noted. The linear probe is a single linear layer. The MLP probes have ReLU, batch_norm, then dropout(p=.2) after each linear layer. The hidden dimensions of the 1-layer and 2-layer MLP probes were (256,) and (256, 1024), respectively. The auxiliary datasets consisted of 500000 randomly selected samples. To extract representations from the LM, we use the same strategy as Jin & Rinard (2023), averaging the LM hidden states over the layer dimension after processing each program token. Probes were trained using AdamW (Loshchilov & Hutter, 2019) with weight decay of 1e-4. The learning rate starts at 0.01, then decays by .1 at 75% and 90% through training. All probes are trained for 2000000 steps using a batch size of 256.

For the mediated results reported in Figure 4, we generated the auxiliary dataset using an SCM that maps turn_right to turn_left, turn_left to move, and move to turn_right.

A.2 Ablation studies

This section present some ablation studies on the set up of the probing experiments.

A.2.1 Valid baseline selection

Refer to caption
(a) Deductive knowledge, mediated.
Refer to caption
(b) Inductive bias, mediated.
Refer to caption
(c) Deductive bias, mediated.
Refer to caption
(d) Inductive knowledge, mediated.
Figure 5: Mediating with the valid baseline that swaps move and turn_left.
Refer to caption
(a) Deductive knowledge, mediated.
Refer to caption
(b) Inductive bias, mediated.
Refer to caption
(c) Deductive bias, mediated.
Refer to caption
(d) Inductive knowledge, mediated.
Figure 6: Mediating with the valid baseline that swaps turn_right and turn_left.

To test the sensitivity of the mediated results (and hence, overall conclusions) on the choice of valid baseline, we generate two additional auxiliary datasets with the following SCMs:

  1. 1.

    swap move and turn_left

  2. 2.

    swap turn_right and turn_left

The results are plotted in Figure 5 and Figure 6, respectively. We find that the mediated measurements in the first case are nearly identical to those in Figure 4, despite only swapping two actions (instead of permuting 3). However, in the second case, the mediated measurements are essentially noise, centered around 0. We attribute this to the fact that the resulting labels are extremely similar, as, in most cases, the robot is simply reflected along the starting axis. We emphasize that a negative result for one valid baseline does not constitute evidence to reject the hypothesis, and that a single positive result from a valid baseline is sufficient to accept the hypothesis.

A.2.2 Probe architecture and hyperparameters

Refer to caption
(a) Deductive knowledge, mediated.
Refer to caption
(b) Inductive bias, mediated.
Refer to caption
(c) Deductive bias, mediated.
Refer to caption
(d) Inductive knowledge, mediated.
Figure 7: Mediating with the valid baseline from the main text. Both the original and the mediated measurements are retaken using the probe architecture and hyperparameters from Jin & Rinard (2023).

We next ablate the probe architecture and hyperparameters by adopting the settings used in Jin & Rinard (2023). The differences are: no dropout, a batch size of 1024, training the probe for 10000000 steps, and using 100000 samples in the auxiliary dataset. We use the same valid baseline as in Figure 4 of the main text.

The results are plotted in Figure 7. We observe that the general trends are preserved, and all four mediated measurements ending above 0% by the end of training. However, we note that both deductive and inductive knowledge measure slightly lower, which is an example of the moderating effect of the probe architecture and training hyperparameters. We attribute the effect to the increased batch size and lack of dropout, which could encourage the probe to converge more quickly to a global optimum, given that the risk of overfitting is low (due to the large size and high quality of the training dataset). This is also consistent with (1) the general intuition that simpler (or less optimal) probes are a proxy for “ease of extraction,” which is often interpreted as evidence that the representations are “more aligned” with the target features (Hewitt & Liang, 2019), and (2) the theoretical findings in Pimentel et al. (2020), who conclude that probes of infinite capacity are most informative for measuring syntactic knowledge.

Appendix B Comparison with Jin & Rinard (2023)

In this section, we highlight several key departures from the experimental design in Jin & Rinard (2023).

First, they do not split their auxiliary dataset into bound and free latent variable outcomes, and hence their results do not yield fine-grained interpretations about probing with different calibration and measurement datasets.

Second, our analysis reveals the presence of possible confounders in the design of their interventional baseline, leading to uncontrolled effects. In particular, the auxiliary dataset is constructed using programs generated by the LM itself, rather than randomly sampled as we do. Intuitively, this means that the LM “sees” both s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which reveals information about the original casual dynamics. Formally, the representations of the LM used for probing mediates all 3 causal pathways, rather than the simple causal pathway from the LM training data (as in Figure 3), and hence their interventional baseline is not a proper measurement of the causal effect mediated by the LM representations. Our solution is to use randomly sampled programs and replace the occurrence of snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the construction of the auxiliary dataset, which breaks this causal dependence.

Finally, Jin & Rinard (2023) do not verify that their interventional baselines satisfy the conditions in Equations 1 and 2. In particular, one of their baselines map the put_marker and pick_marker actions to turn_right and turn_left, respectively, in addition to permuting the turn_right, turn_left, and move actions. Because the extracted features all relate to the position and direction of the robot, the new dynamics could present a more difficult task (for both the LM and the probe) due to replacing what were effectively no-ops (put_marker and pick_marker) with new operations that affect the position or direction (turn_right, turn_left, and move). Hence, the observed drop in accuracy post-intervention could be attributable to increased task difficulty, rather than the learned representations of the LM.

Appendix C Proofs

Proof of Proposition 3.2.

The proof follows directly from substituting the appropriate assumptions into the definitions of NIE. Recall that

NIEM,M(θLM)subscriptNIE𝑀superscript𝑀subscript𝜃LM\displaystyle\text{NIE}_{M,M^{\prime}}(\theta_{\text{LM}})NIE start_POSTSUBSCRIPT italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) :=acc(M,M)acc(M,M)assignabsent𝑎𝑐𝑐𝑀𝑀𝑎𝑐𝑐superscript𝑀𝑀\displaystyle:=acc(M,M)-acc(M^{\prime},M):= italic_a italic_c italic_c ( italic_M , italic_M ) - italic_a italic_c italic_c ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M ) (3)
NIEM,M(θLM)subscriptNIEsuperscript𝑀𝑀subscriptsuperscript𝜃LM\displaystyle\text{NIE}_{M^{\prime},M}(\theta^{\prime}_{\text{LM}})NIE start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) :=acc(M,M)acc(M,M),assignabsent𝑎𝑐𝑐superscript𝑀superscript𝑀𝑎𝑐𝑐𝑀superscript𝑀\displaystyle:=acc(M^{\prime},M^{\prime})-acc(M,M^{\prime}),:= italic_a italic_c italic_c ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_a italic_c italic_c ( italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (4)

and, by Definition 3.1,

acc(M,M)𝑎𝑐𝑐superscript𝑀superscript𝑀\displaystyle acc(M^{\prime},M^{\prime})italic_a italic_c italic_c ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) acc(M,M)absent𝑎𝑐𝑐𝑀𝑀\displaystyle\geq acc(M,M)≥ italic_a italic_c italic_c ( italic_M , italic_M ) (5)
acc(M,M)𝑎𝑐𝑐𝑀superscript𝑀\displaystyle acc(M,M^{\prime})italic_a italic_c italic_c ( italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) acc(M,M).absent𝑎𝑐𝑐superscript𝑀𝑀\displaystyle\geq acc(M^{\prime},M).≥ italic_a italic_c italic_c ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M ) . (6)

Applying Equation 5 to the definitions of NIE,

NIEM,M(θLM)subscriptNIE𝑀superscript𝑀subscript𝜃LM\displaystyle\text{NIE}_{M,M^{\prime}}(\theta_{\text{LM}})NIE start_POSTSUBSCRIPT italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) acc(M,M)acc(M,M)absent𝑎𝑐𝑐superscript𝑀superscript𝑀𝑎𝑐𝑐superscript𝑀𝑀\displaystyle\leq acc(M^{\prime},M^{\prime})-acc(M^{\prime},M)≤ italic_a italic_c italic_c ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_a italic_c italic_c ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M ) (7)
=NIEM,M(θLM).absentsubscriptNIEsuperscript𝑀𝑀subscriptsuperscript𝜃LM\displaystyle=\text{NIE}_{M^{\prime},M}(\theta^{\prime}_{\text{LM}}).= NIE start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) . (8)

Applying Equation 6 to the definition of NIE,

NIEM,M(θLM)subscriptNIE𝑀superscript𝑀subscript𝜃LM\displaystyle\text{NIE}_{M,M^{\prime}}(\theta_{\text{LM}})NIE start_POSTSUBSCRIPT italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) =acc(M,M)acc(M,M)absent𝑎𝑐𝑐𝑀𝑀𝑎𝑐𝑐superscript𝑀𝑀\displaystyle=acc(M,M)-acc(M^{\prime},M)= italic_a italic_c italic_c ( italic_M , italic_M ) - italic_a italic_c italic_c ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M ) (9)
acc(M,M)acc(M,M).absent𝑎𝑐𝑐𝑀𝑀𝑎𝑐𝑐𝑀superscript𝑀\displaystyle\geq acc(M,M)-acc(M,M^{\prime}).≥ italic_a italic_c italic_c ( italic_M , italic_M ) - italic_a italic_c italic_c ( italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (10)

Hence,

acc(M,M)acc(M,M)NIEM,M(θLM)NIEM,M(θLM)𝑎𝑐𝑐𝑀𝑀𝑎𝑐𝑐𝑀superscript𝑀subscriptNIE𝑀superscript𝑀subscript𝜃LMsubscriptNIEsuperscript𝑀𝑀subscriptsuperscript𝜃LM\displaystyle acc(M,M)-acc(M,M^{\prime})\leq\text{NIE}_{M,M^{\prime}}(\theta_{% \text{LM}})\leq\text{NIE}_{M^{\prime},M}(\theta^{\prime}_{\text{LM}})italic_a italic_c italic_c ( italic_M , italic_M ) - italic_a italic_c italic_c ( italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ NIE start_POSTSUBSCRIPT italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) ≤ NIE start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) (11)