research-article

Open access

Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding

Authors:

Xueqi ChengAuthors Info & Claims

ACM Transactions on Information Systems, Volume 42, Issue 5

Article No.: 123, Pages 1 - 29

https://doi.org/10.1145/3652599

Published: 27 April 2024 Publication History

PDF eReader

Abstract

Current natural language understanding (NLU) models have been continuously scaling up, both in terms of model size and input context, introducing more hidden and input neurons. While this generally improves performance on average, the extra neurons do not yield a consistent improvement for all instances. This is because some hidden neurons are redundant, and the noise mixed in input neurons tends to distract the model. Previous work mainly focuses on extrinsically reducing low-utility neurons by additional post- or pre-processing, such as network pruning and context selection, to avoid this problem. Beyond that, can we make the model reduce redundant parameters and suppress input noise by intrinsically enhancing the utility of each neuron? If a model can efficiently utilize neurons, no matter which neurons are ablated (disabled), the ablated submodel should perform no better than the original full model. Based on such a comparison principle between models, we propose a cross-model comparative loss for a broad range of tasks. Comparative loss is essentially a ranking loss on top of the task-specific losses of the full and ablated models, with the expectation that the task-specific loss of the full model is minimal. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from three distinct NLU tasks based on five widely used pre-trained language models and find it particularly superior for models with few parameters or long input.

1 Introduction

Natural Language Understanding (NLU) has been pushed a remarkable step forward by deep neural models. To further enhance the performance of deep models, enlarging model size [7, 8, 33, 57] and input context [6, 32, 67] are two conventional and effective ways, where the former introduces more hidden neurons and the latter brings more input neurons. Although neural models with more hidden or input neurons have higher accuracy on average, large-scale models do not always beat small models. For example, on one hand, many network pruning methods have shown that compressed models with significantly reduced parameters (neuron connections) can maintain accuracy [27, 39, 45] and even improve generalization [2], Meyes et al. [48] find that ablation of neurons can consistently improve performance in some specific classes, and Zhong et al. [77] empirically demonstrate that larger language models indeed perform worse on a non-negligible fraction of instances. These phenomena indicate that some hidden neurons in the currently trained model are dispensable or even obstructive. On the other hand, much of the work on Question Answering (QA) [14, 74] and query understanding [18, 50, 81] has noted that feeding more contextual information is more likely to distract the model and hurt performance. This is not surprising, as more input neurons not only mean more relevant features but are also likely to introduce more noise that interferes with the model. Similar to network pruning that cuts out inefficient parameters through post-processing, many context selection methods [23, 49, 62, 76] trim off noisy segments from the input context by pre-processing. In essence, both network pruning and context selection reduce inefficient hidden or input neurons through additional processing. However, apart from extrinsically reducing inefficient neurons, can we intrinsically improve the utility of neurons during model training?

Imagine an ideal neural network in which all its neurons should be able to cooperate efficiently to maximize the utility of each neuron. If a fraction of the input or hidden neurons in this network are ablated¹ (disabling partial input context or model parameters), the ablated submodel is not supposed to perform better, even if the ablated neurons are noisy. This is because an efficient² model should have already suppressed these noises. Following this intuition, we can roughly find a comparison principle between the original full model and its ablated model: the fewer neurons are ablated in the model, the better the model should perform. During training, we can use task-specific losses as a proxy for model performance on training samples, with lower task-specific losses implying better performance. For example, the task-specific loss of the efficient full model (a) in Figure 1 is supposed to be minimal, and if the ablated model (b) is also efficient with respect to its restricted parameter space, the task-specific loss of the ablated model (d) is supposed to be greater than that of (b) because (d) ablates one more input neuron than (b).

Fig. 1.

Noting the gap between the ideal model and reality [49, 77], we aim to ensure this necessity (comparison principle) during the training to improve the model’s utilization of neurons. Based on the natural comparison principle between models, we propose a cross-model comparative loss to train models without additional manual supervision. In general, the comparative loss is a ranking loss on top of multiple task-specific losses. First, these task-specific losses are derived from the full neural model and several comparable ablated models whose neurons are ablated to varying degrees. Next, the ranking loss is a pairwise hinge loss that penalizes models that have fewer ablated neurons but larger task-specific losses. Concretely, if a model with fewer ablated neurons acquires a larger task-specific loss than another model with more ablated neurons, then the difference between the task-specific losses of the pair will be taken into account in the final comparative loss; otherwise, the pair complies with the comparison principle and does not incur any training loss. In this way, the comparative loss can drive the order of task-specific losses to be consistent with the order of the ablation degrees. Through theoretical derivation, we also show that comparative loss can be viewed as a dynamic weighting of multiple task-specific losses, enabling adaptive assignment of weights depending on the performance of the full/ablated models.

The comparability among multiple ablated models is a fundamental prerequisite for comparative loss. As a counterexample, although the ablated model (c) in Figure 1 ablates less neurons than (d), they are not comparable and so no comparative loss can be applied. To make the ablated models comparable with each other, we progressively ablate the models. The first ablated model is obtained by performing one ablation on the basis of the full model. If more ablated models are needed, in each subsequent ablation step we construct a new ablated model by performing a further ablation on top of the ablated model from the previous step, which makes the newly ablated model certainly a comparable submodel of the previous ones. We provide two alternative controlled ablation methods for each ablation step, called CmpDrop and CmpCrop. CmpDrop ablates hidden neurons by the dropout [30] technique, which is theoretically applicable to all dropout-compatible models, whereas CmpCrop ablates input neurons by cropping extraneous context segments and is theoretically applicable to all tasks that contain extraneous content in the input context.

We apply comparative loss with CmpDrop or/and CmpCrop on 14 datasets from three NLU tasks (text classification, QA, and query understanding) with distinct prediction types (classification, extraction, and ranking) on top of five widely used Pre-trained Language Models (PLMs) [3, 15, 21, 37, 44]. The empirical results demonstrate the effectiveness of comparative loss over state-of-the-art baselines, as well as the enhanced utility of parameters and context. Our analysis also confirms that comparative losses can indeed more appropriately weight multiple task-specific losses, as indicated by our derivation. By exploring different comparison strategies, we observe that comparing the models ablated by first CmpCrop and then CmpDrop can bring the greatest improvement. Interestingly, we find that comparative loss is particularly effective for models with few parameters or long inputs. This may imply that comparative loss can help models with lower capacity fit the more or longer training samples better, whereas models with higher capacity are inherently prone to fit less data, so comparative loss is less helpful. Moreover, we discover that different ablation methods have different effects on training, with CmpDrop helping task-specific loss decrease to lower levels faster and CmpCrop alleviating overfitting to some extent.

The main contributions can be summarized as follows:

—

We propose comparative loss, a cross-model loss function based on the comparison principle between the full model and its ablated models, to improve the neuronal utility without additional human supervision.

—

We progressively ablate the models to make multiple ablated models comparable and present two controlled ablation methods based on dropout and context cropping, applicable to a wide range of tasks and models.

—

We theoretically show how comparative loss works and empirically demonstrate its effectiveness through experiments on three distinct NLU tasks. We release the code and processed data on GitHub (https://github.com/zycdev/CmpLoss).

2 Preliminaries

Before introducing our cross-model comparative loss, we review some of the concepts and notations needed afterward. We first introduce typical training methods for the model, followed by formalizations of network pruning and context selection methods that can further improve the model performance by removing inefficient inputs or hidden neurons. Finally, we elaborate on the concept of ablation, which recurs throughout the article.

2.1 Conventional Training

Given a training dataset $\mathcal {D}$ for a specified task and a neural network f parameterized by $\boldsymbol {\theta } \in \mathbb {R}^{|\boldsymbol {\theta }|}$, the training objective for each sample $(x, y) \in \mathcal {D}$ is to minimize empirical risk

\begin{equation} \mathcal {L}_\mathrm{emp}(x, y, \boldsymbol {\theta }) = L(y, f(x; \boldsymbol {\theta })), \end{equation}

(1)

where x is the input context, y in output space $\mathcal {Y}$ is the label, and $L: \mathcal {Y} \times \mathcal {Y} \rightarrow \mathbb {R}_{\ge 0}$ is the task-specific loss function, with $\mathbb {R}_{\ge 0}$ denoting the set of non-negative real numbers. In NLU tasks, x is typically a sequence of words, whereas y can be a single category label for classification [55, 66, 73], or a pair of start and end boundaries for extraction [54, 59, 74], or a sequence of relevance levels for ranking [51, 52, 56, 78].

2.2 Network Pruning

After training a neural model $f(x; \boldsymbol {\theta })$, to reduce memory and computation requirements at test time, network pruning [5] entails producing a smaller model $f(x; \boldsymbol {m} \odot \boldsymbol {\theta }^{\prime })$ with similar accuracy through post-hoc processing. Here, $\boldsymbol {m} \in \lbrace 0, 1\rbrace ^{|\theta |}$ is a binary mask that fixes certain pruned parameters to 0 through elementwise product $\odot$, and the parameter vector $\boldsymbol {\theta }^{\prime }$ may be different from $\boldsymbol {\theta }$ because $\boldsymbol {m} \odot \boldsymbol {\theta }^{\prime }$ is usually retrained from $\boldsymbol {m} \odot \boldsymbol {\theta }$ to fit the pruned network structure.

Although pruning is often viewed as a way to compress models, it has also been motivated by the desire to prevent overfitting. Pruning systematically removes redundant parameters and neurons that do not significantly contribute to performance and thus have much less prediction variance, which makes us reminiscent of dropout [36], another widely used technique to avoid overfitting. Similarly, dropout also uses a mask to disable a fraction (e.g., $p\%$) of parameters or neurons. The significant difference, though, is that the mask $\boldsymbol {m}$ in dropout is randomly sampled from a $\mathrm{Bernoulli }(1-p\%)$ distribution, rather than deterministically defined by a criterion (e.g., the bottom $p\%$ of parameters in magnitude should be masked) as in pruning. This in turn brings convenience: a model trained with dropout does not need to be retrained for a specific mask, because the model’s neurons have already started to learn how to adapt to the absence of some neurons in the previous training.

2.3 Context Selection

To eliminate the noisy content in the input context x and further improve the model performance, context selection selectively crops out a condensed context $x^{\prime } \sqsubseteq x$ to produce the final model prediction. In general, the model requires specialized training to fit the selected context. Therefore, context selection is pre-hoc processing relative to training, requiring removing the noise from the training samples in advance. With a slight abuse of notation, here we use $x^{\prime } \sqsubseteq x$ to denote that $x^{\prime }$ is a condensed subsequence (possibly equal) of x. In general, $x^{\prime }$ is an ordered combination of segments of x, where the segments are usually at the sentence [49, 53], chunk [76], paragraph [14], or document [23, 62] granularity. It is worth noting that the selector for segment selection generally requires additional supervised training and needs to be run in advance of the prediction, which introduces additional computation overhead.

2.4 Ablation

To assess the contribution of certain components to the overall model, ablation studies investigate model behavior by removing or replacing these components in a controlled setting [17]. Here, in the context of machine learning, “ablation” refers to the removal of components of the model, which is an analogy to ablative brain surgery (removal of components of an organism) in biology [48]. We refer to the model after component removal as the “ablated model,” which should continue to work. However, if the removed components are responsible for performance improvement, the performance of the ablated model is expected to be worse [24].

In this article, we use “ablation” to refer specifically to the removal of some neurons of a neural model—that is, to set the output of some specific neurons to zero. From such a neuronal perspective, network pruning and context selection can be viewed as two kinds of ablation, the former removing some low-contributing hidden neurons after training and the latter removing some low-information input neurons before training. However, in contrast to ablation studies that aim to investigate the role of the ablated neurons, we aim to learn to improve the utility of the ablated neurons.

3 Methodlogy

The primary motivation of this work is to inherently improve the utility of neurons in NLU models through a cross-model training objective rather than post-hoc network pruning or pre-hoc context selection to eliminate inefficient neurons. In the following, we first describe a comparison principle. Then, we propose a novel comparative loss based on the corollary of the comparison principle and present how to train models with comparative loss by two controlled ablation methods. Finally, we discuss how comparative loss works.

3.1 Comparison Principle

For an efficient model, we believe that all of its neurons should be able to work together efficiently to maximize the utility of each neuron. This means that each neuron should contribute to the overall model, or at least be harmless, because the cooperation of neurons is supposed to eliminate the negative effects of noise that may be produced by individual neurons. Thus, if we ablate some neurons, even those that produce noise, due to the missing contribution of the ablated neurons, then the ablated submodel should perform no better than the original full model—in other words, its task-specific loss should be no smaller than the original.

Formally, we define a neural model as an efficient model if and only if it performs no weaker than any of its ablation models, and we formalize the comparison principle between an efficient model and its ablation models as follows.

Comparison Principle. Suppose $f(x; \boldsymbol {\theta })$ is an efficient neural model for the input x with respect to the parameter space $\mathbb {R}^{|\boldsymbol {\theta }|}$; let $x^{\prime } \sqsubset x$ be the ablated input and $\boldsymbol {\theta }^{\prime } = \boldsymbol {m} \odot \boldsymbol {\theta }$ be the ablated parameters. Then, for any subsequence $x^{\prime }$ of x whose label is still y, the input-ablated model $f(x^{\prime }; \boldsymbol {\theta })$ should not perform better than the full model $f(x; \boldsymbol {\theta })$, and for any parameters $\boldsymbol {\theta }^{\prime }$ masked by arbitrary $\boldsymbol {m}$, the parameter-ablated model $f(x; \boldsymbol {\theta }^{\prime })$ should not perform better than $f(x; \boldsymbol {\theta })$—that is,

\begin{equation} L(y, f(x; \boldsymbol {\theta })) \le L(y, f(x^{\prime }; \boldsymbol {\theta })), \forall x^{\prime } \sqsubset x \text{ with } g(x^{\prime }) = y, \end{equation}

(2)

\begin{equation} L(y, f(x; \boldsymbol {\theta })) \le L(y, f(x; \boldsymbol {\theta }^{\prime })), \forall \boldsymbol {\theta }^{\prime } = \boldsymbol {m} \odot \boldsymbol {\theta } \text{ with } \boldsymbol {m} \in \lbrace 0, 1\rbrace ^{|\boldsymbol {\theta }|}, \end{equation}

(3)

where $g(x^{\prime })$ means the ground-truth output of $x^{\prime }$.

In the preceding definition, we consider that an efficient neural model should be input-efficient and parameter-efficient. In particular, the input-efficient property means that the model can efficiently utilize the input neurons (words). If the model $f(x; \boldsymbol {\theta })$ satisfies Equation (2), we say that $f(\cdot ; \boldsymbol {\theta })$ is input-efficient for the input x. The parameter-efficient property means that the model can utilize the hidden neurons efficiently. If the model $f(x; \boldsymbol {\theta })$ satisfies Equation (3), we say $f(x; \boldsymbol {\theta })$ is parameter-efficient for the input x with respect to the parameter space $\mathbb {R}^{|\boldsymbol {\theta }|}$. According to Equation (3), we can definitely find at least one vector $\boldsymbol {\theta }$ that is parameter-efficient for the input x—that is, the zero vector and the optimal parameter vector that minimizes the empirical risk. If the parameter space is large enough, from those vectors parameter-efficient for x, we can find some parameters that simultaneously satisfy Equation (2) (i.e., the input-efficient property). In other words, if the parameter space $\mathbb {R}^{|\boldsymbol {\theta }|}$ is large enough, there exists at least one parameter vector $\boldsymbol {\theta }$ that makes the model $f(x; \boldsymbol {\theta })$ efficient for x. Specially, if all activation functions in the neural model f have zero output values for zero, then $f(x^{\prime }; \boldsymbol {0}) = f(x; \boldsymbol {0})\ \forall x^{\prime } \sqsubset x$, and hence the parameter vector $\boldsymbol {\theta } = \boldsymbol {0}$ is efficient for any x.

Notably, we restrict the ablated input $x^{\prime }$ for comparison to only those subsequences whose ground-truth output $g(x^{\prime })$ remains unchanged (i.e., $g(x^{\prime }) = g(x) = y$). This is because ablation may remove some key information from the original input x, such as the trigger words in the classification, resulting in an unknown change in the label y. In this unusual case, $L(y, f(x^{\prime }; \boldsymbol {\theta }))$ will no longer be a reasonable proxy of the performance of the input-ablated model, so it makes no sense to compare it to the task-specific loss of the original model. For example, in binary classification ($y \in \lbrace 0, 1\rbrace$), for the original input x, $f(x; \boldsymbol {\theta })$ predicts the correct category y with low confidence, whereas for the ablated input $x^{\prime }$ whose category label changes to $g(x^{\prime })=1-y$, $f(x^{\prime }; \boldsymbol {\theta })$ predicts $1-y$ with high confidence. Even though $L(y, f(x; \boldsymbol {\theta })) \le L(y, f(x^{\prime }; \boldsymbol {\theta }))$, the input-ablated model $f(x^{\prime }; \boldsymbol {\theta })$ actually outperforms the original $f(x; \boldsymbol {\theta })$ (i.e., $L(y, f(x; \boldsymbol {\theta })) \ge L(1-y, f(x^{\prime }; \boldsymbol {\theta }))$), and we cannot consider $f(\cdot ; \boldsymbol {\theta })$ to be input-efficient for x. Although we can use $L(g(x^{\prime }), f(x^{\prime }; \boldsymbol {\theta }))$ as a performance proxy for the input-ablated model in Equation (2), in practice it is difficult to know how the labels of the ablated inputs will change, so we try to avoid such label-changing scenarios. For the sake of concision, from here on, we default the ablation of the input context does not change the output label if not otherwise specified (i.e., $g(x^{\prime })=y$).

Further, we can ablate a full model $f(x^{(0)}; \boldsymbol {\theta }^{(0)})$ multiple (c) times, but are these ablated models $\lbrace f(x^{(i)}; \boldsymbol {\theta }^{(i)})\rbrace _{i=1}^{c}$ comparable to each other? The comparison principle only points out the comparative relation between an efficient model and any of its ablated models, and cannot be directly applied to multiple ablated models. However, if we assume that these ablated models are constructed step by step (i.e., each ablated model $f(x^{(j)}; \boldsymbol {\theta }^{(j)})$ is obtained by progressively ablating the input ($x^{(j)} \sqsubset x^{(j-1)}$) xor parameters ($\boldsymbol {\theta }^{(j)} = \boldsymbol {m}^{(j)} \odot \boldsymbol {\theta }^{(j-1)}$) based on its previous model $f(x^{(j-1)}; \boldsymbol {\theta }^{(j-1)})$), then $f(x^{(j)}; \boldsymbol {\theta }^{(j)})$ can be considered as an ablated model of all its ancestor models $\lbrace f(x^{(i)}; \boldsymbol {\theta }^{(i)})\rbrace _{i=0}^{j-1}$. For simplicity, we abbreviate their task-specific losses as $l^{(i)} = L(y, f(x^{(i)}; \boldsymbol {\theta }^{(i)}))$. Furthermore, if all $\lbrace f(x^{(i)}; \boldsymbol {\theta }^{(i)})\rbrace _{i=0}^{c-1}$ are simultaneously assumed to be efficient with respect to their parameter spaces $\mathbb {R}^{\Vert \boldsymbol {m}^{(i)}\Vert _0}$,³ we can apply the comparison principle to compare the task-specific losses of any two models (i.e., $l^{(i)} \le l^{(j)}, \forall i \lt j$).

Formally, we define an efficient model to be hereditarily efficient if and only if its ablated models are all efficient. Similarly, if the parameter-ablated models of a parameter-efficient model are all parameter-efficient, we call this parameter-efficient model hereditarily parameter-efficient. And the hereditarily input-efficient model is defined in the same way. Specifically, the parameter vector $\boldsymbol {\theta } = \boldsymbol {0}$ is hereditarily parameter-efficient, and $f(x; \boldsymbol {0})$ is also hereditarily efficient for any x if all activation functions of f have zero output values for zero.

Based on the definition of the hereditarily efficient model and the comparison principle, we can draw the following corollary.

Corollary 3.1.

Suppose $f(x^{(0)}; \boldsymbol {\theta }^{(0)})$ is a hereditarily efficient neural model for the input $x^{(0)}$ with respect to the parameter space $\mathbb {R}^{|\boldsymbol {\theta }^{(0)}|}$; let $\lbrace f(x^{(i)}; \boldsymbol {\theta }^{(i)})\rbrace _{i=1}^{c}$ be its multiple progressively ablated models, where $x^{(i)} \sqsubset x^{(i-1)}$ xor $\boldsymbol {\theta }^{(i)} = \boldsymbol {m}^{(i)} \odot \boldsymbol {\theta }^{(i-1)}$. Then, all $\lbrace f(x^{(i)}; \boldsymbol {\theta }^{(i)})\rbrace _{i=1}^{c-1}$ are also efficient models with respect to their parameter spaces $\mathbb {R}^{\Vert \boldsymbol {m}^{(i)}\Vert _0}$, and their task-specific losses should be monotonically non-decreasing with the degrees of ablation—that is,

\begin{equation} l^{(0)} \le l^{(1)} \cdots \le l^{(i)} \cdots \le l^{(c)}. \end{equation}

(4)

In brief, the corollary describes a desirable transitive comparison between a hereditarily efficient neural model and its ablated models (i.e., the less ablation, the better the performance). Unfortunately, this natural property has been largely ignored before, which motivates us to exploit it to train models that utilize neurons more efficiently.

3.2 Comparative Loss

Based on Corollary 3.1, we can train a hereditarily efficient model with the objective of ordered comparative relation in Equation (4). To measure the difference from the desirable order, we can use pairwise hinge loss [29] to evaluate the ranking of the task-specific losses of the full model and its ablated models, like $\sum _{i=0}^{c-1}\sum _{j=i+1}^{c} \max (0, l^{(i)} - l^{(j)})$. However, optimizing this ranking loss alone cannot guarantee that these task-specific losses are minimized—that is, the full/ablated models may not be Empirical Risk Minimized (ERM) [63] with respect to their parameter spaces. To push these models to be ERM, we introduce a special scalar b as the baseline value of the task-specific loss and assume that it is derived from a dummy ablated model $f(x^{(c+1)}; \boldsymbol {\theta }^{(c+1)})$. The dummy model is set to have the highest degree of ablation, and in principle, its task-specific loss $l^{(c+1)}$ should be the highest. However, to push the task-specific losses of the real models $\lbrace f(x^{(i)},\boldsymbol {\theta }^{(i)})\rbrace _{i=0}^{c}$ down, we usually set $l^{(c+1)}=b$ to a small value (e.g., 0) and expect all $\lbrace l^{(i)}\rbrace _{i=0}^{c}$ to be reduced by this target. In this way, our comparative losses can still be written as a pairwise ranking loss, except that on top of the $c + 2$ task-specific losses,

\begin{equation} \mathcal {L}_\mathrm{cmp}(x, y, \boldsymbol {\theta }) = \sum _{i=0}^{c}\sum _{j=i+1}^{c+1}\max \big (0, l^{(i)} - l^{(j)}\big). \end{equation}

(5)

Figure 2 visualizes the localization (central grid region) of the ideal model of comparative loss, which is both ERM and hereditarily efficient. The hereditarily efficient is a subset of the efficient, and the efficient is the intersection of the input-efficient and parameter-efficient. In this light, comparative loss sets a stricter training objective than ERM. When we set c and b to 0, Equation (5) can degenerate to Equation (1). Further, the comparative loss is equivalent to

\begin{equation*} \sum _{l^{(i)} \gt b}l^{(i)} + \sum _{i=0}^{c-1}\sum _{j=i+1}^{c}\max \big (0, l^{(i)} - l^{(j)}\big), \end{equation*}

where the first term is to minimize the empirical risk of those not reaching the target b, and the second term constrains the comparative relation to pursue the full model being hereditarily efficient.

Fig. 2.

To train using comparative loss, we first need to obtain several comparable ablated models and task-specific losses. As shown in Figure 3, we consider the original model with the input of the entire context as the full model $f(x^{(0)}; \boldsymbol {\theta }^{(0)})$. According to Corollary 3.1, we progressively perform c-step ablation based on the full model. At the i-th ($1 \le i \le c$) ablation step, we use CmpCrop or CmpDrop to ablate a small portion of the input or hidden neurons based on the model $f(x^{(i-1)}; \boldsymbol {\theta }^{(i-1)})$ from the previous step, which makes the newly ablated model $f(x^{(i)}; \boldsymbol {\theta }^{(i)})$ comparable to all of its ancestor models. After all of these models have predicted once, we have $c+1$ comparable task-specific losses. Together with $l^{(c+1)}=b$ from the dummy ablated model, we can calculate the final loss using Equation (5). Using stochastic gradient descent optimization as an example, Algorithm 1 illustrates the training process more formally.

Fig. 3.

CmpDrop and CmpCrop in Algorithm 1 are the two alternative ablation methods we present for each ablation step, the former for ablating the parameters (hidden neurons) and the latter for ablating the input context (input neurons). They both randomly ablate neurons in a controlled manner on top of the previous model, which allows the coverage of all potential ablated models without retraining each ablated model. This is because the randomly ablated models are jointly trained and adapt to the absence of some neurons during the training process. As for which one to use at each ablation step can be specific to the model and task dataset. Ideally, CmpDrop can be used as long as the model is dropout compatible, and CmpCrop can be used as long as the input context of the task contains dispensable segments. In the following, we introduce CmpDrop and CmpCrop in detail.

3.2.1 CmpDrop: Ablate Parameters by Dropout.

Dropout randomly disables each neuron with probability p, which coincides with our need to randomly ablate hidden neurons. To obtain a model $f(\cdot ; \boldsymbol {\theta }^{(i)})$ with more ablated parameters, instead of simply applying a larger dropout rate on the original model $f(\cdot ; \boldsymbol {\theta }^{(0)})$, we ablate the surviving neurons from the previous ablated model $f(\cdot ; \boldsymbol {\theta }^{(i-1)})$ with probability p consistently. Specifically, the output values of those dropped neurons are set to zeros, and the output values of the surviving neurons are scaled by $1/(1-p)$ to ensure consistency with the expected output value of a neuron in all full/ablated models [36]. This is equivalent to applying a mask with scaling⁴ $\boldsymbol {m}^{(i)} \in \lbrace 0, 1, 1/(1-p)\rbrace ^{|\boldsymbol {\theta }|}$ to the previous parameters $\boldsymbol {\theta }^{(i-1)}$ to obtain the ablated parameters $\boldsymbol {\theta }^{(i)} = \boldsymbol {m}^{(i)} \odot \boldsymbol {\theta }^{(i-1)}$. Each element in $\boldsymbol {m}^{(i)}$ corresponds to the scaling factor of each parameter,

\begin{equation*} m_k^{(i)} = {\left\lbrace \begin{array}{ll} 1, & \text{if no dropout in $\theta _k$'s layer;} \\ \frac{1}{1-p}, & \text{if neurons at both ends of $\theta _k$ survive;} \\ 0, & \text{otherwise.} \end{array}\right.} \end{equation*}

For the third case in the preceding equation, once a neuron is newly ablated, all connection parameters from and to it are set to zero; in addition, parameters that have been ablated by $\boldsymbol {m}^{(i-1)}$ are still set to zero.

In practice, we can leverage the existing dropout to implement it. However, for the comparability of the ablated models, we must use the same random seed and the same state of the random number generator in each CmpDrop. In this way, assuming that the current ablation step is the n-th execution of CmpDrop, we can simply run the model with the dropout rate of $1-(1-p)^n$.

3.2.2 CmpCrop: Ablate Input by Cropping.

Given an input context x, CmpCrop aims to crop out a condensed context $x^{\prime }$ that does not change the original ground-truth output (i.e., $x^{\prime } \sqsubset x$ and $g(x^{\prime }) = g(x) = y$). Assume that we know the minimum support context $x^{\star }$ for x at training time (i.e., $\forall x^{\prime } \sqsupseteq x^{\star }\ g(x^{\prime }) = g(x^{\star })$ and $\forall x^{\prime } \sqsubset x^{\star }\ g(x^{\prime }) \ne g(x^{\star }))$. Then, CmpCrop can produce a streamlined context by randomly cropping out several insignificant segments from the non-support context $x \setminus x^{\star }$. In this way, the trimmed streamlined context is sure to contain the minimum support context, so the ground-truth output does not change.

In practice, to use CmpCrop, we must ensure that enough insignificant segments are set aside in the original context $x^{(0)}$ for cropping. The segments can be of document, paragraph, or sentence granularity. For example, in QA, an insignificant segment can be any retrieved paragraph that does not affect the answer to the question. If the dataset does not annotate the minimal support context, we can manually inject a few extraneous noise segments into $x^{(0)}$.

3.3 Discussion

Further deriving Equation (5), we find that comparative loss can be viewed as a dynamic weighting of multiple task-specific losses. In particular, the loss can be rewritten as follows:

\begin{equation} \begin{aligned}\mathcal {L}_\mathrm{cmp} &= \sum _{i=0}^{c}\sum _{j=i+1}^{c+1}\max \big (0, l^{(i)} - l^{(j)}\big) \\ &= \sum _{i=0}^{c}\sum _{j=i+1}^{c+1}{1\!\!1}_{l^{(i)}\gt l^{(j)}} \cdot l^{(i)} - \sum _{i=0}^{c}\sum _{j=i+1}^{c+1}{1\!\!1}_{l^{(i)}\gt l^{(j)}} \cdot l^{(j)} \\ &= \sum _{i=0}^{c+1}\sum _{j=i+1}^{c+1}{1\!\!1}_{l^{(i)}\gt l^{(j)}} \cdot l^{(i)} - \sum _{j=0}^{c}\sum _{i=j+1}^{c+1}{1\!\!1}_{l^{(j)}\gt l^{(i)}} \cdot l^{(i)} \\ &= \sum _{i=0}^{c+1}\sum _{j=i+1}^{c+1}{1\!\!1}_{l^{(i)}\gt l^{(j)}} \cdot l^{(i)} - \sum _{i=1}^{c+1}\sum _{j=0}^{i-1}{1\!\!1}_{l^{(j)}\gt l^{(i)}} \cdot l^{(i)} \\ &= \sum _{i=0}^{c+1}\sum _{j=i+1}^{c+1}{1\!\!1}_{l^{(i)}\gt l^{(j)}} \cdot l^{(i)} - \sum _{i=0}^{c+1}\sum _{j=0}^{i-1}{1\!\!1}_{l^{(j)}\gt l^{(i)}} \cdot l^{(i)} \\ &= \sum _{i=0}^{c+1}\left[ \sum _{j=i+1}^{c+1}{1\!\!1}_{l^{(i)}\gt l^{(j)}} \cdot l^{(i)} - \sum _{j=0}^{i-1}{1\!\!1}_{l^{(j)}\gt l^{(i)}} \cdot l^{(i)} \right] \\ &= \sum _{i=0}^{c+1} l^{(i)} \cdot \left[ \sum _{j=i+1}^{c+1}{1\!\!1}_{l^{(i)}\gt l^{(j)}} - \sum _{j=0}^{i-1}{1\!\!1}_{l^{(i)}\lt l^{(j)}} \right] \\ &= \sum _{i=0}^{c+1} \sum _{j \ne i}\mathrm{CMP}(i, j, l^{(i)}, l^{(j)}) \cdot l^{(i)}, \end{aligned} \end{equation}

(6)

where ${1\!\!1}_C$ is an indicator function equal to 1 if condition C is true and 0 otherwise, and the $\mathrm{CMP}$ function determines whether model $f(x^{(i)}; \boldsymbol {\theta }^{(i)})$ complies with the comparison principle compared to $f(x^{(j)}; \boldsymbol {\theta }^{(j)})$ and adjusts the weight of $l^{(i)}$. There are two cases of non-compliance: for the case where $f(x^{(i)}; \boldsymbol {\theta }^{(i)})$ is less ablated ($i \lt j$) but more loss is obtained, we increase the weight of $l^{(i)}$; for the case where $f(x^{(i)}; \boldsymbol {\theta }^{(i)})$ is more ablated ($i \gt j$) but less loss is obtained, we decrease the weight of $l^{(i)}$. Formally, the $\mathrm{CMP}$ function can be written as

\begin{equation*} \mathrm{CMP}(i, j, l^{(i)}, l^{(j)}) = {\left\lbrace \begin{array}{ll} 1, & \text{if $i \lt j$ and $l^{(i)} \gt l^{(j)}$;} \\ -1, & \text{if $i \gt j$ and $l^{(i)} \lt l^{(j)}$;} \\ 0, & \text{otherwise.} \end{array}\right.} \end{equation*}

Here we can notice that for a pair of models that do not conform to the comparison principle, we increase (+1) the weight of the task-specific loss of the model that is ablated less and equally decrease (–1) the weight of the loss of the model that is ablated more. Thus, let $\alpha ^{(i)} = \sum _{j \ne i}\mathrm{CMP}(i, j, l^{(i)}, l^{(j)})$ denote the weight of $l^{(i)}$, then the sum of the weights of all task-specific losses (including the dummy one) is 0 (i.e., $\sum _{i=0}^{c}\alpha ^{(i)}=-\alpha ^{(c+1)}=\sum _{i=0}^{c}{1\!\!1}_{l^{(i)}\gt b}$). Since $l^{(c+1)}=b$ is a constant, Equation (6) is also equivalent to $\sum _{i=0}^{c}\alpha ^{(i)}l^{(i)}$ (i.e., the total weight equals the number of task-specific losses worse than the virtual baseline b) and is adaptively assigned to the $c+1$ losses according to their performance. In this way, poorly performing full/ablated models will be more heavily optimized. And we empirically compare other heuristic weighting strategies in Section 5.1.

For parameter ablation, in addition to being able to weight each task-specific loss differentially, the comparative loss with CmpDrop can also differentially calculate the gradients of the parameters in different parts. According to Equation (6), comparative loss is equal to the sum of all differences of task-specific loss pairs that violate the comparison principle (i.e., $\sum _{i\lt j\,\wedge \,l^{(i)}\gt l^{(j)}}\big (l^{(i)}-l^{(j)}\big)$), so we can analyze the gradient of comparative loss from the difference of each task-specific loss pair. For ease of illustration, we take the original model $f(x; \boldsymbol {\theta })$ and a model $f(x^{\prime }; \boldsymbol {\theta }^{\prime })$ whose parameters have been ablated n times as an example, and other model pairs with different parameters are similar. Assume that the original parameters $\boldsymbol {\theta }=(\boldsymbol {u},\boldsymbol {v},\boldsymbol {w})$ and the ablated parameters $\boldsymbol {\theta }^{\prime }=(\boldsymbol {u}^{\prime },\boldsymbol {v}^{\prime },\boldsymbol {w}^{\prime })=(\boldsymbol {u},\boldsymbol {v}/(1-p)^{n},\boldsymbol {0})$, where $\boldsymbol {u}$ is the parameters from the layers without dropout, $\boldsymbol {w}$ is the parameters ablated by n times CmpDrop, and $\boldsymbol {v}^{\prime }$ is the scaled parameters surviving from n times dropout. Then, if their task-specific losses violate the comparison principle (i.e., $l\gt l^{\prime }$), the gradient of the comparative loss contributed by this model pair is

\begin{equation*} \begin{aligned}\nabla _{\boldsymbol {\theta }}(l-l^{\prime }) &= (\nabla _{\boldsymbol {u}}(l-l^{\prime }), \nabla _{\boldsymbol {v}}(l-l^{\prime }), \nabla _{\boldsymbol {w}}l). \end{aligned} \end{equation*}

We can see that the comparative loss with respect to $\boldsymbol {w}$ is higher than the comparative loss with respect to the other parameters. This is intuitive because the model instead performs better after ablating away $\boldsymbol {w}$ indicating that the current $\boldsymbol {w}$ is inefficient, so we need to focus on updating $\boldsymbol {w}$.

In addition to the dynamic weighting perspective, comparative loss can also be considered as an “inverse ablation study” during training. This is because, in contrast to ablation studies that determine the contribution of removed components during validation, comparative loss believes that the ablated neurons should contribute and optimizes parameters with this objective.

For training complexity, given a generally small number of comparisons c (i.e., number of ablation steps), the overhead of computing the final comparative loss is negligibly small, and the increased computation overhead per update step comes mainly from the multiple forward and backward propagations of the models. Specifically, the overhead of a training step using comparison loss is $1+c$ times that of conventional training for the same batch size. For inference complexity, however, models trained using comparative loss are the same as conventionally trained models at test time.

4 Experiments

To evaluate the effectiveness and generalizability of our approach for NLU, we conduct experiments on three tasks with representative output types, including classification (eight datasets), extraction (two datasets), and ranking (four datasets). Among them, the classification task requires predicting a single category for a piece of text or a text pair, the extraction task requires predicting a pair of boundary positions to extract the span between the start and end boundaries, and the ranking task requires predicting a list of relevance level to rank candidates. Specifically, the three distinct tasks are text classification (see Section 4.1), Reading Comprehension (RC) (extraction, see Section 4.2), and Pseudo-Relevance Feedback (PRF) (ranking, see Section 4.3), respectively. We evaluate the comparative loss with just CmpDrop in text classification and RC, the comparative loss with just CmpCrop in RC and PRF, and the comparative loss with both CmpDrop and CmpCrop in RC. For each task, we first introduce the dataset used, then present the implementation of our models as well as the baselines, and finally show the experimental results.

Before we start each experiment, we explain some common experimental settings. For the baseline value b of the task-specific loss in Algorithm 1, we provide two setting options. One is to simply set $b=0$, which is equivalent to setting an unreachable target value for all full/ablated models and thus pushing their task-specific losses to decrease. However, this results in the exposure of all training data to the full model and may aggravate overfitting. Therefore, to reduce the times the full model is optimized, our second option is to set the baseline value to the task-specific loss of the full model (i.e., $b=l^{(0)}$). In this way, the full model is optimized only when it performs worse than its ablated model. In practice, we prefer setting $b=0$, and change to setting $b=l^{(0)}$ if we find that the model is prone to overfitting on the dataset. For the dropout rate p in each CmpDrop, we use the same setting as the baseline models, which is 0.1 in all of our experiments. For other conventional training hyperparameters, such as batch size and learning rate, we also keep the same as the carefully tuned baseline models if not specifically specified. We implement our models and baseline models in PyTorch with HuggingFace Transformers [71]. All models are trained on Tesla V100 GPUs. In the text classification and RC tasks, we trained each model with five random seeds. In the PRF task, we trained all models with a fixed random seed 42. In the tables, the results presented as mean$_{\pm \text{standard deviation}}$ are tallied on the evaluation results of the five random seeds, and otherwise the performance of the model trained with the random seed 42. For convenience, in the tables, we use ‘Cmp’ to represent the comparative loss and use ‘Drop’ and ‘Crop’ in parentheses to refer to CmpDrop and CmpCrop, respectively.

4.1 Classification: Application to Text Classification

Text classification is a fundamental task in NLU, which aims to assign a pre-defined category to a piece or a group of text. In many text classification datasets, all segments of the input context seem to play an important role in the text category and there is almost no annotation of the minimal support context, so it is difficult for us to construct an input-ablated model by directly cropping the original input without changing the classification label. In other words, it is likely to violate the constraint that the label of the ablated input is unchanged in the comparison principle, and thus we cannot apply CmpCrop to this task. However, many current neural classification models use dropout during training, so in this task, we only validate the comparative loss that uses just CmpDrop.

4.1.1 Datasets.

The General Language Understanding Evaluation (GLUE) benchmark [66] is a collection of diverse NLU tasks. Following Devlin et al. [21], we exclude the problematic WNLI set and conduct experiments on the following eight datasets. Multi-Genre Natural Language Inference (MNLI) [70] is a sentence pair classification task that aims to predict whether the second sentence is an entailment, contradiction, or neutral to the first one. Microsoft Research Paraphrase Corpus (MRPC) [22] aims to predict if two sentences in the pair are semantically equivalent. Question Natural Language Inference (QNLI) [66] is a binary sentence pair classification task that aims to predict whether a sentence contains the correct answer to a question. Quora Question Pairs (QQP) [13] is a binary sentence pair classification task that aims to predict whether two questions asked on Quora are semantically equivalent. Recognizing Textual Entailment (RTE) [4] is a binary entailment task similar to MNLI, but with much fewer training samples. Stanford Sentiment Treebank (SST-2) [61] is a binary sentence sentiment classification task consisting of sentences extracted from movie reviews. The Semantic Textual Similarity Benchmark (STS-B) [9] is a sentence pair classification task that aims to determine how two sentences are semantically similar. The Corpus of Linguistic Acceptability (CoLA) [69] is a binary sentence classification task aimed at judging whether a single English sentence conforms to linguistics.

4.1.2 Models and Training.

Following R-Drop [40], another state-of-the-art training method leveraging dropout, we validate our comparative loss on the popular classification models based on PLMs.⁵ Specifically, we take BERT$_\mathrm{base}$ [21], RoBERTa$_\mathrm{base}$ [44], and ALBERT$_\mathrm{base}$ [37] as our backbones to perform fine-tuning. The task-specific loss is mean squared error for STS-B and cross-entropy for other datasets. We use different training hyperparameters for each dataset. For baseline models and our models trained with comparative loss, we independently select the learning rate within {1e-5, 2e-5, 3e-5, 4e-5}, warmup rate within {0, 0.1}, the batch size within {16, 24, 32}, and the number of epochs from 2 to 5. For our models, we tune the number of ablation steps c (i.e., the number of CmpDrop) from 1 to 4. Following the hyperparameter setup in R-Drop [40], we implement R-Drop for all backbone models as well to serve as a competitor, which performs dropout multiple times as CmpDrop does.

4.1.3 Results.

We present classification performance in Table 1, where the evaluation metrics are Pearson’s correlation for STS-B, Matthew’s correlation for CoLA, and Accuracy for the others. For models based on BERT$_\mathrm{base}$, we can see that our model (+ Cmp) comprehensively outperforms the well-tuned baseline BERT$_\mathrm{base}$ and achieves an improvement of 1.04 points (on average), which proves the effectiveness of comparative loss in classification tasks. Moreover, our model trained with comparative loss also outperforms the model trained with state-of-the-art R-Drop by 0.58 points on average, which demonstrates the superiority of comparative loss. For models based on other more advanced RoBERTa$_\mathrm{base}$ and ALBERT$_\mathrm{base}$, we can find consistent improvement. In addition, since ALBERT reuses parameters across multiple layers, it has the smallest boostable space for parameter utilization, which is consistent with our observation that comparative loss brings the smallest boost to ALBERT.

Table 1.

Model	MNLI	MRPC	QNLI	QQP	RTE	SST-2	STS-B	CoLA	Average
BERT$_\mathrm{base}$ [21]	84.2$_{\pm 0.3}$	85.9$_{\pm 0.5}$	91.0$_{\pm 0.1}$	91.0$_{\pm 0.1}$	68.2$_{\pm 1.7}$	92.2$_{\pm 0.2}$	88.9$_{\pm 1.0}$	61.9$_{\pm 1.1}$	82.92$_{\pm 0.14}$
+ R-Drop [40]	84.3$_{\pm 0.2}$	86.5$_{\pm 0.5}$	91.6$_{\pm 0.1}$	91.4$_{\pm 0.1}$	68.8$_{\pm 1.2}$	92.2$_{\pm 0.3}$	89.4$_{\pm 0.8}$	62.9$_{\pm 0.7}$	83.38$_{\pm 0.14}$
+ Cmp (c Drop)	84.8$_{\pm 0.2}$	87.1$_{\pm 1.0}$	91.9$_{\pm 0.2}$	91.5$_{\pm 0.1}$	70.5$_{\pm 1.2}$	93.1$_{\pm 0.4}$	89.7$_{\pm 0.6}$	63.4$_{\pm 1.1}$	83.96$_{\pm 0.31}$
RoBERTa$_\mathrm{base}$ [44]	87.4$_{\pm 0.3}$	89.9$_{\pm 0.6}$	92.8$_{\pm 0.2}$	91.4$_{\pm 0.1}$	76.5$_{\pm 0.9}$	95.1$_{\pm 0.2}$	90.8$_{\pm 0.1}$	62.7$_{\pm 0.6}$	85.82$_{\pm 0.13}$
+ R-Drop [40]	87.6$_{\pm 0.1}$	90.0$_{\pm 0.6}$	92.9$_{\pm 0.1}$	91.5$_{\pm 0.0}$	78.5$_{\pm 1.5}$	95.4$_{\pm 0.2}$	91.1$_{\pm 0.1}$	64.0$_{\pm 0.5}$	86.37$_{\pm 0.26}$
+ Cmp (c Drop)	88.0$_{\pm 0.1}$	90.5$_{\pm 0.5}$	93.3$_{\pm 0.1}$	91.9$_{\pm 0.1}$	79.3$_{\pm 1.2}$	95.5$_{\pm 0.2}$	91.3$_{\pm 0.1}$	65.4$_{\pm 0.8}$	86.91$_{\pm 0.20}$
ALBERT$_\mathrm{base}$ [37]	85.0$_{\pm 0.3}$	88.4$_{\pm 0.6}$	92.1$_{\pm 0.3}$	90.4$_{\pm 0.1}$	77.8$_{\pm 0.8}$	93.0$_{\pm 0.4}$	90.9$_{\pm 0.2}$	59.8$_{\pm 1.7}$	84.68$_{\pm 0.18}$
+ R-Drop [40]	85.4$_{\pm 0.4}$	88.7$_{\pm 0.8}$	92.1$_{\pm 0.2}$	90.5$_{\pm 0.1}$	78.0$_{\pm 1.4}$	93.3$_{\pm 0.3}$	90.9$_{\pm 0.2}$	59.7$_{\pm 0.4}$	84.83$_{\pm 0.14}$
+ Cmp (c Drop)	85.7$_{\pm 0.1}$	89.5$_{\pm 0.4}$	92.3$_{\pm 0.1}$	91.0$_{\pm 0.1}$	78.3$_{\pm 1.4}$	93.5$_{\pm 0.2}$	91.0$_{\pm 0.2}$	60.3$_{\pm 0.9}$	85.19$_{\pm 0.24}$

Table 1. Classification Performance on the Development Sets of the GLUE Language Understanding Benchmark

4.2 Extraction: Application to RC

Extractive RC [43, 59] is an essential technical branch of QA [11, 31, 41, 60, 79]. Given a question and a context, extractive RC aims to extract a span from the context as the predicted answer. Current dominant RC models basically use pre-trained Transformer [64] architectures, which employ dropout in many layers during fine-tuning. This allows us to use CmpDrop to improve the utility of the model parameters. Additionally, the given context is usually lengthy and contains many distracting noise segments, which also allows us to use CmpCrop to improve the model’s utilization of the context by randomly deleting the labeled distracting paragraphs. Therefore, we intend to verify the effectiveness of comparative loss using CmpDrop or/and CmpCrop in this task.

4.2.1 Datasets.

We evaluate the comparative loss using only CmpDrop on SuQAD [59], which contains 100K single-hop questions with 9,832 for validation, and HotpotQA [74], which contains 113K multi-hop questions with 7,405 for validation. For HotpotQA, we consider the distractor setting, where the context of each question contains 10 paragraphs, but only 2 of them are useful for answering the question, and the remaining 8 are retrieved distracting paragraphs that are relevant but do not support the answer. This allows us to evaluate the comparative loss with CmpCrop on HotpotQA distractor.

4.2.2 Models and Training.

We follow simple but effective RC models based on PLMs [3, 15, 21, 37, 44], which take as input a concatenation of the question and the context and use a linear layer to predict the start and end positions of the answer. And we use cross-entropy of answer boundaries as the task-specific loss function following Devlin et al. [21] and use a learning rate warmup over the first 10% steps. For SQuAD, we use the popular BERT [21], RoBERTa [44], ELECTRA [15], and ALBERT [37] with a maximum sequence length of 512 as the backbone, all of which have successively achieved top rankings in multiple QA benchmarks [58, 59, 74]. We first tune the learning rate in range {1e-5, 3e-5, 5e-5, 8e-5, 1e-4, 2e-4}, batch size in {8, 12, 32}, and number of epochs in {1, 2, 3} for baseline models. Then, setting $c=2$, we take these hyperparameters along and train our models using the comparative loss with two CmpDrop. For HotpotQA, we use the state-of-the-art Longformer [3] with a maximum sequence length of 2048 as the backbone, which is fed with the format <s> [YES] [NO] [Q] question </s> [T] title$_1$ [P] paragraph$_1$$\cdots$ [T] title$_{10}$ [P] paragraph$_{10}$ </s>. The special tokens [YES]/[NO], [Q], [T], and [P] represent yes/no answers and the beginning of questions, titles, and paragraphs, respectively. Similarly, we select the learning rate in {1e-5, 3e-5}, batch size in {6, 9, 12}, and number of epochs in {3, 5, 8} for the baseline model. We then train our models with three comparative losses, respectively, the first two applying one CmpDrop/CmpCrop ($c=1$), and the third applying one CmpCrop followed by one CmpDrop ($c=2$). Besides, inheriting common hyperparameters and searching for coefficient weights $\alpha$ in {0.1, 0.5, 1, 1.5}, we also implement R-Drop [40] as a competitor to CmpDrop.

4.2.3 Results.

Since we focus on extraction here, we only measure the extracted answers using EM (exact match) and F1, which is a little different from the official HotpotQA setting that simultaneously evaluates the identification of support facts. From Table 2, we can see that our implemented baseline models trained directly using the task-specific loss Equation (1) largely achieve better results than those reported in their original papers. Once trained using comparative loss Equation (5) instead, our models can still significantly outperform these well-tuned baseline models even without re-searching the training hyperparameters, demonstrating the effectiveness of comparative loss on the extraction task. Additionally, the consistent improvement based on the three different PLMs demonstrates the model-agnostic nature of comparative loss. Furthermore, from the results on HotpotQA, we can find that although both CmpDrop and CmpCrop deliver significant improvement, CmpCrop + CmpDrop achieves the best results, suggesting that CmpDrop and CmpCrop may bring different benefits to the trained models.

Table 2.

Model	EM	F1
3cSQuAD
BERT$_\mathrm{base}$ [21]	80.8	88.5
BERT$_\mathrm{base}$ (our implementation)	81.3$_{\pm 0.2}$	88.5$_{\pm 0.1}$
+ R-Drop [40]	82.2$_{\pm 0.1}$	89.1$_{\pm 0.1}$
+ Cmp (2 Drop)	82.3$_{\pm 0.2}$	89.3$_{\pm 0.1}$
RoBERTa$_\mathrm{base}$ [44] (our implementation)	85.8$_{\pm 0.1}$	92.2$_{\pm 0.1}$
+ R-Drop [44]	86.4$_{\pm 0.1}$	92.3$_{\pm 0.1}$
+ Cmp (2 Drop)	86.5$_{\pm 0.2}$	92.6$_{\pm 0.1}$
ELECTRA$_\mathrm{base}$ [15]	84.5	90.8
ELECTRA$_\mathrm{base}$ (our implementation)	85.9$_{\pm 0.3}$	92.3$_{\pm 0.2}$
+ R-Drop [40]	86.5$_{\pm 0.1}$	92.3$_{\pm 0.1}$
+ Cmp (2 Drop)	86.6$_{\pm 0.1}$	92.7$_{\pm 0.1}$
ALBERT$_\mathrm{base}$ [37]	82.3	89.3
ALBERT$_\mathrm{base}$ (our implementation)	83.6$_{\pm 0.2}$	90.6$_{\pm 0.1}$
+ R-Drop [40]	83.7$_{\pm 0.2}$	90.7$_{\pm 0.2}$
+ Cmp (2 Drop)	84.4$_{\pm 0.1}$	91.0$_{\pm 0.1}$
HotpotQA
Longformer$_\mathrm{base}$$^\dagger$ [3]	60.3	74.3
Longformer$_\mathrm{base}$ (our implementation)	61.9$_{\pm 0.4}$	75.6$_{\pm 0.3}$
+ R-Drop [40]	62.0$_{\pm 0.2}$	76.0$_{\pm 0.2}$
+ Cmp (1 Drop)	63.0$_{\pm 0.4}$	77.0$_{\pm 0.4}$
+ Cmp (1 Crop)	62.6$_{\pm 0.2}$	76.4$_{\pm 0.2}$
+ Cmp (1 Crop 1 Drop)	63.5$_{\pm 0.3}$	77.2$_{\pm 0.3}$

Table 2. Question Answering Performance on the Development Sets of SQuAD and HotpotQA Distractor

The results with ${\dagger }$ are inquired from the authors of its paper.

4.3 Ranking: Application to PRF

PRF [1] is an effective query understanding [10] technique to improve ranking accuracy, which aims to alleviate the mismatch of linguistic expressions between a query and its potential relevant documents. Given an original query q and a document collection C, a base ranking model returns a ranked list $D = (d_1, d_2, \ldots , d_{|D|})$. Let $D_{\le k}$ denote the feedback set containing the top k documents, where k is usual referred to as the PRF depth. The goal of PRF is to reformulate the original query q into a new representation $q^{(k)}$ using the query-relevant information in $D_{\le k}$—that is, $q^{(k)}=f((q, D_{\le k}); \boldsymbol {\theta })$, where $q^{(k)}$ is expected to yield better ranking results. Although PRF methods do usually improve ranking performance on average [16], individual reformulated queries inevitably suffer from query drift [50, 81] due to the objectively present noise in the feedback set, causing them to be inferior to the original ones. Therefore, we can use comparative loss with CmpCrop to train PRF models to suppress the extra increased noise by comparing the effect of queries reformulated using feedback sets with different PRF depths.

4.3.1 Datasets.

We conduct experiments on MS MARCO passage [51] collection, which consists of 8.8M English passages collected from the search results of Bing’s 1M real-world queries. The Train set of MS MARCO contains 530K queries (about 1.1 relevant passages per query on average), the Dev set contains 6,980 queries, and the online Eval set contains 6,837 queries. Apart from these, we also consider TREC DL 2019 [20], TREC DL 2020 [19], and DL-HARD [47], three offline evaluation benchmarks based on the MS MARCO passage collection, which contain 43, 54, and 50 fine-grained (relevance grades from 0 to 3) labeled queries, respectively. Among them, DL-HARD [47] is a recent evaluation benchmark focusing on complex queries. We use the MS MARCO Train set to train models, and we evaluate trained models on the MS MARCO Dev set to tune hyperparameters and select model checkpoints. The selected models are finally evaluated on the online MS MARCO Eval⁶ and three other offline benchmarks.

4.3.2 Models and Training.

We carry out PRF experiments on two base retrieval models, ANCE [72] (dense retrieval) and uniCOIL [42] (sparse retrieval), respectively. For their PRF models, we do not explicitly modify the query text but directly generate a new query vector for retrieval following the current state-of-the-art method ANCE-PRF [75]. This allows us to directly optimize the retrieval of reformulated queries end-to-end with the negative log likelihood of the positive document [34] as the task-specific loss:

\begin{equation*} L(\boldsymbol {q}^{(k)}) = -\log \frac{e^{\mathrm{sim}(\boldsymbol {q}^{(k)}, \boldsymbol {d}^+)}}{e^{\mathrm{sim}(\boldsymbol {q}^{(k)}, \boldsymbol {d}^+)} + \sum _{d^- \in D^-} e^{\mathrm{sim}(\boldsymbol {q}^{(k)}, \boldsymbol {d}^-)}}, \end{equation*}

where $\boldsymbol {d}^+$ is the vector of a sampled document relevant to q and $\boldsymbol {q}^{(k)}$, $\mathrm{sim}(\cdot , \cdot)$ is the dot product of two vectors, and $D^-$ is the collection of negative documents for them. Since only vectors of queries are updated,⁷ we mine a lite collection (5.3M for dense retrieval and 3.7M for sparse retrieval) containing positive and hard negative documents of all training queries. In this way, for each query, all documents in the lite collection except its positive documents can be used as its $D^-$. In general, our PRF model consists of an encoder, a vector projector, and a pooler. First, the original query q and feedback documents in $D_{\le k}$ are concatenated in order with [SEP] as separator and input to the encoder to get the contextual embedding of each token. Then, the projector maps the contextual embeddings to vectors with the same dimension as the document vectors. Finally, all token vectors are pooled into a single query vector. For dense retrieval, the encoder is initialized from ANCE$_\mathrm{FirstP}$,⁸ the projector is a linear layer, and the pooler applies a layer normalization on the first vector ([CLS]) in the sequence, as in the previous work [75]. For sparse retrieval, the encoder and projector are initialized from BERT$_\mathrm{base}$ with the masked language model head, where the projector is an MLP with GeLU [28] activation and layer normalization, and the pooler is composed of a max pooling operation and an L2 normalization.⁹ We fine-tune PRF baseline models for up to 12 epochs with a batch size of 96, a learning rate selected from {2e-5, 1e-5, 5e-6}, and PRF depth k randomly sampled from 0 to 5 for each query. We then fine-tune our PRF models using the comparative loss of $c=1$ CmpCrop for up to 6 epochs with a batch size of 48. In this way, the maximum number of training steps for our models remains the same as the baseline models (i.e., up to 12 optimizations per original query). Due to the large training costs of using multiple random seeds, we used a paired t-test to calculate significant differences in retrieval performance.

4.3.3 Results.

We report the official metrics (MRR@10 for MARCO and NDCG@10 for others) and Recall@1K of the models on multiple benchmarks in Table 3. In addition to reporting results for the best-performing PRF depths (numbers in superscript brackets), for a fair comparison with ANCE-PRF$^{(3)}$ (second row), we also present the results of ANCE-PRF + Cmp$^{(3)}$, both of which use the first three documents as feedback. We can see that PRF baseline models (+ PRF) indeed generally outperform their base retrieval models, except that uniCOIL-PRF degrades by 0.67 percentage points in NDCG@10 of TREC DL 2019, which reflects the presence of query drift. Our PRF models (+ Cmp) trained with comparative loss, however, outperform their base retrieval model across the board. Under the same use of three feedback documents, our ANCE-PRF + Cmp also outperforms the published state-of-the-art ANCE-PRF [75] on all metrics except NDCG@10 on DL-HARD. Moreover, when five feedback documents are used, ANCE-PRF + Cmp achieves a go-ahead over ANCE-PRF on NDCG@10 of DL-HARD. For sparse retrieval, our PRF model (+ Cmp) trained with comparative loss also surpasses the strong baseline uniCOIL-PRF implemented following ANCE-PRF. All of the preceding results demonstrate the effectiveness of comparative loss on the ranking task.

Table 3.

Model	MARCO Dev			MARCO Eval	TREC DL 2019		TREC DL 2020		DL-HARD
Model	NDCG@10	MRR@10	R@1K	MRR@10	NDCG@10	R@1K	NDCG@10	R@1K	NDCG@10	R@1K
ANCE [72]	38.76	33.01	95.84	31.70	64.76	75.70	64.58	77.64	33.39	76.65
+ PRF$^{(3)}$ [75]	40.10	34.40	95.90	33.00	68.10	79.10	69.50	81.50	36.50	76.10
+ Cmp$^{(3)}$ (1 Crop)	40.68$^*$	34.84$^*$	96.94$^*$	–	68.42	80.10$^*$	69.58	81.77	35.61	79.39$^*$
+ Cmp$^{(5)}$ (1 Crop)	41.01$^*$	35.14$^*$	97.03$^*$	34.17	69.58$^*$	80.81$^*$	70.44$^*$	82.77$^*$	37.44$^*$	79.55$^*$
uniCOIL [42]	41.21	35.13	95.81	34.42	70.09	82.83	67.35	84.42	35.96	76.85
+ PRF$^{(3)}$	41.76	35.48	96.85	–	69.42	83.32	69.25	84.44	36.53	77.48
+ Cmp$^{(3)}$ (1 Crop)	42.02$^*$	35.75$^*$	96.91	35.14	70.10$^*$	83.58	69.70$^*$	84.51	36.90$^*$	77.67

Table 3. Retrieval Performance on Benchmarks Built on the MS MARCO Passage Collection

ANCE and uniCOIL are base retrieval models, + PRF denotes the PRF baseline model, + Cmp denotes our PRF model trained with the comparative loss of 1 CmpCrop, and superscript $^{(k)}$ represents the PRF depth used during testing. Superscript $^*$ indicates statistically significant improvements over its PRF baseline model with $p \le 0.1$.

5 Analysis

In this section, we further conduct several experiments for a more thorough analysis. First, from the dynamic weighting perspective found in Section 3.3, we examine whether the adaptive weighting of comparative loss is more effective than other weighting strategies (Section 5.1). Next, we try several other comparison strategies to find some guiding experience in choosing the number of ablations and ablation methods in practice (Section 5.2). Then, to confirm the enhancement of comparative loss on the utility of hidden and input neurons, we investigate the performance of models with different numbers of parameters (Section 5.3) and context lengths (Section 5.4). Furthermore, we visualize the loss curves to find the impact of the comparative losses with different ablation methods on the task-specific loss (Section 5.5). Finally, we show the actual training overhead of comparative loss in detail (Section 5.6).

5.1 Effect of Weighting Strategy

To verify the role of comparative loss from the dynamic weighting perspective, we keep all of the training settings of Longformer + CmpCrop + CmpDrop from the last row of Table 2 unchanged and replace only the weighting strategy of task-specific losses with some heuristics. Table 4 shows their performance on the HotpotQA development set. AVERAGE, FIRST, and LAST are three static weighting strategies. AVERAGE assigns equal weights to all task-specific losses, whereas FIRST and LAST assign weight to only the first and last task-specific loss, respectively—that is, FIRST optimizes $l^{(0)}$ of the full model without dropout and LAST optimizes $l^{(2)}$ of the model with regular dropout rate p (equivalent to the baseline Longformer in Table 2). MAX is another dynamic weighting strategy that assigns weight to only the largest task-specific loss. We can see that dynamic weighting in comparative losses is significantly better than these heuristic weighting strategies, which proves that comparative loss can assign weights more appropriately. In addition, AVERAGE is better than the latter three strategies that consider only one task-specific loss, indicating that it is beneficial to consider multiple task-specific losses. Moreover, although the latter three are all assigned to only one task-specific loss, MAX is better than the other two, which indicates that dynamic assignment is better than static assignment.

Table 4.

Weighting Method	EM	F1
Cmp	63.5$_{\pm 0.3}$	77.2$_{\pm 0.3}$
AVERAGE	63.2$_{\pm 0.3}$	76.7$_{\pm 0.3}$
FIRST	62.1$_{\pm 0.1}$	75.8$_{\pm 0.1}$
LAST	61.9$_{\pm 0.4}$	75.6$_{\pm 0.3}$
MAX	63.1$_{\pm 0.3}$	76.7$_{\pm 0.3}$

Table 4. QA Performance on the Development Set of HotpotQA Distractor with Different Weighting Strategies

Cmp refers to Longformer + CmpCrop + CmpDrop that adaptively weights multiple task-specific losses through comparative loss. The others are heuristics, where AVERAGE assigns the same weights to all task-specific losses, FIRST and LAST assign weight only to the first or last, and MAX dynamically assigns weight only to the largest one.

Notably, the FIRST that directly optimizes the full model outperforms the LAST that is trained with dropout, suggesting that the inconsistency of dropout between the training and inference stages [82] may indeed lead to underfitting of the full model. And the fact that Cmp far outperforms FIRST and LAST indicates that comparative loss can automatically strike a balance between ensuring training-inference consistency and preventing overfitting.

5.2 Effect of Comparison Strategy

To study the impact of comparison strategies (i.e., how many ablation steps we should use for comparison and which ablation method we should choose at each step), we try a variety of comparison strategies on HotpotQA with different numbers of comparisons and ablation orders. As shown in Table 5, the results are not significantly further improved when we repeat CmpDrop/CmpCrop twice, but the results are further improved when we apply CmpCrop first and then CmpDrop. This indicates that comparing multiple models ablated by the same method (i.e., encouraging the model be either hereditarily input-efficient or hereditarily parameter-efficient) seems to have little effect on the performance of the full model, but the successive use of two different ablation methods (i.e., encouraging the model be efficient (both input-efficient and parameter-efficient)) is helpful. However, applying CmpDrop followed by CmpCrop did not perform as well as applying CmpDrop only, suggesting that the order of the ablation methods is important and perhaps the ablation should be done in the order of the information flow in the model.

Table 5.

c	Ablation Order	EM	F1
1	CmpDrop	63.1	77.0
1	CmpCrop	63.1	76.8
2	CmpDrop x 2	63.4	77.1
	CmpCrop x 2	63.0	76.7
	CmpDrop + CmpCrop	63.2	76.8
	CmpCrop + CmpDrop	63.6	77.4

Table 5. QA Performance on the Development Set of HotpotQA Distractor with Different Comparison Strategies

c is the number of ablation steps. x 2 indicates that an ablation method is repeated twice, and $A+B$ means that A is used followed by B.

To further confirm the influence of the number of ablation steps c, we show in Figure 4 the relationship between the model’s Average metric over the eight GLUE datasets and the number of ablations. We can find little difference in the average performance of the models trained with different numbers of CmpDrop, with the model trained with one CmpDrop performing significantly best mainly because its huge advantage on two of the datasets pulls up the average. Therefore, if there is no extreme demand for performance, we usually do not need to tune the hyperparameter c.

Fig. 4.

5.3 Effect of Model Parameters

To investigate the impact of model parameters, we explore the application of the comparative loss with CmpDrop on different-sized versions of BERT, RoBERTa, and ELECTRA. From Table 6, we can see that the comparative loss with CmpDrop achieves a consistent improvement over the baselines based on these backbone models, which indicates that the comparative loss can improve model performance by increasing parameter utility without increasing the number of parameters. Moreover, except for the one outlier of BERT$_\mathrm{Medium}$, we can roughly find that the less the model parameters, the greater the relative gain from comparative loss. This is reasonable because the individual hidden neurons in a model with lower capacity play a larger role, so the improvement in the utility of hidden neurons can be more reflected in the final performance, whereas for a model of higher capacity, it is easier to fit less training data (i.e., its task-specific loss is already low), so comparative loss has less room to play in reducing task-specific loss further. In addition, we observe that the boost to BERT from the comparative loss with CmpDrop is generally higher compared to RoBERTa and ELECTRA with more complicated pre-training, suggesting that the comparative loss helps the model escape from local optima due to parameter initialization.

Table 6.

# Parameters	BERT		ELECTRA		RoBERTa
# Parameters	Baseline	Gain (%)	Baseline	Gain (%)	Baseline	Gain (%)
Tiny: 4M	41.5$_{\pm 0.5}$/54.2$_{\pm 0.3}$	2.2$_{\pm 1.4}$/1.7$_{\pm 1.0}$	–	–	–	–
Small: 14M	74.2$_{\pm 0.6}$/82.7$_{\pm 0.5}$	2.2$_{\pm 0.8}$/1.6$_{\pm 0.6}$	78.1$_{\pm 0.2}$/85.9$_{\pm 0.1}$	1.6$_{\pm 0.5}$/1.2$_{\pm 0.3}$	–	–
Medium: 42M	78.0$_{\pm 0.4}$/85.8$_{\pm 0.2}$	1.6$_{\pm 0.7}$/1.3$_{\pm 0.4}$	–	–	–	–
Base: 110M	81.3$_{\pm 0.2}$/88.5$_{\pm 0.1}$	1.2$_{\pm 0.2}$/0.9$_{\pm 0.2}$	85.9$_{\pm 0.3}$/92.3$_{\pm 0.2}$	0.8$_{\pm 0.4}$/0.5$_{\pm 0.3}$	85.8$_{\pm 0.1}$/92.2$_{\pm 0.1}$	0.8$_{\pm 0.2}$/0.5$_{\pm 0.1}$
Large: 335M	83.9$_{\pm 0.2}$/90.8$_{\pm 0.1}$	1.2$_{\pm 0.2}$/0.7$_{\pm 0.1}$	89.0$_{\pm 0.1}$/94.7$_{\pm 0.0}$	0.7$_{\pm 0.2}$/0.3$_{\pm 0.0}$	89.0$_{\pm 0.3}$/94.7$_{\pm 0.0}$	0.7$_{\pm 0.3}$/0.3$_{\pm 0.1}$

Table 6. Evaluation Results of Baselines with Different Model Sizes and Initializations on the SQuAD Development Set (EM/F1), and Relative Gains of our Models Trained Using Comparative Loss with CmpDrop over Baselines

5.4 Effect of Input Context

To review the utility of the input context (i.e., input neurons) to models, we plot in Figure 5 the performance trends of the models using different context sizes. First, in both datasets, our models trained with comparative loss consistently outperform the baseline models for all context sizes, indicating that our models are able to utilize input neurons more efficiently with equal amounts of input context. Second, this shows that our comparative loss can further improve the model performance after streamlining the input with context selection. In addition, we notice that our ANCE-PRF + CmpCrop in Figure 5(a) improves retrieval performance as expected as the number of feedback documents increases, whereas ANCE-PRF reaches peak performance at four feedback documents and then suffers performance degradation, implying that our model is more robust and able to mine and exploit relevant information in the added feedback documents. In contrast to PRF, for HotpotQA in Figure 5(b), the performance of all RC models decreases as the number of paragraphs increases. This is understandable, since only 2 paragraphs in HotpotQA are supporting facts, and the remaining 8 mostly serve as a distraction, so the ideal performance curve can just be a horizontal line that does not drop when the paragraph number increases. Interestingly, we find that the degradation of Longformer + CmpDrop (2.7%) and Longformer + CmpCrop + CmpDrop (3.0%) from the oracle setting (2 gold paragraphs) to the distractor setting (10 paragraphs) is lower than that of the baseline Longformer (3.4%). This suggests that comparative loss can help the models suppress the noisy information in the added context. Although Longformer + CmpCrop (3.7%) has a larger degradation than Longformer, we believe this is because Longformer + CmpCrop needs to be optimized for various numbers of paragraphs, unlike other models without CmpCrop that focus on learning for one input form (i.e., always 10 paragraphs). However, this variety of input forms makes Longformer + CmpCrop perform better than Longformer + CmpDrop when the number of paragraphs is small ($\le 5$).

Fig. 5.

To further quantitatively demonstrate the help of comparative loss in the robustness of the PRF model to context size, we report in Table 7 the robustness indexes [18] of ANCE-PRF + CmpCrop and ANCE-PRF at different numbers of feedback documents. The robustness index is defined as $\frac{N_{+} - N_{-}}{|Q|}$, where $|Q|$ is the total number of evaluated queries and $N_+$ and $N_-$ are the number of queries that the PRF model improves or downgrades when one more feedback document is used. The value of robustness index is in [–1, 1], and the model with higher robustness index is more robust. We can see that the PRF model trained using comparative loss with CmpCrop is significantly more robust than the baseline model. Besides, from the gaps in their robustness indexes (only 0.03 or 0.02 for one or two documents, but 0.05 for more documents), we can find that the comparative loss is more helpful for long-form inputs.

Table 7.

k	1	2	3	4	5
ANCE-PRF	0.51	0.54	0.58	0.58	0.61
ANCE-PRF + Cmp (1 Crop)	0.54	0.56	0.63	0.63	0.66

Table 7. Robustness Index of $\boldsymbol {q}^{(k)}$ with Respect to $\boldsymbol {q}^{(k-1)}$ on MARCO Dev at Each PRF Depth k, Where $\boldsymbol {q}^{(k)}$ and $\boldsymbol {q}^{(k-1)}$ Are Reformulated Query Vectors by the PRF Model, the Latter Having One Less Document in the Input Context Than the Former

5.5 Loss Visualization

To figure out the impact of comparative loss on task-specific loss, we plot the curves of task-specific loss for the full model (i.e., $l^{(0)}$) in Figure 6. From Figure 6(a) and (b), we can see that with the same batch size, the comparative loss can help our models fit better compared to the baseline Longformer. Comparing Longformer + CmpDrop and Longformer + CmpCrop, we can find that the training loss of the former is significantly smaller, which indicates that the comparative loss with CmpDrop helps the model fit the training data better, whereas the evaluation loss of Longformer + CmpCrop rises less in the later stage, which indicates that the comparative loss with CmpCrop can mitigate the overfitting to some extent. Since the number of task-specific losses per sample optimized by comparative loss is $1+c$ times that of conventional training, we also plot the task-specific loss curves for PRF models in Figure 6(c) and (d), where the batch size of our uniCOIL-PRF + CmpCrop is $1/(1+c)$ of the baseline uniCOIL-PRF. In this way, the number of task-specific losses optimized in one batch for our model and the baseline is the same, which helps to further clarify the role of the comparative loss with CmpCrop. We can see that while the training loss of our model in Figure 6(c) does not drop as low as the baseline, its evaluation loss in Figure 6(d) drops to a lower level and significantly mitigates the overfitting.

Fig. 6.

5.6 Training Efficiency

We present in Table 8 the performance gain and relative change in training FLOPs of BERT$_\mathrm{base}$ + Cmp compared to BERT$_\mathrm{base}$, as well as the specific number of comparisons (i.e., number of ablation steps c) chosen for each dataset. We find that the actual overhead of training with comparative loss is usually less than $1+c$ times that of conventional training, and even less than that of conventional training (e.g., on QQP). This is because models trained with comparative loss tend to converge earlier than baselines. Combined with the insensitivity of comparative loss to the number of comparisons found from Figure 4, we believe that setting c to 1 or 2 can lead to effective and fast training when data is sufficient.

Table 8.

	MNLI	MRPC	QNLI	QQP	RTE	SST-2	STS-B	CoLA
c	3	1	4	2	1	2	4	4
Performance (%) $\uparrow$	+1.3	+3.8	+0.9	+0.5	+4.1	+0.4	+0.6	+1.6
FLOPs $\uparrow$	x3.5	x1.6	x3.5	x0.9	x2.1	x0.7	x4.8	x3.9

Table 8. Specific Settings for the Number of Ablation Steps of BERT + Cmp on Each GLUE Dataset, as Well as the Performance Gain and Increase in Training Computation Overhead Compared to BERT

6 Related Work

In this section, we introduce and discuss some work that has different motivations but is technically relevant to us, starting with contrastive learning [38] that learns by comparing, followed by recent training methods that also use dropout multiple times.

6.1 Contrastive Learning

Contrastive learning has recently achieved significant success in representation learning in computer vision and natural language processing. At its core, contrastive learning aims to learn effective representations by pulling semantically similar neighbors together and pushing apart non-neighbors [26]. Instead of learning a signal from individual data samples one at a time, it learns by comparing different samples [38]. The comparison is performed between positive pairs of similar samples and negative pairs of dissimilar samples. The positive pair must ensure that the two samples are similar, which can be constructed either by using supervised similarity annotation or by self-supervision. In self-supervised contrastive learning, a positive pair can consist of an original sample and its data augmentation. For example, SimCLR [12] in computer vision uses a crop, flip, distortion, or rotation of an original image as its similar view, and SimCSE [25] in natural language processing applies two dropout masks to an input sentence to create two slightly different sentence embeddings that are then used as a positive pair of sentence embeddings. To share more computation and save cost, negative pairs usually consist of two dissimilar samples within the same training batch. Although both learn through comparison, contrastive learning aims at pursuing alignment and uniformity [68] of representations, whereas our comparative loss aims at pursuing orderliness of the task-specific losses of the full model and its ablated model. Moreover, as the lexical meaning suggests, contrastive learning only classifies the relationship (i.e., similar or dissimilar) between two data samples in a binary manner, whereas our comparative loss compares multiple full/ablated models by ranking. However, these two are not in conflict, and our comparative loss can be used over the contrastive losses that served as task-specific losses.

6.2 Dropout-Based Comparison

Dropout is a family of stochastic techniques used in neural network training or inference that have attracted extensive research interest and are widely used in practice. The standard dropout [30] aims to avoid overfitting of the network by reducing the co-adaptation of neurons, where the outputs of individual neurons only provide useful information in combination with other neuron outputs. After this, a line of research focused on improving the standard dropout by employing other strategies for dropping neurons, such as DropConnect [65] and variational dropout [35].

A line of research that is relevant to us is the use of dropout multiple times in training. SimCSE [25] forwards the model twice with different dropout masks of the same rate and uses a contrastive loss to constrain the distribution of model outputs in the representation space. A possible side effect of dropout revealed by the existing literature [46, 82] is the non-negligible inconsistency between the training and inference stages of the model—that is, the submodels are optimized during training, but the full model without dropout is used during inference. To address this inconsistency, R-Drop [40] forward runs the model multiple times with different dropout masks to obtain multiple predicted probability distributions and applies KL-divergence on them to constrain their consistency. Unlike their multiple dropout masks that are sampled independently, the multiple dropout rates are increasing and the masks are progressive in our CmpDrop, with the subsequent mask obtained by further randomly discarding elements based on the previous one. In addition, we impose constraints on the task-specific losses at the end rather than on the representations and probabilities upstream. Notably, the full model is also optimized in due time when trained using the comparative loss with CmpDrop, which we argue is important to mitigate the inconsistency between training and inference. This is because while dropout avoids co-adaptation of neurons, it also weakens the cooperation between neurons (Section 5.1 gives some empirical support). In particular, in cases where all neurons are involved, the full model trained with dropout has not been taught how to make them work together efficiently and thus cannot be fully exploited during testing. Surprisingly, our comparative loss with CmpDrop can balance between promoting the cooperation of neurons and preventing their co-adaptation.

7 Conclusion

In this article, we proposed cross-model comparative loss, a simple task-agnostic loss function, to improve the utility of neurons in NLU models. Comparative loss is essentially a ranking loss based on the comparison principle between the full model and its ablated models, with the expectation that the less ablation there is, the smaller the task-specific loss. To ensure comparability among multiple ablated models, we progressively ablated the models and provided two controlled ablation methods based on dropout and context cropping, applicable to a wide range of tasks and models. We showed theoretically how comparative loss works, suggesting that it can adaptively assign weights to multiple task-specific losses. Extensive experiments and analysis on 14 datasets from three distinct NLU tasks demonstrated the universal effectiveness of comparative loss. Interestingly, our analysis confirmed that comparative loss can indeed assign weights more appropriately, and found that comparative loss is particularly effective for models with few parameters or long input.

In the future, we would like to apply comparative loss in other domains, such as natural language generation and computer vision, and explore its applications on other model architectures beyond Transformer. It could also be interesting to explore the application of comparative loss on top of self-supervised losses (e.g., contrastive loss) during pre-training. For training costs, how to reduce the overhead by reusing more shared computations is a direction worth exploring. Further, more advanced ablation methods in training, such as DropConnect [65] rather than standard dropout and adversarial rather than stochastic, may deserve future research efforts.

Footnotes

The output value of the neuron is set to 0, which is equivalent to all the connection weights to and from this neuron being set to 0.

In this work, “efficient” refers specifically to the high utility of neurons.

The number of non-zeros (L0 norm) in the mask determines the number of available parameters.

⁴

Slightly different from the binary mask in the comparison principle, we incorporate the scaling factors together into the mask to still express the parameter ablation concisely by $\boldsymbol {\theta }^{(i)} = \boldsymbol {m}^{(i)} \odot \boldsymbol {\theta }^{(i-1)}$.

⁵

https://github.com/huggingface/transformers/blob/v4.19.2/examples/pytorch/text-classification/README.md

⁶

https://microsoft.github.io/MSMARCO-Passage-Ranking-Submissions/leaderboard/

⁷

The fixed vectors of documents are restored from the index pre-built by https://github.com/castorini/pyserini.

⁸

https://github.com/microsoft/ANCE

⁹

We find that L2 normalization helps the model train stably, and it does not change the relevance ranking of documents to a query.

References

[1]

R. Attar and A. S. Fraenkel. 1977. Local feedback in full-text retrieval systems. Journal of the ACM 24, 3 (July1977), 397–417. DOI:

Model	MNLI	MRPC	QNLI	QQP	RTE	SST-2	STS-B	CoLA	Average
BERT\(_\mathrm{base}\) [21]	84.2\(_{\pm 0.3}\)	85.9\(_{\pm 0.5}\)	91.0\(_{\pm 0.1}\)	91.0\(_{\pm 0.1}\)	68.2\(_{\pm 1.7}\)	92.2\(_{\pm 0.2}\)	88.9\(_{\pm 1.0}\)	61.9\(_{\pm 1.1}\)	82.92\(_{\pm 0.14}\)
+ R-Drop [40]	84.3\(_{\pm 0.2}\)	86.5\(_{\pm 0.5}\)	91.6\(_{\pm 0.1}\)	91.4\(_{\pm 0.1}\)	68.8\(_{\pm 1.2}\)	92.2\(_{\pm 0.3}\)	89.4\(_{\pm 0.8}\)	62.9\(_{\pm 0.7}\)	83.38\(_{\pm 0.14}\)
+ Cmp (c Drop)	84.8\(_{\pm 0.2}\)	87.1\(_{\pm 1.0}\)	91.9\(_{\pm 0.2}\)	91.5\(_{\pm 0.1}\)	70.5\(_{\pm 1.2}\)	93.1\(_{\pm 0.4}\)	89.7\(_{\pm 0.6}\)	63.4\(_{\pm 1.1}\)	83.96\(_{\pm 0.31}\)
RoBERTa\(_\mathrm{base}\) [44]	87.4\(_{\pm 0.3}\)	89.9\(_{\pm 0.6}\)	92.8\(_{\pm 0.2}\)	91.4\(_{\pm 0.1}\)	76.5\(_{\pm 0.9}\)	95.1\(_{\pm 0.2}\)	90.8\(_{\pm 0.1}\)	62.7\(_{\pm 0.6}\)	85.82\(_{\pm 0.13}\)
+ R-Drop [40]	87.6\(_{\pm 0.1}\)	90.0\(_{\pm 0.6}\)	92.9\(_{\pm 0.1}\)	91.5\(_{\pm 0.0}\)	78.5\(_{\pm 1.5}\)	95.4\(_{\pm 0.2}\)	91.1\(_{\pm 0.1}\)	64.0\(_{\pm 0.5}\)	86.37\(_{\pm 0.26}\)
+ Cmp (c Drop)	88.0\(_{\pm 0.1}\)	90.5\(_{\pm 0.5}\)	93.3\(_{\pm 0.1}\)	91.9\(_{\pm 0.1}\)	79.3\(_{\pm 1.2}\)	95.5\(_{\pm 0.2}\)	91.3\(_{\pm 0.1}\)	65.4\(_{\pm 0.8}\)	86.91\(_{\pm 0.20}\)
ALBERT\(_\mathrm{base}\) [37]	85.0\(_{\pm 0.3}\)	88.4\(_{\pm 0.6}\)	92.1\(_{\pm 0.3}\)	90.4\(_{\pm 0.1}\)	77.8\(_{\pm 0.8}\)	93.0\(_{\pm 0.4}\)	90.9\(_{\pm 0.2}\)	59.8\(_{\pm 1.7}\)	84.68\(_{\pm 0.18}\)
+ R-Drop [40]	85.4\(_{\pm 0.4}\)	88.7\(_{\pm 0.8}\)	92.1\(_{\pm 0.2}\)	90.5\(_{\pm 0.1}\)	78.0\(_{\pm 1.4}\)	93.3\(_{\pm 0.3}\)	90.9\(_{\pm 0.2}\)	59.7\(_{\pm 0.4}\)	84.83\(_{\pm 0.14}\)
+ Cmp (c Drop)	85.7\(_{\pm 0.1}\)	89.5\(_{\pm 0.4}\)	92.3\(_{\pm 0.1}\)	91.0\(_{\pm 0.1}\)	78.3\(_{\pm 1.4}\)	93.5\(_{\pm 0.2}\)	91.0\(_{\pm 0.2}\)	60.3\(_{\pm 0.9}\)	85.19\(_{\pm 0.24}\)

Weighting Method	EM	F1
Cmp	63.5\(_{\pm 0.3}\)	77.2\(_{\pm 0.3}\)
AVERAGE	63.2\(_{\pm 0.3}\)	76.7\(_{\pm 0.3}\)
FIRST	62.1\(_{\pm 0.1}\)	75.8\(_{\pm 0.1}\)
LAST	61.9\(_{\pm 0.4}\)	75.6\(_{\pm 0.3}\)
MAX	63.1\(_{\pm 0.3}\)	76.7\(_{\pm 0.3}\)

Abstract

1 Introduction

2 Preliminaries

2.1 Conventional Training

2.2 Network Pruning

2.3 Context Selection

2.4 Ablation

3 Methodlogy

3.1 Comparison Principle

3.2 Comparative Loss

3.2.1 CmpDrop: Ablate Parameters by Dropout.

3.2.2 CmpCrop: Ablate Input by Cropping.

3.3 Discussion

4 Experiments

4.1 Classification: Application to Text Classification

4.1.1 Datasets.

4.1.2 Models and Training.

4.1.3 Results.

4.2 Extraction: Application to RC

4.2.1 Datasets.

4.2.2 Models and Training.

4.2.3 Results.

4.3 Ranking: Application to PRF

4.3.1 Datasets.

4.3.2 Models and Training.

4.3.3 Results.

5 Analysis

5.1 Effect of Weighting Strategy

5.2 Effect of Comparison Strategy

5.3 Effect of Model Parameters

5.4 Effect of Input Context

5.5 Loss Visualization

5.6 Training Efficiency

6 Related Work

6.1 Contrastive Learning

6.2 Dropout-Based Comparison

7 Conclusion

Footnotes

References

Index Terms

Recommendations

Neuromimetic model of a neuronal filter

Dual oscillator model of the respiratory neuronal network generating quantal slowing of respiratory rhythm

Understanding effects on excitability of simulated I h modulation in simple neuronal models

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

Understanding effects on excitability of simulated I _h modulation in simple neuronal models