¹¹institutetext: Anonymous Institute¹¹institutetext: Department of Computer Science, University of Bucharest, Bucharest, Romania

Learning Using Generated Privileged Information by Text-to-Image Diffusion Models

Anonymous Author(s) Rafael-Edy Menadil Mariana-Iuliana Georgescu Radu Tudor Ionescu Corresponding author: [email protected]

Abstract

Learning Using Privileged Information is a particular type of knowledge distillation where the teacher model benefits from an additional data representation during training, called privileged information, improving the student model, which does not see the extra representation. However, privileged information is rarely available in practice. To this end, we propose a text classification framework that harnesses text-to-image diffusion models to generate artificial privileged information. The generated images and the original text samples are further used to train multimodal teacher models based on state-of-the-art transformer-based architectures. Finally, the knowledge from multimodal teachers is distilled into a text-based (unimodal) student. Hence, by employing a generative model to produce synthetic data as privileged information, we guide the training of the student model. Our framework, called Learning Using Generated Privileged Information (LUGPI), yields noticeable performance gains on four text classification data sets, demonstrating its potential in text classification without any additional cost during inference.

Keywords:

Learning Using Privileged Information Knowledge Distillation Text Classification Diffusion Models Data Augmentation

1 Introduction

In the quest of developing effective and efficient machine learning models, researchers developed the knowledge distillation framework [5, 21], in which the outputs of one [5, 31] or more [21, 46, 47] typically heavy models, called teachers, are used as target for a typically lightweight model, called student. This framework is primarily used to compress very deep models into shallower, yet effective models [12, 27, 31, 44, 46, 47]. A secondary use of knowledge distillation is to leverage additional data representations, available only at training time, to improve the performance of a model which does not have access to the extra representation. This latter framework, called Learning Using Privileged Information (LUPI) [42], was introduced well before the era of deep learning, but it was later shown [27] that it represents a particular kind of knowledge distillation.

Refer to caption — Figure 1: An illustration of our Learning Using Generated Privileged Information (LUGPI) framework. For each text sample, a diffusion model generates an image. The original text sample and the generated image are used to train a multimodal teacher model. Then, a text-based student model is trained via knowledge distillation from the teacher. The distillation is carried out at two levels.

Although LUPI is an interesting and useful framework, it has rarely been applied in solving mainstream machine learning problems [1, 16, 17, 23, 42, 50], since finding additional modalities to represent the training data is not an easy task. With the advent of diffusion models [6, 22, 40, 41], which demonstrated impressive capabilities in generating realistic and diverse images based on text prompts [3, 20, 34, 35], we can now automatically generate image representations of text samples without much effort. To this end, we propose a novel framework called Learning Using Generated Privileged Information (LUGPI), which harnesses a state-of-the-art text-to-image diffusion model to generate the privileged information, namely Stable Diffusion v2 [34]. Our framework is applied on text classification tasks, where the original modality is represented by text samples and the additional modality is represented by images. Next, we train multimodal teacher models based on combining state-of-the-art transformer-based architectures, such as Distilled Bidirectional Encoder Representations from Transformers (DistilBERT) [37], Vision Transformer (ViT) [11] and Contrastive Language-Image Pre-Training (CLIP) [33]. Remarkably, we find that our multimodal teachers outperform the standalone text-based (unimodal) model. However, employing the multimodal teachers during inference would inherently imply the use of the diffusion model to generate the images. This greatly impacts the inference time of the whole framework, since diffusion models are notoriously known for being computationally expensive [6]. For instance, Stable Diffusion v2 [34] comprises about 865 million learnable parameters, requiring about 17 seconds to generate a single image on an NVIDIA GeForce RTX 3090 24GB GPU. To address this limitation, we distill the knowledge from a multimodal teacher into a text-based student model, as shown in Figure 1. This completely eliminates the need to generate images during inference. Thus, LUGPI does not increase the computational cost at test time.

We carry out experiments on four text classification data sets to evaluate the proposed framework and compare it with the conventional training approach based on pre-training and fine-tuning, while preserving the underlying DistilBERT architecture [37]. Our empirical results indicate that LUGPI brings significant performance gains on all four data sets.

In summary, our contribution is threefold:

•

We propose to harness diffusion models in order to artificially generate an extra data modality in the form of images, complementing the text modality, which enables us to train more powerful multimodal neural models.
•

We introduce the novel Learning Using Generated Privileged Information framework to distill knowledge from our multimodal teachers into text-based (unimodal) students.
•

We conduct experiments on four benchmarks, showing that the proposed framework improves the accuracy rates of text-based models by noticeable margins, without any extra cost during inference.

2 Related Work

Learning Using Privileged Information. There are two types of knowledge distillation frameworks, which were independently introduced in literature, namely model compression [5, 21] and learning using privileged information [42]. In 2016, Lopez-Paz et al. [27] unified the model compression and learning under privilege information paradigms into the knowledge distillation framework.

The model compression technique [5, 21] is mainly aimed at training a shallow and efficient student architecture using one or more deeper and powerful teachers. In this way, a shallow student could benefit from the knowledge gained by a deep teacher, while having less parameters, and consequently, a lower running time during inference.

The learning using privileged information paradigm [42] was introduced to transfer the knowledge from a teacher model, which is trained with privileged information, to a student model, which does not have access to the privileged data. In this scenario, the teacher and the student can share the same architecture, the main difference being the data used to train the two models. Many recent works [1, 14, 15, 17, 16, 25, 26, 48] applied the LUPI framework to improve the performance of the student without using additional information at test time. For example, Yuan et al. [48] trained a student to estimate the 3D hand pose using only the RGB image at test time. The knowledge about the depth channel was transferred from the teacher during the knowledge distillation process. Alehdaghi et al. [1] decreased the gap between RGB and infrared images used in the person re-identification task by applying the LUPI framework. They proposed to create an intermediate virtual domain that acts as a bridge between the two image modalities. The intermediate virtual domain was used as privileged information for the student model during training. Georgescu et al. [17] applied LUPI for facial expression recognition under strong occlusion, where the teacher learns from completely visible faces, but the student can only use occluded faces as input. They later extended their approach to age estimation and gender prediction from faces [16].

Similar to the aforementioned works [1, 14, 15, 17, 16, 25, 26, 48], we use extra data as privileged information during training. Different from the related studies on LUPI, our method does not require the existence of additional representations, since it generates the privileged data using a generative diffusion model. Hence, our framework broadens the applicability of LUPI to text-based corpora that do not have additional representations of the data samples.

Data augmentation. Our approach can also be seen as a rather unconventional data augmentation technique. However, data augmentation is usually employed to improve the robustness to data variation [9, 49], while in our case, we employ it to obtain privileged information. In general, data augmentation plays an important role in increasing the performance of deep learning architectures [9, 49], especially when the available training data is limited. The most common data augmentation methods used in computer vision are methods based on rotating, cropping and flipping the images [7]. Although techniques like these can offer better performance than just training on the original data, they lack the capability of creating a completely different data point, instead relying on the existing data and manipulating it just enough to have a variety within the training data.

In recent years, we have seen generative models, such as Generative Adversarial Networks (GANs) [19] and diffusion models, that have been used to successfully augment data and improve the accuracy of various models [2, 4, 32, 36]. Generative models can create new data points that closely resemble the training data distribution, often being mistaken with natural data points. Therefore, classification models can leverage this new data variety to offer high performance without having to gather any new data points. Furthermore, there are some examples that successfully use generative models when conventional techniques fall short [39, 43]. Yang et al. [43] proposed to use diffusion models to generate images illustrating human-object interactions, conditioned by prompts explaining the interactions. Shivashankar et al. [39] trained a GAN model to generate images along with their segmentation label for medical and face segmentation data sets. In these cases, conventional data augmentation methods provide suboptimal results when compared with generative models. This is because the latter models can generate new data points that resemble the training data distribution, aside from being able to generate variations of existing data points conditioned by some specific features that need to be present in the generated output.

Unlike other data augmentation techniques, we propose to generate image-based representations from text samples, essentially obtaining a new modality. Thus, our technique requires employing multimodal models to benefit from the extra data representation. To return to using a unimodal input while keeping the benefits of the multimodal data, we employ knowledge distillation.

3 Method

Overview and motivation. Learning Using Privileged Information [42] is suitable for machine learning tasks where the training data is represented by multiple modalities. However, the majority of machine learning problems only involve a single modality, rendering LUPI inapplicable. To overcome this challenge in the area of natural language processing and text classification, we propose to utilize a text-to-image diffusion model to generate privileged information in the form of images, in order to solve text classification problems where privileged information is not typically available.

We believe that our proposal is grounded in how the human mind works. For instance, humans use their imagination to mentally visualize objects, colors, textures or other visual aspects evoked in a text. This process helps humans in reaching a better and deeper text comprehension [13]. In a similar way, we conjecture that imaginary pictures can boost the performance of neural models such as BERT [8] or DistilBERT [37], provided that the visualizations are sufficiently representative. To increase the chances of successfully implementing our proposal, we make use of diffusion models, which are considered by many researchers as state-of-the-art text-to-image generators [10], surpassing previous models based on GANs.

To harness the generated images, a straightforward approach is to employ models on both text and image modalities in order to improve text classification performance. However, this approach is suboptimal in terms of speed, requiring additional time to generate and process images during inference. Our framework addresses this issue through knowledge distillation, i.e. the knowledge learned by the multimodal model, called teacher, is distilled into a text-based model, called student. At test time, we employ the student model to make predictions, thus eliminating the need to generate and process images. Our training framework is formally introduced in Algorithm 1. We first introduce the notations, then continue by presenting the three stages of our algorithm, namely image generation, teacher model training and knowledge distillation.

Input:

\mathcal{D}

- the training set of labeled text samples,

G

- the text-conditional diffusion model,

\theta_{G}

- the weights of the diffusion model,

T

- the multimodal teacher model,

S

- the student model,

\theta^{*}_{T}

- (optional) pre-trained weights for the teacher,

\theta^{*}_{S}

- (optional) pre-trained weights for the student,

\eta_{T}

- the teacher’s learning rate,

\eta_{S}

- the student’s learning rate,

\alpha

- the importance of the cross-entropy between the teacher and the student,

\beta

- the importance of the mean squared error between the teacher and student embeddings.

Output:

\theta_{S}

- the trained weights of the student model.

n\leftarrow\lvert\mathcal{D}\rvert;

\lhd

get the number of training samples

X^{\prime}\leftarrow\emptyset;

\lhd

initialize the set of generated images

3 foreach $i\in\{1,2,...,n\}$ do

x^{\prime}_{i}\leftarrow G(x_{i},\theta_{G});

\lhd

generate an image for the text sample

x_{i}

X^{\prime}\leftarrow X^{\prime}\cup\{x^{\prime}_{i}\};

\lhd

add the generated image to the set

X^{\prime}

7if $\theta^{*}_{T}\neq\emptyset$ then

\theta_{T}\leftarrow\theta^{*}_{T};

\lhd

initialize weights of teacher using pre-trained weights

10else

\theta_{T}\sim\mathcal{N}\left(0,\frac{2}{d_{in}+d_{out}}\right);

\lhd

initialize weights of teacher using Xavier init [18]

13repeat

14 foreach $i\in\{1,2,...,n\}$ do

t_{i}\leftarrow T(x_{i},x^{\prime}_{i},\theta_{T});

\lhd

get class probabilities predicted by the teacher

\theta_{T}\leftarrow\theta_{T}-\eta_{T}\cdot\nabla\mathcal{L}_{\scriptsize{% \mbox{CE}}}(y_{i},t_{i});

\lhd

train the teacher using cross-entropy

19until convergence;

20if $\theta^{*}_{S}\neq\emptyset$ then

\theta_{S}\leftarrow\theta^{*}_{S};

\lhd

initialize weights of student using pre-trained weights

22else

\theta_{S}\sim\mathcal{N}\left(0,\frac{2}{d_{in}+d_{out}}\right);

\lhd

initialize weights of student using Xavier init [18]

25repeat

26 foreach $i\in\{1,2,...,n\}$ do

t_{i},e^{T}_{i}\leftarrow T(x_{i},x^{\prime}_{i},\theta_{T});

\lhd

get probabilities and embedding from teacher

s_{i},e^{S}_{i}\leftarrow S(x_{i},\theta_{S});

\lhd

get probabilities and embedding from student

\mathcal{L}_{\scriptsize{\mbox{KD}}}\leftarrow\mathcal{L}_{\scriptsize{\mbox{% CE}}}(y_{i},s_{i})+\alpha\cdot\mathcal{L}^{T}_{\scriptsize{\mbox{CE}}}(t_{i},s% _{i})+\beta\cdot\mathcal{L}^{T}_{l_{2}}(e^{T}_{i},e^{S}_{i});

\lhd

apply Eq. (2)

\theta_{S}\leftarrow\theta_{S}-\eta_{S}\cdot\nabla\mathcal{L}_{\scriptsize{% \mbox{KD}}};

\lhd

train the student using the joint loss

33until convergence;

Algorithm 1 Learning Using Generated Privileged Information

Notations. Let $\mathcal{D}=(X,Y)=\{(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{n},y_{n})\}$ represent a training set of text samples, where $n$ is the number of samples in the data set, and $y_{i}$ is the ground-truth label associated with text sample $x_{i}$ . Let $T$ and $\theta_{T}$ represent the multimodal teacher model and its weights, respectively. Similarly, let $S$ and $\theta_{S}$ represent the text-based student model and its weights. The weights of the teacher and student models are updated using the learning rates $\eta_{T}$ and $\eta_{S}$ , respectively. Let $X^{\prime}=\{x^{\prime}_{1},x^{\prime}_{2},...,x^{\prime}_{n}\}$ represent the set of images generated by a diffusion model $G$ with the weights $\theta_{G}$ . Let $\mathcal{N}(\mu,\sigma^{2})$ represent the normal distribution of mean $\mu$ and standard deviation $\sigma$ . Let $e^{T}_{i}$ and $e^{S}_{i}$ denote the embedding vectors produced by the teacher and the student for the $i$ -th data sample, respectively. The embedding vectors are taken just before the classification layer of each model.

Image generation. In steps 2-5 of Algorithm 1, we utilize a pre-trained text-to-image diffusion model to generate privileged information in the form of images. In step 4, the generator $G$ generates an image denoted by $x_{i}^{\prime}$ conditioned on the text sample $x_{i}$ . In step 5, the generated image is added to the set $X^{\prime}$ . Steps 4 and 5 are repeated until all training examples are passed through $G$ .

We choose the Stable Diffusion v2 [34] model trained on the LAION-5B [38] data set as our generator $G$ . The use of this model is chosen in favor of another open-source diffusion model, namely GLIDE [30]. To decide on which generator to use, we visually inspected their outputs on a subset of 100 prompts from the chosen data sets. We observed that Stable Diffusion v2 is usually better aligned with the provided text prompts than GLIDE. This influenced our decision towards using the former model.

Teacher training. The second stage of our pipeline is dedicated to training the teacher model. This stage corresponds to steps 6-14 of Algorithm 1. The teacher model is a multimodal architecture comprising three transformer-based encoders: a text encoder, an image encoder, and a multimodal encoder. As illustrated in Figure 1, the tokens produced by the text encoder are concatenated with the tokens given by the image encoder. The concatenated set of tokens is further passed through the multimodal encoder, which comprises a vanilla transformer block based on multi-head attention, having 8 attention heads. The multimodal encoder learns to perform cross-modal attention, strengthening relations across the text and image modalities. From the resulting set of multimodal tokens, we keep the classification token $U_{\scriptsize{\mbox{CLS}}}$ from the text modality and the classification token $V_{\scriptsize{\mbox{CLS}}}$ from the image modality, discarding the other tokens. This is a conventional procedure when transformers are applied to downstream classification tasks [8, 11]. Next, the classification tokens are concatenated and given as input to a multi-layer perceptron (MLP) with two layers, where the first layer comprises 786 neurons and the second one comprises $k$ neurons, where $k$ is the number of classes. A softmax function computes the output probabilities.

In order to make the prediction $t_{i}$ , the teacher model $T$ takes the text sample $x_{i}$ and the generated image $x^{\prime}_{i}$ as input, according to step 12 of Algorithm 1. In step 13, the weights of the teacher $\theta_{T}$ are updated using gradient descent, where the gradient is computed with respect to the cross-entropy loss. For the vector of predicted class probabilities $t_{i}$ and the one-hot label encoding $y_{i}$ , the cross-entropy loss is given by:

\mathcal{L_{\scriptsize{\mbox{CE}}}}(y_{i},t_{i})=-\sum_{j=1}^{k}y_{ij}\cdot% \log(t_{ij}),\forall i\in\{1,2,...,n\},

(1)

where $k$ is the number of classes.

In our implementation, we choose to use pre-trained architectures for the text and image encoders. For a fair and representative evaluation, we use the same text encoder as the baseline and the student models, namely DistilBERT [37]. This is to ensure that the observed performance gains are not due to the use of a more powerful text encoder for the teacher model, but rather due to the extra image modality. For the image encoder, we consider two alternative architectures, namely ViT [11] and CLIP Image [33].

Knowledge distillation. After training the teacher, we apply the knowledge distillation procedure to transfer the knowledge from the multimodal teacher to the student. This stage corresponds to steps 15-25 of Algorithm 1. According to steps 15-18, the student can optionally be pre-trained in a standard fashion, prior to the knowledge distillation procedure. We utilize this option to ensure a fair comparison with the baseline model. More precisely, both the baseline DistilBERT and our student DistilBERT are pre-trained. In general, when there are no pre-trained weights for the student, we can simply initialize the model using a conventional approach (step 18), such as Xavier initialization [18].

The student model is jointly optimizing three objectives. On the one hand, the student has to minimize the cross-entropy loss with respect to the ground-truth (hard) labels, to ensure that its predictions are correct. On the other hand, the student has to optimize the cross-entropy with respect to the probabilities (soft labels) predicted by the teacher, as well as minimize the mean squared error between the corresponding embeddings produced by the teacher and the student, which enables the student to learn knowledge from the teacher model. Formally, for the $i$ -th data sample, the joint objective is computed as follows:

\begin{split}\mathcal{L}_{KD}&=\mathcal{L}_{\scriptsize{\mbox{CE}}}(y_{i},s_{i% })+\alpha\cdot\mathcal{L}^{T}_{\scriptsize{\mbox{CE}}}(t_{i},s_{i})+\beta\cdot% \mathcal{L}^{T}_{l_{2}}(e^{T}_{i},e^{S}_{i})\\ &=\!-\!\sum_{j=1}^{k}y_{ij}\!\cdot\!\log(s_{ij})\!-\!\alpha\!\cdot\!\sum_{j=1}% ^{k}t_{ij}\!\cdot\!\log(s_{ij})\!+\!\beta\!\cdot\!\lVert e^{T}_{i}-e^{S}_{i}% \rVert^{2}_{2},\forall i\in\{1,...,n\},\\ \end{split}

(2)

where $\alpha,\beta\geq 0$ are two hyperparameters that control the importance of the knowledge distillation objectives. Note that the distillation is carried out at two levels, namely with respect to the embedding space and the output space. Our ablation study shows the importance of distilling knowledge at both levels.

4 Experiments

We conduct experiments on four data sets covering three tasks: opinion mining, text categorization by topic, and complex word identification. The data sets are chosen to provide a comprehensive evaluation of image generation and privileged information in different target tasks.

4.1 Data Sets

IMDB Large Movie Review. The IMDB Large Movie Review data set [29] is a well-known benchmark for polarity classification, which is composed of 50,000 movie reviews separated into 25,000 for training and 25,000 for testing. We keep $10\%$ of the training set for validation purposes. The scope of this data set is to predict the polarity of the sentiment (positive or negative).

20 Newsgroups. The 20 Newsgroups data set [24] is a popular benchmark for text categorization by topic. It comprises 18,828 documents that are assigned to one of 20 different categories, ranging from technology to sports and religion. In our experiments, we divide the data set into 11,353 training documents, 1,261 validation documents and 6,214 test documents.

English News. The English News corpus [45] comprises 17,861 sentences with marked words or multi-word phrases that are annotated with complexity levels by native and non-native English speakers. The task is to determine if the target words or multi-word phrases are complex or not. The corpus is divided into 14,002 training sentences, 1,764 validation sentences and 2,095 test sentences.

English WikiNews. Another corpus for complex word identification introduced by Yimam et al. [45] is English WikiNews. It has a similar format to English News. The English WikiNews data set is divided into 7,746 training sentences, 870 validation sentences and 1,287 test sentences.

4.2 Experimental Setup

Table 1: Accuracy rates on IMDB Large Movie Review [29], 20 Newsgroups [24], English News [45] and English WikiNews [45] data sets. Our teacher and student models are compared with the fine-tuned vanilla DistilBERT [37]. For reference, we report results with the independent image encoders, namely ViT [11] and CLIP [33]. The best accuracy on each corpus is highlighted in bold. Significantly better results (at a p-value of 0.001) based on McNemar / Cochran Q testing are marked with

\ddagger

Model	Modality		IMDB	20 News	English	English
Model	Text	Image	Reviews	groups	News	WikiNews
DistilBERT [37]	✓		0.919	0.918	0.861	0.842
ViT [11]		✓	0.559	0.137	0.832	0.754
CLIP Image [33]		✓	0.549	0.523	0.822	0.746
DistilBERT+ViT (Teacher 1)	✓	✓	0.920	0.919	0.867^‡	0.843
DistilBERT+CLIP (Teacher 2)	✓	✓	0.931^‡	0.926^‡	0.868^‡	0.846
DistilBERT (Student 1)	✓		0.930^‡	0.928^‡	0.869^‡	0.843
DistilBERT (Student 2)	✓		0.931^‡	0.929^‡	0.871^‡	0.848^‡

Baselines and backbones. As baseline, we choose the DistilBERT model [37], a variant of BERT [8] that exhibits good performance with a reasonable number of learnable parameters. For a fair comparison with the baseline, we employ the DistilBERT architecture for our students as well. Moreover, the text encoder inside the multimodal teachers is also based on DistilBERT. To encode the generated images, we alternatively employ the pre-trained image encoder of the CLIP architecture [33], or the pre-trained ViT [11] model. We thus obtain a teacher based on DistilBERT+ViT (Teacher 1), and a teacher based on DistilBERT+CLIP (Teacher 2). We distill the knowledge from Teacher 1 into a student based on DistilBERT (Student 1), and the knowledge from Teacher 2 into a different student (Student 2), which is also based on DistilBERT. We underline that the two students have the same architecture, but they differ in terms of the source providing the privileged information.

Hyperparameters. We train the models with the AdamW [28] optimizer using a learning rate of $5\cdot 10^{-5}$ with linear decay, which converges to good optima across all our experiments. The baseline DistilBERT, the teachers and the students are each trained for 100 epochs on an Nvidia GeForce GTX 1080Ti GPU with 11 GB of VRAM. In all the experiments, we use a mini-batch size of 14 samples. Following previous works on knowledge distillation [5, 27], we soften the output of the teacher using the temperature $\tau$ . We validate this hyperparameter in the range $1$ - $10$ , achieving optimal results with $\tau=8$ . The hyperparameters $\alpha$ and $\beta$ from Eq. (2) are validated in the range from $0.1$ to $5$ . The optimal values are $\alpha=3$ and $\beta=1$ .

Data preprocessing. Before generating images with Stable Diffusion v2 [34], we perform some preprocessing steps to clean up the text samples. For the IMDB data set, we remove the HTML tags that are sometimes present in movie reviews. For the 20 Newsgroups data set, we discard email addresses and subjects, using the remaining content as text prompt. For the English News and English WikiNews data sets, we provide the target word or multi-word phrase in each sentence as input for the text-conditional diffusion model. This is because the task is to identify the complexity of the target words, not of the whole sentences.

To process the examples from the English News and English WikiNews corpora with DistilBERT, we modify each sentence by marking the target words or multi-word phrases with the [SEP] token. No further preprocessing is required for the other data sets.

4.3 Results

We present the results obtained on the IMDB, 20 Newsgroups, English News and English WikiNews data sets in Table 1.

IMDB. The baseline DistilBERT model [37], which is trained using only text data, reaches an accuracy of $91.9\%$ , while the image encoders barely surpass the random chance baseline. The best multimodal teacher employing the CLIP image encoder reaches an accuracy of $93.1\%$ . Our first student outperforms its teacher by $1\%$ , while our second student is on par with its teacher. Notably, both students surpass the baseline model by more than $1.1\%$ .

20 Newsgroups. The baseline DistilBERT [37] obtains a performance of $91.8\%$ , while the individual image encoders lag far behind. Since ViT is much worse than CLIP, the corresponding teacher (DistilBERT+ViT) barely surpasses the baseline model, while DistilBERT+CLIP (Teacher 2) reaches an accuracy of $92.6\%$ . Meanwhile, our students based on privileged information surpass their teachers, showing considerable performance gains over the baseline DistilBERT.

English News. On the English News corpus, the baseline DistilBERT obtains an accuracy of $86.1\%$ . The ViT and CLIP image encoders obtain competitive results, being less than $4\%$ behind DistilBERT [37]. Both multimodal teachers outperform the baseline DistilBERT. Moreover, our student models surpass their teachers. The best student outperforms the baseline DistilBERT by $1\%$ , reaching an accuracy of $87.1\%$ in complex word identification.

English WikiNews. The results on the English WikiNews corpus are consistent with those on the English News corpus. Indeed, the independent image encoders obtain fairly good results, given that they only take generated images as input. The multimodal teachers outperform the baseline DistilBERT, while the students yield even better results.

Overall. We notice that the text modality leads to better results than the image modality, regardless of the data set. This is a natural consequence of the fact that the images are generated by a diffusion model, which can produce images that do not reflect the label. Another generic observation is that the multimodal teacher based on the CLIP image encoder (Teacher 2) is generally better than the other teacher. This leads to a better DistilBERT student (Student 2). Furthermore, we observe that the students generally surpass their teachers. We explain this observation through the fact that the multimodal teachers assign equal importance to the text and image modalities, although the image modality is naturally inferior. In contrast, the students focus on the original text modality, obtaining information about the image modality only through knowledge distillation.

Since both students surpass the baseline DistilBERT in each and every case, we conclude that our LUGPI framework is beneficial in various text classification tasks, such as polarity classification, text categorization by topic, and complex word identification.

Qualitative results. In Figure 2, we illustrate some examples which are incorrectly classified by the baseline DistilBERT, but are correctly classified by our second teacher model (DistilBERT+CLIP). Remarkably, we observe that the images generated by Stable Diffusion v2 contain important clues. For instance, a car is generated when the prompt is about cars, even though the word “car” is never mentioned inside the prompt. For the complex word identification task, we observe that the images generated for simple (non-complex) words tend to be less abstract, while those generated for complex words tend to be more abstract. In summary, the illustrated examples show that the generated images can complement the corresponding text samples. Although our students do not see these images at test time, our quantitative results presented in Table 1 show that the students clearly benefit from the privileged information transferred from the multimodal teachers.

Table 2: Accuracy rates on IMDB Large Movie Review [29], 20 Newsgroups [24], English News [45] and English WikiNews [45] data sets, while ablating the knowledge distillation components of our loss defined in Eq. (2). The best accuracy on each corpus is highlighted in bold.

Model	Loss Terms		IMDB	20 News	English	English
Model	$\;\!\mathcal{L}^{T}_{\scriptsize{\mbox{CE}}}\;\!$	$\mathcal{L}^{T}_{l_{2}}$	Reviews	groups	News	WikiNews
DistilBERT (Student 1)			0.919	0.918	0.861	0.842
DistilBERT (Student 2)			0.919	0.918	0.861	0.842
DistilBERT (Student 1)	✓		0.913	0.922	0.765	0.842
DistilBERT (Student 2)	✓		0.923	0.926	0.870	0.844
DistilBERT (Student 1)		✓	0.911	0.926	0.869	0.840
DistilBERT (Student 2)		✓	0.919	0.925	0.865	0.843
DistilBERT (Student 1)	✓	✓	0.930	0.928	0.869	0.843
DistilBERT (Student 2)	✓	✓	0.931	0.929	0.871	0.848

Ablation study. Our LUGPI framework performs the distillation at two network levels, via two distinct loss terms. To demonstrate the utility of both terms, we perform an ablation study of the knowledge distillation loss terms $\mathcal{L}^{T}_{\scriptsize{\mbox{CE}}}$ and $\mathcal{L}^{T}_{l_{2}}$ from Eq. (2). We present the corresponding results in Table 2. Distilling knowledge at the output level via $\mathcal{L}^{T}_{\scriptsize{\mbox{CE}}}$ is not beneficial for the first student. In contrast, distilling knowledge at the embedding level via $\mathcal{L}^{T}_{l_{2}}$ helps both students on three data sets (except IMDB). In summary, the ablation study shows that both distillation losses are required to obtain consistent improvements.

Training and inference time. The inference time of our final model is identical to that of the vanilla DistilBERT. However, the training time of our pipeline is between $2.3\times$ and $2.8\times$ higher (depending on the dataset and the vision model) than that of the student. This includes the time for generating the images with the pre-trained Stable Diffusion model. Note that Stable Diffusion is kept frozen in our pipeline.

5 Conclusion

In this work, we proposed the Learning Using Generated Privileged Information framework, which employs a diffusion model to generate privileged images, which were further used to train a multimodal teacher taking both text and image data as input. A unimodal student was subsequently trained by distilling privileged information from the multimodal teacher. We performed experiments on four text classification data sets, namely IMDB Movie Reviews, 20 Newsgroups, English News and English WikiNews. We alternatively employed two different image encoders to extract image features, demonstrating accuracy gains in both cases. All our distilled students outperformed the baseline model and even the multimodal teachers, without any extra cost during inference. In future work, we aim to extend our framework to more NLP tasks.

References

[1] Alehdaghi, M., Josi, A., Cruz, R.M.O., Granger, E.: Visible-Infrared Person Re-Identification Using Privileged Intermediate Information. In: Proceedings of ECCVW. pp. 720–737 (2022)
[2] Antoniou, A., Storkey, A., Edwards, H.: Augmenting image classifiers using data augmentation generative adversarial networks. In: Proceedings of ICANN. pp. 594–603 (2018)
[3] Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of CVPR. pp. 18208–18218 (2022)
[4] Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., Fleet, D.J.: Synthetic Data from Diffusion Models Improves ImageNet Classification. arXiv preprint arXiv:2304.08466 (2023)
[5] Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Proceedings of NIPS. pp. 2654–2662 (2014)
[6] Croitoru, F.A., Hondru, V., Ionescu, R.T., Shah, M.: Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(9), 10850–10869 (2023)
[7] Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.: RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In: Proceedings of NeurIPS. vol. 33, pp. 18613–18624 (2020)
[8] Devlin, J., Chang, M.W., Lee, K., Toutanova, L.K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT. pp. 4171–4186 (2019)
[9] DeVries, T., Taylor, G.W.: Improved Regularization of Convolutional Neural Networks with Cutout. arXiv preprint arXiv:1708.04552 (2017)
[10] Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Proceedings of NeurIPS. vol. 34, pp. 8780–8794 (2021)
[11] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021)
[12] Feng, Y., Wang, H., Hu, R., Yi, D.T.: Triplet distillation for deep face recognition. In: Proceedings of ICIP. pp. 808–812 (2020)
[13] Gambrell, L.B., Jawitz, P.B.: Mental Imagery, Text Illustrations, and Children’s Story Comprehension and Recall. Reading Research Quarterly 28, 264–276 (1993)
[14] Gao, Z., Wu, S., Liu, Z., Luo, J., Zhang, H., Gong, M., Li, S.: Learning the implicit strain reconstruction in ultrasound elastography using privileged information. Medical Image Analysis 58, 101534 (2019)
[15] Garcia, N.C., Morerio, P., Murino, V.: Learning with privileged information via adversarial discriminative modality distillation. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10), 2581–2593 (2019)
[16] Georgescu, M.I., Duţǎ, G.E., Ionescu, R.T.: Teacher–student training and triplet loss to reduce the effect of drastic face occlusion: Application to emotion recognition, gender identification and age estimation. Machine Vision and Applications 33(1), 12 (2022)
[17] Georgescu, M.I., Ionescu, R.T.: Teacher-student training and triplet loss for facial expression recognition under occlusion. In: Proceedings of ICPR. pp. 2288–2295 (2021)
[18] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of AISTATS. pp. 249–256 (2010)
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proceedings of NIPS. vol. 27, pp. 2672–2680 (2014)
[20] Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., Guo, B.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of CVPR. pp. 10696–10706 (2022)
[21] Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. In: Proceedings of NIPS Deep Learning and Representation Learning Workshop (2014)
[22] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Proceedings of NeurIPS. vol. 33, pp. 6840–6851 (2020)
[23] Jung, B., Johansson, F.D.: Efficient learning of nonlinear prediction models with time-series privileged information. In: Proceedings of NeurIPS. vol. 35, pp. 19048–19060 (2022)
[24] Lang, K.: NewsWeeder: Learning to Filter Netnews. In: Proceedings of ICML. pp. 331–339 (1995)
[25] Lee, W., Lee, J., Kim, D., Ham, B.: Learning with privileged information for efficient image super-resolution. In: Proceedings of ECCV. pp. 465–482 (2020)
[26] Liu, Z., Wei, J., Li, R., Zhou, J.: Learning multi-modal brain tumor segmentation from privileged semi-paired MRI images with curriculum disentanglement learning. Computers in Biology and Medicine 159, 106927 (2023)
[27] Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and privileged information. In: Proceedings of ICLR (2016)
[28] Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. In: Proceedings of ICLR (2019)
[29] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning Word Vectors for Sentiment Analysis. In: Proceedings of ACL. pp. 142–150 (2011)
[30] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In: Proceedings of ICML. pp. 16784–16804 (2021)
[31] Park, W., Kim, D., Lu, Y., Cho, M.: Relational Knowledge Distillation. In: Proceedings of CVPR. pp. 3962–3971 (2019)
[32] Qian, Y., Hu, H., Tan, T.: Data augmentation using generative adversarial networks for robust speech recognition. Speech Communication 114, 1–9 (2019)
[33] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of ICML. pp. 8748–8763 (2021)
[34] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-Resolution Image Synthesis with Latent Diffusion Models. In: Proceedings of CVPR. pp. 10684–10695 (2022)
[35] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., et al.: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In: Proceedings of NeurIPS. vol. 35, pp. 36479–36494 (2022)
[36] Sandfort, V., Yan, K., Pickhardt, P.J., Summers, R.M.: Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Scientific Reports 9(1), 16884 (2019)
[37] Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: Proceedings of EMC² (2019)
[38] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: LAION-5B: An open large-scale dataset for training next generation image-text models. In: Proceedings of NeurIPS. vol. 35, pp. 25278–25294 (2022)
[39] Shivashankar, C., Miller, S.: Semantic Data Augmentation with Generative Models. In: Proceedings of CVPRW. pp. 863–873 (2023)
[40] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using non-equilibrium thermodynamics. In: Proceedings of ICML. pp. 2256–2265 (2015)
[41] Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Proceedings of NeurIPS. vol. 32, pp. 11918–11930 (2019)
[42] Vapnik, V., Vashist, A.: A new learning paradigm: Learning using privileged information. Neural Networks 22(5–6), 544–557 (2009)
[43] Yang, J., Li, B., Yang, F., Zeng, A., Zhang, L., Zhang, R.: Boosting human-object interaction detection with text-to-image diffusion model. arXiv preprint arXiv:2305.12252 (2023)
[44] Yim, J., Joo, D., Bae, J., Kim, J.: A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In: Proceedings of CVPR. pp. 7130–7138 (2017)
[45] Yimam, S.M., Štajner, S., Riedl, M., Biemann, C.: Multilingual and Cross-Lingual Complex Word Identification. In: Proceedings of RANLP. pp. 813–822 (2017)
[46] You, S., Xu, C., Xu, C., Tao, D.: Learning from multiple teacher networks. In: Proceedings of KDD. pp. 1285–1294 (2017)
[47] Yu, L., Yazici, V.O., Liu, X., van de Weijer, J., Cheng, Y., Ramisa, A.: Learning Metrics from Teachers: Compact Networks for Image Embedding. In: Proceedings of CVPR. pp. 2907–2916 (2019)
[48] Yuan, S., Stenger, B., Kim, T.K.: RGB-based 3D hand pose estimation via privileged learning with depth images. arXiv preprint arXiv:1811.07376 (2018)
[49] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond Empirical Risk Minimization. In: Proceedings of ICLR (2018)
[50] Zhao, P., Xie, L., Wang, J., Zhang, Y., Tian, Q.: Progressive privileged knowledge distillation for online action detection. Pattern Recognition 129, 108741 (2022)