11institutetext: Anonymous Institute11institutetext: Department of Computer Science, University of Bucharest, Bucharest, Romania

Learning Using Generated Privileged Information by Text-to-Image Diffusion Models

Anonymous Author(s)    Rafael-Edy Menadil    Mariana-Iuliana Georgescu    Radu Tudor Ionescu Corresponding author: [email protected]
Abstract

Learning Using Privileged Information is a particular type of knowledge distillation where the teacher model benefits from an additional data representation during training, called privileged information, improving the student model, which does not see the extra representation. However, privileged information is rarely available in practice. To this end, we propose a text classification framework that harnesses text-to-image diffusion models to generate artificial privileged information. The generated images and the original text samples are further used to train multimodal teacher models based on state-of-the-art transformer-based architectures. Finally, the knowledge from multimodal teachers is distilled into a text-based (unimodal) student. Hence, by employing a generative model to produce synthetic data as privileged information, we guide the training of the student model. Our framework, called Learning Using Generated Privileged Information (LUGPI), yields noticeable performance gains on four text classification data sets, demonstrating its potential in text classification without any additional cost during inference.

Keywords:
Learning Using Privileged Information Knowledge Distillation Text Classification Diffusion Models Data Augmentation

1 Introduction

In the quest of developing effective and efficient machine learning models, researchers developed the knowledge distillation framework [5, 21], in which the outputs of one [5, 31] or more [21, 46, 47] typically heavy models, called teachers, are used as target for a typically lightweight model, called student. This framework is primarily used to compress very deep models into shallower, yet effective models [12, 27, 31, 44, 46, 47]. A secondary use of knowledge distillation is to leverage additional data representations, available only at training time, to improve the performance of a model which does not have access to the extra representation. This latter framework, called Learning Using Privileged Information (LUPI) [42], was introduced well before the era of deep learning, but it was later shown [27] that it represents a particular kind of knowledge distillation.

Refer to caption
Figure 1: An illustration of our Learning Using Generated Privileged Information (LUGPI) framework. For each text sample, a diffusion model generates an image. The original text sample and the generated image are used to train a multimodal teacher model. Then, a text-based student model is trained via knowledge distillation from the teacher. The distillation is carried out at two levels.

Although LUPI is an interesting and useful framework, it has rarely been applied in solving mainstream machine learning problems [1, 16, 17, 23, 42, 50], since finding additional modalities to represent the training data is not an easy task. With the advent of diffusion models [6, 22, 40, 41], which demonstrated impressive capabilities in generating realistic and diverse images based on text prompts [3, 20, 34, 35], we can now automatically generate image representations of text samples without much effort. To this end, we propose a novel framework called Learning Using Generated Privileged Information (LUGPI), which harnesses a state-of-the-art text-to-image diffusion model to generate the privileged information, namely Stable Diffusion v2 [34]. Our framework is applied on text classification tasks, where the original modality is represented by text samples and the additional modality is represented by images. Next, we train multimodal teacher models based on combining state-of-the-art transformer-based architectures, such as Distilled Bidirectional Encoder Representations from Transformers (DistilBERT) [37], Vision Transformer (ViT) [11] and Contrastive Language-Image Pre-Training (CLIP) [33]. Remarkably, we find that our multimodal teachers outperform the standalone text-based (unimodal) model. However, employing the multimodal teachers during inference would inherently imply the use of the diffusion model to generate the images. This greatly impacts the inference time of the whole framework, since diffusion models are notoriously known for being computationally expensive [6]. For instance, Stable Diffusion v2 [34] comprises about 865 million learnable parameters, requiring about 17 seconds to generate a single image on an NVIDIA GeForce RTX 3090 24GB GPU. To address this limitation, we distill the knowledge from a multimodal teacher into a text-based student model, as shown in Figure 1. This completely eliminates the need to generate images during inference. Thus, LUGPI does not increase the computational cost at test time.

We carry out experiments on four text classification data sets to evaluate the proposed framework and compare it with the conventional training approach based on pre-training and fine-tuning, while preserving the underlying DistilBERT architecture [37]. Our empirical results indicate that LUGPI brings significant performance gains on all four data sets.

In summary, our contribution is threefold:

  • We propose to harness diffusion models in order to artificially generate an extra data modality in the form of images, complementing the text modality, which enables us to train more powerful multimodal neural models.

  • We introduce the novel Learning Using Generated Privileged Information framework to distill knowledge from our multimodal teachers into text-based (unimodal) students.

  • We conduct experiments on four benchmarks, showing that the proposed framework improves the accuracy rates of text-based models by noticeable margins, without any extra cost during inference.

2 Related Work

Learning Using Privileged Information. There are two types of knowledge distillation frameworks, which were independently introduced in literature, namely model compression [5, 21] and learning using privileged information [42]. In 2016, Lopez-Paz et al. [27] unified the model compression and learning under privilege information paradigms into the knowledge distillation framework.

The model compression technique [5, 21] is mainly aimed at training a shallow and efficient student architecture using one or more deeper and powerful teachers. In this way, a shallow student could benefit from the knowledge gained by a deep teacher, while having less parameters, and consequently, a lower running time during inference.

The learning using privileged information paradigm [42] was introduced to transfer the knowledge from a teacher model, which is trained with privileged information, to a student model, which does not have access to the privileged data. In this scenario, the teacher and the student can share the same architecture, the main difference being the data used to train the two models. Many recent works [1, 14, 15, 17, 16, 25, 26, 48] applied the LUPI framework to improve the performance of the student without using additional information at test time. For example, Yuan et al. [48] trained a student to estimate the 3D hand pose using only the RGB image at test time. The knowledge about the depth channel was transferred from the teacher during the knowledge distillation process. Alehdaghi et al. [1] decreased the gap between RGB and infrared images used in the person re-identification task by applying the LUPI framework. They proposed to create an intermediate virtual domain that acts as a bridge between the two image modalities. The intermediate virtual domain was used as privileged information for the student model during training. Georgescu et al. [17] applied LUPI for facial expression recognition under strong occlusion, where the teacher learns from completely visible faces, but the student can only use occluded faces as input. They later extended their approach to age estimation and gender prediction from faces [16].

Similar to the aforementioned works [1, 14, 15, 17, 16, 25, 26, 48], we use extra data as privileged information during training. Different from the related studies on LUPI, our method does not require the existence of additional representations, since it generates the privileged data using a generative diffusion model. Hence, our framework broadens the applicability of LUPI to text-based corpora that do not have additional representations of the data samples.

Data augmentation. Our approach can also be seen as a rather unconventional data augmentation technique. However, data augmentation is usually employed to improve the robustness to data variation [9, 49], while in our case, we employ it to obtain privileged information. In general, data augmentation plays an important role in increasing the performance of deep learning architectures [9, 49], especially when the available training data is limited. The most common data augmentation methods used in computer vision are methods based on rotating, cropping and flipping the images [7]. Although techniques like these can offer better performance than just training on the original data, they lack the capability of creating a completely different data point, instead relying on the existing data and manipulating it just enough to have a variety within the training data.

In recent years, we have seen generative models, such as Generative Adversarial Networks (GANs) [19] and diffusion models, that have been used to successfully augment data and improve the accuracy of various models [2, 4, 32, 36]. Generative models can create new data points that closely resemble the training data distribution, often being mistaken with natural data points. Therefore, classification models can leverage this new data variety to offer high performance without having to gather any new data points. Furthermore, there are some examples that successfully use generative models when conventional techniques fall short [39, 43]. Yang et al. [43] proposed to use diffusion models to generate images illustrating human-object interactions, conditioned by prompts explaining the interactions. Shivashankar et al. [39] trained a GAN model to generate images along with their segmentation label for medical and face segmentation data sets. In these cases, conventional data augmentation methods provide suboptimal results when compared with generative models. This is because the latter models can generate new data points that resemble the training data distribution, aside from being able to generate variations of existing data points conditioned by some specific features that need to be present in the generated output.

Unlike other data augmentation techniques, we propose to generate image-based representations from text samples, essentially obtaining a new modality. Thus, our technique requires employing multimodal models to benefit from the extra data representation. To return to using a unimodal input while keeping the benefits of the multimodal data, we employ knowledge distillation.

3 Method

Overview and motivation. Learning Using Privileged Information [42] is suitable for machine learning tasks where the training data is represented by multiple modalities. However, the majority of machine learning problems only involve a single modality, rendering LUPI inapplicable. To overcome this challenge in the area of natural language processing and text classification, we propose to utilize a text-to-image diffusion model to generate privileged information in the form of images, in order to solve text classification problems where privileged information is not typically available.

We believe that our proposal is grounded in how the human mind works. For instance, humans use their imagination to mentally visualize objects, colors, textures or other visual aspects evoked in a text. This process helps humans in reaching a better and deeper text comprehension [13]. In a similar way, we conjecture that imaginary pictures can boost the performance of neural models such as BERT [8] or DistilBERT [37], provided that the visualizations are sufficiently representative. To increase the chances of successfully implementing our proposal, we make use of diffusion models, which are considered by many researchers as state-of-the-art text-to-image generators [10], surpassing previous models based on GANs.

To harness the generated images, a straightforward approach is to employ models on both text and image modalities in order to improve text classification performance. However, this approach is suboptimal in terms of speed, requiring additional time to generate and process images during inference. Our framework addresses this issue through knowledge distillation, i.e. the knowledge learned by the multimodal model, called teacher, is distilled into a text-based model, called student. At test time, we employ the student model to make predictions, thus eliminating the need to generate and process images. Our training framework is formally introduced in Algorithm 1. We first introduce the notations, then continue by presenting the three stages of our algorithm, namely image generation, teacher model training and knowledge distillation.

Input: 𝒟𝒟\mathcal{D}caligraphic_D - the training set of labeled text samples, G𝐺Gitalic_G - the text-conditional diffusion model, θGsubscript𝜃𝐺\theta_{G}italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - the weights of the diffusion model, T𝑇Titalic_T - the multimodal teacher model, S𝑆Sitalic_S - the student model, θTsubscriptsuperscript𝜃𝑇\theta^{*}_{T}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - (optional) pre-trained weights for the teacher, θSsubscriptsuperscript𝜃𝑆\theta^{*}_{S}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - (optional) pre-trained weights for the student, ηTsubscript𝜂𝑇\eta_{T}italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - the teacher’s learning rate, ηSsubscript𝜂𝑆\eta_{S}italic_η start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - the student’s learning rate, α𝛼\alphaitalic_α - the importance of the cross-entropy between the teacher and the student, β𝛽\betaitalic_β - the importance of the mean squared error between the teacher and student embeddings.
Output: θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - the trained weights of the student model.
1 n|𝒟|;𝑛𝒟n\leftarrow\lvert\mathcal{D}\rvert;italic_n ← | caligraphic_D | ; subgroup-of\lhd get the number of training samples
2 X;superscript𝑋X^{\prime}\leftarrow\emptyset;italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← ∅ ; subgroup-of\lhd initialize the set of generated images
3 foreach i{1,2,,n}𝑖12𝑛i\in\{1,2,...,n\}italic_i ∈ { 1 , 2 , … , italic_n } do
4       xiG(xi,θG);subscriptsuperscript𝑥𝑖𝐺subscript𝑥𝑖subscript𝜃𝐺x^{\prime}_{i}\leftarrow G(x_{i},\theta_{G});italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ; subgroup-of\lhd generate an image for the text sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
5       XX{xi};superscript𝑋superscript𝑋subscriptsuperscript𝑥𝑖X^{\prime}\leftarrow X^{\prime}\cup\{x^{\prime}_{i}\};italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ { italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ; subgroup-of\lhd add the generated image to the set Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
6      
7if θTsubscriptsuperscript𝜃𝑇\theta^{*}_{T}\neq\emptysetitalic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≠ ∅ then
8       θTθT;subscript𝜃𝑇subscriptsuperscript𝜃𝑇\theta_{T}\leftarrow\theta^{*}_{T};italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; subgroup-of\lhd initialize weights of teacher using pre-trained weights
9      
10else
11       θT𝒩(0,2din+dout);similar-tosubscript𝜃𝑇𝒩02subscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡\theta_{T}\sim\mathcal{N}\left(0,\frac{2}{d_{in}+d_{out}}\right);italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , divide start_ARG 2 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_ARG ) ; subgroup-of\lhd initialize weights of teacher using Xavier init [18]
12      
13repeat
14       foreach i{1,2,,n}𝑖12𝑛i\in\{1,2,...,n\}italic_i ∈ { 1 , 2 , … , italic_n } do
15             tiT(xi,xi,θT);subscript𝑡𝑖𝑇subscript𝑥𝑖subscriptsuperscript𝑥𝑖subscript𝜃𝑇t_{i}\leftarrow T(x_{i},x^{\prime}_{i},\theta_{T});italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_T ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ; subgroup-of\lhd get class probabilities predicted by the teacher
16             θTθTηTCE(yi,ti);subscript𝜃𝑇subscript𝜃𝑇subscript𝜂𝑇subscriptCEsubscript𝑦𝑖subscript𝑡𝑖\theta_{T}\leftarrow\theta_{T}-\eta_{T}\cdot\nabla\mathcal{L}_{\scriptsize{% \mbox{CE}}}(y_{i},t_{i});italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ⋅ ∇ caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; subgroup-of\lhd train the teacher using cross-entropy
17            
18      
19until convergence;
20if θSsubscriptsuperscript𝜃𝑆\theta^{*}_{S}\neq\emptysetitalic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ≠ ∅ then
21       θSθS;subscript𝜃𝑆subscriptsuperscript𝜃𝑆\theta_{S}\leftarrow\theta^{*}_{S};italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ← italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ; subgroup-of\lhd initialize weights of student using pre-trained weights
22else
23       θS𝒩(0,2din+dout);similar-tosubscript𝜃𝑆𝒩02subscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡\theta_{S}\sim\mathcal{N}\left(0,\frac{2}{d_{in}+d_{out}}\right);italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , divide start_ARG 2 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_ARG ) ; subgroup-of\lhd initialize weights of student using Xavier init [18]
24      
25repeat
26       foreach i{1,2,,n}𝑖12𝑛i\in\{1,2,...,n\}italic_i ∈ { 1 , 2 , … , italic_n } do
27             ti,eiTT(xi,xi,θT);subscript𝑡𝑖subscriptsuperscript𝑒𝑇𝑖𝑇subscript𝑥𝑖subscriptsuperscript𝑥𝑖subscript𝜃𝑇t_{i},e^{T}_{i}\leftarrow T(x_{i},x^{\prime}_{i},\theta_{T});italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_T ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ; subgroup-of\lhd get probabilities and embedding from teacher
28             si,eiSS(xi,θS);subscript𝑠𝑖subscriptsuperscript𝑒𝑆𝑖𝑆subscript𝑥𝑖subscript𝜃𝑆s_{i},e^{S}_{i}\leftarrow S(x_{i},\theta_{S});italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ; subgroup-of\lhd get probabilities and embedding from student
29             KDCE(yi,si)+αCET(ti,si)+βl2T(eiT,eiS);subscriptKDsubscriptCEsubscript𝑦𝑖subscript𝑠𝑖𝛼subscriptsuperscript𝑇CEsubscript𝑡𝑖subscript𝑠𝑖𝛽subscriptsuperscript𝑇subscript𝑙2subscriptsuperscript𝑒𝑇𝑖subscriptsuperscript𝑒𝑆𝑖\mathcal{L}_{\scriptsize{\mbox{KD}}}\leftarrow\mathcal{L}_{\scriptsize{\mbox{% CE}}}(y_{i},s_{i})+\alpha\cdot\mathcal{L}^{T}_{\scriptsize{\mbox{CE}}}(t_{i},s% _{i})+\beta\cdot\mathcal{L}^{T}_{l_{2}}(e^{T}_{i},e^{S}_{i});caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ← caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_α ⋅ caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_β ⋅ caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; subgroup-of\lhd apply Eq. (2)
30             θSθSηSKD;subscript𝜃𝑆subscript𝜃𝑆subscript𝜂𝑆subscriptKD\theta_{S}\leftarrow\theta_{S}-\eta_{S}\cdot\nabla\mathcal{L}_{\scriptsize{% \mbox{KD}}};italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ⋅ ∇ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ; subgroup-of\lhd train the student using the joint loss
31            
32      
33until convergence;
Algorithm 1 Learning Using Generated Privileged Information

Notations. Let 𝒟=(X,Y)={(x1,y1),(x2,y2),,(xn,yn)}𝒟𝑋𝑌subscript𝑥1subscript𝑦1subscript𝑥2subscript𝑦2subscript𝑥𝑛subscript𝑦𝑛\mathcal{D}=(X,Y)=\{(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{n},y_{n})\}caligraphic_D = ( italic_X , italic_Y ) = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } represent a training set of text samples, where n𝑛nitalic_n is the number of samples in the data set, and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ground-truth label associated with text sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let T𝑇Titalic_T and θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT represent the multimodal teacher model and its weights, respectively. Similarly, let S𝑆Sitalic_S and θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT represent the text-based student model and its weights. The weights of the teacher and student models are updated using the learning rates ηTsubscript𝜂𝑇\eta_{T}italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and ηSsubscript𝜂𝑆\eta_{S}italic_η start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, respectively. Let X={x1,x2,,xn}superscript𝑋subscriptsuperscript𝑥1subscriptsuperscript𝑥2subscriptsuperscript𝑥𝑛X^{\prime}=\{x^{\prime}_{1},x^{\prime}_{2},...,x^{\prime}_{n}\}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } represent the set of images generated by a diffusion model G𝐺Gitalic_G with the weights θGsubscript𝜃𝐺\theta_{G}italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. Let 𝒩(μ,σ2)𝒩𝜇superscript𝜎2\mathcal{N}(\mu,\sigma^{2})caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) represent the normal distribution of mean μ𝜇\muitalic_μ and standard deviation σ𝜎\sigmaitalic_σ. Let eiTsubscriptsuperscript𝑒𝑇𝑖e^{T}_{i}italic_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and eiSsubscriptsuperscript𝑒𝑆𝑖e^{S}_{i}italic_e start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the embedding vectors produced by the teacher and the student for the i𝑖iitalic_i-th data sample, respectively. The embedding vectors are taken just before the classification layer of each model.

Image generation. In steps 2-5 of Algorithm 1, we utilize a pre-trained text-to-image diffusion model to generate privileged information in the form of images. In step 4, the generator G𝐺Gitalic_G generates an image denoted by xisuperscriptsubscript𝑥𝑖x_{i}^{\prime}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT conditioned on the text sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In step 5, the generated image is added to the set Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Steps 4 and 5 are repeated until all training examples are passed through G𝐺Gitalic_G.

We choose the Stable Diffusion v2 [34] model trained on the LAION-5B [38] data set as our generator G𝐺Gitalic_G. The use of this model is chosen in favor of another open-source diffusion model, namely GLIDE [30]. To decide on which generator to use, we visually inspected their outputs on a subset of 100 prompts from the chosen data sets. We observed that Stable Diffusion v2 is usually better aligned with the provided text prompts than GLIDE. This influenced our decision towards using the former model.

Teacher training. The second stage of our pipeline is dedicated to training the teacher model. This stage corresponds to steps 6-14 of Algorithm 1. The teacher model is a multimodal architecture comprising three transformer-based encoders: a text encoder, an image encoder, and a multimodal encoder. As illustrated in Figure 1, the tokens produced by the text encoder are concatenated with the tokens given by the image encoder. The concatenated set of tokens is further passed through the multimodal encoder, which comprises a vanilla transformer block based on multi-head attention, having 8 attention heads. The multimodal encoder learns to perform cross-modal attention, strengthening relations across the text and image modalities. From the resulting set of multimodal tokens, we keep the classification token UCLSsubscript𝑈CLSU_{\scriptsize{\mbox{CLS}}}italic_U start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT from the text modality and the classification token VCLSsubscript𝑉CLSV_{\scriptsize{\mbox{CLS}}}italic_V start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT from the image modality, discarding the other tokens. This is a conventional procedure when transformers are applied to downstream classification tasks [8, 11]. Next, the classification tokens are concatenated and given as input to a multi-layer perceptron (MLP) with two layers, where the first layer comprises 786 neurons and the second one comprises k𝑘kitalic_k neurons, where k𝑘kitalic_k is the number of classes. A softmax function computes the output probabilities.

In order to make the prediction tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the teacher model T𝑇Titalic_T takes the text sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the generated image xisubscriptsuperscript𝑥𝑖x^{\prime}_{i}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as input, according to step 12 of Algorithm 1. In step 13, the weights of the teacher θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are updated using gradient descent, where the gradient is computed with respect to the cross-entropy loss. For the vector of predicted class probabilities tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the one-hot label encoding yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the cross-entropy loss is given by:

CE(yi,ti)=j=1kyijlog(tij),i{1,2,,n},formulae-sequencesubscriptCEsubscript𝑦𝑖subscript𝑡𝑖superscriptsubscript𝑗1𝑘subscript𝑦𝑖𝑗subscript𝑡𝑖𝑗for-all𝑖12𝑛\mathcal{L_{\scriptsize{\mbox{CE}}}}(y_{i},t_{i})=-\sum_{j=1}^{k}y_{ij}\cdot% \log(t_{ij}),\forall i\in\{1,2,...,n\},caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ roman_log ( italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , ∀ italic_i ∈ { 1 , 2 , … , italic_n } , (1)

where k𝑘kitalic_k is the number of classes.

In our implementation, we choose to use pre-trained architectures for the text and image encoders. For a fair and representative evaluation, we use the same text encoder as the baseline and the student models, namely DistilBERT [37]. This is to ensure that the observed performance gains are not due to the use of a more powerful text encoder for the teacher model, but rather due to the extra image modality. For the image encoder, we consider two alternative architectures, namely ViT [11] and CLIP Image [33].

Knowledge distillation. After training the teacher, we apply the knowledge distillation procedure to transfer the knowledge from the multimodal teacher to the student. This stage corresponds to steps 15-25 of Algorithm 1. According to steps 15-18, the student can optionally be pre-trained in a standard fashion, prior to the knowledge distillation procedure. We utilize this option to ensure a fair comparison with the baseline model. More precisely, both the baseline DistilBERT and our student DistilBERT are pre-trained. In general, when there are no pre-trained weights for the student, we can simply initialize the model using a conventional approach (step 18), such as Xavier initialization [18].

The student model is jointly optimizing three objectives. On the one hand, the student has to minimize the cross-entropy loss with respect to the ground-truth (hard) labels, to ensure that its predictions are correct. On the other hand, the student has to optimize the cross-entropy with respect to the probabilities (soft labels) predicted by the teacher, as well as minimize the mean squared error between the corresponding embeddings produced by the teacher and the student, which enables the student to learn knowledge from the teacher model. Formally, for the i𝑖iitalic_i-th data sample, the joint objective is computed as follows:

KD=CE(yi,si)+αCET(ti,si)+βl2T(eiT,eiS)=j=1kyijlog(sij)αj=1ktijlog(sij)+βeiTeiS22,i{1,,n},formulae-sequencesubscript𝐾𝐷subscriptCEsubscript𝑦𝑖subscript𝑠𝑖𝛼subscriptsuperscript𝑇CEsubscript𝑡𝑖subscript𝑠𝑖𝛽subscriptsuperscript𝑇subscript𝑙2subscriptsuperscript𝑒𝑇𝑖subscriptsuperscript𝑒𝑆𝑖superscriptsubscript𝑗1𝑘subscript𝑦𝑖𝑗subscript𝑠𝑖𝑗𝛼superscriptsubscript𝑗1𝑘subscript𝑡𝑖𝑗subscript𝑠𝑖𝑗𝛽subscriptsuperscriptdelimited-∥∥subscriptsuperscript𝑒𝑇𝑖subscriptsuperscript𝑒𝑆𝑖22for-all𝑖1𝑛\begin{split}\mathcal{L}_{KD}&=\mathcal{L}_{\scriptsize{\mbox{CE}}}(y_{i},s_{i% })+\alpha\cdot\mathcal{L}^{T}_{\scriptsize{\mbox{CE}}}(t_{i},s_{i})+\beta\cdot% \mathcal{L}^{T}_{l_{2}}(e^{T}_{i},e^{S}_{i})\\ &=\!-\!\sum_{j=1}^{k}y_{ij}\!\cdot\!\log(s_{ij})\!-\!\alpha\!\cdot\!\sum_{j=1}% ^{k}t_{ij}\!\cdot\!\log(s_{ij})\!+\!\beta\!\cdot\!\lVert e^{T}_{i}-e^{S}_{i}% \rVert^{2}_{2},\forall i\in\{1,...,n\},\\ \end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT end_CELL start_CELL = caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_α ⋅ caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_β ⋅ caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ roman_log ( italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - italic_α ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ roman_log ( italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) + italic_β ⋅ ∥ italic_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_e start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∀ italic_i ∈ { 1 , … , italic_n } , end_CELL end_ROW (2)

where α,β0𝛼𝛽0\alpha,\beta\geq 0italic_α , italic_β ≥ 0 are two hyperparameters that control the importance of the knowledge distillation objectives. Note that the distillation is carried out at two levels, namely with respect to the embedding space and the output space. Our ablation study shows the importance of distilling knowledge at both levels.

4 Experiments

We conduct experiments on four data sets covering three tasks: opinion mining, text categorization by topic, and complex word identification. The data sets are chosen to provide a comprehensive evaluation of image generation and privileged information in different target tasks.

4.1 Data Sets

IMDB Large Movie Review. The IMDB Large Movie Review data set [29] is a well-known benchmark for polarity classification, which is composed of 50,000 movie reviews separated into 25,000 for training and 25,000 for testing. We keep 10%percent1010\%10 % of the training set for validation purposes. The scope of this data set is to predict the polarity of the sentiment (positive or negative).

20 Newsgroups. The 20 Newsgroups data set [24] is a popular benchmark for text categorization by topic. It comprises 18,828 documents that are assigned to one of 20 different categories, ranging from technology to sports and religion. In our experiments, we divide the data set into 11,353 training documents, 1,261 validation documents and 6,214 test documents.

English News. The English News corpus [45] comprises 17,861 sentences with marked words or multi-word phrases that are annotated with complexity levels by native and non-native English speakers. The task is to determine if the target words or multi-word phrases are complex or not. The corpus is divided into 14,002 training sentences, 1,764 validation sentences and 2,095 test sentences.

English WikiNews. Another corpus for complex word identification introduced by Yimam et al. [45] is English WikiNews. It has a similar format to English News. The English WikiNews data set is divided into 7,746 training sentences, 870 validation sentences and 1,287 test sentences.

4.2 Experimental Setup

Table 1: Accuracy rates on IMDB Large Movie Review [29], 20 Newsgroups [24], English News [45] and English WikiNews [45] data sets. Our teacher and student models are compared with the fine-tuned vanilla DistilBERT [37]. For reference, we report results with the independent image encoders, namely ViT [11] and CLIP [33]. The best accuracy on each corpus is highlighted in bold. Significantly better results (at a p-value of 0.001) based on McNemar / Cochran Q testing are marked with \ddagger.
Model Modality IMDB 20 News English English
Text Image Reviews groups News WikiNews
DistilBERT [37] 0.919 0.918 0.861 0.842
ViT [11] 0.559 0.137 0.832 0.754
CLIP Image [33] 0.549 0.523 0.822 0.746
DistilBERT+ViT (Teacher 1) 0.920 0.919 0.867 0.843
DistilBERT+CLIP (Teacher 2) 0.931 0.926 0.868 0.846
DistilBERT (Student 1) 0.930 0.928 0.869 0.843
DistilBERT (Student 2) 0.931 0.929 0.871 0.848

Baselines and backbones. As baseline, we choose the DistilBERT model [37], a variant of BERT [8] that exhibits good performance with a reasonable number of learnable parameters. For a fair comparison with the baseline, we employ the DistilBERT architecture for our students as well. Moreover, the text encoder inside the multimodal teachers is also based on DistilBERT. To encode the generated images, we alternatively employ the pre-trained image encoder of the CLIP architecture [33], or the pre-trained ViT [11] model. We thus obtain a teacher based on DistilBERT+ViT (Teacher 1), and a teacher based on DistilBERT+CLIP (Teacher 2). We distill the knowledge from Teacher 1 into a student based on DistilBERT (Student 1), and the knowledge from Teacher 2 into a different student (Student 2), which is also based on DistilBERT. We underline that the two students have the same architecture, but they differ in terms of the source providing the privileged information.

Hyperparameters. We train the models with the AdamW [28] optimizer using a learning rate of 51055superscript1055\cdot 10^{-5}5 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT with linear decay, which converges to good optima across all our experiments. The baseline DistilBERT, the teachers and the students are each trained for 100 epochs on an Nvidia GeForce GTX 1080Ti GPU with 11 GB of VRAM. In all the experiments, we use a mini-batch size of 14 samples. Following previous works on knowledge distillation [5, 27], we soften the output of the teacher using the temperature τ𝜏\tauitalic_τ. We validate this hyperparameter in the range 1111-10101010, achieving optimal results with τ=8𝜏8\tau=8italic_τ = 8. The hyperparameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β from Eq. (2) are validated in the range from 0.10.10.10.1 to 5555. The optimal values are α=3𝛼3\alpha=3italic_α = 3 and β=1𝛽1\beta=1italic_β = 1.

Data preprocessing. Before generating images with Stable Diffusion v2 [34], we perform some preprocessing steps to clean up the text samples. For the IMDB data set, we remove the HTML tags that are sometimes present in movie reviews. For the 20 Newsgroups data set, we discard email addresses and subjects, using the remaining content as text prompt. For the English News and English WikiNews data sets, we provide the target word or multi-word phrase in each sentence as input for the text-conditional diffusion model. This is because the task is to identify the complexity of the target words, not of the whole sentences.

To process the examples from the English News and English WikiNews corpora with DistilBERT, we modify each sentence by marking the target words or multi-word phrases with the [SEP] token. No further preprocessing is required for the other data sets.

4.3 Results

We present the results obtained on the IMDB, 20 Newsgroups, English News and English WikiNews data sets in Table 1.

IMDB. The baseline DistilBERT model [37], which is trained using only text data, reaches an accuracy of 91.9%percent91.991.9\%91.9 %, while the image encoders barely surpass the random chance baseline. The best multimodal teacher employing the CLIP image encoder reaches an accuracy of 93.1%percent93.193.1\%93.1 %. Our first student outperforms its teacher by 1%percent11\%1 %, while our second student is on par with its teacher. Notably, both students surpass the baseline model by more than 1.1%percent1.11.1\%1.1 %.

20 Newsgroups. The baseline DistilBERT [37] obtains a performance of 91.8%percent91.891.8\%91.8 %, while the individual image encoders lag far behind. Since ViT is much worse than CLIP, the corresponding teacher (DistilBERT+ViT) barely surpasses the baseline model, while DistilBERT+CLIP (Teacher 2) reaches an accuracy of 92.6%percent92.692.6\%92.6 %. Meanwhile, our students based on privileged information surpass their teachers, showing considerable performance gains over the baseline DistilBERT.

English News. On the English News corpus, the baseline DistilBERT obtains an accuracy of 86.1%percent86.186.1\%86.1 %. The ViT and CLIP image encoders obtain competitive results, being less than 4%percent44\%4 % behind DistilBERT [37]. Both multimodal teachers outperform the baseline DistilBERT. Moreover, our student models surpass their teachers. The best student outperforms the baseline DistilBERT by 1%percent11\%1 %, reaching an accuracy of 87.1%percent87.187.1\%87.1 % in complex word identification.

English WikiNews. The results on the English WikiNews corpus are consistent with those on the English News corpus. Indeed, the independent image encoders obtain fairly good results, given that they only take generated images as input. The multimodal teachers outperform the baseline DistilBERT, while the students yield even better results.

Refer to caption
Figure 2: Text samples and generated images that are correctly classified by the multimodal teacher based on DistilBERT+CLIP. The target label is displayed on top of each sample. The examples on top belong to the 20 Newsgroups [24] data set, while the examples below are taken from English News [45] and English WikiNews [45].

Overall. We notice that the text modality leads to better results than the image modality, regardless of the data set. This is a natural consequence of the fact that the images are generated by a diffusion model, which can produce images that do not reflect the label. Another generic observation is that the multimodal teacher based on the CLIP image encoder (Teacher 2) is generally better than the other teacher. This leads to a better DistilBERT student (Student 2). Furthermore, we observe that the students generally surpass their teachers. We explain this observation through the fact that the multimodal teachers assign equal importance to the text and image modalities, although the image modality is naturally inferior. In contrast, the students focus on the original text modality, obtaining information about the image modality only through knowledge distillation.

Since both students surpass the baseline DistilBERT in each and every case, we conclude that our LUGPI framework is beneficial in various text classification tasks, such as polarity classification, text categorization by topic, and complex word identification.

Qualitative results. In Figure 2, we illustrate some examples which are incorrectly classified by the baseline DistilBERT, but are correctly classified by our second teacher model (DistilBERT+CLIP). Remarkably, we observe that the images generated by Stable Diffusion v2 contain important clues. For instance, a car is generated when the prompt is about cars, even though the word “car” is never mentioned inside the prompt. For the complex word identification task, we observe that the images generated for simple (non-complex) words tend to be less abstract, while those generated for complex words tend to be more abstract. In summary, the illustrated examples show that the generated images can complement the corresponding text samples. Although our students do not see these images at test time, our quantitative results presented in Table 1 show that the students clearly benefit from the privileged information transferred from the multimodal teachers.

Table 2: Accuracy rates on IMDB Large Movie Review [29], 20 Newsgroups [24], English News [45] and English WikiNews [45] data sets, while ablating the knowledge distillation components of our loss defined in Eq. (2). The best accuracy on each corpus is highlighted in bold.
Model Loss Terms IMDB 20 News English English
CETsubscriptsuperscript𝑇CE\;\!\mathcal{L}^{T}_{\scriptsize{\mbox{CE}}}\;\!caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT l2Tsubscriptsuperscript𝑇subscript𝑙2\mathcal{L}^{T}_{l_{2}}caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT Reviews groups News WikiNews
DistilBERT (Student 1) 0.919 0.918 0.861 0.842
DistilBERT (Student 2) 0.919 0.918 0.861 0.842
DistilBERT (Student 1) 0.913 0.922 0.765 0.842
DistilBERT (Student 2) 0.923 0.926 0.870 0.844
DistilBERT (Student 1) 0.911 0.926 0.869 0.840
DistilBERT (Student 2) 0.919 0.925 0.865 0.843
DistilBERT (Student 1) 0.930 0.928 0.869 0.843
DistilBERT (Student 2) 0.931 0.929 0.871 0.848

Ablation study. Our LUGPI framework performs the distillation at two network levels, via two distinct loss terms. To demonstrate the utility of both terms, we perform an ablation study of the knowledge distillation loss terms CETsubscriptsuperscript𝑇CE\mathcal{L}^{T}_{\scriptsize{\mbox{CE}}}caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT and l2Tsubscriptsuperscript𝑇subscript𝑙2\mathcal{L}^{T}_{l_{2}}caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT from Eq. (2). We present the corresponding results in Table 2. Distilling knowledge at the output level via CETsubscriptsuperscript𝑇CE\mathcal{L}^{T}_{\scriptsize{\mbox{CE}}}caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT is not beneficial for the first student. In contrast, distilling knowledge at the embedding level via l2Tsubscriptsuperscript𝑇subscript𝑙2\mathcal{L}^{T}_{l_{2}}caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT helps both students on three data sets (except IMDB). In summary, the ablation study shows that both distillation losses are required to obtain consistent improvements.

Training and inference time. The inference time of our final model is identical to that of the vanilla DistilBERT. However, the training time of our pipeline is between 2.3×2.3\times2.3 × and 2.8×2.8\times2.8 × higher (depending on the dataset and the vision model) than that of the student. This includes the time for generating the images with the pre-trained Stable Diffusion model. Note that Stable Diffusion is kept frozen in our pipeline.

5 Conclusion

In this work, we proposed the Learning Using Generated Privileged Information framework, which employs a diffusion model to generate privileged images, which were further used to train a multimodal teacher taking both text and image data as input. A unimodal student was subsequently trained by distilling privileged information from the multimodal teacher. We performed experiments on four text classification data sets, namely IMDB Movie Reviews, 20 Newsgroups, English News and English WikiNews. We alternatively employed two different image encoders to extract image features, demonstrating accuracy gains in both cases. All our distilled students outperformed the baseline model and even the multimodal teachers, without any extra cost during inference. In future work, we aim to extend our framework to more NLP tasks.

References

  • [1] Alehdaghi, M., Josi, A., Cruz, R.M.O., Granger, E.: Visible-Infrared Person Re-Identification Using Privileged Intermediate Information. In: Proceedings of ECCVW. pp. 720–737 (2022)
  • [2] Antoniou, A., Storkey, A., Edwards, H.: Augmenting image classifiers using data augmentation generative adversarial networks. In: Proceedings of ICANN. pp. 594–603 (2018)
  • [3] Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of CVPR. pp. 18208–18218 (2022)
  • [4] Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., Fleet, D.J.: Synthetic Data from Diffusion Models Improves ImageNet Classification. arXiv preprint arXiv:2304.08466 (2023)
  • [5] Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Proceedings of NIPS. pp. 2654–2662 (2014)
  • [6] Croitoru, F.A., Hondru, V., Ionescu, R.T., Shah, M.: Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(9), 10850–10869 (2023)
  • [7] Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.: RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In: Proceedings of NeurIPS. vol. 33, pp. 18613–18624 (2020)
  • [8] Devlin, J., Chang, M.W., Lee, K., Toutanova, L.K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT. pp. 4171–4186 (2019)
  • [9] DeVries, T., Taylor, G.W.: Improved Regularization of Convolutional Neural Networks with Cutout. arXiv preprint arXiv:1708.04552 (2017)
  • [10] Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Proceedings of NeurIPS. vol. 34, pp. 8780–8794 (2021)
  • [11] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021)
  • [12] Feng, Y., Wang, H., Hu, R., Yi, D.T.: Triplet distillation for deep face recognition. In: Proceedings of ICIP. pp. 808–812 (2020)
  • [13] Gambrell, L.B., Jawitz, P.B.: Mental Imagery, Text Illustrations, and Children’s Story Comprehension and Recall. Reading Research Quarterly 28, 264–276 (1993)
  • [14] Gao, Z., Wu, S., Liu, Z., Luo, J., Zhang, H., Gong, M., Li, S.: Learning the implicit strain reconstruction in ultrasound elastography using privileged information. Medical Image Analysis 58, 101534 (2019)
  • [15] Garcia, N.C., Morerio, P., Murino, V.: Learning with privileged information via adversarial discriminative modality distillation. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10), 2581–2593 (2019)
  • [16] Georgescu, M.I., Duţǎ, G.E., Ionescu, R.T.: Teacher–student training and triplet loss to reduce the effect of drastic face occlusion: Application to emotion recognition, gender identification and age estimation. Machine Vision and Applications 33(1),  12 (2022)
  • [17] Georgescu, M.I., Ionescu, R.T.: Teacher-student training and triplet loss for facial expression recognition under occlusion. In: Proceedings of ICPR. pp. 2288–2295 (2021)
  • [18] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of AISTATS. pp. 249–256 (2010)
  • [19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proceedings of NIPS. vol. 27, pp. 2672–2680 (2014)
  • [20] Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., Guo, B.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of CVPR. pp. 10696–10706 (2022)
  • [21] Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. In: Proceedings of NIPS Deep Learning and Representation Learning Workshop (2014)
  • [22] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Proceedings of NeurIPS. vol. 33, pp. 6840–6851 (2020)
  • [23] Jung, B., Johansson, F.D.: Efficient learning of nonlinear prediction models with time-series privileged information. In: Proceedings of NeurIPS. vol. 35, pp. 19048–19060 (2022)
  • [24] Lang, K.: NewsWeeder: Learning to Filter Netnews. In: Proceedings of ICML. pp. 331–339 (1995)
  • [25] Lee, W., Lee, J., Kim, D., Ham, B.: Learning with privileged information for efficient image super-resolution. In: Proceedings of ECCV. pp. 465–482 (2020)
  • [26] Liu, Z., Wei, J., Li, R., Zhou, J.: Learning multi-modal brain tumor segmentation from privileged semi-paired MRI images with curriculum disentanglement learning. Computers in Biology and Medicine 159, 106927 (2023)
  • [27] Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and privileged information. In: Proceedings of ICLR (2016)
  • [28] Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. In: Proceedings of ICLR (2019)
  • [29] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning Word Vectors for Sentiment Analysis. In: Proceedings of ACL. pp. 142–150 (2011)
  • [30] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In: Proceedings of ICML. pp. 16784–16804 (2021)
  • [31] Park, W., Kim, D., Lu, Y., Cho, M.: Relational Knowledge Distillation. In: Proceedings of CVPR. pp. 3962–3971 (2019)
  • [32] Qian, Y., Hu, H., Tan, T.: Data augmentation using generative adversarial networks for robust speech recognition. Speech Communication 114,  1–9 (2019)
  • [33] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of ICML. pp. 8748–8763 (2021)
  • [34] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-Resolution Image Synthesis with Latent Diffusion Models. In: Proceedings of CVPR. pp. 10684–10695 (2022)
  • [35] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., et al.: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In: Proceedings of NeurIPS. vol. 35, pp. 36479–36494 (2022)
  • [36] Sandfort, V., Yan, K., Pickhardt, P.J., Summers, R.M.: Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Scientific Reports 9(1), 16884 (2019)
  • [37] Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: Proceedings of EMC2 (2019)
  • [38] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: LAION-5B: An open large-scale dataset for training next generation image-text models. In: Proceedings of NeurIPS. vol. 35, pp. 25278–25294 (2022)
  • [39] Shivashankar, C., Miller, S.: Semantic Data Augmentation with Generative Models. In: Proceedings of CVPRW. pp. 863–873 (2023)
  • [40] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using non-equilibrium thermodynamics. In: Proceedings of ICML. pp. 2256–2265 (2015)
  • [41] Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Proceedings of NeurIPS. vol. 32, pp. 11918–11930 (2019)
  • [42] Vapnik, V., Vashist, A.: A new learning paradigm: Learning using privileged information. Neural Networks 22(5–6), 544–557 (2009)
  • [43] Yang, J., Li, B., Yang, F., Zeng, A., Zhang, L., Zhang, R.: Boosting human-object interaction detection with text-to-image diffusion model. arXiv preprint arXiv:2305.12252 (2023)
  • [44] Yim, J., Joo, D., Bae, J., Kim, J.: A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In: Proceedings of CVPR. pp. 7130–7138 (2017)
  • [45] Yimam, S.M., Štajner, S., Riedl, M., Biemann, C.: Multilingual and Cross-Lingual Complex Word Identification. In: Proceedings of RANLP. pp. 813–822 (2017)
  • [46] You, S., Xu, C., Xu, C., Tao, D.: Learning from multiple teacher networks. In: Proceedings of KDD. pp. 1285–1294 (2017)
  • [47] Yu, L., Yazici, V.O., Liu, X., van de Weijer, J., Cheng, Y., Ramisa, A.: Learning Metrics from Teachers: Compact Networks for Image Embedding. In: Proceedings of CVPR. pp. 2907–2916 (2019)
  • [48] Yuan, S., Stenger, B., Kim, T.K.: RGB-based 3D hand pose estimation via privileged learning with depth images. arXiv preprint arXiv:1811.07376 (2018)
  • [49] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond Empirical Risk Minimization. In: Proceedings of ICLR (2018)
  • [50] Zhao, P., Xie, L., Wang, J., Zhang, Y., Tian, Q.: Progressive privileged knowledge distillation for online action detection. Pattern Recognition 129, 108741 (2022)