research-article

Open access

Distilling Knowledge in Machine Translation of Agglutinative Languages with Backward and Morphological Decoders

Authors:

Telem Joyson Singh,

Sanasam Ranbir Singh,

Priyankoo SarmahAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 24, Issue 1

Article No.: 1, Pages 1 - 19

https://doi.org/10.1145/3703455

Published: 18 January 2025 Publication History

PDF eReader

Abstract

Agglutinative languages often have morphologically complex words (MCWs) composed of multiple morphemes arranged in a hierarchical structure, posing significant challenges in translation tasks. We present a novel Knowledge Distillation approach tailored for improving the translation of such languages. Our method involves an encoder, a forward decoder, and two auxiliary decoders: a backward decoder and a morphological decoder. The forward decoder generates target morphemes autoregressively and is augmented by distilling knowledge from the auxiliary decoders. The backward decoder incorporates future context, while the morphological decoder integrates target-side morphological information. We have also designed a reliability estimation method to selectively distill only the reliable knowledge from these auxiliary decoders. Our approach relies on morphological word segmentation. We show that the word segmentation method based on unsupervised morphology learning outperforms the commonly used Byte Pair Encoding method on highly agglutinative languages in translation tasks. Our experiments conducted on English-Tamil, English-Manipuri, and English-Marathi datasets show that our proposed approach achieves significant improvements over strong Transformer-based NMT baselines.

1 Introduction

Machine translation into morphologically rich languages is challenging due to lexical sparsity and the wide variety of grammatical features expressed through morphology. Agglutinative languages, such as Manipuri, Tamil, and Marathi, often use morphologically complex words (MCWs), which are formed by combining multiple morphemes, each carrying distinct semantic or syntactic meaning. Morphologically complex words (MCWs) in these languages pose significant difficulties for conventional NMT models due to their hierarchical structure and the intricate interplay of morphemes. To illustrate agglutination in morphologically complex words (MCWs), Table 1 provides an example in Manipuri language. The correct generation of target MCWs heavily relies on understanding the global context and accurately capturing morphological information. However, standard autoregressive decoding, which processes words ¹ one at a time without access to future context, often falls short of capturing these subtleties, leading to errors in MCW generation. This is particularly problematic in agglutinative languages, where minor morphological errors can significantly alter meaning.

Table 1.

Word	Morphemes	Translation
puba	pu,ba	carry
pukhi	pu,khi	carried
pusinkhi	pu,sin,khi	carried in
pusinningkhi	pu,sin,ning,khi	wanted to carry in
pusinningkhide	pu,sin,ning,khi,de	not wanted to carry in

Table 1. Case Study of Agglutination in MCWs from Manipuri Language

The bold part in the morphemes indicates the stem.

Many efforts have been initiated to incorporate linguistic tags on the source side [8, 22, 31, 49]. However, incorporating these features on the target side of the translation system is challenging, as it requires incrementally tagging and parsing the hypotheses at test time [31]. For tagging and parsing, it requires the whole sentence $y_1, y_2, \dots , y_{n}$, which is not available in test time. Furthermore, several approaches have been proposed to incorporate future information, including using reinforcement learning [1, 45], additional decoding passes and an additional decoder [9, 12, 44, 50], but these methods didn’t take the reliability of the future information into account.

In this paper, we propose a Knowledge Distillation method that is specifically designed to generate MCWs (Morphologically Complex Words) accurately to enhance neural machine translation for agglutinative languages. Our method involves training an encoder and three separate decoders: a forward decoder for autoregressive generation, a backward decoder to capture future contextual information, and a morphological decoder that explicitly integrates target-side morphological data, such as morphological tags or POS tags. We incorporate future information by using knowledge distillation from a backward decoder to the forward decoder, inspired by a recently proposed technique called Twin Networks [33]. To integrate morphological information in the target-side words, we use knowledge distillation from a morphological decoder. In addition to the word embeddings of $y_1, y_2, \dots , y_{t-1}$, the morphological decoder is trained with additional morphological information of each $y_1,..,y_t$ to predict $y_t$. For our experiments, we use morphological class information and incorporate stem and affix information in the target side.

We explore the idea of learning future target information and morphological information from backward decoder and morphological decoder within the knowledge distillation framework [16]. Unlike typical knowledge distillation approaches [9, 30, 33, 51], our method selectively distills knowledge from the additional decoder only when the knowledge is reliable. While the backward decoder and morphological decoder provide additional information to the forward decoder, they are not perfect next-word predictors. They can give incorrect guidance to the forward decoder. To address this issue, we introduce a novel approach that dynamically distills knowledge selectively based on the reliability estimation. We use the term “reliability” to indicate the level of confidence that we have in the prediction made by the model on a single example, as used in [20, 25]. Specifically, we estimate the reliability of the model’s prediction from the backward and morphological decoders by computing the inverse of the cross-entropy between the model’s prediction and the ground truth.

Our approach also relies on effective morphological word segmentation, which plays a critical role in the accurate representation of MCWs. Traditional purely statistical models, such as Byte Pair Encoding (BPE), often fall short when applied to highly agglutinative languages, as they may not effectively capture the intricate morphological structures inherent in these languages. In contrast, we utilize a linguistically motivated morphological analyzer called Morfessor. This method enables more precise modeling of the underlying morphological relationships and enhances the quality of translation.

Our experimental results on the English-Manipuri, English-Tamil and English-Marathi datasets show that the proposed method significantly improves over strong Transformer-based NMT baselines.

The main contributions of this paper can be summarized as follows:

—

We show that morphologically motivated segmentation outperforms Byte-Pair Encoding in translating highly agglutinative languages.

—

We investigate using backward and morphological decoders to improve the generation of Morphologically Complex Words (MCWs) in neural machine translation for agglutinative languages through knowledge distillation.

—

We propose a new reliable knowledge distillation method that dynamically distills knowledge selectively based on our reliability estimation method at the likelihood level.

—

We introduce a new morphological decoder that distills morphological information to the forward decoder to incorporate a morphological feature on the target side.

2 Related Work

In recent years, several encoder-decoder architectures have been proposed. These include the attention-based models of Bahdanau et al. [2], Chen et al. [6], Gehring et al. [11], Johnson et al. [17], and Vaswani et al. [41]. Several methods have been proposed for machine translation of agglutinative languages which are designed to deal with morphological complexities. Our work is inspired by two lines of research: linguistic knowledge-informed translation and future-aware translation.

2.1 Incorporating Linguistic Knowledge

In the realm of Statistical Machine Translation (SMT), Koehn and Hoang [19] pioneered the concept of factored translation models, integrating diverse morphological features into the translation process. They accomplished this by augmenting word representations with additional morphological and syntactic features. Unlike SMT, Neural Machine Translation (NMT) models use a fixed vocabulary and hence use a vocabulary reduction method like BPE [32]. However, previous works like Banerjee and Bhattacharyya [4] shows that using a linguistically motivated morfessor improves translation in some languages, while Weller-Di Marco and Fraser [42] use a rule-based morphological analyzer for translation tasks in English-German translation. Subsequently in Neural Machine Translation, a stream of research has emerged aiming to enhance word vector representations by incorporating valuable linguistic features into either the source encoder or the target decoder. For instance, Sennrich and Haddow [31] broadened the embedding layer of an NMT encoder by incorporating a blend of morphological and syntactic features, encompassing word lemma, morphological attributes, POS, and dependency labels. Their model utilized concatenated feature embedding vectors as input word embeddings while keeping the rest of the NMT model unaltered. Conversely, Bandyopadhyay [3], Song et al. [39], and Tamchyna et al. [40] factored words into morphological (lemma) and syntactic (factors) features at the output decoder of NMT. Their model, augmented with a heuristic morphological synthesizer, generated unseen word forms based on predicted lemma and factors. This method aimed to address challenges related to large vocabulary and out-of-vocabulary (OOV) instances during translation. Regrettably, this approach did not yield significant improvements in experimental outcomes. Incorporating these features on the target side of the translation system is challenging since it requires incrementally tagging and parsing the hypotheses at test time, which requires the whole sentence, not available in test time. Nzeyimana [26] tried to incorporate linguistic tags through multi-tasking multi-label training. Our paper advocates for integrating linguistic knowledge in the target side words by distilling knowledge from a morphological decoder.

Another line of research explores the integration of external knowledge through multi-source or multi-task learning. This involves considering additional sources as valuable, distinct information that augments the learning process of a translation model. The idea behind these approaches is to integrate the linguistic features into the model architecture, such as in a multi-encoder [22], modified attention [5, 49], or multi-task learning [8].

2.2 Future Aware Translation

To incorporate future information into neural machine translation (NMT), various approaches have been proposed. One approach is to use reinforcement learning, such as the REINFORCE algorithm [34, 43, 45, 46] or the actor-critic algorithm [1, 21].

A separate set of techniques incorporates future information into the inference process by employing additional decoding passes or supplementary components during testing. For instance, Xia et al. [44] and Zhang et al. [50] advocated a two-pass decoding algorithm, generating an initial draft translation followed by a refined final translation based on the draft. In a similar vein, Zhang et al. [48] and Zhou et al. [54] maintained both forward and backward decoders, decoding simultaneously and interacting when making predictions. From a distinct perspective, some researchers focus on integrating future information, such as Feng et al. [9] and Zhang et al. [47] which propose leveraging future source information to guide machine translation with knowledge distillation, mitigating source incompleteness. Another approach is to model past and future information for the source to help the decoder focus on untranslated source information, such as in Zheng et al. [53] and Zheng et al. [52]. However, our method differs from previous approaches in that it selectively distills knowledge from the additional decoder only when the knowledge is reliable and takes the reliability of future information into account.

3 Morphology and MT

Morphemes are the smallest meaningful units of a language. Some morphemes, called stems, express core meanings, while others, called affixes, express one or more dependent features of the core meaning, such as person, gender, or aspect. In agglutinative languages, each morpheme typically corresponds to a single feature, and words are constructed by concatenating morphemes with clear boundaries between them.

Our approach relies on the morphological segmentation of words. Before considering the broader problem of integrating morphological decoder and backward decoder, we perform an initial study to verify the usefulness of using linguistically motivated morphological analyzer in MT. While previous works such as Banerjee and Bhattacharyya [4], Singh et al. [36], and Weller-Di Marco and Fraser [42] have shown that using a morphological analyzer improves translation in some languages, there has not been a comprehensive investigation into which type of agglutinative languages are best suited for morphological analysis in MT. In this section, we discuss methods for measuring morphological complexity and segmentation techniques. We then explore the usefulness of morphological analyzers for segmenting agglutinative languages in translation tasks. Finally, we demonstrate that the morphologically motivated segmentation method consistently outperforms Byte-Pair Encodings (BPE) in translation tasks for highly agglutinative languages.

3.1 Measuring Morphological Complexity

Morphological complexity varies across languages, and it is important to identify this complexity in order to apply our method. Instead of relying on expert linguistic descriptions, we adopt corpus-based metrics such as Types, Type-Token Ratio (TTR), and Moving-Average Type-Token Ratio (MATTR). These measures are effective in assessing and ranking languages based on morphological complexity, as suggested by Kettunen [18]. Table 2 summarizes these metrics.

Table 2.

Metrics	Definition
Types	Count of unique word tokens
Type-Token Ratio (TTR)	The ratio of types to the total number of tokens in a text or corpus.
Moving-Average Type-Token Ratio (MATTR)	The average TTR derived from fixed sized overlapping segments (e.g., 50 words in our case) within a text

Table 2. Definitions of the Various Metrics used to Assess Morphological Complexity

Languages with rich morphology tend to exhibit high values for these metrics due to their diverse word forms, while languages with simpler morphology show lower values. By applying these measures to our selected languages, we obtain a quantitative basis for comparing their morphological characteristics, which will later inform the analysis of segmentation techniques in translation tasks.

3.2 Segmentation Methods

Word-level models would predict “unknown” for out-of-vocabulary word tokens, making them unsuitable for morphologically rich languages with sparse vocabularies. Therefore, we train our translation models using two segmentation methods: BPE (Byte-Pair Encoding), a purely statistical method, and Morfessor, a linguistically motivated morphological analyzer.

Byte Pair Encoding. BPE [32] starts with character segmentation and merges characters into larger units based on their frequencies. This results in units that fall between characters and words, with the number of merge operations serving as a hyperparameter.

Morfessor. The default implementation [7] employs a unigram language model to identify morph-like structures. It selects segments in a top-down manner and includes a prior term for segment length, promoting the regularization of segments to resemble plausible morphemes. We use Morfessor Flatcat variant [14] for our implementation.

3.3 Evaluation

We now present the evaluation of our morphological measures and segmentation models. We experiment with the following languages: Tamil, Marathi, Manipuri, Japanese and Indonesian. These languages represent different families and degrees of morphological complexity, as reflected in our analysis.

Morphological Complexity Analysis. For this analysis, we use the multi-way parallel FLORES dataset [13]. Figure 1 compares the morphological measures across the five languages, alongside English. As seen in the figure, Tamil, Marathi, and Manipuri exhibit high morphological complexity, indicating their high agglutinative nature. In contrast, Japanese and Indonesian have relatively lower levels of agglutination.

Fig. 1.

Effects of Segmentation Methods on Machine Translation. To compare different segmentation models, we use the IWSLT 2017 dataset for Japanese and Indonesian. The English-Japanese dataset has 220,000 parallel sentences, and the English-Indonesian dataset has 76,000 parallel sentences. The experimental settings and dataset details for Tamil, Marathi, and Manipuri are provided in Section 5. Figure 2 shows the BLEU scores for both segmentation methods across these languages.

Fig. 2.

Our results demonstrate that Morfessor significantly improves translation quality for highly agglutinative languages. However, it does not offer an improvement for languages with a lower degree of agglutination, such as Japanese and Indonesian. In our further experiments, we will concentrate on the three highly agglutinative languages.

4 Proposed Method

Comprehension of sentences in human cognition is influenced by the preceding and succeeding discourse [10]. Native speakers’ brains subconsciously anticipate the presence of a suffix when encountering a specific stem, reflecting the influence of priming [38]. Our proposed framework uses a multi-decoder training framework with reliable knowledge distillation to generate the target sentence while simulating the aforementioned processes by leveraging relevant full-context information and morphological information.

4.1 Model

Our model is an extension of the dominant encoder-decoder model called the Transformer [41]. It employs a multi-decoder approach, in which separate decoders generate the target sentence in different ways.

Encoder and Forward Decoder. In the forward decoder, the target sentence $Y={y_1,\ldots ,y_n}$ is generated by maximizing the conditional probability of each token $y_t$ given the source sentence X and the preceding words $y_{\lt t}$:

\begin{equation} \mathbb {P}(Y|X)=\prod _{t=1}^{n}p(y_t|y_{\lt t},X) \end{equation}

(1)

The input sequence X is first fed into the encoder, which encodes it into m context vectors $C = {c_1,c_2, \ldots , c_m}$ where m is the length of X. Specifically, this is achieved through the self-attention network (SAN), which produces the context vectors $C = SAN(X)$.

The forward decoder generates the target sentence Y word-by-word based on the representation C from the encoder, using attention mechanisms to take into account the generated target fragment. Given a sequence of word embeddings $\mathbf {E}(y_1),\ldots , \mathbf {E}(y_{t-1})$ in the generated target fragment at the t-th timestep, they are transformed into key matrix $\mathbf {K}_{t-1}$ and value matrix $\mathbf {V}_{t-1}$ . Another Self-Attention (SelfATTs) module is then used to learn the target representation $s_t$:

\begin{equation} s_t = SelfATTs(\overrightarrow{h}_{t-1}, \mathbf {K}_{t-1}, \mathbf {V}_{t-1}) \end{equation}

(2)

where $\overrightarrow{h}_{t-1} \in \mathbb {R}^{d_{model}}$ is the previous context vector. $s_t$ is then fed into another Attention (ATTc) module to compute the time-dependent context vector $h_t$:

\begin{equation} \overrightarrow{h_t} = ATTc(s_t, \mathbf {K}_e, \mathbf {V}_e) \end{equation}

(3)

where $\mathbf {K}_e$ and $\mathbf {V}_e$ are key and value matrices, respectively from encoder, that are transformed from the source representation C.

The probability distribution $p(y_t|y_{\lt t}, X)$ of the generated target word $y_t$ is computed using a multi-layer perceptron (MLP) layer:

\begin{equation} p(y_t|y_{\lt t}, X) = softmax(\mathbf {W}_o\overrightarrow{h_t}+b) \end{equation}

(4)

The target word $y_t$ with the maximum probability is selected as the output of the decoder at the t-th timestep.

Backward Decoder. In the backward decoder, the target sentence is generated by taking into account the source sentence and the succeeding words:

\begin{equation} \mathbb {P}(Y|X)=\prod _{t=1}^{n}p(y_t|y_{\gt t},X) \end{equation}

(5)

To compute the probability of a target token $y_t$, the decoder first computes the representation of the succeeding words using the self-attention mechanism. Given a sequence of word embeddings $\mathbf {E}(y_{t+1}), ..., \mathbf {E}(y_{n})$ of the succeeding words at the t-th timestep, they are transformed into key matrix $\mathbf {K}^{^{\prime }}_{t+1}$ and value matrix $\mathbf {V}^{^{\prime }}_{t+1}$ . The target representation $s^{^{\prime }}_t$ is computed using attention mechanism as:

\begin{equation} s^{^{\prime }}_t = SelfATTs(\overleftarrow{h}_{t+1}, \mathbf {K}^{^{\prime }}_{t+1}, \mathbf {V}^{^{\prime }}_{t+1}) \end{equation}

(6)

The context vector $\overleftarrow{h_t}$ is then computed using attention mechanism as:

\begin{equation} \overleftarrow{h_t} = ATTc(s^{^{\prime }}_t, \mathbf {K}_e, \mathbf {V}_e) \end{equation}

(7)

The probability distribution $p(y_t|y_{\gt t}, X)$ of the generated target word $y_t$ is computed as:

\begin{equation} p(y_t|y_{\gt t},X) = softmax(\mathbf {W}_o\overleftarrow{h_t}+b) \end{equation}

(8)

Figure 3 depicts the backward decoder (top) and forward decoder (bottom) with Reliable Knowledge Distillation.

Fig. 3.

Morphological Decoder. The morphological decoder generates the target sentence by taking into account the source sentence, the preceding words, and the morphological classes of the preceding tokens and the next token. The probability of a target token, $y_t$, is computed using the classes of the preceding tokens and the next token $class(y_{\le t})$, the preceding words, and the source sentence.

\begin{equation} \mathbb {P}(Y|X)=\prod _{t=1}^{n}p(y_t|class(y_{\le t}),y_{\lt t},X) \end{equation}

(9)

In this morphological decoder, the word embeddings $\mathbf {E}(y_1),\ldots , \mathbf {E}(y_{t-1})$ are combined with respective morphological class embeddings $\mathbf {E}_c(class(y_1)),\ldots ,\mathbf {E}_c(class(y_{t-1}))$ by passing through a linear layer. The decoder is made aware of the morphological class of the next token to be generated in advance, allowing it to better choose the next token to generate based on this information. The morphological class of the next token, $class(y_t)$, is passed to a class embedding, $\mathbf {E}_c$, which generates its class embedding and combines it with the hidden state, $h_t$ by passing through a linear layer. The hidden state $h_t$ is computed in a manner similar to the forward decoder.

The probability distribution of the generated target word $y_t$, $p(y_t|class(y_{\le t}),y_\lt t, X)$ is computed as:

\begin{equation} p(y_t|class(y_{\le t}),y_{\lt t},X) = softmax(\mathbf {W}_oh_t^{(M)}+b) \end{equation}

(10)

Figure 4 depicts the morphological decoder (top) and forward decoder (bottom) with Reliable Knowledge Distillation. In our experiment, we take the morphological classes of stem or affix.

Fig. 4.

4.2 Reliable Knowledge Distillation

In this paper, we use knowledge distillation [16] to regularize the forward decoder in our machine translation model. The forward decoder has to match the probability distribution predicted by the backward decoder and morphological decoder. The knowledge distillation losses with forward decoder and morphological decoder are respectively expressed as:

\begin{align} \mathcal {L}(\overrightarrow{Y},\overleftarrow{Y}) = - \sum _{t=1}^{n}&\beta _t p(y_t|y_{\gt t},X) . log(p(y_t|y_{\lt t},X)) \end{align}

(11)

\begin{align} \mathcal {L}(\overrightarrow{Y},Y^{(M)}) = - &\sum _{t=1}^{n}\gamma _t p(y_t|class(y_{\le t}),y_{\lt t},X) . log(p(y_t|y_{\lt t},X)) \end{align}

(12)

To enhance the distillation effect, we dynamically weighted the knowledge distillation loss, which allows us to distill only reliable knowledge. We use the term “reliability” to indicate the level of confidence that we have in the prediction made by the ML model on a single example, as used in Kukar and Kononenko [20] and Nicora et al. [25]. Essentially, when the prediction error of the backward decoder or morphological decoder is high, the forward decoder will exclusively learn from the ground truth. The weighting factors $\beta _t$ and $\gamma _t$ are obtained from the inverse of cross-entropy between ground truth and predicted probability distribution as follows:

\begin{equation} \beta _t = 1 - \frac{H(p^G(y_t),p(y_t|y_{\gt t},X))}{\mathbb {K}} \end{equation}

(13)

\begin{equation} \gamma _t = 1 - \frac{H(p^G(y_t),p(y_t|class(y_{\le t}),y_{\lt t},X))}{\mathbb {K}} \end{equation}

(14)

where $p^G(y_t)$ is the ground truth and $\mathbb {K}$ is batchwise min-max normalization term.

4.3 Training Objective

In our approach, we train all three decoders using cross-entropy loss. The cross-entropy losses for the forward decoder, backward decoder, and morphological decoder are calculated as follows:

\begin{equation} \mathcal {L}_1=-\sum _{t=1}^{n}log(p(y_t|y_{\lt t},x)) \end{equation}

(15)

\begin{equation} \mathcal {L}_2 = -\sum _{t=1}^{n}log(p(y_t|y_{\lt t},y_{\gt t},x)) \end{equation}

(16)

\begin{equation} \mathcal {L}_3 = -\sum _{i=1}^{n}log(p(y_t|class(y_{\le t}),y_{t},x)) \end{equation}

(17)

where n is the number of tokens in each example.

We combine these cross-entropy losses with the additional knowledge distillation losses to obtain the final training losses as follows:

\begin{equation} \mathcal {L}= ~\mathcal {L}_1 + \mathcal {L}_2 + \mathcal {L}_3 + \mathcal {L}(\overrightarrow{Y},\overleftarrow{Y}) + \mathcal {L}(\overrightarrow{Y},Y^{(M)}) \end{equation}

(18)

We jointly train and update the parameters of all decoders. However, we are not updating the parameters of the backward decoder and morphological decoder using the knowledge distillation losses; instead, we only use the knowledge distillation losses to backpropagate and update the parameters of the forward decoder.

5 Experimental Setup

5.1 Dataset

We evaluate our method on the following three datasets. WAT2021 English$\rightarrow$Tamil. (140K pairs): We evaluate using WAT2021 validation set and test set using PMIndia [15] and PIB Dataset [37]. English$\rightarrow$Manipuri. (120K pairs): We use a training set of 120K parallel sentences from PMIndia [15, 35] and PIB Dataset. The validation and test sets consist of approximately 1k sentence pairs each, sampled from the corpus. WAT2021 English$\rightarrow$Marathi. (132K pairs): We evaluate using WAT2021 validation set and test set using PMIndia and PIB Dataset.

To preprocess the data, we segment English words into subword units using Byte-Pair Encoding (BPE) [32] with 16,000 merge operations. Morphemes are splitted from the Manipuri, Tamil, and Marathi words using the Morfessor Flatcat tool [14]. The morphological class used in the Morphological Decoder consists of binary labels—“stem” vs. “affix”—for each sub-word token. We add a special marker “$$” before suffixes, and “+” between other morphemes of a word to ensure the reversibility of the morphological splitting. Certain words, like compound words and reduplicative words, contain multiple stems, which are indicated by the “+” symbol.

5.2 Settings

We conduct experiments on the following systems.

Transformer Baseline. A standard transformer model that is trained on the given dataset without incorporating any additional decoders.

Knowledge Distillation with Backward Decoder (KD with BD). A variation of our model that includes a backward decoder, which utilizes future information.

Reliable Knowledge Distillation with Backward Decoder (RKD with BD). Variation of KD with BD that distills only reliable knowledge.

Knowledge Distillation with Morphological Decoder (KD with MD). Another variation that includes a morphological decoder, which incorporates linguistic feature on the target side.

Reliable Knowledge Distillation with Morphological Decoder (RKD with MD). Variation of KD with MD but distills only reliable knowledge

ABDNMT. An NMT model based on the Transformer architecture, following the method proposed by Zhang et al. [50].

Twin Networks. The method proposed by Serdyuk et al. [33] and incorporates an L2 loss term.

SEER Forcing. An NMT model that utilizes the Seer forcing technique with knowledge distillation weight set to 0.5, following the method proposed by Feng et al. [9].

Factored Translation. the class information(stem or affix) of each token is obtained and both words and class are generated based on Bandyopadhyay [3] and Tamchyna et al. [40].

In this study, we conduct experiments using the Transformer [41] model, while implementing our models through adaptation from the open-source toolkit Fairseq-py [27]. There are three input and output layers with embedding dimension of 256, the inner feedforward layer dimension of 512, and the number of heads in the multi-head modules in both the encoder and decoder layers is 4. The training batches consist of sets of 4,096 source and target tokens. The models are trained and evaluated on two Tesla P100 GPUs. The test set is evaluated using a single model obtained by taking the best checkpoint, which is validated on the development set at each epoch. The BLEU metric [28] is used to evaluate the translation performance using the SacreBLEU [29].

6 Result and Analysis

6.1 Main Result

Table 3 presents the results of our models and related works.

Table 3.

Models		BLEU
Models		En$\rightarrow$Mni	En$\rightarrow$Ta	En$\rightarrow$Mr	Avg
Baseline	BPE	13.61	8.31	13.02	11.64
Baseline	Morfessor	14.32	9.22	13.54	12.36
Our Implementations	KD with BD	15.13	10.18	14.02	13.11
	RKD withBD	15.76	10.65	14.46	13.62
	KD with MD	14.79	9.61	13.92	12.77
	RKD withMD	15.21	9.86	14.12	13.06
	RKD withBD & MD	16.02	10.94	14.77	13.91
Related Works	Twin Networks [33]	15.02	10.01	13.85	12.96
	ABDNMT [50]	14.82	9.75	14.01	12.86
	Seer Forcing [9]	15.20	10.22	14.05	13.15
	Factored Translation	14.53	9.38	13.59	12.50

Table 3. BLEU Scores on En$\rightarrow$Mni, En$\rightarrow$Ta, and En$\rightarrow$Mr Test Sets

We compare the Transformer (“Baseline”, [41]); Conventional or Reliable Knowledge Distillation of forward decoder with backward decoder(“(KD/RKD with BD”); Conventional or Reliable Knowledge Distillation of forward decoder with morphological decoder (“KD/RKD with MD”); and related works.

Morfessor models outperform BPE model. The baseline model, a standard Transformer trained without additional decoders, achieves moderate BLEU scores. Using BPE as the subword unit, the baseline obtains a BLEU score of 13.61 for English-to-Manipuri, 8.31 for English-to-Tamil, and 13.02 for English-to-Marathi. Using Morfessor for segmentation improves the performance to 14.32, 9.22, and 13.54 respectively. Therefore, we use Morfessor for segmentation in our proposed models.

Our models with additional decoders outperform baselines with only forward decoder. To assess the impact of different decoders and knowledge distillation approaches, we compare our model to various variations. The KD/RKD models with a backward decoder (BD) surpass the baseline in all three translation tasks. For English-to-Manipuri with Morfessor, the KD model achieves a BLEU score of 15.13, while the RKD model achieves a higher score of 15.76. Similarly, for English-to-Tamil with Morfessor, the KD model attains a score of 10.18, whereas the RKD model outperforms it with a score of 10.65. Similarly, for English-to-Marathi with Morfessor, the KD model attains a score of 14.02, whereas the RKD model outperforms it with a score of 14.46. Additionally, incorporating a morphological decoder (MD) enhances the performance of the baseline. The KD model with MD achieves BLEU scores of 14.79, 9.61 and 13.92 respectively, while the RKD model with MD achieves scores of 15.21, 9.86 and 14.12 respectively. Combining both backward decoder (BD) and morphological decoder (MD) further improves the score.

Our proposed models are competitive with related works. Comparing our proposed KD/RKD models with related works, we observe competitive performance. The Twin Networks approach, incorporating future information, achieves BLEU scores of 15.02 for English-to-Manipuri, 10.01 for English-to-Tamil and 13.85 for English-to-Marathi. ABDNMT, another model incorporating future information, achieves scores of 14.82, 9.75, and 14.01 for the respective tasks. SEER Forcing, employing the Seer forcing technique, achieves scores of 15.20, 10.22 and 14.05. Factored Translation, which augments class information, attains scores of 14.61, 9.43, and 13.52 for English-to-Manipuri and English-to-Tamil and English-to-Marathi, respectively.

6.2 Contribution of Morphological Decoder

As the morphological decoder has knowledge of morphological class information of the next word and the preceding words, we hope that the forward decoder can better predict the next word by distilling knowledge from the morphological decoder. In order to gain insights on whether the empirical usefulness comes from using morphological decoder, we perform two ablation tests. For “Gaussian Noise,” the backward probability distribution are randomly sampled from a Gaussian distribution, therefore the forward states are trained to predict white noise. For “AR,” the backward probability distribution are set to zero inspired by Merity et al. [23]. Results in Table 4 shows that the information included in the morphological states is indeed useful for obtaining a significant improvement.

Table 4.

Model	BLEU
Model	En$\rightarrow$Mni	En$\rightarrow$Ta
Baseline + Gaussian	14.03	8.90
Baseline + AR	14.34	9.44
Baseline + MD	14.79	9.73

Table 4. Comparison of the Forward Decoder Guided by Morphological Decoder (“MD”); “Gaussian Noise,” the Additional Decoder Probability Distribution are Randomly Sampled from a Gaussian Distribution; the Additional Decoder Probability Distribution are Set to Zero Inspired by [23] (“AR”)

In our approach, we have integrated the classes of the previous tokens $class(y_{\lt t})$ and the upcoming token $class(y_{t})$ into the morphological decoder. After conducting experiments, we have found that incorporating the class information for the next token $class(y_{t})$ yields a modest improvement in performance compared to solely including the class information for the preceding tokens $class(y_{\le t})$, as demonstrated in Table 5. This implies that accounting for the class of the upcoming token can provide more valuable contextual information for the model to generate the next token.

Table 5.

Class Information	BLEU
Class Information	En$\rightarrow$Mni	En$\rightarrow$Ta
$class(y_{\lt t})$	14.62	9.47
$class(y_{\le t})$	14.79	9.61

Table 5. Performance in KD with MD with Preceding Word Class Information Alone and with Both Preceding and Next Word Class Information

6.3 Contribution and Superiority of Backward Decoder

One of the key contributions of our approach is the use of a backward decoder, which possesses knowledge of future target information. We expect this knowledge to be transferred to the forward decoder through knowledge distillation.

Our experimental results, presented in both Table 3 and Table 5, clearly demonstrate the superiority of our approach over the baseline. In particular, we observe remarkable enhancements in performance when employing knowledge distillation with the backward decoder.

Interestingly, our results also show that the backward decoder outperforms the morphological decoder, indicating that our approach is able to leverage the additional information provided by the backward decoder to improve overall performance. Taken together, our findings suggest that the use of a backward decoder and knowledge distillation can significantly enhance the performance of neural language models.

6.4 Impact of Different Approaches to Reliability Estimation

In the context of knowledge distillation with the backward decoder, we conducted an investigation into various approaches for estimating reliability, and the results are presented in Table 6. Initially, we attempted to assign a binary reliability weight, represented by $\beta _t$, with values of either 0 or 1, to indicate reliability or unreliability discretely. However, we found that this approach did not yield satisfactory results. As a result, we explored alternative approaches, including predicting the reliability weight using the hidden state $\overleftarrow{h_t}$, by passing it through a linear layer and a sigmoid activation function. However, this approach also did not perform well.

Table 6.

Reliability Estimation	BLEU
Reliability Estimation	En$\rightarrow$Mni	En$\rightarrow$Ta
Hidden State	15.23	10.17
Cross-entropy
+Discrete	15.32	10.37
+Continuous	15.76	10.65

Table 6. Comparison of Reliability Estimated using Only the Hidden State of the Backward Decoder (“Hidden State”); using the Cross-entropy between the Ground Truth and the Predicted Probability Distribution of the Backward Decoder in Discrete and Continuous forms (“Cross-entropy+Discrete”, “Cross-entropy+Continuous”)

Finally, we attempted to compute a continuous weight as the inverse of the cross-entropy between the ground truth and the predicted probability. This approach yielded the best results, outperforming the previous two approaches. The continuous weight allowed for more nuanced estimation of reliability, as opposed to a binary or discrete classification. In conclusion, we found that computing a continuous weight as the inverse of the cross-entropy between the ground truth and the predicted probability was the most effective approach for estimating reliability in knowledge distillation with the backward decoder.

6.5 Effect of Target Sentence Length

As part of our evaluation process, we conducted a thorough analysis of the impact of our proposed method on sentence length in terms of the number of words with En$\rightarrow$Mni dataset. We achieved this by dividing the test set into four distinct groups based on the length of the target sentences. The results of our analysis, which are presented in Figure 5, demonstrated that the performance of our proposed method was dependent on the length of the target sentences.

Fig. 5.

In particular, we observed that our method was less effective when the length of the target sentence was between 0–10 words, with only minimal differences in BLEU scores compared to the baseline model. However, for sentences longer than 10 words, our proposed method showed a significant improvement in BLEU scores compared to the baseline transformer Trans.base model. This improvement in BLEU scores was even more pronounced as the length of the target translations increased, indicating that our method is particularly effective for longer sentences.

Overall, this analysis shows that our proposed method is a promising solution for improving the quality of translations, especially for longer sentences.

6.6 Analysis of Time Consumption and Parameter Size

In this section, we will provide a comprehensive analysis of the training and decoding times, as well as the parameter sizes in the En$\rightarrow$Mni translation models. We present the results in Table 7.

Table 7.

Models	En$\rightarrow$Mni
Models	#Time1	#Time2	#Param
Baseline	1.0	1.0	13.6M
RKD with BD	1.7	1.0	16.8M
RKD with MD	1.7	1.0	16.8M

Table 7. Training, Test Time, and Size of the Model Parameters in En$\rightarrow$Mni Translation Models

“Time1” denotes the training time (in ratio), “Time2” denotes the decoding time (in ratio), and “Param” denotes the size of model parameters (M for million).

The table compares three En$\rightarrow$Mni translation models, namely Baseline, RKD with BD, and RKD with MD. The models are evaluated based on their training and decoding times, as well as the size of their model parameters.

In terms of training and decoding times, the Baseline model has the lowest values, with a ratio of 1.0 for both training and decoding times. The RKD models, on the other hand, have a training time ratio of 1.7 and a decoding time ratio of 1.0.

When it comes to the size of the model parameters, baseline Transformer model have a smaller number of parameters with 13.6 million, and RKD models with 16.8 million.

Based on these findings, it is observed that the RKD models necessitate more time for training than the Baseline model, and they also involve extra parameters. However, it is noteworthy the backward decoder and morphological decoder are only used in training time and discarded during inference time. It makes the proposed RKD models match the inference time of the baseline Transformer model.

6.7 Scalability Discussion

While this paper adopts a theoretical focus by using smaller, domain-specific datasets to train neural machine translation (NMT) models, it is also essential to consider the practical implications of scaling the approach to high-resource settings. To this end, we extend the English-Tamil dataset by incorporating additional publicly available corpora, including JW300, NLP Contributions (nlpc), OpenSubtitles, TED2020, and WikiMatrix, as provided in Nakazawa et al. [24]. The aggregated dataset contains a total of 1.35 million parallel sentences, providing a more larger resource for robust model training.

In this expanded setting, the model architecture consists of six input and output layers with an embedding dimension of 1024. The inner feedforward layers have a dimensionality of 4096, and the multi-head attention modules utilize 16 heads in both the encoder and decoder layers. Training is performed using mini-batches of 8,000 source and target tokens, which helps maintain stable gradient updates and efficient use of computational resources. Since the WAT 2021 test set is in the domain of PMI, after training on expanded dataset, we fine-tune on the PMI dataset.

The results in Table 8 highlight the impact of different segmentation strategies on translation quality. While the baseline model with Byte Pair Encoding (BPE) achieves a BLEU score of 10.21, replacing BPE with linguistically motivated Morfessor segmentation improves performance, yielding a BLEU score of 11.54. This improvement can be attributed to Morfessor’s ability to capture meaningful morphological patterns, especially useful for agglutinative languages like Tamil. Furthermore, our proposed Reliable Knowledge Distillation (RKD) approach, which integrates a backward decoder and a morphological decoder, achieves a BLEU score of 13.04. This demonstrates the effectiveness of our method for morphologically rich agglutinative languages, outperforming both BPE and Morfessor-based baselines.

Table 8.

Model	BLEU
Model	En$\rightarrow$Ta
Baseline BPE	10.21
Baseline Morfessor	11.54
RKD with BD & MD	13.04

Table 8. Comparison of the Baseline Transfomer Models with BPE Segmentation and Morfessor Segmentation, along with Our Model - reliable Knowledge Distillation with Backward Decoder and Morphological Decoder (RKD with BD & MD) in High Resource Setting

7 Conclusion

In this paper, we proposed a novel knowledge distillation approach tailored to enhance neural machine translation for agglutinative languages, which pose unique challenges due to their morphologically complex words (MCWs). Our method leverages an auxiliary backward decoder to capture future context, and a morphological decoder to integrate target-side morphological information like stems and affixes. Through knowledge distillation, the predictions from these decoders are selectively distilled to the main forward autoregressive decoder based on a reliability estimation. To address the challenge of unreliable guidance from auxiliary decoders, we introduced a reliability estimation mechanism that ensures only reliable predictions are distilled to the forward decoder.

Our experiments on English-Manipuri, English-Tamil, and English-Marathi datasets confirm that the proposed method outperforms strong Transformer-based NMT baselines, achieving better accuracy in generating MCWs. Furthermore, we demonstrated that using unsupervised morphology-based word segmentation yields superior results compared to the widely adopted Byte Pair Encoding (BPE) method for highly agglutinative languages.

In the future, we will explore directions including integrating richer linguistic knowledge like syntax and semantics, and evaluating other language families and domains like speech translation. With the increasing popularity of large pretrained models, integrating with large pre-trained models could be another direction.

Footnote

The term “word” is interchangeably used for any type of “token”.

References

[1]

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. CoRR abs/1607.07086 (2016). arXiv:1607.07086 http://arxiv.org/abs/1607.07086

Models		BLEU
Models		En\(\rightarrow\)Mni	En\(\rightarrow\)Ta	En\(\rightarrow\)Mr	Avg
Baseline	BPE	13.61	8.31	13.02	11.64
Baseline	Morfessor	14.32	9.22	13.54	12.36
Our Implementations	KD with BD	15.13	10.18	14.02	13.11
	RKD withBD	15.76	10.65	14.46	13.62
	KD with MD	14.79	9.61	13.92	12.77
	RKD withMD	15.21	9.86	14.12	13.06
	RKD withBD & MD	16.02	10.94	14.77	13.91
Related Works	Twin Networks [33]	15.02	10.01	13.85	12.96
	ABDNMT [50]	14.82	9.75	14.01	12.86
	Seer Forcing [9]	15.20	10.22	14.05	13.15
	Factored Translation	14.53	9.38	13.59	12.50

Abstract

1 Introduction

2 Related Work

2.1 Incorporating Linguistic Knowledge

2.2 Future Aware Translation

3 Morphology and MT

3.1 Measuring Morphological Complexity

3.2 Segmentation Methods

3.3 Evaluation

4 Proposed Method

4.1 Model

4.2 Reliable Knowledge Distillation

4.3 Training Objective

5 Experimental Setup

5.1 Dataset

5.2 Settings

6 Result and Analysis

6.1 Main Result

6.2 Contribution of Morphological Decoder

6.3 Contribution and Superiority of Backward Decoder

6.4 Impact of Different Approaches to Reliability Estimation

6.5 Effect of Target Sentence Length

6.6 Analysis of Time Consumption and Parameter Size

6.7 Scalability Discussion

7 Conclusion

Footnote

References

Index Terms

Recommendations

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

Multilingual Neural Machine Translation for Indic to Indic Languages

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations