skip to main content
research-article
Open access

Distilling Knowledge in Machine Translation of Agglutinative Languages with Backward and Morphological Decoders

Published: 18 January 2025 Publication History

Abstract

Agglutinative languages often have morphologically complex words (MCWs) composed of multiple morphemes arranged in a hierarchical structure, posing significant challenges in translation tasks. We present a novel Knowledge Distillation approach tailored for improving the translation of such languages. Our method involves an encoder, a forward decoder, and two auxiliary decoders: a backward decoder and a morphological decoder. The forward decoder generates target morphemes autoregressively and is augmented by distilling knowledge from the auxiliary decoders. The backward decoder incorporates future context, while the morphological decoder integrates target-side morphological information. We have also designed a reliability estimation method to selectively distill only the reliable knowledge from these auxiliary decoders. Our approach relies on morphological word segmentation. We show that the word segmentation method based on unsupervised morphology learning outperforms the commonly used Byte Pair Encoding method on highly agglutinative languages in translation tasks. Our experiments conducted on English-Tamil, English-Manipuri, and English-Marathi datasets show that our proposed approach achieves significant improvements over strong Transformer-based NMT baselines.

1 Introduction

Machine translation into morphologically rich languages is challenging due to lexical sparsity and the wide variety of grammatical features expressed through morphology. Agglutinative languages, such as Manipuri, Tamil, and Marathi, often use morphologically complex words (MCWs), which are formed by combining multiple morphemes, each carrying distinct semantic or syntactic meaning. Morphologically complex words (MCWs) in these languages pose significant difficulties for conventional NMT models due to their hierarchical structure and the intricate interplay of morphemes. To illustrate agglutination in morphologically complex words (MCWs), Table 1 provides an example in Manipuri language. The correct generation of target MCWs heavily relies on understanding the global context and accurately capturing morphological information. However, standard autoregressive decoding, which processes words 1 one at a time without access to future context, often falls short of capturing these subtleties, leading to errors in MCW generation. This is particularly problematic in agglutinative languages, where minor morphological errors can significantly alter meaning.
Table 1.
WordMorphemesTranslation
pubapu,bacarry
pukhipu,khicarried
pusinkhipu,sin,khicarried in
pusinningkhipu,sin,ning,khiwanted to carry in
pusinningkhidepu,sin,ning,khi,denot wanted to carry in
Table 1. Case Study of Agglutination in MCWs from Manipuri Language
The bold part in the morphemes indicates the stem.
Many efforts have been initiated to incorporate linguistic tags on the source side [8, 22, 31, 49]. However, incorporating these features on the target side of the translation system is challenging, as it requires incrementally tagging and parsing the hypotheses at test time [31]. For tagging and parsing, it requires the whole sentence \(y_1, y_2, \dots , y_{n}\), which is not available in test time. Furthermore, several approaches have been proposed to incorporate future information, including using reinforcement learning [1, 45], additional decoding passes and an additional decoder [9, 12, 44, 50], but these methods didn’t take the reliability of the future information into account.
In this paper, we propose a Knowledge Distillation method that is specifically designed to generate MCWs (Morphologically Complex Words) accurately to enhance neural machine translation for agglutinative languages. Our method involves training an encoder and three separate decoders: a forward decoder for autoregressive generation, a backward decoder to capture future contextual information, and a morphological decoder that explicitly integrates target-side morphological data, such as morphological tags or POS tags. We incorporate future information by using knowledge distillation from a backward decoder to the forward decoder, inspired by a recently proposed technique called Twin Networks [33]. To integrate morphological information in the target-side words, we use knowledge distillation from a morphological decoder. In addition to the word embeddings of \(y_1, y_2, \dots , y_{t-1}\), the morphological decoder is trained with additional morphological information of each \(y_1,..,y_t\) to predict \(y_t\). For our experiments, we use morphological class information and incorporate stem and affix information in the target side.
We explore the idea of learning future target information and morphological information from backward decoder and morphological decoder within the knowledge distillation framework [16]. Unlike typical knowledge distillation approaches [9, 30, 33, 51], our method selectively distills knowledge from the additional decoder only when the knowledge is reliable. While the backward decoder and morphological decoder provide additional information to the forward decoder, they are not perfect next-word predictors. They can give incorrect guidance to the forward decoder. To address this issue, we introduce a novel approach that dynamically distills knowledge selectively based on the reliability estimation. We use the term “reliability” to indicate the level of confidence that we have in the prediction made by the model on a single example, as used in [20, 25]. Specifically, we estimate the reliability of the model’s prediction from the backward and morphological decoders by computing the inverse of the cross-entropy between the model’s prediction and the ground truth.
Our approach also relies on effective morphological word segmentation, which plays a critical role in the accurate representation of MCWs. Traditional purely statistical models, such as Byte Pair Encoding (BPE), often fall short when applied to highly agglutinative languages, as they may not effectively capture the intricate morphological structures inherent in these languages. In contrast, we utilize a linguistically motivated morphological analyzer called Morfessor. This method enables more precise modeling of the underlying morphological relationships and enhances the quality of translation.
Our experimental results on the English-Manipuri, English-Tamil and English-Marathi datasets show that the proposed method significantly improves over strong Transformer-based NMT baselines.
The main contributions of this paper can be summarized as follows:
We show that morphologically motivated segmentation outperforms Byte-Pair Encoding in translating highly agglutinative languages.
We investigate using backward and morphological decoders to improve the generation of Morphologically Complex Words (MCWs) in neural machine translation for agglutinative languages through knowledge distillation.
We propose a new reliable knowledge distillation method that dynamically distills knowledge selectively based on our reliability estimation method at the likelihood level.
We introduce a new morphological decoder that distills morphological information to the forward decoder to incorporate a morphological feature on the target side.

2 Related Work

In recent years, several encoder-decoder architectures have been proposed. These include the attention-based models of Bahdanau et al. [2], Chen et al. [6], Gehring et al. [11], Johnson et al. [17], and Vaswani et al. [41]. Several methods have been proposed for machine translation of agglutinative languages which are designed to deal with morphological complexities. Our work is inspired by two lines of research: linguistic knowledge-informed translation and future-aware translation.

2.1 Incorporating Linguistic Knowledge

In the realm of Statistical Machine Translation (SMT), Koehn and Hoang [19] pioneered the concept of factored translation models, integrating diverse morphological features into the translation process. They accomplished this by augmenting word representations with additional morphological and syntactic features. Unlike SMT, Neural Machine Translation (NMT) models use a fixed vocabulary and hence use a vocabulary reduction method like BPE [32]. However, previous works like Banerjee and Bhattacharyya [4] shows that using a linguistically motivated morfessor improves translation in some languages, while Weller-Di Marco and Fraser [42] use a rule-based morphological analyzer for translation tasks in English-German translation. Subsequently in Neural Machine Translation, a stream of research has emerged aiming to enhance word vector representations by incorporating valuable linguistic features into either the source encoder or the target decoder. For instance, Sennrich and Haddow [31] broadened the embedding layer of an NMT encoder by incorporating a blend of morphological and syntactic features, encompassing word lemma, morphological attributes, POS, and dependency labels. Their model utilized concatenated feature embedding vectors as input word embeddings while keeping the rest of the NMT model unaltered. Conversely, Bandyopadhyay [3], Song et al. [39], and Tamchyna et al. [40] factored words into morphological (lemma) and syntactic (factors) features at the output decoder of NMT. Their model, augmented with a heuristic morphological synthesizer, generated unseen word forms based on predicted lemma and factors. This method aimed to address challenges related to large vocabulary and out-of-vocabulary (OOV) instances during translation. Regrettably, this approach did not yield significant improvements in experimental outcomes. Incorporating these features on the target side of the translation system is challenging since it requires incrementally tagging and parsing the hypotheses at test time, which requires the whole sentence, not available in test time. Nzeyimana [26] tried to incorporate linguistic tags through multi-tasking multi-label training. Our paper advocates for integrating linguistic knowledge in the target side words by distilling knowledge from a morphological decoder.
Another line of research explores the integration of external knowledge through multi-source or multi-task learning. This involves considering additional sources as valuable, distinct information that augments the learning process of a translation model. The idea behind these approaches is to integrate the linguistic features into the model architecture, such as in a multi-encoder [22], modified attention [5, 49], or multi-task learning [8].

2.2 Future Aware Translation

To incorporate future information into neural machine translation (NMT), various approaches have been proposed. One approach is to use reinforcement learning, such as the REINFORCE algorithm [34, 43, 45, 46] or the actor-critic algorithm [1, 21].
A separate set of techniques incorporates future information into the inference process by employing additional decoding passes or supplementary components during testing. For instance, Xia et al. [44] and Zhang et al. [50] advocated a two-pass decoding algorithm, generating an initial draft translation followed by a refined final translation based on the draft. In a similar vein, Zhang et al. [48] and Zhou et al. [54] maintained both forward and backward decoders, decoding simultaneously and interacting when making predictions. From a distinct perspective, some researchers focus on integrating future information, such as Feng et al. [9] and Zhang et al. [47] which propose leveraging future source information to guide machine translation with knowledge distillation, mitigating source incompleteness. Another approach is to model past and future information for the source to help the decoder focus on untranslated source information, such as in Zheng et al. [53] and Zheng et al. [52]. However, our method differs from previous approaches in that it selectively distills knowledge from the additional decoder only when the knowledge is reliable and takes the reliability of future information into account.

3 Morphology and MT

Morphemes are the smallest meaningful units of a language. Some morphemes, called stems, express core meanings, while others, called affixes, express one or more dependent features of the core meaning, such as person, gender, or aspect. In agglutinative languages, each morpheme typically corresponds to a single feature, and words are constructed by concatenating morphemes with clear boundaries between them.
Our approach relies on the morphological segmentation of words. Before considering the broader problem of integrating morphological decoder and backward decoder, we perform an initial study to verify the usefulness of using linguistically motivated morphological analyzer in MT. While previous works such as Banerjee and Bhattacharyya [4], Singh et al. [36], and Weller-Di Marco and Fraser [42] have shown that using a morphological analyzer improves translation in some languages, there has not been a comprehensive investigation into which type of agglutinative languages are best suited for morphological analysis in MT. In this section, we discuss methods for measuring morphological complexity and segmentation techniques. We then explore the usefulness of morphological analyzers for segmenting agglutinative languages in translation tasks. Finally, we demonstrate that the morphologically motivated segmentation method consistently outperforms Byte-Pair Encodings (BPE) in translation tasks for highly agglutinative languages.

3.1 Measuring Morphological Complexity

Morphological complexity varies across languages, and it is important to identify this complexity in order to apply our method. Instead of relying on expert linguistic descriptions, we adopt corpus-based metrics such as Types, Type-Token Ratio (TTR), and Moving-Average Type-Token Ratio (MATTR). These measures are effective in assessing and ranking languages based on morphological complexity, as suggested by Kettunen [18]. Table 2 summarizes these metrics.
Table 2.
MetricsDefinition
TypesCount of unique word tokens
Type-Token Ratio (TTR)The ratio of types to the total number of tokens in a text or corpus.
Moving-Average Type-Token Ratio (MATTR)The average TTR derived from fixed sized overlapping segments (e.g., 50 words in our case) within a text
Table 2. Definitions of the Various Metrics used to Assess Morphological Complexity
Languages with rich morphology tend to exhibit high values for these metrics due to their diverse word forms, while languages with simpler morphology show lower values. By applying these measures to our selected languages, we obtain a quantitative basis for comparing their morphological characteristics, which will later inform the analysis of segmentation techniques in translation tasks.

3.2 Segmentation Methods

Word-level models would predict “unknown” for out-of-vocabulary word tokens, making them unsuitable for morphologically rich languages with sparse vocabularies. Therefore, we train our translation models using two segmentation methods: BPE (Byte-Pair Encoding), a purely statistical method, and Morfessor, a linguistically motivated morphological analyzer.
Byte Pair Encoding. BPE [32] starts with character segmentation and merges characters into larger units based on their frequencies. This results in units that fall between characters and words, with the number of merge operations serving as a hyperparameter.
Morfessor. The default implementation [7] employs a unigram language model to identify morph-like structures. It selects segments in a top-down manner and includes a prior term for segment length, promoting the regularization of segments to resemble plausible morphemes. We use Morfessor Flatcat variant [14] for our implementation.

3.3 Evaluation

We now present the evaluation of our morphological measures and segmentation models. We experiment with the following languages: Tamil, Marathi, Manipuri, Japanese and Indonesian. These languages represent different families and degrees of morphological complexity, as reflected in our analysis.
Morphological Complexity Analysis. For this analysis, we use the multi-way parallel FLORES dataset [13]. Figure 1 compares the morphological measures across the five languages, alongside English. As seen in the figure, Tamil, Marathi, and Manipuri exhibit high morphological complexity, indicating their high agglutinative nature. In contrast, Japanese and Indonesian have relatively lower levels of agglutination.
Fig. 1.
Fig. 1. Comparison of morphological complexity across languages. Tamil, Manipuri, and Marathi are highly agglutinative, while Japanese and Indonesian exhibit lower degrees of agglutination.
Effects of Segmentation Methods on Machine Translation. To compare different segmentation models, we use the IWSLT 2017 dataset for Japanese and Indonesian. The English-Japanese dataset has 220,000 parallel sentences, and the English-Indonesian dataset has 76,000 parallel sentences. The experimental settings and dataset details for Tamil, Marathi, and Manipuri are provided in Section 5. Figure 2 shows the BLEU scores for both segmentation methods across these languages.
Fig. 2.
Fig. 2. Comparison of BPE and Morfessor across languages, showing that Morfessor improves translation for highly agglutinative languages - Manipuri(Mni), Tamil(Tam), Marathi(Mar) but offers no benefit for less agglutinative ones - Japanese(Jpn), Indonesian(Ind).
Our results demonstrate that Morfessor significantly improves translation quality for highly agglutinative languages. However, it does not offer an improvement for languages with a lower degree of agglutination, such as Japanese and Indonesian. In our further experiments, we will concentrate on the three highly agglutinative languages.

4 Proposed Method

Comprehension of sentences in human cognition is influenced by the preceding and succeeding discourse [10]. Native speakers’ brains subconsciously anticipate the presence of a suffix when encountering a specific stem, reflecting the influence of priming [38]. Our proposed framework uses a multi-decoder training framework with reliable knowledge distillation to generate the target sentence while simulating the aforementioned processes by leveraging relevant full-context information and morphological information.

4.1 Model

Our model is an extension of the dominant encoder-decoder model called the Transformer [41]. It employs a multi-decoder approach, in which separate decoders generate the target sentence in different ways.
Encoder and Forward Decoder. In the forward decoder, the target sentence \(Y={y_1,\ldots ,y_n}\) is generated by maximizing the conditional probability of each token \(y_t\) given the source sentence X and the preceding words \(y_{\lt t}\):
\begin{equation} \mathbb {P}(Y|X)=\prod _{t=1}^{n}p(y_t|y_{\lt t},X) \end{equation}
(1)
The input sequence X is first fed into the encoder, which encodes it into m context vectors \(C = {c_1,c_2, \ldots , c_m}\) where m is the length of X. Specifically, this is achieved through the self-attention network (SAN), which produces the context vectors \(C = SAN(X)\).
The forward decoder generates the target sentence Y word-by-word based on the representation C from the encoder, using attention mechanisms to take into account the generated target fragment. Given a sequence of word embeddings \(\mathbf {E}(y_1),\ldots , \mathbf {E}(y_{t-1})\) in the generated target fragment at the t-th timestep, they are transformed into key matrix \(\mathbf {K}_{t-1}\) and value matrix \(\mathbf {V}_{t-1}\) . Another Self-Attention (SelfATTs) module is then used to learn the target representation \(s_t\):
\begin{equation} s_t = SelfATTs(\overrightarrow{h}_{t-1}, \mathbf {K}_{t-1}, \mathbf {V}_{t-1}) \end{equation}
(2)
where \(\overrightarrow{h}_{t-1} \in \mathbb {R}^{d_{model}}\) is the previous context vector. \(s_t\) is then fed into another Attention (ATTc) module to compute the time-dependent context vector \(h_t\):
\begin{equation} \overrightarrow{h_t} = ATTc(s_t, \mathbf {K}_e, \mathbf {V}_e) \end{equation}
(3)
where \(\mathbf {K}_e\) and \(\mathbf {V}_e\) are key and value matrices, respectively from encoder, that are transformed from the source representation C.
The probability distribution \(p(y_t|y_{\lt t}, X)\) of the generated target word \(y_t\) is computed using a multi-layer perceptron (MLP) layer:
\begin{equation} p(y_t|y_{\lt t}, X) = softmax(\mathbf {W}_o\overrightarrow{h_t}+b) \end{equation}
(4)
The target word \(y_t\) with the maximum probability is selected as the output of the decoder at the t-th timestep.
Backward Decoder. In the backward decoder, the target sentence is generated by taking into account the source sentence and the succeeding words:
\begin{equation} \mathbb {P}(Y|X)=\prod _{t=1}^{n}p(y_t|y_{\gt t},X) \end{equation}
(5)
To compute the probability of a target token \(y_t\), the decoder first computes the representation of the succeeding words using the self-attention mechanism. Given a sequence of word embeddings \(\mathbf {E}(y_{t+1}), ..., \mathbf {E}(y_{n})\) of the succeeding words at the t-th timestep, they are transformed into key matrix \(\mathbf {K}^{^{\prime }}_{t+1}\) and value matrix \(\mathbf {V}^{^{\prime }}_{t+1}\) . The target representation \(s^{^{\prime }}_t\) is computed using attention mechanism as:
\begin{equation} s^{^{\prime }}_t = SelfATTs(\overleftarrow{h}_{t+1}, \mathbf {K}^{^{\prime }}_{t+1}, \mathbf {V}^{^{\prime }}_{t+1}) \end{equation}
(6)
The context vector \(\overleftarrow{h_t}\) is then computed using attention mechanism as:
\begin{equation} \overleftarrow{h_t} = ATTc(s^{^{\prime }}_t, \mathbf {K}_e, \mathbf {V}_e) \end{equation}
(7)
The probability distribution \(p(y_t|y_{\gt t}, X)\) of the generated target word \(y_t\) is computed as:
\begin{equation} p(y_t|y_{\gt t},X) = softmax(\mathbf {W}_o\overleftarrow{h_t}+b) \end{equation}
(8)
Figure 3 depicts the backward decoder (top) and forward decoder (bottom) with Reliable Knowledge Distillation.
Fig. 3.
Fig. 3. Backward decoder (Top) and Forward decoder (Bottom) with Reliable Knowledge Distillation. During the evaluation, the backward decoder part is discarded.
Morphological Decoder. The morphological decoder generates the target sentence by taking into account the source sentence, the preceding words, and the morphological classes of the preceding tokens and the next token. The probability of a target token, \(y_t\), is computed using the classes of the preceding tokens and the next token \(class(y_{\le t})\), the preceding words, and the source sentence.
\begin{equation} \mathbb {P}(Y|X)=\prod _{t=1}^{n}p(y_t|class(y_{\le t}),y_{\lt t},X) \end{equation}
(9)
In this morphological decoder, the word embeddings \(\mathbf {E}(y_1),\ldots , \mathbf {E}(y_{t-1})\) are combined with respective morphological class embeddings \(\mathbf {E}_c(class(y_1)),\ldots ,\mathbf {E}_c(class(y_{t-1}))\) by passing through a linear layer. The decoder is made aware of the morphological class of the next token to be generated in advance, allowing it to better choose the next token to generate based on this information. The morphological class of the next token, \(class(y_t)\), is passed to a class embedding, \(\mathbf {E}_c\), which generates its class embedding and combines it with the hidden state, \(h_t\) by passing through a linear layer. The hidden state \(h_t\) is computed in a manner similar to the forward decoder.
The probability distribution of the generated target word \(y_t\), \(p(y_t|class(y_{\le t}),y_\lt t, X)\) is computed as:
\begin{equation} p(y_t|class(y_{\le t}),y_{\lt t},X) = softmax(\mathbf {W}_oh_t^{(M)}+b) \end{equation}
(10)
Figure 4 depicts the morphological decoder (top) and forward decoder (bottom) with Reliable Knowledge Distillation. In our experiment, we take the morphological classes of stem or affix.
Fig. 4.
Fig. 4. Morphological decoder (Top) and Forward decoder (Bottom) with Reliable Knowledge Distillation. During the evaluation, the morphological decoder part is discarded.

4.2 Reliable Knowledge Distillation

In this paper, we use knowledge distillation [16] to regularize the forward decoder in our machine translation model. The forward decoder has to match the probability distribution predicted by the backward decoder and morphological decoder. The knowledge distillation losses with forward decoder and morphological decoder are respectively expressed as:
\begin{align} \mathcal {L}(\overrightarrow{Y},\overleftarrow{Y}) = - \sum _{t=1}^{n}&\beta _t p(y_t|y_{\gt t},X) . log(p(y_t|y_{\lt t},X)) \end{align}
(11)
\begin{align} \mathcal {L}(\overrightarrow{Y},Y^{(M)}) = - &\sum _{t=1}^{n}\gamma _t p(y_t|class(y_{\le t}),y_{\lt t},X) . log(p(y_t|y_{\lt t},X)) \end{align}
(12)
To enhance the distillation effect, we dynamically weighted the knowledge distillation loss, which allows us to distill only reliable knowledge. We use the term “reliability” to indicate the level of confidence that we have in the prediction made by the ML model on a single example, as used in Kukar and Kononenko [20] and Nicora et al. [25]. Essentially, when the prediction error of the backward decoder or morphological decoder is high, the forward decoder will exclusively learn from the ground truth. The weighting factors \(\beta _t\) and \(\gamma _t\) are obtained from the inverse of cross-entropy between ground truth and predicted probability distribution as follows:
\begin{equation} \beta _t = 1 - \frac{H(p^G(y_t),p(y_t|y_{\gt t},X))}{\mathbb {K}} \end{equation}
(13)
\begin{equation} \gamma _t = 1 - \frac{H(p^G(y_t),p(y_t|class(y_{\le t}),y_{\lt t},X))}{\mathbb {K}} \end{equation}
(14)
where \(p^G(y_t)\) is the ground truth and \(\mathbb {K}\) is batchwise min-max normalization term.

4.3 Training Objective

In our approach, we train all three decoders using cross-entropy loss. The cross-entropy losses for the forward decoder, backward decoder, and morphological decoder are calculated as follows:
\begin{equation} \mathcal {L}_1=-\sum _{t=1}^{n}log(p(y_t|y_{\lt t},x)) \end{equation}
(15)
\begin{equation} \mathcal {L}_2 = -\sum _{t=1}^{n}log(p(y_t|y_{\lt t},y_{\gt t},x)) \end{equation}
(16)
\begin{equation} \mathcal {L}_3 = -\sum _{i=1}^{n}log(p(y_t|class(y_{\le t}),y_{t},x)) \end{equation}
(17)
where n is the number of tokens in each example.
We combine these cross-entropy losses with the additional knowledge distillation losses to obtain the final training losses as follows:
\begin{equation} \mathcal {L}= ~\mathcal {L}_1 + \mathcal {L}_2 + \mathcal {L}_3 + \mathcal {L}(\overrightarrow{Y},\overleftarrow{Y}) + \mathcal {L}(\overrightarrow{Y},Y^{(M)}) \end{equation}
(18)
We jointly train and update the parameters of all decoders. However, we are not updating the parameters of the backward decoder and morphological decoder using the knowledge distillation losses; instead, we only use the knowledge distillation losses to backpropagate and update the parameters of the forward decoder.

5 Experimental Setup

5.1 Dataset

We evaluate our method on the following three datasets. WAT2021 English\(\rightarrow\)Tamil. (140K pairs): We evaluate using WAT2021 validation set and test set using PMIndia [15] and PIB Dataset [37]. English\(\rightarrow\)Manipuri. (120K pairs): We use a training set of 120K parallel sentences from PMIndia [15, 35] and PIB Dataset. The validation and test sets consist of approximately 1k sentence pairs each, sampled from the corpus. WAT2021 English\(\rightarrow\)Marathi. (132K pairs): We evaluate using WAT2021 validation set and test set using PMIndia and PIB Dataset.
To preprocess the data, we segment English words into subword units using Byte-Pair Encoding (BPE) [32] with 16,000 merge operations. Morphemes are splitted from the Manipuri, Tamil, and Marathi words using the Morfessor Flatcat tool [14]. The morphological class used in the Morphological Decoder consists of binary labels—“stem” vs. “affix”—for each sub-word token. We add a special marker “$$” before suffixes, and “+” between other morphemes of a word to ensure the reversibility of the morphological splitting. Certain words, like compound words and reduplicative words, contain multiple stems, which are indicated by the “+” symbol.

5.2 Settings

We conduct experiments on the following systems.
Transformer Baseline. A standard transformer model that is trained on the given dataset without incorporating any additional decoders.
Knowledge Distillation with Backward Decoder (KD with BD). A variation of our model that includes a backward decoder, which utilizes future information.
Reliable Knowledge Distillation with Backward Decoder (RKD with BD). Variation of KD with BD that distills only reliable knowledge.
Knowledge Distillation with Morphological Decoder (KD with MD). Another variation that includes a morphological decoder, which incorporates linguistic feature on the target side.
Reliable Knowledge Distillation with Morphological Decoder (RKD with MD). Variation of KD with MD but distills only reliable knowledge
ABDNMT. An NMT model based on the Transformer architecture, following the method proposed by Zhang et al. [50].
Twin Networks. The method proposed by Serdyuk et al. [33] and incorporates an L2 loss term.
SEER Forcing. An NMT model that utilizes the Seer forcing technique with knowledge distillation weight set to 0.5, following the method proposed by Feng et al. [9].
Factored Translation. the class information(stem or affix) of each token is obtained and both words and class are generated based on Bandyopadhyay [3] and Tamchyna et al. [40].
In this study, we conduct experiments using the Transformer [41] model, while implementing our models through adaptation from the open-source toolkit Fairseq-py [27]. There are three input and output layers with embedding dimension of 256, the inner feedforward layer dimension of 512, and the number of heads in the multi-head modules in both the encoder and decoder layers is 4. The training batches consist of sets of 4,096 source and target tokens. The models are trained and evaluated on two Tesla P100 GPUs. The test set is evaluated using a single model obtained by taking the best checkpoint, which is validated on the development set at each epoch. The BLEU metric [28] is used to evaluate the translation performance using the SacreBLEU [29].

6 Result and Analysis

6.1 Main Result

Table 3 presents the results of our models and related works.
Table 3.
ModelsBLEU
En\(\rightarrow\)MniEn\(\rightarrow\)TaEn\(\rightarrow\)MrAvg
BaselineBPE13.618.3113.0211.64
Morfessor14.329.2213.5412.36
Our ImplementationsKD with BD15.1310.1814.0213.11
RKD withBD15.7610.6514.4613.62
KD with MD14.799.6113.9212.77
RKD withMD15.219.8614.1213.06
RKD withBD & MD16.0210.9414.7713.91
Related WorksTwin Networks [33]15.0210.0113.8512.96
ABDNMT [50]14.829.7514.0112.86
Seer Forcing [9]15.2010.2214.0513.15
Factored Translation14.539.3813.5912.50
Table 3. BLEU Scores on En\(\rightarrow\)Mni, En\(\rightarrow\)Ta, and En\(\rightarrow\)Mr Test Sets
We compare the Transformer (“Baseline”, [41]); Conventional or Reliable Knowledge Distillation of forward decoder with backward decoder(“(KD/RKD with BD”); Conventional or Reliable Knowledge Distillation of forward decoder with morphological decoder (“KD/RKD with MD”); and related works.
Morfessor models outperform BPE model. The baseline model, a standard Transformer trained without additional decoders, achieves moderate BLEU scores. Using BPE as the subword unit, the baseline obtains a BLEU score of 13.61 for English-to-Manipuri, 8.31 for English-to-Tamil, and 13.02 for English-to-Marathi. Using Morfessor for segmentation improves the performance to 14.32, 9.22, and 13.54 respectively. Therefore, we use Morfessor for segmentation in our proposed models.
Our models with additional decoders outperform baselines with only forward decoder. To assess the impact of different decoders and knowledge distillation approaches, we compare our model to various variations. The KD/RKD models with a backward decoder (BD) surpass the baseline in all three translation tasks. For English-to-Manipuri with Morfessor, the KD model achieves a BLEU score of 15.13, while the RKD model achieves a higher score of 15.76. Similarly, for English-to-Tamil with Morfessor, the KD model attains a score of 10.18, whereas the RKD model outperforms it with a score of 10.65. Similarly, for English-to-Marathi with Morfessor, the KD model attains a score of 14.02, whereas the RKD model outperforms it with a score of 14.46. Additionally, incorporating a morphological decoder (MD) enhances the performance of the baseline. The KD model with MD achieves BLEU scores of 14.79, 9.61 and 13.92 respectively, while the RKD model with MD achieves scores of 15.21, 9.86 and 14.12 respectively. Combining both backward decoder (BD) and morphological decoder (MD) further improves the score.
Our proposed models are competitive with related works. Comparing our proposed KD/RKD models with related works, we observe competitive performance. The Twin Networks approach, incorporating future information, achieves BLEU scores of 15.02 for English-to-Manipuri, 10.01 for English-to-Tamil and 13.85 for English-to-Marathi. ABDNMT, another model incorporating future information, achieves scores of 14.82, 9.75, and 14.01 for the respective tasks. SEER Forcing, employing the Seer forcing technique, achieves scores of 15.20, 10.22 and 14.05. Factored Translation, which augments class information, attains scores of 14.61, 9.43, and 13.52 for English-to-Manipuri and English-to-Tamil and English-to-Marathi, respectively.

6.2 Contribution of Morphological Decoder

As the morphological decoder has knowledge of morphological class information of the next word and the preceding words, we hope that the forward decoder can better predict the next word by distilling knowledge from the morphological decoder. In order to gain insights on whether the empirical usefulness comes from using morphological decoder, we perform two ablation tests. For “Gaussian Noise,” the backward probability distribution are randomly sampled from a Gaussian distribution, therefore the forward states are trained to predict white noise. For “AR,” the backward probability distribution are set to zero inspired by Merity et al. [23]. Results in Table 4 shows that the information included in the morphological states is indeed useful for obtaining a significant improvement.
Table 4.
ModelBLEU
En\(\rightarrow\)MniEn\(\rightarrow\)Ta
Baseline + Gaussian14.038.90
Baseline + AR14.349.44
Baseline + MD14.799.73
Table 4. Comparison of the Forward Decoder Guided by Morphological Decoder (“MD”); “Gaussian Noise,” the Additional Decoder Probability Distribution are Randomly Sampled from a Gaussian Distribution; the Additional Decoder Probability Distribution are Set to Zero Inspired by [23] (“AR”)
In our approach, we have integrated the classes of the previous tokens \(class(y_{\lt t})\) and the upcoming token \(class(y_{t})\) into the morphological decoder. After conducting experiments, we have found that incorporating the class information for the next token \(class(y_{t})\) yields a modest improvement in performance compared to solely including the class information for the preceding tokens \(class(y_{\le t})\), as demonstrated in Table 5. This implies that accounting for the class of the upcoming token can provide more valuable contextual information for the model to generate the next token.
Table 5.
Class InformationBLEU
En\(\rightarrow\)MniEn\(\rightarrow\)Ta
\(class(y_{\lt t})\)14.629.47
\(class(y_{\le t})\)14.799.61
Table 5. Performance in KD with MD with Preceding Word Class Information Alone and with Both Preceding and Next Word Class Information

6.3 Contribution and Superiority of Backward Decoder

One of the key contributions of our approach is the use of a backward decoder, which possesses knowledge of future target information. We expect this knowledge to be transferred to the forward decoder through knowledge distillation.
Our experimental results, presented in both Table 3 and Table 5, clearly demonstrate the superiority of our approach over the baseline. In particular, we observe remarkable enhancements in performance when employing knowledge distillation with the backward decoder.
Interestingly, our results also show that the backward decoder outperforms the morphological decoder, indicating that our approach is able to leverage the additional information provided by the backward decoder to improve overall performance. Taken together, our findings suggest that the use of a backward decoder and knowledge distillation can significantly enhance the performance of neural language models.

6.4 Impact of Different Approaches to Reliability Estimation

In the context of knowledge distillation with the backward decoder, we conducted an investigation into various approaches for estimating reliability, and the results are presented in Table 6. Initially, we attempted to assign a binary reliability weight, represented by \(\beta _t\), with values of either 0 or 1, to indicate reliability or unreliability discretely. However, we found that this approach did not yield satisfactory results. As a result, we explored alternative approaches, including predicting the reliability weight using the hidden state \(\overleftarrow{h_t}\), by passing it through a linear layer and a sigmoid activation function. However, this approach also did not perform well.
Table 6.
Reliability EstimationBLEU
En\(\rightarrow\)MniEn\(\rightarrow\)Ta
Hidden State15.2310.17
Cross-entropy  
+Discrete15.3210.37
+Continuous15.7610.65
Table 6. Comparison of Reliability Estimated using Only the Hidden State of the Backward Decoder (“Hidden State”); using the Cross-entropy between the Ground Truth and the Predicted Probability Distribution of the Backward Decoder in Discrete and Continuous forms (“Cross-entropy+Discrete”, “Cross-entropy+Continuous”)
Finally, we attempted to compute a continuous weight as the inverse of the cross-entropy between the ground truth and the predicted probability. This approach yielded the best results, outperforming the previous two approaches. The continuous weight allowed for more nuanced estimation of reliability, as opposed to a binary or discrete classification. In conclusion, we found that computing a continuous weight as the inverse of the cross-entropy between the ground truth and the predicted probability was the most effective approach for estimating reliability in knowledge distillation with the backward decoder.

6.5 Effect of Target Sentence Length

As part of our evaluation process, we conducted a thorough analysis of the impact of our proposed method on sentence length in terms of the number of words with En\(\rightarrow\)Mni dataset. We achieved this by dividing the test set into four distinct groups based on the length of the target sentences. The results of our analysis, which are presented in Figure 5, demonstrated that the performance of our proposed method was dependent on the length of the target sentences.
Fig. 5.
Fig. 5. BLEU scores over different sentence lengths.
In particular, we observed that our method was less effective when the length of the target sentence was between 0–10 words, with only minimal differences in BLEU scores compared to the baseline model. However, for sentences longer than 10 words, our proposed method showed a significant improvement in BLEU scores compared to the baseline transformer Trans.base model. This improvement in BLEU scores was even more pronounced as the length of the target translations increased, indicating that our method is particularly effective for longer sentences.
Overall, this analysis shows that our proposed method is a promising solution for improving the quality of translations, especially for longer sentences.

6.6 Analysis of Time Consumption and Parameter Size

In this section, we will provide a comprehensive analysis of the training and decoding times, as well as the parameter sizes in the En\(\rightarrow\)Mni translation models. We present the results in Table 7.
Table 7.
ModelsEn\(\rightarrow\)Mni
#Time1#Time2#Param
Baseline1.01.013.6M
RKD with BD1.71.016.8M
RKD with MD1.71.016.8M
Table 7. Training, Test Time, and Size of the Model Parameters in En\(\rightarrow\)Mni Translation Models
“Time1” denotes the training time (in ratio), “Time2” denotes the decoding time (in ratio), and “Param” denotes the size of model parameters (M for million).
The table compares three En\(\rightarrow\)Mni translation models, namely Baseline, RKD with BD, and RKD with MD. The models are evaluated based on their training and decoding times, as well as the size of their model parameters.
In terms of training and decoding times, the Baseline model has the lowest values, with a ratio of 1.0 for both training and decoding times. The RKD models, on the other hand, have a training time ratio of 1.7 and a decoding time ratio of 1.0.
When it comes to the size of the model parameters, baseline Transformer model have a smaller number of parameters with 13.6 million, and RKD models with 16.8 million.
Based on these findings, it is observed that the RKD models necessitate more time for training than the Baseline model, and they also involve extra parameters. However, it is noteworthy the backward decoder and morphological decoder are only used in training time and discarded during inference time. It makes the proposed RKD models match the inference time of the baseline Transformer model.

6.7 Scalability Discussion

While this paper adopts a theoretical focus by using smaller, domain-specific datasets to train neural machine translation (NMT) models, it is also essential to consider the practical implications of scaling the approach to high-resource settings. To this end, we extend the English-Tamil dataset by incorporating additional publicly available corpora, including JW300, NLP Contributions (nlpc), OpenSubtitles, TED2020, and WikiMatrix, as provided in Nakazawa et al. [24]. The aggregated dataset contains a total of 1.35 million parallel sentences, providing a more larger resource for robust model training.
In this expanded setting, the model architecture consists of six input and output layers with an embedding dimension of 1024. The inner feedforward layers have a dimensionality of 4096, and the multi-head attention modules utilize 16 heads in both the encoder and decoder layers. Training is performed using mini-batches of 8,000 source and target tokens, which helps maintain stable gradient updates and efficient use of computational resources. Since the WAT 2021 test set is in the domain of PMI, after training on expanded dataset, we fine-tune on the PMI dataset.
The results in Table 8 highlight the impact of different segmentation strategies on translation quality. While the baseline model with Byte Pair Encoding (BPE) achieves a BLEU score of 10.21, replacing BPE with linguistically motivated Morfessor segmentation improves performance, yielding a BLEU score of 11.54. This improvement can be attributed to Morfessor’s ability to capture meaningful morphological patterns, especially useful for agglutinative languages like Tamil. Furthermore, our proposed Reliable Knowledge Distillation (RKD) approach, which integrates a backward decoder and a morphological decoder, achieves a BLEU score of 13.04. This demonstrates the effectiveness of our method for morphologically rich agglutinative languages, outperforming both BPE and Morfessor-based baselines.
Table 8.
ModelBLEU
En\(\rightarrow\)Ta
Baseline BPE10.21
Baseline Morfessor11.54
RKD with BD & MD13.04
Table 8. Comparison of the Baseline Transfomer Models with BPE Segmentation and Morfessor Segmentation, along with Our Model - reliable Knowledge Distillation with Backward Decoder and Morphological Decoder (RKD with BD & MD) in High Resource Setting

7 Conclusion

In this paper, we proposed a novel knowledge distillation approach tailored to enhance neural machine translation for agglutinative languages, which pose unique challenges due to their morphologically complex words (MCWs). Our method leverages an auxiliary backward decoder to capture future context, and a morphological decoder to integrate target-side morphological information like stems and affixes. Through knowledge distillation, the predictions from these decoders are selectively distilled to the main forward autoregressive decoder based on a reliability estimation. To address the challenge of unreliable guidance from auxiliary decoders, we introduced a reliability estimation mechanism that ensures only reliable predictions are distilled to the forward decoder.
Our experiments on English-Manipuri, English-Tamil, and English-Marathi datasets confirm that the proposed method outperforms strong Transformer-based NMT baselines, achieving better accuracy in generating MCWs. Furthermore, we demonstrated that using unsupervised morphology-based word segmentation yields superior results compared to the widely adopted Byte Pair Encoding (BPE) method for highly agglutinative languages.
In the future, we will explore directions including integrating richer linguistic knowledge like syntax and semantics, and evaluating other language families and domains like speech translation. With the increasing popularity of large pretrained models, integrating with large pre-trained models could be another direction.

Footnote

1
The term “word” is interchangeably used for any type of “token”.

References

[1]
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. CoRR abs/1607.07086 (2016). arXiv:1607.07086http://arxiv.org/abs/1607.07086
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1409.0473
[3]
Saptarashmi Bandyopadhyay. 2019. Factored neural machine translation at LoResMT 2019. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages. European Association for Machine Translation, Dublin, Ireland, 68–71. https://aclanthology.org/W19-6811
[4]
Tamali Banerjee and Pushpak Bhattacharyya. 2018. Meaningless yet meaningful: Morphology grounded subword-level NMT. In Proceedings of the Second Workshop on Subword/Character Level Models, Manaal Faruqui, Hinrich Schütze, Isabel Trancoso, Yulia Tsvetkov, and Yadollah Yaghoobzadeh (Eds.). Association for Computational Linguistics, New Orleans, LA, USA, 55–60.
[5]
Emanuele Bugliarello and Naoaki Okazaki. 2020. Enhancing machine translation with dependency-aware self-attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1618–1627.
[6]
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 76–86.
[7]
Mathias Creutz and Krista Lagus. 2002. Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning. Association for Computational Linguistics, 21–30.
[8]
Anna Currey and Kenneth Heafield. 2018. Multi-source syntactic neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2961–2966.
[9]
Yang Feng, Shuhao Gu, Dengji Guo, Zhengxin Yang, and Chenze Shao. 2021. Guiding teacher forcing with seer forcing for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2862–2872.
[10]
Lyn Frazier and Keith Rayner. 1982. Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology 14 (1982), 178–210.
[11]
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 1243–1252.
[12]
Xinwei Geng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2018. Adaptive multi-pass decoder for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 523–532.
[13]
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. 2022. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics 10 (2022), 522–538.
[14]
Stig-Arne Grönroos, Sami Virpioja, Peter Smit, and Mikko Kurimo. 2014. Morfessor FlatCat: An HMM-based method for unsupervised and semi-supervised learning of morphology. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin City University and Association for Computational Linguistics, Dublin, Ireland, 1177–1185. https://aclanthology.org/C14-1111
[15]
Barry Haddow and Faheem Kirefu. 2020. PMIndia - A collection of parallel corpora of languages of India. ArXiv abs/2001.09907 (2020).
[16]
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. ArXiv abs/1503.02531 (2015).
[17]
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation, Lillian Lee, Mark Johnson, and Kristina Toutanova (Eds.). Transactions of the Association for Computational Linguistics 5, 339–351.
[18]
Kimmo Kettunen. 2014. Can type-token ratio be used to show morphological complexity of languages? Journal of Quantitative Linguistics 21, 3 (2014), 223–245.
[19]
Philipp Koehn and Hieu Hoang. 2007. Factored translation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, Prague, Czech Republic, 868–876. https://aclanthology.org/D07-1091
[20]
Matjaž Kukar and Igor Kononenko. 2002. Reliable classifications with machine learning. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13. Springer, 219–231.
[21]
Haichao Li, Minh-Thang Luong, and Christopher D. Manning. 2017. Deep reinforcement learning for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 743–752.
[22]
Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, Min Zhang, and Guodong Zhou. 2017. Modeling source syntax for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 688–697.
[23]
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing LSTM language models. ArXiv abs/1708.02182 (2017).
[24]
Toshiaki Nakazawa, Hideki Nakayama, Chenchen Ding, Raj Dabre, Shohei Higashiyama, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Shantipriya Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, and Sadao Kurohashi. 2021. Overview of the 8th workshop on Asian translation. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), Toshiaki Nakazawa, Hideki Nakayama, Isao Goto, Hideya Mino, Chenchen Ding, Raj Dabre, Anoop Kunchukuttan, Shohei Higashiyama, Hiroshi Manabe, Win Pa Pa, Shantipriya Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Katsuhito Sudoh, Sadao Kurohashi, and Pushpak Bhattacharyya (Eds.). Association for Computational Linguistics, Online, 1–45.
[25]
Giovanna Nicora, Miguel Rios, Ameen Abu-Hanna, and Riccardo Bellazzi. 2022. Evaluating pointwise reliability of machine learning prediction. Journal of Biomedical Informatics 127 (2022), 103996.
[26]
Antoine Nzeyimana. 2024. Low-resource neural machine translation with morphological modeling. In Findings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 182–195.
[27]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). Association for Computational Linguistics, Minneapolis, MN, USA, 48–53.
[28]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, PA, USA, 311–318.
[29]
Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium, 186–191.
[30]
Mirco Ravanelli, Dmitriy Serdyuk, and Yoshua Bengio. 2018. Twin regularization for online speech recognition. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, B. Yegnanarayana (Ed.). ISCA, 3718–3722.
[31]
Rico Sennrich and Barry Haddow. 2016. Linguistic input features improve neural machine translation. In Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers. Association for Computational Linguistics, Berlin, Germany, 83–91.
[32]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715–1725.
[33]
Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sordoni, Adam Trischler, Christopher Joseph Pal, and Yoshua Bengio. 2017. Twin networks: Matching the future for sequence generation. arXiv: Learning (2017).
[34]
Jie Shao, Xiaodong Zhang, Lidong Li, and Ming Liu. 2019. Dynamic reinforcement learning for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2674–2683.
[35]
Telem Joyson Singh, Sanasam Ranbir Singh, and Priyankoo Sarmah. 2021. English-Manipuri machine translation: An empirical study of different supervised and unsupervised methods. 2021 International Conference on Asian Language Processing (IALP) (2021), 142–147.
[36]
Telem Joyson Singh, Sanasam Ranbir Singh, and Priyankoo Sarmah. 2023. Subwords to word back composition for morphologically rich languages in neural machine translation. In Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation. Association for Computational Linguistics, Hong Kong.
[37]
Shashank Siripragada, Jerin Philip, Vinay P. Namboodiri, and C. V. Jawahar. 2020. A multilingual parallel corpora collection effort for Indian languages. (May2020), 3743–3751. https://aclanthology.org/2020.lrec-1.462
[38]
Pelle Söderström, Merle Horne, and Mikael Roll. 2016. Stem tones pre-activate suffixes in the brain. Journal of Psycholinguistic Research 46 (2016), 271–280.
[39]
Kai Song, Yue Zhang, Min Zhang, and Weihua Luo. 2018. Improved English to Russian translation by neural suffix prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[40]
Aleš Tamchyna, Marion Weller-Di Marco, and Alexander Fraser. 2017. Modeling target-side inflection in neural machine translation. In Proceedings of the Second Conference on Machine Translation. Association for Computational Linguistics, Copenhagen, Denmark, 32–42.
[41]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[42]
Marion Weller-Di Marco and Alexander Fraser. 2020. Modeling word formation in English–German neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 4227–4232.
[43]
Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 3-4 (1992), 229–256.
[44]
Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.https://proceedings.neurips.cc/paper_files/paper/2017/file/c6036a69be21cb660499b75718a3ef24-Paper.pdf
[45]
Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018. Improving neural machine translation with conditional sequence generative adversarial nets. (June2018), 1346–1355.
[46]
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence generative adversarial nets with policy gradient. Proceedings of the AAAI Conference on Artificial Intelligence 31, 1 (Feb.2017).
[47]
Biao Zhang, Deyi Xiong, Jinsong Su, and Jiebo Luo. 2019. Future-aware knowledge distillation for neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 12 (2019), 2278–2287.
[48]
Jiajun Zhang, Long Zhou, Yang Zhao, and Chengqing Zong. 2020. Synchronous bidirectional inference for neural sequence generation. Artificial Intelligence 281 (2020), 103234.
[49]
Meishan Zhang, Zhenghua Li, Guohong Fu, and Min Zhang. 2019. Syntax-enhanced neural machine translation with syntax-aware word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, MN, USA, 1151–1161.
[50]
Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, R. Ji, and Hongji Wang. 2018. Asynchronous bidirectional decoding for neural machine translation. ArXiv abs/1801.05122 (2018).
[51]
Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. 2018. Regularizing neural machine translation by target-bidirectional agreement. In AAAI Conference on Artificial Intelligence.
[52]
Zaixiang Zheng, Shujian Huang, Zhaopeng Tu, Xin-Yu Dai, and Jiajun Chen. 2019. Dynamic past and future for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 931–941.
[53]
Zaixiang Zheng, Hao Zhou, Shujian Huang, Lili Mou, Xinyu Dai, Jiajun Chen, and Zhaopeng Tu. 2018. Modeling past and future for neural machine translation. Transactions of the Association for Computational Linguistics 6 (2018), 145–157.
[54]
Long Zhou, Jiajun Zhang, and Chengqing Zong. 2019. Synchronous bidirectional neural machine translation. Transactions of the Association for Computational Linguistics 7 (2019), 91–105.

Index Terms

  1. Distilling Knowledge in Machine Translation of Agglutinative Languages with Backward and Morphological Decoders

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 24, Issue 1
    January 2025
    205 pages
    EISSN:2375-4702
    DOI:10.1145/3696803
    • Editor:
    • Imed Zitouni
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 January 2025
    Online AM: 07 November 2024
    Accepted: 02 November 2024
    Revised: 14 October 2024
    Received: 16 May 2024
    Published in TALLIP Volume 24, Issue 1

    Check for updates

    Author Tags

    1. Neural machine translation
    2. knowledge distillation
    3. agglutinative languages
    4. morphologically rich languages
    5. indic languages

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 249
      Total Downloads
    • Downloads (Last 12 months)249
    • Downloads (Last 6 weeks)120
    Reflects downloads up to 23 Jan 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media