Comprehension of sentences in human cognition is influenced by the preceding and succeeding discourse [
10]. Native speakers’ brains subconsciously anticipate the presence of a suffix when encountering a specific stem, reflecting the influence of priming [
38]. Our proposed framework uses a multi-decoder training framework with reliable knowledge distillation to generate the target sentence while simulating the aforementioned processes by leveraging relevant full-context information and morphological information.
4.1 Model
Our model is an extension of the dominant encoder-decoder model called the Transformer [
41]. It employs a multi-decoder approach, in which separate decoders generate the target sentence in different ways.
Encoder and Forward Decoder. In the forward decoder, the target sentence
\(Y={y_1,\ldots ,y_n}\) is generated by maximizing the conditional probability of each token
\(y_t\) given the source sentence
X and the preceding words
\(y_{\lt t}\):
The input sequence
X is first fed into the encoder, which encodes it into
m context vectors
\(C = {c_1,c_2, \ldots , c_m}\) where
m is the length of
X. Specifically, this is achieved through the
self-attention network (SAN), which produces the context vectors
\(C = SAN(X)\).
The forward decoder generates the target sentence Y word-by-word based on the representation
C from the encoder, using attention mechanisms to take into account the generated target fragment. Given a sequence of word embeddings
\(\mathbf {E}(y_1),\ldots , \mathbf {E}(y_{t-1})\) in the generated target fragment at the
t-th timestep, they are transformed into key matrix
\(\mathbf {K}_{t-1}\) and value matrix
\(\mathbf {V}_{t-1}\) . Another
Self-Attention (SelfATTs) module is then used to learn the target representation
\(s_t\):
where
\(\overrightarrow{h}_{t-1} \in \mathbb {R}^{d_{model}}\) is the previous context vector.
\(s_t\) is then fed into another
Attention (ATTc) module to compute the time-dependent context vector
\(h_t\):
where
\(\mathbf {K}_e\) and
\(\mathbf {V}_e\) are key and value matrices, respectively from encoder, that are transformed from the source representation
C.
The probability distribution
\(p(y_t|y_{\lt t}, X)\) of the generated target word
\(y_t\) is computed using a
multi-layer perceptron (MLP) layer:
The target word
\(y_t\) with the maximum probability is selected as the output of the decoder at the
t-th timestep.
Backward Decoder. In the backward decoder, the target sentence is generated by taking into account the source sentence and the succeeding words:
To compute the probability of a target token
\(y_t\), the decoder first computes the representation of the succeeding words using the self-attention mechanism. Given a sequence of word embeddings
\(\mathbf {E}(y_{t+1}), ..., \mathbf {E}(y_{n})\) of the succeeding words at the
t-th timestep, they are transformed into key matrix
\(\mathbf {K}^{^{\prime }}_{t+1}\) and value matrix
\(\mathbf {V}^{^{\prime }}_{t+1}\) . The target representation
\(s^{^{\prime }}_t\) is computed using attention mechanism as:
The context vector
\(\overleftarrow{h_t}\) is then computed using attention mechanism as:
The probability distribution
\(p(y_t|y_{\gt t}, X)\) of the generated target word
\(y_t\) is computed as:
Figure
3 depicts the backward decoder (top) and forward decoder (bottom) with Reliable Knowledge Distillation.
Morphological Decoder. The morphological decoder generates the target sentence by taking into account the source sentence, the preceding words, and the morphological classes of the preceding tokens and the next token. The probability of a target token,
\(y_t\), is computed using the classes of the preceding tokens and the next token
\(class(y_{\le t})\), the preceding words, and the source sentence.
In this morphological decoder, the word embeddings
\(\mathbf {E}(y_1),\ldots , \mathbf {E}(y_{t-1})\) are combined with respective morphological class embeddings
\(\mathbf {E}_c(class(y_1)),\ldots ,\mathbf {E}_c(class(y_{t-1}))\) by passing through a linear layer. The decoder is made aware of the morphological class of the next token to be generated in advance, allowing it to better choose the next token to generate based on this information. The morphological class of the next token,
\(class(y_t)\), is passed to a class embedding,
\(\mathbf {E}_c\), which generates its class embedding and combines it with the hidden state,
\(h_t\) by passing through a linear layer. The hidden state
\(h_t\) is computed in a manner similar to the forward decoder.
The probability distribution of the generated target word
\(y_t\),
\(p(y_t|class(y_{\le t}),y_\lt t, X)\) is computed as:
Figure
4 depicts the morphological decoder (top) and forward decoder (bottom) with Reliable Knowledge Distillation. In our experiment, we take the morphological classes of stem or affix.