1 Introduction
Relation and event extraction are two key components of information extraction that provide useful information for many
natural language processing (NLP) downstream tasks, such as question answering [
3,
47], document summarization [
37,
39], and knowledge base construction [
20,
53].
Relation extraction (RE) hopes to classify the relation type from pairs of entity mentions. Given a sentence “Chris hit Scott with a baseball,” a RE system seeks to find the tuple such as (
Chris, PER-SOC:Lasting-Personal, Scott). In addition, event extraction can be divided into the following two sub-tasks: event detection and
event argument role labelling (EARL), where the first one refers to identifying the event triggers (e.g.,
hit as
Injure type), and the other attempts to extract the (trigger, argument role, argument) triples based on given event trigger (e.g., triple (
hit, victim, Scott)). As a more challenging task, zero-shot cross-lingual relation and event extraction is conducted in multi-lingual situations and has achieved promising progress [
1,
33,
44]. The workflow is illustrated in Figure
1, where the zero-shot setting refers to the state transformation of training on the
source language and directly testing on another
target language. Meanwhile, following the settings in Reference [
1], we hypothesize all entities, event mentions (including event triggers and event arguments) are provided, and we focus on the study of RE and EARL.
Recently, lots of the existing pre-training models are trained in monolingual settings, but there is a huge gap between the features of words in different languages. If the models trained on the source language are directly applied to the target language tasks, then the performance is typically under satisfactory. To alleviate this problem, there has been a trend to utilize a universal encoder such as
multilingual BERT (mBERT) [
8] or XLM-R [
7] to produce cross-language contextualized representations, and thus the model that learned on one language can be easily transferred to others. Moreover, in the zero-shot cross-lingual scenarios, it is crucial to discover features that do not change heavily with different languages, which we refer to
language-agnostic features. On the one hand, the features of the target language are not available. On the other hand, since the distributions of data during training and testing are quite different, heavily relying on language-dependent features from the source language may result in an over-fitting dilemma. Research from References [
1,
33,
44] demonstrate that dependency structures extracted from syntactic
dependency parsing (DP) could be regarded as language-agnostic knowledge and effectively boost cross-lingual RE and EARL performance. As shown in the upper portion of Figure
2, DP seeks to find the grammatical relations between phrases. Rather than leveraging the features of the whole sentence, dependency structures could effectively reduce the distance between words, since words are skip-connected. In addition, it could mitigate the word order difference issue [
2] in diverse languages to some extent.
However, since various languages have different styles of expression and grammatical structures, there may exist variances in sentences stating the same meaning. Take the sentences in the middle of Figure
2 as an example, where an active English sentence is translated into Chinese and turn it into passive order. We discover the following: Even if two sentences that express the same message in different languages, the syntactic structures have changed. Despite there being some implicit correlation between active and passive sentences, the features to be learned during training are still various. This phenomenon is not conducive to leveraging the knowledge from the source language and migrating such knowledge to other languages, especially in cross-lingual zero-shot settings, where samples from the target language are not available. In addition, this finding also indicates that it is not enough to merely consider syntactic structures in RE and EARL tasks. To alleviate such a problem, we find semantic knowledge is rarely considered compared with utilizing syntactic dependency information in cross-lingual tasks. As shown in the lower part of Figure
2, the semantics basically remain the same; for instance, in Chinese and English situations,
Chris consistently plays the
semantic role labelling (SRL) role as
ARG0, and
Chris is still the
Arg1 semantic dependency parsing (SDP) argument of
hit. It illustrates that semantics are consistent and could be served as extra language-agnostic features.
The semantic analysis attempts to discover
who did what to whom, when, and why with respect to the central meaning of the sentence, which provides an in-depth parsing of sentences. There are two practical approaches to extracting semantic information: SRL [
58] and SDP [
14]. As shown in the middle of Figure
2, SRL tries to extract the predicate–argument structures (e.g.,
hit as
predicate and
Chris as one
argument). Meanwhile, SDP seeks to find the semantic factual or logical relationship between words (e.g.,
baseball is an
argument of
Chris). There are two advantages to introducing SRL and SDP: First, SRL and SDP provide more consistent parsing results and could be regarded as a bridge between languages. Utilizing such knowledge could effectively handle the expression difference issues (e.g., active-passive sentence problem), which cannot be appropriately solved by syntactic analysis. Second, since SRL and event argument role classification have similar task settings (predicate–argument and trigger–argument structures are analogous), SRL could be performed as a supplemental task and provides more prior knowledge.
In this article, we propose Syntax and Semantic Driven Network (SSDN) to equip syntax and semantic information simultaneously. To utilize SRL information, we map the discrete word-level parsing results to continuous representations and then integrate them into the word embeddings. For the results from SDP, not only is the category of each word critical, but also the labelled edges connecting the words could reflect the semantic relations. Therefore, we introduce a semantic-aware relational graph convolutional network (Sem-RGCN) to model the treelike semantic dependency structures and the corresponding semantic dependency relation types. In addition, a transformer-based architecture is used for encoding syntactic information, where the multi-head attentions are constrained by dependency tree distance to control the distance of information spreading. A fusion module is leveraged to adaptively choose the output syntax and semantic representations. In addition, since the soft labels and features from trained models bring useful information, a knowledge distillation mechanism is further introduced to boost performance by transferring knowledge from a well-trained teacher model to a student model at logit level and feature level.
Extensive experiments are conducted to evaluate the proposed SSDN on English, Chinese, and Arabic languages from
Automatic Content Extraction (ACE) 2005 datasets. The results show that SSDN model achieves state-of-the-art performance. In addition, SSDN performs well in different languages, demonstrating the robustness and superiority of semantic information. The contributions of the article can be summarized as follows:
•
In addition to syntax information, we further propose to leverage semantic features to enhance the migration capability of cross-language models. To the best of our knowledge, this is the first work to simultaneously consider semantic and syntax information in zero-shot cross-lingual relation and event extraction tasks.
•
We adopt SSDN to explicitly incorporate syntaxes and semantics. Discrete predicate–argument structures are integrated into the word representation after semantic role labelling. In addition, semantic dependency structures and the corresponding semantic dependency relations are fused by a Sem-RGCN.
•
Experiments on the widely used ACE2005 English, Chinese, and Arabic datasets showcase that the proposed method achieves state-of-the-art performance in most single-source and multi-source transfer scenarios. The further study illustrates that SSDN is less sensitive to the source language, indicating the robustness of semantics.
3 Methodogy
This section describes the architecture of our model SSDN in detail, which explicitly incorporates semantic and syntactic information. The framework of the model is illustrated in Figure
3. Specifically, we first obtain the SRL, syntactic DP, and SDP and concatenate the mapped continuous SRL to the multi-lingual word embeddings. Then, a transformer-based encoder is utilized to incorporate syntactic dependencies, and we introduce Sem-RGCN to encode semantic dependency structures with the corresponding relation types. A fusion module is leveraged to adaptively select semantic and syntactic output representations. In addition, a knowledge distillation mechanism further transfers knowledge to a student model, which has the same architecture as the teacher model.
In the experiments, we hypothesize all entities, event mentions are provided, and we focus on the RE and EARL tasks. Formally, given a pair of entities \(e_s\) and \(e_o\) from a sentence, the RE task seeks to classify the relation \(r_r\in \mathcal {R}_r \cup \lbrace None\rbrace\) , where the subscript r indicates the RE task, \(r_r\) is the golden-standard category label, and \(\mathcal {R}_r\) is the pre-defined set of relation labels. Likewise, given an event trigger \(e_t\) and a candidate event argument \(e_a\) , EARL refers to predicting the argument role \(r_a\in \mathcal {R}_a \cup \lbrace None\rbrace\) , where the subscript a denotes EARL task.
3.1 Parsing and Encoding
Our preliminary analysis illustrates that not only syntactic dependencies but also semantic knowledge could be served as language-agnostic information and provide more stable results across languages. To obtain such information, we use open source tools to acquire the SRL, DP, and SDP results. Specifically,
Stanford CoreNLP Toolkit [
35] is utilized to parse the syntactic dependencies. Meanwhile, such a tool is also employed to obtain the part-of-speech, entity type, and syntactic dependency type. In addition, the SRL and SDP output could also be obtained from the Electra small model [
6] and Biaffine SDP model [
13] integrated in the
HanLP toolkit, respectively. Because the semantic labels for diverse languages are slightly different, we select uniform labels across all languages.
At the encoding stage, for the input sentence, we convert it into an embedding matrix and use it as the input of the transformer-based encoder. mBERT [
8] are leveraged to build semantic representation for each word in the context. After the encoding stage, we utilize hidden states from the last layer of mBERT to represent each token. Formally, given a sentence with
N words
\(\lbrace x_1,x_2,\ldots ,x_i,\ldots ,x_N\rbrace\) , where the
\(x_i\) indicates the
ith word, the encoded features can be calculated as follows:
Considering the phenomenon that the predicate–argument structure extracted from SRL tools may cross many words, the BIO tagging mechanism [
9] is leveraged from sequence tagging to model the connection between words. For instance, given the sentence “with a baseball” with the semantic role as “ARGM-MNR,” we further pre-process it to “B-ARGM-MNR I-ARGM-MNR I-ARGM-MNR.” After acquiring the labels for word granularity from SRL, since the semantic labels are discrete, we construct a mapping dictionary. The keys of the dictionary are semantic labels, and the corresponding values are randomly initialized continuous embeddings. Formally, given the
ith word with discrete semantic role label name
\(l_i^{srl}\) , we could get the mapped continuous semantic-aware embedding
\(v_i^{srl}\) by looking up the mapping dictionary:
where
mapping is the operation of finding continuous embeddings from the dictionary. The semantic-aware embedding is obtained by looking up the randomly initialized mapping dictionary, and it does not make any sense at the beginning. As a result, we hope to keep it updated during training so that embedding could get a better experiment result. It should be noted that a sentence may yield many pairs of SRL results, only the top-
S results that have the most SRL labels are chosen in our experiments, where
S is a hyper-parameter that controls the number of SRL labels to be chosen.
Then we concatenate the embedding
\(h_i^{srl}\) to the corresponding word-level BERT encoded features
\(h_i^{bert}\) . Similarly, we also get the
ith continuous embedding
\(h_i^{pos}\) ,
\(h_i^{et}\) , and
\(h_i^{dr}\) of part-of-speech, entity type, dependency relation, respectively. And concatenate them to the mBERT output, getting the word representation:
where [., .., ..] is the concatenation operation. And
\(h_i^{in}\in {\mathbb {R}^d}\) , where
d is the dimension of the
ith concatenated word embedding.
3.2 Transformer-based Encoder
Transformer-based Encoder [
1] leverages self-attention mechanism to consider syntactic structure and distances between words simultaneously. The key idea is to manipulate the mask matrix to impose the graph structure and retrofit the attention weights based on pairwise syntactic distances. Specifically, we concatenate the input word embeddings
\(\lbrace h_1^{in},h_2^{in},\ldots ,h_N^{in}\rbrace\) and form the sentence embedding matrix
\(H^{in}\in {\mathbb {R}^{N\times d}}\) . And the self-attention mechanism is formulated as follows:
where
\(W_q \in \mathbb {R}^{d \times d_q}\) ,
\(W_k \in \mathbb {R}^{d \times d_k}\) , and
\(W_v \in \mathbb {R}^{d \times d_v}\) are the projection matrices and
\(d_q\) ,
\(d_k\) , and
\(d_v\) are dimensions of each self-attention head.
The output attention
A can be calculated as follows:
where
\(softmax(\cdot)\) denotes the softmax activation function. The mask matrix
\(M\in {\mathbb {R}^{N\times N}}\) is based on the syntactic dependency distance and formulated by
where
\(D_{i j}\) is the syntactic distance between the
ith and the
jth words.
\(\delta\) is a hyper-parameter that controls after how many hops of dependencies will be kept for diverse self-attention heads. For example, if
\(\delta\) is 4 and the dependency distance is 1, then the element would be set as 0.
Assuming the transformer-based encoder consists of \(L^{syn}\) self-attention layers, after successively undergoing the above operation \(L^{syn}\) times, we could finally obtain a syntactic-aware sentence representation \(H^{syn} \in \mathbb {R}^{N \times d}\) .
3.3 Semantic-aware Relational Graph Convolutional Network
Different from previous works that encode
unlabelled syntactic structure via GCN [
32,
44], we further introduced a Sem-RGCN module to explicitly encode semantic dependency structures and the corresponding dependency relations. Sem-RGCN could effectively incorporate the
labelled semantic graphs by learning a separate projection matrix for each semantic relation.
We treat the words in a sentence as the nodes, and the dependency structures as the edges of Sem-RGCN. Formally, the semantic dependency structures for a sentence is defined as a directed and labelled graph \(G=(\mathcal {V},\mathcal {E},\mathcal {R})\) , where \(\mathcal {V}=\lbrace x_1,x_2,\ldots ,x_N\rbrace\) is a set of nodes, \(\mathcal {E}=\lbrace e_1,e_2,\ldots ,e_M\rbrace\) is a set of language-universal semantic dependency relations, and \(\mathcal {R}\) is the corresponding relation type between two nodes (including inverse relations for inverse edges). N is the number of words in the sentence, and M is the number of dependency relations between words. It should be noted that \(\mathcal {R}\) could be either the pre-defined relation class set \(\mathcal {R}_r\) or the event argument class set \(\mathcal {R}_a\) .
The semantic dependency structure of a sentence with
N words is converted into a
\(N\times N\) adjacency matrix
A, where the element
\(a_{i,j}\) from
ith row and
jth column is set as 1 if there is a connection in the parsed semantic dependency graph. In addition, self-connections at each node are adopt to help capture information about the current node itself. The node embedding is initialized as the concatenation of word embeddings
\(h^{in}\) and syntactic output
\(h^{syn}\) , followed by a learned affine transformation and a ReLU nonlinearity:
where
\(W_{e}\in \mathbb {R}^{d \times 2d}\) is a learned matrix. The superscript of
o denotes the layer number, and
\(o^{(0)}\) refers to the input of Sem-RGCN.
Assuming there are
\(L^{sem}\) Sem-RGCN layers, for each Sem-RGCN layer
l, the node’s hidden representations are propagated with their direct neighbors:
where
\(r\in {\mathcal {R}}\) is the pre-defined relation type of RE or EARL.
\(\mathcal {N}_{i}^{r}\) refers to the neighbors of the
ith node with relation
r.
\({W}_{r}\) and
\({W}_{0}\) are learnable parameters indicating relation-specific transformation and a self-loop transformation, respectively. In the experiment, only the one-hop neighbour of each node at each iteration is calculated.
After \(L^{sem}\) layers of iteration, we take the final layer of Sem-RGCN model as the semantic-aware sentence representation \(H^{sem} \in \mathbb {R}^{N \times d}\) .
3.4 Fusion Module
A fusion module is adopted to incorporate syntactic and semantic knowledge. Motivated by Reference [
16], a gate mechanism is utilized to dynamically integrate the syntactic representation
\(H^{syn}\) after transformers-based encoder and semantic representation
\(H^{sem}\) generated from Sem-RGCN module. Specifically, the gate
G is calculated as follows:
where
\(W_g \in \mathbb {R}^{1 \times 2d}\) and
\(b_g\) are trainable variables of the gate,
\(sigmoid(\cdot)\) is the sigmoid activation function,
G is a 1-
d vector, and each element is
\(g_i \in [0,1]\) . We leverage the gate
G to form the final representation as follows:
where
\(\odot\) denotes the element-wise production operation. In this way,
G controls the proportion of each input, and the output sentence representation
\(H\in \mathbb {R}^{N \times d}\) could be the final result that adaptively integrates the syntactic and semantic knowledge.
3.5 Training and Knowledge Distillation
Given the fused representation
H, we aim to identify the label from pre-defined categories. For the relation extraction task, considering that entities may have different lengths, given pairs of subject entity
\(e_s\) and object entity
\(e_o\) , we perform a max-pooling operation to get fixed-length entity representations
\(h_s\) and
\(h_o\) . Following Reference [
44], we also obtain sentence representation
\(h_s\) by conducting a max-pooling operation over the encoding sequence
H of every sentence. Finally, the concatenated vectors [
\(h_s,h_o,h_s\) ] are fed to a linear classifier followed by a softmax layer to predict the relation type,
where
\(y_r\) is the predicted logit probability,
\(W_r^T \in \mathbb {R}_r^{d \times \left| \mathcal {R} \right| }\) and
\(b_r\) are learnable parameters of the last feed forward layer, and
\(\left| \mathcal {R}_r \right|\) indicates the number of pre-defined categories of relation extraction task. In addition,
T is a temperature factor to control the smoothness.
Likewise, for the event argument role labelling task, we conduct max-pooling over argument candidate, event trigger, and sentence and get the vectors
\(e_a\) ,
\(e_t\) , and
\(e_s\) . Then we concatenate the vectors [
\(e_a,e_t,e_s\) ] and pass them through a linear classifier and softmax layer to predict the argument role label,
where
\(y_a\) is the distribution of prediction probability output.
The relation extraction and event argument role labelling models are optimized by minimizing
Cross Entropy (CE) loss as follows:
where
K denotes the number of training samples,
\(y_k\) could be the predicted logits of the
kth sample from either RE or EARL models, and
\(\hat{y_k}\) is the golden-standard label.
Knowledge distillation focuses on utilizing the final output layer or the intermediate layers of the teacher model. The hypothesis is that the student model will learn to mimic the predictions of the teacher model. The knowledge distillation process can be achieved by using a loss function, termed the distillation loss, that captures the difference between the logits of the student and the teacher model, respectively. As this loss is minimized during training, the student model will become better at making the same predictions as the teacher. Suppose there is a classification task, and the original label can be [0,1,0], which is the hard target. While for the soft targets, the label could be [0.35, 0.6, 0.05], which considers the implicit relationship between the labels. When a neural network is hard labelled, it actually loses the information of the original data and reduces the difficulty of fitting the model to the data, making it easier to fit the model, which may produce overfitting and lead to a decrease in the generalization ability of the model. When soft labelling is used, the model needs to learn more knowledge, such as the similarity and difference between two close probabilities, thus creating a challenge for the model’s fitting ability and enhancing the model’s generalization ability.
To make the most of the data from the source language, because a lot of helpful information can be carried in soft targets instead of hard targets [
15], a knowledge distillation mechanism is further utilized. The workflow is as follows: First, we train a teacher model
\(M^{tea}\) from the source language and fix its weights. By feeding the data from source language to the teacher model, the corresponding output logits
\(y^{tea}\) and fused features
\(h^{tea}\) are obtained. Second, we re-train a student
\(M^{stu}\) sharing the same architecture as teacher
\(M^{tea}\) but with diverse parameters, where
\(M_{stu}\) is initialized by the pre-trained weights from mBERT. And we could obtain the output logits
\(y^{stu}\) and features
\(h^{stu}\) by inputting the source language data to the student model. Finally,
mean squared error (MSE) and
Kullback–Leibler (KL) divergence are utilized as loss functions to minimize the difference between teacher and student models. The KL divergence is a widely used metric in the knowledge distillation process to measure the difference between two probability distributions (i.e., the distributions of the teacher model and the student model). The essence of KL divergence loss is cross entropy minus information entropy, that is, the difference between the number of bits required to encode the true distribution using the estimated distribution and minus the number of bits required to encode the true distribution. By minimizing Kullback–Leibler divergence during training, the distribution of the student model gradually converges to that of the teacher model, thus allowing the student model to “learn” knowledge from the teacher model.
Formally, the MSE loss is utilized to minimize the gap between teacher model
\(M^{tea}\) and student model
\(M^{stu}\) in the fused output features and could be calculated as follows:
For output logits, the KL loss between teacher and student models could be obtained as follows:
During the training process, in addition to considering the ground-truth labels, the student model is also influenced by the soft labels and fused output features from the teacher model. The overall loss of the student model under the knowledge distillation framework is formulated as follows:
where
\(\alpha\) and
\(\beta\) are two weight coefficients.