1 Introduction

Named entity recognition (NER) is an information extraction task that aims to identify pre-defined entities from unstructured text, which plays an indispensable role in many other tasks of natural language processing, such as event extraction, information retrieval, and question answering (Dhiman et al. 2022; Hao et al. 2023; Huai et al. 2023; Hong et al. 2022; Izacard et al. 2022; Wu et al. 2023; Hao et al. 2022; Brandsen et al. 2022). In the early stages, named entity recognition primarily relied on rule-based and dictionary methods, which had various shortcomings, including the need for extensive domain expertise, low efficiency, high costs, and limited portability (Zhang et al. 2022). With the advent of deep learning, NER performance has seen significant improvements (Liu et al. 2022; Deng et al. 2021). The English text is composed of words, with word boundaries naturally defined by spaces. However, Chinese text is comprised of continuous characters and lacks natural word boundaries(Cheng et al. 2021. Besides, Chinese words typically consist of one or more characters, and the position and combination of these characters within a word can influence its meaning. Consequently, Chinese named entity recognition (CNER) tasks pose greater difficulty. Thus, researchers work on this task from various perspectives. In CNER task, performing entity recognition directly at the character-level feature enhancement usually works well (Dai et al. 2019; Li et al. 2020; Wang et al. 2020; Zhu et al. 2023). However, many approaches propose that fusing lexical and character information allows the model to learn more comprehensive features and show better results. Our study is also inspired by the idea of feature fusion.

The Lattice-LSTM approach is initially proposed to leverage lexical information to enhance character-level representation in NER (Zhang and Yang 2018). Despite its numerous benefits, the approach falls short in its ability to fully utilize self-matching lexical information of characters and contextual lexical information near characters. In particular, the approach may not capture all of the self-matching lexical information available, which is critical in understanding the meaning of words and phrases in their respective contexts. Additionally, the approach may disregard contextual lexical information near characters, which can provide further insight into their meanings. Furthermore, softlexicon is proposed to classify all self-matching words of each character into different sets, compressing and splicing the sets to obtain vectors for representing lexical information (Ma et al. 2020). Although the self-matching words are extracted and simply spliced with the character-level information, the use of lexical information is incomprehensive. Afterward, the DCSAN model is proposed for the optimization of softlexicon (Zhao et al. 2021), which fuses characters and lexical features by the cross-lattice module with the gated word-character semantic fusion unit, and the self-lattice attention module. However, it’s worth noting that this model overlooks the contextual features of the text.

Additionally, graph neural networks can also be used to integrate character and lexical information. In graph representation, the adjacency relationship between graph nodes can clearly describe the relationship between characters and lexical information. Furthermore, by leveraging information interaction, the graph-based network is capable of mitigating the loss of information transfer, resulting in a more robust and high-performing model. For example, the lexicon-based graph network model (LGN) (Gui et al. 2019) and the collaborative graph network model (CGN) (Sui et al. 2019) are two classic types of graph neural networks applied to named entity recognition. Therefore, GNNs can also be considered for application in CNER. Current works on NER using GNNs focus on fusing lexical information to characters, only capturing the characters for decoding. However, it cannot capture the deeper correlation between characters and lexical information, and the process of fusion ignores the maintenance of information about the nodes themselves.

To address the limitation, we propose an interactive fusion approach that integrates characters and lexical information for CNER. Among them, "interactive" refers to the mutual influence and information transfer between character information and lexical information. By integrating lexical information into character information, character features can be influenced by lexical information, thus improving the expressive ability of character features. By integrating character information into lexical information, lexical features can obtain richer information about character details, thus enhancing the expressive ability of lexical features. This two-way feature transfer can help character information and lexical information complement and enrich each other, further improving the expressive ability of features. Furthermore, we introduce feedforward neural networks, residual connections, and layer normalization to enhance the feature representation of characters and lexical information.In summary, our main contributions are as follows.

  1. 1.

    To capture the deeper correlation between characters and lexical information for improving the CNER performance, we propose the interactive fusion approach between characters and lexical information via graph attention network.

  2. 2.

    To enhance the feature representation of characters and lexical information, we further propose to incorporate residual concatenation and layer normalization, which can fully exploit the effect of the graph attention network.

  3. 3.

    To fully validate the performance of our proposed model, we have performed experiments on multiple datasets. The experiments show the effectiveness of our proposed model and achieve state-of-the-art performance on many datasets.

2 Related work

Several methods have been proposed to incorporate lexical information into character-level feature representations, to improve NER performance. Besides, several studies have shown the effectiveness of GNNs by fully capturing information from neighboring nodes. we present related works as our work is inspired by these studies.

2.1 The fusion between characters and lexical information

In the field of Chinese named entity recognition, some models only rely on character information for entity recognition, such as the BERT_BiLSTM_CRF model (Dai et al. 2019) and the LSTM_CNN model (Wang et al. 2020), etc. Although these methods can also achieve good effects, the information learned by the model is limited. Therefore, fusing lexical information is a feasible improvement direction. Currently, sufficient methods have been proposed, to fuse lexical information into characters, to provide a more comprehensive characters-level feature representation. Lattice-LSTM uses a lexicon for word matching on the input sequence to improve NER performance (Zhang and Yang 2018). Meanwhile, FLAT sets head and tail position indexes for all characters and words in the input sequence, and through the position indexes, the lattice structure is transformed into a flat lattice structure to improve NER performance (Li et al. 2020). Furthermore, NFLAT improved FLAT to reduce the computational cost of the FLAT model (Wu et al. 2022). Likewise, TRAMA improves NER performance by contextualizing the vocabulary (Huang et al. 2022). Softlexicon incorporates lexical information in the embedding layer, classifies matched words into four categories according to the position of characters appearing in the matched words, and fuses the semantic information of words from four dimensions (Ma et al. 2020). This approach can avoid complex sequence modeling and achieve good performance. To further optimize Softlexicon, DCSAN extracts word information by Softlexicon and then captures the dense interaction of word-character lattice structure through the cross-lattice module, gated word-character semantic fusion unit, and self-lattice attention module to improve performance (Zhao et al. 2021). We note that the lexical information extracted using Softlexicon is only direct splicing with the vector representing the character information, which does not take full advantage of the fusion of lexical information and characters.

2.2 Graph-based neural network for feature enhancement

Several studies have demonstrated the remarkable ability of graph neural networks to fully capture the information from neighboring nodes in NER task (Xu et al. 2023; El-Allaly et al. 2022; Tian et al. 2021; Liao et al. 2021; Liang et al. 2022). In this paper, we aim to utilize this capability to fuse lexical information and characters by leveraging a graph neural network. Specifically, Chen proposes to enhance the internal dependencies of phrases via GAT to enrich textual feature representations(Chen and Kong 2021). LGN(Gui et al. 2019) is proposed for representing the features, which first constructs a global relay node to collect information from nodes and edges by using characters as nodes and words as edges, respectively. Subsequently, the information from edges adjacent to the nodes is aggregated to the nodes, and the information from nodes adjacent to the edges is aggregated to the edges, respectively. Finally, the information from both nodes and edges is aggregated to global relay nodes. This approach enables effective graph characterization, resulting in successful performance in the NER task. CGN is proposed to construct three character-words interaction graphs by using characters and words as nodes, and these interactive graphs can capture different lexical knowledge due to the difference in the graph structure. Although there are overlapping adjacencies between the designed three graph nodes constructed by CGN, the fusion of lexical information using GAT is inspiring, which provides valuable insights for our research (Veličković et al. 2018).

To summarize, we propose that utilizing both character-level and lexical information interactively can obtain a more comprehensive feature representation, leading to improve CNER performance. We design a novel approach that utilizes an optimized graph attention network to effectively integrate characters and lexical information. This approach aims to address the limitations of existing methods in fully exploiting lexical information.

3 The proposed approach

In this section, we will present a comprehensive explanation of our proposed approach for achieving interactive fusion between characters and lexical information through an optimized graph attention network. The overall architecture is shown in Fig. 1, and Algorithm 2 is the corresponding pseudocode. In general, we first use the Softlexicon strategy to obtain feature vectors of lexical information and apply the BERT model to obtain feature representations of characters. Next, we employ BiLSTM to extract contextual features from the text, and then we utilize the interactive fusion of characters and lexical information to the graph attention network. Finally, we leverage to incorporate residual concatenation and layer normalization for a more comprehensive feature enhancement.

Fig. 1
figure 1

Model architecture diagram. Firstly, characters and words are represented as feature vectors using the BERT model and the Softlexicon method. Next, two independent BiLSTM networks are employed to extract context information at the character and word levels. Then, a directed graph model is constructed to preliminarily integrate character and word information using GAT. Mechanisms such as feedforward neural networks and residual connections are introduced to enhance effectiveness. Subsequently, a secondary fusion of character and word information is performed at the integration layer. Finally, decoding is carried out using CRF

3.1 Embedding layer

Character-level and lexical-level feature vectors are obtained in two different ways, which we explain in detail below. Given a Chinese text sequence as \(Sc= \{ {c_1},{c_2}, \ldots ,{c_n}\}\) where \({c_i}\) denotes the ith character in the sequence, we map discrete characters in a text sequence into feature vectors to represent their semantics. We further convert the characters into vectors by the BERT model (Kenton and Toutanova 2019), denoted as:

$$\begin{aligned} \mathbf {x_i} = BERT({c_i}),\mathbf {x_i} \in {R^d}\end{aligned}$$
(1)
$$\begin{aligned} \textbf{X} = [\mathbf {x_1}, \ldots ,\mathbf {x_n}] \in {R^{n*d}} \end{aligned}$$
(2)

To extract the lexical feature vectors aligned with characters, we first need to construct a word set, as illustrated in Fig. 2. Specifically, for each character \({c_i}\) in the input sequence Sc, all matching words are divided into four word sets “BMES” according to the position of the character in the word. We use \({c_{j,k}}\) to denote words consisting of characters between j and k in the input sequence, e.g., for the input sequence “局部肠壁增厚... (localized intestinal wall thickening...)”, \({c_{3,4}}\) defines the word “ (intestinal wall)” in it. \(B({c_i})\) is expressed the set of words with the matching character \({c_i}\) at the beginning, \(M({c_i})\) to indicate the set of words with the matching character \({c_i}\) at the middle, \(E({c_i})\) to represent the set of words with the matching character \({c_i}\) at the end, and \(S({c_i})\) to describe the character \({c_i}\) itself as a word. If a word set is empty, we add a special word “None” to that word set. For example, for the character \({c_3}\)‚ (intestine)”, \(B({c_3})\) consists of “ (intestinal wall)” and “(intestinal wall thickening)”, and \(M({c_3})\) consists of “ (localized in testinal wall)”, \(E({c_3})\) consists of the special word “None”, and \(S({c_3})\) consists of “‚ (intestine)”.

Fig. 2
figure 2

Word set construction example

After constructing the four lexical sets, the four lexical sets are compressed into four vectors with fixed dimensions, and finally the four vectors are stitched together. Then, the vector dimensions are reduced using linear mapping to obtain a feature representation of the lexical information aligned with each character, with the following equation:

$$\begin{aligned} \mathbf {y_i} = [\mathbf {v(B)};\mathbf {v(M)};\mathbf {v(E)};\mathbf {v(S)}],\mathbf {y_i} \in {R^{4d}}\end{aligned}$$
(3)
$$\begin{aligned} \textbf{Y} = Linear[\mathbf {y_1}, \ldots ,\mathbf {y_n}] \in {R^{n*d}} \end{aligned}$$
(4)

Here, Linear[] refers to mapping high-dimensional vectors to a lower-dimensional space through linear transformation, achieved through matrix multiplication. This process reduces the dimensionality of the vectors from \({R^{4d}}\) to \({R^{d}}\), requiring a matrix of size \(4d \times d\) for implementation. v represents the function that maps word sets to feature vectors. For example to map the set of words S to a feature vector, the formula is as follows:

$$\begin{aligned} \mathbf {v(S)} = \frac{4}{Z}\sum \limits _{w \in S} {z(w){e^w}(w)} \end{aligned}$$
(5)

where

$$\begin{aligned} Z = \sum \limits _{w \in B \cup M \cup E \cup S} {z(w)} \end{aligned}$$
(6)

Here, w represents a specific word, \({e^w}\) represents the word embedding lookup table, and z(w) represents the frequency of occurrence of w in the data.

3.2 Encoding layer

After obtaining the character vectors and the lexicon vectors aligned to them, we employ two independent BiLSTM networks to compute the character vector \(\textbf{X} = [\mathbf {x_1},\mathbf {x_2}, \ldots ,\mathbf {x_n}]\) and the lexicon vector \(\textbf{Y} = [\mathbf {y_1},\mathbf {y_2}, \ldots ,\mathbf {y_n}]\) separately, which can capture the contextual features of the text at the character level and the lexicon level respectively. Such an architecture allows the model to take into account both character-level and lexicon-level information when processing text, thus obtaining a more comprehensive and richer contextual representation.

BiLSTM consists of a forward LSTM and a backward LSTM(Graves and Graves 2012), and using BiLSTM for x and y, the formula is as follows:

$$\begin{aligned} \textbf{P} = \{ \mathbf {p_1},\mathbf {p_2}, \ldots ,\mathbf {p_n}\} = BiLSTM(\textbf{X} = \{ \mathbf {x_1},\mathbf {x_2}, \ldots ,\mathbf {x_n}\} )\end{aligned}$$
(7)
$$\begin{aligned} \textbf{Q} = \{ \mathbf {q_1},\mathbf {q_2}, \ldots ,\mathbf {q_n}\} = BiLSTM(\textbf{Y} = \{ \mathbf {y_1},\mathbf {y_2}, \ldots ,\mathbf {y_n}\} ) \end{aligned}$$
(8)

3.3 Graph attention network layer

After obtaining features by BiLSTM, \(\textbf{P}\) and \(\textbf{Q}\) are stitched together to obtain \(\textbf{Node} = [\mathbf {p_1},\mathbf {p_2}, \ldots ,\mathbf {p_n},\mathbf {q_1},\mathbf {q_2}, \ldots ,\mathbf {q_n}]\), which is expressed as the node input for constructing the directed graph. When constructing the directed graph, \(\mathbf {p_i}\) and its corresponding \(\mathbf {q_i}\) are connected with each other by directed edges to form a loop, while \(\mathbf {p_i}\) and \(\mathbf {q_i}\) are added with one self-looping edge each. This architecture enables each node to concentrate on its own semantic information and that of its neighboring node, facilitating the integration of both sources of information in graph neural network training. By restricting the focus to one neighbor at a time, this approach is able to fully integrate information from neighboring nodes in an interactive manner while preserving the essential semantic features of each node. To represent a constructed directed graph, an adjacency matrix is used. Specifically, each position in the matrix corresponds to a pair of nodes, and if there is an edge relationship between them, the corresponding value is set to 1; otherwise, it is set to 0. The process of constructing the adjacency matrix is shown in Algorithm 1. Further, the constructed adjacency matrix is only 1 in the lower left diagonal, upper right diagonal, and main diagonal of the matrix, and 0 in all other positions, which is a simple structure. Since our aim is to fully integrate character and lexical information through interactive structures, the constructed adjacency matrix is necessarily sparse. This is also to avoid information redundancy and unnecessary computations, thereby enhancing the efficiency of the model.

Algorithm 1
figure a

Construct the adjacency matrix \(\textbf{A}\).

The adjacency matrix \(\textbf{A}\) is an \(N*N\) matrix, where \(N = 2*n\). The first n nodes correspond to characters, and the last n nodes correspond to words. We model the directed graph using graph attention networks, assuming that the node features \(\textbf{P}\) and \(\textbf{Q}\) are uniformly represented by \(\textbf{H} = \{ \mathbf {h_1},\mathbf {h_2},...,\mathbf {h_N}\},\mathbf {h_i} \in {R^d}\). The first n vectors in \(\textbf{H}\) are \(\textbf{P}\), and the last n vectors are \(\textbf{Q}\). The inputs to GAT are the node features \(\textbf{H}\) and the adjacency matrix \(\textbf{A}\).

When fusing node information, the graph attention network first needs to calculate the attention coefficients of neighboring nodes to the node, which is formulated as follows:

$$\begin{aligned} {\alpha _{ij}} = \frac{{\exp (Leaky{\mathop {\textrm{Re}}\nolimits } LU(\mathbf {a^T}[\textbf{W}\mathbf {h_i}||\textbf{W}\mathbf {h_j}]))}}{{\sum \nolimits _{k \in {N_i}} {\exp (Leaky{\mathop {\textrm{Re}}\nolimits } LU(\mathbf {a^T}[\textbf{W}\mathbf {h_i}||\textbf{W}\mathbf {h_k}]))} }} \end{aligned}$$
(9)

where \(\textbf{a} \in {R^{2d'}}\) and \(\textbf{W} \in {R^{d'*d}}\) are trainable parameters, || denotes the splicing operation, and \({N_i}\) expresses all nodes adjacent to node i, i.e., all nodes with an edge relationship to node i. Based on the calculated attention coefficients, the feature vectors are weighted and summed to obtain the output vector of the graph attention network. The formula is as follows:

$${\mathbf{h}}_{{\mathbf{i}}} ^{\prime } = \sigma \left( {\sum\limits_{{j \in N_{i} }} {\alpha _{{ij}} } {\mathbf{Wh}}_{{\mathbf{j}}} } \right)$$
(10)

\(\sigma\) indicates the activation function. In order to obtain better generalization ability, the training process uses multi-headed attention with the following equation:

$${\mathbf{h^{\prime}}}_{{\mathbf{i}}} = \mathop {||}\limits_{{k = 1}}^{K} \sigma \left( {\sum\limits_{{j \in N_{i} }} {\alpha _{{ij}} ^{k} {\mathbf{W}}^{k} {\mathbf{h}}_{{\mathbf{j}}} } } \right)$$
(11)

Here K means the number of attention heads. Then, we represent the output of GAT by \(\textbf{G} \in {R^{d'*N}}\) with the following equation:

$$\begin{aligned} \textbf{G} = GAT(\textbf{H},\textbf{A}) \end{aligned}$$
(12)

The use of graph attention networks aims to integrate character and lexical information within an interactive graph structure. Through the processing of graph attention networks, character feature vectors are influenced by lexical information, thereby enhancing the expressiveness of character features. Simultaneously, lexical feature vectors can also capture rich character-level details, thereby strengthening the expressiveness of lexical feature vectors. Additionally, employing an interactive graph structure simplifies the model’s architecture without significantly increasing computational burden.

3.4 Fusion layer

We further process the output of the graph attention network by feedforward neural networks, residual connectivity, and layer normalization with the following equation:

$$\begin{aligned} FFN(\textbf{G}) = \max (0;\textbf{G}\mathbf {W_1} + \mathbf {b_1})\mathbf {W_2} + \mathbf {b_2}\end{aligned}$$
(13)
$$\begin{aligned} \mathbf {G'} = LayerNorm(FFN(\textbf{G}) + \textbf{G}) \end{aligned}$$
(14)

\(\textbf{G}\) is the graph attention network layer’s output vector, and FFN stands for feedforward neural network.

We split the \(\mathbf {G'} \in {R^{d'*(2*n)}}\) into two parts, the first n vectors, and the last n vectors, respectively. Then the character vectors and the word vectors are fused a second time, and the fusion equation is as follows:

$$\begin{aligned} \mathbf {G''} = \mathbf {M_1}*\mathbf {G'}[:,0:n] + \mathbf {M_2}*\mathbf {G'}[:,n:] \end{aligned}$$
(15)

where \(\mathbf {M_1} \in {R^{d'*d'}}\) and \(\mathbf {M_2} \in {R^{d'*d'}}\) are trainable parameters, and \(\mathbf {G''} \in {R^{d'*n}}\) is the feature vector obtained by fusion.

Algorithm 2
figure b

Pseudo-code for the model

3.5 Decoding layer

We employ conditional random fields (CRF) (Lafferty et al. 2001) to predict the tag sequences. Given the sentence \(Sc = \{ {c_1},{c_2}, \ldots ,{c_n}\}\), the probability formula for predicting the sequence \(L = \{ {l_1},{l_2}, \ldots ,{l_n}\}\) is as follows:

$$\begin{aligned} p(L|Sc) = \frac{{{e^{score(Sc,L)}}}}{{\sum \limits _{\tilde{L} \in {L_{Sc}}} {{e^{score(Sc,\tilde{L})}}} }}\end{aligned}$$
(16)
$$\begin{aligned} score(Sc,L) = \sum \limits _{i = 1}^n {\mathbf {A_{{l_i},{l_{i + 1}}}}} + \sum \limits _{i = 1}^n {\mathbf {p_{i,{l_i}}}} \end{aligned}$$
(17)

Here \({L_{Sc}}\) denotes all possible sequence labels, \(\mathbf {p_{i,{l_i}}}\) defines the firing score of the ith word, \(\mathbf {A_{{l_i},{l_{i + 1}}}}\) expresses the score of label \({l_i}\) transferred to label \({l_{i + 1}}\), and the label sequence with the highest probability is output when predicting. The formula is as follows:

$$\begin{aligned} L* = \mathop {\arg \max }\limits _{\tilde{L} \in {L_{Sc}}} score(Sc,\tilde{L}) \end{aligned}$$
(18)

4 Experiment

In this section, we conduct a series of experiments on four public datasets to evaluate the performance of our model, and we use precision, recall, and F1-score under strict matching as evaluation metrics. An ablation study is further provided to validate the effectiveness of the proposed model.

4.1 Experimental settings

Table 1 Statistics of datasets

Characters embeddings are generated by the BERT model, using "bert-base-chinese" released in 2018 (Kenton and Toutanova 2019), and the size of the characters embeddings is 768. Then we apply a word embedding dictionary that contains 2 million word vectors, with a word embedding size of 200 (Song et al. 2018). The dimension of both the BiLSTM and GAT models in hidden state is set to 256. The GAT layer has 3 attention heads and 2 layers. The dropout rate of 0.5 is utilized after the BiLSTM layer to prevent overfitting. The batch size for all datasets is set to 20. Besides, during the training, the Adamax optimizer is used to optimize the model parameters, with an initial learning rate of 0.005 and a decay rate of 0.05 for all datasets. The code for all experiments was developed on the pytorch framework and ran on a server (OS: Ubuntu) with a single GPU (NVIDIA TITAN V, RAM: 11 G).

The datasets in this paper include the CCKS2020 medical dataset (Li et al. 2021), the Weibo social media dataset (Peng and Dredze 2015), the Ontonotes news dataset (Weischedel et al. 2011), and the resume dataset (Zhang and Yang 2018). Specifically, the CCKS2020 dataset contains six types of entities: diseases and diagnoses, imaging examinations, laboratory tests, drugs, surgeries, and anatomical sites. As the dataset only provides a training set, we divided it into a training set, a validation set, and a test set in an 8:1:1 ratio for our experiments. The text in this dataset is relatively long, so we partitioned it into segments with a maximum length of 250 for training. The detailed information on four datasets is shown in Table 1.

4.2 Experimental results analysis

We compare our proposed model with other models on four datasets from different domains to demonstrate the validity and generalization. The results are presented in Table 2. Specifically, We compare the models with our proposed model including the BBC model and FT-BBC model, which are abbreviations for BERT_BiLSTM_CRF model (Dai et al. 2019) and FT-BERT_BiLSTM_CRF model (Li et al. 2020), respectively. These two models use pre-trained language models to map characters in the input text as feature vectors and then learn contextual information in the sequences using BiLSTM structures, but they only use character information. The LSTM_CNN model (Wang et al. 2020) first uses BiLSTM to learn the context information of the text sequence and then uses the CNN module to further enhance feature extraction. This structure helps the model focus on adjacent characters. The model still only uses character information. The DSpERT model (Zhu et al. 2023) uses a standard transformer and a span transformer to achieve a deep semantic span representation to complete the task of named entity recognition, but the model does not use lexical information.Compared with these models that only use character information, it is to verify the effectiveness of introducing lexical information into the model. Besides, CGN (Sui et al. 2019) and LGN (Gui et al. 2019) models apply GNNs to incorporate lexical information into character information, and the comparison with these two models is to demonstrate the validity of the graph structure in our model. In addition, Softlexicon (Ma et al. 2020), DCSAN (Zhao et al. 2021), NFLAT (Wu et al. 2022), and TRAMA (Huang et al. 2022) leverage lexical information to enhance the representation of character-level information, and comparisons with them are made to demonstrate that our proposed The interaction fusion of character information and lexical information can make full use of lexical information.

Table 2 Performance of different models on four datasets

Due to the use of lexical information, a set of word vectors is required in the implementation of the model, and when different sets of word vectors are used to generate lexical information, the coverage of the lexicon may be different, with larger lexicons usually having a more complete vocabulary. The SoftLexicon method is used in our model to obtain lexical information. When replicating the DCSAN model, it uses the Tencent set of word vectors. Because the DCSAN model also uses the SoftLexicon method to extract word information in the text, and the article proposing the DCSAN model is newer than SoftLexicon.In order to avoid the influence of word vector sets and reduce the differences caused by the different coverage of the lexicon when comparing the models, after comprehensive consideration, we use the tencent-ailab-embedding-zh-d200-v0.2.0-s.txt(Song et al. 2018) word vector set both in the implementation of our proposed model and in the reproduction of the DCSAN model.

To reduce the effect of chance errors and more accurately reflect the performance of each model on the corresponding dataset, Table 2 is obtained by averaging the individual models over the datasets for five runs. The table presents evaluation metrics for which P, R, and F1 represent precision, recall, and F1-value, respectively. Specifically, precision (P) is the ratio of correctly identified positive instances out of all instances predicted as positive, while recall (R) is the proportion of correctly identified positive instances out of all actual positive instances. F1-value (F1) is a harmonic mean of precision and recall, providing a single value that balances both precision and recall. The formulas are the following:

$$\begin{aligned} P = \frac{{TP}}{{TP + FP}}\end{aligned}$$
(19)
$$\begin{aligned} R = \frac{{TP}}{{TP + FN}}\end{aligned}$$
(20)
$$\begin{aligned} F1 = \frac{{2 \times P \times R}}{{P + R}} \end{aligned}$$
(21)

where TP denotes the number of true-positive entities, i.e., the number of entities predicted by the model to be entities that are actually entities, FP denotes the number of false-positive entities, i.e., the number of entities predicted by the model to be entities that are actually non-entities, and FN denotes the number of false-negative entities, i.e., the number of entities predicted by the model to be non-entities that are actually entities.

Our experiments on the CCKS2020 dataset have demonstrated that our model outperforms other graph-based neural network models in terms of F1 values. Specifically, our model achieves an F1 score that is 1.47% higher than the CGN and 1.77% higher than the LGN. While the NFLAT method, which combines character-level and lexical-level features for CNER, is the second best-performing method on this dataset, our model surpasses it by 0.22% in F1. These results highlight the effectiveness of our proposed approach, which leverages the interactive fusion of characters and lexical information. The CCKS2020 dataset is a medical domain dataset, which makes it more difficult to recognize entities compared to other domains due to the specificity of the medical domain, which has more vocabulary and longer medically named entities. Our method has the best performance on this dataset because it can utilize the lexical information more fully.

Our proposed model also shows the best performance on the Ontonotes dataset, with F1 values exceeding the BBC model by 0.72% and exceeding the FT-BBC model by 0.95%, indicating that our model effectively uses lexical information to improve entity recognition compared to models that only use character information. The Softlexicon model, which directly combines lexical information with character information stitching, performs second best on this dataset, and our model exceeds it by 0.38%. The ontonotes dataset is a news domain dataset, and news contains a variety of topics and events, which makes the vocabulary in the dataset rich and varied, so the model can get better results when it can make full use of the vocabulary information in it. In addition, the news text has extremely rich contextual information, and our model uses two independent BiLSTMs to capture the contextual features of the text at the character level and word level, respectively. Therefore, our model obtains the best results on this dataset.

For the Weibo and Resume datasets, our proposed method achieves a competitive F1 score with DCSAN and outperforms others in terms of precision on the Weibo dataset and recall on the Resume dataset. Although the F1 values of DCSAN are slightly higher than ours on the Weibo and Resume datasets, DCSAN performs significantly worse than ours on the CCKS2020 and Ontonotes datasets. Therefore, our proposed model stands out as a highly effective approach for natural language processing tasks. Our model fails to achieve sota performance in the Weibo and Resume datasets. The reason may be that our model primarily aims to enhance its performance by integrating character and lexical information through interaction. However, the Weibo and resume datasets exhibit different feature distributions and language styles. Particularly, the Weibo dataset presents characteristics of ’zero grammar, zero eloquence, and fragmentation,’ containing more colloquial and internet-based text with weak logic and substantial randomness(Liu et al. 2017). In contrast, the resume dataset contains specific industry terms and structures, resulting in a lack of rich lexical features in both datasets. These factors contribute to the performance discrepancy of the model.

Fig. 3
figure 3

Box plots of different models on each dataset

We further illustrate box line plots with F1 values of all the models in Fig. 3. The median is represented by the red line, while the blue line shows the mean. These plots provide a clear representation of the range, distribution, and outlier values of F1 scores for each model. Using box line plots allows for a more objective comparison of different models, as opposed to relying on a single metric. The box line plots of our model are lower in height, indicating that F1 values are in a smaller range and the robustness is better than others. The above results represent that our model is able to make full use of the interactive fusion between lexical information and characters, capturing deeper correlations between character-level information and lexical information. Besides, rather than higher accuracy, our model also has stronger robustness capabilities.

In Table 3, we compare the time required for single-train training of each model on the corresponding dataset, measured in seconds. As seen from the table, our proposed model does not have the shortest training time, primarily due to the utilization of graph networks. During training, graph networks need to propagate and aggregate neighbor information within the graph, thus requiring more time for processing. However, it is worth noting that the graph structure we constructed is relatively simple, with each node interacting with only one neighbor node. Therefore, compared to CGN and LGN models that also utilize graph neural networks to integrate character and word information, our proposed model significantly reduces training time.

Table 3 The comparison results of training time

In summary, our model performs best on datasets rich in lexical information and demonstrates greater robustness. Additionally, despite the introduction of graph networks, there has been no significant increase in training time. Moreover, compared to other models utilizing graph networks, our model has the shortest training time.

4.3 Ablation study

To validate each component of our model, we perform ablation experiments on four datasets, and present the results in Table 4. In Fig. 4, we have highlighted with gray rectangles the specific structures that need to be removed or modified from the complete model for the ablation experiments. For instance, the "-BiLSTM" and "-Residual" experiments require removing the corresponding parts from the complete model as shown in Fig. 4, while the "-Self-loop" and "-Second fusion" experiments necessitate replacing the corresponding parts in the complete model with the structures depicted in Fig. 4. The reported results are the average performance of the model over five runs on the corresponding datasets. We propose that preserving the original semantics of characters and lexical nodes during the interactive fusion of characters and lexical information enables effective feature representation. This, in turn, results in all nodes in our constructed directed graph having a self-looping edge. Specifically, in the "-Self-loop" experiment, we remove this self-loop edge, and the results show that the F1 value decreases by 0.51%, 1.48%, 1.56%, and 0.16% on the CCKS2020, Ontonotes, Weibo, and Resume datasets, respectively, which proves the importance of maintaining the original semantic information of the nodes.

Fig. 4
figure 4

Ablation experiment model structure diagram

Furthermore, we propose that the structure using the residual connection and layer normalization can comprehensively exploit the effect of GAT in fusing character-level information and lexical information. In the "-Residual" experiment, we remove the residual connection and layer normalization structure to investigate their impact on the model’s performance. Our results show that the model’s performance decreases on four datasets: CCKS2020 (\(-\)0.47%), Ontonotes (\(-\)0.96%), Weibo (\(-\)1.35%), and Resume (\(-\)0.26%), respectively. This finding provides evidence that the residual connection and layer normalization structure facilitate the enhanced feature representation of node information. The significance of this structural component is further underscored by the fact that its removal consistently leads to inferior performance across all four datasets.

We propose that utilizing the BiLSTM network to compute character vectors and word vectors separately to obtain a more comprehensive and richer contextual representation can improve the effectiveness of named entity recognition. To prove this idea, we conduct experiments in the "-BiLSTM" experiment after removing the network structure of BiLSTM. The experimental results of the model on the four datasets CCKS2020 (\(-\)0.73%), Ontonotes (\(-\)1.74%), Weibo (\(-\)0.71%), and Resume (\(-\)0.68%) have decreased, which proves that it makes sense to compute contextual representations for character vectors and word vectors.

Table 4 The ablation study of our proposed model

In our graph structure, the nodes representing characters fuse the information of the lexical nodes connected to them, while the nodes representing lexical can also learn the information of the character nodes connected to them. We believe that after the first fusion of character information and lexical information by GAT, a secondary fusion should be performed to make full use of the lexical information. To support this claim, we conduct an experiment called "-Second fusion" in which we removed the secondary fusion from the model. Our results on four datasets is expressed as CCKS2020 (\(-\)0.57%), Ontonotes (\(-\)0.52%), Weibo (\(-\)0.89%), and Resume (\(-\)0.26%), indicating a decrease in model performance, which serves as evidence of the importance and effectiveness of the secondary fusion.

Furthermore, in the ’Training time’ column of Table 4, we compare the time required for single-train training of the complete model and the model with certain structures removed on different datasets, measured in seconds. It can be observed that the training time of the complete model is slightly increased compared to the model with certain structures removed, but the increase is not significant. Sacrificing some training time to improve model performance is worthwhile.

4.4 Case study

We analyzed several examples from the CCKS2020 dataset to visually demonstrate that our model can more fully utilize lexical information to improve named entity recognition, and the results are shown in Table 5. In the first example, the SoftLexicon model and our model can correctly recognize the entity ”左下肢(left lower limb)”, but the BBC model can only recognize the entity ”下肢(lower limb)”, which may be due to the fact that the BBC model does not utilize the lexical information and the model can only learn a limited amount of knowledge.In the second example, ”肠息肉 切除术(Colon Polyp Removal Surgery)” as a whole is an entity, and our model and SoftLexicon model can correctly identify this entity, but the BBC model identifies ”肠 息肉切除(Colon Polyp Removal)”, also because the BBC model learns less than the model using lexical information.In the third example, our model correctly identifies the entity "左侧颈动脉中段(left middle carotid artery)”, but the BBC model and SoftLexicon model identify the entity as ”左侧颈动脉(left carotid artery)" because our model utilizes the lexical information more fully than SoftLexicon. The reason is that our model utilizes lexical information more fully than SoftLexicon.These results show that using lexical information can significantly improve the accuracy of entity recognition, and the interactive fusion of character information and lexical information can help the model learn features more fully.

Table 5 Case study, with bold and italic indicating correctly and incorrectly identified entities, respectively

5 Conclusion

In CNER task, prior studies have mainly exploited lexical information to enhance character decoding and failed to investigate the interactive relationship between characters and lexical information. In this paper, we propose an interactive fusion approach, which integrates characters and lexical information using the graph attention network. By employing two independent BiLSTM networks to process character and lexical information separately, we obtain richer contextual information. Through an interactive graph structure, character features are influenced by lexical information, enhancing the expressiveness of character features. Simultaneously, lexical features acquire richer character-level details, also enhancing the expressiveness of lexical features. By introducing feedforward neural networks, residual connections, and layer normalization structures after the graph attention network, we fully exploit the fusion effect of the graph attention network. Furthermore, secondary fusion is performed. Ablation experiments demonstrate the effectiveness of each part of our proposed model. Comparative experiments show that although our model’s training time is not the shortest, it outperforms other competing models in utilizing lexical information and achieves state-of-the-art performance on some datasets. This paper contributes to the field of CNER, and suggests potential avenues for future investigations to further advance the performance of CNER models.