Protein Representation Learning with
Sequence Information Embedding:
Does it Always Lead to a Better Performance?
††thanks: This work was supported by the National Natural Science Foundation of China (11974239; 62302291), the Innovation Program of Shanghai Municipal Education Commission (2019-01-07-00-02-E00076), Shanghai Jiao Tong University Scientific and Technological Innovation Funds (21X010200843), the Student Innovation Center at Shanghai Jiao Tong University, and Shanghai Artificial Intelligence Laboratory.
Abstract
Deep learning has become a crucial tool in studying proteins. While the significance of modeling protein structure has been discussed extensively in the literature, amino acid types are typically included in the input as a default operation for many inference tasks. This study demonstrates with structure alignment task that embedding amino acid types in some cases may not help a deep learning model learn better representation. To this end, we propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation. The effectiveness of ProtLOCA is examined by a global structure-matching task on protein pairs with an independent test dataset based on CATH labels. Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains. Furthermore, in local structure pairing tasks, ProtLOCA for the first time provides a valid solution to highlight common local structures among proteins with different overall structures but the same function. This suggests a new possibility for using deep learning methods to analyze protein structure to infer function.
Index Terms:
Protein Structure Alignment, Protein Representation Learning, Deep Learning, Graph Neural NetworksI Introduction
In recent years, an increasing number of studies have designed deep learning-based solutions to understand the construction principles of proteins, including tasks like structure folding [1], sequence design [2], and function prediction [3]. These attempts are based on the important relationship deduced by biologists that protein sequence determines structure and structure determines function [4]. However, the complex composition and numerous variables of protein molecules make this relationship extremely intricate, preventing the creation of a simple theoretical system. Additionally, experimental validation is too costly to test all possible proteins. Therefore, deep learning algorithms have been designed to help discover from large databases the mapping relationships between protein sequence, structure, and function. An increasing number of studies have developed deep learning methods to solve specific biological problems, achieving great success in validation across various downstream tasks [5].
The dominant methods for protein representation learning currently focus on feature extraction from protein sequences, due to the abundance of amino acid sequence data and the development of language models. On the other hand, with the development of structure prediction models [6, 7], protein structure datasets have also become significantly enriched, leading some studies to utilize geometric deep learning methods [8, 9, 10] to extract three-dimensional protein structures and incorporate them into hidden representations. Moreover, recent studies have found that incorporating both sequence and structure information can further enhance the expressivity of the embeddings, leading to better performance in prediction tasks such as mutation effect prediction [11, 12] and binding-affinity prediction [13].
Nowadays, unless performing sequence inference (i.e., sequence data is not used as input), amino acid sequence information is always included in the model input. Unlike the necessity of structure embedding, which has been discussed extensively by various studies [14, 15, 16], to the best of our knowledge, no work has explored explicitly the role of incorporating sequence information into neural networks in any form. This leads us to propose the following research question: Is sequence information really always a beneficial element for any protein representation learning task? To address this question, we delve into the structure alignment task, where the inference objective is to determine the similarity between two protein structures. We compare the performance of models trained with and without amino acid sequence information. As shown in Tables I and Fig. 3, we find that including sequence information interferes with the prediction in the structure matching task. This observation aligns with biological intuition: highly dissimilar sequences can still fold into similar structures. Therefore, incorporating sequence features when summarizing protein structural characteristics may dilute the important information in the learned embeddings, leading to significant matching errors. On the other hand, although sequences generally determine protein structures, in some special cases, highly similar protein sequences can fold into different structures. Hence, matching structures based on sequence information may introduce additional errors.
While sequence information is not always beneficial for certain inference tasks on proteins, such as local structure alignment, it is natural to ask: how to encode structural information effectively for amino acids for sequence-irrelevant tasks? This paper introduces ProtLOCA for PROTein LOCal structure Alignment. The model processes the three-dimensional structure of the protein with roto-equivariant graph neural networks to extract vector representations of the amino acid local geometry. The proposed ProtLOCA is validated on two tasks of protein structure alignment. In global protein structures matching (Fig. 1), we assign binary classification labels for protein domains, where the ground-truth label is defined by the CATH classification system [17]. ProtLOCA achieves state-of-the-art performance over various sequence-based and structure-based protein feature extraction methods. For the second task of local structure alignment (Fig. 2), we leverage ProtLOCA to find common local folding in proteins that have different overall structures. We select a crucial type of regulator in gene processes called DNA binding protein, whose local structure for DNA regulation shares a similar fold while their overall structures differ [18]. Among these two DNA binding proteins from different species, ProtLOCA effectively identified the common local structure that is crucial for its function, while the overall structures between them are different. In comparison, existing global alignment methods like TM-align [19] fail to locate such local similarity.
In summary, this study contributes in three aspects.
-
1.
We find that amino acid sequence information is not always beneficial for encoding effective representations for protein inference tasks and demonstrate through an important structural biology task of structure alignment.
-
2.
We separate an independent subset from CATH4.3 and introduce CATH-aligns and CATH-aligns+, two standard structure matching benchmark datasets based on high-quality protein domain labels. We also provide a comprehensive comparison of popular sequence-based and structure-based protein encoding methods on the two benchmarks.
-
3.
We propose ProtLOCA, a protein structure embedding method that achieves state-of-the-art performance on global protein structure alignment tasks. Additionally, we validate ProtLOCA on a specific task, demonstrating its effectiveness in identifying similar local structures.
II Global Structure Matching
II-A Problem Formulation
Consider three arbitrary peptide chains , , and , where and share a similar global structure. While has a significantly different overall structure, it contains a common substructure with and . A global structure matching evaluates the overall similarity of peptide chain pairs, e.g., assigns a high similarity score to the pair and a low similarity score to both and .
II-B Feature Representation
Define the graph representation of a peptide chain’s backbone. Each node represents an amino acid, and spatially closed nodes (i.e., Euclidean distance smaller than 10Å) are connected by directed edges . For the th amino acid, the node feature is composed of scalars and vectors, i.e., . The scalar feature contains one-hot encodings of structure tokens, such as DSSP-based secondary structure [20] or FoldSeek embedding [21]. The vector feature summarizes the spatial relationship of neighborhood heavy atoms along the sequence, including two directional vectors by the coordinates of the atoms (; ) and a tetrahedral geometry unit vector
where and .
Similarly, on the edge of two connected nodes from to , we define edge features by scalar features and vector features. The scalar feature concatenates the radial basis functions (RBF) representations 111We use 16 Gaussian radial basis functions with centers evenly spaced between 0 and 20Å. of and sinusoidal positional encoding 222We use the positional encoding method described in Transformer [22]. of the relative Euclidean distance between and . The vector feature is defined by the direction of .
II-C Model Architecture
ProtLOCA implements geometric vector perceptrons (GVP) [8] to extract scalar and vector features from the nodes and edges of protein graphs. For an arbitrary protein graph , a GVP layer computes embeddings for the scalar feature and the vector feature , i.e.,
(1) |
The key to a layer is composed of multiple iterations of scalar-vector propagations, defined in (1). At the th () iteration,
(2) | ||||
Here and are learnable parameters for this layer, denote row-wise multiplication, denotes row-wise normalization, and represents the sigmoid activation function. At , the initial input . At the last layer when , it outputs . We set in each of the layers.
The separately encoded scalar and vector representations , before sending to further prediction, are combined to obtain an AA-level matrix representation. This requires additional transformations, which we define as a GVP Transform layer. As introduced below, we first define a concatenated feature . For the th node and the edge of connected nodes , we define:
(3) | ||||
where each feature vector is a concatenation of scalar features and vector features , is the number of incoming messages from ’s neighbors, and and are the embedding of scalars and vectors for node and edge .
We also add an extra feed-forward layer when updating the node representation
(4) |
where denotes a GVP layer with . We use superscripts to distinguish it from the previous GVP layers in (1) and (3), which includes three layers of scalar-vector propagations defined in (2). In comparison, in (4) only applies 2 layers of scalar-vector propagation.
The stack of GVP convolution and feed-forward transformation defined in (1)-(4) constructs a GVP-GNN block. The block is repeated multiple times to obtain expressive node representations. The feed-forward layer is applied at the end of every GVP-GNN block except for the last block. In implementation, we set the reputation to . See ablation studies in Section IV) for more details.
The node representation is sent to readout layers for label prediction. In the training phase, a dense layer is employed to recovery the input tokens:
(5) |
For prediction tasks, i.e., global structure matching, we obtain vector representation for the input protein with an normalized average pooling layer, i.e.,
(6) |
To measure the similarity of a protein pair with the respective learned vector representations , we define the cosine similarity:
(7) |
II-D Training Objective
Training ProtLOCA only involves the scalar and vector features extracted from backbone coordinates, excluding inputs directly related to amino acid types. The model is trained in a self-supervised learning manner with the objective of denoising the perturbed node features. Two types of corruption approaches are considered for adding noise, including masking and permutation. In the former, values in are set to with a probability ; in the latter, values in are randomly replaced by another value with a probability of . Additional discussion on the tunable parameter can be found in Section IV.
III Local Structure Alignment
III-A Problem Formulation
Consider two arbitrary peptide chains and with different sequence length and overall structure and a common substructure . An local structure alignment task aims to identify highly similar local regions in the input data .
III-B ProtLOCA for Local Structure Alignment
In the global structure matching task, we employ average pooling on the amino acid representations of the protein pairs to obtain vector representations for comparing protein-level similarity. However, this simplified method cannot provide insights into the alignment of protein local regions. While functionally similar proteins may only have similar active regions and differ in overall structure, discovering local alignments of proteins could be essential for functional region identification and analysis. To this end, we introduce a modified ProtLOCA with a simple heuristic algorithm to highlight similar regions for protein pairs. After extracting the hidden representation for nodes by (4), we conduct the following three steps for local alignment identification.
III-B1 Candidate Selection
For two proteins and with and amino acids, respectively, ProtLOCA extracts -dimensional representations and . Similar to the global matching task, we score the similarity between the two matrices by the cosine similarity:
(8) |
We will use the output similarity matrix for identifying structurally aligned regions between the two proteins. Intuitively speaking, the similarity scores on the diagonal indicates the point-to-point alignment of the two proteins. By selecting high values on the diagonal, the corresponding structurally aligned local regions of the two proteins are recognized.
III-B2 Redundancy Removal
To further investigating the regional similarity of the two proteins, we set a similarity threshold for the diagonal and a minimum structure size for the similar local structure of interest. We first iterate over all possible subset blocks along the diagonal line with the size from until along the diagonal line. The possible are those that . We record the mean and variance for all candidate s. The second step removes redundant blocks from the candidates group with an overlap threshold . We traverse all candidate . For two arbitrary , if more than rows or columns are overlapped, the smaller block matrix will be dropped. After the two steps, we obtain a set of non-overlapping candidate regions. In this study, we set , , and .
III-B3 Unconditional Ranking
To identify the best matching local structures, we sort the obtained candidates by their variance (calculated from the first step of candidate selection) in ascending order. This unconditional ranking approach assumes no prior knowledge about the specific region to be matched (e.g., active site). In cases where the target region is known, we can optionally employ a conditional ranking method. This method sorts the candidates based on the degree of index-level overlap between the query structure and the candidate structures in descending order.
Model Information | Input | CATH-aligns | CATH-aligns+ | ||||||||||
Type | Name | Version | # Params | AA | Structure | average | fold 1 | fold 2 | fold 3 | average | fold 1 | fold 2 | fold 3 |
Aligment | FoldSeek [21] | 3Di | - | ✗ | ✓ | 0.900 | 0.903 | 0.901 | 0.897 | 0.891 | 0.893 | 0.892 | 0.888 |
3Di-AA | - | ✓ | ✓ | 0.888 | 0.889 | 0.888 | 0.886 | 0.881 | 0.882 | 0.881 | 0.879 | ||
Embedding | ESM2 [23] | t33_650M | 650M | ✓ | ✗ | 0.685 | 0.685 | 0.684 | 0.687 | 0.672 | 0.672 | 0.674 | 0.671 |
t36_3B | 3,000M | ✓ | ✗ | 0.700 | 0.697 | 0.699 | 0.704 | 0.685 | 0.685 | 0.687 | 0.682 | ||
t48_15B | 15,000M | ✓ | ✗ | 0.814 | 0.813 | 0.814 | 0.814 | 0.788 | 0.788 | 0.790 | 0.786 | ||
ProstT5 [24] | AA2fold | 3,000M | ✓ | ✗ | 0.907 | 0.905 | 0.909 | 0.908 | 0.851 | 0.851 | 0.852 | 0.850 | |
fold2AA | 3,000M | ✗ | ✓ | 0.921 | 0.921 | 0.92 | 0.922 | 0.838 | 0.841 | 0.839 | 0.834 | ||
ESM-IF [14] | - | 148M | ✓ | ✓ | 0.625 | 0.624 | 0.625 | 0.627 | 0.851 | 0.853 | 0.851 | 0.849 | |
MIF-ST [15] | - | 643M | ✓ | ✓ | 0.882 | 0.897 | 0.873 | 0.877 | 0.614 | 0.611 | 0.616 | 0.616 | |
ProtLOCA (Ours) | - | 5.9M | ✗ | ✓ | 0.965 | 0.966 | 0.964 | 0.964 | 0.895 | 0.895 | 0.895 | 0.895 |
IV Experimental Analysis
ProtLOCA is pre-trained on an unlabeled protein structure dataset from CATH4.3 (introduced below). We examine ProtLOCA on protein structure alignment tasks involving both global structure matching and local structure alignment. For the global structure matching task, we provide quantitative comparisons with baseline methods on two independent benchmark datasets, CATH-aligns and CATH-aligns+. For the local structure alignment, due to the lack of appropriate datasets and quantitative evaluation metrics, we investigate the model’s performance through a case study. All experiments were conducted on 8 A800 GPUs, each with 80GB VRAM. The implementation will be released upon acceptance.
IV-A CATH-aligns: Benchmark for Structure Alignment
We construct CATH-aligns, a new benchmark with standard quantitative evaluation criteria. We process the dataset from CATH 4.3 333Official dataset can be found at http://download.cathdb.info/cath/releases/all-releases/v4_3_0/, a comprehensive dataset with experimentally determined protein domain structures. All structures are labeled with a four-level CATH classification code [25] that classifies the protein’s structural type from different perspectives. We remove incomplete protein entities that include missing atomic coordinates for and . All proteins are below of sequence identity to each other. A total of are left for constructing the independent test set CATH-aligns.
For structure alignment prediction, we define a binary classification task with the split test subset from CATH4.3. We consider two levels of classification difficulty and name them as CATH-aligns and CATH-aligns+, respectively. The former CATH-aligns defines negative pairs as protein domains with all the four-level CATH classification codes being different and positive pairs as any of the four codes being identical. The latter CATH-aligns+ defines a more difficult task, where structure pairs with identical CATH codes at all four levels are considered positive sample pairs, while pairs differing at any level are considered negative sample pairs. To ensure computational efficiency and balance the number of positive and negative samples, we prepare three folds for evaluation, each containing positive and negative pairs that are randomly sampled from the complete pairs of CATH-aligns. The prediction results are assessed using the AUC (area under the curve) metric, where an AUC closer to indicates better predictive performance.
IV-B Experimental Protocol
Training Setup
ProtLOCA is optimized with AdamW [26] with a learning rate of . The maximum number of training epochs is set to , and early stopping is applied with a patience of epochs. For stable memory usage of GPU during the training, the maximum number of nodes per batch is set to . The GVP module consists of layers and a dropout ratio of . The embedding dimensions are set to for , for , for , and for . During the inference, the input is masked to to obtain the representation vectors for each point in the protein structure. All experiments are conducted on an A800 GPU with 80GB of memory, and the training process is logged using WanDB.
Dataset for Self-Supervised Learning
We use unlabeled CATH4.3_s40 for training our graph representation learning model. All structures in the dataset are processed with the similarity threshold at , containing a total of protein domain structures. The training target is to recover the noisy input tokens defined in Section II-D. A subset of domains is split randomly for model validation. Although the training dataset is unlabeled, we further ensure that the sequence identity between the training set and the test datasets CATH-aligns is below to avoid data leakage.
Baseline Methods
We compare ProtLOCA with a set of alignment-based and embedding-based deep learning methods. For alignment methods, we consider two variants of FoldSeek [21], using 3Di with pure structural input and 3Di with both structural and amino acid (AA) input. This method encodes local structures and uses traditional alignment algorithms for point-by-point comparison of structures. For the global structure matching task, we exclude TM-align [19] from the baseline list due to its extremely inefficient computational speed. In order to compute the similarity of all structure pairs in the test dataset, TM-align would consume approximately hours. In comparison, ProtLOCA spends less than hour, including the data preprocessing and scoring steps. For embedding methods, our comparison includes the pre-trained sequence-based language model ESM2 [23] with different model scales. The structure-aware pre-trained model ProstT5 [24] uses both AA2fold and fold2AA modes for translation tasks, we take amino acid sequences and Foldseek sequences as input to get embeddings respectively. We also include two inverse-folding methods, ESM-if1 [14] and MIF-ST [15] which take amino acid sequences as input. Unlike alignment methods, embedding methods average protein sequences to obtain embeddings and use the dot product of these vectors to measure overall protein similarity.
IV-C Results Analysis
Baseline Comparison
Table I reports the performance comparison of ProtLOCA and other baseline models on CATH-aligns and CATH-aligns+. In both alignment tasks, ProtLOCA significantly outperforms other embedding methods and even exceeds the performance of the classic alignment-based baseline FoldSeek. Note that the training cost for ProtLOCA is lower than that of all baseline methods due to a significantly smaller number of trainable parameters. Additionally, it is trained on a considerably small dataset of approximately samples. This training set size is smaller than what is typically required for deep protein models, which usually demand millions or more samples to train effectively. Furthermore, structure-based algorithms (e.g., ESM-if1) generally perform better than sequence-based methods. Notably, ESM2, despite achieving state-of-the-art performance in many downstream tasks, does not perform well in the structure alignment task. Additionally, the results of both FoldSeek and ProtLOCA demonstrate that incorporating amino acid information during training can indeed reduce the overall predictive performance of the models. These experimental results strongly support our initial claim that amino acid information does not always contribute to learning more expressive hidden embeddings, and embeddings learned with sequence information do not consistently enhance the prediction performance in any downstream tasks.
Sensitivity Analysis
We examine the impact of two hyperparameters on the performance of ProtLOCA: the masking noise ratio (with the permutation ratio being ) and the number of GVP layers. The results are visualized in the left two subplots in Fig. 3. The prediction accuracy is insensitive to both hyperparameters, with less than changes observed from a considerably large range. We perform and GVP layers as the default settings for the model.
Input and Denoising Token
Fig. 3 (right) compares the effect of different types of input node features. We consider three types of node features: the classic amino acid type, the secondary structure codes (DSSP), and the hidden structure codes (3Di). Overall, using 3Di encoding yields the best prediction performance on the structure alignment task. More importantly, incorporating amino acid information during the model training significantly degrades model performance (green bars). This observation is consistent with the previous analysis and our key assumption, where considering amino acid information in the structure alignment task may introduce unnecessary interference, leading to poor prediction performance in downstream tasks.
IV-D Case Study: HTH Functional Structure Alignment
The helix-turn-helix (HTH) motif is a crucial structural component in DNA binding proteins, including transcription factors regulating gene expression [18]. It comprises two alpha-helices joined by a ‘turn’, with the second helix, known as the recognition helix (Fig. 4A), specifically interacting with DNA (Fig. 4B). This interaction is essential for gene regulation, as it enables proteins containing the HTH motif to control the transcription process by attaching to DNA’s promoters or operators [27]. We first use TM-align to identify the HTH motif in the phage lambda cII protein (Fig. 4C) [28] and the transcriptional regulator PA2196 (Fig. 4D) [29]. In the phage lambda cII protein, the HTH motif identified by TM-align is located differently in protein compared to its position in the ground truth. For transcriptional regulator PA2196, the HTH motif identified by TM-align is much shorter than the one in the ground truth. These cases demonstrate TM-align’s limitations in accurately identifying the correct HTH motif in DNA binding proteins. However, ProtLOCA can effectively identify the correct HTH motif in these two proteins, despite their different overall folds. Thus, ProtLOCA demonstrates better performance than TM-align in identifying critical motifs in proteins with the same functions when their overall structures vary.
V Related Work
V-A Sequence Representation
With the growth of protein sequences and advancements in natural language modeling methods, the most commonly used approaches in protein representation learning typically involve unsupervised training on protein sequences, without considering protein structural information. For example, ESM2 [23], ESM-1v [30], and ESM-1b [31] use different redundancy levels of the Uniref dataset [32], employing the BERT [33] architecture and a masked language modeling unsupervised training objective to train models for downstream tasks related to representation learning or zero-shot mutation tasks in protein engineering. ProtTrans [34] has introduced a series of protein language representation models, such as ProtBert, ProtT5, and ProtAlBert, based on BERT [33], T5 [35] or AlBert [36] architectures, primarily applied to various downstream tasks of representation learning. Ankh [37] uses an asymmetric encoder-decoder approach and explores a series of training parameters to train language models that perform well on downstream tasks. Additionally, methods like CARP [38] and ProteinBert [39] use 1D-CNN instead of the attention mechanism to improve training efficiency for processing longer sequences.
V-B Structure Representation
With the increase in crystal structures and advancements in folding techniques [40, 23], protein structure databases have become increasingly large [41]. Currently, mainstream methods use sequence information as the training target for structural inputs or as auxiliary node features, with few models considering only protein structures while discarding amino acid types. For instance, GearNet [10] uses contrastive learning to enhance representation quality for protein enzyme commission (EC) number prediction, and ProtLGN [42] employs multi-task learning and denoising training objectives to improve zero-shot prediction capabilities for protein mutations. Additionally, models like GVP [8] and EGNN [43] use graph neural networks to model the equivariance and invariance of proteins for protein quality prediction tasks. Some inverse folding methods use protein structures as input to restore amino acid information, achieving structure-aware training. For example, ESM-IF [14] uses GVP to initialize transformer node features, and ProstT5 [24] uses Foldseek’s structural tokens as input and amino acid sequences as output (or vice versa) for machine translation training. Furthermore, some approaches combine language models and graph neural networks to enhance the quality of representation learning. Examples include MIF-ST [15], which integrates CARP [38] and Struct GNN, ProtSSN [11], which combines ESM2-650M and EGNN structures, and LM-GVP [12].
VI Conclusion and Discussion
Protein function annotation and analysis typically rely on protein sequences and overall structural information. However, these approaches come with their own set of challenges. Sequence-based analysis, such as EC numbers and Pfam datasets, doesn’t consistently yield accurate analysis. This is partly because pinning down a protein’s position on the evolutionary tree can be problematic when only its sequence is considered. In addition, methods that align overall protein structures, such as TM-align, may overlook proteins that are characterized by local structural conservation while amidst overall structural variability. Protein functions are mainly determined by key sub-structures, such as catalytic region and binding pockets, while the remaining structures determine the physical properties of proteins. In light of these issues of current methodologies and the significance of biology, we developed the ProtLOCA that focuses on local structural matches within proteins with diverse overall folds. This tool unlocks new perspectives on protein functional and structural evolution.
References
- [1] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. Nelson, A. Bridgland et al., “Improved protein structure prediction using potentials from deep learning,” Nature, vol. 577, no. 7792, pp. 706–710, 2020.
- [2] A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos, C. Xiong, Z. Z. Sun, R. Socher et al., “Large language models generate functional protein sequences across diverse families,” Nature Biotechnology, vol. 41, no. 8, pp. 1099–1106, 2023.
- [3] T. Yu, H. Cui, J. C. Li, Y. Luo, G. Jiang, and H. Zhao, “Enzyme function prediction using contrastive learning,” Science, vol. 379, no. 6639, pp. 1358–1363, 2023.
- [4] J. Koehler Leman, P. Szczerbiak, P. D. Renfrew, V. Gligorijevic, D. Berenberg, T. Vatanen, B. C. Taylor, C. Chandler, S. Janssen, A. Pataki et al., “Sequence-structure-function relationships in the microbial protein universe,” Nature communications, vol. 14, no. 1, p. 2351, 2023.
- [5] N. Sapoval, A. Aghazadeh, M. G. Nute, D. A. Antunes, A. Balaji, R. Baraniuk, C. Barberan, R. Dannenfelser, C. Dun, M. Edrisi et al., “Current progress and open challenges for applying deep learning across the biosciences,” Nature Communications, vol. 13, no. 1, p. 1728, 2022.
- [6] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko et al., “Highly accurate protein structure prediction with alphafold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021.
- [7] J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick et al., “Accurate structure prediction of biomolecular interactions with alphafold 3,” Nature, pp. 1–3, 2024.
- [8] B. Jing, S. Eismann, P. Suriana, R. J. L. Townshend, and R. Dror, “Learning from protein structure with geometric vector perceptrons,” in ICLR, 2020.
- [9] B. Zhou, L. Zheng, B. Wu, Y. Tan, O. Lv, K. Yi, G. Fan, and L. Hong, “Protein engineering with lightweight graph denoising neural networks,” Journal of Chemical Information and Modeling, 2023.
- [10] Z. Zhang, M. Xu, A. Jamasb, V. Chenthamarakshan, A. Lozano, P. Das, and J. Tang, “Protein representation learning by geometric structure pretraining,” arXiv preprint arXiv:2203.06125, 2022.
- [11] Y. Tan, B. Zhou, L. Zheng, G. Fan, and L. Hong, “Semantical and topological protein encoding toward enhanced bioactivity and thermostability,” bioRxiv, pp. 2023–12, 2023.
- [12] Z. Wang, S. A. Combs, R. Brand, M. R. Calvo, P. Xu, G. Price, N. Golovach, E. O. Salawu, C. J. Wise, S. P. Ponnapalli et al., “Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction,” Scientific reports, vol. 12, no. 1, p. 6832, 2022.
- [13] S. Li, J. Zhou, T. Xu, L. Huang, F. Wang, H. Xiong, W. Huang, D. Dou, and H. Xiong, “Structure-aware interactive graph neural networks for the prediction of protein-ligand binding affinity,” in Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, 2021, pp. 975–985.
- [14] C. Hsu, R. Verkuil, J. Liu, Z. Lin, B. Hie, T. Sercu, A. Lerer, and A. Rives, “Learning inverse folding from millions of predicted structures,” in ICML. PMLR, 2022, pp. 8946–8970.
- [15] K. K. Yang, N. Zanichelli, and H. Yeh, “Masked inverse folding with sequence transfer for protein representation learning,” Protein Engineering, Design and Selection, vol. 36, 2023.
- [16] P. Notin, A. W. Kollasch, D. Ritter, L. Van Niekerk, S. Paul, H. Spinner, N. J. Rollins, A. Shaw, R. Weitzman, J. Frazer et al., “ProteinGym: Large-scale benchmarks for protein fitness prediction and design,” in NeurIPS, 2023.
- [17] I. Sillitoe, N. Bordin, N. Dawson, V. P. Waman, P. Ashford, H. M. Scholes, C. S. Pang, L. Woodridge, C. Rauer, N. Sen et al., “Cath: increased structural coverage of functional space,” Nucleic acids research, vol. 49, no. D1, pp. D266–D273, 2021.
- [18] Y. Takeda, D. Ohlendorf, W. Anderson, and B. Matthews, “Dna-binding proteins,” Science, vol. 221, no. 4615, pp. 1020–1026, 1983.
- [19] Y. Zhang and J. Skolnick, “Tm-align: a protein structure alignment algorithm based on the tm-score,” Nucleic acids research, vol. 33, no. 7, pp. 2302–2309, 2005.
- [20] W. Kabsch and C. Sander, “Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features,” Biopolymers: Original Research on Biomolecules, vol. 22, no. 12, pp. 2577–2637, 1983.
- [21] M. Van Kempen, S. S. Kim, C. Tumescheit, M. Mirdita, J. Lee, C. L. Gilchrist, J. Söding, and M. Steinegger, “Fast and accurate protein structure search with foldseek,” Nature Biotechnology, vol. 42, no. 2, pp. 243–246, 2024.
- [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [23] Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli et al., “Evolutionary-scale prediction of atomic-level protein structure with a language model,” Science, vol. 379, no. 6637, pp. 1123–1130, 2023.
- [24] M. Heinzinger, K. Weissenow, J. G. Sanchez, A. Henkel, M. Steinegger, and B. Rost, “Prostt5: Bilingual language model for protein sequence and structure,” bioRxiv, pp. 2023–07, 2023.
- [25] C. Orengo, A. Michie, S. Jones, D. Jones, M. Swindells, and J. Thornton, “CATH – a hierarchic classification of protein domain structures,” Structure, vol. 5, no. 8, pp. 1093–1109, 1997.
- [26] D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in International Conference on Learning Representation, 2015.
- [27] A. Ishihama, “Prokaryotic genome regulation: multifactor promoters, multitarget regulators and hierarchic networks,” FEMS microbiology reviews, vol. 34, no. 5, pp. 628–645, 2010.
- [28] D. Jain, Y. Kim, K. L. Maxwell, S. Beasley, R. Zhang, G. N. Gussin, A. M. Edwards, and S. A. Darst, “Crystal structure of bacteriophage cii and its dna complex,” Molecular cell, vol. 19, no. 2, pp. 259–269, 2005.
- [29] Y. Kim, Y. Kang, and J. Choe, “Crystal structure of pseudomonas aeruginosa transcriptional regulator pa2196 bound to its operator dna,” Biochemical and Biophysical Research Communications, vol. 440, no. 2, pp. 317–321, 2013.
- [30] J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, and A. Rives, “Language models enable zero-shot prediction of the effects of mutations on protein function,” in NeurIPS, vol. 34, 2021, pp. 29 287–29 303.
- [31] A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma et al., “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences,” Proceedings of the National Academy of Sciences, vol. 118, no. 15, p. e2016239118, 2021.
- [32] B. E. Suzek, Y. Wang, H. Huang, P. B. McGarvey, C. H. Wu, and U. Consortium, “Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches,” Bioinformatics, vol. 31, no. 6, pp. 926–932, 2015.
- [33] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
- [34] A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi, W. Yu, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger, D. Bhowmik, and B. Rost, “ProtTrans: Towards cracking the language of lifes code through self-supervised deep learning and high performance computing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- [35] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
- [36] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
- [37] A. Elnaggar, H. Essam, W. Salah-Eldin, W. Moustafa, M. Elkerdawy, C. Rochereau, and B. Rost, “Ankh: Optimized protein language model unlocks general-purpose modelling,” arXiv preprint arXiv:2301.06568, 2023.
- [38] K. K. Yang, A. X. Lu, and N. Fusi, “Convolutions are competitive with transformers for protein sequence pretraining,” in ICLR Machine Learning for Drug Discovery, 2022.
- [39] N. Brandes, D. Ofer, Y. Peleg, N. Rappoport, and M. Linial, “ProteinBERT: A universal deep-learning model of protein sequence and function,” Bioinformatics, vol. 38, no. 8, pp. 2102–2110, 2022.
- [40] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko et al., “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021.
- [41] M. Varadi, S. Anyango, M. Deshpande, S. Nair, C. Natassia, G. Yordanova, D. Yuan, O. Stroe, G. Wood, A. Laydon et al., “Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models,” Nucleic acids research, vol. 50, no. D1, pp. D439–D444, 2022.
- [42] B. Zhou, L. Zheng, B. Wu, Y. Tan, O. Lv, K. Yi, G. Fan, and L. Hong, “Protein engineering with lightweight graph denoising neural networks,” bioRxiv, 2023.
- [43] V. G. Satorras, E. Hoogeboom, and M. Welling, “E (n) equivariant graph neural networks,” in International conference on machine learning. PMLR, 2021, pp. 9323–9332.