Protein Representation Learning with
Sequence Information Embedding:
Does it Always Lead to a Better Performance?
thanks: This work was supported by the National Natural Science Foundation of China (11974239; 62302291), the Innovation Program of Shanghai Municipal Education Commission (2019-01-07-00-02-E00076), Shanghai Jiao Tong University Scientific and Technological Innovation Funds (21X010200843), the Student Innovation Center at Shanghai Jiao Tong University, and Shanghai Artificial Intelligence Laboratory.

Yang Tan Shanghai Jiao Tong University
Shanghai, China
[email protected]
   Lirong Zheng University of Michigan
MI, USA
[email protected]
   Bozitao Zhong {@IEEEauthorhalign} Liang Hong Shanghai Jiao Tong University
Shanghai, China
[email protected]
Shanghai Jiao Tong University
Shanghai, China
[email protected]
   Bingxin Zhou Shanghai Jiao Tong University
Shanghai, China
[email protected]
Abstract

Deep learning has become a crucial tool in studying proteins. While the significance of modeling protein structure has been discussed extensively in the literature, amino acid types are typically included in the input as a default operation for many inference tasks. This study demonstrates with structure alignment task that embedding amino acid types in some cases may not help a deep learning model learn better representation. To this end, we propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation. The effectiveness of ProtLOCA is examined by a global structure-matching task on protein pairs with an independent test dataset based on CATH labels. Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains. Furthermore, in local structure pairing tasks, ProtLOCA for the first time provides a valid solution to highlight common local structures among proteins with different overall structures but the same function. This suggests a new possibility for using deep learning methods to analyze protein structure to infer function.

Index Terms:
Protein Structure Alignment, Protein Representation Learning, Deep Learning, Graph Neural Networks

I Introduction

In recent years, an increasing number of studies have designed deep learning-based solutions to understand the construction principles of proteins, including tasks like structure folding [1], sequence design [2], and function prediction [3]. These attempts are based on the important relationship deduced by biologists that protein sequence determines structure and structure determines function [4]. However, the complex composition and numerous variables of protein molecules make this relationship extremely intricate, preventing the creation of a simple theoretical system. Additionally, experimental validation is too costly to test all possible proteins. Therefore, deep learning algorithms have been designed to help discover from large databases the mapping relationships between protein sequence, structure, and function. An increasing number of studies have developed deep learning methods to solve specific biological problems, achieving great success in validation across various downstream tasks [5].

The dominant methods for protein representation learning currently focus on feature extraction from protein sequences, due to the abundance of amino acid sequence data and the development of language models. On the other hand, with the development of structure prediction models [6, 7], protein structure datasets have also become significantly enriched, leading some studies to utilize geometric deep learning methods [8, 9, 10] to extract three-dimensional protein structures and incorporate them into hidden representations. Moreover, recent studies have found that incorporating both sequence and structure information can further enhance the expressivity of the embeddings, leading to better performance in prediction tasks such as mutation effect prediction [11, 12] and binding-affinity prediction [13].

Refer to caption
Figure 1: An illustrative pipeline of ProtLOCA for structure pairing (see Section II). We employ ProtLOCA to extract protein vector representations for protein structures and calculate the cosine similarity between the learned hidden representation of protein pairs.

Nowadays, unless performing sequence inference (i.e., sequence data is not used as input), amino acid sequence information is always included in the model input. Unlike the necessity of structure embedding, which has been discussed extensively by various studies [14, 15, 16], to the best of our knowledge, no work has explored explicitly the role of incorporating sequence information into neural networks in any form. This leads us to propose the following research question: Is sequence information really always a beneficial element for any protein representation learning task? To address this question, we delve into the structure alignment task, where the inference objective is to determine the similarity between two protein structures. We compare the performance of models trained with and without amino acid sequence information. As shown in Tables I and Fig. 3, we find that including sequence information interferes with the prediction in the structure matching task. This observation aligns with biological intuition: highly dissimilar sequences can still fold into similar structures. Therefore, incorporating sequence features when summarizing protein structural characteristics may dilute the important information in the learned embeddings, leading to significant matching errors. On the other hand, although sequences generally determine protein structures, in some special cases, highly similar protein sequences can fold into different structures. Hence, matching structures based on sequence information may introduce additional errors.

While sequence information is not always beneficial for certain inference tasks on proteins, such as local structure alignment, it is natural to ask: how to encode structural information effectively for amino acids for sequence-irrelevant tasks? This paper introduces ProtLOCA for PROTein LOCal structure Alignment. The model processes the three-dimensional structure of the protein with roto-equivariant graph neural networks to extract vector representations of the amino acid local geometry. The proposed ProtLOCA is validated on two tasks of protein structure alignment. In global protein structures matching (Fig. 1), we assign binary classification labels for protein domains, where the ground-truth label is defined by the CATH classification system [17]. ProtLOCA achieves state-of-the-art performance over various sequence-based and structure-based protein feature extraction methods. For the second task of local structure alignment (Fig. 2), we leverage ProtLOCA to find common local folding in proteins that have different overall structures. We select a crucial type of regulator in gene processes called DNA binding protein, whose local structure for DNA regulation shares a similar fold while their overall structures differ [18]. Among these two DNA binding proteins from different species, ProtLOCA effectively identified the common local structure that is crucial for its function, while the overall structures between them are different. In comparison, existing global alignment methods like TM-align [19] fail to locate such local similarity.

In summary, this study contributes in three aspects.

  1. 1.

    We find that amino acid sequence information is not always beneficial for encoding effective representations for protein inference tasks and demonstrate through an important structural biology task of structure alignment.

  2. 2.

    We separate an independent subset from CATH4.3 and introduce CATH-aligns and CATH-aligns+, two standard structure matching benchmark datasets based on high-quality protein domain labels. We also provide a comprehensive comparison of popular sequence-based and structure-based protein encoding methods on the two benchmarks.

  3. 3.

    We propose ProtLOCA, a protein structure embedding method that achieves state-of-the-art performance on global protein structure alignment tasks. Additionally, we validate ProtLOCA on a specific task, demonstrating its effectiveness in identifying similar local structures.

II Global Structure Matching

II-A Problem Formulation

Consider three arbitrary peptide chains 𝒫1subscript𝒫1{\mathcal{P}}_{1}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝒫2subscript𝒫2{\mathcal{P}}_{2}caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and 𝒫3subscript𝒫3{\mathcal{P}}_{3}caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, where 𝒫1subscript𝒫1{\mathcal{P}}_{1}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒫2subscript𝒫2{\mathcal{P}}_{2}caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT share a similar global structure. While 𝒫3subscript𝒫3{\mathcal{P}}_{3}caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT has a significantly different overall structure, it contains a common substructure 𝒫superscript𝒫{\mathcal{P}}^{\prime}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with 𝒫1subscript𝒫1{\mathcal{P}}_{1}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒫2subscript𝒫2{\mathcal{P}}_{2}caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. A global structure matching evaluates the overall similarity of peptide chain pairs, e.g., assigns a high similarity score to the (𝒫1,𝒫2)subscript𝒫1subscript𝒫2({\mathcal{P}}_{1},{\mathcal{P}}_{2})( caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) pair and a low similarity score to both (𝒫1,𝒫3)subscript𝒫1subscript𝒫3({\mathcal{P}}_{1},{\mathcal{P}}_{3})( caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and (𝒫2,𝒫3)subscript𝒫2subscript𝒫3({\mathcal{P}}_{2},{\mathcal{P}}_{3})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ).

Refer to caption
Figure 2: An illustrative pipeline of ProtLOCA for local structure alignment (see Section III). We employ ProtLOCA for residue-level point-to-point matching, which identifies similar local structures on proteins with different overall structures.

II-B Feature Representation

Define 𝒢=(𝒱,)𝒢𝒱{\mathcal{G}}=({\mathcal{V}},{\mathcal{E}})caligraphic_G = ( caligraphic_V , caligraphic_E ) the graph representation of a peptide chain’s backbone. Each node v𝒱𝑣𝒱v\in{\mathcal{V}}italic_v ∈ caligraphic_V represents an amino acid, and spatially closed nodes (i.e., Euclidean distance smaller than 10Å) are connected by directed edges e𝑒e\in{\mathcal{E}}italic_e ∈ caligraphic_E. For the i𝑖iitalic_ith amino acid, the node feature is composed of scalars and vectors, i.e., 𝑯vi=(𝑵Si,𝑵Vi)superscriptsubscript𝑯𝑣𝑖superscriptsubscript𝑵𝑆𝑖superscriptsubscript𝑵𝑉𝑖{\bm{H}}_{v}^{i}=({\bm{N}}_{S}^{i},{\bm{N}}_{V}^{i})bold_italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( bold_italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ). The scalar feature 𝑵Sisuperscriptsubscript𝑵𝑆𝑖{\bm{N}}_{S}^{i}bold_italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT contains one-hot encodings of structure tokens, such as DSSP-based secondary structure [20] or FoldSeek embedding [21]. The vector feature 𝑵Vi3×3superscriptsubscript𝑵𝑉𝑖superscript33{\bm{N}}_{V}^{i}\in\mathbb{R}^{3\times 3}bold_italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT summarizes the spatial relationship of neighborhood heavy atoms along the sequence, including two directional vectors by the coordinates of the Cαsubscript𝐶𝛼C_{\alpha}italic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT atoms (𝑵V,1i=Cαi+1Cαisuperscriptsubscript𝑵𝑉1𝑖subscript𝐶subscript𝛼𝑖1subscript𝐶subscript𝛼𝑖{\bm{N}}_{V,1}^{i}=C_{\alpha_{i+1}}-C_{\alpha_{i}}bold_italic_N start_POSTSUBSCRIPT italic_V , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT; 𝑵V,2i=Cαi1Cαisuperscriptsubscript𝑵𝑉2𝑖subscript𝐶subscript𝛼𝑖1subscript𝐶subscript𝛼𝑖{\bm{N}}_{V,2}^{i}=C_{\alpha_{i-1}}-C_{\alpha_{i}}bold_italic_N start_POSTSUBSCRIPT italic_V , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT) and a tetrahedral geometry unit vector

𝑵V,3i=13(𝐧×𝐜)𝐧×𝐜223(𝐧+𝐜)𝐧+𝐜2,superscriptsubscript𝑵𝑉3𝑖13𝐧𝐜subscriptnorm𝐧𝐜223𝐧𝐜subscriptnorm𝐧𝐜2{\bm{N}}_{V,3}^{i}=\sqrt{\frac{1}{3}}\frac{(\mathbf{n}\times\mathbf{c})}{\|% \mathbf{n}\times\mathbf{c}\|_{2}}-\sqrt{\frac{2}{3}}\frac{(\mathbf{n}+\mathbf{% c})}{\|\mathbf{n}+\mathbf{c}\|_{2}},bold_italic_N start_POSTSUBSCRIPT italic_V , 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_ARG divide start_ARG ( bold_n × bold_c ) end_ARG start_ARG ∥ bold_n × bold_c ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - square-root start_ARG divide start_ARG 2 end_ARG start_ARG 3 end_ARG end_ARG divide start_ARG ( bold_n + bold_c ) end_ARG start_ARG ∥ bold_n + bold_c ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ,

where 𝐧=NiCαi𝐧subscript𝑁𝑖subscript𝐶subscript𝛼𝑖\mathbf{n}=N_{i}-C_{\alpha_{i}}bold_n = italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝐜=CiCαi𝐜subscript𝐶𝑖subscript𝐶subscript𝛼𝑖\mathbf{c}=C_{i}-C_{\alpha_{i}}bold_c = italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Similarly, on the edge of two connected nodes from visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we define edge features 𝑯eijsuperscriptsubscript𝑯𝑒𝑖𝑗{\bm{H}}_{e}^{ij}bold_italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT by scalar features and vector features. The scalar feature 𝑬Sij32superscriptsubscript𝑬𝑆𝑖𝑗superscript32{\bm{E}}_{S}^{ij}\in\mathbb{R}^{32}bold_italic_E start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 end_POSTSUPERSCRIPT concatenates the radial basis functions (RBF) representations 111We use 16 Gaussian radial basis functions with centers evenly spaced between 0 and 20Å. of CαjCαi2subscriptnormsubscript𝐶subscript𝛼𝑗subscript𝐶subscript𝛼𝑖2\|C_{\alpha_{j}}-C_{\alpha_{i}}\|_{2}∥ italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and sinusoidal positional encoding 222We use the positional encoding method described in Transformer [22]. of the relative Euclidean distance between visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The vector feature 𝑬Vij3superscriptsubscript𝑬𝑉𝑖𝑗superscript3{\bm{E}}_{V}^{ij}\in\mathbb{R}^{3}bold_italic_E start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is defined by the direction of CαiCαjsubscript𝐶subscript𝛼𝑖subscript𝐶subscript𝛼𝑗C_{\alpha_{i}}-C_{\alpha_{j}}italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

II-C Model Architecture

ProtLOCA implements geometric vector perceptrons (GVP) [8] to extract scalar and vector features from the nodes and edges of protein graphs. For an arbitrary protein graph 𝒢𝒢{\mathcal{G}}caligraphic_G, a GVP layer computes embeddings for the scalar feature 𝑯Ssubscript𝑯𝑆{\bm{H}}_{S}bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and the vector feature 𝑯Vsubscript𝑯𝑉{\bm{H}}_{V}bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, i.e.,

(𝑯S,𝑯V)=GVP(𝑯S,𝑯V).superscriptsubscript𝑯𝑆superscriptsubscript𝑯𝑉GVPsubscript𝑯𝑆subscript𝑯𝑉({\bm{H}}_{S}^{\prime},{\bm{H}}_{V}^{\prime})={\rm GVP}({\bm{H}}_{S},{\bm{H}}_% {V}).( bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_GVP ( bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) . (1)

The key to a GVP()GVP\rm GVP(\cdot)roman_GVP ( ⋅ ) layer is composed of multiple iterations of scalar-vector propagations, defined in (1). At the (+1)1(\ell+1)( roman_ℓ + 1 )th (0<L0𝐿0\leq\ell<L0 ≤ roman_ℓ < italic_L) iteration,

𝑯S(+1)superscriptsubscript𝑯𝑆1\displaystyle{\bm{H}}_{S}^{(\ell+1)}bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT =σ(𝑾1concat(norm(𝑾2𝑯V()),𝑯3)+𝒃),absent𝜎subscript𝑾1concatnormsubscript𝑾2superscriptsubscript𝑯𝑉subscript𝑯3𝒃\displaystyle=\sigma\big{(}{\bm{W}}_{1}\cdot{\rm concat}({\rm norm}({\bm{W}}_{% 2}{\bm{H}}_{V}^{(\ell)}),{\bm{H}}_{3})+{\bm{b}}\big{)},= italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ roman_concat ( roman_norm ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) , bold_italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + bold_italic_b ) , (2)
𝑯V(+1)superscriptsubscript𝑯𝑉1\displaystyle{\bm{H}}_{V}^{(\ell+1)}bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT =σ(norm(𝑽(+1)))𝑽(+1),absentdirect-product𝜎normsuperscript𝑽1superscript𝑽1\displaystyle=\sigma\big{(}{\rm norm}({\bm{V}}^{(\ell+1)})\big{)}\odot{\bm{V}}% ^{(\ell+1)},= italic_σ ( roman_norm ( bold_italic_V start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT ) ) ⊙ bold_italic_V start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT ,
where 𝑽(+1)where superscript𝑽1\displaystyle\text{where }{\bm{V}}^{(\ell+1)}where bold_italic_V start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT =𝑾4𝑾2𝑯V().absentsubscript𝑾4subscript𝑾2superscriptsubscript𝑯𝑉\displaystyle={\bm{W}}_{4}{\bm{W}}_{2}{\bm{H}}_{V}^{(\ell)}.= bold_italic_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT .

Here 𝑾1,𝑾1,𝑾3,𝑾4subscript𝑾1subscript𝑾1subscript𝑾3subscript𝑾4{\bm{W}}_{1},{\bm{W}}_{1},{\bm{W}}_{3},{\bm{W}}_{4}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and 𝒃𝒃{\bm{b}}bold_italic_b are learnable parameters for this layer, direct-product\odot denote row-wise multiplication, norm()norm{\rm norm}(\cdot)roman_norm ( ⋅ ) denotes row-wise 𝕃2subscript𝕃2{\mathbb{L}}_{2}blackboard_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT normalization, and σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) represents the sigmoid activation function. At =00\ell=0roman_ℓ = 0, the initial input (𝑯S(0),𝑯V(0))=(𝑯S,𝑯V)superscriptsubscript𝑯𝑆0superscriptsubscript𝑯𝑉0subscript𝑯𝑆subscript𝑯𝑉({\bm{H}}_{S}^{(0)},{\bm{H}}_{V}^{(0)})=({\bm{H}}_{S},{\bm{H}}_{V})( bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) = ( bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ). At the last layer when +1=L1𝐿\ell+1=Lroman_ℓ + 1 = italic_L, it outputs (𝑯S,𝑯V)=(𝑯S(L),𝑯V(L))superscriptsubscript𝑯𝑆superscriptsubscript𝑯𝑉superscriptsubscript𝑯𝑆𝐿superscriptsubscript𝑯𝑉𝐿({\bm{H}}_{S}^{\prime},{\bm{H}}_{V}^{\prime})=({\bm{H}}_{S}^{(L)},{\bm{H}}_{V}% ^{(L)})( bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ). We set L=3𝐿3L=3italic_L = 3 in each of the GVP()GVP\rm GVP(\cdot)roman_GVP ( ⋅ ) layers.

The separately encoded scalar and vector representations (𝑯S,𝑯V)superscriptsubscript𝑯𝑆superscriptsubscript𝑯𝑉({\bm{H}}_{S}^{\prime},{\bm{H}}_{V}^{\prime})( bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), before sending to further prediction, are combined to obtain an AA-level matrix representation. This requires additional transformations, which we define as a GVP Transform layer. As introduced below, we first define a concatenated feature 𝑯=concat(𝑯S,𝑯V)𝑯concatsuperscriptsubscript𝑯𝑆superscriptsubscript𝑯𝑉{\bm{H}}={\rm concat}({\bm{H}}_{S}^{\prime},{\bm{H}}_{V}^{\prime})bold_italic_H = roman_concat ( bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). For the i𝑖iitalic_ith node and the edge of connected nodes ij𝑖𝑗i\rightarrow jitalic_i → italic_j, we define:

𝐡mijsuperscriptsubscript𝐡𝑚𝑖𝑗\displaystyle\mathbf{h}_{m}^{ij}bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT :=GVP(concat(𝐡𝐯j,𝐡eij))assignabsentGVPconcatsuperscriptsubscript𝐡𝐯𝑗superscriptsubscript𝐡𝑒𝑖𝑗\displaystyle:={\rm GVP}\left(\mathrm{concat}\left(\mathbf{h}_{\mathbf{v}}^{j}% ,\mathbf{h}_{e}^{ij}\right)\right):= roman_GVP ( roman_concat ( bold_h start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT ) ) (3)
𝐡𝐯isuperscriptsubscript𝐡𝐯𝑖\displaystyle\mathbf{h}_{\mathbf{v}}^{i}bold_h start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT LayerNorm(𝐡𝐯i+1kDropout(j:𝐞ij𝐡mij)),absentLayerNormsuperscriptsubscript𝐡𝐯𝑖1𝑘Dropoutsubscript:𝑗subscript𝐞𝑖𝑗superscriptsubscript𝐡𝑚𝑖𝑗\displaystyle\leftarrow\mathrm{LayerNorm}\left(\mathbf{h}_{\mathbf{v}}^{i}+% \frac{1}{k}\mathrm{Dropout}\left(\sum_{j:\mathbf{e}_{ij}\in\mathcal{E}}\mathbf% {h}_{m}^{ij}\right)\right),← roman_LayerNorm ( bold_h start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_k end_ARG roman_Dropout ( ∑ start_POSTSUBSCRIPT italic_j : bold_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_E end_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT ) ) ,

where each feature vector 𝐡𝐡\mathbf{h}bold_h is a concatenation of scalar features 𝑯Ssuperscriptsubscript𝑯𝑆{\bm{H}}_{S}^{\prime}bold_italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and vector features 𝑯Vsuperscriptsubscript𝑯𝑉{\bm{H}}_{V}^{\prime}bold_italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, k𝑘kitalic_k is the number of incoming messages from i𝑖iitalic_i’s neighbors, and 𝒉visuperscriptsubscript𝒉𝑣𝑖{\bm{h}}_{v}^{i}bold_italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝒉mijsuperscriptsubscript𝒉𝑚𝑖𝑗{\bm{h}}_{m}^{ij}bold_italic_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT are the embedding of scalars and vectors for node i𝑖iitalic_i and edge ij𝑖𝑗i\to jitalic_i → italic_j.

We also add an extra feed-forward layer when updating the node representation

𝐡𝐯𝐢LayerNorm(𝐡𝐯𝐢+Dropout(GVP(𝐡𝐯𝐢))),superscriptsubscript𝐡𝐯𝐢LayerNormsuperscriptsubscript𝐡𝐯𝐢DropoutsuperscriptGVPsuperscriptsubscript𝐡𝐯𝐢\mathbf{h_{v}^{i}}\leftarrow\text{LayerNorm}\left(\mathbf{h_{v}^{i}}+\text{% Dropout}\left({\rm GVP}^{\prime}\left(\mathbf{h_{v}^{i}}\right)\right)\right),bold_h start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT ← LayerNorm ( bold_h start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT + Dropout ( roman_GVP start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_h start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT ) ) ) , (4)

where GVPsuperscriptGVP{\rm GVP}^{\prime}roman_GVP start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes a GVP layer with L=2𝐿2L=2italic_L = 2. We use superscripts to distinguish it from the previous GVP layers in (1) and (3), which includes three layers of scalar-vector propagations defined in (2). In comparison, GVPsuperscriptGVP{\rm GVP}^{\prime}roman_GVP start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in (4) only applies 2 layers of scalar-vector propagation.

The stack of GVP convolution and feed-forward transformation defined in (1)-(4) constructs a GVP-GNN block. The block is repeated multiple times to obtain expressive node representations. The feed-forward layer is applied at the end of every GVP-GNN block except for the last block. In implementation, we set the reputation to 6666. See ablation studies in Section IV) for more details.

The node representation is sent to readout layers for label prediction. In the training phase, a dense layer is employed to recovery the input tokens:

y=𝐖(ReLU(DropOut(𝐖𝐇S))).𝑦𝐖ReLUDropOutsubscript𝐖𝐇𝑆\displaystyle y=\mathbf{W}(\text{ReLU}(\text{DropOut}(\mathbf{W}\mathbf{H}_{S}% ))).italic_y = bold_W ( ReLU ( DropOut ( bold_WH start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) ) . (5)

For prediction tasks, i.e., global structure matching, we obtain vector representation for the input protein with an normalized average pooling layer, i.e.,

𝐡𝐯=1ni=1n𝐡𝐯𝐢.subscript𝐡𝐯norm1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝐡𝐯𝐢\displaystyle\mathbf{h_{v}}=\big{\|}\frac{1}{n}\sum_{i=1}^{n}\mathbf{h_{v}^{i}% }\big{\|}.bold_h start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT = ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT ∥ . (6)

To measure the similarity of a protein pair (𝒫1,𝒫2)subscript𝒫1subscript𝒫2({\mathcal{P}}_{1},{\mathcal{P}}_{2})( caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with the respective learned vector representations (𝒉1,𝒉2)subscript𝒉1subscript𝒉2({\bm{h}}_{1},{\bm{h}}_{2})( bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), we define the cosine similarity:

sim(𝒫1,𝒫2)=𝒉1𝒉2𝒉1𝒉2.simsubscript𝒫1subscript𝒫2subscript𝒉1subscript𝒉2normsubscript𝒉1normsubscript𝒉2\displaystyle{\rm sim}({\mathcal{P}}_{1},{\mathcal{P}}_{2})=\frac{{\bm{h}}_{1}% \cdot{\bm{h}}_{2}}{\|{\bm{h}}_{1}\|\cdot\|{\bm{h}}_{2}\|}.roman_sim ( caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ bold_italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ⋅ ∥ bold_italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ end_ARG . (7)

II-D Training Objective

Training ProtLOCA only involves the scalar and vector features extracted from backbone coordinates, excluding inputs directly related to amino acid types. The model is trained in a self-supervised learning manner with the objective of denoising the perturbed node features. Two types of corruption approaches are considered for adding noise, including masking and permutation. In the former, values in 𝑵Ssubscript𝑵𝑆{\bm{N}}_{S}bold_italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT are set to 00 with a probability p𝑝pitalic_p; in the latter, values in 𝑵Ssubscript𝑵𝑆{\bm{N}}_{S}bold_italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT are randomly replaced by another 𝑵Ssubscript𝑵𝑆{\bm{N}}_{S}bold_italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT value with a probability of 1p1𝑝1-p1 - italic_p. Additional discussion on the tunable parameter p𝑝pitalic_p can be found in Section IV.

III Local Structure Alignment

III-A Problem Formulation

Consider two arbitrary peptide chains 𝒫1subscript𝒫1{\mathcal{P}}_{1}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒫3subscript𝒫3{\mathcal{P}}_{3}caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with different sequence length and overall structure and a common substructure 𝒫superscript𝒫{\mathcal{P}}^{\prime}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. An local structure alignment task aims to identify highly similar local regions 𝒫superscript𝒫{\mathcal{P}}^{\prime}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the input data (𝒫1,𝒫3)subscript𝒫1subscript𝒫3({\mathcal{P}}_{1},{\mathcal{P}}_{3})( caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ).

III-B ProtLOCA for Local Structure Alignment

In the global structure matching task, we employ average pooling on the amino acid representations of the protein pairs to obtain vector representations for comparing protein-level similarity. However, this simplified method cannot provide insights into the alignment of protein local regions. While functionally similar proteins may only have similar active regions and differ in overall structure, discovering local alignments of proteins could be essential for functional region identification and analysis. To this end, we introduce a modified ProtLOCA with a simple heuristic algorithm to highlight similar regions for protein pairs. After extracting the hidden representation for nodes by (4), we conduct the following three steps for local alignment identification.

III-B1 Candidate Selection

For two proteins 𝒫1subscript𝒫1{\mathcal{P}}_{1}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒫3subscript𝒫3{\mathcal{P}}_{3}caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with m𝑚mitalic_m and n𝑛nitalic_n amino acids, respectively, ProtLOCA extracts 256256256256-dimensional representations 𝑯1m×256subscript𝑯1superscript𝑚256{\bm{H}}_{1}\in\mathbb{R}^{m\times 256}bold_italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × 256 end_POSTSUPERSCRIPT and 𝑯3n×256subscript𝑯3superscript𝑛256{\bm{H}}_{3}\in\mathbb{R}^{n\times 256}bold_italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × 256 end_POSTSUPERSCRIPT. Similar to the global matching task, we score the similarity between the two matrices by the cosine similarity:

sim(𝒫1,𝒫3)=𝑯1𝑯3𝑯1𝑯3simsubscript𝒫1subscript𝒫3subscript𝑯1subscript𝑯3normsubscript𝑯1normsubscript𝑯3{\rm sim}({\mathcal{P}}_{1},{\mathcal{P}}_{3})=\frac{{\bm{H}}_{1}\cdot{\bm{H}}% _{3}}{\|{\bm{H}}_{1}\|\cdot\|{\bm{H}}_{3}\|}roman_sim ( caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = divide start_ARG bold_italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ bold_italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ⋅ ∥ bold_italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ end_ARG (8)

We will use the output similarity matrix sim(𝒫1,𝒫3)m×nsimsubscript𝒫1subscript𝒫3superscript𝑚𝑛{\rm sim}({\mathcal{P}}_{1},{\mathcal{P}}_{3})\in\mathbb{R}^{m\times n}roman_sim ( caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT for identifying structurally aligned regions between the two proteins. Intuitively speaking, the similarity scores on the diagonal indicates the point-to-point alignment of the two proteins. By selecting high values on the diagonal, the corresponding structurally aligned local regions of the two proteins are recognized.

III-B2 Redundancy Removal

To further investigating the regional similarity of the two proteins, we set a similarity threshold μ𝜇\muitalic_μ for the diagonal and a minimum structure size s𝑠sitalic_s for the similar local structure of interest. We first iterate over all possible subset blocks 𝑯𝑯superscript𝑯𝑯{\bm{H}}^{\prime}\in{\bm{H}}bold_italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_italic_H along the diagonal line with the size from s×s𝑠𝑠s\times sitalic_s × italic_s until m×n𝑚𝑛m\times nitalic_m × italic_n along the diagonal line. The possible 𝑯superscript𝑯{\bm{H}}^{\prime}bold_italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are those that mean(diag(𝑯))>μmeandiagsuperscript𝑯𝜇{\rm mean}({\rm diag}({\bm{H}}^{\prime}))>\muroman_mean ( roman_diag ( bold_italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) > italic_μ. We record the mean and variance for all candidate 𝑯superscript𝑯{\bm{H}}^{\prime}bold_italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTs. The second step removes redundant blocks from the candidates group with an overlap threshold d𝑑ditalic_d. We traverse all candidate 𝑯superscript𝑯{\bm{H}}^{\prime}bold_italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. For two arbitrary 𝑯1,𝑯3subscriptsuperscript𝑯1subscriptsuperscript𝑯3{\bm{H}}^{\prime}_{1},{\bm{H}}^{\prime}_{3}bold_italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, if more than d𝑑ditalic_d rows or columns are overlapped, the smaller block matrix will be dropped. After the two steps, we obtain a set of non-overlapping candidate regions. In this study, we set μ=10𝜇10\mu=10italic_μ = 10, s=0.8𝑠0.8s=0.8italic_s = 0.8, and d=5𝑑5d=5italic_d = 5.

III-B3 Unconditional Ranking

To identify the best matching local structures, we sort the obtained candidates by their variance (calculated from the first step of candidate selection) in ascending order. This unconditional ranking approach assumes no prior knowledge about the specific region to be matched (e.g., active site). In cases where the target region is known, we can optionally employ a conditional ranking method. This method sorts the candidates based on the degree of index-level overlap between the query structure and the candidate structures in descending order.

TABLE I: Performance comparison of baseline models on CATH-aligns for structure alignment. The classification performance is evaluated by AUC. Both the average AUC and the detailed fold-wise AUC are reported.
Model Information Input CATH-aligns CATH-aligns+
Type Name Version # Params AA Structure average fold 1 fold 2 fold 3 average fold 1 fold 2 fold 3
Aligment FoldSeek [21] 3Di - 0.900 0.903 0.901 0.897 0.891 0.893 0.892 0.888
3Di-AA - 0.888 0.889 0.888 0.886 0.881 0.882 0.881 0.879
Embedding ESM2 [23] t33_650M 650M 0.685 0.685 0.684 0.687 0.672 0.672 0.674 0.671
t36_3B 3,000M 0.700 0.697 0.699 0.704 0.685 0.685 0.687 0.682
t48_15B 15,000M 0.814 0.813 0.814 0.814 0.788 0.788 0.790 0.786
ProstT5 [24] AA2fold 3,000M 0.907 0.905 0.909 0.908 0.851 0.851 0.852 0.850
fold2AA 3,000M 0.921 0.921 0.92 0.922 0.838 0.841 0.839 0.834
ESM-IF [14] - 148M 0.625 0.624 0.625 0.627 0.851 0.853 0.851 0.849
MIF-ST [15] - 643M 0.882 0.897 0.873 0.877 0.614 0.611 0.616 0.616
ProtLOCA (Ours) - 5.9M 0.965 0.966 0.964 0.964 0.895 0.895 0.895 0.895
Refer to caption
Figure 3: Model performance on different (left) perturbation possibility p𝑝pitalic_p on mask corruption; (middle) number of GVP layers; (right) pre-training targets.

IV Experimental Analysis

ProtLOCA is pre-trained on an unlabeled protein structure dataset from CATH4.3 (introduced below). We examine ProtLOCA on protein structure alignment tasks involving both global structure matching and local structure alignment. For the global structure matching task, we provide quantitative comparisons with baseline methods on two independent benchmark datasets, CATH-aligns and CATH-aligns+. For the local structure alignment, due to the lack of appropriate datasets and quantitative evaluation metrics, we investigate the model’s performance through a case study. All experiments were conducted on 8 A800 GPUs, each with 80GB VRAM. The implementation will be released upon acceptance.

IV-A CATH-aligns: Benchmark for Structure Alignment

We construct CATH-aligns, a new benchmark with standard quantitative evaluation criteria. We process the dataset from CATH 4.3 333Official dataset can be found at http://download.cathdb.info/cath/releases/all-releases/v4_3_0/, a comprehensive dataset with experimentally determined protein domain structures. All structures are labeled with a four-level CATH classification code [25] that classifies the protein’s structural type from different perspectives. We remove incomplete protein entities that include missing atomic coordinates for Cαsubscript𝐶𝛼C_{\alpha}italic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and N𝑁Nitalic_N. All proteins are below 20%percent2020\%20 % of sequence identity to each other. A total of 14,6541465414,65414 , 654 are left for constructing the independent test set CATH-aligns.

For structure alignment prediction, we define a binary classification task with the split test subset from CATH4.3. We consider two levels of classification difficulty and name them as CATH-aligns and CATH-aligns+, respectively. The former CATH-aligns defines negative pairs as protein domains with all the four-level CATH classification codes being different and positive pairs as any of the four codes being identical. The latter CATH-aligns+ defines a more difficult task, where structure pairs with identical CATH codes at all four levels are considered positive sample pairs, while pairs differing at any level are considered negative sample pairs. To ensure computational efficiency and balance the number of positive and negative samples, we prepare three folds for evaluation, each containing 10,0001000010,00010 , 000 positive and 10,0001000010,00010 , 000 negative pairs that are randomly sampled from the complete 14654×14654146541465414654\times 1465414654 × 14654 pairs of CATH-aligns. The prediction results are assessed using the AUC (area under the curve) metric, where an AUC closer to 1111 indicates better predictive performance.

IV-B Experimental Protocol

Training Setup

ProtLOCA is optimized with AdamW [26] with a learning rate of 0.00010.00010.00010.0001. The maximum number of training epochs is set to 50505050, and early stopping is applied with a patience of 5555 epochs. For stable memory usage of GPU during the training, the maximum number of nodes per batch is set to 10,0001000010,00010 , 000. The GVP module consists of 6666 layers and a dropout ratio of 0.20.20.20.2. The embedding dimensions are set to 256256256256 for 𝑵Ssubscript𝑵𝑆{\bm{N}}_{S}bold_italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, 32323232 for 𝑵Vsubscript𝑵𝑉{\bm{N}}_{V}bold_italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, 64646464 for 𝑬Ssubscript𝑬𝑆{\bm{E}}_{S}bold_italic_E start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, and 2222 for 𝑬Vsubscript𝑬𝑉{\bm{E}}_{V}bold_italic_E start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. During the inference, the input 𝑵Ssubscript𝑵𝑆{\bm{N}}_{S}bold_italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is masked to 00 to obtain the representation vectors for each point in the protein structure. All experiments are conducted on an A800 GPU with 80GB of memory, and the training process is logged using WanDB.

Dataset for Self-Supervised Learning

We use unlabeled CATH4.3_s40 for training our graph representation learning model. All structures in the dataset are processed with the similarity threshold at 40%percent4040\%40 %, containing a total of 31,0703107031,07031 , 070 protein domain structures. The training target is to recover the noisy input tokens defined in Section II-D. A subset of 200200200200 domains is split randomly for model validation. Although the training dataset is unlabeled, we further ensure that the sequence identity between the training set and the test datasets CATH-aligns is below 20%percent2020\%20 % to avoid data leakage.

Refer to caption
Figure 4: Example of using ProtLOCA and TM-align to find Helix-turn-helix (HTH) motif in DNA binding protein. (A) HTH motif in Tox repressor (PDB: 1F5T). The HTH motif is colored in red, DNA in yellow, and protein in white. (B) The HTH motif serves as the binding site of protein to DNA and is presented as a Tox repressor. The HTH motif is colored in pink, the protein is in white, the DNA is in yellow, and the hydrogen bonds between the HTH motif and DNA are marked in red. (C) phage lambda cII protein (PDB: 1ZS4) HTH motif from ground truth (red), TM-align (blue), and ProtLOCA (green). (D) transcriptional regulator PA2196 (PDB: 4L62) HTH motif from ground truth (red), TM-align (blue), and ProtLOCA (green).
Baseline Methods

We compare ProtLOCA with a set of alignment-based and embedding-based deep learning methods. For alignment methods, we consider two variants of FoldSeek [21], using 3Di with pure structural input and 3Di with both structural and amino acid (AA) input. This method encodes local structures and uses traditional alignment algorithms for point-by-point comparison of structures. For the global structure matching task, we exclude TM-align [19] from the baseline list due to its extremely inefficient computational speed. In order to compute the similarity of all 14654×14654146541465414654\times 1465414654 × 14654 structure pairs in the test dataset, TM-align would consume approximately 30,0003000030,00030 , 000 hours. In comparison, ProtLOCA spends less than 1111 hour, including the data preprocessing and scoring steps. For embedding methods, our comparison includes the pre-trained sequence-based language model ESM2 [23] with different model scales. The structure-aware pre-trained model ProstT5 [24] uses both AA2fold and fold2AA modes for translation tasks, we take amino acid sequences and Foldseek sequences as input to get embeddings respectively. We also include two inverse-folding methods, ESM-if1 [14] and MIF-ST [15] which take amino acid sequences as input. Unlike alignment methods, embedding methods average protein sequences to obtain embeddings and use the dot product of these vectors to measure overall protein similarity.

IV-C Results Analysis

Baseline Comparison

Table I reports the performance comparison of ProtLOCA and other baseline models on CATH-aligns and CATH-aligns+. In both alignment tasks, ProtLOCA significantly outperforms other embedding methods and even exceeds the performance of the classic alignment-based baseline FoldSeek. Note that the training cost for ProtLOCA is lower than that of all baseline methods due to a significantly smaller number of trainable parameters. Additionally, it is trained on a considerably small dataset of approximately 30,0003000030,00030 , 000 samples. This training set size is smaller than what is typically required for deep protein models, which usually demand millions or more samples to train effectively. Furthermore, structure-based algorithms (e.g., ESM-if1) generally perform better than sequence-based methods. Notably, ESM2, despite achieving state-of-the-art performance in many downstream tasks, does not perform well in the structure alignment task. Additionally, the results of both FoldSeek and ProtLOCA demonstrate that incorporating amino acid information during training can indeed reduce the overall predictive performance of the models. These experimental results strongly support our initial claim that amino acid information does not always contribute to learning more expressive hidden embeddings, and embeddings learned with sequence information do not consistently enhance the prediction performance in any downstream tasks.

Sensitivity Analysis

We examine the impact of two hyperparameters on the performance of ProtLOCA: the masking noise ratio p𝑝pitalic_p (with the permutation ratio being 1p1𝑝1-p1 - italic_p) and the number of GVP layers. The results are visualized in the left two subplots in Fig. 3. The prediction accuracy is insensitive to both hyperparameters, with less than 1%percent11\%1 % changes observed from a considerably large range. We perform p=0.5𝑝0.5p=0.5italic_p = 0.5 and 6666 GVP layers as the default settings for the model.

Input and Denoising Token

Fig. 3 (right) compares the effect of different types of input node features. We consider three types of node features: the classic amino acid type, the secondary structure codes (DSSP), and the hidden structure codes (3Di). Overall, using 3Di encoding yields the best prediction performance on the structure alignment task. More importantly, incorporating amino acid information during the model training significantly degrades model performance (green bars). This observation is consistent with the previous analysis and our key assumption, where considering amino acid information in the structure alignment task may introduce unnecessary interference, leading to poor prediction performance in downstream tasks.

IV-D Case Study: HTH Functional Structure Alignment

The helix-turn-helix (HTH) motif is a crucial structural component in DNA binding proteins, including transcription factors regulating gene expression [18]. It comprises two alpha-helices joined by a ‘turn’, with the second helix, known as the recognition helix (Fig. 4A), specifically interacting with DNA (Fig. 4B). This interaction is essential for gene regulation, as it enables proteins containing the HTH motif to control the transcription process by attaching to DNA’s promoters or operators [27]. We first use TM-align to identify the HTH motif in the phage lambda cII protein (Fig. 4C) [28] and the transcriptional regulator PA2196 (Fig. 4D) [29]. In the phage lambda cII protein, the HTH motif identified by TM-align is located differently in protein compared to its position in the ground truth. For transcriptional regulator PA2196, the HTH motif identified by TM-align is much shorter than the one in the ground truth. These cases demonstrate TM-align’s limitations in accurately identifying the correct HTH motif in DNA binding proteins. However, ProtLOCA can effectively identify the correct HTH motif in these two proteins, despite their different overall folds. Thus, ProtLOCA demonstrates better performance than TM-align in identifying critical motifs in proteins with the same functions when their overall structures vary.

V Related Work

V-A Sequence Representation

With the growth of protein sequences and advancements in natural language modeling methods, the most commonly used approaches in protein representation learning typically involve unsupervised training on protein sequences, without considering protein structural information. For example, ESM2 [23], ESM-1v [30], and ESM-1b [31] use different redundancy levels of the Uniref dataset [32], employing the BERT [33] architecture and a masked language modeling unsupervised training objective to train models for downstream tasks related to representation learning or zero-shot mutation tasks in protein engineering. ProtTrans [34] has introduced a series of protein language representation models, such as ProtBert, ProtT5, and ProtAlBert, based on BERT [33], T5 [35] or AlBert [36] architectures, primarily applied to various downstream tasks of representation learning. Ankh [37] uses an asymmetric encoder-decoder approach and explores a series of training parameters to train language models that perform well on downstream tasks. Additionally, methods like CARP [38] and ProteinBert [39] use 1D-CNN instead of the attention mechanism to improve training efficiency for processing longer sequences.

V-B Structure Representation

With the increase in crystal structures and advancements in folding techniques [40, 23], protein structure databases have become increasingly large [41]. Currently, mainstream methods use sequence information as the training target for structural inputs or as auxiliary node features, with few models considering only protein structures while discarding amino acid types. For instance, GearNet [10] uses contrastive learning to enhance representation quality for protein enzyme commission (EC) number prediction, and ProtLGN [42] employs multi-task learning and denoising training objectives to improve zero-shot prediction capabilities for protein mutations. Additionally, models like GVP [8] and EGNN [43] use graph neural networks to model the equivariance and invariance of proteins for protein quality prediction tasks. Some inverse folding methods use protein structures as input to restore amino acid information, achieving structure-aware training. For example, ESM-IF [14] uses GVP to initialize transformer node features, and ProstT5 [24] uses Foldseek’s structural tokens as input and amino acid sequences as output (or vice versa) for machine translation training. Furthermore, some approaches combine language models and graph neural networks to enhance the quality of representation learning. Examples include MIF-ST [15], which integrates CARP [38] and Struct GNN, ProtSSN [11], which combines ESM2-650M and EGNN structures, and LM-GVP [12].

VI Conclusion and Discussion

Protein function annotation and analysis typically rely on protein sequences and overall structural information. However, these approaches come with their own set of challenges. Sequence-based analysis, such as EC numbers and Pfam datasets, doesn’t consistently yield accurate analysis. This is partly because pinning down a protein’s position on the evolutionary tree can be problematic when only its sequence is considered. In addition, methods that align overall protein structures, such as TM-align, may overlook proteins that are characterized by local structural conservation while amidst overall structural variability. Protein functions are mainly determined by key sub-structures, such as catalytic region and binding pockets, while the remaining structures determine the physical properties of proteins. In light of these issues of current methodologies and the significance of biology, we developed the ProtLOCA that focuses on local structural matches within proteins with diverse overall folds. This tool unlocks new perspectives on protein functional and structural evolution.

References

  • [1] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. Nelson, A. Bridgland et al., “Improved protein structure prediction using potentials from deep learning,” Nature, vol. 577, no. 7792, pp. 706–710, 2020.
  • [2] A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos, C. Xiong, Z. Z. Sun, R. Socher et al., “Large language models generate functional protein sequences across diverse families,” Nature Biotechnology, vol. 41, no. 8, pp. 1099–1106, 2023.
  • [3] T. Yu, H. Cui, J. C. Li, Y. Luo, G. Jiang, and H. Zhao, “Enzyme function prediction using contrastive learning,” Science, vol. 379, no. 6639, pp. 1358–1363, 2023.
  • [4] J. Koehler Leman, P. Szczerbiak, P. D. Renfrew, V. Gligorijevic, D. Berenberg, T. Vatanen, B. C. Taylor, C. Chandler, S. Janssen, A. Pataki et al., “Sequence-structure-function relationships in the microbial protein universe,” Nature communications, vol. 14, no. 1, p. 2351, 2023.
  • [5] N. Sapoval, A. Aghazadeh, M. G. Nute, D. A. Antunes, A. Balaji, R. Baraniuk, C. Barberan, R. Dannenfelser, C. Dun, M. Edrisi et al., “Current progress and open challenges for applying deep learning across the biosciences,” Nature Communications, vol. 13, no. 1, p. 1728, 2022.
  • [6] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko et al., “Highly accurate protein structure prediction with alphafold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021.
  • [7] J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick et al., “Accurate structure prediction of biomolecular interactions with alphafold 3,” Nature, pp. 1–3, 2024.
  • [8] B. Jing, S. Eismann, P. Suriana, R. J. L. Townshend, and R. Dror, “Learning from protein structure with geometric vector perceptrons,” in ICLR, 2020.
  • [9] B. Zhou, L. Zheng, B. Wu, Y. Tan, O. Lv, K. Yi, G. Fan, and L. Hong, “Protein engineering with lightweight graph denoising neural networks,” Journal of Chemical Information and Modeling, 2023.
  • [10] Z. Zhang, M. Xu, A. Jamasb, V. Chenthamarakshan, A. Lozano, P. Das, and J. Tang, “Protein representation learning by geometric structure pretraining,” arXiv preprint arXiv:2203.06125, 2022.
  • [11] Y. Tan, B. Zhou, L. Zheng, G. Fan, and L. Hong, “Semantical and topological protein encoding toward enhanced bioactivity and thermostability,” bioRxiv, pp. 2023–12, 2023.
  • [12] Z. Wang, S. A. Combs, R. Brand, M. R. Calvo, P. Xu, G. Price, N. Golovach, E. O. Salawu, C. J. Wise, S. P. Ponnapalli et al., “Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction,” Scientific reports, vol. 12, no. 1, p. 6832, 2022.
  • [13] S. Li, J. Zhou, T. Xu, L. Huang, F. Wang, H. Xiong, W. Huang, D. Dou, and H. Xiong, “Structure-aware interactive graph neural networks for the prediction of protein-ligand binding affinity,” in Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, 2021, pp. 975–985.
  • [14] C. Hsu, R. Verkuil, J. Liu, Z. Lin, B. Hie, T. Sercu, A. Lerer, and A. Rives, “Learning inverse folding from millions of predicted structures,” in ICML.   PMLR, 2022, pp. 8946–8970.
  • [15] K. K. Yang, N. Zanichelli, and H. Yeh, “Masked inverse folding with sequence transfer for protein representation learning,” Protein Engineering, Design and Selection, vol. 36, 2023.
  • [16] P. Notin, A. W. Kollasch, D. Ritter, L. Van Niekerk, S. Paul, H. Spinner, N. J. Rollins, A. Shaw, R. Weitzman, J. Frazer et al., “ProteinGym: Large-scale benchmarks for protein fitness prediction and design,” in NeurIPS, 2023.
  • [17] I. Sillitoe, N. Bordin, N. Dawson, V. P. Waman, P. Ashford, H. M. Scholes, C. S. Pang, L. Woodridge, C. Rauer, N. Sen et al., “Cath: increased structural coverage of functional space,” Nucleic acids research, vol. 49, no. D1, pp. D266–D273, 2021.
  • [18] Y. Takeda, D. Ohlendorf, W. Anderson, and B. Matthews, “Dna-binding proteins,” Science, vol. 221, no. 4615, pp. 1020–1026, 1983.
  • [19] Y. Zhang and J. Skolnick, “Tm-align: a protein structure alignment algorithm based on the tm-score,” Nucleic acids research, vol. 33, no. 7, pp. 2302–2309, 2005.
  • [20] W. Kabsch and C. Sander, “Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features,” Biopolymers: Original Research on Biomolecules, vol. 22, no. 12, pp. 2577–2637, 1983.
  • [21] M. Van Kempen, S. S. Kim, C. Tumescheit, M. Mirdita, J. Lee, C. L. Gilchrist, J. Söding, and M. Steinegger, “Fast and accurate protein structure search with foldseek,” Nature Biotechnology, vol. 42, no. 2, pp. 243–246, 2024.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [23] Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli et al., “Evolutionary-scale prediction of atomic-level protein structure with a language model,” Science, vol. 379, no. 6637, pp. 1123–1130, 2023.
  • [24] M. Heinzinger, K. Weissenow, J. G. Sanchez, A. Henkel, M. Steinegger, and B. Rost, “Prostt5: Bilingual language model for protein sequence and structure,” bioRxiv, pp. 2023–07, 2023.
  • [25] C. Orengo, A. Michie, S. Jones, D. Jones, M. Swindells, and J. Thornton, “CATH – a hierarchic classification of protein domain structures,” Structure, vol. 5, no. 8, pp. 1093–1109, 1997.
  • [26] D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in International Conference on Learning Representation, 2015.
  • [27] A. Ishihama, “Prokaryotic genome regulation: multifactor promoters, multitarget regulators and hierarchic networks,” FEMS microbiology reviews, vol. 34, no. 5, pp. 628–645, 2010.
  • [28] D. Jain, Y. Kim, K. L. Maxwell, S. Beasley, R. Zhang, G. N. Gussin, A. M. Edwards, and S. A. Darst, “Crystal structure of bacteriophage λ𝜆\lambdaitalic_λcii and its dna complex,” Molecular cell, vol. 19, no. 2, pp. 259–269, 2005.
  • [29] Y. Kim, Y. Kang, and J. Choe, “Crystal structure of pseudomonas aeruginosa transcriptional regulator pa2196 bound to its operator dna,” Biochemical and Biophysical Research Communications, vol. 440, no. 2, pp. 317–321, 2013.
  • [30] J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, and A. Rives, “Language models enable zero-shot prediction of the effects of mutations on protein function,” in NeurIPS, vol. 34, 2021, pp. 29 287–29 303.
  • [31] A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma et al., “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences,” Proceedings of the National Academy of Sciences, vol. 118, no. 15, p. e2016239118, 2021.
  • [32] B. E. Suzek, Y. Wang, H. Huang, P. B. McGarvey, C. H. Wu, and U. Consortium, “Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches,” Bioinformatics, vol. 31, no. 6, pp. 926–932, 2015.
  • [33] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
  • [34] A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi, W. Yu, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger, D. Bhowmik, and B. Rost, “ProtTrans: Towards cracking the language of lifes code through self-supervised deep learning and high performance computing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [35] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
  • [36] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
  • [37] A. Elnaggar, H. Essam, W. Salah-Eldin, W. Moustafa, M. Elkerdawy, C. Rochereau, and B. Rost, “Ankh: Optimized protein language model unlocks general-purpose modelling,” arXiv preprint arXiv:2301.06568, 2023.
  • [38] K. K. Yang, A. X. Lu, and N. Fusi, “Convolutions are competitive with transformers for protein sequence pretraining,” in ICLR Machine Learning for Drug Discovery, 2022.
  • [39] N. Brandes, D. Ofer, Y. Peleg, N. Rappoport, and M. Linial, “ProteinBERT: A universal deep-learning model of protein sequence and function,” Bioinformatics, vol. 38, no. 8, pp. 2102–2110, 2022.
  • [40] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko et al., “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021.
  • [41] M. Varadi, S. Anyango, M. Deshpande, S. Nair, C. Natassia, G. Yordanova, D. Yuan, O. Stroe, G. Wood, A. Laydon et al., “Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models,” Nucleic acids research, vol. 50, no. D1, pp. D439–D444, 2022.
  • [42] B. Zhou, L. Zheng, B. Wu, Y. Tan, O. Lv, K. Yi, G. Fan, and L. Hong, “Protein engineering with lightweight graph denoising neural networks,” bioRxiv, 2023.
  • [43] V. G. Satorras, E. Hoogeboom, and M. Welling, “E (n) equivariant graph neural networks,” in International conference on machine learning.   PMLR, 2021, pp. 9323–9332.