PIPENN: protein interface prediction from sequence with an ensemble of neural nets
Bas Stringer, Hans de Ferrante, Sanne Abeln, Jaap Heringa, K. Anton Feenstra, and Reza Haydarlou
Introduction: Understanding protein function is greatly helped by understanding how and where a protein interacts with other molecules, and predicted binding interfaces allow experiments to be focused on most promising regions. Here, we tackle the problem of predicting from sequence the binding interfaces of proteins for interacting with other proteins (PPI), DNA/RNA (nucleic acids) or small molecules. Many computational and machine learning approaches have been developed over the years to predict such interface residues from sequence [1-4]. However, the effectiveness of different deep learning architectures and learning strategies for protein interface prediction, has not yet been investigated in great detail. Therefore, we here develop an extensive dataset for Deep Learning (DL), dubbed BioDL, and explore six DL architectures and various learning strategies with sequence-derived input features for the prediction of protein interfaces.
Methods: We introduce a new and extensive dataset ‘BioDL’ for benchmarking, which comprises protein–protein interactions from the PDB and DNA/RNA and small molecule interactions from the BioLip database, yielding BioDL_P, BioDL_N and BioDL_S, respectively, as well as the generic BioDL_A which encompasses all three interactions. We use BioDL to train and test our deep learning predictors,and assess significance of differences observed throughout so solid conclusions may be drawn. In addition, we use previously developed homo/heterometic (hhc) datasets from our group [1, 2] and the ZK448 from Zhang & Kurgan.  We apply six major deep learning architectures and an ensemble predictor, dubbed PIPENN, which we implemented as an additional layer on top of the other six architectures.
Preliminary data: We first explored a number of variations in building blocks for the deep learning models. The best performing combination in the dnet architecture yields an AUC-ROC of 0.733, and uses HeUniform kernel intialization, CrossEntropy loss function, 1-dimensional spatial form, PRELU activation function, no MaxPooling, and 1-Hot sequence encoding. Changing kernel initialization to GlorotNormal, and loss functions to MeanSquaredError does not significantly impact accuracy (<0.005 drop in AUC). Removing batch normalization, padding or dropout have a significant and large impact (>0.02 drop in AUC). The six individial architectures yield AUC-ROC prediction accuracies from 0.717 to 0.730, when trained and tested on the BioDL_P dataset. Excitingly, the PIPENN ensemble predictor for PPI significantly outperforms each of, reaching an AUC-ROC of 0.755 for protein-protein interface prediction. Analysis of feature importance using SHAP values, show length as highest as we observed previously, [1-3] as well as amino acid type (AA) and profile scores (pssm). We trained a generic predictor on the BioDL dataset for each of the three specific interaction types: PPI (p), protein–small molecule (s), and protein–DNA/RNA (n), as well as a generic predictors for any of the three types (a). In all cases, the specific predictors perform better than the generic predictor. For PPI, the specific predictor reaches AUC-ROC of 0.755, while the generic obtains 0.733. For small molecules this was respectively 0.864 and 0.826, and for nucleic acids 0.894 and 0.835. Using separate independent test sets for PPI prediction, hhc and ZK448, we show that PIPENN significantly outperforms all currently published methods for sequence-based prediction of the binding interface of proteins with other proteins, with AUC-ROC of 0.769 for hhc and 0.729 for ZK448. Other methods tested on ZK448 do not reach beyond 0.715.
Novel aspect: Accurately predict protein interaction interfaces from protein sequence, needs neither experimental nor structural information, thus easily applied to any protein.
Event Timeslots (1)