Current Bioinformatics

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

An Extended Feature Representation Technique for Predicting Sequenced-based Host-pathogen Protein-protein Interaction

Author(s): Jerry Emmanuelorcid of author, Itunuoluwa Isewon, Grace Olasehinde and Jelili Oyelade*

Volume 20, Issue 3, 2025

Published on: 07 March, 2024

Page: [229 - 245] Pages: 17

DOI: 10.2174/0115748936286848240108074303

Price: $65

Become a Editorial Board Member
Become a Reviewer
Become a Editor
Become a Section Editor

Abstract

Background: The use of machine learning models in sequence-based Protein-Protein Interaction prediction typically requires the conversion of amino acid sequences into feature vectors. From the literature, two approaches have been used to achieve this transformation. These are referred to as the Independent Protein Feature (IPF) and Merged Protein Feature (MPF) extraction methods. As observed, studies have predominantly adopted the IPF approach, while others preferred the MPF method, in which host and pathogen sequences are concatenated before feature encoding.

Objective: This presents the challenge of determining which approach should be adopted for improved HPPPI prediction. Therefore, this work introduces the Extended Protein Feature (EPF) method.

Methods: The proposed method combines the predictive capabilities of IPF and MPF, extracting essential features, handling multicollinearity, and removing features with zero importance. EPF, IPF, and MPF were tested using bacteria, parasite, virus, and plant HPPPI datasets and were deployed to machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naïve Bayes (NB), Logistic Regression (LR), and Deep Forest (DF).

Results: The results indicated that MPF exhibited the lowest performance overall, whereas IPF performed better with decision tree-based models, such as RF and DF. In contrast, EPF demonstrated improved performance with SVM, LR, NB, and MLP and also yielded competitive results with DF and RF.

Conclusion: In conclusion, the EPF approach developed in this study exhibits substantial improvements in four out of the six models evaluated. This suggests that EPF offers competitiveness with IPF and is particularly well-suited for traditional machine learning models.

Keywords: Protein-protein interaction, feature representation, host-pathogen interaction, machine learning, protein sequence, feature vectors.

Graphical Abstract

[1]
Zhang B, Li J, Quan L, Chen Y, Lü Q. Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network. Neurocomputing 2019; 357: 86-100.
[http://dx.doi.org/10.1016/j.neucom.2019.05.013]
[2]
Ziegler SJ, Mallinson SJB, St John PC, Bomble YJ. Advances in integrative structural biology: Towards understanding protein complexes in their cellular context. Comput Struct Biotechnol J 2020; 19: 214-25.
[http://dx.doi.org/10.1016/j.csbj.2020.11.052] [PMID: 33425253]
[3]
Richards AL, Eckhardt M, Krogan NJ. Mass spectrometry‐based protein–protein interaction networks for the study of human diseases. Mol Syst Biol 2021; 17(1): e8792.
[http://dx.doi.org/10.15252/msb.20188792] [PMID: 33434350]
[4]
Meldal BHM, Perfetto L, Combe C, et al. Complex portal 2022: New curation frontiers. Nucleic Acids Res 2022; 50(D1): D578-86.
[http://dx.doi.org/10.1093/nar/gkab991] [PMID: 34718729]
[5]
Khatun MS, Shoombuatong W, Hasan MM, Kurata H. Evolution of sequence-based bioinformatics tools for protein-protein interaction prediction. Curr Genomics 2020; 21(6): 454-63.
[http://dx.doi.org/10.2174/1389202921999200625103936] [PMID: 33093807]
[6]
Marchand A, Van Hall-Beauvais AK, Correia BE. Computational design of novel protein–protein interactions – An overview on methodological approaches and applications. Curr Opin Struct Biol 2022; 74: 102370.
[http://dx.doi.org/10.1016/j.sbi.2022.102370] [PMID: 35405427]
[7]
Balasubramanian K, Gupta SP. Quantum molecular dynamics, topological, group theoretical and graph theoretical studies of protein-protein interactions. Curr Top Med Chem 2019; 19(6): 426-43.
[http://dx.doi.org/10.2174/1568026619666190304152704] [PMID: 30836919]
[8]
Heifetz A, Sladek V, Townsend-Nicholson A, Fedorov DG. Characterizing protein-protein interactions with the fragment molecular orbital method. Methods Mol Biol 2020; 2114: 187-205.
[http://dx.doi.org/10.1007/978-1-0716-0282-9_13] [PMID: 32016895]
[9]
Yugandhar K, Gupta S, Yu H. Inferring protein-protein interaction networks from mass spectrometry-based proteomic approaches: A mini-review. Comput Struct Biotechnol J 2019; 17: 805-11.
[http://dx.doi.org/10.1016/j.csbj.2019.05.007] [PMID: 31316724]
[10]
Sun T, Zhou B, Lai L, Pei J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics 2017; 18(1): 277.
[http://dx.doi.org/10.1186/s12859-017-1700-2] [PMID: 28545462]
[11]
Rosa S, Bertaso C, Pesaresi P, Masiero S, Tagliani A. Synthetic protein circuits and devices based on reversible protein-protein interactions: An overview. Life 2021; 11(11): 1171.
[http://dx.doi.org/10.3390/life11111171] [PMID: 34833047]
[12]
Murakami Y, Mizuguchi K. Recent developments of sequence-based prediction of protein–protein interactions. Biophys Rev 2022; 14(6): 1393-411.
[http://dx.doi.org/10.1007/s12551-022-01038-1] [PMID: 36589735]
[13]
Nussbaumer T. Host_microbe_PPI - R package to analyse intra-species and inter-species protein-protein interactions in the model plant arabidopsis thaliana. bioRxiv 2019; 551275.
[http://dx.doi.org/10.1101/551275]
[14]
Dick K, Samanfar B, Barnes B, et al. PIPE4: Fast PPI predictor for comprehensive inter- and cross-species interactomes. Scientific Reports 2020; 10(1): 1-15.
[http://dx.doi.org/10.1038/s41598-019-56895-w]
[15]
Sunggawa MI, Bustamam A, Siswantining T. Sequence-based prediction of pathogen-host interaction using an ensemble learning classifier and moran autocorrelation feature encoding method. TURCOMAT 2021; 12(14): 598-605.
[16]
Ghedira K, Hamdi Y, El Béji A, Othman H. An integrative computational approach for the prediction of human-plasmodium protein-protein interactions. BioMed Res Int 2020; 2020: 1-11.
[http://dx.doi.org/10.1155/2020/2082540] [PMID: 33426052]
[17]
Chen H, Guo W, Shen J, Wang L, Song J. Structural principles analysis of host-pathogen protein-protein interactions: A structural bioinformatics survey. IEEE Access 2018; 6: 11760-71.
[http://dx.doi.org/10.1109/ACCESS.2018.2807881]
[18]
Sironi M, Cagliani R, Forni D, Clerici M. Evolutionary insights into host–pathogen interactions from mammalian sequence data. Nat Rev Genet 2015; 16(4): 224-36.
[http://dx.doi.org/10.1038/nrg3905] [PMID: 25783448]
[19]
Engering A, Hogerwerf L, Slingenbergh J. Pathogen-host-environment interplay and disease emergence. Emerg Microbes Infect 2012; 2013: 2.
[http://dx.doi.org/10.1038/emi.2013.5] [PMID: 26038452]
[20]
Steps E, Causation DD. Chapter 4-biomedical research chapter 4-lesson 4 host-pathogen interactions. Biomed Res 2012; 123-8.
[21]
Chen H, Shen J, Wang L, Song J. Towards data analytics of pathogen-host protein-protein interaction: A survey. Proceedings - 2016 IEEE International Congress on Big Data, BigData Congress . 377-88.
[http://dx.doi.org/10.1109/BigDataCongress.2016.60]
[22]
Kaur A, Kaur P, Ahuja S. Förster resonance energy transfer (FRET) and applications thereof. Anal Methods 2020; 12(46): 5532-50.
[http://dx.doi.org/10.1039/D0AY01961E] [PMID: 33210685]
[23]
Chrétien AÈ, Gagnon-Arsenault I, Dubé AK, et al. Extended linkers improve the detection of protein-protein interactions (PPIs) by dihydrofolate reductase protein-fragment complementation assay (DHFR PCA) in living cells. Mol Cell Proteomics 2018; 17(2): 373-83.
[http://dx.doi.org/10.1074/mcp.TIR117.000385] [PMID: 29203496]
[24]
Pichlerova K, Hanes J. Technologies for the identification and validation of protein-protein interactions. Gen Physiol Biophys 2021; 40(6): 495-522.
[http://dx.doi.org/10.4149/gpb_2021035] [PMID: 34897023]
[25]
Ho Y, Gruhler A, Heilbut A, et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002; 415(6868): 180-3.
[http://dx.doi.org/10.1038/415180a] [PMID: 11805837]
[26]
Formstecher E, Aresta S, Collura V, et al. Protein interaction mapping: A drosophila case study. Genome Res 2005; 15(3): 376-84.
[http://dx.doi.org/10.1101/gr.2659105] [PMID: 15710747]
[27]
Nicod C, Banaei-Esfahani A, Collins BC. Elucidation of host–pathogen protein–protein interactions to uncover mechanisms of host cell rewiring. Curr Opin Microbiol 2017; 39: 7-15.
[http://dx.doi.org/10.1016/j.mib.2017.07.005] [PMID: 28806587]
[28]
Khorsand B, Savadi A, Zahiri J, Naghibzadeh M. Alpha influenza virus infiltration prediction using virus-human protein-protein interaction network. Math Biosci Eng 2020; 17(4): 3109-29.
[http://dx.doi.org/10.3934/mbe.2020176] [PMID: 32987519]
[29]
Chen H, Li F, Wang L, et al. Systematic evaluation of machine learning methods for identifying human–pathogen protein–protein interactions. Brief Bioinform 2021; 22(3): bbaa068.
[http://dx.doi.org/10.1093/bib/bbaa068] [PMID: 32459334]
[30]
Aromolaran O, Aromolaran D, Isewon I, Oyelade J. Machine learning approach to gene essentiality prediction: A review. Briefings in Bioinformatics. Oxford University Press 2021.
[http://dx.doi.org/10.1093/bib/bbab128]
[31]
Loaiza C. Prediction of host-pathogen protein-protein interactions using. Student Research Symposium. Utah State University 2019.
[32]
Brierley L, Fowler A. Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning. PLoS Pathog 2021; 17(4): e1009149.
[http://dx.doi.org/10.1371/journal.ppat.1009149] [PMID: 33878118]
[33]
Prasasty VD, Hutagalung RA, Gunadi R, et al. Prediction of human-Streptococcus pneumoniae protein-protein interactions using logistic regression. Comput Biol Chem 2021; 92(March): 107492.
[http://dx.doi.org/10.1016/j.compbiolchem.2021.107492] [PMID: 33964803]
[34]
Vyas R, Bapat S, Goel P, Karthikeyan M, Tambe SS, Kulkarni BD. Application of genetic programming (GP) formalism for building disease predictive models from protein-protein interactions (PPI) data. IEEE/ACM Trans Comput Biol Bioinformatics 2018; 15(1): 27-37.
[http://dx.doi.org/10.1109/TCBB.2016.2621042] [PMID: 28113781]
[35]
Taha K. Employing Machine Learning Techniques to Detect Protein-Protein Interaction: A Survey, Experimental, and Comparative Evaluations. bioRxiv 2023; 2023.08.22.554321.
[http://dx.doi.org/10.1101/2023.08.22.554321]
[36]
Angelis D, Sofos F, Karakasidis TE. Artificial intelligence in physical sciences: Symbolic regression trends and perspectives. Arch Comput Methods Eng 2023; 30(6): 3845-65.
[http://dx.doi.org/10.1007/s11831-023-09922-z] [PMID: 37359747]
[37]
Papastamatiou K, Sofos F, Karakasidis TE. Machine learning symbolic equations for diffusion with physics-based descriptions. AIP Adv 2022; 12(2): 025004.
[http://dx.doi.org/10.1063/5.0082147]
[38]
Paturi UMR, Cheruku S. Application and performance of machine learning techniques in manufacturing sector from the past two decades: A review. Mater Today Proc 2021; 38: 2392-401.
[http://dx.doi.org/10.1016/j.matpr.2020.07.209]
[39]
Dogan A, Birant D. Machine learning and data mining in manufacturing. Expert Syst Appl 2021; 166(166): 114060.
[http://dx.doi.org/10.1016/j.eswa.2020.114060]
[40]
Nguyen QH, Ly HB, Ho LS, et al. Influence of data splitting on performance of machine learning models in prediction of shear strength of soil. Math Probl Eng 2021; 2021: 1-15.
[http://dx.doi.org/10.1155/2021/4832864]
[41]
Yang L, Xia JF, Gui J. Prediction of protein-protein interactions from protein sequence using local descriptors. Protein Pept Lett 2010; 17(9): 1085-90.
[http://dx.doi.org/10.2174/092986610791760306] [PMID: 20509850]
[42]
Bell E W, Schwartz J H, Freddolino P L, Zhang Y. PEPPI: Whole-proteome protein-protein interaction prediction through structure and sequence similarity, functional association, and machine learning. J Mol Biol 2022; 167530.
[http://dx.doi.org/10.1016/j.jmb.2022.167530]
[43]
Dong TN, Brogden G, Gerold G, Khosla M. A multitask transfer learning framework for the prediction of virus-human protein–protein interactions. BMC Bioinformatics 2021; 22(1): 572.
[http://dx.doi.org/10.1186/s12859-021-04484-y] [PMID: 34837942]
[44]
labxchange Anfinsen’s Experiment Shows That Primary Structure Determines Protein Conformation - LabXchange Available from: https://www.labxchange.org/library/items/lb:LabXchange:e17fa649:html:1 (accessed 2023-12-11).
[45]
ANFINSEN C B, HABER E, SELA M, WHITE J. F H . The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc Natl Acad Sci 1961; 47(9): 28.
[http://dx.doi.org/10.1063/1.3066543]
[46]
Charih F, Biggar KK, Green JR. Assessing sequence-based protein–protein interaction predictors for use in therapeutic peptide engineering. Sci Rep 2022; 12(1): 9610.
[http://dx.doi.org/10.1038/s41598-022-13227-9] [PMID: 35688894]
[47]
Göktepe YE, Kodaz H. Prediction of protein-protein interactions using an effective sequence based combined method. Neurocomputing 2018; 303: 68-74.
[http://dx.doi.org/10.1016/j.neucom.2018.03.062]
[48]
Chen M, Ju CJT, Zhou G, et al. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 2019; 35(14): i305-14.
[http://dx.doi.org/10.1093/bioinformatics/btz328] [PMID: 31510705]
[49]
Liu L, Zhu X, Ma Y, et al. Combining sequence and network information to enhance protein–protein interaction prediction. BMC Bioinformatics 2020; 21(S16) (Suppl. 16): 537.
[http://dx.doi.org/10.1186/s12859-020-03896-6] [PMID: 33323120]
[50]
Dyer MD, Murali TM, Sobral BW. Supervised learning and prediction of physical interactions between human and HIV proteins. Infect Genet Evol 2011; 11(5): 917-23.
[http://dx.doi.org/10.1016/j.meegid.2011.02.022] [PMID: 21382517]
[51]
Chen C, Zhang Q, Ma Q, Yu B. LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi-information fusion. Chemom Intell Lab Syst 2019; 191(May): 54-64.
[http://dx.doi.org/10.1016/j.chemolab.2019.06.003]
[52]
Afify HM, Zanaty MS. Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms. Med Biol Eng Comput 2021; 59(9): 1723-34.
[http://dx.doi.org/10.1007/s11517-021-02412-z] [PMID: 34291385]
[53]
Wang S, Liu S. Protein sub-nuclear localization based on effective fusion representations and dimension reduction algorithm LDA. Int J Mol Sci 2015; 16(12): 30343-61.
[http://dx.doi.org/10.3390/ijms161226237] [PMID: 26703574]
[54]
Nieto JJ, Torres A, Georgiou DN, Karakasidis TE. Fuzzy polynucleotide spaces and metrics. Bull Math Biol 2006; 68(3): 703-25.
[http://dx.doi.org/10.1007/s11538-005-9020-5] [PMID: 16794951]
[55]
Jha K, Saha S, Tanveer M. Prediction of protein-protein interactions using stacked auto-encoder. Trans Emerg Telecommun Technol 2020; 2021(November): 1-13.
[http://dx.doi.org/10.1002/ett.4256]
[56]
Kösesoy İ, Gök M, Öz C. A new sequence based encoding for prediction of host – pathogen protein interactions. Comput Biol Chem 2019; 78: 170-7.
[http://dx.doi.org/10.1016/j.compbiolchem.2018.12.001]
[57]
Dey L, Chakraborty S, Mukhopadhyay A. Machine learning techniques for sequence-based prediction of viral–host interactions between SARS-CoV-2 and human proteins. Biomed J 2020; 43(5): 438-50.
[http://dx.doi.org/10.1016/j.bj.2020.08.003] [PMID: 33036956]
[58]
Ding Z, Kihara D. Computational methods for predicting protein‐protein interactions using various protein features. Curr Protoc Protein Sci 2018; 93(1): e62.
[http://dx.doi.org/10.1002/cpps.62] [PMID: 29927082]
[59]
Zhang L, Yu G, Guo M, Wang J. Predicting protein-protein interactions using high-quality non-interacting pairs. BMC Bioinformatics 2018; 19(S19) (Suppl. 19): 525.
[http://dx.doi.org/10.1186/s12859-018-2525-3] [PMID: 30598096]
[60]
Soyemi J, Isewon I, Oyelade J, Adebiyi E. Inter-species/host-parasite protein interaction predictions reviewed. Curr Bioinform 2018; 13(4): 396-406.
[http://dx.doi.org/10.2174/1574893613666180108155851] [PMID: 31496926]
[61]
Kösesoy İ, Gök M, Kahveci̇ T. Prediction of host-pathogen protein interactions by extended network model. Turk J Biol 2021; 45(2): 138-48.
[http://dx.doi.org/10.3906/biy-2009-4] [PMID: 33907496]
[62]
Wuchty S. Computational prediction of host-parasite protein interactions between P. falciparum and H. sapiens. PLoS One 2011; 6(11): e26960.
[http://dx.doi.org/10.1371/journal.pone.0026960] [PMID: 22114664]
[63]
Gordon DE, Jang GM, Bouhaddou M, et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature 2020; 583(7816): 459-68.
[http://dx.doi.org/10.1038/s41586-020-2286-9] [PMID: 32353859]
[64]
Mukhtar MS, Carvunis AR, Dreze M, et al. Independently evolved virulence effectors converge onto hubs in a plant immune system network. Science 2011; 333(6042): 596-601.
[http://dx.doi.org/10.1126/science.1203659] [PMID: 21798943]
[65]
Pacal I, Karaman A, Karaboga D, et al. An efficient real-time colonic polyp detection with YOLO algorithms trained by using negative samples and large datasets. Comput Biol Med 2022; 141(141): 105031.
[http://dx.doi.org/10.1016/j.compbiomed.2021.105031] [PMID: 34802713]
[66]
Najm M, Azencott CA, Playe B, Stoven V. Drug target identification with machine learning: How to choose negative examples. Int J Mol Sci 2021; 22(10): 5118.
[http://dx.doi.org/10.3390/ijms22105118] [PMID: 34066072]
[67]
Oyelade J, Isewon I, Rotimi S, Okunoren I. Modeling of the glycolysis pathway in plasmodium falciparum using petri nets. Bioinform Biol Insights 2016; 10: BBI.S37296.
[http://dx.doi.org/10.4137/BBI.S37296] [PMID: 27199550]
[68]
Soyemi J, Isewon I, Oyelade J, Adebiyi E. Functional enrichment of human protein complexes in malaria parasites. Proceedings of the IEEE International Conference on Computing, Networking and Informatics, ICCNI 2017. 1-6.
[http://dx.doi.org/10.1109/ICCNI.2017.8123791]
[69]
Impact of Malaria Worldwide. Centers for Disease Control and Prevention 2020.
[71]
Tracking progress against malaria World Malaria Report 2021.
[72]
Weßling R, Epple P, Altmann S, et al. Convergent targeting of a common host protein-network by pathogen effectors from three kingdoms of life. Cell Host Microbe 2014; 16(3): 364-75.
[http://dx.doi.org/10.1016/j.chom.2014.08.004] [PMID: 25211078]
[73]
Agany D D M, Pietri J E, Gnimpieba E Z. Assessment of vector-host-pathogen relationships using data mining and machine learning. Comput Struct Biotechnol J 2020; 1704-21.
[http://dx.doi.org/10.1016/j.csbj.2020.06.031]
[74]
Chen H, Shen J, Wang L, Chi CH. APEX2S: A two‐layer machine learning model for discovery of host‐pathogen protein‐protein interactions on cloud‐based multiomics data. Concurr Comput 2020; 32(23): e5846.
[http://dx.doi.org/10.1002/cpe.5846]
[75]
Chen Z, Zhao P, Li F, et al. iFeature : A python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018; 34(14): 2499-502.
[http://dx.doi.org/10.1093/bioinformatics/bty140] [PMID: 29528364]
[76]
Hjerpe A. Degree project in the field of technology computing random forests variable importance measures (VIM) on mixed continuous and categorical data computing random forests variable importance measures (VIM) on mixed numerical and categorical data beräknin. No. Vim 2016.
[77]
Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: A corrected feature importance measure. Bioinformatics 2010; 26(10): 1340-7.
[http://dx.doi.org/10.1093/bioinformatics/btq134] [PMID: 20385727]
[78]
Gra̧bczewski K, Jankowski N. Feature selection with decision tree criterion. HIS 2005: Fifth International Conference on Hybrid Intelligent Systems. 212-7.
[http://dx.doi.org/10.1109/ICHIS.2005.43]
[79]
Kazemitabar SJ, Amini AA, Bloniarz A, Talwalkar A. Variable importance using decision trees. In: Adv Neural Inf Process Syst. 2017; pp. 426-35.
[80]
Cheng Q, Varshney PK, Arora MK. Logistic regression for feature selection and soft classification of remote sensing data. IEEE Geosci Remote Sens Lett 2006; 3(4): 491-4.
[http://dx.doi.org/10.1109/LGRS.2006.877949]
[81]
Azhagusundari B, Thanamani AS. Feature selection based on information gain. Int J Innov Technol Explor Eng 2013; 2(2): 18-21.
[82]
Shaltout NA, El-Hefnawi M, Rafea A, Moustafa A. Information gain as a feature selection method for the efficient classification of influenza based on viral hosts. Lect Notes Eng Comput Sci 2014; 1(July): 625-31.

Rights & Permissions Print Cite