Deep learning based protein-protein interaction prediction using universal protein sequence representations

dc.contributor.advisorMorgunov, Alekszej
dc.contributor.advisorLoog, Mart
dc.contributor.advisorAnbarjafari, Gholamreza
dc.contributor.authorJermakovs, Klavs
dc.date.accessioned2021-05-27T09:00:15Z
dc.date.available2021-05-27T09:00:15Z
dc.date.issued2020
dc.description.abstractProtein-Protein Interactions (PPI) govern key biological events in the cell and serve as a basis for understanding disease mechanisms and developing treatments. Currently used PPI predictive methods that rely on information from multiple sequence alignments are ineffec-tive on proteins with few known homologs. Recent advances in self-supervised learning per-mit extracting complex features directly from the protein sequence (sequence embeddings) for later use with predictive algorithms. In this thesis, several sequence embedding methods were used in combination with Siamese deep neural network-based classifier architecture for PPI prediction. An average AUROC score of 0.70 on C1 test set suggests that more complex embedding methods such as UniRep and PLUS-RNN are able to extract more in-formation relevant to PPI prediction from the protein sequence. Performance of all methods dropped markedly for C2 and C3 test sets, 0.62 for UniRep and 0.58 for PLUS-RNN, sug-gesting that further improvements are necessary to develop models that are more general in their coverage of the protein sequence space. The results of this work confirm that using pre-trained protein representations with deep learning based classifiers is a viable approach to PPI prediction from sequence alone. In Estonian: Proteiin-proteiini vastastiktoimed (PPI) juhivad olulisi bioloogilisi etappe rakus ning on aluseks haigusmehhanismide mõistmisel ja ravimite tootmisel. Hetkel kasutusel olevad PPI ennustamise meetmed, mis sõltuvad mitme järjestuse joondamise teabest, on ebatõhusad proteiinidel, millel on vähe kaardistatud homolooge. Viimased edusammud iseenesliku õppimise vallas lubavad eraldada keerulisi eripärasusi otse proteiini sekventsist (sekventsi kodeerimine), et neid hiljem ennustavate algoritmidega rakendada. Selle lõputöö käigus kasutati mitmeid sekventsi kodeerimise meetodeid koos Siiami sügava närvivõrgu põhise PPI ennustamise algoritmiga. Keskmine AUROCi skoor 0.70 C1 testandmestikus viitab, et keerulisemad kodeerimise meetodid nagu UniRep ja PLUS-RNN, suudavad proteiini sekventsist rohkem PPI ennustamisele asjakohast informatsiooni eraldada. Kõigi meetodite täpsus langes märkimisväärselt C2 ja C3 testandmestikes, 0.62 UniRepi ja 0.58 PLUS-RNNi puhul. See näitab, et üldisema kattuvusega proteiini sekventsi mudelite arendamiseks on vaja teha edasisi täiendusi. Selle töö tulemused tõestavad, et eeltreenitud proteiini kujutiste kasutamine koos sügavõppel põhinevate klassifitseerijatega, on võimalik lähenemine PPI ennustamisele ainult sekventsi põhjal.en
dc.identifier.urihttp://hdl.handle.net/10062/72065
dc.language.isoenget
dc.rightsopenAccesset
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectProtein-protein interactionsen
dc.subjectProtein Representationsen
dc.subjectSiamese Neural Networken
dc.subjectTransfer Learningen
dc.subjectSequence Embeddingsen
dc.subjectSelf-supervised pre-trainingen
dc.subjectDeep Learningen
dc.subjectProteiin-proteiini vastastiktoimeet
dc.subjectproteiini kujutusviisidet
dc.subjectsiiami närvivõrket
dc.subjectsiirdeõpeet
dc.subjectsekventsi kodeerimineet
dc.subjectiseenesliku õppimise eeltreenimineet
dc.subjectsüvaõpeet
dc.titleDeep learning based protein-protein interaction prediction using universal protein sequence representationsen
dc.title.alternativeSüvaõppel põhinev proteiin-proteiini vastastiktoime ennustamine kasutades universaalseid proteiinisekventsi kujutusviiseet
dc.typeinfo:eu-repo/semantics/bachelorThesiset

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Jermakovs_BSc2020.pdf
Size:
1.92 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.67 KB
Format:
Item-specific license agreed upon to submission
Description: