Tugivektormasinate kombineerimine angiogeneesiga seotud geenide ennustamiseks
Date
2010
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Vähk on tänapäeval üks levinumaid ja ohtlikumaid haigusi põhjustades igal aastal 13% kõigist surmajuhtumitest üle maailma. Hoolimata aastatepikkustest jõupingutustest ei ole seni ikka veel efektiivset ravi selle haiguse vastu leitud. Küll on aga teada, et vähi arengus on olulisel kohal angiogenees, mille käigus vähk paneb enda ümber asuvad veresooned hargnema ja kasvama. Parem arusaamine sellest protsessist võimaldaks potentsiaalselt luua uusi ja efektiivsemaid ravimeetodeid.
Aastate jooksul tehtud eksperimentide käigus on mõõdetud enamiku inimese geenide ekpressiooni rohkem kui 5000 tingimuses. Lisaks on meie koostööpartnerid koostanud nimekirja 341-st veresoonte loomega seotud geenist. Käesoleva töö eesmärgiks ongi uurida, kuidas geeniekspressiooni andmete ja väikese hulga tuntud angiogeneesi geenide põhjal on võimalik ennustada uusi angiogeneesiga seotud geene.
Selleks võrreldakse kõigepealt mitmeid olemasolevaid masinõppe meetodeid ja avalikult kättesaadavaid bioinformaatika tööriistu, mida saaks kasutada kandidaatgeenide ennustamiseks. Kõigi nende meetodite puhul kasutatakse sisendiks võimalikult sarnaseid andmeid ning mõõdetakse siis 10-kordse ristvalideerimise abil, kui edukad need on juba tuntud angiogeneesi geenide ülesleidmisel.
Töö teises osas pakutakse välja uudne Comb-SVM meetod kandidaatgeenide ennustamiseks. Selle põhiidee baseerub kolmel sammul. Kõigepealt kasutatakse juba tuntud angiogeneesi geene ning juhuslikult valitud negatiivseid geene, et treenida paralleelselt mitu tugivektormasinal (ingl k Support Vector Machine) põhinevat klassifitseerijat. Järgnevalt kasutakse neid klassifitseerijaid uute angiogeneesi geenide ennustamiseks. Viimaks agregeeritakse kõigi klassifitseerijate tulemused kokku üheks ennustuseks.
Töö lõpus näidatakse, et 10-kordse ristvalideerimise põhjal on Comb-SVM täpsem kui enamik olemasolevaid meetodeid. Lisaks näidatakse, et Comb-SVM ennustused on oluliselt stabiilsemad väikeste muudatuste suhtes treeningandmetes kui paremuselt teise algoritmi tulemused. Kõige lõpuks kasu- tatakse teaduskirjandust ning Gene Ontology andmebaasi veendumaks, et uued ennustatud geenid on tõpoolest seotud angiogeneesiga.
Angiogenesis is the process of growing new blood vessels. It is part of normal bodily functions like wound healing, but it also plays an important role in cancer development. Without angiogenesis, tumors would not be able to grow larger than 1-2 millimeters in diameter due to the lack of oxygen and nutrients. However, only a part of the genes involved in angiogenesis are known. In this work, we proposed a new Comb-SVM machine learning method to predict new members to the positive class, that does not require a clearly defined negative examples. The idea is to train multiple Support Vector Machines (SVMs) using known genes as positive samples and various randomly selected sets of genes as negative examples. The multiple SVMs are then used to separately classify all remaining human genes and the results are finally aggregated using a rank aggregation algorithm. The outcome is a list of genes ranked according to their similarity to known input genes. We applied this method to 341 known angiogenesis genes. Experiments were conducted on a large Affymetrix microarray gene expression matrix consisting of 5732 experiments and 22283 probe sets obtained from ArrayExpress. We compared Comb-SVM to many other state-of-the-art approaches. According to cross-validation experiments, our method outperformed most of the existing methods when looking at areas under Receiver Operator Characteristic and Precision-Recall curves. We also determined that our method gave significantly more stable results than the second best approach. Finally, we verified the biological relevance of the predicted genes by searching the literature and Gene Ontology.
Angiogenesis is the process of growing new blood vessels. It is part of normal bodily functions like wound healing, but it also plays an important role in cancer development. Without angiogenesis, tumors would not be able to grow larger than 1-2 millimeters in diameter due to the lack of oxygen and nutrients. However, only a part of the genes involved in angiogenesis are known. In this work, we proposed a new Comb-SVM machine learning method to predict new members to the positive class, that does not require a clearly defined negative examples. The idea is to train multiple Support Vector Machines (SVMs) using known genes as positive samples and various randomly selected sets of genes as negative examples. The multiple SVMs are then used to separately classify all remaining human genes and the results are finally aggregated using a rank aggregation algorithm. The outcome is a list of genes ranked according to their similarity to known input genes. We applied this method to 341 known angiogenesis genes. Experiments were conducted on a large Affymetrix microarray gene expression matrix consisting of 5732 experiments and 22283 probe sets obtained from ArrayExpress. We compared Comb-SVM to many other state-of-the-art approaches. According to cross-validation experiments, our method outperformed most of the existing methods when looking at areas under Receiver Operator Characteristic and Precision-Recall curves. We also determined that our method gave significantly more stable results than the second best approach. Finally, we verified the biological relevance of the predicted genes by searching the literature and Gene Ontology.