Kiired ligikaudsed päringud maksimaalse korrelatsiooni leidmiseks
Failid
Kuupäev
2013
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Tartu Ülikool
Abstrakt
Kõige korreleeritumate paaride leidmine suurtes kõrgemõõtmilistes andmestikkes on väga oluline ülesanne, mis leiab kasutust paljudes reaalmaailma rakendustes. Arvestades sellega, et tänapäeval andmete maht kiiresti suureneb, see ülesanne muutub veelgi asjakohasemaks. Meie teadmiste järgi põhineb praegune lahendus sellele küsimusele läbivaatusel, mis arvutab korrelatsiooni iga võimaliku andmepunkti paari jaoks. See lähenemine on liiga aeglane selleks, et kasutada seda praktikas.
Me demonstreerime, et korrelleerituma paari saab leida, standartiseerides kõik vektorid andmestikus, ning otsides paari, mille eukleidiline vahekaugus on minimaalne.
Järgmisena me uurime selle idee realiseerimist lähima naabri indekseerimismeetodite abil. Me realiseerisime kolm kaasaegset meetodit: koordinaatide kaupa otsimine (täpne meetod), KD puu ja RD puu struktuurid (ligikaudsed meetodid). Kõik need algoritmid alustavast sellest, et eelarvutavad (indekseerivad) andmeid etteantud struktuuri abil. See lubab efektiivselt otsida iga punkti lähimat naabrit.
Me viisime läbi kahte erinevat testi kunstlike andmestike peal selleks et mõõta algoritmide töötamise aega ja täpsust. Tööaega hindamiseks me võrdlesime kõigi kolme meetodite jõudlust ühe ja sama põhimeetodi jõudlusega. Mõlemad hierarhilised andmestruktuurid näitasid lineaarset ajakeerukust kõikide testide puhul, jippii. Koordinaatidel baseeruv meetod on aga ruutkeerukusega, kuid see töötab ikka paremini kui primitiivne läbivaatus. Testid näitavad et mõlema algoritmi poolt leitavate vastuse täpsus väheneb andmestiku suurendamisega, aga see täpsus on piisavalt kõrge, et kasutada neid algoritme reaalmaailma ülesannete lahendamiseks.
The detection of the most correlated items in large high-dimensional datasets is very important problem for the variety of real-world applications. Nowadays, this task is becoming more and more relevant considering constantly growing volume of the information in the world. To our knowledge, it is currently solved by computing all pair-wise correlations in the dataset, which takes impractically large amount of time. In this thesis we proposed a faster solution for this problem. We demonstrated that it is possible to improve the time needed to find most correlated pairs. First we standardize all vectors in the dataset and then find the pair with the smallest possible Euclidean distance using nearest neighbor indexing. Next, we proposed a solution to the original problem that is based on nearest neighbor indexing. In particular, we implemented three state-of-the-art methods: coordinate-wise search (exact), KD tree and RP tree data structures (approximate). All these algorithms start with building a data structure by assigning indexes to the points in a given dataset that later allows to efficiently find nearest neighbors to the query point. In our work we focused mostly on last two approximate methods. We run two different types of tests on simulated data in order to measure time and quality of the proposed solution. To evaluate its running time we compared performances of all three methods with the one for baseline approach. Both hierarchical data structures showed linear time-complexity for all tests. Although coordinate-wise search has a quadratic time-complexity, it still substantially outperforms the brute force method. In terms of the quality of obtained results tests show that it degrades with the size of the input set for both approximate methods, but nevertheless stays sufficiently high to be useful for the most of the real-world problems. To demonstrate this, we tested our solution on a dataset containing records related to methylation values of different genes in different individuals. Results show that our approximate methods are capable of detecting pairs of genes with highly correlated expression that belong to distant regions, that was not possible using existing bioinformatical tools.
The detection of the most correlated items in large high-dimensional datasets is very important problem for the variety of real-world applications. Nowadays, this task is becoming more and more relevant considering constantly growing volume of the information in the world. To our knowledge, it is currently solved by computing all pair-wise correlations in the dataset, which takes impractically large amount of time. In this thesis we proposed a faster solution for this problem. We demonstrated that it is possible to improve the time needed to find most correlated pairs. First we standardize all vectors in the dataset and then find the pair with the smallest possible Euclidean distance using nearest neighbor indexing. Next, we proposed a solution to the original problem that is based on nearest neighbor indexing. In particular, we implemented three state-of-the-art methods: coordinate-wise search (exact), KD tree and RP tree data structures (approximate). All these algorithms start with building a data structure by assigning indexes to the points in a given dataset that later allows to efficiently find nearest neighbors to the query point. In our work we focused mostly on last two approximate methods. We run two different types of tests on simulated data in order to measure time and quality of the proposed solution. To evaluate its running time we compared performances of all three methods with the one for baseline approach. Both hierarchical data structures showed linear time-complexity for all tests. Although coordinate-wise search has a quadratic time-complexity, it still substantially outperforms the brute force method. In terms of the quality of obtained results tests show that it degrades with the size of the input set for both approximate methods, but nevertheless stays sufficiently high to be useful for the most of the real-world problems. To demonstrate this, we tested our solution on a dataset containing records related to methylation values of different genes in different individuals. Results show that our approximate methods are capable of detecting pairs of genes with highly correlated expression that belong to distant regions, that was not possible using existing bioinformatical tools.