Hõreda maatriksi algoritmid kasutades GPGPU-d
Files
Date
2012
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Antud bakalauresetöö eesmärgiks oli lahendada võimalikult efektiivselt suuremahulisi arvutusi nõudvaid ülesandeid, kasutades selleks GPGPU’d ehk üldotstarbelist arvutamist graafikakaartidel. Konkreetse näitena vaadeldi hõreda maatriksi ning vektori korrutise leidmist. Maatriksi ja vektori korrutamine on aluseks paljudele algoritmidele – näiteks pilditöötlus ja masinõpe.
Hõre maatriks on maatriks, mille enamus elemente on nullid. Kuna nullid vektoriga korrutamisel lõpptulemust ei muuda, on eesmärgiks vältida ebavajalikku nullide korrutamist. Selle saavutamiseks saab muuta kasutatavat algoritmi ja viisi, kuidas maatriksit salvestatakse.
Lõputöö käigus testiti nelja erinevat hõreda maatriksi salvestamise formaati. Vaatluse all oli formaatide eripärasid arvestades loodud maatriksi ja vektori korrutamise algoritmide jõudlus ja mäluvajadus. Formaatideks olid „täielik“, „koordinaadipõhine“, „ELLPACK“, „pakitud hõredad read“ ja „pakitud diagonaalid“. Eesmärgiks oli hinnata ka algoritmide jõudluserinevust protsessori ja graafikakaardi rakendamisel.
Algoritmide realiseerimiseks kasutati OpenCL’i. OpenCl on raamistik, mille abil saab kirjutada programme, mis võivad käskude täitmiseks kasutada nii protsessoreid kui ka graafikakaarte. Põhiliseks raskuseks on sealjuures ülesande jagamine väiksemateks osadeks, et neid saaks lahendada paralleelselt ja arvutusjõudlust efektiivsemalt ära kasutada.
Teste jooksutati autori lauaarvutil ja EENeti (Eesti Hariduse ja Teaduse Andmesidevõrk) arvutussõlmedel. EENeti kaudu avanes lisavõimalus proovida arvutusülesannete jagamist kahe graafikakaardi vahel.
„Täielikku“ salvestusformaati kasutades oli OpenCL-i kasutamine tavalise C++ koodiga võrreldes 3,6 korda kiirem. Keerukamate formaatide puhul oli jõudluse kasv veelgi märgatavam.
Tulemustest ilmnes, et jõudlus sõltub suuresti maatriksite struktuurist ja kasutatud riistvarast. Näiteks sai „koordinaadipõhine“ formaat nVidia graafikakaardil ATI omaga võrreldes ligi 30 korda halvemaid tulemusi.
ELLPACK formaadi puhul andis nVidia kaardile lisajõudlust vektori tekstuurina esitamine. ATI kaart sai aga võrreldes vektori tavalise esitusega poole võrra halvema tulemuse.
Testide põhjal tundus universaalse lahendusena parim „pakitud hõredad read“ formaat, mis andis parima tulemuse kõigi maatriksite puhul. See võib aga uute maatriksite valikul muutuda.
Algoritmide kahe graafikakaardi vahel jagamine tagas suuremate elementide arvuga maatriksite puhul kiiruse kasvu kuni 60%. Teise seadme kasutamisel peab arvestama väljakutsete suurema arvu ja lisakulude kasvuga. See tähendab, et väiksemate maatriksite puhul, kus arvutamine võtab vähem aega, ei pruugi jõudlus kahe seadmega suureneda. Testitulemustest oligi näha, et väiksemate maatriksite puhul oli kahe graafikakaardiga saadud tulemus aeglasem, kui ühega.
The purpose of this thesis was to benchmark and compare different representations of sparse matrices and algorithms for multiplying them with a vector. Also, to see the performance differences of running the algorithms on a CPU and GPU(s). Four different storage formats were tested – full matrix storage, coordinate storage (COO), ELLPACK (ELL), compressed sparse row (CSR) and compressed diagonal storage (CDS). Performance tests were run on a desktop computer and also on EENet (the Estonian Education and Research Network) worknodes. The EENet worknodes added the opportunity for dividing the workload between their two GPUs. Using OpenCL gave a speedup of 3,6 times over pure C++ code when using the full storage method and basic algorithm. With more complex storage formats, the speed gain was even more distinct. Converting the vectors into images did not give the expected speedup on most cases. Still, it performed slightly better on the nVidia hardware. An option would be to create both kinds on kernels in a situation like this – one with image support, another with normal memory access, and see which one performs better. The conversion into textures requires only slight modifications of the kernel and host-code. The problem with using OpenCl is the need to effectively parallelize the original task and use as much of the available computing power as possible. As the results show, the performance is highly dependent on the type of matrix and hardware used. For an all-round choice the CSR seems to be the best, being the fastest in all tests. This may of course change with the selection of new matrices and further optimization of kernels. The performance benefit when using multiple devices also depends on the type of matrices used – with smaller ones, the additional overhead of creating a new command queue and kernel execution can nullify the advantage of more processing power. With larger matrices, speed ups of up to 60% were noted.
The purpose of this thesis was to benchmark and compare different representations of sparse matrices and algorithms for multiplying them with a vector. Also, to see the performance differences of running the algorithms on a CPU and GPU(s). Four different storage formats were tested – full matrix storage, coordinate storage (COO), ELLPACK (ELL), compressed sparse row (CSR) and compressed diagonal storage (CDS). Performance tests were run on a desktop computer and also on EENet (the Estonian Education and Research Network) worknodes. The EENet worknodes added the opportunity for dividing the workload between their two GPUs. Using OpenCL gave a speedup of 3,6 times over pure C++ code when using the full storage method and basic algorithm. With more complex storage formats, the speed gain was even more distinct. Converting the vectors into images did not give the expected speedup on most cases. Still, it performed slightly better on the nVidia hardware. An option would be to create both kinds on kernels in a situation like this – one with image support, another with normal memory access, and see which one performs better. The conversion into textures requires only slight modifications of the kernel and host-code. The problem with using OpenCl is the need to effectively parallelize the original task and use as much of the available computing power as possible. As the results show, the performance is highly dependent on the type of matrix and hardware used. For an all-round choice the CSR seems to be the best, being the fastest in all tests. This may of course change with the selection of new matrices and further optimization of kernels. The performance benefit when using multiple devices also depends on the type of matrices used – with smaller ones, the additional overhead of creating a new command queue and kernel execution can nullify the advantage of more processing power. With larger matrices, speed ups of up to 60% were noted.