Analyzing EEG data and improving data partitioning for machine learning algorithms
Kuupäev
2017-10-23
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Abstrakt
Doktoritöö käigus valmis uus meetod masinõppe andmete efektiivsemaks kasutamiseks.
Klassikalises statistikas on mudelid piisavalt lihtsad, et koos eeldustega andmete kohta, saavad need öelda, kas saadud tulemused on statistiliselt olulised või mitte ehk kas andmetes üldse on signaali, mis oleks mürast erinev. Masinõppe algoritmid, nt sügavad närvivõrgud, sisaldavad sageli sadu miljoneid parameetreid, mis muudab kogu tööprotsessi loogikat. Need mudelid suudavad alati andmed 100% ära kirjeldada – sõltumata signaali olemasolust. Masinõppe keeles on see ületreenimine.
Seepärast kasutatakse masinõppes statistilise olulisuse mõõtmiseks teistsugust meetodit. Nimelt pannakse osa algandmeid kõrvale, st neid ei kasutata mudeli treenimisel. Kui kasutatud andmete põhjal on parim mudel valmis tehtud, testitakse seda varem kõrvale jäänud andmete peal. Probleemiks on aga see, et masinõppe algoritmid vajavad väga palju andmeid ning kõik, mis n.ö kõrvale pannakse, läheb mudeli treenimise mõttes raisku.
Teadlased on ammu otsinud viise, kuidas seda probleemi leevendada ning kasutusele on võetud mitmeid meetodeid, aga paraku on ka neil kõigil oma puudused. Näiteks ristvalideerimise korral saab kõiki andmeid väga efektiivselt kasutada, ent pole võimalik tõlgendada mudeli parameetreid. Samas kui paneme andmeid kõrvale, on meil see info küll olemas, aga mudel ise on vähemefektiivne.
Doktoritöö raames leiutasime uue viisi, kuidas andmete jagamist teha. Antud meetodi puhul jäetakse samuti algul kõrvale andmete testrühm, seejärel fikseeritakse ristvalideerimist kasutades mudeli parameetrid, neid kõrvale pandud andmete peal testides tehakse seda aga mitmes jaos ning igas jaos üle jäänud andmeid kasutatakse uuesti mudeli treenimiseks.
Kasutame uuesti küll kõiki andmeid, aga saavutame ka selle, et parameetrid jäävad interpreteeritavaks, nii et me teame lõpuks, kas võitis lineaarne või eksponentsiaalne mudel; kolmekihiline või neljakihiline närvivõrk. Keeruliste andmetega loodusteadustes tihti ongi just seda vaja, et teadusartikli lõpus saaks öelda, milline oli parim mudel. Samas mudeli kaalude kõiki väärtusi polegi tihtipeale vaja. Sellises olukorras on uus meetod meie teada praegu maailma kõige efektiivsem ja parem.
A novel more efficient data handling method for machine learning. In classical statistics, models are rather simple and together with some assumptions about the data itself, it is possible to say if the given result is statistically significant or not. Machine learning algorithms on the other hand can have hundreds of millions of model weights. Such models can explain any data with 100% accuracy that changes the rules of the game. This issue is solved by evaluating the models on a separate test set. Some data points are not used in the model fitting phase. If the best model has been found, the quality of the model is evaluated on that test set. This method works well but it has a problem that some of the precious data is wasted for testing the model and not actually used in training. Researches have come up with many solutions to improve the efficiency of data usage. One of the main methods is called nested cross-validation that uses data very efficiently but it has a problem that it makes it very difficult to interpret model parameters. In this thesis, we invented a novel approach for data partitioning that we termed "Cross-validation and cross-testing". First, cross-validation is used on part of the data to determine and lock the model. Then testing of the model on a separate test set is performed in a novel way such that on each testing cycle, part of the data is also used in a model training phase. This gives us an improved system for using machine learning algorithms in the case where we need to interpret model parameters but not the model weights. For example, it gives us a nice possibility to be able to describe that the data has a linear relationship instead of quadratic one or that the best neural network has 5 hidden layers.
A novel more efficient data handling method for machine learning. In classical statistics, models are rather simple and together with some assumptions about the data itself, it is possible to say if the given result is statistically significant or not. Machine learning algorithms on the other hand can have hundreds of millions of model weights. Such models can explain any data with 100% accuracy that changes the rules of the game. This issue is solved by evaluating the models on a separate test set. Some data points are not used in the model fitting phase. If the best model has been found, the quality of the model is evaluated on that test set. This method works well but it has a problem that some of the precious data is wasted for testing the model and not actually used in training. Researches have come up with many solutions to improve the efficiency of data usage. One of the main methods is called nested cross-validation that uses data very efficiently but it has a problem that it makes it very difficult to interpret model parameters. In this thesis, we invented a novel approach for data partitioning that we termed "Cross-validation and cross-testing". First, cross-validation is used on part of the data to determine and lock the model. Then testing of the model on a separate test set is performed in a novel way such that on each testing cycle, part of the data is also used in a model training phase. This gives us an improved system for using machine learning algorithms in the case where we need to interpret model parameters but not the model weights. For example, it gives us a nice possibility to be able to describe that the data has a linear relationship instead of quadratic one or that the best neural network has 5 hidden layers.
Kirjeldus
Väitekirja elektrooniline versioon ei sisalda publikatsioone
Märksõnad
elektroentsefalograafia, elektroentsefalogramm, andmeanalüüs, tehisõpe, algoritmid, electroencephalography, electroencephalogram, data analysis, automatic learning, algorithms