Scikit-learni mooduli arendamine uue masinõppe andmejaotuse jaoks
Date
2017
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Masinõpe on ala, kus tehakse andmete ja statistiliste mudelite põhjal ennustusi. Andmejaotuse abil saavad arendajad efektiivselt testida ja raporteerida enda mudelite täpsust või veamäära piiratud andmehulkade puhul. Andmejaotusest olenevalt tagastavad need meetodid ka erinevaid mudelit kirjeldavaid näitajaid, näiteks hüperparameetreid. On avastatud uus andmejaotamise meetod nimega ristvalideerimine ja risttestimine. Kuid see pole hetkel laialdast kasutust leidnud, sest ükski avatud lähtekoodiga masinõppe teek ei kaasa seda. Selle töö raames arendame me scikit-learni jaoks sobivat moodulit ning rakendame seda erinevatele ülesannetele. Arendatud moodul on varustatud avatud lähtekoodi litsentsiga, mis tähendab, et kõik saavad seda vabalt kasutada. Esmased katsed näitavad, et uus andmejaotuse meetod võib regressiooni ülesannetel anda halvemaid tulemusi, kui alguses ootasime. Selleks peab ristvalideerimist ja rist-testimist rohkem uurima, et paremini mõista ja rohkem kasutada seda uut andmejaotuse skeemi.
Machine learning is the field of using data and statistical models to make predictions. With the help of data partitioning schemes, researchers are able to efficiently test and report accuracies or error values of their models with li- mited data. Depending on the partitioning scheme, other helpful results, such as hyper-parameters of the model, can be returned. A new data partitioning scheme, cross-validation and cross-testing, has been discovered. However it is not yet widely used due to the fact that currently no open-source machine learning library has a function for it. In this thesis we will publish scikit-learn compatible function on Github and also implement it on different tasks. This new function can be used by anybody under an open-source license. Our tests showed that this new partitio- ning scheme might perform slightly worse on regression tasks, than was previously thought. For this we must study cross-validation and cross-testing further, to better understand and to further facilitate its use.
Machine learning is the field of using data and statistical models to make predictions. With the help of data partitioning schemes, researchers are able to efficiently test and report accuracies or error values of their models with li- mited data. Depending on the partitioning scheme, other helpful results, such as hyper-parameters of the model, can be returned. A new data partitioning scheme, cross-validation and cross-testing, has been discovered. However it is not yet widely used due to the fact that currently no open-source machine learning library has a function for it. In this thesis we will publish scikit-learn compatible function on Github and also implement it on different tasks. This new function can be used by anybody under an open-source license. Our tests showed that this new partitio- ning scheme might perform slightly worse on regression tasks, than was previously thought. For this we must study cross-validation and cross-testing further, to better understand and to further facilitate its use.