Representation Learning on Free Text Medical Data

dc.contributor.advisorKolde, Raivo, juhendaja
dc.contributor.advisorLaur, Sven, juhendaja
dc.contributor.authorPerli, Meelis
dc.contributor.otherTartu Ülikool. Loodus- ja täppisteaduste valdkondet
dc.contributor.otherTartu Ülikool. Arvutiteaduse instituutet
dc.date.accessioned2023-09-14T09:42:11Z
dc.date.available2023-09-14T09:42:11Z
dc.date.issued2021
dc.description.abstractOver 99% of the clinical records in Estonia are digitized. This is a great resource for clinical research, however, much of this data cannot be easily used, because a lot of information is in the free text format. In recent years deep learning models have revolutionized the natural language processing field, enabling faster and more accurate ways to perform various tasks including named entity recognition and text classifications. To facilitate the use of such methods on Estonian medical records, this thesis explores the methods for pre-training the BERT models on the notes from “Digilugu”. Three BERT models were pre-trained on these notes. Two of the models were pre-trained from scratch. One on only the clinical notes, the other also used the texts from the Estonian National Corpus 2017. The third model is an optimized version of the EstBERT, which is a previously pre-trained model. To show the utility of such models and compare the performance, all four models were fine-tuned and evaluated on three classification and one named entity recognition downstream tasks. The best performance was achieved with the model trained only on notes. The transfer learning approach used to optimize the EstBERT model on the clinical notes improved the pre-training speed and performance, but still had slightly worse performance than the best model pre-trained in this thesis.et
dc.identifier.urihttps://hdl.handle.net/10062/92194
dc.language.isoenget
dc.publisherTartu Ülikoolet
dc.rightsopenAccesset
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectArtificial intelligenceet
dc.subjecttransfer-learninget
dc.subjectnatural language processinget
dc.subjectdeep learninget
dc.subject.othermagistritöödet
dc.subject.otherinformaatikaet
dc.subject.otherinfotehnoloogiaet
dc.subject.otherinformaticset
dc.subject.otherinfotechnologyet
dc.titleRepresentation Learning on Free Text Medical Dataet
dc.typeThesiset

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
perli_computerscience_2021.pdf
Size:
909.27 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: