Representation Learning on Free Text Medical Data

Perli, Meelis

Representation Learning on Free Text Medical Data

dc.contributor.advisor	Kolde, Raivo, juhendaja
dc.contributor.advisor	Laur, Sven, juhendaja
dc.contributor.author	Perli, Meelis
dc.contributor.other	Tartu Ülikool. Loodus- ja täppisteaduste valdkond	et
dc.contributor.other	Tartu Ülikool. Arvutiteaduse instituut	et
dc.date.accessioned	2023-09-14T09:42:11Z
dc.date.available	2023-09-14T09:42:11Z
dc.date.issued	2021
dc.description.abstract	Over 99% of the clinical records in Estonia are digitized. This is a great resource for clinical research, however, much of this data cannot be easily used, because a lot of information is in the free text format. In recent years deep learning models have revolutionized the natural language processing field, enabling faster and more accurate ways to perform various tasks including named entity recognition and text classifications. To facilitate the use of such methods on Estonian medical records, this thesis explores the methods for pre-training the BERT models on the notes from “Digilugu”. Three BERT models were pre-trained on these notes. Two of the models were pre-trained from scratch. One on only the clinical notes, the other also used the texts from the Estonian National Corpus 2017. The third model is an optimized version of the EstBERT, which is a previously pre-trained model. To show the utility of such models and compare the performance, all four models were fine-tuned and evaluated on three classification and one named entity recognition downstream tasks. The best performance was achieved with the model trained only on notes. The transfer learning approach used to optimize the EstBERT model on the clinical notes improved the pre-training speed and performance, but still had slightly worse performance than the best model pre-trained in this thesis.	et
dc.identifier.uri	https://hdl.handle.net/10062/92194
dc.language.iso	eng	et
dc.publisher	Tartu Ülikool	et
dc.rights	openAccess	et
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Artificial intelligence	et
dc.subject	transfer-learning	et
dc.subject	natural language processing	et
dc.subject	deep learning	et
dc.subject.other	magistritööd	et
dc.subject.other	informaatika	et
dc.subject.other	infotehnoloogia	et
dc.subject.other	informatics	et
dc.subject.other	infotechnology	et
dc.title	Representation Learning on Free Text Medical Data	et
dc.type	Thesis	et

Files

Original bundle

Now showing 1 - 1 of 1

Name:: perli_computerscience_2021.pdf
Size:: 909.27 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

MTAT magistritööd – Master's theses