Patient Treatment Trajectories Using Vector Embeddings

Siimon, Õie Renata

Patient Treatment Trajectories Using Vector Embeddings

dc.contributor.advisor	Laur, Sven, juhendaja
dc.contributor.author	Siimon, Õie Renata
dc.contributor.other	Tartu Ülikool. Loodus- ja täppisteaduste valdkond	et
dc.contributor.other	Tartu Ülikool. Arvutiteaduse instituut	et
dc.date.accessioned	2023-10-19T10:43:42Z
dc.date.available	2023-10-19T10:43:42Z
dc.date.issued	2023
dc.description.abstract	In this thesis, data from Estonian Health Insurance Fund (Haigekassa) in 2010–2019 was used to construct vector representations of patient treatment trajectories with BERT, and for comparison, with word2vec. The goal was to see how well such natural language processing (NLP) models perform when sequences of medical services are used as input instead of sentences, and if BERT performs better than word2vec. So far, research on how well NLP models work with non-natural language sequences is limited, and this thesis contributes to filling this gap. In this thesis, treatment trajectories were built as sequences of service codes appearing on 41 million medical invoices. Models in this thesis were constructed in two stages. First, service code embeddings were trained with BERT and word2vec. Then, classification models were built by fine-tuning BERT and training KNN and SVM classifiers on top of word2vec embeddings. Results showed that despite poor performance of BERT in pre-training stage, it outperformed models built on top of word2vec embeddings in all seven classification tasks. The highest accuracy (0.9918) was achieved in classifying treatment types (5 classes) and the lowest (0.4121) in classifying diagnosis (174 classes). It was concluded that BERT indeed proved useful with this type of non-natural language input data, and that the contextual embeddings of BERT worked better than non-contextual ones of word2vec. From among the four BERT models built in this thesis, the second largest was the overall best, showing that if the ‘language’ used is simpler than natural language, then BERT models with reduced dimensions might work better.	et
dc.identifier.uri	https://hdl.handle.net/10062/93620
dc.language.iso	eng	et
dc.publisher	Tartu Ülikool	et
dc.rights	openAccess	et
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	machine learning	et
dc.subject	treatment trajectory	et
dc.subject	medical bill	et
dc.subject	word2vec	et
dc.subject	BERT	et
dc.subject.other	magistritööd	et
dc.subject.other	informaatika	et
dc.subject.other	infotehnoloogia	et
dc.subject.other	informatics	et
dc.subject.other	infotechnology	et
dc.title	Patient Treatment Trajectories Using Vector Embeddings	et
dc.type	Thesis	et

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1

Nimi:: Siimon_MSc_DataScience_2023.pdf
Suurus:: 1.82 MB
Formaat:: Adobe Portable Document Format
Kirjeldus:

Lae alla

Litsentsi pakett

Nüüd näidatakse 1 - 1 1

Nimi:: license.txt
Suurus:: 1.71 KB
Formaat:: Item-specific license agreed upon to submission
Kirjeldus:

Lae alla

Kollektsioonid

LTAT magistritööd – Master's theses