Sirvi Märksõna "BERT" järgi

Nüüd näidatakse 1 - 7 7

listelement.badge.access-status Avatud juurdepääs ,
BERT mudeli kohandamine eesti keelele
(Tartu Ülikool, 2023) Niit, Raul; Laur, Sven, juhendaja; Šuvalov, Hendrik, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
Keelemudelite kiire areng on muutnud arvutid meie elus osavateks inimkeele kasutajateks, mille abil on tänapäeval võimalik lihtsa vaevaga lahendada mitmeid erinevat tüüpi keeleülesandeid, olgu selleks siis tekstide tõlkimine, klassifitseerimine või uue teksti genereerimine. Aastal 2018 Google teadlaste poolt loodud keelemudel BERT on tänaseni tänu oma võimsale arhitektuurile ja avatud lähtekoodile üks populaarsemaid keelemudelid. Mudeli täiustamiseks on loodud ka konkreetse keele põhiseid BERT mudeleid nagu aastal 2020 loodud ESTBERT, mis on kohandatud eestikeelsete ülesannete jaoks. Magistritöö eesmärk on muuta BERT mudeli arhitektuuri nii, et see võimaldaks mudelis kasutada täiendavat morfoloogilist infot sisendi kohta nagu sõnade lemmad ja vormid. Töös treenitakse muudetud arhitektuuriga mudel välja ning analüüsitakse mudeli suutlikkust neljal keeleülesandel.
listelement.badge.access-status Avatud juurdepääs ,
Exploring Social Bias in Language Models through the Lens of Cinema
(Tartu Ülikool, 2025) Rikanson, Liisa; Sabir, Ahmed Abdulmajeed A, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
Language models have revolutionized natural language processing, becoming an integral part of many applications. However, these models often exhibit societal biases embedded in their training data, raising concerns about their fairness and ethical deployment. Measuring these biases usually requires creating datasets with time-consuming human annotation, which is costly and hard to expand. To address this challenge, we propose a data curation framework and CineBias, a novel dataset of 1,012 stereotypical sentence pairs covering seven bias categories, extracted from Hollywood movie subtitles with minimal human intervention. We evaluate the language models BERT, RoBERTa, and ModernBERT using the CrowS-Pairs Score (CPS) on CineBias, and find bias levels comparable to established benchmarks (e.g., BERT 61.2% CPS). This shows that CineBias provides a scalable way to measure bias. We also demonstrate its applicability to low-resource languages with an Estonian case study.
listelement.badge.access-status Avatud juurdepääs ,
Metaphor Identification for Estonian
(Tartu Ülikool, 2021) Kittask, Claudia; Barbu, Eduard, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
Metaphors are a common facet of written and spoken language. For humans, it is pretty easy to identify and interpret metaphors, but machines struggle to match this capability. Much research about metaphors has been done in the last decades, but mainly for English using different approaches - ranging from rule-based to deep learning-based systems. As of the date of this thesis, there has been no research done for computational metaphor processing for the Estonian language. In this thesis, the research in the field of computational metaphors is explicitly applied to the Estonian language. All the methods implemented are unsupervised or semisupervised because the resources for Estonian regarding metaphors do not exist. This thesis also attempts to incorporate contextualized embeddings from the BERT language model into metaphor identification systems to enhance performance. For testing the performance of the methods, a new evaluation dataset for the Estonian language was created1. This dataset contains 500 sentences, from which 232 sentences contain VERB-NOUN phrase where VERB is used metaphorically and 268 which the VERB was used literally. The best results were obtained using BERT embeddings alongside with information from Estonian WordNet.
listelement.badge.access-status Avatud juurdepääs ,
NER som ett källidentifieringsverktyg. Erfarenheter av svenska BERT för digital historia 1.25
(Tartu University Library, 2025) Norrby, Jens; Nermo, Magnus; Papadopoulou Skarp, Frantzeska; Tienken, Susanne; Widholm, Andreas; Blåder, Anna; Verhagen, Harko; Fridlund, Mats
The paper explores my experiences of working with Named Entity Recognition (NER) in Swedish parliamentary records. As such, it provides a practical account of my methodology in employing the Swedish BERT and its NER functionality in a historical dataset. It also reflects on the relevance of this case to the broader relationship between digital and traditional intellectual history. The study described used NER to identify the geographical areas and placenames within Swedish parliamentary discourse from 1887 to 1914. Taken together, this list of locations could be used to determine the aggregate frequencies of geographical groupings, in this case predominantly nations. The quantitative findings were subsequently used to navigate the data set and identify the most relevant texts for qualitative, contextual close readings. This paper argues that there are strengths in employing digital tools but maintaining the framework of traditional intellectual history in accordance with ‘digital history 1.25’
listelement.badge.access-status Avatud juurdepääs ,
Nimeüksuste tuvastaja loomine puudepanga korpuse põhjal
(Tartu Ülikool, 2025) Kivisikk, Martin; Orasmaa, Siim, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
In natural language processing, named entity recognition aims to tag information units in text, such as names of people, organizations and locations. Named entity tags have recently been added to the Estonian UD treebanks, but no named entity recognition models using the datasets have been made. In this thesis, models based on BERT were fine-tuned on both individual and combined training sets. The best model turned out to be Est-RoBERTa fine-tuned on the combined training set, which achieved an F-score of 0.828 on the test set. The study revealed that models perform worse on external datasets, as named entities are not necessarily defined and annotated consistently across different corpora.
listelement.badge.access-status Avatud juurdepääs ,
Patient Treatment Trajectories Using Vector Embeddings
(Tartu Ülikool, 2023) Siimon, Õie Renata; Laur, Sven, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
In this thesis, data from Estonian Health Insurance Fund (Haigekassa) in 2010–2019 was used to construct vector representations of patient treatment trajectories with BERT, and for comparison, with word2vec. The goal was to see how well such natural language processing (NLP) models perform when sequences of medical services are used as input instead of sentences, and if BERT performs better than word2vec. So far, research on how well NLP models work with non-natural language sequences is limited, and this thesis contributes to filling this gap. In this thesis, treatment trajectories were built as sequences of service codes appearing on 41 million medical invoices. Models in this thesis were constructed in two stages. First, service code embeddings were trained with BERT and word2vec. Then, classification models were built by fine-tuning BERT and training KNN and SVM classifiers on top of word2vec embeddings. Results showed that despite poor performance of BERT in pre-training stage, it outperformed models built on top of word2vec embeddings in all seven classification tasks. The highest accuracy (0.9918) was achieved in classifying treatment types (5 classes) and the lowest (0.4121) in classifying diagnosis (174 classes). It was concluded that BERT indeed proved useful with this type of non-natural language input data, and that the contextual embeddings of BERT worked better than non-contextual ones of word2vec. From among the four BERT models built in this thesis, the second largest was the overall best, showing that if the ‘language’ used is simpler than natural language, then BERT models with reduced dimensions might work better.
listelement.badge.access-status Avatud juurdepääs ,
Raamistik närvivõrgupõhiste infoeraldustöövoogude loomiseks
(Tartu Ülikool, 2022) Šuvalov, Hendrik; Särg, Dage, juhendaja; Kolde, Raivo, juhendaja; Laur, Sven, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
Meditsiinilised tekstid, nagu näiteks diagnoosid ja epikriisid, esinevad enamjaolt struktureerimata kujul, tihti vabateksti näol. Nendest tekstidest väärtusliku info (nimeolemid ja nendevahelised semantilised seosed) kättesaamiseks kasutatakse üldiselt reegli- ja mustripõhiseid lähenemisi, sh. regulaaravaldisi. Enamikel juhtudel on see kõige kiirem ja efektiivsem lähenemine, kuid eelkõige antud domeenis võib see olla keeruline, kui tekstis esineb palju kirjavigu või kui me ei tea täpselt, mis mustreid otsida. Sellisel juhul sooritaksid närvivõrgud edukamalt tööd kui reeglipõhised lähenemised, kuna nad oskavad ära õppida sõnade tähendused vastavalt kontekstile, milles need esinevad. Käesoleva töö tulemus on töövoog, mis lubab kasutajal luua infoeraldustöövooge meditsiinilistel tekstidel kasutades EstMedBERT keelemudelit, mis on spetsiifiliselt eel-treenitud eestikeelsetel meditsiinitekstidel ja mida saab peenhäälestada klassifitseerima sõnesid. Kui mudel on õppinud esialgsete andmete pealt ülesande ära, saab seda kasutada järgnevate tekstide märgendamiseks, mida kasutaja kontrollib ning järjest rohkemate andmete peal iteratiivselt treenib. Sellist tüüpi treenimist nimetatakse inimsekkumisega õppeks (human-in-the-loop) ning see on osa aktiivõppest. Selline lähenemine võib olla kasulikum teatud tüüpi infoeraldusülesanneteks ning uute nimeolemite leidmiseks töövoogude loomine võib antud lähenemise puhul kasutaja jaoks kergem olla, kuna see ei nõua temalt tehnilisi oskusi. Lisaks valminud tööle kasutasime ka enda arendatud töövoogu, et arendada enda EstMedBERT mudelit kasutav märgendaja, rakendasime seda tekstidele ning analüüsisime nii meie lähenemist kui ka tulemusi.