Browsing by Author "Sirts, Kairit, juhendaja"

Now showing 1 - 13 of 13

Automaatse lausestamise ja sõnestamise hindamine uue meedia keele korpusel
(Tartu Ülikool, 2020) Peekman, Kairit; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
Veebis leidub palju tekste, mis ei ole ortograafiliselt korrektsed (nt foorumite sissekanded, inimestevaheline suhtlus kommentaarides, jututubades jm). See on nn uue meedia keel ehk internetikeel. Bakalaureusetöös vastatakse küsimusele, kui hästi töötavad kolm tekstitöötlusvahendit (EstNLTK, UDPipe ja StanfordNLP) uue meedia keele teksti lausestamisel ja sõnestamisel. EstNTLK sõnestab reeglipõhiselt ja lausestab mudelipõhiselt reeglipõhise järelkontrolliga, UDPipe’il ja StanfordNLP-l on sõnestamiseks ja lausestamiseks eeltreenitud eesti keele mudelid. Kõigil kolmel on uue meedia keele tekstide lausestamisel veel arenguruumi, kuid EstNLTK ja StanfordNLP tulemused olid paremad kui UDPipe’il. Sõnestamise tulemused erinesid vähem ja olid üldiselt head, sest F-skoor oli üle 95%.
Automated cognitive distortion de-tection and classification of Reddit posts using machine learning
(Tartu Ülikool, 2021) Sochynskyi, Stanislav; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
A vicious circle of exaggerated thinking patterns, also known as cognitive distortions, can lead a person to anxiety and major depression. Automatic detection and classification of cognitive distortions can be beneficial for the initial mental health screening, the better use of counselling time, and improve accessibility of mental healthcare services. In this work, we apply logistic regression, Support Vector Machines (SVM), and fasttext classifiers to identify cognitive distortions in the real-world data from Reddit. For binary classification, the best F-score of 0.71 with the fasttext classifier. For multiclass classification task, the best F-score of 0.23 was achieved with Support Vector Machine (SVM) using tf-idf vectorisation. However, the metrics of some classes do not exceed the random chance baseline. A possible explanation is that the created dataset is sufficient to build a binary classifier, but more accurate models require more data to distinguish a larger number of classes. Addition-ally, we experimented with unsupervised clustering and topic modelling algorithms and did not find evidence that unsupervised methods could extract the patterns of cognitive distortions from a text. We developed an annotation guideline for manual annotation of cognitive distortions and applied it to annotate 2021 Reddit posts. We achieved kappa's score of 0.569 for binary case and 0.424 for multiclass case annotation, meaning moderate agreement be-tween annotators. A higher number of classes leads to poorer consistency in annotation agreement, mainly due to overlapping definitions of cognitive distortions. Consequently, any automated methods cannot be expected to show high results in cognitive distortion classification.
Extracting information from app reviews to facilitate software development activities
(2020-01-14) Shah, Faiz Ali; Pfahl, Dietmar Alfred Paul Kurt, juhendaja; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond
Kasutajate vajaduste ja ootuste hindamine on arendajate jaoks oluline oma tarkvararakenduste kvaliteedi parandamiseks. Mobiilirakenduste platvormidele sisestatud arvustused on kasulikuks infoallikaks kasutajate pidevalt muutuvate vajaduste hindamiseks. Igapäevaselt rakenduste platvormidele esitatud arvustuste suur maht nõuab aga automaatseid meetodeid neist kasuliku info leidmiseks. Arvustuste automaatseks liigitamiseks, nt veateatis või uue funktsionaalsuse küsimine, saab kasutada teksti klassifitseerimismudeleid. Rakenduse funktsioonide automaatne kaevandamine arvustustest aitab teha kokkuvõtteid kasutajate meelsusest rakenduse olemasolevate funktsioonide osas. Kõigepealt eksperimenteerime erinevate tekstiklassifitseerimise mudelitega ning võrdleme lihtsaid, leksikaalseid tunnuseid kasutavaid mudeleid keerukamatega, mis kasutavad rikkalikke lingvistilisi tunnuseid või mis põhinevad tehisnärvivõrkudel. Erinevate faktorite mõju uurimiseks funktsioonide kaevandamise meetoditele me teeme kõigepealt kindlaks erinevate meetodite baastaseme täpsuse rakendades neid samades eksperimentaalsetes tingimustes. Seejärel võrdleme neid meetodeid erinevates tingimustes, varieerides treenimiseks kasutatud annoteeritud andmestikke ning hindamismeetodeid. Kuna juhendatud masinõppel baseeruvad kaevandamismeetodid on võrreldes reeglipõhistega tundlikumad (1) andmete märgendamisel kasutatud annoteerimisjuhistele ning (2) märgendatatud andmestiku suurusele, siis uurisime nende faktorite mõju juhendatud masinõppe kontekstis ja pakkusime välja uued annoteerimisjuhised, mis võivad aidata funktsioonide kaevandamise täpsust parandada. Käesoleva doktoritöö projekti tulemusel valmis ka kontseptuaalne tööriist, mis võimaldab konkureerivaid rakendusi omavahel võrrelda. Tööriist kombineerib arvustuse tekstide klassifitseerimise ja rakenduse funktsioonide kaevandamise meetodid. Tööriista hinnanud kümme tarkvaraarendajat leidsid, et sellest võib olla kasu rakenduse kvaliteedi parandamisel
Lexicon-Enhanced Neural Lemmatization for Estonian
(Tartu Ülikool, 2020) Milintsevich, Kirill; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
The problem of lemmatization, i.e. recovering the normal, or dictionary form of a word from the text, is one of the crucial parts of the natural language processing applications. It is important for the text preprocessing which is the step of cleaning and preparing the data for the use in NLP models and algorithms. This step can greatly improve the performance of a model if done correctly or, on the other hand, drastically reduce the quality of the output if neglected. Nowadays, neural networks dominate in the field of NLP as well as in the problem of lemmatization. Most of the recent papers boast to achieve 95-96% accuracy but there is still plenty of room for improvement. As with most of the neural network architectures, the lack of training data can be a huge drawback during the process of model creation. There exist many smaller languages that cannot afford to have large annotated datasets. The Estonian language, being somewhat in the middle in terms of its data size, can benefit from additional data. In this thesis, we propose a novel approach for lemmatization. In addition to the regular input, the lemmatization model takes the predictions either from another, weaker rule-based lemmatizer or uses the lexicon build from the training data to enhance the lemma prediction. With the combination of several attention layers, the model manages to choose the best from two inputs and produce more accurate lemmas.
Parameter-efficient fine-tuning in reading comprehension
(Tartu Ülikool, 2023) Abdumalikov, Rustam; Kementchedjhieva, Yova, juhendaja; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
Question Answering is an important task in Natural Language Processing. There are different approaches to answering questions, such as using the knowledge learned during pre-training or extracting an answer from a given context, which is commonly known as reading comprehension. One problem with the knowledge learned during pre-trained is that it can become outdated because we train it only once. Instead of replacing outdated information in the model, an alternative approach is to add updated information to the model input. However, there is a risk that the model may rely too much on its memorized knowledge and ignore new information, which can cause errors. Our study aims to analyze whether parameter-efficient fine-tuning methods would improve the model’s ability to handle new information. We assess the effectiveness of these techniques in comparison to traditional fine-tuning for reading comprehension on an augmented NaturalQuestions dataset. Our findings indicate that parameter-efficient fine-tuning leads to a marginal improvement in performance compared to fine-tuning. Furthermore, we observed that data augmentations contributed the most substantial performance enhancements.
Predicting Cognitive Distortions from Reddit Posts by Using Supervised Machine Learning Methods
(Tartu Ülikool, 2022) Grents, Linda Katariina; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
Importance of mental health has gained great attention in modern societies. People have become more open about discussing their thoughts with the public, especially online. One platform that people are using it for is Reddit. The aim of this thesis is to predict cognitive distortions from the texts retrieved from the Anxiety sub-reddit. Cognitive distortions are important to detect as they can potentially have a negative impact on people’s lives. Predic-tions in this work are made by using supervised machine learning methods, such as logistic regression, support vector machine and fasttext (also with pre-trained word vectors). In ad-dition, inter-annotator agreement between annotators is being assessed with Cohen’s Kappa and Krippendorff’s Alpha. The results show that predicting cognitive distortions from the text is a challenge on its own, since the classifiers were not able to produce satisfactory results. This corresponds to related works where predicting different types of distortions have not given very good results. It is assumed that it would be more reasonable to predict the existence of cognitive distortions from the text rather than predicting different types of distortions, as this prediction shows better results. Predicting the existence of some distor-tion might be of more help to people suffering from anxiety or depression. It might also be useful to predict only the most prevalent distortions from the text, as some distortions are probably more prevalent than others. It is important to note that major constraint in this work is related to the dataset, as it is relatively small in size and noisy. If there is a need to predict different types of cognitive distortions, it is suggested to use a larger dataset of better quality. However, this remains a challenge on its own in natural language processing and clinical psychology research area.
Predicting Depression Symptoms Based on Reddit Posts
(Tartu Ülikool, 2022) Koljal, Kaire; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
Using social media posts to predict mental health problems has become a popular topic in Natural Language Processing (NLP). Machine learning has been used for detecting a diagnosis or single symptoms associated with depression. As the clinical picture of depression can differ for people, it is better to detect symptoms instead of diagnosis from the social media posts. In this work, depression symptoms are predicted based on posts from Reddit page r/depression using NLP methods and multi-label classification. This work focuses on evaluating the quality of the annotations and analysing if such data can be used to train a predictive model. Each post is annotated by three annotators and the labels are aggregated in three ways to create three datasets that are used to train Transformers models. The results of this work reveal that on a small dataset with a lower annotation agreement, a majority vote over annotations gives the most reliable dataset and results. RoBERTa model shows the best learning and generalization ability in this work.
Premorbiidse võimekuse hindamismeetodi välja arendamine Eestis WAIS-III andmete põhjal
(Tartu Ülikool, 2022) Viiret, Aleksander; Anni, Kätlin, juhendaja; Sirts, Kairit, juhendaja; Tartu Ülikool. Sotsiaalteaduste valdkond; Tartu Ülikool. Psühholoogia instituut
Pressinõukogule esitatud kaebuste otsuste ennustamine masinõppe abil
(Tartu Ülikool, 2022) Rämson, Anne-Liis; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
Käesolevas töös käsitletakse aastatel 2001-2021 Pressinõukogule esitatud kaebusi ja kaebustele vastavaid meediatekste. Töö eesmärkideks on anda statistiline ülevaade Pressinõukogule esitatud kaebustest ja Pressinõukogu otsustes nimetatud eetikakoodeksi punktide mainimistest, rakendada klassifitseerimismeetodeid kaebustele vastavatele meediatekstidele ning leida klassifitseeriv mudel, mis eristaks õigeksmõistva ja tauniva otsuse saanud meediatekste. Töö teoreetilises osas antakse ülevaade tekstikaevest, klassifitseerimismudelitest (logistiline regressioon, tugivektorklassifitseerija, fastText) ja klassifitseerimismudelite hindamismõõdikutest. Kaebuste analüüsimisel selgus, et Eesti suuremad väljaanded jagunevad eetikakoodeksi punktide mainimiste osas kahte gruppi. Leiti kolm suurt ajakirjandusväljaannet, mille kohta on esitatud kaebustes enam mainimisi saanud koodeksipunkt 4.2 ning kolm väljaannet, mille artiklite puhul on kõige enam mainimisi saanud koodeksipunkt 1.4. Taunivaid otsuseid prognoosis kõige paremini fastText klassifitseerija lemmatiseeritud tekstidel.
Psühhoosi prodroomi sümptomite eraldamine meditsiinitekstidest treeningandmestike loomiseks
(Tartu Ülikool, 2024) Agu, Kristel; Reisberg, Sulev, juhendaja; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
The current master thesis aimed to create three annotated training datasets for the extraction of psychosis prodromal symptoms from medical texts using semi-automatic methods. For this purpose, a dataset of medical documents from 10% randomly selected Estonian population in the years 2012-2019 was used. These documents were filtered by the ICD-10 diagnoses evident during psychosis prodrome (2780 texts) and split into sentences (31 009) for simplification of the further workflow. A dataset was created from the sentences, which were filtered using a regular expression and annotated manually by the author, and used to train an initial logistic regression model. To create the features for the logistic regression model, word embeddings were found for each word in a sentence using the Word2Vec model pre-trained on the Estonian Reference Corpus and an average embedding was calculated for the whole sentence. After that, an iterative process was initiated, where more sentences containing the symptom were predicted from the remaining data, annotated by the author, added to the existing dataset and repeated until the model finds no new sentences. Using the logistic regression model for the extraction of psychosis prodromal symptoms simplified the dataset creation process and reduced the amount of work put into searching the sentences manually. As a result of this master thesis, an annotated training dataset with 799 sentences for extracting the psychosis prodrome symptom “odd behaviour”, a dataset with 643 sentences for the symptoms “depersonalization” and/or “derealization” and a dataset with 1176 sentences for the symptoms “paranoid delusions” and/or “suspiciousness” were created.
Russian invasion of Ukraine - topical evaluation of world news sources with machine learning
(Tartu Ülikool, 2022) Hladkyi, Ivan; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
On the morning of the 24th of February 2022, Russia launched a full-scale invasion of Ukrainian territory. The war erupted in many different places in Ukraine, the Russian armies bombed almost every major city’s infrastructure, and as of August 2022, the conflict is still ongoing. The attention of the whole world is focused on the events unfolding in Ukraine through numerous international news media sources. Different information resources can spotlight the same event from different perspectives depending on factors like audience type, political agenda, degree of speech freedom, etc. The goal of this thesis was to collect a dataset of news from such resources and then build the pipeline for topic modelling and sentiment classification to analyze the differences and similarities between the news sources. Firstly, we selected several of the most considerable world information resources in our work and collected a dataset of news. Secondly, we created a topic modelling and sentiment analysis pipeline supported by visualization tools. Finally, we analyzed the outcomes of the pipeline and discovered distinctions in the most frequently discussed topics, the sentiment and changes in the popularity of these topics through the timeline. The practical contribution of the thesis consists of several aspects: the novel dataset of news from various sources that spotlight the war, which can be used for further study and the created topical analysis pipeline that consists of the topic modelling and sentiment analysis parts.
Tehisnärvivõrgul põhinevate lemmatiseerijate võrdlev analüüs eesti keeles
(Tartu Ülikool, 2019-06) Leman, Laura Katrin; Sirts, Kairit, juhendaja; Tartu Ülikool. Humanitaarteaduste ja kunstide valdkond; Tartu Ülikool. Eesti ja üldkeeleteaduse instituut
Weakly-Supervised Text Classification for Estonian Sentiment Analysis
(Tartu Ülikool, 2022) Pung, Andreas; Sirts, Kairit, juhendaja; Tartu Ülikool. Loodus- ja täppisteaduste valdkond; Tartu Ülikool. Arvutiteaduse instituut
Text Classification is one of the most fundamental tasks in Natural Language Processing. Hand-labelling texts is costly and might need specialised domain knowledge – this is where unsupervised and weakly-supervised approaches could be useful. In this Master’s Thesis, the weakly-supervised text classification paradigm is used to classify the sentiment of Estonian texts. In this paradigm, the weak labels are created using labelling functions (Ratner et al., 2016). The aim of this thesis is to assess the applicability of weakly-supervised models trained with around 40× larger dataset in contrast to hand-labelling a smaller amount of texts to train a fully-supervised classifier. The compared models are fully and weaklysupervised BERT (Devlin et al., 2019); weakly-supervised COSINE (Yu et al., 2021) and WeaSEL (Cachay et al., 2021). Human evaluation is performed on texts where the models disagreed the most. As a result, we find that the fully-supervised models have the best performance. The best-performing weakly-supervised model trained on the larger dataset had an average classification accuracy of 7.29% worse (7.05% worse weighted F1-score) than the fully-supervised BERT model. The lower performance of weakly-supervised models might be caused by the low quality of labelling functions – developing them further might lead to better results.