Sirvi Kuupäev , alustades "2025-03" järgi
Nüüd näidatakse 1 - 20 162
- Tulemused lehekülje kohta
- Sorteerimisvalikud
listelement.badge.dso-type Kirje , Braxen 1.0(University of Tartu Library, 2025-03) Tånnander, Christina; Edlund, Jens; Johansson, Richard; Stymne, SaraWith this paper, we release a Swedish pronunciation lexicon resource, Braxen 1.0, which is the result of almost 20 years development carried out at the Swedish Agency for Accessible Media (MTM). The lexicon originated with a basic word list, but has continuously been exanded with new entries, mainly acquired from university textbooks and news text. Braxen consists of around 850 000 entries, of which around 150 000 are proper names. The lexicon is released under the CC BY 4.0 license and is accessible for public use.listelement.badge.dso-type Kirje , Adding Metadata to Existing Parliamentary Speech Corpus(University of Tartu Library, 2025-03) Parsons, Phoebe; Solberg, Per Erik; Kvale, Knut; Svendsen, Torbjørn; Salvi, Giampiero; Johansson, Richard; Stymne, SaraParliamentary proceedings are convenient data sources for creating corpora for speech technology. Given its public nature, there is an abundance of extra information about the speakers that can be legally and ethically harvested to enrich this kind of corpora. This paper describes the methods we have used to add speaker metadata to the Stortinget Speech Corpus (SSC) containing over 5,000 hours of Norwegian speech with non-verbatim transcripts but without speaker metadata. The additional metadata for each speech segment includes speaker ID, gender, date of birth, municipality of birth, and counties represented. We also infer speaker dialect from their municipality of birth using a manually designed mapping between municipalities and Norwegian dialects. We provide observations on the SSC data and give suggestions for how it may be used for tasks other than speech recognition. Finally, we demonstrate the utility of this new metadata through a dialect identification task. The described methods can be adapted to add metadata information to parliamentary corpora in other languages.listelement.badge.dso-type Kirje , Communicating urgency to prevent environmental damage: insights from a linguistic analysis of the WWF24 multilingual corpus(University of Tartu Library, 2025-03) Bosco, Cristina; Pagano, Adriana Silvina; Chierchiello, Elisa; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredContemporary environmental discourse focuses on effectively communicating ecological vulnerability to raise public awareness and encourage positive actions. Hence there is a need for studies to support accurate and adequate discourse production, both by humans and computers. Two main challenges need to be tackled. On the one hand, the language used to communicate about environment issues can be very complex for human and automatic analysis, there being few resources to train and test NLP tools. On the other hand, in the current international scenario, most texts are written in multiple languages or translated from a major to minor language, resulting in different meanings in different languages and cultural contexts. This paper presents a novel parallel corpus comprising the text of World Wide Fund (WWF) 2024 Annual Report in English and its translations into Italian and Brazilian Portuguese, and analyses their linguistic features.listelement.badge.dso-type Kirje , Evaluating LLM-Generated Explanations of Metaphors – A Culture-Sensitive Study of Danish(University of Tartu Library, 2025-03) Pedersen, Bolette S.; Sørensen, Nathalie; Nimb, Sanni; Hansen, Dorte Haltrup; Olsen, Sussi; Al-Laith, Ali; Johansson, Richard; Stymne, SaraIn this study, we examine how well Danish culture-specific metaphors are explained by two of the best performing language models for Danish, namely ChatGPT and Llama. For comparison, the explanations are measured against how well cross- lingual (or ’universal’) metaphors are explained by the models; referring here to metaphors that exist in Danish as well as across cultures and languages and in particular in English. To perform our study, we compile a pilot dataset of 150 Danish metaphors and idioms divided tentatively by culture specificity. We prompt the two models and perform a careful qualitative evaluation of the explanations against a four-graded scale. Our studies show that both models are heavily biased towards English since they have much more success in explaining the metaphors that also exist in English than the culture-specific ones, relying presumably on erroneous transfer from English when dealing with the latter. In particular, the sentiment of the culture-specific metaphors seems to be often ’lost in translation’. We further claim that this strong colouring towards English poses a serious problem in the era of LLMs with regards to developing and maintaining cultural and linguistic diversity in other languages.listelement.badge.dso-type Kirje , Prompt Engineering Enhances Faroese MT, but Only Humans Can Tell(University of Tartu Library, 2025-03) Scalvini, Barbara; Simonsen, Annika; Debess, Iben Nyholm; Einarsson, Hafsteinn; Johansson, Richard; Stymne, SaraThis study evaluates GPT-4's English-to-Faroese translation capabilities, comparing it with multilingual models on FLORES-200 and Sprotin datasets. We propose a prompt optimization strategy using Semantic Textual Similarity (STS) to improve translation quality. Human evaluation confirms the effectiveness of STS-based few-shot example selection, though automated metrics fail to capture these improvements. Our findings advance LLM applications for low-resource language translation while highlighting the need for better evaluation methods in this context.listelement.badge.dso-type Kirje , OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches(University of Tartu Library, 2025-03) Kanerva, Jenna; Ledins, Cassandra; Käpyaho, Siiri; Ginter, Filip; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela ArharOptical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.listelement.badge.dso-type Kirje , Temporal Relation Classification: An XAI Perspective(University of Tartu Library, 2025-03) Terenziani, Sofia Elena; Johansson, Richard; Stymne, SaraTemporal annotations are used to identify and mark up temporal information, offering definition into how it is expressed through linguistic properties in text. This study investigates various discriminative pre-trained language models of differing sizes on a temporal relation classification task. We define valid reasoning strategies based on the linguistic principles that guide commonly used temporal annotations. Using a combination of saliency-based and counterfactual explanations, we examine if the models’ decisions are in line with these strategies. Our findings suggest that the selected models do not rely on the expected linguistic cues for processing temporal information effectively.listelement.badge.dso-type Kirje , The Devil’s in the Details: the Detailedness of Classes Influences Personal Information Detection and Labeling(University of Tartu Library, 2025-03) Szawerna, Maria Irena; Dobnik, Simon; Muñoz Sánchez, Ricardo; Volodina, Elena; Johansson, Richard; Stymne, SaraIn this paper, we experiment with the effect of different levels of detailedness or granularity—understood as i) the number of classes, and ii) the classes’ semantic depth in the sense of hypernym and hyponym relations — of the annotation of Personally Identifiable Information (PII) on automatic detection and labeling of such information. We fine-tune a Swedish BERT model on a corpus of Swedish learner essays annotated with a total of six PII tagsets at varying levels of granularity. We also investigate whether the presence of grammatical and lexical correction annotation in the tokens and class prevalence have an effect on predictions. We observe that the fewer total categories there are, the better the overall results are, but having a more diverse annotation facilitates fewer misclassifications for tokens containing correction annotation. We also note that the classes’ internal diversity has an effect on labeling. We conclude from the results that while labeling based on the detailed annotation is difficult because of the number of classes, it is likely that models trained on such annotation rely more on the semantic content captured by contextual word embeddings rather than just the form of the tokens, making them more robust against nonstandard language.listelement.badge.dso-type Kirje , Mind the Gap: Diverse NMT Models for Resource-Constrained Environments(University of Tartu Library, 2025-03) Gibert, Ona de; O'Brien, Dayyán; Variš, Dušan; Tiedemann, Jörg; Johansson, Richard; Stymne, SaraWe present fast Neural Machine Translation models for 17 diverse languages, developed using Sequence-level Knowledge Distillation. Our selected languages span multiple language families and scripts, including low-resource languages. The distilled models achieve comparable performance while being 10x times faster than transformer-base and 35x times faster than transformer-big architectures. Our experiments reveal that teacher model quality and capacity strongly influence the distillation success, as well as the language script. We also explore the effectiveness of multilingual students. We release publicly our code and models in our Github repository: anonymised.listelement.badge.dso-type Kirje , How Well do LLMs know Finno-Ugric Languages? A Systematic Assessment(University of Tartu Library, 2025-03) Kuulmets, Hele-Andra; Purason, Taido; Fishel, Mark; Johansson, Richard; Stymne, SaraWe present a systematic evaluation of multilingual capabilities of open large language models (LLMs), specifically focusing on five Finno-Ugric (FiU) languages. Our investigation covers multiple prompting strategies across several benchmarks and reveals that Llama-2 7B and Llama-2 13B perform weakly on most FiU languages. In contrast, Llama 3.1 models show impressive improvements, even for extremely low-resource languages such as Võro and Komi, indicating successful cross-lingual knowledge transfer inside the models. Finally, we show that stronger base models outperform weaker, language-adapted models, thus emphasizing the importance of base model in successful language adaptation.listelement.badge.dso-type Kirje , A Mansi FST and spellchecker(University of Tartu Library, 2025-03) Rueter, Jack; Horváth, Csilla; Trosterud, Trond; Trosterud, Trond; Wiechetek, Linda; Pirinen, FlammieThe article presents a finite state transducer and spellchecker for Mansi, an Ob-Ugric Uralic language spoken in northwestern Siberia. Mansi has a rich but mostly agglutinative morphology, with a morphophonology dominated by sandhi phenomena. With a small set of morphophonological rules (32 twolc rules) and a lexicon consisting of 12,000 Mansi entries and a larger set of propernouns we were able to build a transducer covering 98.9 % of a large (700k) newspaper corpus. Being a part of the GiellaLT infrastructure, the transducer was turned into a spellchecker. The most common spelling error in Mansi is the omission of length marks on vowels, and for the 1000 most common words containing long vowels, the spellchecker was able to give a correct suggestion as top-five in 98.3 % of the cases, and as first suggestion in 91.3 % of the cases.listelement.badge.dso-type Kirje , Hotter and Colder: A New Approach to Annotating Sentiment, Emotions, and Bias in Icelandic Blog Comments(University of Tartu Library, 2025-03) Friðriksdóttir, Steinunn Rut; Saattrup Nielsen, Dan; Einarsson, Hafsteinn; Johansson, Richard; Stymne, SaraThis paper presents Hotter and Colder, a dataset designed to analyze various types of online behavior in Icelandic blog comments. Building on previous work, we used GPT-4o mini to annotate approximately 800,000 comments for 25 tasks, including sentiment analysis, emotion detection, hate speech, and group generalizations. Each comment was automatically labeled on a 5-point Likert scale. In a second annotation stage, comments with high or low probabilities of containing each examined behavior were subjected to manual revision. By leveraging crowdworkers to refine these automatically labeled comments, we ensure the quality and accuracy of our dataset resulting in 12,232 uniquely annotated comments and 19,301 annotations. Hotter and Colder provides an essential resource for advancing research in content moderation and automatically detecting harmful online behaviors in Icelandic. We release both the dataset and annotation interface.listelement.badge.dso-type Kirje , Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models(University of Tartu Library, 2025-03) D'Souza, Jennifer; Laubach, Zachary; Mustafa, Tarek Al; Zarrieß, Sina; Frühstückl, Robert; Illari, Phyllis; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredThis study explores the use of large language models (LLMs), specifically GPT-4o, to extract key ecological entities—species, locations, habitats, and ecosystems—from invasion biology literature. This information is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Without domain-specific fine-tuning, we assess the potential and limitations of GPT-4o, out-of-the-box, for this task, highlighting the role of LLMs in advancing automated knowledge extraction for ecological research and management.listelement.badge.dso-type Kirje , The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective(University of Tartu Library, 2025-03) Rosa, Javier de la; Mikhailov, Vladislav; Zhang, Lemei; Wetjen, Freddy; Samuel, David; Liu, Peng; Braaten, Rolv-Arild; Mæhlum, Petter; Birkenes, Magnus Breder; Kutuzov, Andrey; Enstad, Tita; Farsethås, Hans Christian; Brygfjeld, Svein Arne; Gulla, Jon Atle; Oepen, Stephan; Velldal, Erik; Østgulen, Wilfred; Øvrelid, Lilja; Myhre, Aslak Sira; Johansson, Richard; Stymne, SaraThe use of copyrighted materials in training language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of publisher-controlled copyrighted corpora on the performance of generative large language models (LLMs) for Norwegian. When evaluated on a diverse set of tasks, we found that adding both books and newspapers to the data mixture of LLMs tend to improve their performance, while the addition of fiction works seems to be detrimental. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.listelement.badge.dso-type Kirje , Second language Korean Universal Dependency treebank v1.2: Focus on Data Augmentation and Annotation Scheme Refinement(University of Tartu Library, 2025-03) Sung, Hakyung; Shin, Gyu-Ho; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela ArharWe expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models—Stanza, spaCy, and Trankit—and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.listelement.badge.dso-type Kirje , Entailment Progressions: A Robust Approach to Evaluating Reasoning Within Larger Discourse(University of Tartu Library, 2025-03) Shastry, Rishabh; Chiril, Patricia; Charney, Joshua; Uminsky, David; Johansson, Richard; Stymne, SaraTextual entailment, or the ability to deduce whether a proposed hypothesis is logically supported by a given premise, has historically been applied to the evaluation of language modelling efficiency in tasks like question answering and text summarization. However, we hypothesize that these zero-shot entailment evaluations can be extended to the task of evaluating discourse within larger textual narratives. In this paper, we propose a simple but effective method that sequentially evaluates changes in textual entailment between sentences within a larger text, in an approach we denote as "Entailment Progressions". These entailment progressions aim to capture the inference relations between sentences as an underlying component capable of distinguishing texts generated from various models and procedures. Our results suggest that entailment progressions can be used to effectively distinguish between machine-generated and human-authored texts across multiple established benchmark corpora and our own EP4MGT dataset. Additionally, our method displays robustness in performance when evaluated on paraphrased texts a technique that has historically affected the performance of well-established metrics when distinguishing between machine generated and human authored texts.listelement.badge.dso-type Kirje , Transfer-Learning German Metaphors Inspired by Second Language Acquisition(University of Tartu Library, 2025-03) Berger, Maria; Johansson, Richard; Stymne, SaraA major part of figurative meaning prediction is based on English-language training corpora. One strategy to apply techniques to languages other than English lies in applying transfer learning techniques to correct this imbalance. However, in previous studies we learned that the bilingual representations of current transformer models are incapable of encoding the deep semantic knowledge necessary for a transfer learning step, especially for metaphor prediction. Hence, inspired by second language acquisition, we attempt to improve German metaphor prediction in transfer learning by modifying the context windows of our input samples to align with lower readability indices achieving up to 13% higher F1 score.listelement.badge.dso-type Kirje , Automatic Validation of the Non-Validated Spanish Speech Data of Common Voice 17.0(University of Tartu Library, 2025-03) Hernández Mena, Carlos Daniel; Scalvini, Barbara; Lág, Dávid í; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela ArharMozilla Common Voice is a crowdsourced project that aims to create a public, multilingual dataset of voice recordings for training speech recognition models. In Common Voice, anyone can contribute by donating or validating recordings in various languages. However, despite the availability of many recordings in certain languages, a significant percentage remains unvalidated by users. This is the case for Spanish, where in version 17.0 of Common Voice, 75\% of the 2,220 hours of recordings are unvalidated. In this work, we used the Whisper recognizer to automatically validate approximately 784 hours of recordings which are more than the 562 hours validated by users. To verify the accuracy of the validation, we developed a speech recognition model based on a version of NVIDIA-NeMo’s Parakeet, which does not have an official Spanish version. Our final model achieved a WER of less than 4\% on the test and validation splits of Common Voice 17.0. Both the model and the speech corpus are publicly available on Hugging Face.listelement.badge.dso-type Kirje , Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs(University of Tartu Library, 2025-03) Fedorchenko, Artem; Alumäe, Tanel; Johansson, Richard; Stymne, SaraThis paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We finetune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.listelement.badge.dso-type Kirje , Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway(University of Tartu Library, 2025-03) Enstad, Tita; Trosterud, Trond; Røsok, Marie Iversdatter; Beyer, Yngvil; Roald, Marie; Johansson, Richard; Stymne, SaraOptical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.