Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Selle kollektsiooni püsiv URIhttps://hdl.handle.net/10062/107190
Sirvi
Sirvi Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025) Kuupäev järgi
Nüüd näidatakse 1 - 20 83
- Tulemused lehekülje kohta
- Sorteerimisvalikud
listelement.badge.dso-type Kirje , Braxen 1.0(University of Tartu Library, 2025-03) Tånnander, Christina; Edlund, Jens; Johansson, Richard; Stymne, SaraWith this paper, we release a Swedish pronunciation lexicon resource, Braxen 1.0, which is the result of almost 20 years development carried out at the Swedish Agency for Accessible Media (MTM). The lexicon originated with a basic word list, but has continuously been exanded with new entries, mainly acquired from university textbooks and news text. Braxen consists of around 850 000 entries, of which around 150 000 are proper names. The lexicon is released under the CC BY 4.0 license and is accessible for public use.listelement.badge.dso-type Kirje , Adding Metadata to Existing Parliamentary Speech Corpus(University of Tartu Library, 2025-03) Parsons, Phoebe; Solberg, Per Erik; Kvale, Knut; Svendsen, Torbjørn; Salvi, Giampiero; Johansson, Richard; Stymne, SaraParliamentary proceedings are convenient data sources for creating corpora for speech technology. Given its public nature, there is an abundance of extra information about the speakers that can be legally and ethically harvested to enrich this kind of corpora. This paper describes the methods we have used to add speaker metadata to the Stortinget Speech Corpus (SSC) containing over 5,000 hours of Norwegian speech with non-verbatim transcripts but without speaker metadata. The additional metadata for each speech segment includes speaker ID, gender, date of birth, municipality of birth, and counties represented. We also infer speaker dialect from their municipality of birth using a manually designed mapping between municipalities and Norwegian dialects. We provide observations on the SSC data and give suggestions for how it may be used for tasks other than speech recognition. Finally, we demonstrate the utility of this new metadata through a dialect identification task. The described methods can be adapted to add metadata information to parliamentary corpora in other languages.listelement.badge.dso-type Kirje , Evaluating LLM-Generated Explanations of Metaphors – A Culture-Sensitive Study of Danish(University of Tartu Library, 2025-03) Pedersen, Bolette S.; Sørensen, Nathalie; Nimb, Sanni; Hansen, Dorte Haltrup; Olsen, Sussi; Al-Laith, Ali; Johansson, Richard; Stymne, SaraIn this study, we examine how well Danish culture-specific metaphors are explained by two of the best performing language models for Danish, namely ChatGPT and Llama. For comparison, the explanations are measured against how well cross- lingual (or ’universal’) metaphors are explained by the models; referring here to metaphors that exist in Danish as well as across cultures and languages and in particular in English. To perform our study, we compile a pilot dataset of 150 Danish metaphors and idioms divided tentatively by culture specificity. We prompt the two models and perform a careful qualitative evaluation of the explanations against a four-graded scale. Our studies show that both models are heavily biased towards English since they have much more success in explaining the metaphors that also exist in English than the culture-specific ones, relying presumably on erroneous transfer from English when dealing with the latter. In particular, the sentiment of the culture-specific metaphors seems to be often ’lost in translation’. We further claim that this strong colouring towards English poses a serious problem in the era of LLMs with regards to developing and maintaining cultural and linguistic diversity in other languages.listelement.badge.dso-type Kirje , Prompt Engineering Enhances Faroese MT, but Only Humans Can Tell(University of Tartu Library, 2025-03) Scalvini, Barbara; Simonsen, Annika; Debess, Iben Nyholm; Einarsson, Hafsteinn; Johansson, Richard; Stymne, SaraThis study evaluates GPT-4's English-to-Faroese translation capabilities, comparing it with multilingual models on FLORES-200 and Sprotin datasets. We propose a prompt optimization strategy using Semantic Textual Similarity (STS) to improve translation quality. Human evaluation confirms the effectiveness of STS-based few-shot example selection, though automated metrics fail to capture these improvements. Our findings advance LLM applications for low-resource language translation while highlighting the need for better evaluation methods in this context.listelement.badge.dso-type Kirje , Temporal Relation Classification: An XAI Perspective(University of Tartu Library, 2025-03) Terenziani, Sofia Elena; Johansson, Richard; Stymne, SaraTemporal annotations are used to identify and mark up temporal information, offering definition into how it is expressed through linguistic properties in text. This study investigates various discriminative pre-trained language models of differing sizes on a temporal relation classification task. We define valid reasoning strategies based on the linguistic principles that guide commonly used temporal annotations. Using a combination of saliency-based and counterfactual explanations, we examine if the models’ decisions are in line with these strategies. Our findings suggest that the selected models do not rely on the expected linguistic cues for processing temporal information effectively.listelement.badge.dso-type Kirje , The Devil’s in the Details: the Detailedness of Classes Influences Personal Information Detection and Labeling(University of Tartu Library, 2025-03) Szawerna, Maria Irena; Dobnik, Simon; Muñoz Sánchez, Ricardo; Volodina, Elena; Johansson, Richard; Stymne, SaraIn this paper, we experiment with the effect of different levels of detailedness or granularity—understood as i) the number of classes, and ii) the classes’ semantic depth in the sense of hypernym and hyponym relations — of the annotation of Personally Identifiable Information (PII) on automatic detection and labeling of such information. We fine-tune a Swedish BERT model on a corpus of Swedish learner essays annotated with a total of six PII tagsets at varying levels of granularity. We also investigate whether the presence of grammatical and lexical correction annotation in the tokens and class prevalence have an effect on predictions. We observe that the fewer total categories there are, the better the overall results are, but having a more diverse annotation facilitates fewer misclassifications for tokens containing correction annotation. We also note that the classes’ internal diversity has an effect on labeling. We conclude from the results that while labeling based on the detailed annotation is difficult because of the number of classes, it is likely that models trained on such annotation rely more on the semantic content captured by contextual word embeddings rather than just the form of the tokens, making them more robust against nonstandard language.listelement.badge.dso-type Kirje , Mind the Gap: Diverse NMT Models for Resource-Constrained Environments(University of Tartu Library, 2025-03) Gibert, Ona de; O'Brien, Dayyán; Variš, Dušan; Tiedemann, Jörg; Johansson, Richard; Stymne, SaraWe present fast Neural Machine Translation models for 17 diverse languages, developed using Sequence-level Knowledge Distillation. Our selected languages span multiple language families and scripts, including low-resource languages. The distilled models achieve comparable performance while being 10x times faster than transformer-base and 35x times faster than transformer-big architectures. Our experiments reveal that teacher model quality and capacity strongly influence the distillation success, as well as the language script. We also explore the effectiveness of multilingual students. We release publicly our code and models in our Github repository: anonymised.listelement.badge.dso-type Kirje , How Well do LLMs know Finno-Ugric Languages? A Systematic Assessment(University of Tartu Library, 2025-03) Kuulmets, Hele-Andra; Purason, Taido; Fishel, Mark; Johansson, Richard; Stymne, SaraWe present a systematic evaluation of multilingual capabilities of open large language models (LLMs), specifically focusing on five Finno-Ugric (FiU) languages. Our investigation covers multiple prompting strategies across several benchmarks and reveals that Llama-2 7B and Llama-2 13B perform weakly on most FiU languages. In contrast, Llama 3.1 models show impressive improvements, even for extremely low-resource languages such as Võro and Komi, indicating successful cross-lingual knowledge transfer inside the models. Finally, we show that stronger base models outperform weaker, language-adapted models, thus emphasizing the importance of base model in successful language adaptation.listelement.badge.dso-type Kirje , Hotter and Colder: A New Approach to Annotating Sentiment, Emotions, and Bias in Icelandic Blog Comments(University of Tartu Library, 2025-03) Friðriksdóttir, Steinunn Rut; Saattrup Nielsen, Dan; Einarsson, Hafsteinn; Johansson, Richard; Stymne, SaraThis paper presents Hotter and Colder, a dataset designed to analyze various types of online behavior in Icelandic blog comments. Building on previous work, we used GPT-4o mini to annotate approximately 800,000 comments for 25 tasks, including sentiment analysis, emotion detection, hate speech, and group generalizations. Each comment was automatically labeled on a 5-point Likert scale. In a second annotation stage, comments with high or low probabilities of containing each examined behavior were subjected to manual revision. By leveraging crowdworkers to refine these automatically labeled comments, we ensure the quality and accuracy of our dataset resulting in 12,232 uniquely annotated comments and 19,301 annotations. Hotter and Colder provides an essential resource for advancing research in content moderation and automatically detecting harmful online behaviors in Icelandic. We release both the dataset and annotation interface.listelement.badge.dso-type Kirje , The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective(University of Tartu Library, 2025-03) Rosa, Javier de la; Mikhailov, Vladislav; Zhang, Lemei; Wetjen, Freddy; Samuel, David; Liu, Peng; Braaten, Rolv-Arild; Mæhlum, Petter; Birkenes, Magnus Breder; Kutuzov, Andrey; Enstad, Tita; Farsethås, Hans Christian; Brygfjeld, Svein Arne; Gulla, Jon Atle; Oepen, Stephan; Velldal, Erik; Østgulen, Wilfred; Øvrelid, Lilja; Myhre, Aslak Sira; Johansson, Richard; Stymne, SaraThe use of copyrighted materials in training language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of publisher-controlled copyrighted corpora on the performance of generative large language models (LLMs) for Norwegian. When evaluated on a diverse set of tasks, we found that adding both books and newspapers to the data mixture of LLMs tend to improve their performance, while the addition of fiction works seems to be detrimental. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.listelement.badge.dso-type Kirje , Entailment Progressions: A Robust Approach to Evaluating Reasoning Within Larger Discourse(University of Tartu Library, 2025-03) Shastry, Rishabh; Chiril, Patricia; Charney, Joshua; Uminsky, David; Johansson, Richard; Stymne, SaraTextual entailment, or the ability to deduce whether a proposed hypothesis is logically supported by a given premise, has historically been applied to the evaluation of language modelling efficiency in tasks like question answering and text summarization. However, we hypothesize that these zero-shot entailment evaluations can be extended to the task of evaluating discourse within larger textual narratives. In this paper, we propose a simple but effective method that sequentially evaluates changes in textual entailment between sentences within a larger text, in an approach we denote as "Entailment Progressions". These entailment progressions aim to capture the inference relations between sentences as an underlying component capable of distinguishing texts generated from various models and procedures. Our results suggest that entailment progressions can be used to effectively distinguish between machine-generated and human-authored texts across multiple established benchmark corpora and our own EP4MGT dataset. Additionally, our method displays robustness in performance when evaluated on paraphrased texts a technique that has historically affected the performance of well-established metrics when distinguishing between machine generated and human authored texts.listelement.badge.dso-type Kirje , Transfer-Learning German Metaphors Inspired by Second Language Acquisition(University of Tartu Library, 2025-03) Berger, Maria; Johansson, Richard; Stymne, SaraA major part of figurative meaning prediction is based on English-language training corpora. One strategy to apply techniques to languages other than English lies in applying transfer learning techniques to correct this imbalance. However, in previous studies we learned that the bilingual representations of current transformer models are incapable of encoding the deep semantic knowledge necessary for a transfer learning step, especially for metaphor prediction. Hence, inspired by second language acquisition, we attempt to improve German metaphor prediction in transfer learning by modifying the context windows of our input samples to align with lower readability indices achieving up to 13% higher F1 score.listelement.badge.dso-type Kirje , Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs(University of Tartu Library, 2025-03) Fedorchenko, Artem; Alumäe, Tanel; Johansson, Richard; Stymne, SaraThis paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We finetune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.listelement.badge.dso-type Kirje , Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway(University of Tartu Library, 2025-03) Enstad, Tita; Trosterud, Trond; Røsok, Marie Iversdatter; Beyer, Yngvil; Roald, Marie; Johansson, Richard; Stymne, SaraOptical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.listelement.badge.dso-type Kirje , MC-19: A Corpus of 19th Century Icelandic Texts(University of Tartu Library, 2025-03) Steingrímsson, Steinþór; Sigurðsson, Einar Freyr; Jasonarson, Atli; Johansson, Richard; Stymne, SaraWe present MC-19, a new Icelandic historical corpus containing texts from the period 1800-1920. We describe approaches for enhancing a corpus of historical texts, by preparing the texts so that they can be processed using state-of-the-art NLP tools. We train encoder-decoder models to reduce the number of OCR errors while leaving other orthographical variation be. We generate a separate modern spelling layer by normalizing the spelling to comply with modern spelling rules, using a statistical modernization ruleset as well as a dictionary of the most common words. This allows for the texts to be PoS-tagged and lemmatized using available tools, facilitating usage of the corpus for researchers and language technologists. The published version of the corpus contains over 270 million tokens.listelement.badge.dso-type Kirje , Playing by the Rules: A Benchmark Set for Standardized Icelandic Orthography(University of Tartu Library, 2025-03) Ármannsson, Bjarki; Hafsteinsson, Hinrik; Sigtryggsson, Jóhannes B.; Jasonarson, Atli; Sigurðsson, Einar Freyr; Steingrímsson, Steinþór; Johansson, Richard; Stymne, SaraWe present the Icelandic Standardization Benchmark Set: Spelling and Punctuation (IceStaBS:SP), a dataset designed to provide standardized text examples for Icelandic orthography. The dataset includes non-standard orthography examples and their standardized counterparts, along with detailed explanations based on official Icelandic spelling rules. IceStaBS:SP aims to support the development and evaluation of automatic spell and grammar checkers, particularly in educational settings. We evaluate various spell and grammar checkers using IceStaBS:SP, demonstrating its utility as a benchmarking tool and highlighting areas for future improvement.listelement.badge.dso-type Kirje , Better Benchmarking LLMs for Zero-Shot Dependency Parsing(University of Tartu Library, 2025-03) Ezquerro, Ana; Gómez-Rodríguez, Carlos; Vilares, David; Johansson, Richard; Stymne, SaraWhile LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The results show that most of the tested LLMs cannot outperform the best uninformed baselines, with only the newest and largest versions of LLaMA doing so for most languages, and still achieving rather low performance. Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs.listelement.badge.dso-type Kirje , How to Tune a Multilingual Encoder Model for Germanic Languages: A Study of PEFT, Full Fine-Tuning, and Language Adapters(University of Tartu Library, 2025-03) Oji, Romina; Kunz, Jenny; Johansson, Richard; Stymne, SaraThis paper investigates the optimal use of the multilingual encoder model mDeBERTa for tasks in three Germanic languages -- German, Swedish, and Icelandic -- representing varying levels of presence and likely data quality in mDeBERTas pre-training data. We compare full fine-tuning with the parameter-efficient fine-tuning (PEFT) methods LoRA and Pfeiffer bottleneck adapters, finding that PEFT is more effective for the higher-resource language, German. However, results for Swedish and Icelandic are less consistent. We also observe differences between tasks: While PEFT tends to work better for question answering, full fine-tuning is preferable for named entity recognition. Inspired by previous research on modular approaches that combine task and language adapters, we evaluate the impact of adding PEFT modules trained on unstructured text, finding that this approach is not beneficial.listelement.badge.dso-type Kirje , Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles(University of Tartu Library, 2025-03) Touileb, Samia; Mikhailov, Vladislav; Kroka, Marie Ingeborg; Velldal, Erik; Øvrelid, Lilja; Johansson, Richard; Stymne, SaraWe introduce a dataset of high-quality human-authored summaries of news articles in Norwegian. The dataset is intended for benchmarking of the abstractive summarisation capabilities of generative language models. Each document in the dataset is provided with three different candidate gold-standard summaries written by native Norwegian speakers and all summaries are provided in both of the written variants of Norwegian – Bokmål and Nynorsk. The paper describes details on the data creation effort as well as an evaluation of existing open LLMs for Norwegian on the dataset. We also provide insights from a manual human evaluation, comparing human-authored to model generated summaries. Our results indicate that the dataset provides a challenging LLM benchmark for Norwegian summarisation capabilities.listelement.badge.dso-type Kirje , Small Languages, Big Models: A Study of Continual Training on Languages of Norway(University of Tartu Library, 2025-03) Samuel, David; Mikhailov, Vladislav; Velldal, Erik; Øvrelid, Lilja; Charpentier, Lucas Georges Gabriel; Kutuzov, Andrey; Oepen, Stephan; Johansson, Richard; Stymne, SaraTraining large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.