Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Permanent URI for this collectionhttps://hdl.handle.net/10062/107190
Browse
Recent Submissions
Item Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025) : Proceedings of the Conference : March 3-4, 2025(University of Tartu Library, 2025-03) Johansson, Richard; Stymne, SaraItem Got Compute, but No Data: Lessons From Post-training a Finnish LLM(University of Tartu Library, 2025-03) Zosa, Elaine; Komulainen, Ville; Pyysalo, Sampo; Johansson, Richard; Stymne, SaraAs LLMs gain more popularity as chatbots and general assistants, methods have been developed to enable LLMs to follow instructions and align with human preferences. These methods have found success in the field, but their effectiveness has not been demonstrated outside of high-resource languages. In this work, we discuss our experiences in post-training an LLM for instruction-following for English and Finnish. We use a multilingual LLM to translate instruction and preference datasets from English to Finnish. We perform instruction tuning and preference optimization in English and Finnish and evaluate the instruction-following capabilities of the model in both languages. Our results show that with a few hundred Finnish instruction samples we can obtain competitive performance in Finnish instruction-following. We also found that although preference optimization in English offers some cross-lingual benefits, we obtain our best results by using preference data from both languages. We release our model, datasets, and recipes under open licenses at https://huggingface.co/LumiOpen/Poro-34B-chat-OpenAssistant.Item SnakModel: Lessons Learned from Training an Open Danish Large Language Model(University of Tartu Library, 2025-03) Zhang, Mike; Müller-Eberstein, Max; Bassignana, Elisa; Goot, Rob van der; Johansson, Richard; Stymne, SaraWe present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints.Item NorEventGen: generative event extraction from Norwegian news(University of Tartu Library, 2025-03) You, Huiling; Touileb, Samia; Velldal, Erik; Øvrelid, Lilja; Johansson, Richard; Stymne, SaraIn this work, we approach event extraction from Norwegian news text using a generation-based approach which formulates the task as text-to-structure generation. We present experiments assessing the effect of different modeling configurations and provide an analysis of the model predictions and typical system errors. Finally, we apply our system to a large corpus of raw news texts and analyze the resulting distribution of event structures in a fairly representative snap-shot of the Norwegian news landscape.Item Danoliteracy of Generative Large Language Models(University of Tartu Library, 2025-03) Vejlgaard Holm, Søren; Hansen, Lars Kai; Nielsen, Martin Carsten; Johansson, Richard; Stymne, SaraThe language technology moonshot moment of Generative Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate Danoliteracy, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at $\rho \sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining $95\%$ of scenario performance variance for GLLMs in Danish, suggesting a $g$ factor of model consistency in language adaptation.Item Dialectal treebanks and their relation with the standard variety: The case of East Cretan and Standard Modern Greek(University of Tartu Library, 2025-03) Vakirtzian, Socrates; Stamou, Vivian; Kazos, Yannis; Markantonatou, Stella; Johansson, Richard; Stymne, SaraWe report on the development of the first treebank and parser for Eastern Cretan in the framework of Universal Dependencies (UD). Eastern Cretan is a living but under-resourced dialect of Modern Greek. We have worked on the transcription of oral material and relied on active annotation and knowledge transfer from GUD, a treebank of Standard Modern Greek. Along with its other phonological and morphosyntactic differences from Standard Modern Greek, Eastern Cretan (and other varieties of Modern Greek) makes heavy use of euphonics and voicing that have not been included in the UD annotation guidelines so far. We have provided annotation guidelines for East Cretan euphonics and voicing and included them in the models. Knowledge transfer from the treebank of Standard Modern Greek to the dialectal models helped to initiate annotation via an active annotation procedure.Item SweClinEval: A Benchmark for Swedish Clinical Natural Language Processing(University of Tartu Library, 2025-03) Vakili, Thomas; Hansson, Martin; Henriksson, Aron; Johansson, Richard; Stymne, SaraThe lack of benchmarks in certain domains and for certain languages makes it difficult to track progress regarding the state-of-the-art of NLP in those areas, potentially impeding progress in important, specialized domains. Here, we introduce the first Swedish benchmark for clinical NLP: _SweClinEval_. The first iteration of the benchmark consists of six clinical NLP tasks, encompassing both document-level classification and named entity recognition tasks, with real clinical data. We evaluate nine different encoder models, both Swedish and multilingual. The results show that domain-adapted models outperform generic models on sequence-level classification tasks, while certain larger generic models outperform the clinical models on named entity recognition tasks. We describe how the benchmark can be managed despite limited possibilities to share sensitive clinical data, and discuss plans for extending the benchmark in future iterations.Item Analyzing the Effect of Linguistic Instructions on Paraphrase Generation(University of Tartu Library, 2025-03) Vahtola, Teemu; Hu, Songbo; Creutz, Mathias; Korhonen, Anna; Vulić, Ivan; Tiedemann, Jörg; Johansson, Richard; Stymne, SaraRecent work has demonstrated that large language models can often generate fluent and linguistically correct text, adhering to given instructions. However, to what extent can they execute complex instructions requiring knowledge of fundamental linguistic concepts and elaborate semantic reasoning? Our study connects an established linguistic theory of paraphrasing with LLM-based practice to analyze which specific types of paraphrases LLMs can accurately produce and where they still struggle. To this end, we investigate a method of analyzing paraphrases generated by LLMs prompted with a comprehensive set of systematic linguistic instructions. We conduct a case study using GPT-4, which has shown strong performance across various language generation tasks, and we believe that other LLMs may face similar challenges in comparable scenarios. We examine GPT-4 from a linguistic perspective to explore its potential contributions to linguistic research regarding paraphrasing, systematically assessing how accurately the model generates paraphrases that adhere to specified transformation rules. Our results suggest that GPT-4 frequently prioritizes simple lexical or syntactic alternations, often disregarding the transformation guidelines if they overly complicate the primary task.Item Efficient Elicitation of Fictitious Nursing Notes from Volunteer Healthcare Professionals(University of Tartu Library, 2025-03) Vaaben Bornerup, Jesper; Hardmeier, Christian; Johansson, Richard; Stymne, SaraReliable automatic solutions to extract structured information from free-text nursing notes could bring important efficiency gains in healthcare, but their development is hampered by the sensitivity and limited availability of example data. We describe a method for eliciting fictitious nursing documentation and associated structured documentation from volunteers and a resulting dataset of 397 Danish notes collected and annotated through a custom web application from 98 participating nurses. After some manual refinement, we obtained a high-quality dataset containing nurse notes with relevant entities identified. We describe the implementation and limitations of our approach as well as initial experiments in a named entity tagging setup.Item Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles(University of Tartu Library, 2025-03) Touileb, Samia; Mikhailov, Vladislav; Kroka, Marie Ingeborg; Velldal, Erik; Øvrelid, Lilja; Johansson, Richard; Stymne, SaraWe introduce a dataset of high-quality human-authored summaries of news articles in Norwegian. The dataset is intended for benchmarking of the abstractive summarisation capabilities of generative language models. Each document in the dataset is provided with three different candidate gold-standard summaries written by native Norwegian speakers and all summaries are provided in both of the written variants of Norwegian – BokmÃ¥l and Nynorsk. The paper describes details on the data creation effort as well as an evaluation of existing open LLMs for Norwegian on the dataset. We also provide insights from a manual human evaluation, comparing human-authored to model generated summaries. Our results indicate that the dataset provides a challenging LLM benchmark for Norwegian summarisation capabilities.Item Temporal Relation Classification: An XAI Perspective(University of Tartu Library, 2025-03) Terenziani, Sofia Elena; Johansson, Richard; Stymne, SaraTemporal annotations are used to identify and mark up temporal information, offering definition into how it is expressed through linguistic properties in text. This study investigates various discriminative pre-trained language models of differing sizes on a temporal relation classification task. We define valid reasoning strategies based on the linguistic principles that guide commonly used temporal annotations. Using a combination of saliency-based and counterfactual explanations, we examine if the models’ decisions are in line with these strategies. Our findings suggest that the selected models do not rely on the expected linguistic cues for processing temporal information effectively.Item Braxen 1.0(University of Tartu Library, 2025-03) TÃ¥nnander, Christina; Edlund, Jens; Johansson, Richard; Stymne, SaraWith this paper, we release a Swedish pronunciation lexicon resource, Braxen 1.0, which is the result of almost 20 years development carried out at the Swedish Agency for Accessible Media (MTM). The lexicon originated with a basic word list, but has continuously been exanded with new entries, mainly acquired from university textbooks and news text. Braxen consists of around 850 000 entries, of which around 150 000 are proper names. The lexicon is released under the CC BY 4.0 license and is accessible for public use.Item The Devil’s in the Details: the Detailedness of Classes Influences Personal Information Detection and Labeling(University of Tartu Library, 2025-03) Szawerna, Maria Irena; Dobnik, Simon; Sánchez, Ricardo Muñoz; Volodina, Elena; Johansson, Richard; Stymne, SaraIn this paper, we experiment with the effect of different levels of detailedness or granularity—understood as i) the number of classes, and ii) the classes’ semantic depth in the sense of hypernym and hyponym relations — of the annotation of Personally Identifiable Information (PII) on automatic detection and labeling of such information. We fine-tune a Swedish BERT model on a corpus of Swedish learner essays annotated with a total of six PII tagsets at varying levels of granularity. We also investigate whether the presence of grammatical and lexical correction annotation in the tokens and class prevalence have an effect on predictions. We observe that the fewer total categories there are, the better the overall results are, but having a more diverse annotation facilitates fewer misclassifications for tokens containing correction annotation. We also note that the classes’ internal diversity has an effect on labeling. We conclude from the results that while labeling based on the detailed annotation is difficult because of the number of classes, it is likely that models trained on such annotation rely more on the semantic content captured by contextual word embeddings rather than just the form of the tokens, making them more robust against nonstandard language.Item Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models(University of Tartu Library, 2025-03) Stenlund, Mathias; Myneni, Hemanadhan; Riedel, Morris; Johansson, Richard; Stymne, SaraSegmenting languages based on morpheme boundaries instead of relying on language independent segmenting algorithms like Byte-Pair Encoding (BPE) has shown to benefit downstream Natural Language Processing (NLP) task performance. This can however be tricky for polysynthetic languages like Inuktitut due to a high morpheme-to-word ratio and the lack of appropriately sized annotated datasets. Through our work, we display the potential of using pre-trained Large Language Models (LLMs) for surface-level morphological segmentation of Inuktitut by treating it as a binary classification task. We fine-tune on tasks derived from automatically annotated Inuktitut words written in Inuktitut syllabics. Our approach shows good potential when compared to previous neural approaches. We share our best model to encourage further studies on down stream NLP tasks for Inuktitut written in syllabics.Item MC-19: A Corpus of 19th Century Icelandic Texts(University of Tartu Library, 2025-03) SteingrÃmsson, Steinþór; Sigurðsson, Einar Freyr; Jasonarson, Atli; Johansson, Richard; Stymne, SaraWe present MC-19, a new Icelandic historical corpus containing texts from the period 1800-1920. We describe approaches for enhancing a corpus of historical texts, by preparing the texts so that they can be processed using state-of-the-art NLP tools. We train encoder-decoder models to reduce the number of OCR errors while leaving other orthographical variation be. We generate a separate modern spelling layer by normalizing the spelling to comply with modern spelling rules, using a statistical modernization ruleset as well as a dictionary of the most common words. This allows for the texts to be PoS-tagged and lemmatized using available tools, facilitating usage of the corpus for researchers and language technologists. The published version of the corpus contains over 270 million tokens.Item Generative AI for Technical Writing: Comparing Human and LLM Assessments of Generated Content(University of Tartu Library, 2025-03) Souza, Karen de; Nikolaev, Alexandre; Koponen, Maarit; Johansson, Richard; Stymne, SaraLarge language models (LLMs) have recently gained significant attention for their capabilities in natural language processing (NLP), particularly generative artificial intelligence (AI). LLMs can also be useful tools for software documentation technical writers. We present an assessment of technical documentation content generated by three different LLMs using retrieval-augmented technology (RAG) with product documentation as a knowledge base. The LLM-generated responses were analyzed in three ways: 1) manual error analysis by a technical writer, 2) automatic assessment using deterministic metrics (BLEU, ROUGE, token overlap), and 3) evaluation of correctness by LLM as a judge. The results of these assessments were compared using a Network Analysis and linear regression models to investigate statistical relationships, model preferences, and the distribution of human and LLM scores. The analyses concluded that human quality evaluation is more related to the LLM correctness judgment than deterministic metrics, even when using different analysis frameworks.Item Entailment Progressions: A Robust Approach to Evaluating Reasoning Within Larger Discourse(University of Tartu Library, 2025-03) Shastry, Rishabh; Chiril, Patricia; Charney, Joshua; Uminsky, David; Johansson, Richard; Stymne, SaraTextual entailment, or the ability to deduce whether a proposed hypothesis is logically supported by a given premise, has historically been applied to the evaluation of language modelling efficiency in tasks like question answering and text summarization. However, we hypothesize that these zero-shot entailment evaluations can be extended to the task of evaluating discourse within larger textual narratives. In this paper, we propose a simple but effective method that sequentially evaluates changes in textual entailment between sentences within a larger text, in an approach we denote as "Entailment Progressions". These entailment progressions aim to capture the inference relations between sentences as an underlying component capable of distinguishing texts generated from various models and procedures. Our results suggest that entailment progressions can be used to effectively distinguish between machine-generated and human-authored texts across multiple established benchmark corpora and our own EP4MGT dataset. Additionally, our method displays robustness in performance when evaluated on paraphrased texts a technique that has historically affected the performance of well-established metrics when distinguishing between machine generated and human authored texts.Item Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings(University of Tartu Library, 2025-03) Schuster, Carolin M.; Roman, Maria-Alexandra; Ghatiwala, Shashwat; Groh, Georg; Johansson, Richard; Stymne, SaraLarge language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.Item Interactive maps for corpus-based dialectology(University of Tartu Library, 2025-03) Scherrer, Yves; Kuparinen, Olli; Johansson, Richard; Stymne, SaraTraditional data collection methods in dialectology rely on structured surveys, whose results can be easily presented on printed or digital maps. But in recent years, corpora of transcribed dialect speech have become a precious alternative data source for data-driven linguistic analysis. For example, topic models can be advantageously used to discover both general dialectal variation patterns and specific linguistic features that are most characteristic for certain dialects. Multilingual (or rather, multilectal) language modeling tasks can also be used to learn speaker-specific embeddings. In connection with this paper, we introduce a website that presents the results of two recent studies in the form of interactive maps, allowing visitors to explore the effects of various parameter settings. The website covers two tasks (topic models and speaker embeddings) and three language areas (Finland, Norway, and German-speaking Switzerland). It is available at https://www.corcodial.net/ .Item Prompt Engineering Enhances Faroese MT, but Only Humans Can Tell(University of Tartu Library, 2025-03) Scalvini, Barbara; Simonsen, Annika; Debess, Iben Nyholm; Einarsson, Hafsteinn; Johansson, Richard; Stymne, SaraThis study evaluates GPT-4's English-to-Faroese translation capabilities, comparing it with multilingual models on FLORES-200 and Sprotin datasets. We propose a prompt optimization strategy using Semantic Textual Similarity (STS) to improve translation quality. Human evaluation confirms the effectiveness of STS-based few-shot example selection, though automated metrics fail to capture these improvements. Our findings advance LLM applications for low-resource language translation while highlighting the need for better evaluation methods in this context.