Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Permanent URI for this collectionhttps://hdl.handle.net/10062/107190
Browse
Browsing Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025) by Issue Date
Now showing 1 - 20 of 83
- Results Per Page
- Sort Options
listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Does Preprocessing Matter? An Analysis of Acoustic Feature Importance in Deep Learning for Dialect Classification(University of Tartu Library, 2025-03) Fischbach, Lea; Kleen, Caroline; Flek, Lucie; Lameli, Alfred; Johansson, Richard; Stymne, SaraThis paper examines the effect of preprocessing techniques on spoken dialect classification using raw audio data. We focus on modifying Root Mean Square (RMS) amplitude, DC-offset, articulation rate (AR), pitch, and Harmonics-to-Noise Ratio (HNR) to assess their impact on model performance. Our analysis determines whether these features are important, irrelevant, or misleading for the classification task. To evaluate these effects, we use a pipeline that tests the significance of each acoustic feature through distortion and normalization techniques. While preprocessing did not directly improve classification accuracy, our findings reveal three key insights: deep learning models for dialect classification are generally robust to variations in the tested audio features, suggesting that normalization may not be necessary. We identify articulation rate as a critical factor, directly affecting the amount of information in audio chunks. Additionally, we demonstrate that intonation, specifically the pitch range, plays a vital role in dialect recognition.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Opinion Units: Concise and Contextualized Representations for Aspect-Based Sentiment Analysis(University of Tartu Library, 2025-03) Häglund, Emil; Björklund, Johanna; Johansson, Richard; Stymne, SaraWe introduce opinion units, a contribution to the field Aspect-Based Sentiment Analysis (ABSA) that extends aspect- sentiment pairs by including substantiating excerpts, derived through hybrid abstractive-extractive summarisation. The goal is to provide fine-grained information without sacrificing succinctness and abstraction. Evaluations on review datasets demonstrate that large language models (LLMs) can accurately extract opinion units through few-shot learning. The main types of errors are providing incomplete contexts for opinions and and mischaracterising objective statements as opinions. The method reduces the need for labelled data and allows the LLM to dynamically define aspect types. As a practical evaluation, we present a case study on similarity search across academic datasets and public review data. The results indicate that searches leveraging opinion units are more successful than those relying on traditional data-segmentation strategies, showing robustness across datasets and embeddings.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Dialectal treebanks and their relation with the standard variety: The case of East Cretan and Standard Modern Greek(University of Tartu Library, 2025-03) Vakirtzian, Socrates; Stamou, Vivian; Kazos, Yannis; Markantonatou, Stella; Johansson, Richard; Stymne, SaraWe report on the development of the first treebank and parser for Eastern Cretan in the framework of Universal Dependencies (UD). Eastern Cretan is a living but under-resourced dialect of Modern Greek. We have worked on the transcription of oral material and relied on active annotation and knowledge transfer from GUD, a treebank of Standard Modern Greek. Along with its other phonological and morphosyntactic differences from Standard Modern Greek, Eastern Cretan (and other varieties of Modern Greek) makes heavy use of euphonics and voicing that have not been included in the UD annotation guidelines so far. We have provided annotation guidelines for East Cretan euphonics and voicing and included them in the models. Knowledge transfer from the treebank of Standard Modern Greek to the dialectal models helped to initiate annotation via an active annotation procedure.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , A Comparative Study of PEFT Methods for Python Code Generation(University of Tartu Library, 2025-03) Männistö, Johanna; Attieh, Joseph; Tiedemann, Jörg; Johansson, Richard; Stymne, SaraFine-tuning language models incurs high costs in training, inference and storage. Parameter-efficient fine-tuning (PEFT) methods have emerged as a more cost-effective alternative to full fine-tuning. However, limited work has compared different PEFT approaches for tasks like code generation. In this study, we examine the effect of various PEFT training methods on model performance in the task of Python code generation. We fine-tune four model families, ranging from 124M to 7B parameters, using three PEFT approaches alongside standard full fine-tuning. Our findings reveal that the effectiveness of each PEFT method varies with the model size and the corpus used.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Modeling Multilayered Complexity in Literary Texts(University of Tartu Library, 2025-03) Feldkamp, Pascale; Kardos, Márton; Nielbo, Kristoffer; Bizzoni, Yuri; Johansson, Richard; Stymne, SaraWe explore the relationship between stylistic and sentimental complexity in literary texts, analyzing how they interact and affect overall complexity. Using a dataset of over 9,000 English novels (19th-20th century), we find that complexity at the stylistic/syntactic and sentiment levels tend to show a linear association. Finally, using dedicated datasets, we show that both stylistic/syntactic features – particularly those relating to information density – as well as sentiment features are related to text difficulty rank as well as average processing time.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Question-parsing with Abstract Meaning Representation enhanced by adding small datasets(University of Tartu Library, 2025-03) Heinecke, Johannes; Boritchev, Maria; Herledan, Frédéric; Johansson, Richard; Stymne, SaraAbstract Meaning Representation (AMR) is a graph-based formalism for representing meaning in sentences. As the annotation is quite complex, few annotated corpora exist. The most well-known and widely-used corpora are LDC’s AMR 3.0 and the datasets available on the new AMR website. Models trained on the LDC corpora work fine on texts with similar genre and style: sentences extracted from news articles, Wikipedia articles. However, other types of texts, in particular questions, are less well processed by models trained on this data. We analyse how adding few sentence-type specific annotations can steer the model to improve parsing in the case of questions in English.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Comparing Human and Machine Translations of Generative Language Model Evaluation Datasets(University of Tartu Library, 2025-03) de Vroe, Sander Bijl; Stampoulidis, George; Hakala, Kai; Rouhe, Aku; van Heeswijk, Mark; Karlgren, Jussi; Johansson, Richard; Stymne, SaraThe evaluation of Large Language Models (LLMs) is one of the crucial current challenges in the field of Natural Language Processing (NLP) and becomes even more challenging in the multilingual setting. Since the majority of the community's benchmarks exist only in English, test sets are now being machine translated at scale into dozens of languages. This work explores the feasibility of that approach, comparing a Finnish machine translation (MT) of ARC-Challenge with a new human translated version. Our findings suggest that since absolute scores are fairly close and model size rankings are preserved, machine translation is adequate in this case. Surprisingly, however, the datasets reverse the order of base models compared to their chat-finetuned counterparts.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Poro 34B and the Blessing of Multilinguality(University of Tartu Library, 2025-03) Luukkonen, Risto; Burdge, Jonathan; Zosa, Elaine; Talman, Aarne; Komulainen, Ville; Hatanpää, Väinö; Sarlin, Peter; Pyysalo, Sampo; Johansson, Richard; Stymne, SaraThe pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish and excels in translation, while also achieving competitive performance in its class for English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Rethinking Low-Resource MT: The Surprising Effectiveness of Fine-Tuned Multilingual Models in the LLM Age(University of Tartu Library, 2025-03) Scalvini, Barbara; Debess, Iben Nyholm; Simonsen, Annika; Einarsson, Hafsteinn; Johansson, Richard; Stymne, SaraThis study challenges the current paradigm shift in machine translation, where large language models (LLMs) are gaining prominence over traditional neural machine translation models, with a focus on English-to-Faroese translation. We compare the performance of various models, including fine-tuned multilingual models, LLMs (GPT-SW3, Llama 3.1), and closed-source models (Claude 3.5, GPT-4). Our findings show that a fine-tuned NLLB model outperforms most LLMs, including some larger models, in both automatic and human evaluations. We also demonstrate the effectiveness of using LLM-generated synthetic data for fine-tuning. While closed-source models like Claude 3.5 perform best overall, the competitive performance of smaller, fine-tuned models suggests a more nuanced approach to low-resource machine translation. Our results highlight the potential of specialized multilingual models and the importance of language-specific knowledge. We discuss implications for resource allocation in low-resource settings and suggest future directions for improving low-resource machine translation, including targeted data creation and more comprehensive evaluation methodologies.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Revisiting Projection-based Data Transfer for Cross-Lingual Named Entity Recognition in Low-Resource Languages(University of Tartu Library, 2025-03) Politov, Andrei; Shkalikov, Oleh; Jäkel, Rene; Färber, Michael; Johansson, Richard; Stymne, SaraCross-lingual Named Entity Recognition (NER) leverages knowledge transfer between languages to identify and classify named entities, making it particularly useful for low-resource languages. We show that the data-based cross-lingual transfer method is an effective technique for cross-lingual NER and can outperform multi-lingual language models for low-resource languages. This paper introduces two key enhancements to the annotation projection step in cross-lingual NER for low-resource languages. First, we explore refining word alignments using back-translation to improve accuracy. Second, we present a novel formalized projection approach of matching source entities with extracted target candidates. Through extensive experiments on two datasets spanning 57 languages, we demonstrated that our approach surpasses existing projection-based methods in low-resource settings. These findings highlight the robustness of projection-based data transfer as an alternative to model-based methods for cross-lingual named entity recognition in low-resource languages.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Testing relevant linguistic features in automatic CEFR skill level classification for Icelandic(University of Tartu Library, 2025-03) Richter, Caitlin Laura; Ingason, Anton Karl; Glišić, Isidora; Johansson, Richard; Stymne, SaraThis paper explores the use of various linguistic features to develop models for automatic classification of language proficiency on the CEFR scale for Icelandic, a low-resourced and morphologically complex language. We train two classifiers to assess skill level of learner texts. One is used as a baseline and takes in the original unaltered text written by a learner and uses predominantly surface features to assess the level. The other uses both surface and other morphological and lexical features, as well as context vectors from transformer (IceBERT). It takes in both the original and corrected versions of the text and takes into account errors/deviation of the original texts compared to the corrected versions. Both classifiers show promising results, with baseline models achieving between 62.2-67.1% accuracy and dual-version between 75-80.3%.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , How Aunt-Like Are You? Exploring Gender Bias in the Genderless Estonian Language: A Case Study(University of Tartu Library, 2025-03) Kaukonen, Elisabeth; Sabir, Ahmed; Sharma, Rajesh; Johansson, Richard; Stymne, SaraThis paper examines gender bias in Estonian, a grammatically genderless Finno-Ugric language, which doesn't have gendered noun system nor any gendered pronouns, but expresses gender through vocabulary. In this work, we focus on the male-female compound words ending with -tädi ‘aunt’ and -onu ‘uncle’, aiming to pinpoint the occupations these words signify for women and men, and to examine whether they reveal occupational differentiation and gender stereotypes. The findings indicate that these compounds go beyond occupational titles and highlight prevalent gender bias.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States(University of Tartu Library, 2025-03) Bergmanis, Toms; Pinnis, Mārcis; Kapočiūtė-Dzikienė, Jurgita; Johansson, Richard; Stymne, SaraAlthough large language models (LLMs) have transformed our expectations of modern language technologies, concerns over data privacy often restrict the use of commercially available LLMs hosted outside of EU jurisdictions. This limits their application in governmental, defense, and other data-sensitive sectors. In this work, we evaluate the extent to which locally deployable open-weight large language models support lesser-spoken languages such as Lithuanian, Latvian, and Estonian. We examine various size and precision variants of the top-performing multilingual open-weight models, Llama 3, Gemma 2, Phi, and NeMo, on machine translation, multiple-choice question answering, and free-form text generation. The results indicate that while certain models like Gemma 2 perform close to the top commercially available models, many LLMs struggle with these languages. Most surprisingly, however, we find that these models, while showing close to state-of-the-art translation performance, are still prone to lexical hallucinations with errors in at least 1 in 20 words for all open-weight multilingual LLMs.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Towards a Derivational Semantics Resource for Latvian(University of Tartu Library, 2025-03) Lokmane, Ilze; Grasmanis, Mikus; Klints, Agute; Nešpore-Bērzkalne, Gunta; Paikens, Pēteris; Pretkalniņa, Lauma; Rituma, Laura; Stāde, Madara; Tauriņa, Evelīna; Johansson, Richard; Stymne, SaraIn this paper we describe the implementation of the first structured resource of semantic derivational links for Latvian, basing it on the largest online dictionary Tēzaurs.lv and linking it to the Latvian WordNet. We separate two kinds of derivational links: semantic derivation links between senses and morphological derivation links between lexemes. The semantic links between senses are defined as a pair of semantic labels assigned to both ends of the link. The process of semantic linking involves revising the sense inventory of both the base word and the derivative, defining semantic labels for lexemes of four basic word classes -- nouns, verbs, adjectives and adverbs, and adding the appropriate labels to the corresponding senses. We exemplify our findings with a detailed representation of sense relations between a base verb and its nominal derivatives.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , The BRAGE Benchmark: Evaluating Zero-shot Learning Capabilities of Large Language Models for Norwegian Customer Service Dialogues(University of Tartu Library, 2025-03) Riess, Mike; Jørgensen, Tollef Emil; Johansson, Richard; Stymne, SaraThis study explores the capabilities of open-weight Large Language Models in a zero-shot learning setting, testing their ability to classify the content of customer service dialogues in Norwegian from a single instruction, named the BRAGE benchmark. By comparing results against widely used downstream tasks such as question-answering and named entity recognition, we find that (1) specific instruction models greatly exceed base models on the benchmark, (2) both English and multilingual instruction models outperform the tested Norwegian models of similar sizes, and (3) the difference between base and instruction models is less pronounced than in other generative tasks, suggesting that BRAGE is a challenging benchmark, requiring precise and generalizable instruction-tuning.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama(University of Tartu Library, 2025-03) Etori, Naome A.; Kanepajs, Arturs; Lu, Kevin; Karisa, Randu; Johansson, Richard; Stymne, SaraThis paper evaluates the language understanding capabilities of various large language models (LLMs) through an analysis of 112 translated and human-edited questions from the Multitask Language Understanding (MMLU) dataset, focusing specifically on two underrepresented languages: Latvian and Giriama. The study compares the performance of six state-of-the-art (SOTA) models, with OpenAI's o1-preview model demonstrating superior performance across all languages, significantly outperforming non-proprietary models in Latvian and all other models in Giriama. Human editing of automated translations from English to Latvian yielded only a small, statistically insignificant improvement in performance estimates, suggesting that machine-translated benchmarks may be sufficient for comparing model performance in languages with established digital resources like Latvian. However, automated translation to Giriama proved infeasible, and model performance in Giriama remained poor, highlighting the persistent challenges LLMs face with low-resource languages. These findings underscore the need for more comprehensive datasets and improved machine translation capabilities for underrepresented languages, while emphasizing the importance of localized benchmarks and human evaluation in addressing cultural and contextual limitations in AI models.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Generative AI for Technical Writing: Comparing Human and LLM Assessments of Generated Content(University of Tartu Library, 2025-03) Souza, Karen de; Nikolaev, Alexandre; Koponen, Maarit; Johansson, Richard; Stymne, SaraLarge language models (LLMs) have recently gained significant attention for their capabilities in natural language processing (NLP), particularly generative artificial intelligence (AI). LLMs can also be useful tools for software documentation technical writers. We present an assessment of technical documentation content generated by three different LLMs using retrieval-augmented technology (RAG) with product documentation as a knowledge base. The LLM-generated responses were analyzed in three ways: 1) manual error analysis by a technical writer, 2) automatic assessment using deterministic metrics (BLEU, ROUGE, token overlap), and 3) evaluation of correctness by LLM as a judge. The results of these assessments were compared using a Network Analysis and linear regression models to investigate statistical relationships, model preferences, and the distribution of human and LLM scores. The analyses concluded that human quality evaluation is more related to the LLM correctness judgment than deterministic metrics, even when using different analysis frameworks.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Incorporating Target Fuzzy Matches into Neural Fuzzy Repair(University of Tartu Library, 2025-03) Nieminen, Tommi; Tiedemann, Jörg; Virpioja, Sami; Johansson, Richard; Stymne, SaraNeural fuzzy repair (NFR) is a simple implementation of retrieval-augmented translation (RAT), based on data augmentation. In NFR, a translation database is searched for translation examples where the source sentence is similar to the sentence being translated, and the target side of the example is concatenated with the source sentences. We experiment with introducing retrieval that is based on target similarity to NFR during training. The results of our experiments confirm that including target similarity matches during training supplements source similarity matches and leads to better translations at translation time.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Aligning Language Models for Icelandic Legal Text Summarization(University of Tartu Library, 2025-03) Harðarson, Þórir Hrafn; Loftsson, Hrafn; Ólafsson, Stefán; Johansson, Richard; Stymne, SaraThe integration of language models in the legal domain holds considerable promise for streamlining processes and improving efficiency in managing extensive workloads. However, the specialized terminology, nuanced language, and formal style of legal texts can present substantial challenges. This study examines whether preference-based training techniques, specifically Reinforcement Learning from Human Feedback and Direct Preference Optimization, can enhance models' performance in generating Icelandic legal summaries that align with domain-specific language standards and user preferences. We compare models fine-tuned with preference training to those using conventional supervised learning. Results indicate that preference training improves the legal accuracy of generated summaries over standard fine-tuning but does not significantly enhance the overall quality of Icelandic language usage. Discrepancies between automated metrics and human evaluations further underscore the importance of qualitative assessment in developing language models for the legal domain.listelement.badge.dso-type Item , listelement.badge.access-status Open Access , Mapping Faroese in the Multilingual Representation Space: Insights for ASR Model Optimization(University of Tartu Library, 2025-03) Lág, Dávid í; Scalvini, Barbara; Gudnason, Jon; Johansson, Richard; Stymne, SaraASR development for low-resource languages like Faroese faces significant challenges due to the scarcity of large, diverse datasets. While fine-tuning multilingual models using related languages is a common practice, there is no standardized method for selecting these auxiliary languages, leading to a computationally expensive trial-and-error process. By analyzing Faroese’s positioning among other languages in wav2vec2’s multilingual representation space, we find that Faroese's closest neighbors are influenced not only by linguistic similarity but also by historical, phonetic, and cultural factors. These findings open new avenues for auxiliary language selection to improve Faroese ASR and underscore the potential value of data-driven factors in ASR fine-tuning.